《【2】Audio Content Generation Building digitalized human and humanized AI CUHK-SZ.pdf》由會員分享,可在線閱讀,更多相關《【2】Audio Content Generation Building digitalized human and humanized AI CUHK-SZ.pdf(29頁珍藏版)》請在三個皮匠報告上搜索。
1、1AUDIO CONTENT GENERATION:BUILDINGDIGITALIZED HUMAN AND HUMANIZED AIWei XueDivision of Emerging Interdisciplinary AreasThe Hong Kong University of Science and Technology2STEPPING INTO THE VIRTUAL WORLDThe pandemic has changed the way of living,and the value of building an online virtual community is
2、 recognized.3STEPPING INTO THE VIRTUAL WORLDMetaverse and Web 3.0 are being built which define new paradigms of communication,perception and interactionDigitalized Human and Humanized AI Human and AI become Indistinguishable(Perception,emotion and creation)AI-generated contents enabling co-inspiring
3、 and co-creation between human and AI4NEXT-GENERATION AUDIOSWe have been working to build the next generation of audio content generation and the interaction between humans,machines and environmentsEndless beautiful audio contents will be generated in the virtual world.We take the music as a researc
4、h focus.Music is a symbol of human creativity and is the basis of modern entertainment industry.My team focuses on Virtual singers with realistic and emotional singing voicesDigital twin and virtual human Music composition with aesthetic feelings Co-creation,noveltyAudio Content Generation5NEXT-GENE
5、RATION AUDIOSOur goal is to ultimately create a new hybrid of real and virtual environments where humans and AIcan inspire,appreciate and interact through audios.Can we create an AI superstar among the top 10 singers?Can ordinary people with little music knowledge express themselves through music?Ca
6、n we talk,sing,and write songs with our idols?6OUTLINE AI Choir:Audio Generation CoMoSpeech:One Step Speech and Singing Voice Synthesis via Consistency Model7HOW TO MAKE AN AI CHOIR?8HUMAN-AI SYMPHONY ORCHESTRAThe worlds first human-AI collaborative symphony orchestra9FACTS OF THE AI CHOIRThe AI cho
7、ir,which performed“Pearl of the Orient(東方之珠)”,was one of the key parts of the orchestra.320 virtual singers were created for the 4-part choral performance.10UNSUPERVISED VIRTUALIZATION OF HUMAN VOICECreating a labeled database for hundreds of singers in the choir is cost infeasible(100K HKD for 1 ho
8、ur of data).Even though we have budgets,it is hard to collect enough professional singing sounds.A typical commercial AI singerNon-supervisionMake the machine“imagine”the singing voices of any person by only listening to unlabeled recordings of any language and content.Digital TwinUnlabeled datasetS
9、inging new songs11UNSUPERVISED VIRTUALIZATION OF HUMAN VOICEAuto-encoder Framework Humans use existing skills for new tasks Four models were pretrained using public datasets for the specific tasksOnline sourcesSinging voice datasetsSpeech EnhancementMixture of singing speech,music and noiseSinging s
10、peechSpeech RecognitionSinging speechPhonetic distribution(Lyrics information)Speaker IDSinging speechIdentity EmbeddingPitch ExtractionSinging speechPitch ContourHow to Get Data?What Lyrics(content)?Who Sings?What Melody?12UNSUPERVISED VIRTUALIZATION OF HUMAN VOICEAuto-encoder FrameworkOverall Fram
11、eworkConformer-based DecoderBuilding BlocksConformer BlockPostNetWe design a conformer-based decoder to recover the time-frequency Mel spectrogram of the original speech13UNSUPERVISED VIRTUALIZATION OF HUMAN VOICEAuto-encoder Framework The model learns the“uniqueness”of the person from the large-sca
12、le data It is also controllable over speech content and melody(Virtualize the humans voice and make a digital twin)Mel Spectrogram ComputationLossUnsupervised Training on Decoder(control2spec)Singing Voice VocoderUnsupervised Training on Vocoder(spec2wav)Mel Spectrogram ComputationLossUnsupervised T
13、raining PredGTYike speech sample Generated Yike singing sample Male singer generated sampleMale singer training samplePredGT14UNSUPERVISED VIRTUALIZATION OF HUMAN VOICEAn Audio Digital Twin of Maria CallasAI generated sample(on wings of songs)100th birthday this year15CREATING NEW HUMAN VOICEInterpo
14、lationBesides virtualizing existing human,we can create an arbitrary number of non-existing singers through timbre interpolationNew Digital Human16CREATING NEW HUMAN VOICEInterpolationInterpolation weights are guided by musicians to produce good choir sound.17CREATING AN AI CHOIR8 Prototype Singers3
15、20Virtual SingersAudios were collected from online or in our studioIndividual singerChoir18OUTLINE AI Choir:Audio Generation CoMoSpeech:One-Step Speech and Singing Voice Synthesis via Consistency Model19OBJECTIVES OF SPEECH SYNTHESISHigh Audio QualityFast Inference SpeedBeyond SpeakingWhile diffusio
16、n models can produce good audio quality,they require many steps of inference which makes the algorithm not practical.We achieve one-step sampling,achieving practical diffusion-based speech synthesis.20CONSISTENCY MODELSDiffusion model as SDEDiffusion model as probability flow ODE,:“denoiser”21CONSIS
17、TENCY MODELSConsistency Constraints(,)is the“denoiser”If we imposeThen,=022COMOSPEECHTwo stages:a)train a diffusion-based teacher b)consistency distillation 23COMOSPEECHTeacher Training:ODEWith the drift and diffusion coefficients as t and 1We chooseGrad-TTS based framework24COMOSPEECHConsistency Di
18、stillationAlready satisfiedConsistency ConstraintsWe optimizeEstimated by teacher model 25COMOSPEECHWe also support multi-step sampling26COMOSPEECHResults27COMOSPEECHResultshttps:/comospeech.github.io/Now,as all books not primarily intended as picture-books consist principally of types composed to f
19、orm letterpressGrad-TTS(NFE:50,RTF:0.1694)FastSpeech 2(NFE:1,RTF:0.0017)Teacher(NFE:50,RTF:0.1824)CoMoSpeech(NFE:1,RTF:0.0058)28COMOSPEECHResultshttps:/comospeech.github.io/哦越過你劃的線我定了勇氣的終點FFTSinger(NFE:1,RTF:0.0032)DiffSinger(NFE:60,RTF:0.1338)Teacher-SVS(NFE:50,RTF:0.1282)CoMoSpeech-SVS(NFE:1,RTF:0.0048)29THANKS!