當前位置：首頁 > 報告詳情

【2】Audio Content Generation Building digitalized human and humanized AI CUHK-SZ.pdf

上傳人： 2*** 編號：129342 2023-05-01 PDF PDF 29頁 3.74MB

該報告所屬合集： 2023語音與語言處理技術交流會（深圳）嘉賓PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/29

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《【2】Audio Content Generation Building digitalized human and humanized AI CUHK-SZ.pdf》由會員分享，可在線閱讀，更多相關《【2】Audio Content Generation Building digitalized human and humanized AI CUHK-SZ.pdf（29頁珍藏版）》請在三個皮匠報告上搜索。

1、1AUDIO CONTENT GENERATION:BUILDINGDIGITALIZED HUMAN AND HUMANIZED AIWei XueDivision of Emerging Interdisciplinary AreasThe Hong Kong University of Science and Technology2STEPPING INTO THE VIRTUAL WORLDThe pandemic has changed the way of living,and the value of building an online virtual community is

2、 recognized.3STEPPING INTO THE VIRTUAL WORLDMetaverse and Web 3.0 are being built which define new paradigms of communication,perception and interactionDigitalized Human and Humanized AI Human and AI become Indistinguishable(Perception,emotion and creation)AI-generated contents enabling co-inspiring

3、 and co-creation between human and AI4NEXT-GENERATION AUDIOSWe have been working to build the next generation of audio content generation and the interaction between humans,machines and environmentsEndless beautiful audio contents will be generated in the virtual world.We take the music as a researc

4、h focus.Music is a symbol of human creativity and is the basis of modern entertainment industry.My team focuses on Virtual singers with realistic and emotional singing voicesDigital twin and virtual human Music composition with aesthetic feelings Co-creation,noveltyAudio Content Generation5NEXT-GENE

5、RATION AUDIOSOur goal is to ultimately create a new hybrid of real and virtual environments where humans and AIcan inspire,appreciate and interact through audios.Can we create an AI superstar among the top 10 singers?Can ordinary people with little music knowledge express themselves through music?Ca

6、n we talk,sing,and write songs with our idols?6OUTLINE AI Choir:Audio Generation CoMoSpeech:One Step Speech and Singing Voice Synthesis via Consistency Model7HOW TO MAKE AN AI CHOIR?8HUMAN-AI SYMPHONY ORCHESTRAThe worlds first human-AI collaborative symphony orchestra9FACTS OF THE AI CHOIRThe AI cho

7、ir,which performed“Pearl of the Orient(東方之珠)”,was one of the key parts of the orchestra.320 virtual singers were created for the 4-part choral performance.10UNSUPERVISED VIRTUALIZATION OF HUMAN VOICECreating a labeled database for hundreds of singers in the choir is cost infeasible(100K HKD for 1 ho

8、ur of data).Even though we have budgets,it is hard to collect enough professional singing sounds.A typical commercial AI singerNon-supervisionMake the machine“imagine”the singing voices of any person by only listening to unlabeled recordings of any language and content.Digital TwinUnlabeled datasetS

9、inging new songs11UNSUPERVISED VIRTUALIZATION OF HUMAN VOICEAuto-encoder Framework Humans use existing skills for new tasks Four models were pretrained using public datasets for the specific tasksOnline sourcesSinging voice datasetsSpeech EnhancementMixture of singing speech,music and noiseSinging s

10、peechSpeech RecognitionSinging speechPhonetic distribution(Lyrics information)Speaker IDSinging speechIdentity EmbeddingPitch ExtractionSinging speechPitch ContourHow to Get Data?What Lyrics(content)?Who Sings?What Melody?12UNSUPERVISED VIRTUALIZATION OF HUMAN VOICEAuto-encoder FrameworkOverall Fram

11、eworkConformer-based DecoderBuilding BlocksConformer BlockPostNetWe design a conformer-based decoder to recover the time-frequency Mel spectrogram of the original speech13UNSUPERVISED VIRTUALIZATION OF HUMAN VOICEAuto-encoder Framework The model learns the“uniqueness”of the person from the large-sca

12、le data It is also controllable over speech content and melody(Virtualize the humans voice and make a digital twin)Mel Spectrogram ComputationLossUnsupervised Training on Decoder(control2spec)Singing Voice VocoderUnsupervised Training on Vocoder(spec2wav)Mel Spectrogram ComputationLossUnsupervised T

13、raining PredGTYike speech sample Generated Yike singing sample Male singer generated sampleMale singer training samplePredGT14UNSUPERVISED VIRTUALIZATION OF HUMAN VOICEAn Audio Digital Twin of Maria CallasAI generated sample(on wings of songs)100th birthday this year15CREATING NEW HUMAN VOICEInterpo

14、lationBesides virtualizing existing human,we can create an arbitrary number of non-existing singers through timbre interpolationNew Digital Human16CREATING NEW HUMAN VOICEInterpolationInterpolation weights are guided by musicians to produce good choir sound.17CREATING AN AI CHOIR8 Prototype Singers3

15、20Virtual SingersAudios were collected from online or in our studioIndividual singerChoir18OUTLINE AI Choir:Audio Generation CoMoSpeech:One-Step Speech and Singing Voice Synthesis via Consistency Model19OBJECTIVES OF SPEECH SYNTHESISHigh Audio QualityFast Inference SpeedBeyond SpeakingWhile diffusio

16、n models can produce good audio quality,they require many steps of inference which makes the algorithm not practical.We achieve one-step sampling,achieving practical diffusion-based speech synthesis.20CONSISTENCY MODELSDiffusion model as SDEDiffusion model as probability flow ODE,:“denoiser”21CONSIS

17、TENCY MODELSConsistency Constraints(,)is the“denoiser”If we imposeThen,=022COMOSPEECHTwo stages:a)train a diffusion-based teacher b)consistency distillation 23COMOSPEECHTeacher Training:ODEWith the drift and diffusion coefficients as t and 1We chooseGrad-TTS based framework24COMOSPEECHConsistency Di

18、stillationAlready satisfiedConsistency ConstraintsWe optimizeEstimated by teacher model 25COMOSPEECHWe also support multi-step sampling26COMOSPEECHResults27COMOSPEECHResultshttps:/comospeech.github.io/Now,as all books not primarily intended as picture-books consist principally of types composed to f

19、orm letterpressGrad-TTS(NFE:50,RTF:0.1694)FastSpeech 2(NFE:1,RTF:0.0017)Teacher(NFE:50,RTF:0.1824)CoMoSpeech(NFE:1,RTF:0.0058)28COMOSPEECHResultshttps:/comospeech.github.io/哦越過你劃的線我定了勇氣的終點FFTSinger(NFE:1,RTF:0.0032)DiffSinger(NFE:60,RTF:0.1338)Teacher-SVS(NFE:50,RTF:0.1282)CoMoSpeech-SVS(NFE:1,RTF:0.0048)29THANKS!

相關圖表

本文主要探討了音頻內容生成領域，特別是在構建數字化的虛擬人類和人性化的AI方面。文中提到，隨著元宇宙和Web 3.0的建設，人類和AI的溝通、感知和交互方式正在被重新定義。作者的團隊致力于研究虛擬歌手，他們擁有逼真的聲音和情感表達，同時研究音樂創作的美學感受和AI與人類的協同創新。他們創造了一個AI合唱團，在無需大量標注數據的情況下，僅通過聆聽未標記的錄音，AI便能虛擬化出任何人的聲音。此外，他們提出了一種名為CoMoSpeech的方法，通過一致性模型實現了一步語音和歌唱聲音合成，提高了音頻質量并加快了推理速度。關鍵數據包括：320個用于合唱的虛擬歌手、AI合唱團在“東方之珠”一曲中的表演，以及多個比較實驗的結果，展示了他們方法在音頻質量、推理速度等方面的優勢。

如何實現音頻內容的自動化生成？人工智能如何助力音樂創作與演繹？虛擬歌手與真實歌手的差距還有多遠？

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站