FastSpeech:高效語音合成的算法設計及優化.pdf

編號:29529 PDF 36頁 1.51MB 下載積分:VIP專享
下載報告請您先登錄!

FastSpeech:高效語音合成的算法設計及優化.pdf

1、MicrosoftNVIDIA.FastSpeech: Algorithm and Optimizationfor State-of-the-art Text to SpeechXu Tan & Dabi AhnMicrosoft Research Asia & NVIDIA#page#NVIDIA.MicrosoftOutlineThe algorithm of FastSpeechBy Xu Tan, Microsoft Research AsiaThe optimization of FastSpeechBy Dabi Ahn NVIDIA#page#MicrosoftAbout tex

2、t to speech systemFaest spiychFastSpeechTTSMel-spectrogramPhonemeAcousticVocoderF SpeechText ModelFrontend#page#MicrosoftAbout FastSpeechA fast robust controllable high-quality and end-to-end text to speech(TTS) systemFastSpeech: Fast, Robust and Controllable Text to Speech, NeuriPS 2019 1 FastSpeec

3、h 2: Fast and High-Quality End-to-End Text to Speech,ICLR 2021submission 2Widely supported by the community and deployedin Microsoft AzureTTS service to support all the languages1 https/proceedings.neurips.cc/paper/2019/file/f63f65b503e22cb970527f23c9ad7dbl-Paperpdf2 https:/ these issues?Slow infere

4、nce speedAutoregressive generationsInference time depends on sequence length (for 5s speech mellength is about 500)Not robustEncoder-decoder attention is not accurate, repeating and skipping attentionLack of controllability No control information as inputAutoregressive generation cannot explicitly c

5、ontrol the duration#page#MicrosoftOur solution: FastSpeechKey designsGenerate melspectrogram in parallel (forspeedup)s Remove the attention mechanism between text and speech (for robustness) Variance adaptor introduces duration pitch energy (for controllability)FastSpeech has the following advantage

6、sx8Euolelaug ueodsu uo dnpaadsaualayu xzseuxspeedup on voice generation!Robust: no bad case of word skipping and repeating Controllable: can controlvoice speed and prosody.Voicequaliity:on par or better than SOTA model#page#MicrosoftOur solution: FastSpeechFeed-forward transformer: generate mel-spec

7、trogramLinear LayerAin parallel both in training and inference (speedup)Mel Decoder Remove the attention mechanism between phoneme40樓encoder and mel-spectrogram decoder (robustness)Variance Adaptor米 Variance adaptor: provide duration to bridge the lengthPhoneme Encodermismatch between phoneme and me

8、l sequence, providepitch andenergy as input to ease the one-to-many mappingPhoneme Embedding下problem (controllability,voice quality)TextPhoneme車multiple speech variations(duration pitchsoundvolume,speaker style,emotion etc)6TOZ SdlnaN aadsoxaaqelouo pue sngol seaadsse(6TOZ)1n“Zoeuz Soeuz1uOx uel“uen

9、yuau IT#page#MicrosoftFastSpeech:variance adaptorAdd more variance information: duration pitch,energy as inputDuration from text-audio force-alignmentLinear LayerAPitch extracted from audio.PredictorMel DecoderEnergy from L2 norm of melspectrogramEnergy Predictors Training using ground-truth valuein

10、ference usingpredicted values from duration/pitch/energyVariance AdaptorPitch Predictor本predictors.Phoneme EncoderCRDuration PredictorPositionaEncodingPhoneme Embedding個Phoneme#page#MicrosoftFastSpeech: duration predictorspeech中1.0x speed2231spiychHmelLinear LayerLAssppiyiyiychMel DecoderQ=0.5(111.5

11、0.511212.0x speedspiych2231D=2.2.3.11Variance Adaptorspiyiy chDuratiorPhoneme Encoder=2(4462)0.5x speedspiych2231出PositionaEncoding上ssssppppiy iyiyiyiyiy chchPhoneme EmbeddingHpho個Phoneme#page#MicrosoftFastSpeech: overviewTmel個Linear LayerAdd& NormLinear LayePredicto個中Mel DecoderConvIDLN+DropoutEner

12、gy Predictor個=2.2.3.1Variance AdaptorConvID + ReLUAdd& NormDurationPitch Predictor本個PredictorPhoneme EncoderLN+DropoutMulti-HeadCRAttentionDuration PredictorConviD+ ReLUPhoneme Embedding本HphoPhonemePhoneme EncoderDuration/Pitcl/EnergyFastSpeechVariance AdaptorLength RegulatorMel DecoderPredictor#pag

13、e#MicrosoftFastSpeech:end-to-end text-to-wave generationChallengesWaveform has more variance information (eg phase)which is hard to predictMelWaveformDecoder Hard to train audio clip of a full sentenceDecoder本 SolutionsPositional Adversarial trainingEncoding Auxiliary mel decoderVariance AdaptorAPho

14、neme EncoderPositionalEncodingPhoneme Embedding個Phoneme#page#MicrosoftTrainings Mean absolute error (L1) lossLmet = ly-f(xdpe)lwhere(xy)aretextand speech paird/p/erepresents duration pitch and energy Wav lossLa=(1-D(c(x,dp,))s Duration/pitch/energy losslog-scaledurationLa=(d-fa(x)Logscale pitchnorma

15、lizationLp=(p-f(x)Energy,Le=(e-f(x)#page#MicrosoftInference speedupMethodLatency Gs) SpeedupTransformer TTS 131 (Mel)6.7353.969L0.0250.005269.40FastSpeech (Mel)270x speedup for mel-spectrogram generation!38x speedup for voice synthesis!#page#MicrosoftRobustnessMethodIRepeats I Skips IError Sentences

16、Error Rate7151734%Transformer TTS0000%FastSpeechTest on 50 extremely hard sentences from TTS product scenarioFastSpeech has no repeating, skipping and error sentencesFastSpeechTransformer TTSYoucancalmedirectlyat4257037344ormycell4254447474orsendmametingqustwialteapproprateinfmation.五示五示Httpoxx,Htp1

17、Xx,Http2xx,Http3xx#page#MicrosoftSpeech synthesis qualityMethodMOSGT4.300.07GT(Mel + Vocoder)3.920.08Tacotron 2(Shen et al,2018)3.700.08TransformerTTS (Liet al,2019)3.720.07FastSpeech13.830.08FastSpeech achieves better voice quality than Tacotron2 and Transformer TTSand is close to ground-truth reco

18、rdings.1RenY,Ruan,Y,Tan,X,Qin,TZhao,S.Zhao,Z,&LiuT.Y.Fastspeech:Fast,robustandcontrollabletexttospeech.Neurps2019.https:/speechresearch.github.io/fastspeech/2Ren,Y,Hu,C,Tan,X.,Qin,T,Zhao,S,Zhao,Z.,&Liu,T.Y.FastSpeech2:FastandHigh-QualityEnd-to-EndText toSpeech.arxiv2020https:/speechresearch.github.i

19、o/fastspeech2/#page#MicrosoftProduct deployment of FastSpeechFastSpeech is extremely fast and high-quality with 270x speedup on mel-spec generation,38x speedup on audio generation!FastSpeech is widely supported by the community: ESPNet NvidiaFastSpeech is deployed on Microsoft Azure Speech Service(T

20、TS) for 54 languages/localesLanguagesLocalesLanguagesLocalesLanguagesLocalesLanguagesLocales有-F1Arabicar-EG,ar-SAFinnishja-pSlovenianS-SIJapanesebg-BGFrenchf-FR,fr-CA,fr-CHKoreanko-KRSpanishBulgarianes-ES,es-MXCatalanCa-ESGermande-DE,de-AT,de-CHMalaymS-MYSwedishSV-SETamilChinesezh-CN.zh-HKzh-TWGreek

21、el-GRNonwegiannb-NOta-INhe-1LPolishCroatianhr-HRHebrewpl-PLTelugute-INCzechCS-CZHindihi-INThaith-THPortuguesept-BR,pt-PTda-DKhuHUro-ROTurkishDanishHungarianRomaniantr-TRid-1DDutchn-NLIndonesianRussianu-nVietnameseVi-VN4Sultaliani-1TYBAOISsk-SKIrishen-us,en-UKen-Au,en-cA,enIN,en-EBa-EIL-LTet-EEMaltes

22、emt-MTLithuanianLatvianI-LVEstonianhttps:/len-us/services/cognitive-services/textto-speech#page#MicrosoftThanks!Xu Tan Senior Researcher Microsoft Research Ahttps:/len-us/research/people/xutal#page#THE OPTIMIZATION OFFASTSPEECH TRAINING#page#NVIDIAWHY ACCELERATE TRAININGTraining of modern deep neura

23、l TTS still takes time in a matter of days or weeksFor researchers:The process of designing models and the experiment cycles could be much fasterFor TTS service providers:Deploy their TTS services more frequentlyMulti speakersMultilanguages#page#NVIDIASTEP 0:MINIMIZE CPU TIMESynchronousPyTorch 1.7+D

24、ATA/CPUGPU TRAININGDATA/CPUGPU TRAININGLoad data asynchronous to model trainingiteration 1iteration0Set pin_memory to True in DataloaderSet non_blocking to True in cudaO)AsynchronousParallelize data loaditeration 0Use DataLoaderwith many threadsDATA/CPUGPU TRAININGfteration 1E.g.set num_workers to16

25、inDataLoaderGPU TRAININGDATA/CPUiteration 2operations are hiddenDATA/CPUGPU TRAININGbehind GPU training#page#NVIDIASTEP 0: MINIMIZE CPU TIMEPreprocess Input DataPreprocess target mel/pitch/engeryPreprocess target durationTarget melPredicted melPredicted durationTarget durationMSElossMSE Loss2.2.3.11

26、2,2,3,1FourierAligringTransformForced aligner中售WaveformDuration Predictor#page#NVIDIASTEP 0:MINIMIZE CPU TIMEBaseline PerformanceNVIDIA 1xV100 32G(DGX1)Batch size 48(Mel),6(Wave) on LJSpeech1.1 datasetIterations:#5to#25(20steps)Timeelapsed:14.2s(Mel),12.5s(Wave)Foreach iteration,710ms(Mel),626ms(Wav

27、e)Throughput:38,248frames/sand1,387,474samples/s on average13powBatch SizeTime Elapsed(s)sed /iter(s)4814.2037Mel Decoder0.7101851.00x38,248(frames/s)612.52940.62647Wave Decoder(E2E)1.00x1,387,474(samples/s)nVIOL#page#NVIDIASTEP 1:USE MIXED PRECISIONMixed Precision TrainingThe idea of training model

28、s in multiple precisions:Half precision(Fp16)everywhere else for speedand scaleFul precision(Fp32) where needed to maintain the model accuracy:Benefits:Accelerates math-boundoperations:Tensor Cores are 8xfaster than FP32Accelerates memory-bound operations: FP16 halves memory traffic compared to FP32

29、Reduces memory consumptionFP16 halvesthesize ofactivationand gradient tensorsRVIOI#page#NVIDIASTEP1:USE MIXED PRECISIONAssume forwards 8 backwards in math-bound ops (FC,Conv,BMM) account for 70% of the total GPU timeThis can run up to 8x faster with Tensor CoresBy Amdahls law,theoretically it can ex

30、pect up to 2.58x speed-up during total GPU time30%70%FP32OtherFC,Conv BMM2.58x8MPOtherGPU timeFVIO#page#NVIDIASTEP 1:USE MIXED PRECISIONAutomatic Mixed Precision (AMP)Enables mixed precision training without loss ofaccuracyAs simple as two lines of codeimport torchimport ampmodel=allows AMP to perfo

31、rm automatic castingoptimizer=amp.initia7ize(model,optimizer,opt_level=01)model,optimizerfor data,label in data_iter:out= model(data)1oss = criterion(out,label)optimizer.zero_grad()optimizer) as scaled_loss:withamp.scaled_loss(loss,replaces1oss.backwardCScaled_loss.backwardCoptimizer.step()#page#NVI

32、DIASTEP 1:USE MIXED PRECISIONSatisfy Shape ConstraintsTo enable Tensor Cores,satisfy theseFor linear layers:input size,output size,batch size shouldbe multiples of 8For convolutions:input and output channel counts should bemultiplesof 8Fortunately,the hyperparameters ofthe FastSpeech model have alre

33、ady satisfied the shapeconstraints!#page#NVIDIASTEP 1:USE MIXED PRECISIONPerformanceWithsimilarbatch size,dataset,anditerationcountTimeelapsed:7.93s(Mel),11.4s(Wave)Foreachiteration,397ms(Mel),570ms(wave)Speed-up of 1.79x(Mel),1.10x(Wave)Throughput:68,479frames/s(Mel),1,524,624 samples/s(Wave)onaver

34、ageModelPrecisionBatch SizeTime Elapseds)Time ElapsedSpeed-upThroughput/iter(s)P324814.20370.7101851.00x38,248(frames/s)Mel DecoderMP487.93320.396661.79x68,479(frames/s)FP32612.52941.00x0.626471,387,474(samples/s)Wave Decoder(E2E)MP611.40231.10x0.5701151,524,624(samples/s)nVIOL#page#NVIDIASTEP2:MAXI

35、MIZE COMPUTEUTILIZATIONIncrease Batch SizeUse a batch size of that al Streaming Multiprocessors(SMs) are fully utilizedWith batch size48(Mel)and6(Wave),math-boundoperations do NOTfullyutilizealSMsWith batch size 144(Mel)and 18(Wave),allSMsare nowfully utilizedso you get more improved speed-ups and t

36、hroughputsMixed PrecisionSpeed-up of 1.89x (Mel) and 1.61x(Wave)Throughput:99,476frames/s (Mel)and 3,384,163 samples/s(Wave) on averageBatch SizeModelTime ElapsTime ElapsThroughputd(sd/iter (s)Speedupw/M487.93320.396661.79x68,479(frames/s)Mel Decoder14416.38350.8191751.89x99,476(frames/s)611.40230.5

37、701151.10x1,524,624(samples/s)WaveDecoder(E2E)1815.41080.770541.61x3,384,163(samples/s)#page#NVIDIASTEP2:MAXIMIZE COMPUTEUTILIZATIONConcatenate Matrix MultipliesConcatenate matrix multiplies that share an inpute.g.query/key/valueprojectionmatricesinMulti-HeadAttentionFFT BlockMatrix multiplies of la

38、rger matrices utilize SMs moreIt reduces the number of CUDA kernel launches by a thirdvbGet 4-5%speed-upW2WKnVID#page#NVIDIASTEP 3: FUSE OPERATIONSue ou! sdo aslm-quaua Jo saHas e Sulsny Kq dn-paads ex 39operationFused Layer NormMeanStdLayerLayer norm is usedin every FFT Block and Duration/Pitch/Ene

39、rgy PredictorNormNormalizeSimply use apex.normalization.FusedLayerNorm in ApexScale and ShiftFused Adam OptimizerFused Layer NormAdam is used during both Mel decoder and Wave decoder trainingUseapex.optimizers.FusedAdaminApex#page#NVIDIASTEP 3:FUSE OPERATIONSPerformanceMixed PrecisionBatch size144(M

40、el)and18(Wave)The Speed-up is now 2.04x (Mel) and 1.71x(Wave)Throughput:107,903frames/s(Mel)and3,583,735samples/s(Wave)on averageModelFusedTime Elapsed(s)Speed-up w/ MPTime Elapsed/iter(s)ThroughputNo16.38350.8191751.89x99,476(frames/s)Mel DecoderYes2.04x15.10410.755205107,903(frames/s)No0.770541.61

41、x15.41083,384,163(samples/s)Wave Decoder(E2E)Yes14.55260.727631.71x3,583,735(samples/s)#page#NVIDIASUMMARYEnjoy around 2x speed-up during training byMinimizing CPUtimeUsing Mixed PrecisionMaximizingComputeUtilizationFusingoperationsWithout loss ofaccuracyMinimal code update,thanks to Nvidia Apex liibrary#page#NVIDIASpecial thanks to Huihan Yao for her great support姚惠涵an intern in Developper Technology(DevTech) Team in NVIDIA#page#NVIDIAThanks!Dabi Ahn, Senior DL Engineer NVIDIA ()

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(FastSpeech:高效語音合成的算法設計及優化.pdf)為本站 (X-iao) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站