《2020年終大會-自然語言處理:6-1.pdf》由會員分享,可在線閱讀,更多相關《2020年終大會-自然語言處理:6-1.pdf(81頁珍藏版)》請在三個皮匠報告上搜索。
1、LanguageDistributionsKIpHAAMeCkHf aAdaBLatin alphabetMore than 5000 different languages in the worldCA Volctrans#page#Machine Translation has increasedinternationaltrade by over 10%Equality to make the world smaller than 26%MANAGEMENT SCIENCEinformsVol.65.Na.12.DeeDoes Machine Translation Affect Int
2、ernational Trade? Evidencefrom a Large Digital PlatformErikBrynjolfssonXiangHuMeng Liuctts02142:Marketing,Olin SchoolUniversity in StLouisSt.Louis,Miswusleduhttp/ord.org/0000-0001-7595-3461(XH)1.edu.id.org/0000-0002-7952LDfdomrslation.We find that thJOIE2COntrade on this platforreaimcn2019.3388ausal
3、 evidence that language barriers significantly hinder trade and that AIhas alreadyegun toimprove cconomtc efficiency inat least one domainleathtps/dotong/10.1287/mnse2019.3388.Keywords:learningdgitpatomCA Volctrans#page#Global Footprints of BytedanceServe inSupporting150+75+languagescountries13230+R
4、&DOficescentersworldwidGlobalQJ口TOIBabeHeloTopBuzzVigo VideoNewsRepublicChina小口頭茶6ATouliaoDouyinTuChongHuoshanXigua VideoDon CheDiFaceuCA Volctrans#page#Xiaomingbot: Multilingual ReporterXiaomingbot-WorldCupByteDance Al Lab字節跳動人工智能實驗室MULTILINGUAL ROBOTNEWS REPORTERXiaomingbot-https:/xiaomingbotgithu
5、b.io/A Volctrans#page#Business Translation-Chat,Doc,Email,Sheetand Meeting for more around 150CountriesG179906186A Volctrans#page#VideoTranslationthings you never knew you你從不知道你需要亞馬遜的東西neededfromamazonA Volctrans#page#TranslationVideod TikTokmahoganyloxCA Volctrans#page#Cross language live broadcast
6、ing12n12800:hl中國直播首秀!隆駕到村上潮流=小眾MURAKAMI TAKASHI村上隆中國直播首秀倒計時12月12日周六下午2點Q村上隆抖音搜索omCA Volctrans#page#Machine Translation: Conditional SequenceGenerationDecoderEncoder本LI1店Maximize the conditional generation probability:PYIX=IIpOylyax6)101-1CA Volctrans#page#The UltimateQuest of MachineTranslationHow t
7、o build a universal MT system that iscapable of making the most of data fromdocuments,monolingual data,multi-lingual andmulti-modalities corpus?:貓坐在墊子上!EncoderDecoderThe cat saton the mat11GA Volctrans#page#Knowledge transfer from Pre-trainingMethodsLimited parallel corpus for both rich-resourceand
8、low-resource translation directionsM(almost) unlimited monolingual dataMLarge scale multi-lingual data13CA Volctrans#page#Goal of Pre-trainingBuild a universal ready-to-use model fromwhich we can derive any model for specifictask with the least effort.NERFipefuneQAPLMNLIBERT14CA Volctrans#page#1.Tow
9、ards making the most of BERTNMTfor3erFine-tuning leads to performance degradation on theoriginal taskThe situation is more severe on NMT fine-tuningHigh capacity of baseline needs much updatingUpdating to much makes the model forgets its universalknowledge from pre-training15CA Volctrans#page#Not Tu
10、ningmuchBERT too Concerted training frameworkRate-scheduled LearningDynamic SwitchAsymptotic Distillation長FFNh:hahahe: hs2 heDynamc個中中SwitchInter AttentionFFNFFN節個個Self AttentionCSelf AttentionSelf AttentionFR中個個yiyaXXaXsxXx3BERTEncoderDecoder16CA Volctrans#page#Not TuningmuchBERTtooRate-scheduled L
11、earningrate1.00Gradually increase the0.75learning rate of BERTparameters from 0 to 10.50Then,decrease thelearning rate of BERT0.25parameters from 1 to 00.00Keep the BERTTTparameters frozenLearning rate scalar for BERT parameterRate-scheduled learning rate is actually a trade off betweenfine-tuning a
12、nd fixed BERT17GA Volctrans#page#Not TuningmuchBERTtooDynamic SwitchUse a gate to dynamicallydecide which part is moreCombinedFeatureimportantFFNIfois learned to O,itCSelf AttentionSelf Attentiondegrade to the NMT個BERTEncodermodelIfois learned to 1,itLeamned parameter is used to balanceBERT encoder
13、and NMT encodersimply act as Bert fine-tune approachDynamic Switch is more flexible than rate-scheduled learning rate18A Volctrans#page#Not TuningmuchBERTtooAsymptotic DistillationDistillationThe pre-trained BERT servesasa teacher network while theencoder of the NMT modelservesasa studentBERTNMTMini
14、mize MSE loss of hiddenEncoderEncoderstates between NMT encoderand BERT to retain the pre-trained informationStudentTeacherUse a hyper-parameter tobalances the preferencexp= lben-hnmtlbetween pre-training distillationand NMT objectiveWithout introducing of additional parameters!19CA Volctrans#page#E
15、xperimentsResultsArchitectureEn-DeEn-FrEn-ZhSystemExisting systems27.338.1Vaswanietal.(2017)Transformer base28.441.0Vaswanietal.(2017)Transformer big27.7Lample and Conneau (2019)Transformerbig+Fine-tuning28.7Lample and Conneau (2019)Transformer big + Frozen Feature28.741.7-RNMT+MultiColChenetal.(201
16、8)Our NMT systems27.241.037.3CTNMTTransformer (base)29.741.638.4CTNMTRate-scheduling29.441.438.6CTNMTDynamic Switch29.241.638.3CTNMTAsymptotic Distillation+ALL30.142.338.9CTNMTThree strategies can independently work well on WMT14 En-De, En-Fr andWMT18En-ZhCTNMT base model achieves even better result
17、s than Transformer big model20CA Volctrans#page#AnalysisPre-train encoder or decoder ?Rate-scheduled learningDynamic switchoAsymptotic distillationEncoderDecoderModelsEnDe BLEUXXGPT29.2BERT Enc26.1BERT DecXV27.7BERTGPT-2 Enc27.4GPT-2 Dec21GA Volctrans#page#AnalysisPre-train encoder or decoder ?Rate-
18、scheduled learning Dynamic switchoAsymptotic distillation=1indicates fine-tuning BERTModelsEn-De BLEUy=0 indicates frozen BERTy=0.01 indicates not tuning27.7BERT too muchnm=0.0129.0y=pyw is Rate-scheduled29.7learning rate strategym=028.4122CA Volctrans#page#AnalysisPre-train encoder or decoder Rate-
19、scheduled learningDynamic switchAsymptotic distillation35- Transformer一Fine-tuningAD3127231952700K3600K900K1800KAD outperforms Fine-tuning stability The gap is enlarged as the fine-tuning steps increasing24CA Volctrans#page#HomeTakeMessagesPre-training methods for NMT has great potentialM CTNMT with
20、 only BERT encoder obtains +3BLEU on WMT14 en-de benchmarko Introducing pre-training method for decoder ispromising but still difficult25CA Volctrans#page#2.Prune-tune: FindingSparse StructureNMTfor SpecificGenerateNetworkand Fine-tuneAdapting to more DomainPruningLottery Subneta)GeneralModelb)Infor
21、mative Generalc)Lottery Domaind)Multi-domainModelSubnetworkSubnetwork26CA Volctrans#page#RobustTuning33.032.532.031.5BLEU31.030.530.0-Fine-tuningPrune-tune0.129.5Prune-tune0.3Prunetune 0.529.0-100002000400060008000training stepsFigure 3: BLEU results of fine-tuning and our proposedPRUNE-TUNE with 10
22、%,30%,50%sparsity when adapting27toIWSLT14ENDE.CA Volctrans#page#Overcomedegradationfine-tuningModelMixed Domain ModelGeneral Domain Model+Fine-tuning(Luong and Manning 2015)Target Domain Model+EWC-regularized (Thompson et al.2019)+LayerFreeze(Thompsonetal.2018)10%100%30%+Adapter(BapnaandFirat 2019)
23、OursFigure 4:Fine-tuning with different domain specific cor-pus.PRUNE-TUNE improves the baseline at different scaleswhile full fine-tuning suffers from over fitting.28CA Volctrans#page#3.Multi-lingual Pre-training for NMTData scarcity for lowlzero resource languagesTransfer knowledge between languag
24、es.29GA Volctrans#page#What dowe want?We want a universal pre-trained model forNMTacross all language pairs.2A辦30CA Volctrans#page#Why is thisinteresting2We want a universal pre-trained MT modelfor many languages, which adapts easily todownstream tasks.NEREn-DeFinetuneFinetuneQAZh-JpPLMPMTNLINI-PtBE
25、RT31CA Volctrans#page#Further Pursuit:Unified Multilingual Representation Further: It is expected to bridge distributionalrepresentation of different languages.Utterances in different languages with the samesemantics will be mapped to adjacent embeddingspaces.FrEnESDe1lloveyou, Je taime. lch liebe d
26、ich. Te quiero. tiamo32CA Volctrans#page#MethodsExistingPre-trainingBERT/GPT: Partially pre-trainingGPTBERTEncodeDecodDecodEncodererrerJDevin etal.2018ARadfordetal.201833CA Volctrans#page#ExistingMethodsPre-trainingWe want a BERT-like pre-training model forNMTMASS/BART: Training objectivediscrepancy
27、 exists !BARTMASSX2X1X2X4X3X3X5EncoderDecoderEncoderDecoderX5X3X2X5XX1346A Volctrans#page#mRASP ApproachmRASP: multilingual Random AlignedSubstitution Pre-training Multilingual Pre-training ApproachMRAS: specially designed trainingmethod to align semantic embeddingsY1EncoderDecoderZ4X5Y1Y2X1Y35X2X4A
28、 Volctrans#page#IntuitionPre-training in6essence gets anaverage model forall language pairs. Fine-tuning furtherEn-Fderives specializedmodelsRo-Es36GA Volctrans#page#Overview of mRASP3lope.rdansePre-trainingchantJadorechantedanserEncoderDecoderOrigHdanserFRidsJadorechanletokPOS4RAS3o3Cc3fAeAdanserto
29、kchante002Language indicator88GA Volctrans#page#Overview of mRASP3lope.rdansePre-trainingJadorechantedanserEncoderDecoderOrig中EN idsdansertokcFRidsJadorechanlerPOS2RASBBfCct3式3f1danserJadorechantertokENidzpos20Random Aligned Substitutionlloveyou Jetaimo. lch liebe dich39 Tequiero.IttiamoA Volctrans#
30、page#Overview of mRASP3lope.rdansePre-trainingHaJadorechantedanserEncoderDecoderOrig中特ENidsdansertokFRidsJadorechanlerpOS2RASmmCco3式31JadorechanterdansertokENidpos20Fine-tuningEn-FrbasketbalEncoderDecodertokEN idsplayingFRidsJadorebasketballPOS03CA Volctrans#page#MethodRAS TrainingRandom Aligned Sub
31、stitution (RAS)MRandomly replace a source word to itssynonym in different language.M Draw the embedding space closer.likeanddancingsinging41CA Volctrans#page#MethodRAS TrainingRandom Aligned Substitution (RAS)MRandomly replace a source word to itssynonym in different language.M Draw the embedding sp
32、ace closer.likeanddancingsingingchanterdanser42CA Volctrans#page#MethodRAS TrainingRandom Aligned Substitution (RAS)MRandomly replace a source word to itssynonym in different language.M Draw the embedding space closer.likeanddancingsingingchanterdanser43CA Volctrans#page#MethodRAS TrainingRandom Ali
33、gned Substitution (RAS)MRandomly replace a source word to itssynonym in different language.M Draw the embedding space closer.likeandchanterdansersingingdancing-logPo(x1C()P=E(xx1je844CA Volctrans#page#Top 3 QuestionsDoes mRASP work for both low-resourceand high-resource scenarios?Does mRASP work whe
34、n encounteringunseen languages7Why mRASP works?45CA Volctrans#page#Datasets Pre-training Dataset: PC32 (Parallel Corpus 32)M32 English-centric language pairs, resulting in64 directed translation pairs in totalM Contains a total size of 110.4M public parallelsentence pairsSize10000000010800000100OODD
35、100000100001000100CA Volctrans#page#DatasetsFine-tuning Dataset Indigenous Corpus: included in pre-trainingphaseM Extremely low resource (100k and 1M)(He, Tr etc.)M Medium resource C1M and 10M) (Zh,Fr. etc.)47CA Volctrans#page#DatasetsFine-tuning Dataset Exotic Corpus: not included in pre-training p
36、hase-Exotic Pair(De-Fr)-Exotic Source/Target (En-Mr)-Exotic Full (NI-Pt)CategoryISource language includedTarget language included Pair included inin pre-training ?in pre-training ?pre-training?Exotic PairExotic SourceExotic TargetExotic Full43CA Volctrans#page#TrainingDetailsTransformer bigs Trainin
37、g data 200 million sentence pairs NVIDIA V100 64 cardsOn goingM Larger data size: 2 bilion pairsMNVIDIA V100 256*8 cards49GA Volctrans#page#(Extremely) LowResource口 DirectmRASPExtremely-Low Resource Directions4035.832.331.130.428.6273025.825.320IUTTT100En2BeEo2EnBe2EnEn2MyEn2AfAf2EnEn2EoMy2EnLow Res
38、ource Directions5044.63937.44033.332.430.529.829.233022.2120P10.100En2HeHe2EnEn2TrTr2EnEn2RoRo2EnEn2csCs2En50higher resourcelower resourceCA Volctrans#page#Medium & Rich Resource(Popular Benchmark)Rich resource benchmarks can be furtherimproved (En-Fr +1.1BLEU).4731DirectCTNMTCTNMTDirectmBARTmRASPXL
39、MMASSmRASPmBERT45.2530.2543.529.541.7528.754028En2Fr(wmt2014)En2De(wmt2016)CA Volctrans#page#Does mRASP boost MT performancefor Exotic Languages?mRASP generalizes on all exotic scenarios.Fr-Zh(20K)De-Fr(9M)1)11)1323.521.20.7DirectExotic Pair26.729.923.4mRASP25.8NPPti2K)Da-EI(1.2M)1)Fr): +1.1 BLEU ga
40、insExotic FullNI-Pt,12K):010+ BLEU gains.mRASP takes a step towards the universalsemantic representation.M There is certain connection between the semanticrepresentation being bridged and MT models being improved.59GA Volctrans#page#Released!mRASPCode and models available at:https:/ thank Liwei Wu,
41、Huadong Chen,Qiangian Dong, Zewei Sun, Yang Wei andWeiying Ma for their useful suggestions6CA Volctrans#page#3.RethinkingDocument MachineTranslationDespite the progressExtra parametersM Small and identically distributed Datasetsr Unfair comparison (dropout: 0.1 for SNMT &0.2 for DNMTls the progress
42、from the coherence improvement, or just parameterregularization / overfitting ?61CA Volctrans#page#RethinkingDocument MachineTranslation Despite the progressExtra parametersM Small and identically distributed DatasetsM Unfair comparison (dropout: 0.1 for SNMT &0.2 for DNMTls the progress from the co
43、herence improvement, or just parameterregularization / overfitting ?52GA Volctrans#page#Rethinking Document MachineTranslationNaive Doc2Doc does not work due to datasparsity problem Multi-Resolutional Training (MR), which splitDocuments into smaller segmentsM Pre-training with Large scale sentence p
44、airsData Matters! The amount of data limited the success of document-to-document MT6CA Volctrans#page#AII in one model for arbitrary lengthBLEU with Length of Sequences401.Sent2Sent missesSent2SentSR Doc2DocMR Doc2Doc!35long sequences302.SR Doc2Doc25BLEUmisses short20sequences15103.MR Doc2Doc canhan
45、dle them all!5店Length of Sequences64CA Volctrans#page#RethinkingDocument MachineTranslationZH-ENEN-DETEDTEDEuroparlModelsNewsS-BLEUS-BLEUd-BLEUd-BLEUS-BLEUd-BLEUS-BLEUd-BLEU17.3224.2723.5822.1024.32Sent2Sent-Weak27.519.2125.8225.19916224.9827.0331.7033.83Sent2Sent-Strong24.0023.0829.3217.9025.03HAN(
46、Miculficich et al,2018)24.5828.6024.4224.8429.75SAN(Marufet al.,2019)25.1922.3729.82OCN(Yang et al,2019)19.1025.6729.0924.9126.9730.4032.63MCN(Zhengetal,2020)25.10SR Doc2Doc8.624.7021.1834.1625.8429.2726.7134.48MR Doc2Doc21.8927.9027.1230.74278529.41321434.20Sent2Sent+SRDoc2Doc+26.9729.9630.6134.382
47、8.3931.3732.5934.91MRDoc2Doc+NMT is able to acquire the capacity of translating long-range contextas long as it cooperateswith some shorter segments as assistanceG56A Volctrans#page#RethinkingDocument MachineTranslationZH-ENEN-DETEDTEDEuroparlModelsNewsS-BLEUd-BLEUS-BLEUd-BLEUS-BLEUd-BLEUd-BLEUS-BLE
48、USent2Sent-Weak17.3224.2723.5827.5822.1024.3233.8329.1627.0331.7019.2125.8225.1924.98Sent2Sent-StrongDocT(Zhangetal.2018)24.0023.0829.3217.9024.5825.0328.60HAN(Miculicich etal,2018)24.4224.8429.75SAN(Marufetal,2019)29.8225.1922.37QCN(Yang et al,2019)19.1025.6726.9725.1029.0924.9130.4032.63MCN(Zhenge
49、tal,2020)8.624.7021.1834.16SR Doc2Doc29.27344825.8426.71MR Doc2Do21.8930.7434.2027.9027.1227.8529.4132.14Sent2Sent+26.9729.9630.6134.38SRDoc2Doc+28.3932.59313734.91MRDoc2Doc+1.SR Doc2Doc models are activated2. MR Doc2Doc obtains the best results and widens the gap (even with outof-domain sentences)p
50、roving the advantage of MR3.Doc2Doc beats Doc2Sent for itcmore data like sentencesTaW668A Volctrans#page#Doc2Doc doesimprove discoursecoherenceModelsdeixisell.infllex.cellVP45.655.4274Sent2Sent51.161.346.161.035.6Zheng et al.(2020)64.746.365.953.0MR Doc2DocAn English-Russian test sets which include
51、deixis, lexiconconsistency, ellipsis (inflection),and ellipsis (verb phrase)GA Volctrans#page#Doc2Doc does improve discourseCoherencePTTCCPTCPSystemsd-BLEUMan50.723.657.841.0Sent2Sent26.733.0850.121.254.738.7SR Doc2Doc24.163.0260.527.3155.724.043.23.35MR Doc2Doc56.233.260.148.229.723.51Sent2Sent +30
52、.6357.033.268.250.53.53SR Doc2Doc +31.2558.434.568.951.83.68MR Doc2Doc +Propose a Zh-En test set and several metrics to evaluatediscourse coherence Results are similar. Match with human evaluation.68CA Volctrans#page#The UltimateQuest of MachineTranslation Knowledge transfer from textsMonolingual-Bi
53、lingual-MultilingualM Sentence-Documento Knowledge transfer from Multimodal Bilingual-MultilingualM Sentence-Documento Knowledge transfer from MultimodalM Speech Visual guided Simple, Fast and Accurate人|條79CA Volctrans#page#Lightseq模型解碼方法功能支持DiverseVAEGPTBeam SearchSamplingTransformerBeam SearchXXFaster TransformerXTurbo TransformersLightSeqLight Fast and Accurate inference framework!80CA Volctrans#page#Thanks81CA Volctrans