《顏水成-Foundations of Foundation Models-v2.pdf》由會員分享,可在線閱讀,更多相關《顏水成-Foundations of Foundation Models-v2.pdf(43頁珍藏版)》請在三個皮匠報告上搜索。
1、顏水成 新加坡工程院院士、IEEE Fellow、昆侖萬維榮譽顧問顏水成教授現任昆侖萬維榮譽顧問,曾任2050研究院院長、昆侖萬維兼Skywork AI首席科學家,以及Sea首席科學家。顏水成是新加坡工程院院士、AAAI、ACM、IEEE 和 IAPR Fellow。他的研究領域包括計算機視覺、機器學習和多媒體分析。截至目前,顏水成教授已在國際頂級期刊和會議上發表論文800余篇,H指數為140+。顏水成教授的團隊在Pascal VOC和ImageNet(ILSVRC)兩項核心競賽中獲獎十余次。此外,他的團隊還獲得了十余項最佳論文和最佳學生論文獎,特別是在多媒體領域的頂級會議ACM Multim
2、edia上獲得了大滿貫,包括三次最佳論文獎、兩次最佳學生論文獎和最佳演示獎。演講主題:基礎模型的根基F Fo ou un nd da at ti io on ns s o of f F Fo ou un nd da at ti io on n MMo od de el ls sShuicheng YAN Skywork AI S Sy ys st te emms sA Al lg go or ri it th hmms sSustainable and Real-time AIGCTraining Scaling Law vs.Inference Scaling LawAlmighty Int
3、elligenceAffordable IntelligenceScale is Everything暴力出奇跡Tiny yet Powerful四兩撥千斤天工APPClub CoalaSky MusicShort DramaSky AgentAI StoryLarge Language ModelImage/Multimodal LLMLarge Music ModelLarge Audio ModelLarge Video ModelLarge 3D Model 2050 ResearchSix Large ModelsSystemsModelsAlgorithms研究即產品Researc
4、h as ProductsAI SystemsAI=Almighty Intelligence Advanced Search Hierarchical Reasoning Finance Expert Research Expert流媒體音樂平臺Melodio音樂商用創作平臺Mureka剁剁椒椒魚魚頭頭(粵語版)C Ch ho op pp pe ed d P Pe ep pp pe er r F Fi is sh h H He ea ad d (Cantonese Version)User-generated MusicS St to or ry yb bo oa ar rd d G Ge
5、en n.4 4A Ac ct to or r-s sh ho oww G Ge en n.5 5V Vi id de eo o-a au ud di io o G Ge en n.6 6S Sc cr ri ip pt t C Cr re ea at ti io on n1 1C Ch ha ar ra ac ct te er r G Ge en n.2 2S Sc ce en ne e G Ge en ne er ra at ti io on n3 3S Sh ho or rt t D Dr ra amma a C Co o-P Pi il lo ot tAI AlgorithmsAI=A
6、ffordable IntelligenceApproximation Error Network ArchitectureEstimation Error Data/LossOptimization Error Parameter Optimizer :g gr ro ou un nd d-t tr ru ut th h function for classification,generation or representation(unknown):o op pt ti imma al l function in h hy yp po ot th he es si is s s sp pa
7、 ac ce e (unknown,bio-inspired):o op pt ti imma al l e emmp pi ir ri ic ca al l function in based on labeled and/or unlabeled d da at ta a (unknown):c co ommp pu ut te ed d e emmp pi ir ri ic ca al l function in based on data (to be computed and output,theory-inspired)Training:65M US$/once20-50%spee
8、dup in training or inference means saving the earth!On-device IntelligenceApproximation Error:Model Architecture brings 1x1 convolution for the communityN NI IN N:f fu ul ll l-n ne eu ur ro on n-f fu un nc ct ti io on n,p pu ur re e c co on nv vo ol lu ut ti io on na al l,1 1x x1 1 c co on nv vo ol
9、lu ut ti io on n l la ay ye er rs s Intuitively less overfitting globally,and more discriminative locallyCNNNINCan be any small networks,e.g.MLP,or others for other particular targets,but SMALLParameter#is reduced to 1/10 or lessSmaller receptive field3 3x x3 3 C Co on nv v1 1x x1 1 C Co on nv v1x1
10、Conv is currently a s st ta an nd da ar rd d c co ommp po on ne en nt t for almost all later popular deep learning models.brings 1x1 convolution for the community1x1 convolutionsGPT TransformerGoogleNet,22 layers(ILSVRC 2014)ResNet,152 layers(ILSVRC 2015)1x1 convolutions3x3 convolutions5x5 convoluti
11、onsFilter concatenationPrevious layer3x3 max pooling1x1 convolutions1x1 convolutionsNetwork in Network in NetworkAssigned-topicHow to achieve speedup by 2x for MoE?Only FFN ExpertsA vanilla MoE layer typically consists of multiple s st tr ru uc ct tu ur ra al ll ly y i id de en nt ti ic ca al l e ex
12、 xp pe er rt ts s,where each expert is a standard feed-forward network(FFN).V Va an ni il ll la a MMo oE EA vanilla MoE layer typically merging the outputs from the Top-K feed-forward networks.Fixed Mixing MechanismA heterogeneous MoE framework that integrates b bo ot th h v va an ni il ll la a a an
13、 nd d z ze er ro o-c co ommp pu ut ta at ti io on n e ex xp pe er rt ts s.Three zero-computation expertsFFN ExpertsZero ExpertsCopy ExpertsConstant ExpertsMore Efficient and More EffectiveMoE+expands the original s si in ng gl le e mixing mechanism to e ei ig gh ht t mixing strategies.Approximately
14、half computation load if with equal numbers of FFN experts and zero-computation expertsMore Efficient and More EffectiveWe leverage g ga at ti in ng g r re es si id du ua al ls s,enabling each token to consider the pathway taken in the previous layer when selecting the appropriate experts.Specifical
15、ly,we add the routing scores from the previous layer to the routing scores predicted by the current layer.MoE+demands more accurate gating function due to the variations of experts.Pathway-aware RouterMMo oE E +Gating residuals effectively reduce t th he e v va ar ri ia an nc ce e o of f r ro ou ut
16、ti in ng g s sc co or re es s and mma ak ke e r ro ou ut ti in ng g mmo or re e s st ta ab bl le e.Distribution of#activated experts MMo oE E +Due to the negligible computing and communication costs associated with basic operation experts,MoE+achieves approximately 2 2 expert forward throughput comp
17、ared to a vanilla MoE model of the same size.Activated FFNs are less than.MMo oE E +MoE+model delivers performance comparable to dense models that have 2 to 3 times the number of parameters.Notably,MoE+outperforms OpenMoE-8B/32E,a larger MoE model trained with more tokens.Independent HeadsIn multi-h
18、ead attention,different attention heads are independent of each other and are calculated in parallel.V Va an ni il ll la a MMu ul lt ti i-H He ea ad d A At tt te en nt ti io on nMulti-head attention simply sums the results of different attention heads as output.Standard Sum Heads as ExpertsWe propos
19、e the attention head as the expert in the MoE model.MMi ix xt tu ur re e-o of f-H He ea ad d A At tt te en nt ti io on nWe improve the standard sum of the multi-head attention to a weighted sum.Weighted SumWe use shared heads to process shared knowledge.Shared HeadsMore Efficient and More EffectiveM
20、Mi ix xt tu ur re e-o of f-H He ea ad d A At tt te en nt ti io on n i in n V Vi iT TFor VIT,MoH outperforms the current state-of-the-art model on ImageNet without adding extra parameters and activating fewer attention heads.MMi ix xt tu ur re e-o of f-H He ea ad d A At tt te en nt ti io on n i in n
21、D Di iT TFor DIT,MoH also delivers better generation quality on ImageNet than standard DIT without increasing the number of parameters and activating fewer attention heads.MMi ix xt tu ur re e-o of f-H He ea ad d A At tt te en nt ti io on n i in n L LL LMMFor LLMs,MoH outperforms multi-head attentio
22、n in language capability while keeping the same parameter and using fewer attention heads.MMi ix xt tu ur re e-o of f-H He ea ad d A At tt te en nt ti io on n f fo or r C Co on nt ti in nu ue e-t tr ra ai in ni in ng gWe can continue-train an existing multi-head attention model into a mixture-of-hea
23、d attention model.Optimization Error:New Optimizer T Twwo o s st te ep ps s t to o g go o f fr ro omm l le ef ft t t to o r ri ig gh ht t:1)Adjust the learning rate element-wisely by the average of the historical gradient squaree ef ff fe ec ct t:increase/decrease LR for the coordinates with sparse/
24、dense gradient1)Change Momentums updating rule from accumulating to averagingC Cl la as ss si ic ca al l O Op pt ti immi iz ze er rs sGradient DescentStep 1)A Ad da ap pt ti iv ve e O Op pt ti immi iz ze er rs sStep 1)and 2)exponential moving averageadaptive learning rateClassical MomentumNesterov M
25、omentumTake a step in direction of MomentumCalculate gradient and make a correctionReformulated Nesterov Momentum Core Steps in Adanused for moments calculationexponential moving averageadaptive learning rateMain idea of Adan:Reformulate Nesterov momentum as a deep-learning-friendly versionCombine N
26、esterov momentum with the adaptive optimizerA At tt tr ra ac ct ti io on n B Ba as si in n V Vi is su ua al li iz za at ti io on n33 ()(b3)+(b1)A Ad da an n(b2)take one step first using A Ad da an nA Ad da amm+(a1)A Ad da amm (a2)(b4)Shows superiority in LLM TrainingTraining from scratch settings,1.
27、25xContinual training settings,1.5-2x Sota optimizer for various architectures(15 kinds)CNN,VIT,MAE,LSTM,Bert,GPT,UNet:1.5-2x faster than SOTAsEasy to Plug-and-playA Ad da an n A Ad da ammWWT Tr ra ai in ni in ng g L Lo os ss s f fo or r MMo oE E wwi it th h 8 8 E Ex xp pe er rt ts sC Co on nt ti in
28、 nu ua al l T Tr ra ai in ni in ng g L Lo os ss s wwi it th h F Fu ul ll l P Pa ar ra amme et te er rs s A Ad da an n A Ad da ammWW2 2x x+a ac cc ce el le er ra at ti io on nFinding 2There is a gradient correlation between training and testing across different layers of the LLM:in lower layers,test
29、data gradients more closely resemble those of more challenging data,while in upper layers,simpler data gradients are more essential for updates.This generalization has been validated in both code and math tasks.Train set:Code/Test data:Leetcode/Model:Llama3-8BTrain set:Code/Test data:Olympic Math/Mo
30、del:Llama3-8BFinding 1There is a gradient correlation between training and testing:as the difficulty of the test increases,the need for more challenging training data to adjust parameters also grows.Train set:CodeTest data:GSM8KModel:Llama3-8BTrain set:CodeTest data:Olympic MathModel:Llama3-8BFrom G
31、SM8K to Olympic data,the proportion of the most difficult code data in the training set increases from around 33%to 45%.Using different levels data to train the specific layers of model-Updating each layer sequentially with easy-to-hard data yields the best performance.Using different levels data to
32、 train the whole model-Easy-to-hard training improves steadily;hard-to-easy starts strong but declines,while random shows variable results.We conduct the experiments based on Llama3-8B using MetaMath training data with 5 difficulty levels.Testing in MATH dataset with a total of 5,000 test samples,an
33、d GSM8K with a total of 1,319 samplesConsciousness experience=Coordinate different brain regions to reason for certain targetAGI should possess consciousness,and there is a useful theater stage metaphor for consciousness U UN NC CO ON NS SC CI IO OU US SU UN NC CO ON NS SC CI IO OU US SPerceptualInt
34、entionsExpectationsSelfStage of Working MemorySensory input,ideas,visual images,inner speech.Voluntary speech and action outputBackstage is Dark-unconsciousThe Stage is Light-conscious The audience that receives conscious messages is also sitting in the Dark-unconsciousU UN NC CO ON NS SC CI IO OU U
35、S SInterpretersMemoriesLanguageAutomatisms(output)Language is the common media for all modalities in the global workspace(Attention)Self-reflection,Self-correction,ExplorationReasoningReasoningSuper-agentAttention,Dynamic ReasoningSelf-reflection,Self-correction,ExplorationDirectorSOTA FM Modal-1SOT
36、A FM Modal-2SOTA FM Modal-nSOTA FM Modal-NMemoriesTools/APIsO1-style new models are more and more like Super-AgentConsciousUnconsciousUnconsciousUnconsciousUnconsciousUnconsciousUnconsciousMotor ActionsUnconsciousAGI should possess consciousness,and global workspace theory presents a useful theater stage metaphorThank You!