當前位置：首頁 > 報告詳情

顏水成-Foundations of Foundation Models-v2.pdf

上傳人：哆哆編號：630945 2025-04-19 PDF PDF 43頁 13.08MB

該報告所屬合集： 2024全球機器學習技術大會嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/43

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《顏水成-Foundations of Foundation Models-v2.pdf》由會員分享，可在線閱讀，更多相關《顏水成-Foundations of Foundation Models-v2.pdf（43頁珍藏版）》請在三個皮匠報告上搜索。

1、顏水成新加坡工程院院士、IEEE Fellow、昆侖萬維榮譽顧問顏水成教授現任昆侖萬維榮譽顧問，曾任2050研究院院長、昆侖萬維兼Skywork AI首席科學家，以及Sea首席科學家。顏水成是新加坡工程院院士、AAAI、ACM、IEEE 和 IAPR Fellow。他的研究領域包括計算機視覺、機器學習和多媒體分析。截至目前，顏水成教授已在國際頂級期刊和會議上發表論文800余篇，H指數為140+。顏水成教授的團隊在Pascal VOC和ImageNet（ILSVRC）兩項核心競賽中獲獎十余次。此外，他的團隊還獲得了十余項最佳論文和最佳學生論文獎，特別是在多媒體領域的頂級會議ACM Multim

2、edia上獲得了大滿貫，包括三次最佳論文獎、兩次最佳學生論文獎和最佳演示獎。演講主題：基礎模型的根基F Fo ou un nd da at ti io on ns s o of f F Fo ou un nd da at ti io on n MMo od de el ls sShuicheng YAN Skywork AI S Sy ys st te emms sA Al lg go or ri it th hmms sSustainable and Real-time AIGCTraining Scaling Law vs.Inference Scaling LawAlmighty Int

3、elligenceAffordable IntelligenceScale is Everything暴力出奇跡Tiny yet Powerful四兩撥千斤天工APPClub CoalaSky MusicShort DramaSky AgentAI StoryLarge Language ModelImage/Multimodal LLMLarge Music ModelLarge Audio ModelLarge Video ModelLarge 3D Model 2050 ResearchSix Large ModelsSystemsModelsAlgorithms研究即產品Researc

4、h as ProductsAI SystemsAI=Almighty Intelligence Advanced Search Hierarchical Reasoning Finance Expert Research Expert流媒體音樂平臺Melodio音樂商用創作平臺Mureka剁剁椒椒魚魚頭頭（粵語版）C Ch ho op pp pe ed d P Pe ep pp pe er r F Fi is sh h H He ea ad d （Cantonese Version）User-generated MusicS St to or ry yb bo oa ar rd d G Ge

5、en n.4 4A Ac ct to or r-s sh ho oww G Ge en n.5 5V Vi id de eo o-a au ud di io o G Ge en n.6 6S Sc cr ri ip pt t C Cr re ea at ti io on n1 1C Ch ha ar ra ac ct te er r G Ge en n.2 2S Sc ce en ne e G Ge en ne er ra at ti io on n3 3S Sh ho or rt t D Dr ra amma a C Co o-P Pi il lo ot tAI AlgorithmsAI=A

6、ffordable IntelligenceApproximation Error Network ArchitectureEstimation Error Data/LossOptimization Error Parameter Optimizer :g gr ro ou un nd d-t tr ru ut th h function for classification,generation or representation(unknown):o op pt ti imma al l function in h hy yp po ot th he es si is s s sp pa

7、 ac ce e (unknown,bio-inspired):o op pt ti imma al l e emmp pi ir ri ic ca al l function in based on labeled and/or unlabeled d da at ta a (unknown):c co ommp pu ut te ed d e emmp pi ir ri ic ca al l function in based on data (to be computed and output,theory-inspired)Training:65M US$/once20-50%spee

8、dup in training or inference means saving the earth!On-device IntelligenceApproximation Error:Model Architecture brings 1x1 convolution for the communityN NI IN N:f fu ul ll l-n ne eu ur ro on n-f fu un nc ct ti io on n,p pu ur re e c co on nv vo ol lu ut ti io on na al l,1 1x x1 1 c co on nv vo ol

9、lu ut ti io on n l la ay ye er rs s Intuitively less overfitting globally,and more discriminative locallyCNNNINCan be any small networks,e.g.MLP,or others for other particular targets,but SMALLParameter#is reduced to 1/10 or lessSmaller receptive field3 3x x3 3 C Co on nv v1 1x x1 1 C Co on nv v1x1

10、Conv is currently a s st ta an nd da ar rd d c co ommp po on ne en nt t for almost all later popular deep learning models.brings 1x1 convolution for the community1x1 convolutionsGPT TransformerGoogleNet,22 layers(ILSVRC 2014)ResNet,152 layers(ILSVRC 2015)1x1 convolutions3x3 convolutions5x5 convoluti

11、onsFilter concatenationPrevious layer3x3 max pooling1x1 convolutions1x1 convolutionsNetwork in Network in NetworkAssigned-topicHow to achieve speedup by 2x for MoE?Only FFN ExpertsA vanilla MoE layer typically consists of multiple s st tr ru uc ct tu ur ra al ll ly y i id de en nt ti ic ca al l e ex

12、 xp pe er rt ts s,where each expert is a standard feed-forward network(FFN).V Va an ni il ll la a MMo oE EA vanilla MoE layer typically merging the outputs from the Top-K feed-forward networks.Fixed Mixing MechanismA heterogeneous MoE framework that integrates b bo ot th h v va an ni il ll la a a an

13、 nd d z ze er ro o-c co ommp pu ut ta at ti io on n e ex xp pe er rt ts s.Three zero-computation expertsFFN ExpertsZero ExpertsCopy ExpertsConstant ExpertsMore Efficient and More EffectiveMoE+expands the original s si in ng gl le e mixing mechanism to e ei ig gh ht t mixing strategies.Approximately

14、half computation load if with equal numbers of FFN experts and zero-computation expertsMore Efficient and More EffectiveWe leverage g ga at ti in ng g r re es si id du ua al ls s,enabling each token to consider the pathway taken in the previous layer when selecting the appropriate experts.Specifical

15、ly,we add the routing scores from the previous layer to the routing scores predicted by the current layer.MoE+demands more accurate gating function due to the variations of experts.Pathway-aware RouterMMo oE E +Gating residuals effectively reduce t th he e v va ar ri ia an nc ce e o of f r ro ou ut

16、ti in ng g s sc co or re es s and mma ak ke e r ro ou ut ti in ng g mmo or re e s st ta ab bl le e.Distribution of#activated experts MMo oE E +Due to the negligible computing and communication costs associated with basic operation experts,MoE+achieves approximately 2 2 expert forward throughput comp

17、ared to a vanilla MoE model of the same size.Activated FFNs are less than.MMo oE E +MoE+model delivers performance comparable to dense models that have 2 to 3 times the number of parameters.Notably,MoE+outperforms OpenMoE-8B/32E,a larger MoE model trained with more tokens.Independent HeadsIn multi-h

18、ead attention,different attention heads are independent of each other and are calculated in parallel.V Va an ni il ll la a MMu ul lt ti i-H He ea ad d A At tt te en nt ti io on nMulti-head attention simply sums the results of different attention heads as output.Standard Sum Heads as ExpertsWe propos

19、e the attention head as the expert in the MoE model.MMi ix xt tu ur re e-o of f-H He ea ad d A At tt te en nt ti io on nWe improve the standard sum of the multi-head attention to a weighted sum.Weighted SumWe use shared heads to process shared knowledge.Shared HeadsMore Efficient and More EffectiveM

20、Mi ix xt tu ur re e-o of f-H He ea ad d A At tt te en nt ti io on n i in n V Vi iT TFor VIT,MoH outperforms the current state-of-the-art model on ImageNet without adding extra parameters and activating fewer attention heads.MMi ix xt tu ur re e-o of f-H He ea ad d A At tt te en nt ti io on n i in n

21、D Di iT TFor DIT,MoH also delivers better generation quality on ImageNet than standard DIT without increasing the number of parameters and activating fewer attention heads.MMi ix xt tu ur re e-o of f-H He ea ad d A At tt te en nt ti io on n i in n L LL LMMFor LLMs,MoH outperforms multi-head attentio

22、n in language capability while keeping the same parameter and using fewer attention heads.MMi ix xt tu ur re e-o of f-H He ea ad d A At tt te en nt ti io on n f fo or r C Co on nt ti in nu ue e-t tr ra ai in ni in ng gWe can continue-train an existing multi-head attention model into a mixture-of-hea

23、d attention model.Optimization Error:New Optimizer T Twwo o s st te ep ps s t to o g go o f fr ro omm l le ef ft t t to o r ri ig gh ht t:1)Adjust the learning rate element-wisely by the average of the historical gradient squaree ef ff fe ec ct t:increase/decrease LR for the coordinates with sparse/

24、dense gradient1)Change Momentums updating rule from accumulating to averagingC Cl la as ss si ic ca al l O Op pt ti immi iz ze er rs sGradient DescentStep 1)A Ad da ap pt ti iv ve e O Op pt ti immi iz ze er rs sStep 1)and 2)exponential moving averageadaptive learning rateClassical MomentumNesterov M

25、omentumTake a step in direction of MomentumCalculate gradient and make a correctionReformulated Nesterov Momentum Core Steps in Adanused for moments calculationexponential moving averageadaptive learning rateMain idea of Adan:Reformulate Nesterov momentum as a deep-learning-friendly versionCombine N

26、esterov momentum with the adaptive optimizerA At tt tr ra ac ct ti io on n B Ba as si in n V Vi is su ua al li iz za at ti io on n33 ()(b3)+(b1)A Ad da an n(b2)take one step first using A Ad da an nA Ad da amm+(a1)A Ad da amm (a2)(b4)Shows superiority in LLM TrainingTraining from scratch settings，1.

27、25xContinual training settings,1.5-2x Sota optimizer for various architectures(15 kinds)CNN,VIT,MAE,LSTM,Bert,GPT,UNet:1.5-2x faster than SOTAsEasy to Plug-and-playA Ad da an n A Ad da ammWWT Tr ra ai in ni in ng g L Lo os ss s f fo or r MMo oE E wwi it th h 8 8 E Ex xp pe er rt ts sC Co on nt ti in

28、 nu ua al l T Tr ra ai in ni in ng g L Lo os ss s wwi it th h F Fu ul ll l P Pa ar ra amme et te er rs s A Ad da an n A Ad da ammWW2 2x x+a ac cc ce el le er ra at ti io on nFinding 2There is a gradient correlation between training and testing across different layers of the LLM:in lower layers,test

29、data gradients more closely resemble those of more challenging data,while in upper layers,simpler data gradients are more essential for updates.This generalization has been validated in both code and math tasks.Train set:Code/Test data:Leetcode/Model:Llama3-8BTrain set:Code/Test data:Olympic Math/Mo

30、del:Llama3-8BFinding 1There is a gradient correlation between training and testing:as the difficulty of the test increases,the need for more challenging training data to adjust parameters also grows.Train set:CodeTest data:GSM8KModel:Llama3-8BTrain set:CodeTest data:Olympic MathModel:Llama3-8BFrom G

31、SM8K to Olympic data,the proportion of the most difficult code data in the training set increases from around 33%to 45%.Using different levels data to train the specific layers of model-Updating each layer sequentially with easy-to-hard data yields the best performance.Using different levels data to

32、 train the whole model-Easy-to-hard training improves steadily;hard-to-easy starts strong but declines,while random shows variable results.We conduct the experiments based on Llama3-8B using MetaMath training data with 5 difficulty levels.Testing in MATH dataset with a total of 5,000 test samples,an

33、d GSM8K with a total of 1,319 samplesConsciousness experience=Coordinate different brain regions to reason for certain targetAGI should possess consciousness,and there is a useful theater stage metaphor for consciousness U UN NC CO ON NS SC CI IO OU US SU UN NC CO ON NS SC CI IO OU US SPerceptualInt

34、entionsExpectationsSelfStage of Working MemorySensory input,ideas,visual images,inner speech.Voluntary speech and action outputBackstage is Dark-unconsciousThe Stage is Light-conscious The audience that receives conscious messages is also sitting in the Dark-unconsciousU UN NC CO ON NS SC CI IO OU U

35、S SInterpretersMemoriesLanguageAutomatisms（output）Language is the common media for all modalities in the global workspace（Attention）Self-reflection,Self-correction,ExplorationReasoningReasoningSuper-agentAttention,Dynamic ReasoningSelf-reflection,Self-correction,ExplorationDirectorSOTA FM Modal-1SOT

36、A FM Modal-2SOTA FM Modal-nSOTA FM Modal-NMemoriesTools/APIsO1-style new models are more and more like Super-AgentConsciousUnconsciousUnconsciousUnconsciousUnconsciousUnconsciousUnconsciousMotor ActionsUnconsciousAGI should possess consciousness,and global workspace theory presents a useful theater stage metaphorThank You!

相關圖表

顏水成教授是新加坡工程院院士、IEEE Fellow，現任昆侖萬維榮譽顧問，曾任2050研究院院長、昆侖萬維兼Skywork AI首席科學家，以及Sea首席科學家。他的研究領域包括計算機視覺、機器學習和多媒體分析。顏教授已在國際頂級期刊和會議上發表論文800余篇，H指數為140+。他的團隊在Pa s c a l V O C和I m a g e N e t（I L S V R C）兩項核心競賽中獲獎十余次。顏教授的成果已被廣泛應用于多個領域，包括流媒體音樂平臺Melodio、用戶生成音樂Storyboard Gen.4、演員展示Gen.5、視頻音頻Gen.6、劇本創作1、角色生成2、場景生成3、短片副駕駛AI Algorithms等。

"顏水成教授如何平衡研究與商業化應用？" "機器學習如何助力于多媒體分析與呈現？" "如何通過混合注意力機制提升AI模型性能？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站