《大模型能力對齊 - 桂韜.pdf》由會員分享,可在線閱讀,更多相關《大模型能力對齊 - 桂韜.pdf(48頁珍藏版)》請在三個皮匠報告上搜索。
1、大模型智能體能力對齊與超越Tao GuiFudan University2024/10/152Fudan NLP LabWhat is An Agent?If they find a parrot who could answer to everything,I would claim it to be an intelligent being without hesitation.Denis Diderot,1875Agent in PhilosophyAgency-individuality-asymmetry-normativityGenerally Speaking:Entities
2、with the capacity to act.Narrowly Speaking:Entities possessing desires,beliefs,intentions,and the ability to take actions.3Fudan NLP Lab3OpenAIs Mission&Goalhttps:/ thus building a living metric which measures how well an agentcan achieve its users intended goal in a wide range of environments.4Fuda
3、n NLP LabWhat is AI Agent?Agents:Artificial entities that are capable of perceiving their surroundings using sensors,making decisions,and then taking actions in response using actuators.Perceiving surroundingsMaking decisionsTaking actions1 Russell,S.J.Artificial intelligence a modern approach.Pears
4、on Education,Inc.,2010.2 Wooldridge,M.J.,N.R.Jennings.Intelligent agents:theory and practice.Knowl.Eng.Rev.,10(2):115152,1995.為誰服務?5Fudan NLP Lab56Fudan NLP Lab67Fudan NLP LabTraining language models to follow instructions with human feedbackWhat is Alignment?HelpfulFollow instructionsAsk relevant f
5、ollow-up questions and obtain necessary detailsRe-direct ill-informed requestsHonestKnow who it is,and what can/cannot it do/knowHarmlessRefuse inappropriate requests8Fudan NLP LabTraining language models to follow instructions with human feedbackTwo Steps of RLHF AlignmentAlignmentTrainingPreferenc
6、esModeling9Fudan NLP LabHard for Alignment TrainingLanguage Environment;Reward Design;Optimization Algorithm10Fudan NLP Lab1.Evaluation Metrics for Monitor Training Process2.Implement Details in PPO3.PPO-max SetupPPO Max for Stable TrainingTechnical reportSecrets of RLHF in Large Language Models Par
7、t I:PPO11Fudan NLP LabLLMs with Competitive Self-PlayCompetitive self-play|OpenAI12Fudan NLP LabLLMs with Competitive Self-PlayToward Optimal LLM Alignments Using Two-Player Games13Fudan NLP Lab1314Fudan NLP Lab1415Fudan NLP Lab先進行規劃規劃,生成Plan,然后再回答O1系列模型表現出的特點16Fudan NLP LabO1效果test time scaling,XOT
8、17Fudan NLP LabProcess SupervisionImproving mathematical reasoning with process supervision|OpenAI18Fudan NLP LabLLMs with Process Supervision18https:/ from a Single Demonstration19Fudan NLP LabLLMs with Process SupervisionReversed curriculum makes learning easier20Fudan NLP LabLLMs with Process Sup
9、ervisionR3:Reinforcement learning for Reasoning with Reversed curriculum,ICML 202421Fudan NLP LabLLMs with Environment FeedbackStepCoder:Improving Code Generation with Reinforcement Learning from Compiler Feedback,ACL 202422Fudan NLP Lab2223Fudan NLP Lab2324Fudan NLP LabLLaVAThis sym bolic represent
10、ation allow s us to encode the im age as an LLM-recognizable sequence.W euse C O C O im ages 28and generate three types ofinstruction-follow ing data.O ne exam ple pertypeis show n in the bottom block ofTable 1.Foreach type,w efirstm anually design a few exam ples.They are the only hum an annotation
11、s w e have during data collection,and are used as seed exam plesi n i n-cont ext-l earni ng t o query G PT-4.C onversation.W e design a conversation betw een the assistantand a person asking questionsaboutthis photo.The answ ers are in a tone as ifthe assistantis seeing the im age and answ eringthe
12、question.A diverse setofquestionsare asked aboutthe visualcontentofthe im age,includingthe objecttypes,counting the objects,objectactions,objectlocations,relative positions betw eenobjects.O nly questions thathave definite answ ers are considered.Please see Table 10 forthedet ai l ed prom pt.D etail
13、ed description.To include a rich and com prehensive description foran im age,w e create alistofquestions w ith such an intent.W e prom ptG PT-4 then curate the list,w hich is show inTable 9 in the A ppendix.Foreach im age,w e random ly sam ple one question from the listto askG PT-4 t o generat e t h
14、e det ai l ed descri pt i on.C om plex reasoning.The above tw o types focus on the visualcontentitself,based on w hichw e furthercreate in-depth reasoning questions.The answ ers typically require a step-by-stepreasoni ng processby fol l ow i ng ri gorousl ogi c.W e collect158K unique language-im age
15、 instruction-follow ing sam ples in total,including 58K inconversations,23K in detailed description,and 77k in com plex reasoning,respectively.W e ablatedthe use of C hatG PT and G PT-4 in our early experim ents,and found thatG PT-4 can consistentlyprovi dehi gherqual i t y i nst ruct i on-fol l ow
16、i ng dat a,such asspat i alreasoni ng.4Vi sualInstructi on Tuni ng4.1A rchi tectureThe prim ary goalis to effectively leverage the capabilities ofboth the pre-trained LLM and visualm odel.The netw ork archtecture is illustrated in Figure 1.W e choose LLaM A as ourLLMf()param eterized by,as its effec
17、tiveness has been dem onstrated in severalopen-source language-onlyi nst ruct i on-t uni ng w orks.43,45,34.Vision EncoderWfProjectionXvZvHvImageLanguage InstructionLanguage Response HqXqXaLanguage ModelFi gure 1:LLaVA net w ork archi t ect ure.For an input im ageXv,w e consider the pre-trained C LI
18、P visual encoder V iT-L/14 36,w hichprovides the visualfeatureZv=g(Xv).The grid features before and afterthe lastTransform erlayerare consi dered i n ourexperi m ent s.W econsi dera si m pl e l i nearl ayert o connecti m age feat uresi nt othe w ord em bedding space.Specifically,w e apply a trainabl
19、e projection m atrixWto convertZvintol anguage em beddi ng t okensHq,w hi ch havet hesam e di m ensi onal i t y oft hew ord em beddi ng spacei n t he l anguage m odel:Hv=WZv,w i t h Zv=g(Xv)(1)Thus w e have a sequence ofvisualtokensHv.N ote thatoursim ple projection schem e is lightw eightand cost-e
20、ffective,w hich allow s us to iterate data centric experim ents quickly.M ore sophisticated(butexpensive)schem es to connectthe im age and language representations can also be considered,such as gated cross-attention in Flam ingo 2and Q-form erin B LIP-2 25,orothervision encoderssuch as SA M 21thatp
21、rovide object-levelfeatures.W e leave exploring possibly m ore effective andsophi st i cat ed archi t ect ure desi gnsforLLaVA asfut ure w ork.4多模態Vocabulary(token embedding alignment)PTW:use CC3M 595k image-textSFT(ViT):150kCLIP ViT-L/14LLaMA-13BVicuna-13B25Fudan NLP Lab空間能力欠缺W hat s“up”w i th vi s
22、i on-l anguage m odel s?Investi gati ng thei r struggl e w i th spati alreasoni ngA m i ta K am ath1Jack H essel2K ai-W eiC hang11U ni versi t y ofCal i forni a,LosA ngel es2A l l en Inst i t ut e forA Ikamatha,kwchangcs.ucla.edu,jackhallenai.orgA bstractR ecentvision-language(V L)m odels are pow-er
23、ful,butcan they reliably distinguish“right”from“left”?W e curate three new corpora toquantify m odelcom prehension of such basicspatialrelations.These tests isolate spatialrea-soning m ore precisely than existing datasetslike V Q A v2,e.g.,ourW hat sU pbenchm arkcontains sets ofphotographs varyingon
24、lythespatialrelations ofobjects,keeping theiriden-tityfixed(see Figure 1:m odels m ustcom pre-hend not only the usual case of a dog undera table,butalso,the sam e dogon top ofthesam e table).W e evaluate 18 V L m odels,find-ing that all performpoorly,e.g.,B LIPfine-tuned on V Q A v2,w hich nears hum
25、 an parityon V Q A v2,achieves 56%accuracy on ourbenchm arks vs.hum ans at99%.W e concludeby studying causes ofthis surprising behavior,finding:1)that popular vision-language pre-training corpora like LA IO N-2B contain littlereliable data forlearning spatialrelationships;and 2)thatbasic m odeling i
26、nterventions likeup-w eighting preposition-containing instancesorfine-tuning on ourcorpora are notsufficientto address the challenges ourbenchm arks pose.W e are hopefulthatthese corpora w illfacilitatefurther research,and w e release our data andcode athttps:/ onPre-trained vision-language m odels
27、perform w ellon com plex tasks such as V Q A v2(G oyal et al.,2016)and N ocaps(A graw al et al.,2019),evenin the zero-shot setting(Li et al.,2023).H ow-ever,recent w ork has re-surfaced a concern thathas long plagued vision-language m odels(Y atskaretal.,2016;Johnson etal.,2017):new m ultim odalm od
28、elsstillexhibitpoorbehavioron sim ple taskslike attribute attachm ent,counting,etc.(Y am adaetal.,2022;Thrush etal.,2022;Yuksekgonuletal.,2023;Parcalabescu etal.,2021).D espite im prove-m ents,m odels still fail to reliably capture evenA dog ona tabl e!A dog ri ght ofa tabl eA dog l ef t ofa tabl eA
29、 dog ri ght ofa tabl eA dog ri ght ofa tabl eA dog ri ght ofa tabl eA dog undera tabl eA dog ri ght ofa tabl eFigure 1:W e propose three tightly controlled bench-m arks to assess m odelcapacity forfine-grained spatialreasoning,show ing thatpopularvision-language m od-els fallfarbehind hum an perform
30、 ance w hen asked toselectthe correctspatialrelation betw een tw o objects inan i m age(realexam pl esshow n).basic spatialfactors of im ages,a prerequisite form ore preci se and com pl ex reasoni ng benchm arks.Butw hy?In this w ork,w e study vision-languagem odelsperform ance on basic spatial rela
31、tions,such as“leftof”and“rightof”.Existing bench-m arks w hich aim to operationalize spatialunder-standing such as V Q A v2 and G Q A(H udson andM anning,2019)often conflate the evaluation ofspatial reasoning w ith other types of reasoning,such as in the G Q A question“Is there a w om an tot hel eft
32、oft he person t hati sw eari ng aw et sui t?”.arXiv:2310.19785v1 cs.CL 30 Oct 202326Fudan NLP Lab細節能力不足Tabl e 1:Com pari son ofvi sualt okeni zersofVi T-B w i t h di fferentpret rai ni ng st rat egi es.The bestresul ti sbol d w hi l et hesecond besti sunderl i ned.Joi ntSupervi sedVi sual#Pret rai n
33、i ngV Q ACapt i oni ngO CM CIAvgTuni ngTokeni zerIm agesA ccCID ErSPICEA ccA ccFul l yD ei T 161.28 M48.365.815.937.583.658.8Sel fD IN O 191.28 M50.145.013.546.580.855.6M A E 181.28 M48.437.311.847.582.753.4D IN O v2 20142 M51.367.916.147.086.063.1W eakl yCLIP 17400 M52.269.316.642.586.062.5Ful l yD
34、 ei T 161.28 M50.738.410.041.086.954.3Sel fD IN O 191.28 M47.354.114.544.586.658.1M A E 181.28 M48.948.014.247.588.758.2D IN O v2 20142 M50.549.613.043.584.156.9W eakl yCLIP 17400 M47.764.215.445.588.061.42.2C om pari ng Vi sualTokeni zersO n G V TBench,w e eval uat e vi sualt okeni zers w i t h t h
35、e sam e archi t ect ure(Vi T-B 34)butdi ffer-entpret rai ni ng st rat egi es,i ncl udi ng ful l y-supervi sed(D ei T 16),sel f-supervi sed(D IN O 19,D I-N O v2 20,M A E 18)and t ext-gui ded w eakl y supervi sed(CLIP 17)pret rai ni ng.Based on t heresul t si n Tabl e1,w earri veatt he fol l ow i ng c
36、oncl usi ons.Ful l y/w eakl ysupervi sed m odel scapturem oresem anti csthan sel f-supervi sed ones,butthegapi snarrow ed by scal i ng up the pre-trai ni ng dataset.W i t h t okeni zerspret rai ned on rel at i ve sm al l-scal e dat aset(i.e.,Im ageN et-1k 35w i t h 1.28M i m ages),D ei T dem onst ra
37、t esbet t eri m age capt i on-i ng perform ance(65.8 CID Er)t han sel f-supervi sed m odel sD IN O(45.0)and M A E(37.3),w i t houtj oi nt l y t uni ng t he vi sualt okeni zer.H ow ever,w i t h 142M i m agesforpret rai ni ng,t hesel f-supervi sedm odelD IN O v2 out perform s t he supervi sed D ei T o
38、n i m age capt i oni ng(67.9)and V Q A(51.3),and i s onl y i nferi ort o CLIP w hi ch i s pret rai ned w i t h w eak supervi si on from a l arge-scal e dat asetw i t h 400M i m age-t extpai rs.Thi si ndi cat est hatsupervi si on i sbenefici alforsem ant i crepresent at i oncapabi l i t y,butt hi sca
39、n al so em ergefrom l arge-scal epret rai ni ng w i t h sel f-supervi si on.Sel f-supervi sed m odel s are better at fine-grai ned percepti on,w here patch-l evelsupervi si oni s parti cul arl y effecti ve.O n fine-grai ned vi sual underst andi ng t asks,i.e.,O C and M CI,sel f-supervi sed m odel sd
40、em onst rat econsi st ent l y bet t erperform ancet han t hosew i t h supervi si on.W hent hey are j oi nt l y t uned on t he i nst ruct i on dat aset,t hei r O C and M CI perform ance are m ost l yboost ed,i ndi cat i ng t hei rfine-grai ned vi sualpercept i on capabi l i t y get s i m proved.A m o
41、ng al lt hesel f-supervi sed m odel s,M A E achi eves t he bestperform ance,i ndi cat i ng t he pat ch-based supervi-si on i spart i cul arl y effect i vefori m provi ng fine-grai ned vi sualunderst andi ng.Tuni ng sem anti c-ri ch vi sualtokeni zer l eadsto sem anti c l osson sm al l-scal e i nstru
42、cti on tuni ngdataset.W hen t het okeni zeri sj oi nt l yopt i m i zedont hei nst ruct i on t uni ng dat aset,t heri ch sem ant i csobt ai ned from l arge-scal e pret rai ni ng i n CLIP and D IN O v2 have not i ceabl y dropped(e.g.,CLIPV Q A 52.2 47.7 and D IN O v2 capt i oni ng 67.9 49.6).W econj e
43、ct uret hi si sduet o t herel at i vel ysm al lscal e ofouri nst ruct i on dat aset(5M142M).A ssuch,form odern M LLM st hatare oft ent uned on sm al l-scal e and hi gh-qual i t y i nst ruct i on dat aset s7;8,j oi nt l y t uni ng t he vi sualt okeni zerm ay notbeagood opt i on.3U ni fyi ng Sem anti
44、c and Fi ne-grai ned Vi sualU nderstandi ng3.1C LIP w i th R egi on-based Trai ni ngThe general i stM LLM scal lfora versat i l evi sualt okeni zert hatcoul d properl y representan i m age scont entatm ul t i pl e l evel s.H ow ever,based on t he resul t s i n Tabl e 1,none ofexi st i ng pret rai ni
45、 ngm et hodsl eadst o agood vi sualt okeni zert hatexcel satbot h sem ant i cand fine-grai ned vi sualpercep-427Fudan NLP Lab復旦眸思(MouSi)28Fudan NLP Lab復旦眸思(MouSi)不同專家各有所長,是否能專家協同合作?29Fudan NLP Lab復旦眸思(MouSi)30Fudan NLP Lab復旦眸思(MouSi)31Fudan NLP LabAnyGPT:OverviewMultimodal TokenizersImage:SeedSpeech
46、:SpeechTokenizerMusic:EncodecAutoregressive LMDe-TokenizersImage:Diffusion ModelSpeech:Soundstorm+SpeechTokenzier DecoderMusic:Encodec32Fudan NLP LabMultimodal Alignment PretrainingImage-to-TextSpeech-to-TextMusic-to-TextText-to-ImageText-to-SpeechText-to-MusicImage-text Interleaved data33Fudan NLP
47、Lab多模態大模型34Fudan NLP Lab日益增長的信息無障礙需求截至2023年底,中國殘疾人中視力殘疾人最多,達到2856.5萬人。2023年我國60歲以上老年人口達到3億人左右,并且在未來一段時期,老年人口規模還將不斷增加。視障者老年人35Fudan NLP Lab基于大模型的手機中控智能體36Fudan NLP Lab聽見世界Agent3637Fudan NLP Lab3738Fudan NLP Lab3839Fudan NLP Lab3940Fudan NLP Lab4041Fudan NLP LabCapabilityHuman LevelSFTRLHFCollaboratio
48、n Week2Strong Env FeedbackHuman SupervisionLLMs with Various Feedback模型能力增強與人類監督減少的矛盾42Fudan NLP LabLLMs with Environment FeedbackImitation LearningExploration LearningEvolving Large Language Model based Agents across Diverse Environments43Fudan NLP LabLLMs with Environment FeedbackInteractive Train
49、ing PlatformDiverse environments and tasks that allow the agents to evolve dynamically and comprehensively,rather than being confined to an isolated world,which may limitgeneralization.Base agent with basic abilities and prior knowledgeWe need a trajectory set of an appropriate size to train a base
50、agent with preliminary instruction-following abilities and knowledge.This facilitates further exploration as in diverse,complex environments,it would be extremely inefficient for an agent to learn everything from scratch through trial and errorSelf-Evolving AlgorithmAn effective and flexible evoluti
51、on method can adapt to environments of varying difficulty and elicit the generalizing ability of LLM-based agents.This involves how the agent interacts with the environment and how it utilizes the feedbackThree pillars to achieve our goal44Fudan NLP LabLLMs with Environment FeedbackAGENTGYM2.Behavio
52、ral Clone4.Multi-taskEvaluation3.Exploring&LearningEnv ServersBaseTool UsingMovieWeatherTodoSheetTextCraftWebArenaWebShopBabyAIAlfWorldScienceWorldBIRD-SQLMAZETrajectoryFormatsEnv ClientsImitationAgentEvolPerformanceSingle TaskMulti-TasksGeneral DataReasoning and ActingInstruction:Find me a pillow w
53、ith blue andThought:I think I should search for pillowsAction:searchpillowObservation:Results:Sep Item 1 Sep General Domain ChatInstruction:Hello!Can you translate this into Chinese for me?Response:Sure!Heres the translationEvolveExplorationFeedbackWordleWebWebShopWebArenaEmbodiedAlfWorldScienceWorl
54、dBabyAIGameTextCraftMAZE/WordleToolWeather/TodoAcademiaMovie/SheetCodeBIRD-SQLHTTPHTTPHTTPHTTPHTTP1.Data PrepareAgentGym:Online Interactive Training and Evaluation Platform45Fudan NLP Lab4546Fudan NLP Lab4647Fudan NLP Lab47SPA-VLAgentG YMTool Sw ordLLMReasoning/PlanningRealityAlignmentR3StepC oder眸思聽世界Stable PPOGPOThanksFudanNLP