當前位置：首頁 > 報告詳情

楊旭-多模態大語言模型中的上下文學習.pdf

上傳人：張** 編號：182444 2024-10-14 PDF PDF 57頁 9.93MB

該報告所屬合集： 2024AI+研發數字峰會（AiDD）·上海站嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/57

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《楊旭-多模態大語言模型中的上下文學習.pdf》由會員分享，可在線閱讀，更多相關《楊旭-多模態大語言模型中的上下文學習.pdf（57頁珍藏版）》請在三個皮匠報告上搜索。

1、多模態大語言模型中的上下文學習楊旭東南大學演講嘉賓楊旭東南大學副教授楊旭博士2021年6月從南洋理工大學計算機科學與技術系獲工學博士學位，導師為蔡劍飛，張含望教授?，F為東南大學計算機科學與工程學院、軟件學院、人工智能學院副教授、任東南大學新一代人工智能技術與交叉應用教育部重點實驗室副主任?，F主要從事視覺文本多模態大模型應用研究以及一種新的大模型訓練-部署模式：學習基因的研究。目錄CONTENTS1.Background2.Heuristic-based configuration strategies3.Learning-based configuration strategiesBac

2、kground“Why do we need In Context Learning？”PART 01The Development of GPTLMICLMultimodal1 Background6GPT(2018)1.5B ParametersPrompt EngineeringGPT-2(2019)175B ParametersIn-context LearningGPT-3(2020)GPT-4(2023)1324DataDataPre-training Fine-tuningPromptImageTextMultimodal VideoIn-contextExamples7GPT-

3、2s Capability of Prompt Engineering l GPT-2 exhibits a distinctive feature known as“prompt engineering”.l This can be compared to the architecture of modern computers,where both data and commands exist in the form of 0s and 1s encoding.LMICL1 BackgroundMultimodal8GPT-3s Capability of In-Context Lear

4、ningl GPT-3 possesses a unique capability known as“In-context learning”.l It will learn the representation of tasks from the provided in-context examples.LMICL1 BackgroundMultimodal9Why In-Context Learning?In-Context Learning Prompt EngineeringYield precise responsesUnlock the potential of LLMsA spe

5、cialized prompt engineeringAdapt to a task using a few examplesfew shotLiu,Pengfei,et al.Pre-train,prompt,and predict:A systematic survey of prompting methods in natural language processing.Dong,Qingxiu,et al.A survey for in-context learning.LMICL1 BackgroundMultimodal10Why In-Context Learning?outsi

6、de-in methodologies to unravel the inner properties of LLMsPros of ICLHow many meters does a 1-kilogram object fall in 1 second?4.9 mObjects fall with a constant acceleration due to gravity,regardless of their mass.What about 10-kilogram?4.9 mProviding incorrect examples does not affect the LLMs abi

7、lity to make correct judgments.Positive“Best movie ever.”Sentiment:Positive.“I like it.”Sentiment:?Positive“Best movie ever.”Sentiment:Negative.“I like it.”Sentiment:?l Flexible controllabilityl Encapsulate more informationLMICL1 BackgroundMultimodal11How about GPT-4?GPT-4:Large Multimodal ModelProc

8、ess visual data&understand and generate natural languageWhat is LMM?Refer to visual information in conversationsAnswer questions about the imagesExcellent Multimodal capabilitiesNot open-sourceinternal workings and training processes are opaqueThese two images represent two different robots,respecti

9、velyWhat color is the purse?blueHow does this food taste?Delicious,especially the cake!Incorporate the understanding of visual contentLMICL1 BackgroundMultimodal12The development of large models from single-modal to multi-modalWhy Multimodal Model In-Context Learning?Expands the application scope of

10、 the model:various image/video understanding tasks.ImageTextVideoTextQ:What color is the purse?A:Blue.A table with bread and milk on it.Classify:Table.Visual Question AnsweringImage CaptionImitate real humans and achieve multi-modal analogy capabilitiesLMICL1 BackgroundMultimodal13l Less research in

11、 the Multimodal Model In-Context Learningl Most of the work only considers the field of Natural Language ProcessingWhy Multimodal Model In-Context Learning?l Some large multimodal models are not well adapted to in-context learning,such as miniGPT-4,LLAVA,mPLUG owl,etc.l Large multimodal model with g

12、ood in-context learning:Flamingo,Otter,IDEFICSLMICL1 BackgroundMultimodalHeuristic-based configuration strategies“Take IC and VQA as examples”PART 0215CaptionAnswer2 Heuristic-basedExploring Diverse In-Context Configurations for Image Captioning（NIPS 2023）Xu Yang,Yongliang Wu,Mingzhuo Yang,Haokun Ch

13、en,Xin GengarXiv:https:/arxiv.org/abs/2305.14800code:https:/ Diverse In-Context Caption：Background and Motivationl Transitioning from single-modal to multi-modal leads to increased complexity.l In image modality,which image optimizes testing?l In caption modality,what is the ideal choice for model g

14、eneration?Review:Delicious food!Review:The movie is awful.Review:Terrible dishes!Sentiment:PositiveSentiment:Negative Sentiment:NegativeReview:Good meal!Sentiment:RetrieveQueryCandidatesWhich pairis better?A bunch of bananas hanging from a wire.A bunch of bananas stuck on a line above a kitchen.Bush

15、els of bananas hanging from a rope in a store.RetrieveSelectQueryCandidatesGTsCaptionAnswer2 Heuristic-based17Given a test image,how to select the proper image?n Random Selection(RS):Randomly select k examples for few-shot in-context learning.n Similarity-based Image-Image Retrieval(SIIR)n Similarit

16、y-based Image-Caption Retrieval(SICR)n Diversity-based Image-Image Retrieval(DIIR)CaptionAnswer2 Heuristic-based18Given the selected image,how to choose the suitable caption?n Ground Truth Caption(GTC)n Model Generated Caption(MGC)n Model Generated Caption as Anchor(MGCA)n Iterative Prompting(IP)Eac

17、h image has five human-annotated captions.Choose the first caption in our experimentsUse a VLM or an offline captioner to generate corresponding caption Compute which GTC have higher CIDEr with the generated caption.Generate captions and then using these captions paired with the images to iterativel

18、y prompt VLM for enhanced captionsCaptionAnswer2 Heuristic-based19Exploring Diverse In-Context Caption：Caption Assignment Strategies l Model Generated Caption(MGC)l Given an image,we can use a VLM or an offline captioner to generate caption.l It might be helpful since the generated caption usually h

19、ave the same pattern with the output.A vision language model or offline caption to generate caption as in-context examples.Vision Language ModelOrOffline CaptionerA little girl eating a piece of chocolate cake.CaptionAnswer2 Heuristic-based20Exploring Diverse In-Context Caption：Caption Assignment St

20、rategies l Model Generated Caption as Anchor(MGCA)l Once get the generated caption,We can compute CIDEr scores to find the best caption.l The selected one will have the advantages of both GTC and MGC,more precise expression and more consistent pattern.We can use the model-generated caption as anchor

21、 to select the best caption from human-annotated captions.A little girl eating a piece of chocolate cake.A close up of a young person at a table eating cake.A small girl takes a bite of chocolate cake.A young girl eating a piece of chocolate cake.A little girl taking a big bite out of chocolate cake

22、.A young child enjoying a serving of cake and ice cream.Model Generated CaptionGround truth CaptionSelectCaptionAnswer2 Heuristic-based21l Similar Images lead to short-cut inference.l(1)Same as test image(2)Similar images(3)Random imagesl Ensure the captions are irrelevant to the images to avoid bia

23、sed inferences.From top to bottom:The outputs start from imitation to inferencing from the vision cues.similarityperformanceExploring Diverse In-Context Caption：ConclusionsCaptionAnswer2 Heuristic-based22Exploring Diverse In-Context Caption：ConclusionsThe top:Model-generated captions.The bottom:Grou

24、nd truth captions.l Simpler sentence patterns are more easily recognized by the VLM.l Ground truth captions use more diverse words and complex patternsWhich have more precise expressionl Model-generated captions have more salient objects and simple patternsWhich have more consist patternsCaptionAnsw

25、er2 Heuristic-based23l There is a synergy effect between the two modalities.l When similar images are used,lower-quality captions can become toxic examples l When dissimilar images are used,the negative effects of these low-quality captions are diminished.Image SimilarityCaption Quality4-shot8-shot1

26、6-shot32-ahotmeanHighHigh95.6496.6297.6698.3297.06 LowHigh72.3570.1072.7377.76 73.23HighLow65.9869.52 71.8873.4970.22LowLow70.4573.92 74.8377.00 74.05Exploring Diverse In-Context Caption：ConclusionsCaptionAnswer2 Heuristic-based24How to Configure Good In-Context Sequence for Visual Question Answerin

27、gLi Li,Jiawei Peng,Huiyi Chen,Chongyang Gao,Xu YangarXiv:https:/arxiv.org/abs/2312.01571code:https:/ Heuristic-based25Explore effective In-context examples configuration strategiesSimilar Images&TextsQueryDemonstrationsAdding InstructionsQueryDemonstrationsInstru-ctionPseudo AnswersLVLMRetriveHow to

28、 Configure Good In-Context Sequence for VQA:BackgroundCaptionAnswer2 Heuristic-based26How to Configure Good In-Context Sequence for VQA:BackgroundGain a better understanding of the inner properties of LVLMTR is More Crucial than TL.Short-cut Inference exists in LVLM.Image and Language Decoders are n

29、ot totally Compatible.CaptionAnswer2 Heuristic-based27Supporting Set?Query Triplet Ground Truth Answer:Image:Question:AnswerRandom Sampling(RS)RSRandom sampling-shot demonstrationsRetrieving via Similar Image(SI)SITop-shot demonstrationsCLIP EmbeddingQuery ImageRetrieving via Similar Questions(SQ)SQ

30、Top-shot demonstrationsCLIP EmbeddingQuery QuestionRetrieving via Similar Question&Answer(SQA)SQATop-shot demonstrationsCLIP Embeddingretrieve question-answer pairsQuery Question&Ground Truth AnswerRetrieving via Similar Question&Pseudo Answer(SQPA)SQPA?Pseudo AnswerTop-CLIP EmbeddingQuery Question&

31、Pseudo Answer-shot demonstrationsRetrieve Question-Answer Pairs?RS/SIHow to Configure Good In-Context Sequence for VQA:ApproachRetrieving In-context examplesCaptionAnswer2 Heuristic-based28Mismatching Image(MI)Mismatching the TripletUsing InstructionsMismatching Answer(MA)Mismatching Question-Answer

32、 pair(MQA)Instru-ctionQuestion:What is the man doing in the street?Short Answer:According to the previous question and answer pair,answer the final question.e.g.Question:What number is on the bus?Short Answer:284Question:Where would a taxi park to wait for a customer?Short Answer:curbHow to Configur

33、e Good In-Context Sequence for VQA:ApproachManipulating examplesCaptionAnswer2 Heuristic-based2929Task RecognitionRecognizes the distribution of the taskApplying pre-trained priorsof LLMCOCO_val2014_000000142722.jpgGT AnswerVisualTRdogtwo cowswhitebrowngreensitGT AnswerLinguistic TRredIt is green.wh

34、itebrowngreenwhiteWhat color is the dog?GT AnswerFormat TRDemonstration Demonstration Demonstration brownrabbityesThe answeris blue.#D28946The recall of pre-trained visual/language knowledgeIdentify：l task format,l input distributionl label space from demonstrationsHow to Configure Good In-Context S

35、equence for VQA:ApproachExtend TR and TL Hypothesis in the VL domainCaptionAnswer2 Heuristic-based30Task LearningLearn the mapping relationshipbetween QA pairs from the demonstrationsl Treats QAs from demonstrations as“training samples”l Implicit learning process analogous to explicit fine-tuningHow

36、 to Configure Good In-Context Sequence for VQA:ApproachExtend TR and TL Hypothesis in the VL domainCaptionAnswer2 Heuristic-based3131In ICL,TR and TL coexist simultaneouslyHow to Configure Good In-Context Sequence for VQA:ApproachExtend TR and TL Hypothesis in the VL domainCaptionAnswer2 Heuristic-b

37、ased32Three important inner properties of LVLM during ICL1.Limited TL capabilitiesl As the number of shots increases,the improvement of the model diminishesl Replacing incorrect answers in demonstrations did not significantly impact the models performance.l Disentangle TR and TL and find that the ac

38、curacy of TR is significantly higher than TLHow to Configure Good In-Context Sequence for VQA:AnalysisCaptionAnswer2 Heuristic-based332.The presence of a short-cut effectQ:What is the design on the sheets?A:alligators and bearsQ:What is the design of the bed cover?A:alligators and bearsSQGT:zebraQ:W

39、hat is the scientific name of this leaf?A:tulipQ:What is the scientific name of this leaf?SQGT:camelliaA:tulipCopy rate(%)OFv1OFv2RS43.6437.34SI50.4454.38SQ77.2679.84SQA87.7489.47SQA(sole)47.3945.82SQA(sole wrong)37.0745.71How to Configure Good In-Context Sequence for VQA:AnalysisThree important inn

40、er properties of LVLM during ICLCaptionAnswer2 Heuristic-based34How to Configure Good In-Context Sequence for VQA:AnalysisThree important inner properties of LVLM during ICLlinguistic TR plays a more substantial role than visual TR3.Partial compatibility between vision and language modulesCaptionAns

41、wer2 Heuristic-based35Dataset4-shot8-shot16-shotRS(OFv1)VQAv244.5647.3848.71instruct1(OFV1)VQAv243.7546.9148.67RS(OFv2)VQAv248.8251.0550.89instruct1(OFv2)VQAv249.9352.7150.953.Partial compatibility between vision and language modulesHow to Configure Good In-Context Sequence for VQA:AnalysisThree imp

42、ortant inner properties of LVLM during ICLSome language reasoning ability lose efficacy in the VL caseCaptionAnswer2 Heuristic-based36RSSISI-QSQSQ-ISQPA(SI-4)46.88 49.55 51.95 50.31 52.18 52.87 404244464850525430.79 40.31 39.78 34.23 38.41 39.01 20253035404531.14 40.79 40.47 36.99 40.30 39.57 202530

43、35404537.00 41.28 42.43 41.35 42.21 43.25 3035404537.24 39.14 39.92 37.93 38.91 40.48 30354045505550.25 52.47 53.13 48.94 50.32 51.71 40455055606570(a)OFv1-VQAv2(c)OFv1-VizWiz(e)OFv1-OK-VQA(b)OFv2-VQAv2(d)OFv2-VizWiz(f)OFv2-OK-VQAHow to Configure Good In-Context Sequence for VQA:AnalysisEffective Co

44、nfiguration Strategiesl Similar images and texts lead to better performancelSimilar images compensate visual information missed or incorrectly recognizedlSimilar texts brings unstable improvements due to the presence of the short-cutCaptionAnswer2 Heuristic-based37l Instruction enhances the performa

45、nce of linguistically advanced modellincreasing information density in demonstrationsl do not yield significant improvements in inferior language encoderInstruct1:According to the previous question and answer pair,answer the final question.Instruct2:Consider the semantic relationship between the que

46、stion and the image.Instruct3:You will be engaged in a two-phase task.Phase 1:Absorb the information from a series of image-text pairs.Phase 2:Use that context,combined with an upcoming image and your own database of knowledge,to accurately answer a subsequent question.Dataset4-shot8-shot16-shotRSVQ

47、Av248.8251.0550.89instruct1VQAv249.9352.7150.95RSOK-VQA34.8238.5439.55instruct1OK-VQA35.7239.3840.46instruct2OK-VQA36.4540.1741.11instruct3OK-VQA35.5340.1940.02How to Configure Good In-Context Sequence for VQA:AnalysisEffective Configuration StrategiesCaptionAnswer2 Heuristic-based38l Pseudo answers

48、 have potential for expeditious enhancement of performanceLVLMRetriveFirst RoundSecond RoundDataset4-shotRSVQAv248.82SQPA(RS-4)VQAv249.85SIVQAv250.36SQPA(SI-4)VQAv250.57RSVizWiz22.07 SQPA(RS-4)VizWiz30.02 SIVizWiz36.30 SQPA(SI-4)VizWiz38.37 RSOK-VQA34.82SQPA(RS-4)OK-VQA38.92SIOK-VQA36.46SQPA(SI-4)OK

49、-VQA39.34How to Configure Good In-Context Sequence for VQA:AnalysisEffective Configuration StrategiesCaptionAnswer2 Heuristic-basedLearning-based configuration strategies“Take IC,and VQA as examples”PART 03403 Learning-based ICD-LM:Configuring Vision-Language In-Context Demonstrations by Language Mo

50、delingYingzhe Peng,Xu Yang,Haoxuan Ma,Shuo Xu,Chi Zhang,Yucheng Han,Hanwang ZhangarXiv:https:/arxiv.org/abs/2312.10104code:https:/ Require selecting and reordering ICD sequences.ICD-LM:Traditional Configure ICD Methods3 Learning-based 42ICD-LM:Traditional Configure ICD Methodsl Different LVLMs have

51、different optimal ICD sequence.3 Learning-based 43Based on the following observation:Obtaining an optimal ICD sequence can be likened to sentence generation in a language model.ICD-LM:ICD Language Model3 Learning-based 44ICD-LM:ICD Language Modell One selects the most fluent word(ICD)from a vocabula

52、ry(ICD set)one by one.l Using a language model enables learning to select and arrange optimal ICDs.3 Learning-based 45a)Anchor set selection.Anchor sample simulate a query sample during testing.Other train data samples will be used as supporting set.ICD-LM:Dataset Construction3 Learning-based 46b)Su

53、b-Supporting set sampling.To reduce the time complexity.ICD-LM:Dataset Construction3 Learning-based 47c)Use to evaluate the ICD sequence.Obtain the optimal ICD sequence using a greedy algorithm.ICD-LM:Dataset Construction3 Learning-based 48We use CLIP to extract multimodal features as the embedding

54、of LM.The final Embedding is sum of:a)Learnable Embedding:Randomly initializedb)Image Embeddingc)Text EmbeddingICD-LM:Training LM3 Learning-based 49ICD-LM:Experiments Setting1.Random Sample(RS)2.Similarity-based Retrieval methods:1.Similarity-based Image-Image Retrieval(SIIR)2.Similarity-based Text-

55、Text Retrieval(STTR)3.Similarity-based Image-Text Retrieval(SITR)Compared Methods3 Learning-based 50ICD-LM:Main ResultWe construct 2-shot ICD configurations dataset to train the ICD-LM.l ICD-LM achieve the best performance compared with other methods.l The trained ICD-LM excels in configuring 4-shot

56、 ICDs with strong length extrapolation ability.020406080100120Avg:12Avg:38Avg:18Results of diverse ICL methods on ICRSSITRSIIRICD-LM0102030405060Avg:12Avg:38Avg:18Results of diverse ICL methods on VQARSSITRSIIRSTTRICD-LM3 Learning-based 51ICD-LM:Ablation Result:Diverse configuration of dataset const

57、ruction.We select three factors for our ablation studies:1.Beam size .2.The number of samples in anchor set.3.The sampling method of sub-supporting set:Random:Selecting randomly from total supporting set.Similar Text(Sim-T):Selecting the highest textual similarity sample with anchor sample from tota

58、l supporting set.Similar Image(Sim-I):Selecting the highest visual similarity sample with anchor sample from total supporting set.3 Learning-based 52ICD-LM:Ablation Result:Diverse configuration of dataset construction.l Increasing the beam size has a positive correlation with ICD-LM performance.l An

59、 excessively large beam size can negatively impact performance.l The performance drop is due to lower-scoring ICD sequences introduced with a large beam size,misleading the ICD-LM during training.42444648505254Avg:12Avg:38Avg:18Accuracy of different Beam Size on VQAb=1b=5b=10020406080100120Avg:12Avg

60、:38Avg:18CIDEr of different Beam Size on ICb=1b=5b=101.Beam size .3 Learning-based 53ICD-LM:Ablation Result:Diverse configuration of dataset construction.4042444648505254Avg:12Avg:38Avg:18Accuracy on VQAn=1000n=3000n=50007580859095100Avg:12Avg:38Avg:18CIDEr on ICn=1000n=3000n=5000l Using more anchor

61、 samples can improve the interpolation performance in both IC and VQAl However,on IC,the extrapolation performance decay when n changes from 3000 to 5000.2.The number of samples in anchor set.3 Learning-based 54ICD-LM:Ablation Result:Diverse configuration of dataset construction.4042444648505254Avg:

62、12Avg:38Avg:18Accuracy of different sample methods on VQASim-ISim-TRandom707580859095100Avg:12Avg:38Avg:18CIDEr of different sample methods on ICSim-ISim-TRandoml We find Randomis the best in both IC and VQA.l We suppose this is because selecting similar ICDs with the anchor sample will damage the d

63、iversity of ICD sequence.3.The sampling method of sub-supporting set.3 Learning-based 55ICD-LM:Ablation Result:Diverse scorers structurel Using task-specifical scorers will increase the interpolation performance.l Accuracy is not suitable for n Binary Metric020406080100120Avg:12Avg:38Avg:18CIDEr of diverse scorers on ICRSConfidenceCIDEr3 Learning-based THANKS

相關圖表

本文主要探討了多模態大語言模型（LMM）中的上下文學習方法。首先，文章介紹了GPT模型的發展歷程，從最初的GPT到現在的GPT-4，模型參數和能力不斷提升，特別是GPT-3引入了“in-context learning”的能力。接著，文章分析了為什么需要in-context learning，以及其相對于prompt engineering的優勢。然后，文章詳細介紹了基于啟發式的方法來配置上下文學習，包括圖像字幕和視覺問答兩個例子。這些方法通過探索不同的上下文配置策略，如隨機選擇、基于相似性的檢索等，來提高模型的性能。最后，文章探討了基于學習的方法來配置上下文學習，提出了ICD-LM模型，通過語言建模來學習選擇和排列最優的上下文演示。實驗結果表明，ICD-LM在配置4-shot ICDs方面表現出色，具有強大的長距離擴展能力。總的來說，本文深入探討了多模態大語言模型中的上下文學習方法，包括基于啟發式和基于學習的方法，為未來的研究提供了有價值的參考。

上下文學習在多模態大模型中的應用如何有效配置上下文學習中的圖像字幕？視覺問答中上下文學習的有效策略有哪些？

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站