《楊旭-多模態大語言模型中的上下文學習.pdf》由會員分享,可在線閱讀,更多相關《楊旭-多模態大語言模型中的上下文學習.pdf(57頁珍藏版)》請在三個皮匠報告上搜索。
1、多模態大語言模型中的上下文學習楊旭 東南大學演講嘉賓楊旭東南大學 副教授楊旭博士2021年6月從南洋理工大學計算機科學與技術系獲工學博士學位,導師為蔡劍飛,張含望教授?,F為東南大學計算機科學與工程學院、軟件學院、人工智能學院副教授、任東南大學新一代人工智能技術與交叉應用教育部重點實驗室副主任?,F主要從事視覺文本多模態大模型應用研究以及一種新的大模型訓練-部署模式:學習基因的研究。目 錄CONTENTS1.Background2.Heuristic-based configuration strategies3.Learning-based configuration strategiesBac
2、kground“Why do we need In Context Learning?”PART 01The Development of GPTLMICLMultimodal1 Background6GPT(2018)1.5B ParametersPrompt EngineeringGPT-2(2019)175B ParametersIn-context LearningGPT-3(2020)GPT-4(2023)1324DataDataPre-training Fine-tuningPromptImageTextMultimodal VideoIn-contextExamples7GPT-
3、2s Capability of Prompt Engineering l GPT-2 exhibits a distinctive feature known as“prompt engineering”.l This can be compared to the architecture of modern computers,where both data and commands exist in the form of 0s and 1s encoding.LMICL1 BackgroundMultimodal8GPT-3s Capability of In-Context Lear
4、ningl GPT-3 possesses a unique capability known as“In-context learning”.l It will learn the representation of tasks from the provided in-context examples.LMICL1 BackgroundMultimodal9Why In-Context Learning?In-Context Learning Prompt EngineeringYield precise responsesUnlock the potential of LLMsA spe
5、cialized prompt engineeringAdapt to a task using a few examplesfew shotLiu,Pengfei,et al.Pre-train,prompt,and predict:A systematic survey of prompting methods in natural language processing.Dong,Qingxiu,et al.A survey for in-context learning.LMICL1 BackgroundMultimodal10Why In-Context Learning?outsi
6、de-in methodologies to unravel the inner properties of LLMsPros of ICLHow many meters does a 1-kilogram object fall in 1 second?4.9 mObjects fall with a constant acceleration due to gravity,regardless of their mass.What about 10-kilogram?4.9 mProviding incorrect examples does not affect the LLMs abi
7、lity to make correct judgments.Positive“Best movie ever.”Sentiment:Positive.“I like it.”Sentiment:?Positive“Best movie ever.”Sentiment:Negative.“I like it.”Sentiment:?l Flexible controllabilityl Encapsulate more informationLMICL1 BackgroundMultimodal11How about GPT-4?GPT-4:Large Multimodal ModelProc
8、ess visual data&understand and generate natural languageWhat is LMM?Refer to visual information in conversationsAnswer questions about the imagesExcellent Multimodal capabilitiesNot open-sourceinternal workings and training processes are opaqueThese two images represent two different robots,respecti
9、velyWhat color is the purse?blueHow does this food taste?Delicious,especially the cake!Incorporate the understanding of visual contentLMICL1 BackgroundMultimodal12The development of large models from single-modal to multi-modalWhy Multimodal Model In-Context Learning?Expands the application scope of
10、 the model:various image/video understanding tasks.ImageTextVideoTextQ:What color is the purse?A:Blue.A table with bread and milk on it.Classify:Table.Visual Question AnsweringImage CaptionImitate real humans and achieve multi-modal analogy capabilitiesLMICL1 BackgroundMultimodal13l Less research in
11、 the Multimodal Model In-Context Learningl Most of the work only considers the field of Natural Language ProcessingWhy Multimodal Model In-Context Learning?l Some large multimodal models are not well adapted to in-context learning,such as miniGPT-4,LLAVA,mPLUG owl,etc.l Large multimodal model with g
12、ood in-context learning:Flamingo,Otter,IDEFICSLMICL1 BackgroundMultimodalHeuristic-based configuration strategies“Take IC and VQA as examples”PART 0215CaptionAnswer2 Heuristic-basedExploring Diverse In-Context Configurations for Image Captioning(NIPS 2023)Xu Yang,Yongliang Wu,Mingzhuo Yang,Haokun Ch
13、en,Xin GengarXiv:https:/arxiv.org/abs/2305.14800code:https:/ Diverse In-Context Caption:Background and Motivationl Transitioning from single-modal to multi-modal leads to increased complexity.l In image modality,which image optimizes testing?l In caption modality,what is the ideal choice for model g
14、eneration?Review:Delicious food!Review:The movie is awful.Review:Terrible dishes!Sentiment:PositiveSentiment:Negative Sentiment:NegativeReview:Good meal!Sentiment:RetrieveQueryCandidatesWhich pairis better?A bunch of bananas hanging from a wire.A bunch of bananas stuck on a line above a kitchen.Bush
15、els of bananas hanging from a rope in a store.RetrieveSelectQueryCandidatesGTsCaptionAnswer2 Heuristic-based17Given a test image,how to select the proper image?n Random Selection(RS):Randomly select k examples for few-shot in-context learning.n Similarity-based Image-Image Retrieval(SIIR)n Similarit
16、y-based Image-Caption Retrieval(SICR)n Diversity-based Image-Image Retrieval(DIIR)CaptionAnswer2 Heuristic-based18Given the selected image,how to choose the suitable caption?n Ground Truth Caption(GTC)n Model Generated Caption(MGC)n Model Generated Caption as Anchor(MGCA)n Iterative Prompting(IP)Eac
17、h image has five human-annotated captions.Choose the first caption in our experimentsUse a VLM or an offline captioner to generate corresponding caption Compute which GTC have higher CIDEr with the generated caption.Generate captions and then using these captions paired with the images to iterativel
18、y prompt VLM for enhanced captionsCaptionAnswer2 Heuristic-based19Exploring Diverse In-Context Caption:Caption Assignment Strategies l Model Generated Caption(MGC)l Given an image,we can use a VLM or an offline captioner to generate caption.l It might be helpful since the generated caption usually h
19、ave the same pattern with the output.A vision language model or offline caption to generate caption as in-context examples.Vision Language ModelOrOffline CaptionerA little girl eating a piece of chocolate cake.CaptionAnswer2 Heuristic-based20Exploring Diverse In-Context Caption:Caption Assignment St
20、rategies l Model Generated Caption as Anchor(MGCA)l Once get the generated caption,We can compute CIDEr scores to find the best caption.l The selected one will have the advantages of both GTC and MGC,more precise expression and more consistent pattern.We can use the model-generated caption as anchor
21、 to select the best caption from human-annotated captions.A little girl eating a piece of chocolate cake.A close up of a young person at a table eating cake.A small girl takes a bite of chocolate cake.A young girl eating a piece of chocolate cake.A little girl taking a big bite out of chocolate cake
22、.A young child enjoying a serving of cake and ice cream.Model Generated CaptionGround truth CaptionSelectCaptionAnswer2 Heuristic-based21l Similar Images lead to short-cut inference.l(1)Same as test image(2)Similar images(3)Random imagesl Ensure the captions are irrelevant to the images to avoid bia
23、sed inferences.From top to bottom:The outputs start from imitation to inferencing from the vision cues.similarityperformanceExploring Diverse In-Context Caption:ConclusionsCaptionAnswer2 Heuristic-based22Exploring Diverse In-Context Caption:ConclusionsThe top:Model-generated captions.The bottom:Grou
24、nd truth captions.l Simpler sentence patterns are more easily recognized by the VLM.l Ground truth captions use more diverse words and complex patternsWhich have more precise expressionl Model-generated captions have more salient objects and simple patternsWhich have more consist patternsCaptionAnsw
25、er2 Heuristic-based23l There is a synergy effect between the two modalities.l When similar images are used,lower-quality captions can become toxic examples l When dissimilar images are used,the negative effects of these low-quality captions are diminished.Image SimilarityCaption Quality4-shot8-shot1
26、6-shot32-ahotmeanHighHigh95.6496.6297.6698.3297.06 LowHigh72.3570.1072.7377.76 73.23HighLow65.9869.52 71.8873.4970.22LowLow70.4573.92 74.8377.00 74.05Exploring Diverse In-Context Caption:ConclusionsCaptionAnswer2 Heuristic-based24How to Configure Good In-Context Sequence for Visual Question Answerin
27、gLi Li,Jiawei Peng,Huiyi Chen,Chongyang Gao,Xu YangarXiv:https:/arxiv.org/abs/2312.01571code:https:/ Heuristic-based25Explore effective In-context examples configuration strategiesSimilar Images&TextsQueryDemonstrationsAdding InstructionsQueryDemonstrationsInstru-ctionPseudo AnswersLVLMRetriveHow to
28、 Configure Good In-Context Sequence for VQA:BackgroundCaptionAnswer2 Heuristic-based26How to Configure Good In-Context Sequence for VQA:BackgroundGain a better understanding of the inner properties of LVLMTR is More Crucial than TL.Short-cut Inference exists in LVLM.Image and Language Decoders are n
29、ot totally Compatible.CaptionAnswer2 Heuristic-based27Supporting Set?Query Triplet Ground Truth Answer:Image:Question:AnswerRandom Sampling(RS)RSRandom sampling-shot demonstrationsRetrieving via Similar Image(SI)SITop-shot demonstrationsCLIP EmbeddingQuery ImageRetrieving via Similar Questions(SQ)SQ
30、Top-shot demonstrationsCLIP EmbeddingQuery QuestionRetrieving via Similar Question&Answer(SQA)SQATop-shot demonstrationsCLIP Embeddingretrieve question-answer pairsQuery Question&Ground Truth AnswerRetrieving via Similar Question&Pseudo Answer(SQPA)SQPA?Pseudo AnswerTop-CLIP EmbeddingQuery Question&
31、Pseudo Answer-shot demonstrationsRetrieve Question-Answer Pairs?RS/SIHow to Configure Good In-Context Sequence for VQA:ApproachRetrieving In-context examplesCaptionAnswer2 Heuristic-based28Mismatching Image(MI)Mismatching the TripletUsing InstructionsMismatching Answer(MA)Mismatching Question-Answer
32、 pair(MQA)Instru-ctionQuestion:What is the man doing in the street?Short Answer:According to the previous question and answer pair,answer the final question.e.g.Question:What number is on the bus?Short Answer:284Question:Where would a taxi park to wait for a customer?Short Answer:curbHow to Configur
33、e Good In-Context Sequence for VQA:ApproachManipulating examplesCaptionAnswer2 Heuristic-based2929Task RecognitionRecognizes the distribution of the taskApplying pre-trained priorsof LLMCOCO_val2014_000000142722.jpgGT AnswerVisualTRdogtwo cowswhitebrowngreensitGT AnswerLinguistic TRredIt is green.wh
34、itebrowngreenwhiteWhat color is the dog?GT AnswerFormat TRDemonstration Demonstration Demonstration brownrabbityesThe answeris blue.#D28946The recall of pre-trained visual/language knowledgeIdentify:l task format,l input distributionl label space from demonstrationsHow to Configure Good In-Context S
35、equence for VQA:ApproachExtend TR and TL Hypothesis in the VL domainCaptionAnswer2 Heuristic-based30Task LearningLearn the mapping relationshipbetween QA pairs from the demonstrationsl Treats QAs from demonstrations as“training samples”l Implicit learning process analogous to explicit fine-tuningHow
36、 to Configure Good In-Context Sequence for VQA:ApproachExtend TR and TL Hypothesis in the VL domainCaptionAnswer2 Heuristic-based3131In ICL,TR and TL coexist simultaneouslyHow to Configure Good In-Context Sequence for VQA:ApproachExtend TR and TL Hypothesis in the VL domainCaptionAnswer2 Heuristic-b
37、ased32Three important inner properties of LVLM during ICL1.Limited TL capabilitiesl As the number of shots increases,the improvement of the model diminishesl Replacing incorrect answers in demonstrations did not significantly impact the models performance.l Disentangle TR and TL and find that the ac
38、curacy of TR is significantly higher than TLHow to Configure Good In-Context Sequence for VQA:AnalysisCaptionAnswer2 Heuristic-based332.The presence of a short-cut effectQ:What is the design on the sheets?A:alligators and bearsQ:What is the design of the bed cover?A:alligators and bearsSQGT:zebraQ:W
39、hat is the scientific name of this leaf?A:tulipQ:What is the scientific name of this leaf?SQGT:camelliaA:tulipCopy rate(%)OFv1OFv2RS43.6437.34SI50.4454.38SQ77.2679.84SQA87.7489.47SQA(sole)47.3945.82SQA(sole wrong)37.0745.71How to Configure Good In-Context Sequence for VQA:AnalysisThree important inn
40、er properties of LVLM during ICLCaptionAnswer2 Heuristic-based34How to Configure Good In-Context Sequence for VQA:AnalysisThree important inner properties of LVLM during ICLlinguistic TR plays a more substantial role than visual TR3.Partial compatibility between vision and language modulesCaptionAns
41、wer2 Heuristic-based35Dataset4-shot8-shot16-shotRS(OFv1)VQAv244.5647.3848.71instruct1(OFV1)VQAv243.7546.9148.67RS(OFv2)VQAv248.8251.0550.89instruct1(OFv2)VQAv249.9352.7150.953.Partial compatibility between vision and language modulesHow to Configure Good In-Context Sequence for VQA:AnalysisThree imp
42、ortant inner properties of LVLM during ICLSome language reasoning ability lose efficacy in the VL caseCaptionAnswer2 Heuristic-based36RSSISI-QSQSQ-ISQPA(SI-4)46.88 49.55 51.95 50.31 52.18 52.87 404244464850525430.79 40.31 39.78 34.23 38.41 39.01 20253035404531.14 40.79 40.47 36.99 40.30 39.57 202530
43、35404537.00 41.28 42.43 41.35 42.21 43.25 3035404537.24 39.14 39.92 37.93 38.91 40.48 30354045505550.25 52.47 53.13 48.94 50.32 51.71 40455055606570(a)OFv1-VQAv2(c)OFv1-VizWiz(e)OFv1-OK-VQA(b)OFv2-VQAv2(d)OFv2-VizWiz(f)OFv2-OK-VQAHow to Configure Good In-Context Sequence for VQA:AnalysisEffective Co
44、nfiguration Strategiesl Similar images and texts lead to better performancelSimilar images compensate visual information missed or incorrectly recognizedlSimilar texts brings unstable improvements due to the presence of the short-cutCaptionAnswer2 Heuristic-based37l Instruction enhances the performa
45、nce of linguistically advanced modellincreasing information density in demonstrationsl do not yield significant improvements in inferior language encoderInstruct1:According to the previous question and answer pair,answer the final question.Instruct2:Consider the semantic relationship between the que
46、stion and the image.Instruct3:You will be engaged in a two-phase task.Phase 1:Absorb the information from a series of image-text pairs.Phase 2:Use that context,combined with an upcoming image and your own database of knowledge,to accurately answer a subsequent question.Dataset4-shot8-shot16-shotRSVQ
47、Av248.8251.0550.89instruct1VQAv249.9352.7150.95RSOK-VQA34.8238.5439.55instruct1OK-VQA35.7239.3840.46instruct2OK-VQA36.4540.1741.11instruct3OK-VQA35.5340.1940.02How to Configure Good In-Context Sequence for VQA:AnalysisEffective Configuration StrategiesCaptionAnswer2 Heuristic-based38l Pseudo answers
48、 have potential for expeditious enhancement of performanceLVLMRetriveFirst RoundSecond RoundDataset4-shotRSVQAv248.82SQPA(RS-4)VQAv249.85SIVQAv250.36SQPA(SI-4)VQAv250.57RSVizWiz22.07 SQPA(RS-4)VizWiz30.02 SIVizWiz36.30 SQPA(SI-4)VizWiz38.37 RSOK-VQA34.82SQPA(RS-4)OK-VQA38.92SIOK-VQA36.46SQPA(SI-4)OK
49、-VQA39.34How to Configure Good In-Context Sequence for VQA:AnalysisEffective Configuration StrategiesCaptionAnswer2 Heuristic-basedLearning-based configuration strategies“Take IC,and VQA as examples”PART 03403 Learning-based ICD-LM:Configuring Vision-Language In-Context Demonstrations by Language Mo
50、delingYingzhe Peng,Xu Yang,Haoxuan Ma,Shuo Xu,Chi Zhang,Yucheng Han,Hanwang ZhangarXiv:https:/arxiv.org/abs/2312.10104code:https:/ Require selecting and reordering ICD sequences.ICD-LM:Traditional Configure ICD Methods3 Learning-based 42ICD-LM:Traditional Configure ICD Methodsl Different LVLMs have
51、different optimal ICD sequence.3 Learning-based 43Based on the following observation:Obtaining an optimal ICD sequence can be likened to sentence generation in a language model.ICD-LM:ICD Language Model3 Learning-based 44ICD-LM:ICD Language Modell One selects the most fluent word(ICD)from a vocabula
52、ry(ICD set)one by one.l Using a language model enables learning to select and arrange optimal ICDs.3 Learning-based 45a)Anchor set selection.Anchor sample simulate a query sample during testing.Other train data samples will be used as supporting set.ICD-LM:Dataset Construction3 Learning-based 46b)Su
53、b-Supporting set sampling.To reduce the time complexity.ICD-LM:Dataset Construction3 Learning-based 47c)Use to evaluate the ICD sequence.Obtain the optimal ICD sequence using a greedy algorithm.ICD-LM:Dataset Construction3 Learning-based 48We use CLIP to extract multimodal features as the embedding
54、of LM.The final Embedding is sum of:a)Learnable Embedding:Randomly initializedb)Image Embeddingc)Text EmbeddingICD-LM:Training LM3 Learning-based 49ICD-LM:Experiments Setting1.Random Sample(RS)2.Similarity-based Retrieval methods:1.Similarity-based Image-Image Retrieval(SIIR)2.Similarity-based Text-
55、Text Retrieval(STTR)3.Similarity-based Image-Text Retrieval(SITR)Compared Methods3 Learning-based 50ICD-LM:Main ResultWe construct 2-shot ICD configurations dataset to train the ICD-LM.l ICD-LM achieve the best performance compared with other methods.l The trained ICD-LM excels in configuring 4-shot
56、 ICDs with strong length extrapolation ability.020406080100120Avg:12Avg:38Avg:18Results of diverse ICL methods on ICRSSITRSIIRICD-LM0102030405060Avg:12Avg:38Avg:18Results of diverse ICL methods on VQARSSITRSIIRSTTRICD-LM3 Learning-based 51ICD-LM:Ablation Result:Diverse configuration of dataset const
57、ruction.We select three factors for our ablation studies:1.Beam size .2.The number of samples in anchor set.3.The sampling method of sub-supporting set:Random:Selecting randomly from total supporting set.Similar Text(Sim-T):Selecting the highest textual similarity sample with anchor sample from tota
58、l supporting set.Similar Image(Sim-I):Selecting the highest visual similarity sample with anchor sample from total supporting set.3 Learning-based 52ICD-LM:Ablation Result:Diverse configuration of dataset construction.l Increasing the beam size has a positive correlation with ICD-LM performance.l An
59、 excessively large beam size can negatively impact performance.l The performance drop is due to lower-scoring ICD sequences introduced with a large beam size,misleading the ICD-LM during training.42444648505254Avg:12Avg:38Avg:18Accuracy of different Beam Size on VQAb=1b=5b=10020406080100120Avg:12Avg
60、:38Avg:18CIDEr of different Beam Size on ICb=1b=5b=101.Beam size .3 Learning-based 53ICD-LM:Ablation Result:Diverse configuration of dataset construction.4042444648505254Avg:12Avg:38Avg:18Accuracy on VQAn=1000n=3000n=50007580859095100Avg:12Avg:38Avg:18CIDEr on ICn=1000n=3000n=5000l Using more anchor
61、 samples can improve the interpolation performance in both IC and VQAl However,on IC,the extrapolation performance decay when n changes from 3000 to 5000.2.The number of samples in anchor set.3 Learning-based 54ICD-LM:Ablation Result:Diverse configuration of dataset construction.4042444648505254Avg:
62、12Avg:38Avg:18Accuracy of different sample methods on VQASim-ISim-TRandom707580859095100Avg:12Avg:38Avg:18CIDEr of different sample methods on ICSim-ISim-TRandoml We find Randomis the best in both IC and VQA.l We suppose this is because selecting similar ICDs with the anchor sample will damage the d
63、iversity of ICD sequence.3.The sampling method of sub-supporting set.3 Learning-based 55ICD-LM:Ablation Result:Diverse scorers structurel Using task-specifical scorers will increase the interpolation performance.l Accuracy is not suitable for n Binary Metric020406080100120Avg:12Avg:38Avg:18CIDEr of diverse scorers on ICRSConfidenceCIDEr3 Learning-based THANKS