《Google:Gemini 1.5技術報告(英文版)(154頁).pdf》由會員分享,可在線閱讀,更多相關《Google:Gemini 1.5技術報告(英文版)(154頁).pdf(154頁珍藏版)》請在三個皮匠報告上搜索。
1、Gemini 1.5:Unlocking multimodalunderstanding across millions of tokens ofcontextGemini Team,Google1In this report,we introduce the Gemini 1.5 family of models,representing the next generation of highlycompute-efficient multimodal models capable of recalling and reasoning over fine-grained informatio
2、nfrom millions of tokens of context,including multiple long documents and hours of video and audio.Thefamily includes two new models:(1)an updated Gemini 1.5 Pro,which exceeds the February version onthe great majority of capabilities and benchmarks;(2)Gemini 1.5 Flash,a more lightweight variantdesig
3、ned for efficiency with minimal regression in quality.Gemini 1.5 models achieve near-perfectrecall on long-context retrieval tasks across modalities,improve the state-of-the-art in long-documentQA,long-video QA and long-context ASR,and match or surpass Gemini 1.0 Ultras state-of-the-artperformance a
4、cross a broad set of benchmarks.Studying the limits of Gemini 1.5s long-context ability,we find continued improvement in next-token prediction and near-perfect retrieval(99%)up to atleast 10M tokens,a generational leap over existing models such as Claude 3.0(200k)and GPT-4 Turbo(128k).Finally,we hig
5、hlight real-world use cases,such as Gemini 1.5 collaborating with professionalson completing their tasks achieving 26 to 75%time savings across 10 different job categories,as well assurprising new capabilities of large language models at the frontier;when given a grammar manual forKalamang,a languag
6、e with fewer than 200 speakers worldwide,the model learns to translate English toKalamang at a similar level to a person who learned from the same content.1.IntroductionWe present our latest multimodal models from the Gemini line:Gemini 1.5 Pro and Gemini 1.5Flash.They are members of Gemini 1.5,a ne
7、w family of highly-capable multimodal models whichincorporates our latest innovations in sparse and dense scaling as well as major advances in training,distillation and serving infrastructure that allow it to push the boundary of efficiency,reasoning,planning,multi-linguality,function calling and lo
8、ng-context performance.Gemini 1.5 models arebuilt to handle extremely long contexts;they have the ability to recall and reason over fine-grainedinformation from up to at least 10M tokens.This scale is unprecedented among contemporary largelanguage models(LLMs),and enables the processing of long-form
9、 mixed-modality inputs includingentire collections of documents,multiple hours of video,and almost five days long of audio.The Gemini 1.5 Pro presented in this report is an update over the previous Gemini 1.5 Pro Februaryversion and it outperforms it predecessor on most capabilities and benchmarks.A
10、ll in all,the Gemini1.5 series represents a generational leap in model performance and training efficiency.Gemini 1.5 Prosurpasses Gemini 1.0 Pro and 1.0 Ultra on a wide array of benchmarks while requiring significantlyless compute to train.Similarly,Gemini 1.5 Flash performs uniformly better compar
11、ed to 1.0 Pro andeven performs at a similar level to 1.0 Ultra on several benchmarks.The ability to model data of increasingly longer contexts has tracked the development of moregeneral and capable language models,from the now toy 2-gram language model proposed by Shannon1Please send correspondence
12、to gemini-1_5-.2024 Google.All rights reservedarXiv:2403.05530v5 cs.CL 16 Dec 2024Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextFigure 1|Gemini 1.5 Pro achieves near-perfect“needle”recall(99.7%)up to 1M tokens of“haystack”in all modalities,i.e.,text,video and audi
13、o.It even maintains this recall performance when extendingto 10M tokens in the text modality(approximately 7M words);9.7M tokens in the audio modality(up to 107 hours);9.9M tokens in the video modality(up to 10.5 hours).The x-axis represents thecontext window,and the y-axis the depth percentage of t
14、he needle placed for a given context length.The results are color-coded to indicate:green for successful retrievals and red for unsuccessful ones.Note that the performance for all modalities is obtained with the previously reported Gemini 1.5 Proversion from February.(1948),to the modern n-gram mode
15、ls of the 1990s&2000s typically constrained to 5 tokens ofcontext(Brants et al.,2007;Chen and Goodman,1999;Jelinek,1998;Kneser and Ney,1995),torecurrent neural networks language models from the 2010s which could effectively condition onhundredsoftokens(Jozefowiczetal.,2016;Mikolovetal.,2010),tothemo
16、dernTransformer(Vaswaniet al.,2017)which can condition on hundreds of thousands of tokens(Anthropic,2023a).Gemini1.5 Pro continues this trend by extending language model context lengths by over an order ofmagnitude.Scaling to millions of tokens,we find a continued improvement in predictive performan
17、ce(Section 5.2.1.1),near perfect recall(99%)on synthetic retrieval tasks(Figure 1 and Section 5.2.1.2),and a host of surprising new capabilities like in-context learning from entire long documents andmultimodal content(Section 5.2.2).To measure the effectiveness of our models multimodal long-context
18、 capabilities,we conductexperiments on both synthetic and real-world tasks.In synthetic“needle-in-a-haystack”tasks inspiredby Kamradt(2023)that probe how reliably the model can recall information amidst distractor context,we find that both Gemini 1.5 Pro and Gemini 1.5 Flash achieve near-perfect(99%
19、)“needle”recallup to multiple millions of tokens of“haystack”in all modalities,i.e.,text,video and audio.As partof our experimental setup,we also assessed the performance of Gemini 1.5 Pro when extending2Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextthe context to
20、 10M tokens across all three modalities.We found that the recall performance wasmaintained even with this significant increase in context size.Gemini 1.5 ProRelative to 1.5 Pro(Feb)Relative to 1.0 ProRelative to 1.0 UltraLong-Context Text,Video&Audiono changefrom 32kup to 10M tokensfrom 32kup to 10M
21、 tokensCore CapabilitiesWin-rate:78.1%(25/32 benchmarks)Win-rate:88.0%(44/50 benchmarks)Win-rate:77.8%(35/45 benchmarks)TextWin-rate:78.6%(11/14 benchmarks)Win-rate:95.8%(23/24 benchmarks)Win-rate:84.2%(16/19 benchmarks)VisionWin-rate:92.3%(12/13 benchmarks)Win-rate:95.2%(20/21 benchmarks)Win-rate:8
22、5.7%(18/21 benchmarks)Audio*Win-rate:80%(4/5 benchmarks)Win-rate:60%(3/5 benchmarks)Win-rate:40%(2/5 benchmarks)Table 1|Gemini 1.5 Pro Win-rates compared to Gemini 1.5 Pro from the February release,as wellas the Gemini 1.0 family.Gemini 1.5 Pro maintains high levels of performance even as its contex
23、twindow increases.Detailed results are presented in Table 10.In speech recognition,it is generallyaccepted that any difference in Word Error Rate(WER)that falls within a 3%relative range is notstatistically significant and can be considered as mere noise,and we grouped such instances as winsfor the
24、latest systems.Gemini 1.5 FlashRelative to 1.0 ProRelative to 1.0 UltraLong-Context Text,Video&Audiofrom 32k up to 10M tokensfrom 32k up to 10M tokensCore CapabilitiesWin-rate:82.0%(41/50 benchmarks)Win-rate:46.7%(21/44 benchmarks)TextWin-rate:94.7%(18/19 benchmarks)Win-rate:42.1%(8/19 benchmarks)Vi
25、sionWin-rate:90.5%(19/21 benchmarks)Win-rate:61.9%(13/21 benchmarks)AudioWin-rate:0%(0/5 benchmarks)Win-rate:0%(0/5 benchmarks)Table 2|Gemini 1.5 Flash Win-rates compared to Gemini 1.0 family.Gemini 1.5 Flash while beingsmaller and way more efficient and faster to serve,maintains high levels of perf
26、ormance even as itscontext window increases.Detailed results are presented in Table 10.In more realistic multimodal long-context benchmarks which require retrieval and reasoning overmultiple parts of the context(such as answering questions from long documents or long videos),wealso see Gemini 1.5 Pr
27、o outperforming all competing models across all modalities even when thesemodels are augmented with external retrieval methods.We showcase the in-context learning abilitiesof both Gemini 1.5 Pro and Gemini 1.5 Flash enabled by very long context:for example,learningto translate a new language from a
28、single set of linguistic documentation.With only instructionalmaterials(a 500-page reference grammar,a dictionary,and400 extra parallel sentences)allprovided in context,Gemini 1.5 Pro and Gemini 1.5 Flash are capable of learning to translate from3Gemini 1.5:Unlocking multimodal understanding across
29、millions of tokens of contextEnglish to Kalamang a Papuan language with fewer than 200 speakers2and therefore almost noonline presencewith quality similar to a person who learned from the same materials.Moreover,we add in 45 minutes of transcribed Kalamang speech recordings to demonstrate that Gemin
30、i 1.5,forthe first time with an LLM,can leverage mixed-modal documentation to learn speech recognition fora new language in context.We further showcase how long-context capability of Gemini 1.5 modelsbreak grounds on long-context automatic speech recognition,long-context video understanding,in-conte
31、xt planning and unstructured multimodal data analytics tasks.Importantly,this leap in long-context performance does not come at the expense of the core multi-modal capabilities of the model.3Across a extensive battery of evaluations,both Gemini 1.5 Pro andGemini 1.5 Flash greatly surpass Gemini 1.0
32、Pro(44/50 for Gemini 1.5 Pro and 41/50 for Gemini 1.5Flash).These include core capabilities such as Math,Science and Reasoning(+49.6%and+30.8%,respectively,Sec.6.1.1),Multilinguality(+21.4%and+16.7%,Sec.6.1.4),Video Understanding(+18.7%and+7.5%,Sec.6.2.4),Natural Image Understanding(+21.7%and+18.9%,
33、Sec.6.2.3),Chart and Document Understanding(+63.9%and+35.9%,Sec.6.2.2),Multimodal Reasoning(+31.5%and+15.6%,Sec.6.2.1),Code(+21.5%and+10.3%,Sec.6.1.3),and more(see Table 10and Table 2 for full breakdowns).These evaluations additionally evaluate on a series of“agentic”tasksincluding Function Calling(
34、+72.8%and+54.6%,Sec.6.1.5),planning(Sec.5.2.2.7)and in-the-wildlong-tail real world use cases such as improving job productivity for professionals(Sec.6.1.7).Theseadvances are particularly striking when benchmarking against Gemini 1.0 Ultra,a state-of-the-artmodel across many capabilities.Despite us
35、ing significantly less training compute and being moreefficient to serve,Gemini 1.5 Pro performs better on more than half of the overall benchmarks(35/45),and the majority of vision(18/21)and text(16/19)benchmarks.For Gemini 1.5 Flash,which substantially more efficient to serve and faster at inferen
36、ce time,we find it to be be better thanUltra 1.0 on the majority of vision benchmarks(13/21)and almost half the text benchmarks(8/18).In the following sections,we provide an overview of the model architecture and present the resultsof large-scale quantitative evaluations comparing Gemini 1.5 Pro and
37、 1.5 Flash to other LLMs.Wepresent detailed evaluations for the models long context capabilities followed by evaluations of theircore capabilities,similar to the Gemini 1.0 Technical Report(Gemini-Team et al.,2023),coveringwell-studied benchmarks across text,code,image,video and audio.Finally,we dis
38、cuss our approachto responsible deployment,including our process for impact assessment developing model policies,evaluations,and mitigations of harm before deployment decisions.42.An Improved Gemini 1.5 ProSince the initial release in February,Gemini 1.5 Pro has undergone a number of pre-training an
39、dpost-training iterations.These iterations have led to significant improvement in performance acrossthe spectrum of model capabilities.On average,we see more than 10%relative improvement in evalsover the previous version of 1.5 Pro.See Figure 2 for a highlight of performance across a selection of be
40、ncharks.On reasoningbenchmarks,1.5 Pros performance on MATH(Hendrycks et al.,2021b)has improved from 58.5%to 67.7%while on GPQA(Rein et al.,2023)1.5 Pro now scores 46.2%compared to 41.5%before.We see a similar picture on multimodal tasks,with 1.5 Pro improving on all image understandingbenchmarks an
41、d most video understanding benchmarks;on MathVista(Lu et al.,2023)Gemini1.5 Pros performance improves from 52.1%to 63.9%,on InfographicVQA(Mathew et al.,2022)it2Kalamang language:https:/ define the core capabilities as those capabilities of the model that are primarily non long-context(e.g.,math,sci
42、ence,reasoning,code)similar to capabilities covered in the Gemini 1.0 Technical Report(Gemini-Team et al.,2023).4See the model card(Mitchell et al.,2019a)in Appendix Section 12.1.4Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextMATHGPQABigBench-HardMMLUHumanEval Nat
43、ural2CodeWMT23V*BenchMathVistaMMMUFLEURS()EgoSchemaBenchmark020406080Score67.7(+9.2)46.2(+4.7)89.2(+5.2)85.9(+4.0)84.1(+12.2)82.6(+4.9)75.3(+0.1)71.7(+23.7)63.9(+9.2)62.2(+3.7)6.5(-0.1)72.2(+7.1)58.541.584.081.971.977.775.248.054.758.56.665.1Gemini 1.5 Pros Improvements across Benchmarks(Feb to May
44、2024)Gemini 1.5 Pro(Feb 2024)Gemini 1.5 Pro(May 2024)Figure 2|Comparison of Gemini 1.5 Pro(May 2024)to the initial version(Feb 2024)across severalbenchmarks.The latest Gemini 1.5 Pro makes improvements across all reasoning,coding,vision andvideo benchmarks;with audio and translation performance rema
45、ining neutral.Note that for FLEURSa lower score is better.improves from 72.7%to 81.0%,and on EgoSchema(Mangalam et al.,2023)it improves from 65.1%to 72.2%.Gemini 1.5 Pro now achieves state-of-the-art results on several multimodal benchmarksincluding AI2D,MathVista,ChartQA,DocVQA,InfographicVQA and E
46、goSchema.3.Model Architecture3.1.Gemini 1.5 ProGemini 1.5 Pro is a sparse mixture-of-expert(MoE)Transformer-based model that builds on Gemini1.0s(Gemini-Team et al.,2023)research advances and multimodal capabilities.Gemini 1.5 Pro alsobuilds on a much longer history of MoE research at Google(Clark e
47、t al.,2022;Du et al.,2022;Feduset al.,2021;Lepikhin et al.,2020;Riquelme et al.,2021;Shazeer et al.,2017;Zoph et al.,2022)andlanguage model research in the broader literature(Anil et al.,2023b;Anthropic,2023a;Brown et al.,2020;Chowdhery et al.,2023b;Hoffmann et al.,2022;Jiang et al.,2024;Kim et al.,
48、2021;OpenAI,2023a;Rae et al.,2021;Raffel et al.,2020;Roller et al.,2021;Thoppilan et al.,2022;Touvron et al.,2023a,b;Vaswani et al.,2017).MoE models use a learned routing function to direct inputs to a subsetof the models parameters for processing.This form of conditional computation(Bengio et al.,2
49、013;Davis and Arel,2014;Jacobs et al.,1991)allows models to grow their total parameter count whilekeeping the number of parameters that are activated for any given input constant.A host of improvements made across nearly the entire model stack(architecture,data,optimizationand systems)allows Gemini
50、1.5 Pro to achieve comparable quality to Gemini 1.0 Ultra(see Section6),while using significantly less training compute and being significantly more efficient to serve.Gemini 1.5 Pro also incorporates a series of significant architecture changes that enable long-contextunderstanding of inputs up to
51、10 million tokens without degrading performance.Translated into real5Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextworld data,this context length enables Gemini 1.5 Pro models to comfortably process almost fivedays of audio recordings(i.e.,107 hours),more than ten
52、 times the entirety of the 1440 page book(or587,287 words)War and Peace,the entire Flax(Heek et al.,2023)codebase(41,070 lines of code),or 10.5 hours of video at 1 frame-per-second.Further,since the model is natively multimodal andsupports interleaving of data from different modalities,it can suppor
53、t a mix of audio,visual,text,andcode inputs in the same input sequence.In Section 5.1,we highlight some of the novel capabilitiesenabled by these advances,including evaluations that yielded positive results on context lengths upto 10 million.We note that understanding the limits of these capabilitie
54、s and studying their excitingcapabilities and applications remains an area of continued research exploration.3.2.Gemini 1.5 FlashGemini 1.5 Flash is a transformer decoder model with the same 2M+context and multimodalcapabilities as Gemini 1.5 Pro,designed for efficient utilization of tensor processi
55、ng units(TPUs)withlower latency for model serving.For example,Gemini 1.5 Flash does parallel computation of attentionand feedforward components(Chowdhery et al.,2023b),and is also online distilled(Agarwal et al.,2024b;Anil et al.,2018;Beyer et al.,2021;Bucila et al.,2006;Hinton et al.,2015)from the
56、muchlarger Gemini 1.5 Pro model.It is trained with higher-order preconditioned methods(Becker andLeCun,1989;Duchi et al.,2011;Heskes,2000)for improved quality.3.3.Serving efficiency and latencyIn addition to serving 2M+context,Gemini 1.5 models were designed for high efficiency and lowlatency at all
57、 context lengths.For most queries,the time per output character dominates total requestlatency during transformer inference(Pope et al.,2023),and this compounds for applications usingchained prompts and agentic workflows.To assess the latency of Gemini 1.5 compared with other foundation model APIs a
58、cross multiplelanguages,we report the average time per output character on English,Japanese,Chinese,and Frenchqueries.Latency measurements were taken from the Vertex AI streaming API for Gemini 1.5 andClaude 3,and from the OpenAI streaming API for GPT 3.5 and GPT 4.To account for variance intraffic
59、load,we calculate the mean time per output character for 32 queries and report the minimumvalue across the 32 requests.LanguageGemini 1.5FlashGemini 1.5ProGPT-3.5TurboGPT-4TurboClaude 3HaikuClaude 3SonnetClaude 3OpusEnglish1.54.32.66.82.26.210.5Japanese4.310.912.935.410.723.946.6Chinese5.214.118.453
60、.412.83055.3French1.94.74.110.73.06.813.4Table 3|Time per output character(ms)of various APIs for English,Japanese,Chinese,and Frenchresponses,given inputs of 10,000 characters.Gemini 1.5 Flash achieves the fastest output generationfor all languages tested.Across all four evaluated languages,Gemini
61、1.5 Flash yields the fastest output generation of allmodels,and Gemini 1.5 Pro shows faster generation than GPT-4 Turbo,Claude 3 Sonnet,and Claude3 Opus(see Table 3).For English queries,Gemini 1.5 Flash generates over 650 characters per second,more than 30%faster than Claude 3 Haiku,the second faste
62、st of the models evaluated.6Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context4.Training Infrastructure and DatasetLike Gemini 1.0 series,Gemini 1.5 models are trained on multiple 4096-chip pods of Googles TPUv4accelerators,distributed across multiple datacenters,and
63、on a variety of multimodal and multilingualdata.Our pre-training dataset includes data sourced across many different domains,includingweb documents and code,and incorporates image,audio,and video content.For the instruction-tuning phase we finetuned Gemini 1.5 models on a collection of multimodal da
64、ta(containing pairedinstructions and appropriate responses),with further tuning based on human preference data.Werefer readers to the Gemini 1.0 Technical Report(Gemini-Team et al.,2023)for further information.5.Evaluation ResultsExisting evaluations are increasingly strained by the new and rapidly
65、advancing capabilities of largemultimodal models.They typically focus on individual modalities and/or are restricted to tasks withshorter context lengths.Hence,there is a growing need for benchmarks which exemplify the nuancedrequirements of real world long mixed-modality use cases.Among these,we hi
66、ghlight the quantitativeassessment of reasoning capabilities across long mixed-modality sequences as a key challenge.With the challenges of evaluating increasingly capable models in mind,our evaluation of Gemini1.5 series first focuses on understanding and evaluating its novel capabilities.Subsequen
67、tly,we explorecore benchmarks,covering capabilities studied in the Gemini 1.0 Technical Report(Gemini-Teamet al.,2023).Specifically,we evaluate Gemini 1.5 in three main categories:51.Qualitative long-context multimodal evaluations:manually probe and stress-test the modelslong-context abilities,espec
68、ially for novel capabilities where no quantitative benchmarks exist.2.Quantitative long-context multimodal evaluations:measure the models long-context abilitieson both synthetic and real-world tasks with well-defined metrics.3.Quantitative core evaluations:identify progress and regression in core ca
69、pabilities(e.g.,coding,math,science,multilinguality and instruction following).5.1.Qualitative Examples of Multimodal Long-Context CapabilitiesThe ability to process multiple millions of tokens unlocks practical applications that were not possiblebefore.In this section we demonstrate some surprising
70、 interactions we observed with Gemini 1.5 Proacross code,text and video.6As shown in the Figure 3,Gemini 1.5 Pro is able to ingest entire large codebases such as JAX(746,152 tokens),and answer very specific queries about them.in Figure 4 we show Gemini 1.5 Prosability to learn a new language based o
71、nly on reference materials given in its input(see Section 5.2.2.1for quantitative metrics for this use case).Additionally,we test Gemini 1.5 Pros ability to answeran image query given the entire text of Les Misrables and observe that being natively multimodalallows it to locate a famous scene from a
72、 hand-drawn sketch,as shown in Figure 5.Lastly,we askGemini 1.5 Pro questions about an entire movie of 45 minutes in Figure 6 which the model answersseamlessly while retrieving moments and timestamps down to a second.5We note that all the evaluations are from the same checkpoint of the Gemini 1.5 mo
73、dels that are instruction tunedpost pre-training,unless otherwise stated.All sampling evals throughout the report are done with temperature 0(greedydecoding).6For additional short videos of demonstrations of the long context abilities of Gemini 1.5 Pro across video,text,andcode seehttps:/deepmind.go
74、ogle/technologies/gemini/.7Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextFigure 3|Given the entire 746,152 token JAX codebase in context,Gemini 1.5 Pro can identify thespecific location of a core automatic differentiation method.Figure 4|Given a reference grammar
75、book and a bilingual wordlist(dictionary),Gemini 1.5 Pro isable to translate from English to Kalamang with similar quality to a human who learned from thesame materials.notesLong context?User promptLook at the event in this drawing.What page is this on?Parsed PDF les_miserables.pdf 1382 pages,732162
76、 tokensFile:les_miserables.pdfLes Miserables by Victor Hugo:PREFACESo long as there shall exist,by virtue of law and custom,decrees of damnation pronounced by society,artificially creating hells amid the civilization of earth,and adding the element of human fate to divine destiny;This is on page 119
77、Page 119.“My friend,”resumed the Bishop,“before you go,here are your candlesticks.Take them.”He stepped to the chimney-piece,took the two silver candlesticks,and brought them to Jean Valjean.The two women looked on without uttering a word,without a gesture,without a look which could disconcert the B
78、ishop.Model outputFigure 5|With the entire text of Les Misrables in the prompt(1382 pages,732k tokens),Gemini1.5 Pro is able to identify and locate a famous scene from a hand-drawn sketch.8Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextFigure 6|When prompted with a
79、 45 minute Buster Keaton movie“Sherlock Jr.(1924)(2,674 framesat 1FPS,684k tokens),Gemini 1.5 Pro retrieves and extracts textual information from a specific framein and provides the corresponding timestamp.At bottom right,the model identifies a scene in themovie from a hand-drawn sketch.5.2.Long-con
80、text EvaluationsFor the past few years,LLM research has prioritized expanding the context window from which modelscan incorporate information(Anthropic,2023a;OpenAI,2023a).This emphasis stems from therecognition that a wider context window allows models to incorporate a larger amount of new,task-spe
81、cific information not found in the training data at inference time,leading to improved performancein various natural language or multimodal tasks.Recent approaches to improving the long-contextcapabilities of models fall into a few categories,including novel architectural approaches(Ainslieet al.,20
82、23;Gu and Dao,2023;Guo et al.,2021;Orvieto et al.,2023;Zaheer et al.,2020),post-training modifications(Bertsch et al.,2023;Chen et al.,2023b;Press et al.,2021;Xiong et al.,2023),retrieval-augmented models(Guu et al.,2020;Izacard et al.,2022;Jiang et al.,2022;Karpukhin et al.,2020;Santhanam et al.,20
83、21),memory-augmented models(Bulatov et al.,2022,2023;Martinset al.,2022;Mu et al.,2023;Wu et al.,2022a,b;Zhong et al.,2022),and techniques for buildingmore coherent long-context datasets(Shi et al.,2023b;Staniszewski et al.,2023).This activityhas resulted in measurable improvements on long-context c
84、apabilities of LLMs over the past severalmonths,with the recent concurrent work of Liu et al.(2024)exploring context window of 7B modelsup to 1M multimodal tokens.Notably,among the state-of-the-art LLMs,Anthropic has successfullyextended the context of their text-only Claude 2 model to 100k tokens,w
85、hile OpenAI has recently9Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context128 256 5121K2K4K8K16K 32K 64K 128K256K512K 1M1MSequence positionNegative Log-LikelihoodCumulative Average NLL for Long Documents.R2=0.997.Gemini 1.5 FlashGemini 1.0 ProGemini 1.5 ProPower law
86、fit1285122K8K32K128K512K2M10MSequence positionNegative Log-LikelihoodCumulative Average NLL for Code.R2=0.995.Gemini 1.5 FlashGemini 1.0 ProGemini 1.5 ProPower law fitFigure 7|Cumulative average negative log-likelihood(NLL)as a function of token position in long documentsand code data.A lower value
87、demonstrates better prediction.Gemini 1.5 Pro shows improved predictions upto 1M tokens for long-documents and 10M tokens for code,whereas Gemini 1.0 Pro improves up to only 32Ktokens.Gemini 1.5 Flash shows improvement up to 1M tokens for long-documents and 2M tokens in code.TheNLL of Gemini 1.5 Pro
88、 follows a power-law trend up until 1M tokens(documents)and 2M tokens(code)witha deviating trend at 10M tokens.released GPT-4 Turbo reaching 128k tokens.Finally,the latest addition to the series was Claude 3with a context window of up to 1M tokens.Gemini 1.5 Pro significantly extend this context len
89、gth frontier to multiple millions of tokenswith almost no degradation in performance,making it possible to process significantly larger inputs.Compared to Claude 2.1 with a 200k token context window,Gemini 1.5 Pro achieves a 100%recallat 200k tokens,surpassing Claude 2.1s 98%.This 100%recall is main
90、tained up to 530k tokens,and recall is 99.7%at 1M tokens.When increasing from 1M tokens to 10M tokens,the modelretains 99.2%recall.Moreover,Gemini 1.5 Pros native multimodal capabilities enables the model toingest multiple hours of audio and video recordings alongside or interleaved with text.Such r
91、ecallcapabilities are summarized in Figure 1.Below we report results on long-context evaluations acrossall three modalities,i.e.,text,vision and audio.Similarly,Gemini 1.5 Flash achieves almost perfectrecall across all three modalities up to 2M tokens,yielding 100%recall on text,99.8%on video and99.
92、1%on audio.The evaluation methodology we followed to measure the long-context capability of Gemini 1.5models consists of both diagnostic-focused probing of the long context capabilities(e.g.,perplexityover long sequences,needle-in-a-haystack retrieval studies)and realistic evaluations specificallyde
93、signed for multimodal long-context tasks(e.g.,long-document QA,long-context automatic speechrecognition,learning to translate a new language from only one book,and long-context video QA).To provide a reference point,throughout this section we compare Gemini 1.5 models with the leadingmodel available
94、 externally for each task.With the evaluation harness we developed for Gemini 1.5models we are able to quantify the quality of long-context understanding capabilities reliably all theway up to 10M tokens.5.2.1.Diagnostic Long-Context Evaluations5.2.1.1Perplexity over Long SequencesWe start by report
95、ing results on the text modality.To evaluate the ability of the models to makeuse of very long contexts to improve next-token prediction,which is the objective function used totrain language models,we record the negative log-likelihood(NLL)of tokens at different positions in10Gemini 1.5:Unlocking mu
96、ltimodal understanding across millions of tokens of contextthe input sequences from held-out text(i.e.,not used in training).Here,a lower value implies animproved prediction.Typically,we expect tokens at the beginning of a sequence to have high NLL,asthere is little to no context that the model can
97、use to predict them,and tokens later in the sequenceto have lower NLL as more information becomes available to the model.The shape of the resultingcurve indicates the abilities of models to reason over long-context.A downward trend signifies modelsmaking use of long-context to reduce models uncertai
98、nty.On the other hand,an upward trendsignifies that models are unable to effectively use information from the previous context and may bedeteriorating in prediction quality,highlighting the limitations in their long-context understandingcapability.We perform this analysis on two data sources:(a)a da
99、taset of long documents with up to 1 milliontokens,and(b)a dataset of code repositories constructed by first randomly shuffling all the files andthen concatenating them.The code dataset contains sequences longer than 1 million tokens withsome natural form of semantic association(e.g.,a whole reposit
100、ory),allowing for further evaluationof sequences of up to 10M tokens.Figure 7 shows the cumulative NLL up to a specific token index.7We also fit a power law of the form()=+to these data points(dashed line).We find in Figure 7 that NLL decreases monotonically with sequence length and thus predictiona
101、ccuracy improves up to the tested sequence lengths(1M for long documents,and 10M for code),indicating that our models can make use of the whole input even at very long-context lengths.Thissuggests that Gemini 1.5 models are able to improve their predictions by finding useful patterns intokens,even i
102、f they occurred millions of tokens in the past,as in the case of code.Finally,we see this improved prediction follows a regular power-law structure.While it is wellknown that language models follow a power-law in terms of training compute to model performance(NLL)(Kaplan et al.,2020)up to a very lar
103、ge scale,we demonstrate that a power law can holdbetween log-loss and context length up to extremely long context lengths.We see the power-lawfit is quite accurate up to 1M tokens for long-documents and about 2M tokens for code for Gemini1.5 Pro.From inspecting longer code token predictions closer t
104、o 10M,we see a phenomena of theincreased context occasionally providing outsized benefit(e.g.due to repetition of code blocks)whichmay explain the power-law deviation.However this deserves further study,and may be dependent onthe exact dataset used.5.2.1.2Text HaystackNext,we move to testing long-co
105、ntext recall using the recently introduced needle-in-a-haystackevaluation(Kamradt,2023),which tests a models ability to retrieve a text(i.e.,“needle”)inserted atvarious positions into a sequence(i.e.,“haystack”).Following prior work(Dhinakaran,2024),we usea set of concatenated and repeated essays wr
106、itten by Paul Graham8to fill the desired context length.We insert a needle at linearly spaced intervals from the beginning to the end of the context,wherethe needle is i.e.,“The special magic city number is:number”where the city andnumber are varied for each query,and query the model to return the m
107、agic number for a specificcity.We report whether the magic number recall was correct at various context lengths(x axis thehaystack)as a function of its position in the input sequence expressed in terms of depth percentage(y axis),e.g.,depth at 100%would indicate a needle inserted at the very end of
108、the input whereas0%at the very beginning.As can be seen in Figure 8,Gemini 1.5 Pro achieves 100%recall up to 530k tokens and 99.7%recall up to 1M tokens.This task,while simple,provides a clear demonstration that Gemini 1.5 Pro7We note that we are unable to obtain logits for other commercially availa
109、ble LLMs for comparison.8https:/ 1.5:Unlocking multimodal understanding across millions of tokens of context32k128k256k512k1MTokens0142943577186100Depth(%)Gemini 1.5 Pro:From 1k to 1M tokens2M5M10MTokensUp to 10M tokens32k128k256k512k1MTokens0142943577186100Depth(%)GPT-4 Turbo:From 1k to 128k tokens
110、Figure 8|Text Haystack.This figure compares Gemini 1.5 Pro with GPT-4 Turbo for the textneedle-in-a-haystack task.Green cells indicate the model successfully retrieved the secret number,gray cells indicate API errors,and red cells indicate that the model response did not contain the secretnumber.The
111、 top row shows results for Gemini 1.5 Pro,from 1k to 1M tokens(top left),and from1M to 10M tokens(top right).The bottom row shows results on GPT-4 Turbo up to the maximumsupported context length of 128k tokens.is able to reliably retrieve information from long documents up to 1M tokens.For reference
112、,wereport results for GPT-4 Turbo up to the 128K sequence length supported by their API.In order to testwhether the capabilities demonstrated in the perplexity plots in Figure 7 transfer to sampling tasks,we continue to evaluate Gemini 1.5 Pro on the needle-in-a-haystack task beyond 1M tokens.Theres
113、ults in Fig 8 show that the model is still able to find and extract information with 99.2%accuracyup to 10M tokens.On the same task,we also evaluated Gemini 1.5 Flash up to 2M tokens andattained a flawless recall of 100%,suggesting its best in class long-context text retrieval performance,which is e
114、xclusive in its tier of models.In Section 9.4.1,we also showcase an“adversarial”version of this needle-in-the-haystack task forlong context safety evaluations.5.2.1.3Video HaystackAs Gemini 1.5 Pro is natively multimodal,its long-context abilities translate directly to other modalities,enabling it t
115、o retrieve specific information across multiple hours of video.To test this capability,weadapt the text needle-in-a-haystack evaluation and turn it into a cross-modal evaluation,wherein aneedle is hidden in one modality while the retrieval query is given in text.Rather than asking themodel to retrie
116、ve a randomly inserted phrase from a corpus of text,we ask the model to retrieveinformation embedded in a random frame(the“needle”)in a 10.5-hour-long video(the“haystack”)that is sampled at one frame-per-second.Concretely,we overlay the text“The secret word is needle”on a single randomly sam-pled vi
117、deo frame in a 10.5 hour video constructed from concatenating seven copies of the full AlphaGodocumentary(Kohs,2017)back-to-back(for a total of 37994 frames,or 9.9M tokens).See Figure 28in the Appendix for an example of such an embedded frame.After feeding it the video,we ask the12Gemini 1.5:Unlocki
118、ng multimodal understanding across millions of tokens of context6121824303642485460Minutes1030507090Depth(%)Gemini 1.5 Pro:1 minute to 1 hour246810HoursUp to 10 hours6121824303642485460Minutes1030507090Depth(%)GPT-4V:1 minute to 1 hourFigure 9|Video Haystack.This figure compares Gemini 1.5 Pro with
119、GPT-4V for the video needle-in-a-haystack task,where the models are given video clips of different lengths up to 10.5 hours of videoand are asked to retrieve a secret word embedded as text at different points within the clip.All videoclips are sampled at one frame-per-second(1 fps).The first pair of
120、 1050 haystack plots on theleft compare Gemini 1.5 Pro with GPT-4V on the first hour of the AlphaGo documentary.The x-axisrepresents the video duration which ranges from 1.2 minutes to 1 hour,and the y-axis representsthe depth,namely the relative offset of the needle(e.g.,the top left cell represent
121、s providing themodel with the first 1.2 minutes and inserting the needle in a randomly sampled frame in the firstseven seconds of that trimmed video).A green cell indicates that the model successfully retrieved theneedle,whereas a gray cell indicates an API error.Whereas the GPT-4V API supports vide
122、o lengthsonly up to around the first 3 minutes,Gemini 1.5 Pro successfully retrieves the secret word insertedat all depth percentages for the full hour,as shown by the all-green plot.Finally,the 1010 grid onthe right shows Gemini 1.5 Pros perfect retrieval capabilities across 10.5 hours of video,con
123、structedby concatenating seven copies of the AlphaGo documentary back-to-back.model to answer the question“What is the secret word?”.As Figure 9 shows,Gemini 1.5 Prosuccessfully answers this question across a breadth of video lengths and a range of randomly insertedneedle locations in the 10.5 hour
124、video.In contrast,the GPT-4V API supports video lengths only up toaround the first 3 minutes.We also evaluated Gemini 1.5 Flash on the video-haystack problem withup to 2M tokens and obtained 99.8%recall,demonstrating its best-in-class long-context retrievalperformance in vision modality.5.2.1.4Audio
125、 HaystackWe follow a similar strategy for testing Gemini 1.5 Pros long context capabilities on audio under-standing.We hide a very short clip of audio lasting a few seconds where a speaker says“the secretkeyword is needle”within an audio signal(the haystack)up to almost five days long(i.e.,107hours)
126、.The task for the model is then to retrieve the secret keyword,given a question in text,hencerequiring cross-modal reasoning.To further challenge the model beyond increasing context,the largeaudio signal is built from an unlabeled speech corpus from the VoxPopuli dataset(Wang et al.,2021)so that the
127、 input signal contains multiple speakers.In Figure 10 we plot the result of the experiment13Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context3684132180228276324372420468516564612660Minutes1030507090Depth(%)Gemini 1.5 Pro:From 12 minutes to 11 hours6401920 3200 4480 5
128、760MinutesUp to 107 hours3684132180228276324372420468516564612660Minutes1030507090Depth(%)Whisper+GPT-4 Turbo:From 12 minutes to 11 hoursFigure 10|Audio Haystack.This figure presents the audio version of the needle-in-a-haystackexperiment comparing Gemini 1.5 Pro and a combination of Whisper and GPT
129、-4 Turbo.In this setting,the needle is a short segment of audio that is inserted within a very large audio segment(of up to107 hours)containing concatenated audio clips.The task is to retrieve the secret keywordwhich isrevealed in the needle.Red indicates that the model did not identify the keyword,
130、whereas greenindicates that the model identified the keyword correctly.when the input audio ranges from 12 minutes to 107 hours(or 9.9M tokens),inserting the needle indifferent positions across the signal.The red boxes indicate a score of 0.0(meaning the model did notidentify the keyword),and green
131、indicates a score of 1.0(meaning the model identified the keywordcorrectly).The Gemini 1.5 models succeed at finding the secret keyword in all instances,with theoverall accuracy of Gemini 1.5 Pro being 100%and Gemini 1.5 Flash being 98.7%on this task.Unlike Gemini 1.5 Pro and Gemini 1.5 Flash,existi
132、ng models cannot natively handle more than afew seconds of audio in the context.As such,in order to fairly compare against them we need toemploy a strategy where we first transcribe audio into text using windows of tens of seconds,andthen rely on text models to extend beyond that limited window.Spec
133、ifically,to compare against Whisper,we chunk the audio input into 30 second segments,transcribe the audio using the model to produce a text transcript,concatenate the transcripts foreach chunk,and finally prompt GPT-4 Turbo to find the“secret keyword”given the text transcript.Figure 10 shows the per
134、formance for each depth percent and number of hours.The overall accuracyof Whisper combined with GPT-4 Turbo to identify the needle is around 94.5%.5.2.1.5Improved DiagnosticsDespite the excellent performance of Gemini 1.5 Pro and Gemini 1.5 Flash on the needle-in-a-haystacktasks for all three modal
135、ities,significantly surpassing previously reported results(99.7%for text,100%for video and 100%for audio),we also present early findings of observed limitations.By design,the needle-in-a-haystack task is a retrieval task measuring recall and so far we have considered thesimplest possible setup.A nat
136、ural extension to the task is to increase the number of unique“needles”in each haystack,and require the model to retrieve them all.For a context length of up to 1M tokens,we inserted 100 different needles and measured the total number of correct needles retrieved.14Gemini 1.5:Unlocking multimodal un
137、derstanding across millions of tokens of context1K128K512K1MNumber of tokens in context0.00.20.40.60.81.0RecallGemini 1.5 ProGPT-4 TurboFigure 11|Retrieval performance of the“multiple needles-in-haystacktask,which requires retrieving100 unique needles in a single turn.When comparing Gemini 1.5 Pro t
138、o GPT-4 Turbo we observehigher recall at shorter context lengths,and a very small decrease in recall towards 1M tokens.Figure 11 compares the recall of Gemini 1.5 Pro9and GPT-4 Turbo on this task.We see animproved recall from Gemini 1.5 Pro over GPT-4 Turbo up until 128K tokens.It is important to no
139、tethat GPT-4 Turbos context length is limited to 128K tokens and its retrieval quality largely oscillateswith longer context lengths with an average recall of around 50%at 128K tokens.In contrast,Gemini1.5 Pro maintains around 70%recall up to 128K tokens,and 60%recall up to 1M tokens.We reportfurthe
140、r results on this task in the Appendix 12.2 including results with different numbers of needles,where we observe consistent trends.In line with other tasks in the literature of LLMs,we also observe that the choice of the promptingmethod and type of needle affect final performance of models,and futur
141、e versions of“needle(s)-in-a-haystack”style tests should account for prompt robustness.We also modulate retrieval difficulty on another axis:the similarity of the needles.In the Multi-round Co-reference Resolution(MRCR)task,the model is presented with a long conversation betweena user and a model,in
142、 which the user requests writing(e.g.poems,riddles,essays)on differenttopics proceeded by the model responses.In each conversation,two user requests containing topicsand writing formats distinct from the rest of the conversation are randomly placed in the context.Given the conversation,the model mus
143、t reproduce the models output(the needle)resulting fromone of the two requests(the key).Either the formats,the topics,or both,overlap in order to createa single key that is adversarially similar to the query key.For instance,the request“Reproduce thepoem about penguins.”requires the model to disting
144、uish the poem about penguins from the poemabout flamingos,and“Reproduce the first poem about penguins.”requires the model to reason aboutordering.We score MRCR via a string-similarity measure between the model output and the correctresponse.10Figure 12 compares the capabilities of the Gemini 1.5 Pro
145、,Gemini 1.5 Flash,GPT-4 Turbo 01-25-2024 and 04-09-2024 models,Claude 3 models,and Claude 2.1 on MRCR.Gemini 1.5 Pro overtakesGPT-4 Turbo 04-09-2024 and Claude 3 Opus at around 32K tokens and GPT-4 Turbo 01-25-2024at around 8K tokens,while Gemini 1.5 Flash consistently dominates Claude 3 Sonnet at a
146、round 5Ktokens(though performs comparably at 5K tokens),overtakes GPT-4 Turbo 01-25-2024 at around20K tokens,and always strictly dominates Claude 3 Haiku.After 32K tokens,Gemini 1.5 Flash is9All the results for this task are obtained with the previously reported Gemini 1.5 Pro version from February.
147、10SequenceMatcher ratio as implemented inhttps:/docs.python.org/3/library/difflib.html15Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context2K8K32K128K512K1MNumber of tokens in context0.00.20.40.60.81.0Cumulative Average ScoreMRCR Gemini 1.5 ProGemini 1.5 FlashGPT-4 Tur
148、bo(040924)GPT-4 Turbo(012524)Claude 3 OpusClaude 3 SonnetClaude 3 HaikuClaude 2.1Figure 12|Cumulative average string similarity score as a function of context length over 2000instances of the MRCR task.When comparing Gemini 1.5 Pro and Gemini 1.5 Flash to GPT-4 Turboand Claude 3 models,we observe th
149、at after 32K tokens,Gemini 1.5 Pro outperforms Claude 3 Opusand GPT-4 Turbo(04-09-2024).Gemini 1.5 Flash outperforms Claude 3 Sonnet starting at around 5Ktokens,GPT-4 Turbo(01-25-2024)at around 20K tokens,and always outperforms Claude 3 Haikuand Claude 2.1.Both Gemini 1.5 Pro and Gemini 1.5 Flash ha
150、ve smaller decreases in performance asa function of context length up to 1M tokens compared to all competitors.on par with both GPT-4 Turbo 04-09-2024 and Claude 3 Opus,and retains this performance allthe way to 1M tokens.Both Gemini models achieve an average score of around 75%at 1M tokens.GPT-4 Tu
151、rbo and Claude 3 model performances fall off steadily as a function of context length,allthe way up to 128K tokens,though GPT-4 Turbo 04-09-2024 and Claude 3 Opus have less negativeslopes compared to the other models from their families.Claude 2.1(with context length going up to200K tokens)under per
152、forms the other models by hallucinating that the needle is not in context andpunting on requests to solve the task,despite following Claude 2.1 prompting guidelines for retrievaltasks(Anthropic,2023b).We highlight that“multiple needles-in-a-haystack”and MRCR capture different aspects of theretrieval
153、 task:MRCR is harder and requires stronger reasoning and disambiguation skills,while themultiple needles challenge is a test of the models recall ability,explaining disparities between themodel orderings up to 8K tokens.Gemini 1.5 Pro and Gemini 1.5 Flash impressively avoid seriousdegradation on bot
154、h of these axes all the way up to 1M tokens.While the“multiple needles-in-a-haystack”and MRCR evaluations offer two challenging setups thatstress-test long-context retrieval and reasoning capabilities in different ways,we advocate for pushingthe boundaries even further.Evaluating models on tasks tha
155、t demand complex reasoning over multiplepieces of information scattered across a long context would provide even deeper insights into theirtrue capabilities.This could involve tasks that require integrating disparate facts,drawing inferences,or resolving inconsistencies within the retrieved informat
156、ion.By incorporating such assessmentsalongside prompt robustness studies,we can gain a more comprehensive and nuanced understandingof how effectively models can utilize long contexts for advanced reasoning and knowledge extraction.16Gemini 1.5:Unlocking multimodal understanding across millions of to
157、kens of context5.2.2.Realistic Long-Context EvaluationsHaving investigated the models multimodal abilities on handling long-context using a battery ofdiagnostic tests,we now turn to a series of novel multimodal tasks designed to better reflect thepotential uses of this model,thus stress-testing mode
158、ls in a more realistic way.5.2.2.1In-context language learning learning to translate a new language from one bookTo test the in-context learning abilities enabled by very long context,we evaluate Gemini 1.5 Flash&1.5 Pro on the Machine Translation from One Book(MTOB)benchmark(Tanzer et al.,2023).MTO
159、Bmeasures the ability to learn to perform sentence-level translation between English and Kalamang(ISO639-3 language code:kgv)from instructional materials.Kalamang has fewer than 200 speakers andtherefore virtually no presence on the web,which means that the model must rely on the data givenin contex
160、t(rather than knowledge stored in its weights at training time).11The available resourcesfor Kalamang are:field linguistics documentation12comprising a500 page reference grammar(Visser,2020b),a2000-entry bilingual wordlist(Visser,2020a),and a set of400 additional parallelsentences(Visser,2020a).In t
161、otal the available resources for Kalamang add up to around250ktokens.This task framing offers the promise of using extremely long-context models to supportlanguages that are not sufficiently represented in pre-training corpora,with curated resources thatcan be created and deployed by independent par
162、ties.To perform the task,we provide Gemini 1.5 Flash&1.5 Pro with the full set of materials in theirinput context.To compare fairly against GPT-4 Turbo(version 2024-04-09)and Claude 3,since the fullmaterials do not fit in their publicly released context windows(128K and 200K respectively),we alsorep
163、ort results using only half of the grammar book(100k tokens).Moreover,to test to what extentthe models are making use of information in the context,rather than relying on knowledge foundin their pre-training data,we also run a 0-shot setup.Finally,we compare to MTOBs reference forhuman performance,i
164、n which a human learned Kalamang from the exact same full set of materials.13To assess performance,we conduct a human evaluation where the same human language learneris given the input sentence and reference translation,and rates the quality of the predicted translationon a scale from 0 to 6,with 6
165、being an excellent translation.This rater is a non-native non-fluentspeaker who can identify their own translations,so the scores should be interpreted in context.Weadditionally report results using automatic metrics,i.e.,BLEURT(Sellam et al.,2020)for Kalamangto English(kgveng)and chrF(Popovi,2015)f
166、or English to Kalamang(engkgv).Gemini 1.5,GPT-4 Turbo,and Claude 3 all have essentially random performance in the 0-shotsetting(i.e.,no additional Kalamang information in context).The models sometimes successfully copyproper nouns,identify loanwords from higher resource languages like Malay,or narro
167、w generationusing style cues like question marks.Their generations forengkgvare identified by GoogleTranslate as various other languages,often malformed.These results indicate that,as expected,no11Of course,the models do have some relevant knowledge to transfer to the task,such as competence at tran
168、slationin general,understanding of linguistics reference grammars,and knowledge of loanwords or other languages with arealinfluence on Kalamang(though Kalamang is not known to be genealogically related to any other languages)(Tanzer et al.,2023;Visser,2020b).12Machine learning for indigenous languag
169、es can be culturally sensitive(Bird,2020).In the case of MTOB,the fieldlinguist who documented Kalamang is an author on the MTOB project;in addition to getting consent for the original datacollection for linguistic research,the field linguist got renewed permission from their main community consulta
170、nt to usethe data specifically for machine learning research(Tanzer et al.,2023).13Note that this is different from the typical notion of human performance in machine translation,where native speakersrate the quality of the translations in absolute.MTOB is instead concerned with the ability to learn
171、 to translate from limitedreference materials,and how to bridge the gap to useful applications is a separate research question.17Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextContextGPT-4Claude 3Gemini 1.5HumanTurboHaikuSonnetOpusFlashProlanguage learner0-shot0.14
172、0.240.140.180.140.18-(30.0)(33.4)(30.0)(32.7)(31.5)(30.0)-half book2.042.803.403.743.004.14-(49.7)(53.5)(58.5)(58.3)(55.1)(63.9)-full book-3.144.005.52-(57.4)(64.6)(70.3)Table 4|Quantitative results for KalamangEnglish translation on MTOB(Tanzer et al.,2023).Wepresent human evaluation scores on a sc
173、ale of 0 to 6,with 6 being an excellent translation.We includeautomatic metrics(BLEURT)in parentheses.ContextGPT-4Claude 3Gemini 1.5HumanTurboHaikuSonnetOpusFlashProlanguage learner0-shot0.080.080.080.120.080.00-(15.0)(15.3)(17.3)(18.7)(15.4)(12.0)-half book3.904.464.645.184.945.38-(45.4)(51.7)(49.2
174、)(55.5)(54.6)(59.1)-full book-4.665.465.60-(52.0)(59.0)(57.0)Table 5|Quantitative results for EnglishKalamang translation on MTOB(Tanzer et al.,2023).Wepresent human evaluation scores on a scale of 0 to 6,with 6 being an excellent translation.We includeautomatic metrics(chrF)in parentheses.substanti
175、al Kalamang data,if any,was part of the models training data.Gemini 1.5 Pro in the half book setting outperforms GPT-4 Turbo and Claude 3 on the samesetup by a wide margin;see Tables 4 and 5.In the best setting,Gemini 1.5 Pro attains a 4.14human evaluation score onkgvengtranslation,compared to 5.52
176、of the“human language learner”score,and 5.46 onengkgvtranslation,compared to 5.58 by the“human language learner”.Thereis still a substantial qualitative gap forkgvengtranslation,butengkgvtranslation is similarto the human language learner on average.14Gemini 1.5 Flash also performs admirably,falling
177、short of Gemini 1.5 Pro as expected but handily outperforming GPT-4 Turbo and sitting somewherebetween Claude 3 Haiku&Sonnet or Sonnet&Opus depending on the translation direction.SeeAppendix 12.17 for more details,experiments,and qualitative examples.The performance of Gemini 1.5 Pro highlights the
178、importance of long-context understandingand providing sufficient context for learning new skills in-context.By leveraging the extensive14This is not to say that the task is solved;both the human and Gemini 1.5 Pro make avoidable errors,though typically ofdifferent kinds.The human errors tend to be r
179、etrieval failures,where they pick a suboptimal phrase because they could notfind the ideal reference(because rereading the entire set of materials for each sentence is infeasible for a human).Themodel failures tend to be inconsistent application of rules,like that the word“se”is pronounced“he”after
180、a vowel(thisalternation is described in the phonology section of the grammar and reflected in the additional parallel sentence data,butthe model may be confused by the fact that the underlying“se”form is used as the gloss throughout the examples withinthe grammar),or lack of reflection,like that the
181、 word“kabor”,although it is defined as“to be full”in the dictionary,is onlyused for stomachs/hunger in all examples of its use.18Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextGemini 1.5 Proaudio context0-audioshot5-audioshot50-audioshot800-audioshottext contextnon
182、e35.0%30.2%27.5%23.1%wordlist29.7%27.7%24.8%23.2%sentences31.4%27.2%25.7%22.9%both32.5%26.5%25.8%23.0%Table 6|Character Error Rate(CER,)for Gemini 1.5 Pro learning Kalamang ASR in context.field linguistics documentation provided in context,Gemini 1.5 Pro was able to achieve remarkabletranslation qua
183、lity comparable to a human language learner,and was able to do so for a languagefor which it had essentially zero exposure to during the training of the model.This finding opensup exciting possibilities for leveraging LLMs with sufficiently long-context capabilities to supportthe preservation and re
184、vitalization of endangered languages,as well as to facilitate communicationand understanding across different linguistic communities.As research continues in this area,itwill be crucial to explore techniques for improving the quality of translation in both directions,e.g.,Kalamang-to-English,and to
185、address the challenges of evaluating the performance of LLMs onlow-resource and under-represented languages,which we believe is also applicable to other areas ofeducation and language learning.5.2.2.2In-context language learning-learning to transcribe speech in a new language incontextGemini 1.5 has
186、 demonstrated exceptional performance at extremely long in-context learning fortranslation,both with Kalamang(on MTOB)and other low-resource languages(on standard bench-marks below).Kalamang,like many other endangered languages,is primarily oral;applicationsmust therefore support speech in order to
187、be socially useful.We take the next step towards theseapplicationsand at the same time stress test Gemini 1.5s mixed-modal in-context learning ca-pabilitiesby evaluating how well it can learn to transcribe Kalamang speech from text and audiodocumentation in context.This task framing is possible in L
188、LMs for the first time due to Gemini 1.5snative audio support.We present a preview of results on a new benchmark,ASROB(Automatic Speech Recognitionfrom One Book).ASROB extends MTOB with 104 speech recordings(15 total hours)of transcribedand translated Kalamang speech from The Kalamang Collection(Vis
189、ser,2020c).15Here we reportexperiments on a subset of 6 recordings(45 minutes)with manually realigned phrase-level captions;we use 5 of the recordings(800 phrases)as the in-context train set and 1(100 phrases)as the testset.The same speaker from the test recording is present in 3 of the train record
190、ings.We report Character Error Rate(CER)across various combinations of text context(the2000-entry bilingual wordlist and400 parallel sentences from MTOB)and audio context(up to 800Kalamang speech/text pairs).We cannot compare directly to GPT-4 or Claude 3 because they do notprovide access to audio i
191、nput at the time of writing.1615When the linguist author of Visser(2020b)got the blessing of a community member to use the Kalamang data formachine learning for MTOB,she included speech data/tasks in the discussion.16We tried to compare to GPT-4 by cascading the speech input through Whisper,but we w
192、ere unable to get Whisper toconsistently produce transcriptions of the kind that Gemini 1.5 produces zero-shot.We tried using a)default settings,b)the language code for Indonesian,and c)prompts in English&Indonesian introducing“an audio sample in Kalamang,which is written with Indonesian orthography
193、.”19Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextGemini 1.5 Flashaudio context0-audioshot5-audioshot50-audioshot800-audioshottext contextnone73.3%32.0%32.3%26.0%wordlist45.4%31.3%31.3%26.1%sentences37.9%33.2%33.1%25.3%both37.5%33.0%33.4%25.2%Table 7|Character Err
194、or Rate(CER,)for Gemini 1.5 Flash learning Kalamang ASR in context.We see in Table 6 that Gemini 1.5 Pro performs remarkably well without any in-context examples(where the prompt instructs the model to transcribe Kalamang speech with Indonesian orthography),achieving 35.0%CER.Judging from the output
195、s,the model clearly hears the sounds of the languagebut does not know how to segment or spell words in it,especially affixes.As we add text and audiocontext for many-shot learning(Agarwal et al.,2024a),ASR quality improves relatively gracefully,reaching 22.9%CER in the best setting.These outputs are
196、 segmented and spelled much better,withsome examples transcribed perfectly,but there is still significant room for improvementin particularensuring that all outputs are grammatically correct Kalamang.Gemini 1.5 Flash(Table 7)follows asimilar trajectory to Pro,but with worse scores across the board.5
197、.2.2.3Scaling In-Context learning for low-resource machine translation30.040.050.0Test chrF(Flores)Translation:English Bemba35.040.045.0Translation:English Kurdish25.030.035.040.0Translation:English Ewe20212223242526272829210211212Number of Shots(K)15.020.025.030.035.0Test chrF(In-house)Translation:
198、English Acholi20212223242526272829210211212Number of Shots(K)10.020.030.0Translation:English Abkhaz20212223242526272829210211212Number of Shots(K)10.015.020.025.030.035.0Translation:English NavajoGemini 1.5 ProGemini 1.5 FlashGPT-4 TurboFigure 13|In-context learning scaling results(chrF)on low-resou
199、rce translation.Top:results onFlores test sets.We use Flores dev sets for prompting and scale the number of shots to 1K(includingabout 90K tokens).Bottom:results on in-house evaluation sets.We use Gatitos for prompting andscale the number of shots to 4K(including about 30K tokens).Gemini 1.5 yields
200、increasingly bettertranslation performance as the number of shots grows,surpassing GPT-4 Turbo significantly.The impressive in-context language learning capability of Gemini 1.5 inspires us to revisit tra-ditional in-context learning(ICL)at scale.ICL allows LLMs to learn new tasks from input-outpute
201、xamples provided at inference time.While it has been widely observed across tasks and models,the number of in-context examples explored is often limited,ranging from a handful to a few dozen,20Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextbecause of context length
202、 limitations and/or suboptimal long-context capabilities(Brown et al.,2020;Min et al.,2022;Zhang et al.,2023a).By contrast,Gemini 1.5s millions of tokens of contextopen new opportunities for scaling ICL to thousands of examples,also known as the many-shot ICLregime(Agarwal et al.,2024a;Bertsch et al
203、.,2024).In this section,we explore to what extentGemini 1.5 can leverage an increased number of in-context examples(or shots)to improve machinetranslation for low-resource languages,extending the prior work exploring the limits of few-shotlearning for machine translation(Garcia et al.,2023).We evalu
204、ate translation from English to 6 diverse low-resource languages,including Acholi(ISO639-3:ach;Africa),Abkhaz(ISO 639-3:abk;Asia),Navajo(ISO 639-3:nav;Americas),Bemba(ISO 639-3:bem;Africa),Ewe(ISO 639-3:ewe;Africa),and Kurdish(ISO 639-3:kur;Asia).Theselanguages each have between a few hundred thousa
205、nd to a few million speakers,with relatively lowexposure on the Internet.Due to such low-resource nature,it requires the model to understand andlearn from the given in-context examples rather than from the pretraining data and to acquire newknowledge about the language when appropriate.We create two
206、 evaluation setups:public and in-house,depending on the accessibility of the testdata to the general public.Following Agarwal et al.(2024a),the public setup covers Bemba,Ewe,andKurdish,using the dev set of Flores-200(Team et al.,2022)(up to 997 examples and 90K tokens)as the in-context example set a
207、nd the first 200 examples from its devtest set as the test set;Thein-house setup covers Acholi,Abkhaz,and Navajo,using Gatitos(Jones et al.,2023)(including upto 4K examples and 30K tokens)as the in-context example set and 200 held-out sentence pairsannotated by professional translators as the test s
208、et.We use the translation prompt presented inAppendix 12.16.15.We report averaged performance over three runs,each with freshly sampledin-context examples,and use chrF(Popovi,2015)as the evaluation metric.Figure 13 shows the ICL scaling performance.While previous ICL studies often see performancesat
209、urate after dozens of examples,Gemini 1.5 delivers almost consistent improvements as we scalethe number of shots.The quality gain over zero-shot translation can be quite substantial,e.g.+11.1/+21.4 chrF on Bemba and+9.5/+15.9 chrF on Navajo for 1.5 Pro/Flash,although it variesgreatly across language
210、s.We see Gemini 1.5 Flash degrades from few-shot ICL on Acholi,but furtherscaling ICL significantly improves the translation.Overall,Gemini 1.5 Flash presents more pronouncedmany-shot scaling compared to 1.5 Pro which may be related to its smaller model size:it relies moreon in-context examples to p
211、erform the translation rather than its implicit knowledge.While GPT-4 Turbo also shows some positive trend we scale the number of shots,Gemini 1.5Pro often outperforms GPT-4 Turbo across languages and numbers of shots by a wide margin.Oneinteresting exception is the translation for Abkhaz,where Gemi
212、ni 1.5 Pro lags behind GPT-4 Turbowith few prompts but significantly surpasses it as ICL scales.This pattern also occurs with Gemini 1.5Flash across languages:scaling up to 1K/4K examples improves Flash,enabling it to achieve superiorperformance to GPT-4 Turbo,e.g.+9.6/+6.4 chrF on Ewe/Acholi.In sho
213、rt,the Gemini 1.5 modelsexcel at translating low-resources languages and gracefully improve with in-context examples.5.2.2.4Long-document QAAfter testing Gemini 1.5 models in-context language learning capabilities up to 250k tokens,weproceed into another realistic evaluation setup.In this section we
214、 present experiments on questionanswering,we create questions using the book“Les Misrables”(by Victor Hugo)and test the modelsability to answer them correctly when the entire 1,462 page book(i.e.,710K tokens)is providedas input.Evaluating a models ability to answer questions about long documents(or
215、collectionsof documents)presents a unique challenge.Unlike tasks that focus on specific facts or details21Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context0k context Gemini 1.0 ProRAG 4k context Gemini 1.0 ProRAG 4k context GPT-4 Turbo0k context GPT-4 Turbo0k context
216、 Gemini 1.5 ProRAG 4k context Gemini 1.5 Profull 710k context Gemini 1.5 Pro01234567Strength0.10410.29711.29941.64241.37461.76566.2417Gemini 1.0 ProGPT-4 TurboGemini 1.5 ProFigure 14|Answer quality based on side-by-side auto-rater(Gemini 1.5 Pro),rankings and scores()computed via the Bradley-Terry M
217、odel.that measure the retrieval capability of the models,such questions often require understandingrelationships between pieces of information spanning large portions of text.For example,a questionlike“How is the concept of duality portrayed through the character who embodies both respect forauthori
218、ty and hatred of rebellion?”necessitates comprehending the overall narrative and characterdynamics within the above book.We compare Gemini 1.5 Pro against Gemini 1.0 Pro.Due to the limited context window of thelatter,Gemini 1.0 Pro requires retrieval-augmented generation to access useful passages fr
219、om thebook.This method indexes passages using TF-IDF and stores the results in an external database.The question is then used as a query to re-rank passages by cosine similarity,and the most relevantpassages are retrieved,up to a maximum of 4k tokens(roughly 41 passages).The retrieved passagesare th
220、en put into context following a temporal ordering.In contrast,Gemini 1.5 Pro,due to its largercontext window capable of accommodating much longer material,eliminates any need for additionaldata post-processing,indexing and retrieval pipelines.17To evaluate the models response,we createa set of 100 q
221、uestions.Generally,LLMs today can achieve high factual accuracy in the zero-shot setting for well-knownworks such as“Les Misrables”.This makes it challenging to distinguish between models whenusing absolute performance measures.We therefore use side-by-side comparisons to assess theanswer quality be
222、tween models with varying context sizes.For a more detailed discussion on thismethodology and its implications,see(Bohnet et al.,2024).The side-by-side comparison allowsus to rate if models provide enough details to answer a question sufficiently.We use an auto-raterthat takes a question and answers
223、 from two different systems and compares them against each other.The auto-rater response is either system-A is better,system-B is better or None if both answers arenon-factual,in which case they are both excluded.17See Appendix 12.3 on details of the automatic question generation pipeline.22Gemini 1
224、.5:Unlocking multimodal understanding across millions of tokens of contextUsing these side-by-side comparisons results,we provide an analysis of model strength using theBradley-Terry model(Bradley and Terry,1952).Such ranking models are used in many applicationsand are best known for their use in Ch
225、ess or Go to rate player strength.The Bradley-Terry modelassigns scores to a fixed set of models based on pairwise comparisons,where the log-odds of modeloutperforming modelis given by the difference of their scores.The fitting of the parameters formodels,(1,.,)is performed via maximum likelihood es
226、timation.The model strength has adirect mapping to the probability that an answer from Modelis better than an answer from:(answers better than)=+(1)Figure 14 summarizes the results for this evaluation.When using the entire book“Les Misrables”as context,Gemini 1.5 Pro outperforms all other system by
227、large margin.For example,full-contextGemini 1.5 Pro provides better answers than retrieval-augmented generation with 4k tokens usingGemini 1.5 Pro with probability=6.24176.2417+1.7656=0.7795,or in 78%of cases.Using the full book ascontext with Gemini 1.5 Pro provides a better answer compared to retr
228、ieval-augmented GPT4-Turbowith 4k tokens in 83%of cases.5.2.2.5Long-context AudioNext,we evaluate Gemini 1.5s long context understanding capabilities on audio inputs.To evaluatelong-context automatic speech recognition(ASR)performance,we test Gemini 1.5 models on aninternal benchmark derived from 15
229、 minute segments of YouTube videos.For this evaluation,wereport results against the 1.0 Pro model,which is trained on audio segments much shorter in length.We also report performance with the Universal Speech Model(USM)(Zhang et al.,2023b)andWhisper(OpenAI,2023).Note that ASR tasks report a word err
230、or rate(WER)metric,where a lowernumber is better.The Table 8 below shows that the 1.0 Pro model,when evaluated on transcribing 15-minutevideos without segmentation,has a WER of 100%due to a mismatch between training and testingaudio lengths.When we segment the videos every 30 seconds and pass the te
231、xtual content of thelanguage model across each segment boundary,the 1.0 Pro model can achieve a WER of 7.8%.TheUSM model with a CTC decoder,while robust to long segments,achieves a WER of 8.8%.As indicatedin the table,Whisper is not robust to long segments and hence requires audio to be segmentedeve
232、ry 30 seconds to achieve a WER of 7.3%.In comparison,Gemini 1.5 Pro is much more robust onthese longer-context tasks.Specifically,thanks to its long-context capabilities and without the addedcomplexity of extra input segmentation and pre-processing,Gemini 1.5 Pro can transcribe 15-minutevideos more
233、accurately than other models,achieving a WER of 5.5%,while Gemini 1.5 Flash trailingbehind 1.0 Pro with a WER of 8.8%,a remarkable level of quality considering its smaller size andsuperier efficiency.USMWhisperGemini1.0 Pro1.5 Pro1.5 FlashSegmentation30s30sWER8.8%12.5%7.3%100%7.8%5.5%8.8%Table 8|Wor
234、d error rate(WER)for various models on 15-minute videos.23Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context1H-VideoQA(40-105 minutes)EgoSchema(3 minutes)Dataset0.00.20.40.60.81.0Accuracy0.238*0.5360.4520.7020.5630.7270.722Gemini 1.5 Pro Performance on 1H-VideoQA and
235、EgoSchemaRandom BaselineFirst frameLinearly subsampled to 16 framesLinearly subsampled to 150 framesAll frames(1 fps)Figure 15|Comparison between 1H-VideoQA and EgoSchema,reporting Gemini 1.5 Pros accuracywhen linearly subsampling to 1,16,or 150 frames.We also show performance if we provide allthe f
236、rames for each video for 1H-VideoQA,in yellow.Gemini 1.5 Pro achieves SotA accuracy onboth benchmarks.Gemini 1.5 Pros performance on 1H-VideoQA keeps increasing as we scale up toproviding all frames in the video,while its performance on EgoSchema saturates after 16 frames,showing that 1H-VideoQA is
237、more effective at differentiating among models that operate over differentcontext lengths.*When provided with the first frame only,Gemini 1.5 Pro tends to avoid using oneof the choices and instead indicates it lacks information to reply accurately.5.2.2.6Long-context Video QAWe next proceed to long
238、video question answering evaluation to test Gemini 1.5s efficacy on long-context video understanding.Question-answering benchmarks for long-context video understandingneed to have at least two properties:first,they need to contain long videos and second,their questionsneed to be designed to in a way
239、 that can differentiate among models that operate over differentcontext lengths.Unfortunately,no existing benchmarks satisfy these properties for evaluating modelsthat can handle hours-long videos like Gemini 1.5 models.The publicly available question answeringbenchmark with the longest videos is Eg
240、oSchema(Mangalam et al.,2023),but its videos are at most3 minutes(i.e.,180 frames)in length.To bridge this evaluation gap,we introduce a new benchmark,1H-VideoQA,composed of 125 five-way multiple-choice questions over public videos 40-105 minuteslong.We collected annotations that require understandi
241、ng one or multiple events,each spanning onlya few seconds from the full video so that the answer is extremely challenging to infer by looking at afew randomly sampled video frames.We run experiments by extracting video frames at one frame-per-second,and further linearlysubsampling long videos to a f
242、ixed context length.We also measure performance if we provide all24Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextModelFrames16150full video(1 fps)GPT-4V36.5%52.3%Not supportedGemini 1.5 Pro45.2%56.3%72.2%Gemini 1.5 Flash39.7%50.8%65.9%Table 9|Comparison between GP
243、T-4V and Gemini 1.5 models on 1H-VideoQA.Experiments are runzero-shot,by sampling one video frame-per-second and linearly subsampling 16 or 150 frames.Wealso show performance if we provide all the frames for each video to Gemini 1.5 models.frames for each video for 1H-VideoQA as a reference.Results
244、are shown in Figure 15 and Table 9).Figure 15 illustrates the improvement of 1H-VideoQA over EgoSchema in terms of its abilityto differentiate among models that operate over different numbers of frames.Gemini 1.5 Pro setsa new state-of-the-art of 70.2%accuracy on EgoSchema using only 16 frames(vs 55
245、.6%for GPT-4V(Balaevi et al.,2024).However,we do not see clear gains from going to 150 frames,suggestingthat many questions in EgoSchema can be easily solved with a limited number of frames.In contrast,on 1H-VideoQA the performance of Gemini 1.5 Pro consistently increases as the numberof frames prov
246、ided increases from the first frame to the full video,suggesting that a substantialproportion of questions in 1H-VideoQA can only be solved with more frames as context,therebymaking 1H-VideoQA more effective as a long-context benchmark.Table 9 further shows that Gemini1.5 Pro consistently outperform
247、s GPT-4V on 1H-VideoQA,whether the video has been subsampledto 16 or to 150 frames.The fact that Gemini 1.5 Pro does not solve 1H-VideoQA perfectly(seeexamples in Appendix,Table 47),despite observing a frame every second,makes 1H-VideoQA auseful benchmark for evaluating and driving the development o
248、f future long-context video models.We additionally highlight the quality of Gemini 1.5 Flash,which is only slightly behind GPT-4V with150 frames,and even better than GPT-4V with 16 frames,despite being significantly smaller andmore efficient.5.2.2.7In-Context PlanningReasoning and planning are funda
249、mental human skills for problem-solving and decision-making.While LLMs have demonstrated recent progress in reasoning tasks,planning remains a more chal-lenging area which so far have received limited attention.In this section,we report performance ofmodels on both classical planning benchmarks expr
250、essed in the standard Planning Domain DefinitionLanguage(PDDL)and more modern ones expressed in natural language.See Appendix 12.8 forexample of prompts for all planning tasks considered in this section.We evaluate the planning capability of the model as we add more examples(“shots)into thecontext,i
251、nspired by the success of many-shot learning across a large number of tasks(Agarwal et al.,2024a).The challenge in“in-context planninginvolves understanding a specific task and problemthrough a limited number of examples.Additionally,it requires the models to produce a solutionwithout checking each
252、planning step to confirm if a proposed move is correct.The model has tocreate a plan in one go.To humans,this might be seen as thinking fast(instead of slow).Figure 16 shows the in-context learning performance on classical planning and natural languagebenchmarks as we vary the number of shots.Overal
253、l,Gemini 1.5 Pro outperforms other modelsindicating that the model not only can plan better with a fewer number of examples/shots,it canalso make effective use of additional and longer context.Gemini 1.5 Flash which is a smaller,faster25Gemini 1.5:Unlocking multimodal understanding across millions o
254、f tokens of context12481020406080100200Few-shot exemplars(log scale)0102030405060Planning Accuracy(%)Error bars represent a 70%CIGemini 1.5 FlashGemini 1.5 ProGPT-4 Turbo 202404090.4k0.6k1.1k2.0k2.5k4.9k9.6k19.1k23.9k47.7kSentence pieces in 1000(log scale)(a)BlocksWorld.124810204080100200400800Few-s
255、hot exemplars(log scale)010203040506070Planning Accuracy(%)Error bars represent a 70%CIGemini 1.5 FlashGemini 1.5 ProGPT-4 Turbo 202404090.9k1.3k2.3k4.1k5.1k9.7k19.1k37.7k47.1k93.7k187.1k373.8kSentence pieces in 1000(log scale)(b)Logistics.12410204080100200400Few-shot exemplars(log scale)020406080Pl
256、anning Accuracy(%)Error bars represent a 70%CIGemini 1.5 FlashGemini 1.5 ProGPT-4 Turbo 202404092.6k3.8k6.4k13.9k27.2k53.1k105.0k130.6k259.7k518.0kSentence pieces in 1000(log scale)(c)Mini-Grid.14102040100200400800Few-shot exemplars(log scale)01020304050Planning Accuracy(%)Error bars represent a 70%
257、CIGemini 1.5 FlashGemini 1.5 ProGPT-4 Turbo 202404090.7k1.5k3.8k8.5k17.3k45.4k89.5k178.3k355.4kSentence pieces in 1000(log scale)(d)Trip Planning.14102040100200400Few-shot exemplars(log scale)0102030405060Planning Accuracy(%)Error bars represent a 70%CIGemini 1.5 FlashGemini 1.5 ProGPT-4 Turbo 20240
258、4090.7k1.6k3.8k7.3k13.6k34.6k70.6k144.6kSentence pieces in 1000(log scale)(e)Calendar Scheduling.Figure 16|PDDL Planning and Natural Language Planning with few-shots.In all benchmarks,Gemini1.5 Pro outperforms other models indicating that the model not only can plan better with a fewernumber of exam
259、ples,it can also make effective use of a longer context.and more efficient model is outperformed by Gemini 1.5 Pro but occasionally matching GPT-4 Turboperformance.Bellow are the details of each task and their corresponding results.BlocksWorldBlocksWorld is a well-known planning problem from Interna
260、tional Planning Confer-ence(IPC)18.This domain consists of a set of blocks,a table and a robot hand.The goal is to finda plan to move from one configuration of blocks to another.We generated BlocksWorld probleminstances of 3 to 7 blocks.Figure 16a shows the performance of Gemini 1.5 models on this b
261、enchmarkas we increase the number of few-shot examples.The 1-shot planning capability of Gemini 1.5 Proand Gemini 1.5 Flash reaches reaches 35%and 26%,while GPT-4 Turbo performance is close to zero.Moreover the 40-shots planning capability of Gemini 1.5 Pro reaches 48%range which performsbetter than
262、 the best(200-shots)performance of GPT-4 Turbo,which peaks at 43%.LogisticsLogistics is an AI planning problem from IPC-199819expressed in PDDL that involvesarranging the delivery of packages to their destinations using trucks within cities and airplanesbetween cities.The aim is to optimize transpor
263、tation modes under constraints like vehicle capacitiesand locations,showcasing models ability to manage multi-step logistics efficiently.The planning18https:/ 1.5:Unlocking multimodal understanding across millions of tokens of contextcapability of Gemini 1.5 models on Logistics benchmark is shown in
264、 Figure 16b.As one can see the1-shot planning capability of Gemini 1.5 Pro reaches 43%while GPT-4 Turbo can only reach to 18%.Moreover for Gemini 1.5 Pro more context leads to consistently better results,indicating that themodel can make effective use of additional and longer context.This is not the
265、 case for GPT-4 Turbowhere the accuracy drops when more examples are provided.Mini-GridIn Mini-Grid problem from Artificial Intelligence Planning Systems(AIPS)-199820,alsoexpressed in PDDL.We create various floorplans with rooms containing random configurations ofkey shapes.The goal then is for a ro
266、bot to navigate from an initial position to a designated goalcell.Figure 16c shows the performance of Gemini 1.5 models as we increase the number of few-shotexamples.The 1-shot planning capability of Gemini 1.5 Pro reaches 28%while GPT-4 Turbo achievedonly 15%.More context leads to better performanc
267、e for Gemini 1.5 Pro.With 400-shots Gemini 1.5Pro reached 77%accuracy.GPT-4 Turbo performance is also increasing with the increasing numberof shots but it is far behind Gemini 1.5 Pro.With 80-shots GPT-4 Turbo reaches 38%accuracy whichis 32%lower than the accuracy of Gemini 1.5 Pro.Gemini 1.5 Flash
268、is outperformed by Gemini 1.5Pro but almost matching GPT-4 Turbo performance.Trip PlanningTrip Planning is a task focusing on planning a trip itinerary under given constraintswhere the goal is to find the itinerary regarding the order of visiting N cities.We add enoughconstraints such that there is
269、only one solution to the task,which makes the evaluation of thepredictions straightforward.Figure 16d shows the performance of Gemini 1.5 Pro on this benchmarkas we increase the number of few-shot examples.The 1-shot performance of the GPT-4 Turbo modelseems to be better than the Gemini 1.5 Pro.Howe
270、ver,as we increase the number of shots theperformance of Gemini 1.5 Pro improves dramatically.With 100-shots Gemini 1.5 Pro reaches 42%while the best(20-shots)performance of GPT-4 Turbo is 31%.Calendar SchedulingCalendar Scheduling is a task to schedule a meeting of either 30 minutes oran hour among
271、 up to 7 attendees.The attendees may have a busy schedule or a light schedule withless than half of the working hours spent in meetings.The planning capability of Gemini 1.5 Pro onthis benchmark is shown in Figure 16e.The 1-shot planning capability of Gemini 1.5 Pro reaches33%while GPT-4 Turbos accu
272、racy is under 10%.It also seems that more context leads to betterperformance for both Gemini 1.5 and GPT-4 Turbo models.With 40-shots GPT-4 Turbo achieves36%accuracy while Gemini 1.5 Pro reaches 48%.With 100-shots the Gemini 1.5 Pro is able reach52%indicating that the model can make effective use of
273、 the longer context.In summary,planning empowers intelligent agents to look ahead and proactively determine acourse of action to reach objectives(Russell and Norvig,2016).Recently prompting LLMs to extractcommon sense knowledge gained attention(Ding et al.,2023;Huang et al.,2022;Singh et al.,2023)an
274、d effectiveness of LLMs in generating plans has been studied in(Guan et al.,2024;Hao et al.,2023;Valmeekam et al.,2024).Even though existing work finds current models to be incapable offew-shot planning,our results confirm that Gemini 1.5 Pro can perform well even in 1-shot settingwhile making effec
275、tive use of additional and longer context to further improve performance.Sinceplanning is at the core of robotics,embodied environments,and agentic space,leveraging this modelcapability in such applications has a great potential.20https:/ 1.5:Unlocking multimodal understanding across millions of tok
276、ens of context5.2.2.8Unstructured Multimodal Data Analytics TaskWhile performing data analytics on structured data is a very mature field with many successfulmethods,the majority of real-world data exists in unstructured formats like images and conversations.We investigate the potential of Large Lan
277、guage Models(LLMs)to enable unstructured data analyticsand explore how LLMs can directly analyze this vast pool of multimodal information.As an instance of unstructured data analytics,we perform an image structuralization task.Wepresent LLMs with a set of 1024 images with the goal of extracting the
278、information that the imagescontain into a structured data sheet(see Appendix 12.7 for examples prompts used in this study).Asthis is a long-context task,in case where context length of models does not permit processing of allthe images at once,we use mini-batches with different batch sizes to allevi
279、ate this shortcoming.Inthe end,the results of each mini-batch are concatenated to form the final structured table.IDCategoryColorSemantic Attribute0category_0color_csemantic_class_31category_9color_ysemantic_class_02category_1color_ysemantic_class_103category_9color_bsemantic_class_04category_2color
280、_rsemantic_class_55category_2color_msemantic_class_7.Please process the following images,.image_0:image_1:image_2:.Please output the table only in the format specified.Input:A batch of imagesOutput:Structured extracted attributes8163264128256512Batch Size1015202530354045Accuracy(%)Accuracy on all th
281、e attributes extractionGemini 1.5 ProGPT 4 Turbo 20240409Claude 3 OpusFigure 17|Performance of models on unstructured data analytics tasks.Figure 17 presents the results in terms of accuracy for different types of information extractedfrom images.We find that:The accuracy on all attributes extractio
282、n of Gemini 1.5 Pro is improved by 9%(absolute)or 27%(relative compared to GPT-4 Turbos accuracy).At the time of evaluation,Claude 3 API is notable to analyze more than 20 images which is why the results for Claude 3 Opus are capped.For Gemini 1.5 Pro more images lead to consistently better results,
283、indicating that the modelcan make effective use of the additional and longer context.This is not the case for GPT-4 Turbowhere the accuracy drops as more images are provided.6.Core Capability EvaluationsThe final component of our evaluation harness for the Gemini 1.5 Pro and Gemini 1.5 Flash measure
284、sthe quality of the models core capabilities(i.e.,performance on non long-context task).The evalua-tions in this section consist of benchmarks covering all three modalities:text,vision and audio.Werely on a combination of established benchmarks that are public and used by the community alongwith som
285、e internal benchmarks that are held-out and unleaked.Our selection criteria primarily aimto measure the improvement of Gemini 1.5 series compared to its predecessor,Gemini 1.0 series ofmodels:Gemini 1.0 Pro and Gemini 1.0 Ultra.Our goal is to highlight the extent of the trade-off,if itexists,between
286、 the 1.5 generation of Gemini models that excel in long-context capabilities and theirperformance on non long-context tasks.In particular,as we develop the 1.5 series,we aim to enhancethe models proficiency in this new dimension of multimodal long-context without compromising theirquality across all
287、 other capabilities.All in all,we find a clear generational improvement between the 1.0 and 1.5 series,with Gemini28Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextCore Capability1.5 Pro Relative to1.5 Flash Relative to1.5 Pro(Feb)1.0 Pro1.0 Ultra1.0 Pro1.0 UltraTex
288、tMath,Science&Reasoning+5.9%+49.6%+18.1%+30.8%+4.1%Multilinguality-0.7%+21.4%+5.9%+16.7%+2.1%Coding+11.6%+21.5%+11.7%+10.3%+1.5%Instruction following+9.9%-0.2%+8.7%-1.2%Function calling+72.8%+54.6%VisionMultimodal reasoning+15.5%+31.5%+14.8%+15.6%+1.0%Charts&Documents+8.8%+63.9%+39.6%+35.9%+17.9%Nat
289、ural images+8.3%+21.7%+8.1%+18.9%+5.6%Video understanding-0.3%+18.7%+2.1%+7.5%-8.1%AudioSpeech recognition*+1.0%+2.2%-3.8%-17.9%-25.5%Speech translation-1.7%-1.5%-3.9%-9.8%-11.9%Table 10|Detailed breakdown of the results presented in Table 1.In speech recognition,it isgenerally accepted that any dif
290、ference in Word Error Rate(WER)that falls within a 3%relative rangeis not statistically significant and can be considered as mere noise,and we grouped such instances aswins for the latest systems.1.5 Pro uniformly outperforming 1.0 Pro and approaching(often even surpassing)1.0 Ultra,a state-of-the-a
291、rt model on most benchmarks,despite being significantly more efficient to train.An outlier tothis picture is the situation on the audio capability.The post-training data of the model contains 5head languages,resulting thus in slight regressions on multi-lingual datasets that are not head heavy(e.g.,
292、YouYube,FLEURS and Covost 2).6.1.Core Text EvalsWe start by evaluating Gemini models performance on seven major core text capabilities:(1)Mathand Science(Section 6.1.1);(2)General reasoning(Section 6.1.2),(3)Coding(Section 6.1.3);(4)Multilinguality(Section 6.1.4);(5)Function calling(Section 6.1.5);(
293、6)Instruction Following(Section 6.1.6);and(7)Real-world and expert long-tail GenAI tasks(Section 6.1.7).See Table 11 fora summary of these results;see the Appendix for details on each,and an additional evaluation on QAfor Web Search Topics.With web-scale pretraining of langauge models,decontaminatio
294、n of public benchmarks is a persis-tant challenge(Brown et al.,2020;Gemini-Team et al.,2023;OpenAI,2023a).Gemini 1.5 employedstandard n-gram based decontamination procedures to help mitigate this issue,however these n-grambased procedures are imperfect.To move beyond the reliance on training set dec
295、ontamination,we alsoreport performance on a internally developed non-public evals,such as PhysicsFinals,HiddenMath,and Natural2Code.6.1.1.Math and ScienceWe find that 1.5 Pro consistently outperforms both 1.0 Ultra and 1.0 Pro on grade-school math(i.e.,GSM8K)and even shows material improvement over
296、the more demanding benchmarks where thereis more headroom for improvement,i.e.,+14.5%over 1.0 Ultra for middle-and high-school mathproblems(i.e.,Hendrycks MATH),+13.2%for the American Mathematical Competitions(i.e.,AMC)29Gemini 1.5:Unlocking multimodal understanding across millions of tokens of cont
297、extCapabilityBenchmarkGemini1.0 Pro1.0 Ultra1.5 Flash1.5 ProMath and ScienceGPQA:Graduate-Level Google-Proof Q&A.(Rein et al.,2023)27.9%4-shot35.7%4-shot39.5%0-shot46.2%0-shotMATH:Math problems rangingacross 5 levels of difficultyand 7 sub-disciplines.(Hendrycks et al.,2021b)32.6%4-shotMinervaprompt
298、53.2%4-shotMinervaprompt54.9%4-shotMinervaprompt67.7%4-shotMinervaprompt77.9%*maj64PhysicsFinals:61 undergraduatephysics problems that havenot appeared on the internet.31.1%0-shot41.0%0-shot57.4%0-shot63.9%0-shotHiddenMath 179 new math problemscreated from scratch.6.1%0-shot11.2%0-shot6.7%0-shot20.1
299、%0-shotFunctional MATH Functional variant of 1745MATH problems(December snapshot)39.9%0-shot55.8%0-shot53.6%0-shot64.6%0-shotAMC 2022-23:250 latest problemsincluding 100 AMC 12,100 AMC 10,and 50 AMC 8 problems.22.8%4-shot30%4-shot34.8%4-shot46.4%4-shotGSM8K:Grade-school math problems.(Cobbe et al.,2
300、021)77.9%11-shot88.9%11-shot86.2%11-shot90.8%11-shotGeneral ReasoningBigBench-Hard:A subset of hardertasks from Big Bench.(Srivastava et al.,2022;Suzgun et al.,2022)75.0%3-shot83.6%3-shot85.5%3-shot89.2%3-shotDROP:Reading comprehension&arithmetic.(Metric:F1-Score).(Dua et al.,2019)74.1Variableshots8
301、2.4Variableshots78.4Variableshots74.9VariableshotsMMLU:Multiple-choice questions in57 subjects(professional&academic).(Hendrycks et al.,2021a)71.8%5-shot83.7%5-shot90.0%*maj3278.9%5-shot85.9%5-shot91.7%*maj32Hellaswag(Zellers et al.,2019)84.7%10-shot87.8%10-shot86.5%10-shot93.3%10-shotCodingHumanEva
302、lchat preamble*(Metric:pass rate).(Chen et al.,2021)67.7%0-shot(PT)74.4%0-shot(PT)74.3%0-shot84.1%0-shotNatural2Codechat preamble*(Metric:pass rate).69.6%0-shot74.9%0-shot77.2%0-shot82.6%0-shotMultilingualityWMT23:sentence-level machinetranslation(Metric:BLEURT).(Tom et al.,2023)71.71-shot(PT)74.41-
303、shot(PT)74.11-shot75.31-shotMGSM:multilingual mathreasoning.(Shi et al.,2023a)63.5%8-shot(PT)79.0%8-shot(PT)82.6%8-shot87.5%8-shotTable 11|Evaluation results of Gemini 1.5 Pro,1.5 Flash and Gemini 1.0 models on standard coding,multilingual as well as math,science and reasoning benchmarks.Unless expl
304、icitly specified,alltasks are evaluated in terms of answer accuracy.Note that in this table,PT for the 1.0 Ultra andPro models denote tasks evaluated with model variants that have undergone a post-training(i.e.instruction-tuning)phase after pre-training.All numbers for the 1.5 Pro and 1.5 Flash are
305、obtainedafter instruction-tuning except for the ones marked with*which come from pretrained models,asdescribed in Section 4.30Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextand+5.8%on graduate-level science problems(i.e.,GPQA(Rein et al.,2023).21Gemini 1.5 Flash de
306、monstrates significant improvements over the 1.0 Pro version,achieving an11.6%increase on GPQA,a 22.3%increase on middle-and high-school math problems(HendrycksMATH),a 26.3%increase on undergraduate physics problems(PhysicsFinals),a 0.6%increase onHiddenMath problems,an 8.4%increase on AMC problems,
307、and an 8.3%increase on Grade SchoolMath problems.Functional MATH(Srivastava et al.,2024),a new benchmark derived from the MATH dataset,comprises 1,745 problems in both original and modified forms.The benchmark aims to evaluatea models ability to solve modified problems compared to its performance on
308、 the original versions.The Reasoning Gap,defined as the relative decrease in performance between original and modifiedproblems,serves as a novel metric for assessing generalization capability.It quantifies a modelsadaptability to modifications of problems previously published online,specifically tho
309、se introducedafter the training data cut-off date.As modifications often lead to problems of increased computationalcomplexity,a performance gap is expected.The Functional MATH dataset provides a means to controland analyze the magnitude of this gap across different models.We specifically chose to u
310、tilize theDecember snapshot of Functional MATH.Our evaluation prioritizes both overall performance on themodified problem set and minimization of the Reasoning Gap.We test the models in a zero-shotsetting wherein we present them with problems without any accompanying context or instructions.The auto
311、mated evaluation process consisted of two steps.First,we use the Gemini 1.0 Pro model toextract the proposed final answer from the models output,and then this answer is compared to theground truth using the same programmatic heuristic as employed in our evaluations of the MATHdataset.Gemini 1.0 Pro
312、solved 55.8%of the original problems and 39.9%of the modified problems(reasoning gap=28.5%).Gemini 1.5 Flash solved 69.7%of the original problems and 53.6%ofthe modified problems(reasoning gap=23.1%),Gemini Ultra 1.0 solved 74.5%of the originalproblems,55.8%of the modified problems(reasoning gap=25.
313、1%)whereas Gemini 1.5 Pro solved81.1%and 64.6%(reasoning gap=20.3%).We also evaluated Gemini 1.5 Pro on two new,unreleased internal benchmarks:PhysicsFinalsand HiddenMath.PhysicsFinals comprises 61 undergraduate physics problems,curated by a group ofphysics professors for offline final exams,coverin
314、g topics such as wave mechanics,quantum mechanics,special relativity,and introductory general relativity.Answers were graded by a physics professor.Gemini 1.5 Pro achieved a score of 39,significantly surpassing Gemini 1.0 Ultra(25)and Gemini1.0 Pro(19).HiddenMath comprises 179 competition-level math
315、 problems,crafted by experts andevaluated automatically.Gemini 1.0 Pro solved 11 problems,Gemini 1.5 Flash solved 12,Gemini 1.0Ultra solved 20,and Gemini 1.5 Pro solved 36.Section 7 presents additional results obtained with a Math-Specialized Gemini 1.5 Pro.6.1.2.General ReasoningBigBench-Hard,DROP,
316、MMLU,and Hellaswag are benchmarks designed to assess a models ability tohandle complex relationships within text,perform multi-step reasoning,and apply common senseknowledge to novel situations.BigBench-Hard,a curated subset of challenging tasks from the broaderBigBench suite,requires models to enga
317、ge in intricate reasoning processes.Gemini 1.5 Pro achievesa state-of-the-art score of 89.2%on this benchmark.MMLU,encompassing a diverse range of 57subjects across professional and academic domains,sees Gemini 1.0 Ultra,1.5 Pro,and 1.5 Flash allexceeding 80%.Hellaswag,a benchmark designed to test c
318、ommon sense reasoning and the ability to21In Appendix 12.6,we analyze the impact of long-context prompting on the performance of Gemini 1.5 Pro on a selectionof 528 challenging problems from Hendrycks MATH(Level 4 and 5 Intermediate Algebra problems).31Gemini 1.5:Unlocking multimodal understanding a
319、cross millions of tokens of contextdistinguish plausible scenarios,sees Gemini 1.5 Pro achieving 93.3%.We observe a similar strongperformance for Gemini 1.5 Flash,with the model outperforming Gemini 1.0 Pro.6.1.3.CodeGemini 1.5 Pro is our best performing model in code to date,surpassing Gemini 1.0 U
320、ltra onHumanEval and Natural2Code,our internal held-out code generation test set made to prevent web-leakage.We see the same gains being transferred to Gemini 1.5 Flash with the model outperformingGemini Ultra 1.0.HumanEval leakageHumanEval is an industry standard open-source evaluation benchmark(Ch
321、enet al.,2021),but we found controlling for accidental leakage on webpages and open-source coderepositories to be a non-trivial task,even with conservative filtering heuristics.An analysis of the testdata leakage of Gemini 1.0 Ultra showed that continued pre-training on a dataset containing even asi
322、ngle epoch of the test split for HumanEval boosted scores from 74.4%to 89.0%,highlighting thedanger of data contamination.We found that this sharp increase persisted even when examples wereembedded in extraneous formats(e.g.JSON,HTML).We invite researchers assessing coding abilitiesof these models h
323、ead-to-head to always maintain a small set of truly held-out test functions that arewritten in-house,thereby minimizing the risk of leakage.The Natural2Code benchmark,which weannounced and used in the evaluation of Gemini 1.0 series of models,was created to fill this gap.Itfollows the exact same for
324、mat of HumanEval but with a different set of prompts and tests.6.1.4.MultilingualityFor our multilingual evaluations we use a multilingual math reasoning(MGSM;Shi et al.,2023a)benchmark and a machine translation benchmark(WMT23;Kocmi et al.,2023)which was constructedafter the models training data cu
325、t-off hence minimizing test set leakage risks.Both of these coverdiverse languages from different language families and resource groups,with MGSM covering 11languages and WMT23 eight languages for a total of 14 language pairs.We find that Gemini 1.5 Pro improves over Gemini 1.0 Ultra on both tasks,p
326、articularly showing asubstantial improvement of almost+9%on the MGSM dataset,in line with the English-only mathimprovements reported above.Interestingly,we find that these improvements are not limited to aparticular resource group;rather,1.5 Pro improves performance equally among differently-resourc
327、edlanguages.Particularly,on medium and low resource languages the gap between 1.0 Ultra and 1.5Pro increases to9%and7%,respectively.22In addition,Gemini 1.5 Flash achieves comparableperformance to Gemini 1.0 Ultra on WMT23,and surpasses it by3 on MGSM,despite its muchsmaller size.6.1.5.Function Call
328、ingThere is an increasing interest in LLMs as the core building block of AI systems(often called agents)thatoperate in environments to achieve complex goals.Environments may include anything from generalweb search,private documents or calendars of users,internal enterprise APIs to general programmin
329、ginterpreters and robot sensors.Due to their general-purpose nature,LLMs are expectedand have thepromiseto operate in many such environments without having seen them at training time.Here,we focus on evaluating this capability via Function Calling(FC),or zero-shot tool use:given thedescriptions and
330、type signatures of a set of functions or APIs,and a user prompt,the model has22See Appendix 12.4 for a complete performance breakdown.32Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextTaskGemini1.0 Pro1.5 Flash1.5 ProSimple Functions92.0%88.0%92.8%Multiple Functions
331、90.0%92.0%90.5%Parallel Functions38.5%73.5%88.5%Parallel Multiple27.0%73.5%83.5%Relevance67.5%75.4%83.3%Weighted Average67.8%81.8%88.4%Table 12|Function calling performance on Berkeley Function Calling Leaderboard splits(excludingJava and Javascript,and with various fixes).to infer what function cal
332、ls have to be made to service the prompt,if any.Specifically,use the TheBerkeley Function Calling Leaderboard(Yan et al.,2024,BFCL)and focus on a subset of BCFL splits.23In Table 12 we show performance in terms of accuracy of various Gemini FC endpoints.We see asubstantial improvement from Gemini 1.
333、0 Pro to 1.5 Pro in terms of overall weighted accuracy.Thiscan be attributed in large part to the new support for parallel function calling(where one prompttriggers several independent functions)but also to the improved ability of Gemini to determine whennot to call functions(cf.“Relevance”column).We also find the 1.5 Flash FC endpoint to be extremelyclose to 1.5 Pro.Finally,we believe that while