當前位置：首頁 > 報告詳情

Google：Gemini 1.5技術報告（英文版）（154頁）.pdf

上傳人：淘*** 編號：650876 2025-04-07 PDF PDF DOCX DOCX DOCX 154頁 6.85MB

下載：

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/154

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《Google：Gemini 1.5技術報告（英文版）（154頁）.pdf》由會員分享，可在線閱讀，更多相關《Google：Gemini 1.5技術報告（英文版）（154頁）.pdf（154頁珍藏版）》請在三個皮匠報告上搜索。

1、Gemini 1.5:Unlocking multimodalunderstanding across millions of tokens ofcontextGemini Team,Google1In this report,we introduce the Gemini 1.5 family of models,representing the next generation of highlycompute-efficient multimodal models capable of recalling and reasoning over fine-grained informatio

2、nfrom millions of tokens of context,including multiple long documents and hours of video and audio.Thefamily includes two new models:(1)an updated Gemini 1.5 Pro,which exceeds the February version onthe great majority of capabilities and benchmarks;(2)Gemini 1.5 Flash,a more lightweight variantdesig

3、ned for efficiency with minimal regression in quality.Gemini 1.5 models achieve near-perfectrecall on long-context retrieval tasks across modalities,improve the state-of-the-art in long-documentQA,long-video QA and long-context ASR,and match or surpass Gemini 1.0 Ultras state-of-the-artperformance a

4、cross a broad set of benchmarks.Studying the limits of Gemini 1.5s long-context ability,we find continued improvement in next-token prediction and near-perfect retrieval(99%)up to atleast 10M tokens,a generational leap over existing models such as Claude 3.0(200k)and GPT-4 Turbo(128k).Finally,we hig

5、hlight real-world use cases,such as Gemini 1.5 collaborating with professionalson completing their tasks achieving 26 to 75%time savings across 10 different job categories,as well assurprising new capabilities of large language models at the frontier;when given a grammar manual forKalamang,a languag

6、e with fewer than 200 speakers worldwide,the model learns to translate English toKalamang at a similar level to a person who learned from the same content.1.IntroductionWe present our latest multimodal models from the Gemini line:Gemini 1.5 Pro and Gemini 1.5Flash.They are members of Gemini 1.5,a ne

7、w family of highly-capable multimodal models whichincorporates our latest innovations in sparse and dense scaling as well as major advances in training,distillation and serving infrastructure that allow it to push the boundary of efficiency,reasoning,planning,multi-linguality,function calling and lo

8、ng-context performance.Gemini 1.5 models arebuilt to handle extremely long contexts;they have the ability to recall and reason over fine-grainedinformation from up to at least 10M tokens.This scale is unprecedented among contemporary largelanguage models(LLMs),and enables the processing of long-form

9、 mixed-modality inputs includingentire collections of documents,multiple hours of video,and almost five days long of audio.The Gemini 1.5 Pro presented in this report is an update over the previous Gemini 1.5 Pro Februaryversion and it outperforms it predecessor on most capabilities and benchmarks.A

10、ll in all,the Gemini1.5 series represents a generational leap in model performance and training efficiency.Gemini 1.5 Prosurpasses Gemini 1.0 Pro and 1.0 Ultra on a wide array of benchmarks while requiring significantlyless compute to train.Similarly,Gemini 1.5 Flash performs uniformly better compar

11、ed to 1.0 Pro andeven performs at a similar level to 1.0 Ultra on several benchmarks.The ability to model data of increasingly longer contexts has tracked the development of moregeneral and capable language models,from the now toy 2-gram language model proposed by Shannon1Please send correspondence

12、to gemini-1_5-.2024 Google.All rights reservedarXiv:2403.05530v5 cs.CL 16 Dec 2024Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextFigure 1|Gemini 1.5 Pro achieves near-perfect“needle”recall(99.7%)up to 1M tokens of“haystack”in all modalities,i.e.,text,video and audi

13、o.It even maintains this recall performance when extendingto 10M tokens in the text modality(approximately 7M words);9.7M tokens in the audio modality(up to 107 hours);9.9M tokens in the video modality(up to 10.5 hours).The x-axis represents thecontext window,and the y-axis the depth percentage of t

14、he needle placed for a given context length.The results are color-coded to indicate:green for successful retrievals and red for unsuccessful ones.Note that the performance for all modalities is obtained with the previously reported Gemini 1.5 Proversion from February.(1948),to the modern n-gram mode

15、ls of the 1990s&2000s typically constrained to 5 tokens ofcontext(Brants et al.,2007;Chen and Goodman,1999;Jelinek,1998;Kneser and Ney,1995),torecurrent neural networks language models from the 2010s which could effectively condition onhundredsoftokens(Jozefowiczetal.,2016;Mikolovetal.,2010),tothemo

16、dernTransformer(Vaswaniet al.,2017)which can condition on hundreds of thousands of tokens(Anthropic,2023a).Gemini1.5 Pro continues this trend by extending language model context lengths by over an order ofmagnitude.Scaling to millions of tokens,we find a continued improvement in predictive performan

17、ce(Section 5.2.1.1),near perfect recall(99%)on synthetic retrieval tasks(Figure 1 and Section 5.2.1.2),and a host of surprising new capabilities like in-context learning from entire long documents andmultimodal content(Section 5.2.2).To measure the effectiveness of our models multimodal long-context

18、 capabilities,we conductexperiments on both synthetic and real-world tasks.In synthetic“needle-in-a-haystack”tasks inspiredby Kamradt(2023)that probe how reliably the model can recall information amidst distractor context,we find that both Gemini 1.5 Pro and Gemini 1.5 Flash achieve near-perfect(99%

19、)“needle”recallup to multiple millions of tokens of“haystack”in all modalities,i.e.,text,video and audio.As partof our experimental setup,we also assessed the performance of Gemini 1.5 Pro when extending2Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextthe context to

20、 10M tokens across all three modalities.We found that the recall performance wasmaintained even with this significant increase in context size.Gemini 1.5 ProRelative to 1.5 Pro(Feb)Relative to 1.0 ProRelative to 1.0 UltraLong-Context Text,Video&Audiono changefrom 32kup to 10M tokensfrom 32kup to 10M

21、 tokensCore CapabilitiesWin-rate:78.1%(25/32 benchmarks)Win-rate:88.0%(44/50 benchmarks)Win-rate:77.8%(35/45 benchmarks)TextWin-rate:78.6%(11/14 benchmarks)Win-rate:95.8%(23/24 benchmarks)Win-rate:84.2%(16/19 benchmarks)VisionWin-rate:92.3%(12/13 benchmarks)Win-rate:95.2%(20/21 benchmarks)Win-rate:8

22、5.7%(18/21 benchmarks)Audio*Win-rate:80%(4/5 benchmarks)Win-rate:60%(3/5 benchmarks)Win-rate:40%(2/5 benchmarks)Table 1|Gemini 1.5 Pro Win-rates compared to Gemini 1.5 Pro from the February release,as wellas the Gemini 1.0 family.Gemini 1.5 Pro maintains high levels of performance even as its contex

23、twindow increases.Detailed results are presented in Table 10.In speech recognition,it is generallyaccepted that any difference in Word Error Rate(WER)that falls within a 3%relative range is notstatistically significant and can be considered as mere noise,and we grouped such instances as winsfor the

24、latest systems.Gemini 1.5 FlashRelative to 1.0 ProRelative to 1.0 UltraLong-Context Text,Video&Audiofrom 32k up to 10M tokensfrom 32k up to 10M tokensCore CapabilitiesWin-rate:82.0%(41/50 benchmarks)Win-rate:46.7%(21/44 benchmarks)TextWin-rate:94.7%(18/19 benchmarks)Win-rate:42.1%(8/19 benchmarks)Vi

25、sionWin-rate:90.5%(19/21 benchmarks)Win-rate:61.9%(13/21 benchmarks)AudioWin-rate:0%(0/5 benchmarks)Win-rate:0%(0/5 benchmarks)Table 2|Gemini 1.5 Flash Win-rates compared to Gemini 1.0 family.Gemini 1.5 Flash while beingsmaller and way more efficient and faster to serve,maintains high levels of perf

26、ormance even as itscontext window increases.Detailed results are presented in Table 10.In more realistic multimodal long-context benchmarks which require retrieval and reasoning overmultiple parts of the context(such as answering questions from long documents or long videos),wealso see Gemini 1.5 Pr

27、o outperforming all competing models across all modalities even when thesemodels are augmented with external retrieval methods.We showcase the in-context learning abilitiesof both Gemini 1.5 Pro and Gemini 1.5 Flash enabled by very long context:for example,learningto translate a new language from a

28、single set of linguistic documentation.With only instructionalmaterials(a 500-page reference grammar,a dictionary,and400 extra parallel sentences)allprovided in context,Gemini 1.5 Pro and Gemini 1.5 Flash are capable of learning to translate from3Gemini 1.5:Unlocking multimodal understanding across

29、millions of tokens of contextEnglish to Kalamang a Papuan language with fewer than 200 speakers2and therefore almost noonline presencewith quality similar to a person who learned from the same materials.Moreover,we add in 45 minutes of transcribed Kalamang speech recordings to demonstrate that Gemin

30、i 1.5,forthe first time with an LLM,can leverage mixed-modal documentation to learn speech recognition fora new language in context.We further showcase how long-context capability of Gemini 1.5 modelsbreak grounds on long-context automatic speech recognition,long-context video understanding,in-conte

31、xt planning and unstructured multimodal data analytics tasks.Importantly,this leap in long-context performance does not come at the expense of the core multi-modal capabilities of the model.3Across a extensive battery of evaluations,both Gemini 1.5 Pro andGemini 1.5 Flash greatly surpass Gemini 1.0

32、Pro(44/50 for Gemini 1.5 Pro and 41/50 for Gemini 1.5Flash).These include core capabilities such as Math,Science and Reasoning(+49.6%and+30.8%,respectively,Sec.6.1.1),Multilinguality(+21.4%and+16.7%,Sec.6.1.4),Video Understanding(+18.7%and+7.5%,Sec.6.2.4),Natural Image Understanding(+21.7%and+18.9%,

33、Sec.6.2.3),Chart and Document Understanding(+63.9%and+35.9%,Sec.6.2.2),Multimodal Reasoning(+31.5%and+15.6%,Sec.6.2.1),Code(+21.5%and+10.3%,Sec.6.1.3),and more(see Table 10and Table 2 for full breakdowns).These evaluations additionally evaluate on a series of“agentic”tasksincluding Function Calling(

34、+72.8%and+54.6%,Sec.6.1.5),planning(Sec.5.2.2.7)and in-the-wildlong-tail real world use cases such as improving job productivity for professionals(Sec.6.1.7).Theseadvances are particularly striking when benchmarking against Gemini 1.0 Ultra,a state-of-the-artmodel across many capabilities.Despite us

35、ing significantly less training compute and being moreefficient to serve,Gemini 1.5 Pro performs better on more than half of the overall benchmarks(35/45),and the majority of vision(18/21)and text(16/19)benchmarks.For Gemini 1.5 Flash,which substantially more efficient to serve and faster at inferen

36、ce time,we find it to be be better thanUltra 1.0 on the majority of vision benchmarks(13/21)and almost half the text benchmarks(8/18).In the following sections,we provide an overview of the model architecture and present the resultsof large-scale quantitative evaluations comparing Gemini 1.5 Pro and

37、 1.5 Flash to other LLMs.Wepresent detailed evaluations for the models long context capabilities followed by evaluations of theircore capabilities,similar to the Gemini 1.0 Technical Report(Gemini-Team et al.,2023),coveringwell-studied benchmarks across text,code,image,video and audio.Finally,we dis

38、cuss our approachto responsible deployment,including our process for impact assessment developing model policies,evaluations,and mitigations of harm before deployment decisions.42.An Improved Gemini 1.5 ProSince the initial release in February,Gemini 1.5 Pro has undergone a number of pre-training an

39、dpost-training iterations.These iterations have led to significant improvement in performance acrossthe spectrum of model capabilities.On average,we see more than 10%relative improvement in evalsover the previous version of 1.5 Pro.See Figure 2 for a highlight of performance across a selection of be

40、ncharks.On reasoningbenchmarks,1.5 Pros performance on MATH(Hendrycks et al.,2021b)has improved from 58.5%to 67.7%while on GPQA(Rein et al.,2023)1.5 Pro now scores 46.2%compared to 41.5%before.We see a similar picture on multimodal tasks,with 1.5 Pro improving on all image understandingbenchmarks an

41、d most video understanding benchmarks;on MathVista(Lu et al.,2023)Gemini1.5 Pros performance improves from 52.1%to 63.9%,on InfographicVQA(Mathew et al.,2022)it2Kalamang language:https:/ define the core capabilities as those capabilities of the model that are primarily non long-context(e.g.,math,sci

42、ence,reasoning,code)similar to capabilities covered in the Gemini 1.0 Technical Report(Gemini-Team et al.,2023).4See the model card(Mitchell et al.,2019a)in Appendix Section 12.1.4Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextMATHGPQABigBench-HardMMLUHumanEval Nat

43、ural2CodeWMT23V*BenchMathVistaMMMUFLEURS()EgoSchemaBenchmark020406080Score67.7(+9.2)46.2(+4.7)89.2(+5.2)85.9(+4.0)84.1(+12.2)82.6(+4.9)75.3(+0.1)71.7(+23.7)63.9(+9.2)62.2(+3.7)6.5(-0.1)72.2(+7.1)58.541.584.081.971.977.775.248.054.758.56.665.1Gemini 1.5 Pros Improvements across Benchmarks(Feb to May

44、2024)Gemini 1.5 Pro(Feb 2024)Gemini 1.5 Pro(May 2024)Figure 2|Comparison of Gemini 1.5 Pro(May 2024)to the initial version(Feb 2024)across severalbenchmarks.The latest Gemini 1.5 Pro makes improvements across all reasoning,coding,vision andvideo benchmarks;with audio and translation performance rema

45、ining neutral.Note that for FLEURSa lower score is better.improves from 72.7%to 81.0%,and on EgoSchema(Mangalam et al.,2023)it improves from 65.1%to 72.2%.Gemini 1.5 Pro now achieves state-of-the-art results on several multimodal benchmarksincluding AI2D,MathVista,ChartQA,DocVQA,InfographicVQA and E

46、goSchema.3.Model Architecture3.1.Gemini 1.5 ProGemini 1.5 Pro is a sparse mixture-of-expert(MoE)Transformer-based model that builds on Gemini1.0s(Gemini-Team et al.,2023)research advances and multimodal capabilities.Gemini 1.5 Pro alsobuilds on a much longer history of MoE research at Google(Clark e

47、t al.,2022;Du et al.,2022;Feduset al.,2021;Lepikhin et al.,2020;Riquelme et al.,2021;Shazeer et al.,2017;Zoph et al.,2022)andlanguage model research in the broader literature(Anil et al.,2023b;Anthropic,2023a;Brown et al.,2020;Chowdhery et al.,2023b;Hoffmann et al.,2022;Jiang et al.,2024;Kim et al.,

48、2021;OpenAI,2023a;Rae et al.,2021;Raffel et al.,2020;Roller et al.,2021;Thoppilan et al.,2022;Touvron et al.,2023a,b;Vaswani et al.,2017).MoE models use a learned routing function to direct inputs to a subsetof the models parameters for processing.This form of conditional computation(Bengio et al.,2

49、013;Davis and Arel,2014;Jacobs et al.,1991)allows models to grow their total parameter count whilekeeping the number of parameters that are activated for any given input constant.A host of improvements made across nearly the entire model stack(architecture,data,optimizationand systems)allows Gemini

50、1.5 Pro to achieve comparable quality to Gemini 1.0 Ultra(see Section6),while using significantly less training compute and being significantly more efficient to serve.Gemini 1.5 Pro also incorporates a series of significant architecture changes that enable long-contextunderstanding of inputs up to

51、10 million tokens without degrading performance.Translated into real5Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextworld data,this context length enables Gemini 1.5 Pro models to comfortably process almost fivedays of audio recordings(i.e.,107 hours),more than ten

52、 times the entirety of the 1440 page book(or587,287 words)War and Peace,the entire Flax(Heek et al.,2023)codebase(41,070 lines of code),or 10.5 hours of video at 1 frame-per-second.Further,since the model is natively multimodal andsupports interleaving of data from different modalities,it can suppor

53、t a mix of audio,visual,text,andcode inputs in the same input sequence.In Section 5.1,we highlight some of the novel capabilitiesenabled by these advances,including evaluations that yielded positive results on context lengths upto 10 million.We note that understanding the limits of these capabilitie

54、s and studying their excitingcapabilities and applications remains an area of continued research exploration.3.2.Gemini 1.5 FlashGemini 1.5 Flash is a transformer decoder model with the same 2M+context and multimodalcapabilities as Gemini 1.5 Pro,designed for efficient utilization of tensor processi

55、ng units(TPUs)withlower latency for model serving.For example,Gemini 1.5 Flash does parallel computation of attentionand feedforward components(Chowdhery et al.,2023b),and is also online distilled(Agarwal et al.,2024b;Anil et al.,2018;Beyer et al.,2021;Bucila et al.,2006;Hinton et al.,2015)from the

56、muchlarger Gemini 1.5 Pro model.It is trained with higher-order preconditioned methods(Becker andLeCun,1989;Duchi et al.,2011;Heskes,2000)for improved quality.3.3.Serving efficiency and latencyIn addition to serving 2M+context,Gemini 1.5 models were designed for high efficiency and lowlatency at all

57、 context lengths.For most queries,the time per output character dominates total requestlatency during transformer inference(Pope et al.,2023),and this compounds for applications usingchained prompts and agentic workflows.To assess the latency of Gemini 1.5 compared with other foundation model APIs a

58、cross multiplelanguages,we report the average time per output character on English,Japanese,Chinese,and Frenchqueries.Latency measurements were taken from the Vertex AI streaming API for Gemini 1.5 andClaude 3,and from the OpenAI streaming API for GPT 3.5 and GPT 4.To account for variance intraffic

59、load,we calculate the mean time per output character for 32 queries and report the minimumvalue across the 32 requests.LanguageGemini 1.5FlashGemini 1.5ProGPT-3.5TurboGPT-4TurboClaude 3HaikuClaude 3SonnetClaude 3OpusEnglish1.54.32.66.82.26.210.5Japanese4.310.912.935.410.723.946.6Chinese5.214.118.453

60、.412.83055.3French1.94.74.110.73.06.813.4Table 3|Time per output character(ms)of various APIs for English,Japanese,Chinese,and Frenchresponses,given inputs of 10,000 characters.Gemini 1.5 Flash achieves the fastest output generationfor all languages tested.Across all four evaluated languages,Gemini

61、1.5 Flash yields the fastest output generation of allmodels,and Gemini 1.5 Pro shows faster generation than GPT-4 Turbo,Claude 3 Sonnet,and Claude3 Opus(see Table 3).For English queries,Gemini 1.5 Flash generates over 650 characters per second,more than 30%faster than Claude 3 Haiku,the second faste

62、st of the models evaluated.6Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context4.Training Infrastructure and DatasetLike Gemini 1.0 series,Gemini 1.5 models are trained on multiple 4096-chip pods of Googles TPUv4accelerators,distributed across multiple datacenters,and

63、on a variety of multimodal and multilingualdata.Our pre-training dataset includes data sourced across many different domains,includingweb documents and code,and incorporates image,audio,and video content.For the instruction-tuning phase we finetuned Gemini 1.5 models on a collection of multimodal da

64、ta(containing pairedinstructions and appropriate responses),with further tuning based on human preference data.Werefer readers to the Gemini 1.0 Technical Report(Gemini-Team et al.,2023)for further information.5.Evaluation ResultsExisting evaluations are increasingly strained by the new and rapidly

65、advancing capabilities of largemultimodal models.They typically focus on individual modalities and/or are restricted to tasks withshorter context lengths.Hence,there is a growing need for benchmarks which exemplify the nuancedrequirements of real world long mixed-modality use cases.Among these,we hi

66、ghlight the quantitativeassessment of reasoning capabilities across long mixed-modality sequences as a key challenge.With the challenges of evaluating increasingly capable models in mind,our evaluation of Gemini1.5 series first focuses on understanding and evaluating its novel capabilities.Subsequen

67、tly,we explorecore benchmarks,covering capabilities studied in the Gemini 1.0 Technical Report(Gemini-Teamet al.,2023).Specifically,we evaluate Gemini 1.5 in three main categories:51.Qualitative long-context multimodal evaluations:manually probe and stress-test the modelslong-context abilities,espec

68、ially for novel capabilities where no quantitative benchmarks exist.2.Quantitative long-context multimodal evaluations:measure the models long-context abilitieson both synthetic and real-world tasks with well-defined metrics.3.Quantitative core evaluations:identify progress and regression in core ca

69、pabilities(e.g.,coding,math,science,multilinguality and instruction following).5.1.Qualitative Examples of Multimodal Long-Context CapabilitiesThe ability to process multiple millions of tokens unlocks practical applications that were not possiblebefore.In this section we demonstrate some surprising

70、 interactions we observed with Gemini 1.5 Proacross code,text and video.6As shown in the Figure 3,Gemini 1.5 Pro is able to ingest entire large codebases such as JAX(746,152 tokens),and answer very specific queries about them.in Figure 4 we show Gemini 1.5 Prosability to learn a new language based o

71、nly on reference materials given in its input(see Section 5.2.2.1for quantitative metrics for this use case).Additionally,we test Gemini 1.5 Pros ability to answeran image query given the entire text of Les Misrables and observe that being natively multimodalallows it to locate a famous scene from a

72、 hand-drawn sketch,as shown in Figure 5.Lastly,we askGemini 1.5 Pro questions about an entire movie of 45 minutes in Figure 6 which the model answersseamlessly while retrieving moments and timestamps down to a second.5We note that all the evaluations are from the same checkpoint of the Gemini 1.5 mo

73、dels that are instruction tunedpost pre-training,unless otherwise stated.All sampling evals throughout the report are done with temperature 0(greedydecoding).6For additional short videos of demonstrations of the long context abilities of Gemini 1.5 Pro across video,text,andcode seehttps:/deepmind.go

74、ogle/technologies/gemini/.7Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextFigure 3|Given the entire 746,152 token JAX codebase in context,Gemini 1.5 Pro can identify thespecific location of a core automatic differentiation method.Figure 4|Given a reference grammar

75、book and a bilingual wordlist(dictionary),Gemini 1.5 Pro isable to translate from English to Kalamang with similar quality to a human who learned from thesame materials.notesLong context?User promptLook at the event in this drawing.What page is this on?Parsed PDF les_miserables.pdf 1382 pages,732162

76、 tokensFile:les_miserables.pdfLes Miserables by Victor Hugo:PREFACESo long as there shall exist,by virtue of law and custom,decrees of damnation pronounced by society,artificially creating hells amid the civilization of earth,and adding the element of human fate to divine destiny;This is on page 119

77、Page 119.“My friend,”resumed the Bishop,“before you go,here are your candlesticks.Take them.”He stepped to the chimney-piece,took the two silver candlesticks,and brought them to Jean Valjean.The two women looked on without uttering a word,without a gesture,without a look which could disconcert the B

78、ishop.Model outputFigure 5|With the entire text of Les Misrables in the prompt(1382 pages,732k tokens),Gemini1.5 Pro is able to identify and locate a famous scene from a hand-drawn sketch.8Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextFigure 6|When prompted with a

79、 45 minute Buster Keaton movie“Sherlock Jr.(1924)(2,674 framesat 1FPS,684k tokens),Gemini 1.5 Pro retrieves and extracts textual information from a specific framein and provides the corresponding timestamp.At bottom right,the model identifies a scene in themovie from a hand-drawn sketch.5.2.Long-con

80、text EvaluationsFor the past few years,LLM research has prioritized expanding the context window from which modelscan incorporate information(Anthropic,2023a;OpenAI,2023a).This emphasis stems from therecognition that a wider context window allows models to incorporate a larger amount of new,task-spe

81、cific information not found in the training data at inference time,leading to improved performancein various natural language or multimodal tasks.Recent approaches to improving the long-contextcapabilities of models fall into a few categories,including novel architectural approaches(Ainslieet al.,20

82、23;Gu and Dao,2023;Guo et al.,2021;Orvieto et al.,2023;Zaheer et al.,2020),post-training modifications(Bertsch et al.,2023;Chen et al.,2023b;Press et al.,2021;Xiong et al.,2023),retrieval-augmented models(Guu et al.,2020;Izacard et al.,2022;Jiang et al.,2022;Karpukhin et al.,2020;Santhanam et al.,20

83、21),memory-augmented models(Bulatov et al.,2022,2023;Martinset al.,2022;Mu et al.,2023;Wu et al.,2022a,b;Zhong et al.,2022),and techniques for buildingmore coherent long-context datasets(Shi et al.,2023b;Staniszewski et al.,2023).This activityhas resulted in measurable improvements on long-context c

84、apabilities of LLMs over the past severalmonths,with the recent concurrent work of Liu et al.(2024)exploring context window of 7B modelsup to 1M multimodal tokens.Notably,among the state-of-the-art LLMs,Anthropic has successfullyextended the context of their text-only Claude 2 model to 100k tokens,w

85、hile OpenAI has recently9Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context128 256 5121K2K4K8K16K 32K 64K 128K256K512K 1M1MSequence positionNegative Log-LikelihoodCumulative Average NLL for Long Documents.R2=0.997.Gemini 1.5 FlashGemini 1.0 ProGemini 1.5 ProPower law

86、fit1285122K8K32K128K512K2M10MSequence positionNegative Log-LikelihoodCumulative Average NLL for Code.R2=0.995.Gemini 1.5 FlashGemini 1.0 ProGemini 1.5 ProPower law fitFigure 7|Cumulative average negative log-likelihood(NLL)as a function of token position in long documentsand code data.A lower value

87、demonstrates better prediction.Gemini 1.5 Pro shows improved predictions upto 1M tokens for long-documents and 10M tokens for code,whereas Gemini 1.0 Pro improves up to only 32Ktokens.Gemini 1.5 Flash shows improvement up to 1M tokens for long-documents and 2M tokens in code.TheNLL of Gemini 1.5 Pro

88、 follows a power-law trend up until 1M tokens(documents)and 2M tokens(code)witha deviating trend at 10M tokens.released GPT-4 Turbo reaching 128k tokens.Finally,the latest addition to the series was Claude 3with a context window of up to 1M tokens.Gemini 1.5 Pro significantly extend this context len

89、gth frontier to multiple millions of tokenswith almost no degradation in performance,making it possible to process significantly larger inputs.Compared to Claude 2.1 with a 200k token context window,Gemini 1.5 Pro achieves a 100%recallat 200k tokens,surpassing Claude 2.1s 98%.This 100%recall is main

90、tained up to 530k tokens,and recall is 99.7%at 1M tokens.When increasing from 1M tokens to 10M tokens,the modelretains 99.2%recall.Moreover,Gemini 1.5 Pros native multimodal capabilities enables the model toingest multiple hours of audio and video recordings alongside or interleaved with text.Such r

91、ecallcapabilities are summarized in Figure 1.Below we report results on long-context evaluations acrossall three modalities,i.e.,text,vision and audio.Similarly,Gemini 1.5 Flash achieves almost perfectrecall across all three modalities up to 2M tokens,yielding 100%recall on text,99.8%on video and99.

92、1%on audio.The evaluation methodology we followed to measure the long-context capability of Gemini 1.5models consists of both diagnostic-focused probing of the long context capabilities(e.g.,perplexityover long sequences,needle-in-a-haystack retrieval studies)and realistic evaluations specificallyde

93、signed for multimodal long-context tasks(e.g.,long-document QA,long-context automatic speechrecognition,learning to translate a new language from only one book,and long-context video QA).To provide a reference point,throughout this section we compare Gemini 1.5 models with the leadingmodel available

94、 externally for each task.With the evaluation harness we developed for Gemini 1.5models we are able to quantify the quality of long-context understanding capabilities reliably all theway up to 10M tokens.5.2.1.Diagnostic Long-Context Evaluations5.2.1.1Perplexity over Long SequencesWe start by report

95、ing results on the text modality.To evaluate the ability of the models to makeuse of very long contexts to improve next-token prediction,which is the objective function used totrain language models,we record the negative log-likelihood(NLL)of tokens at different positions in10Gemini 1.5:Unlocking mu

96、ltimodal understanding across millions of tokens of contextthe input sequences from held-out text(i.e.,not used in training).Here,a lower value implies animproved prediction.Typically,we expect tokens at the beginning of a sequence to have high NLL,asthere is little to no context that the model can

97、use to predict them,and tokens later in the sequenceto have lower NLL as more information becomes available to the model.The shape of the resultingcurve indicates the abilities of models to reason over long-context.A downward trend signifies modelsmaking use of long-context to reduce models uncertai

98、nty.On the other hand,an upward trendsignifies that models are unable to effectively use information from the previous context and may bedeteriorating in prediction quality,highlighting the limitations in their long-context understandingcapability.We perform this analysis on two data sources:(a)a da

99、taset of long documents with up to 1 milliontokens,and(b)a dataset of code repositories constructed by first randomly shuffling all the files andthen concatenating them.The code dataset contains sequences longer than 1 million tokens withsome natural form of semantic association(e.g.,a whole reposit

100、ory),allowing for further evaluationof sequences of up to 10M tokens.Figure 7 shows the cumulative NLL up to a specific token index.7We also fit a power law of the form()=+to these data points(dashed line).We find in Figure 7 that NLL decreases monotonically with sequence length and thus predictiona

101、ccuracy improves up to the tested sequence lengths(1M for long documents,and 10M for code),indicating that our models can make use of the whole input even at very long-context lengths.Thissuggests that Gemini 1.5 models are able to improve their predictions by finding useful patterns intokens,even i

102、f they occurred millions of tokens in the past,as in the case of code.Finally,we see this improved prediction follows a regular power-law structure.While it is wellknown that language models follow a power-law in terms of training compute to model performance(NLL)(Kaplan et al.,2020)up to a very lar

103、ge scale,we demonstrate that a power law can holdbetween log-loss and context length up to extremely long context lengths.We see the power-lawfit is quite accurate up to 1M tokens for long-documents and about 2M tokens for code for Gemini1.5 Pro.From inspecting longer code token predictions closer t

104、o 10M,we see a phenomena of theincreased context occasionally providing outsized benefit(e.g.due to repetition of code blocks)whichmay explain the power-law deviation.However this deserves further study,and may be dependent onthe exact dataset used.5.2.1.2Text HaystackNext,we move to testing long-co

105、ntext recall using the recently introduced needle-in-a-haystackevaluation(Kamradt,2023),which tests a models ability to retrieve a text(i.e.,“needle”)inserted atvarious positions into a sequence(i.e.,“haystack”).Following prior work(Dhinakaran,2024),we usea set of concatenated and repeated essays wr

106、itten by Paul Graham8to fill the desired context length.We insert a needle at linearly spaced intervals from the beginning to the end of the context,wherethe needle is i.e.,“The special magic city number is:number”where the city andnumber are varied for each query,and query the model to return the m

107、agic number for a specificcity.We report whether the magic number recall was correct at various context lengths(x axis thehaystack)as a function of its position in the input sequence expressed in terms of depth percentage(y axis),e.g.,depth at 100%would indicate a needle inserted at the very end of

108、the input whereas0%at the very beginning.As can be seen in Figure 8,Gemini 1.5 Pro achieves 100%recall up to 530k tokens and 99.7%recall up to 1M tokens.This task,while simple,provides a clear demonstration that Gemini 1.5 Pro7We note that we are unable to obtain logits for other commercially availa

109、ble LLMs for comparison.8https:/ 1.5:Unlocking multimodal understanding across millions of tokens of context32k128k256k512k1MTokens0142943577186100Depth(%)Gemini 1.5 Pro:From 1k to 1M tokens2M5M10MTokensUp to 10M tokens32k128k256k512k1MTokens0142943577186100Depth(%)GPT-4 Turbo:From 1k to 128k tokens

110、Figure 8|Text Haystack.This figure compares Gemini 1.5 Pro with GPT-4 Turbo for the textneedle-in-a-haystack task.Green cells indicate the model successfully retrieved the secret number,gray cells indicate API errors,and red cells indicate that the model response did not contain the secretnumber.The

111、 top row shows results for Gemini 1.5 Pro,from 1k to 1M tokens(top left),and from1M to 10M tokens(top right).The bottom row shows results on GPT-4 Turbo up to the maximumsupported context length of 128k tokens.is able to reliably retrieve information from long documents up to 1M tokens.For reference

112、,wereport results for GPT-4 Turbo up to the 128K sequence length supported by their API.In order to testwhether the capabilities demonstrated in the perplexity plots in Figure 7 transfer to sampling tasks,we continue to evaluate Gemini 1.5 Pro on the needle-in-a-haystack task beyond 1M tokens.Theres

113、ults in Fig 8 show that the model is still able to find and extract information with 99.2%accuracyup to 10M tokens.On the same task,we also evaluated Gemini 1.5 Flash up to 2M tokens andattained a flawless recall of 100%,suggesting its best in class long-context text retrieval performance,which is e

114、xclusive in its tier of models.In Section 9.4.1,we also showcase an“adversarial”version of this needle-in-the-haystack task forlong context safety evaluations.5.2.1.3Video HaystackAs Gemini 1.5 Pro is natively multimodal,its long-context abilities translate directly to other modalities,enabling it t

115、o retrieve specific information across multiple hours of video.To test this capability,weadapt the text needle-in-a-haystack evaluation and turn it into a cross-modal evaluation,wherein aneedle is hidden in one modality while the retrieval query is given in text.Rather than asking themodel to retrie

116、ve a randomly inserted phrase from a corpus of text,we ask the model to retrieveinformation embedded in a random frame(the“needle”)in a 10.5-hour-long video(the“haystack”)that is sampled at one frame-per-second.Concretely,we overlay the text“The secret word is needle”on a single randomly sam-pled vi

117、deo frame in a 10.5 hour video constructed from concatenating seven copies of the full AlphaGodocumentary(Kohs,2017)back-to-back(for a total of 37994 frames,or 9.9M tokens).See Figure 28in the Appendix for an example of such an embedded frame.After feeding it the video,we ask the12Gemini 1.5:Unlocki

118、ng multimodal understanding across millions of tokens of context6121824303642485460Minutes1030507090Depth(%)Gemini 1.5 Pro:1 minute to 1 hour246810HoursUp to 10 hours6121824303642485460Minutes1030507090Depth(%)GPT-4V:1 minute to 1 hourFigure 9|Video Haystack.This figure compares Gemini 1.5 Pro with

119、GPT-4V for the video needle-in-a-haystack task,where the models are given video clips of different lengths up to 10.5 hours of videoand are asked to retrieve a secret word embedded as text at different points within the clip.All videoclips are sampled at one frame-per-second(1 fps).The first pair of

120、 1050 haystack plots on theleft compare Gemini 1.5 Pro with GPT-4V on the first hour of the AlphaGo documentary.The x-axisrepresents the video duration which ranges from 1.2 minutes to 1 hour,and the y-axis representsthe depth,namely the relative offset of the needle(e.g.,the top left cell represent

121、s providing themodel with the first 1.2 minutes and inserting the needle in a randomly sampled frame in the firstseven seconds of that trimmed video).A green cell indicates that the model successfully retrieved theneedle,whereas a gray cell indicates an API error.Whereas the GPT-4V API supports vide

122、o lengthsonly up to around the first 3 minutes,Gemini 1.5 Pro successfully retrieves the secret word insertedat all depth percentages for the full hour,as shown by the all-green plot.Finally,the 1010 grid onthe right shows Gemini 1.5 Pros perfect retrieval capabilities across 10.5 hours of video,con

123、structedby concatenating seven copies of the AlphaGo documentary back-to-back.model to answer the question“What is the secret word?”.As Figure 9 shows,Gemini 1.5 Prosuccessfully answers this question across a breadth of video lengths and a range of randomly insertedneedle locations in the 10.5 hour

124、video.In contrast,the GPT-4V API supports video lengths only up toaround the first 3 minutes.We also evaluated Gemini 1.5 Flash on the video-haystack problem withup to 2M tokens and obtained 99.8%recall,demonstrating its best-in-class long-context retrievalperformance in vision modality.5.2.1.4Audio

125、 HaystackWe follow a similar strategy for testing Gemini 1.5 Pros long context capabilities on audio under-standing.We hide a very short clip of audio lasting a few seconds where a speaker says“the secretkeyword is needle”within an audio signal(the haystack)up to almost five days long(i.e.,107hours)

126、.The task for the model is then to retrieve the secret keyword,given a question in text,hencerequiring cross-modal reasoning.To further challenge the model beyond increasing context,the largeaudio signal is built from an unlabeled speech corpus from the VoxPopuli dataset(Wang et al.,2021)so that the

127、 input signal contains multiple speakers.In Figure 10 we plot the result of the experiment13Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context3684132180228276324372420468516564612660Minutes1030507090Depth(%)Gemini 1.5 Pro:From 12 minutes to 11 hours6401920 3200 4480 5

128、760MinutesUp to 107 hours3684132180228276324372420468516564612660Minutes1030507090Depth(%)Whisper+GPT-4 Turbo:From 12 minutes to 11 hoursFigure 10|Audio Haystack.This figure presents the audio version of the needle-in-a-haystackexperiment comparing Gemini 1.5 Pro and a combination of Whisper and GPT

129、-4 Turbo.In this setting,the needle is a short segment of audio that is inserted within a very large audio segment(of up to107 hours)containing concatenated audio clips.The task is to retrieve the secret keywordwhich isrevealed in the needle.Red indicates that the model did not identify the keyword,

130、whereas greenindicates that the model identified the keyword correctly.when the input audio ranges from 12 minutes to 107 hours(or 9.9M tokens),inserting the needle indifferent positions across the signal.The red boxes indicate a score of 0.0(meaning the model did notidentify the keyword),and green

131、indicates a score of 1.0(meaning the model identified the keywordcorrectly).The Gemini 1.5 models succeed at finding the secret keyword in all instances,with theoverall accuracy of Gemini 1.5 Pro being 100%and Gemini 1.5 Flash being 98.7%on this task.Unlike Gemini 1.5 Pro and Gemini 1.5 Flash,existi

132、ng models cannot natively handle more than afew seconds of audio in the context.As such,in order to fairly compare against them we need toemploy a strategy where we first transcribe audio into text using windows of tens of seconds,andthen rely on text models to extend beyond that limited window.Spec

133、ifically,to compare against Whisper,we chunk the audio input into 30 second segments,transcribe the audio using the model to produce a text transcript,concatenate the transcripts foreach chunk,and finally prompt GPT-4 Turbo to find the“secret keyword”given the text transcript.Figure 10 shows the per

134、formance for each depth percent and number of hours.The overall accuracyof Whisper combined with GPT-4 Turbo to identify the needle is around 94.5%.5.2.1.5Improved DiagnosticsDespite the excellent performance of Gemini 1.5 Pro and Gemini 1.5 Flash on the needle-in-a-haystacktasks for all three modal

135、ities,significantly surpassing previously reported results(99.7%for text,100%for video and 100%for audio),we also present early findings of observed limitations.By design,the needle-in-a-haystack task is a retrieval task measuring recall and so far we have considered thesimplest possible setup.A nat

136、ural extension to the task is to increase the number of unique“needles”in each haystack,and require the model to retrieve them all.For a context length of up to 1M tokens,we inserted 100 different needles and measured the total number of correct needles retrieved.14Gemini 1.5:Unlocking multimodal un

137、derstanding across millions of tokens of context1K128K512K1MNumber of tokens in context0.00.20.40.60.81.0RecallGemini 1.5 ProGPT-4 TurboFigure 11|Retrieval performance of the“multiple needles-in-haystacktask,which requires retrieving100 unique needles in a single turn.When comparing Gemini 1.5 Pro t

138、o GPT-4 Turbo we observehigher recall at shorter context lengths,and a very small decrease in recall towards 1M tokens.Figure 11 compares the recall of Gemini 1.5 Pro9and GPT-4 Turbo on this task.We see animproved recall from Gemini 1.5 Pro over GPT-4 Turbo up until 128K tokens.It is important to no

139、tethat GPT-4 Turbos context length is limited to 128K tokens and its retrieval quality largely oscillateswith longer context lengths with an average recall of around 50%at 128K tokens.In contrast,Gemini1.5 Pro maintains around 70%recall up to 128K tokens,and 60%recall up to 1M tokens.We reportfurthe

140、r results on this task in the Appendix 12.2 including results with different numbers of needles,where we observe consistent trends.In line with other tasks in the literature of LLMs,we also observe that the choice of the promptingmethod and type of needle affect final performance of models,and futur

141、e versions of“needle(s)-in-a-haystack”style tests should account for prompt robustness.We also modulate retrieval difficulty on another axis:the similarity of the needles.In the Multi-round Co-reference Resolution(MRCR)task,the model is presented with a long conversation betweena user and a model,in

142、 which the user requests writing(e.g.poems,riddles,essays)on differenttopics proceeded by the model responses.In each conversation,two user requests containing topicsand writing formats distinct from the rest of the conversation are randomly placed in the context.Given the conversation,the model mus

143、t reproduce the models output(the needle)resulting fromone of the two requests(the key).Either the formats,the topics,or both,overlap in order to createa single key that is adversarially similar to the query key.For instance,the request“Reproduce thepoem about penguins.”requires the model to disting

144、uish the poem about penguins from the poemabout flamingos,and“Reproduce the first poem about penguins.”requires the model to reason aboutordering.We score MRCR via a string-similarity measure between the model output and the correctresponse.10Figure 12 compares the capabilities of the Gemini 1.5 Pro

145、,Gemini 1.5 Flash,GPT-4 Turbo 01-25-2024 and 04-09-2024 models,Claude 3 models,and Claude 2.1 on MRCR.Gemini 1.5 Pro overtakesGPT-4 Turbo 04-09-2024 and Claude 3 Opus at around 32K tokens and GPT-4 Turbo 01-25-2024at around 8K tokens,while Gemini 1.5 Flash consistently dominates Claude 3 Sonnet at a

146、round 5Ktokens(though performs comparably at 5K tokens),overtakes GPT-4 Turbo 01-25-2024 at around20K tokens,and always strictly dominates Claude 3 Haiku.After 32K tokens,Gemini 1.5 Flash is9All the results for this task are obtained with the previously reported Gemini 1.5 Pro version from February.

147、10SequenceMatcher ratio as implemented inhttps:/docs.python.org/3/library/difflib.html15Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context2K8K32K128K512K1MNumber of tokens in context0.00.20.40.60.81.0Cumulative Average ScoreMRCR Gemini 1.5 ProGemini 1.5 FlashGPT-4 Tur

148、bo(040924)GPT-4 Turbo(012524)Claude 3 OpusClaude 3 SonnetClaude 3 HaikuClaude 2.1Figure 12|Cumulative average string similarity score as a function of context length over 2000instances of the MRCR task.When comparing Gemini 1.5 Pro and Gemini 1.5 Flash to GPT-4 Turboand Claude 3 models,we observe th

149、at after 32K tokens,Gemini 1.5 Pro outperforms Claude 3 Opusand GPT-4 Turbo(04-09-2024).Gemini 1.5 Flash outperforms Claude 3 Sonnet starting at around 5Ktokens,GPT-4 Turbo(01-25-2024)at around 20K tokens,and always outperforms Claude 3 Haikuand Claude 2.1.Both Gemini 1.5 Pro and Gemini 1.5 Flash ha

150、ve smaller decreases in performance asa function of context length up to 1M tokens compared to all competitors.on par with both GPT-4 Turbo 04-09-2024 and Claude 3 Opus,and retains this performance allthe way to 1M tokens.Both Gemini models achieve an average score of around 75%at 1M tokens.GPT-4 Tu

151、rbo and Claude 3 model performances fall off steadily as a function of context length,allthe way up to 128K tokens,though GPT-4 Turbo 04-09-2024 and Claude 3 Opus have less negativeslopes compared to the other models from their families.Claude 2.1(with context length going up to200K tokens)under per

152、forms the other models by hallucinating that the needle is not in context andpunting on requests to solve the task,despite following Claude 2.1 prompting guidelines for retrievaltasks(Anthropic,2023b).We highlight that“multiple needles-in-a-haystack”and MRCR capture different aspects of theretrieval

153、 task:MRCR is harder and requires stronger reasoning and disambiguation skills,while themultiple needles challenge is a test of the models recall ability,explaining disparities between themodel orderings up to 8K tokens.Gemini 1.5 Pro and Gemini 1.5 Flash impressively avoid seriousdegradation on bot

154、h of these axes all the way up to 1M tokens.While the“multiple needles-in-a-haystack”and MRCR evaluations offer two challenging setups thatstress-test long-context retrieval and reasoning capabilities in different ways,we advocate for pushingthe boundaries even further.Evaluating models on tasks tha

155、t demand complex reasoning over multiplepieces of information scattered across a long context would provide even deeper insights into theirtrue capabilities.This could involve tasks that require integrating disparate facts,drawing inferences,or resolving inconsistencies within the retrieved informat

156、ion.By incorporating such assessmentsalongside prompt robustness studies,we can gain a more comprehensive and nuanced understandingof how effectively models can utilize long contexts for advanced reasoning and knowledge extraction.16Gemini 1.5:Unlocking multimodal understanding across millions of to

157、kens of context5.2.2.Realistic Long-Context EvaluationsHaving investigated the models multimodal abilities on handling long-context using a battery ofdiagnostic tests,we now turn to a series of novel multimodal tasks designed to better reflect thepotential uses of this model,thus stress-testing mode

158、ls in a more realistic way.5.2.2.1In-context language learning learning to translate a new language from one bookTo test the in-context learning abilities enabled by very long context,we evaluate Gemini 1.5 Flash&1.5 Pro on the Machine Translation from One Book(MTOB)benchmark(Tanzer et al.,2023).MTO

159、Bmeasures the ability to learn to perform sentence-level translation between English and Kalamang(ISO639-3 language code:kgv)from instructional materials.Kalamang has fewer than 200 speakers andtherefore virtually no presence on the web,which means that the model must rely on the data givenin contex

160、t(rather than knowledge stored in its weights at training time).11The available resourcesfor Kalamang are:field linguistics documentation12comprising a500 page reference grammar(Visser,2020b),a2000-entry bilingual wordlist(Visser,2020a),and a set of400 additional parallelsentences(Visser,2020a).In t

161、otal the available resources for Kalamang add up to around250ktokens.This task framing offers the promise of using extremely long-context models to supportlanguages that are not sufficiently represented in pre-training corpora,with curated resources thatcan be created and deployed by independent par

162、ties.To perform the task,we provide Gemini 1.5 Flash&1.5 Pro with the full set of materials in theirinput context.To compare fairly against GPT-4 Turbo(version 2024-04-09)and Claude 3,since the fullmaterials do not fit in their publicly released context windows(128K and 200K respectively),we alsorep

163、ort results using only half of the grammar book(100k tokens).Moreover,to test to what extentthe models are making use of information in the context,rather than relying on knowledge foundin their pre-training data,we also run a 0-shot setup.Finally,we compare to MTOBs reference forhuman performance,i

164、n which a human learned Kalamang from the exact same full set of materials.13To assess performance,we conduct a human evaluation where the same human language learneris given the input sentence and reference translation,and rates the quality of the predicted translationon a scale from 0 to 6,with 6

165、being an excellent translation.This rater is a non-native non-fluentspeaker who can identify their own translations,so the scores should be interpreted in context.Weadditionally report results using automatic metrics,i.e.,BLEURT(Sellam et al.,2020)for Kalamangto English(kgveng)and chrF(Popovi,2015)f

166、or English to Kalamang(engkgv).Gemini 1.5,GPT-4 Turbo,and Claude 3 all have essentially random performance in the 0-shotsetting(i.e.,no additional Kalamang information in context).The models sometimes successfully copyproper nouns,identify loanwords from higher resource languages like Malay,or narro

167、w generationusing style cues like question marks.Their generations forengkgvare identified by GoogleTranslate as various other languages,often malformed.These results indicate that,as expected,no11Of course,the models do have some relevant knowledge to transfer to the task,such as competence at tran

168、slationin general,understanding of linguistics reference grammars,and knowledge of loanwords or other languages with arealinfluence on Kalamang(though Kalamang is not known to be genealogically related to any other languages)(Tanzer et al.,2023;Visser,2020b).12Machine learning for indigenous languag

169、es can be culturally sensitive(Bird,2020).In the case of MTOB,the fieldlinguist who documented Kalamang is an author on the MTOB project;in addition to getting consent for the original datacollection for linguistic research,the field linguist got renewed permission from their main community consulta

170、nt to usethe data specifically for machine learning research(Tanzer et al.,2023).13Note that this is different from the typical notion of human performance in machine translation,where native speakersrate the quality of the translations in absolute.MTOB is instead concerned with the ability to learn

171、 to translate from limitedreference materials,and how to bridge the gap to useful applications is a separate research question.17Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextContextGPT-4Claude 3Gemini 1.5HumanTurboHaikuSonnetOpusFlashProlanguage learner0-shot0.14

172、0.240.140.180.140.18-(30.0)(33.4)(30.0)(32.7)(31.5)(30.0)-half book2.042.803.403.743.004.14-(49.7)(53.5)(58.5)(58.3)(55.1)(63.9)-full book-3.144.005.52-(57.4)(64.6)(70.3)Table 4|Quantitative results for KalamangEnglish translation on MTOB(Tanzer et al.,2023).Wepresent human evaluation scores on a sc

173、ale of 0 to 6,with 6 being an excellent translation.We includeautomatic metrics(BLEURT)in parentheses.ContextGPT-4Claude 3Gemini 1.5HumanTurboHaikuSonnetOpusFlashProlanguage learner0-shot0.080.080.080.120.080.00-(15.0)(15.3)(17.3)(18.7)(15.4)(12.0)-half book3.904.464.645.184.945.38-(45.4)(51.7)(49.2

174、)(55.5)(54.6)(59.1)-full book-4.665.465.60-(52.0)(59.0)(57.0)Table 5|Quantitative results for EnglishKalamang translation on MTOB(Tanzer et al.,2023).Wepresent human evaluation scores on a scale of 0 to 6,with 6 being an excellent translation.We includeautomatic metrics(chrF)in parentheses.substanti

175、al Kalamang data,if any,was part of the models training data.Gemini 1.5 Pro in the half book setting outperforms GPT-4 Turbo and Claude 3 on the samesetup by a wide margin;see Tables 4 and 5.In the best setting,Gemini 1.5 Pro attains a 4.14human evaluation score onkgvengtranslation,compared to 5.52

176、of the“human language learner”score,and 5.46 onengkgvtranslation,compared to 5.58 by the“human language learner”.Thereis still a substantial qualitative gap forkgvengtranslation,butengkgvtranslation is similarto the human language learner on average.14Gemini 1.5 Flash also performs admirably,falling

177、short of Gemini 1.5 Pro as expected but handily outperforming GPT-4 Turbo and sitting somewherebetween Claude 3 Haiku&Sonnet or Sonnet&Opus depending on the translation direction.SeeAppendix 12.17 for more details,experiments,and qualitative examples.The performance of Gemini 1.5 Pro highlights the

178、importance of long-context understandingand providing sufficient context for learning new skills in-context.By leveraging the extensive14This is not to say that the task is solved;both the human and Gemini 1.5 Pro make avoidable errors,though typically ofdifferent kinds.The human errors tend to be r

179、etrieval failures,where they pick a suboptimal phrase because they could notfind the ideal reference(because rereading the entire set of materials for each sentence is infeasible for a human).Themodel failures tend to be inconsistent application of rules,like that the word“se”is pronounced“he”after

180、a vowel(thisalternation is described in the phonology section of the grammar and reflected in the additional parallel sentence data,butthe model may be confused by the fact that the underlying“se”form is used as the gloss throughout the examples withinthe grammar),or lack of reflection,like that the

181、 word“kabor”,although it is defined as“to be full”in the dictionary,is onlyused for stomachs/hunger in all examples of its use.18Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextGemini 1.5 Proaudio context0-audioshot5-audioshot50-audioshot800-audioshottext contextnon

182、e35.0%30.2%27.5%23.1%wordlist29.7%27.7%24.8%23.2%sentences31.4%27.2%25.7%22.9%both32.5%26.5%25.8%23.0%Table 6|Character Error Rate(CER,)for Gemini 1.5 Pro learning Kalamang ASR in context.field linguistics documentation provided in context,Gemini 1.5 Pro was able to achieve remarkabletranslation qua

183、lity comparable to a human language learner,and was able to do so for a languagefor which it had essentially zero exposure to during the training of the model.This finding opensup exciting possibilities for leveraging LLMs with sufficiently long-context capabilities to supportthe preservation and re

184、vitalization of endangered languages,as well as to facilitate communicationand understanding across different linguistic communities.As research continues in this area,itwill be crucial to explore techniques for improving the quality of translation in both directions,e.g.,Kalamang-to-English,and to

185、address the challenges of evaluating the performance of LLMs onlow-resource and under-represented languages,which we believe is also applicable to other areas ofeducation and language learning.5.2.2.2In-context language learning-learning to transcribe speech in a new language incontextGemini 1.5 has

186、 demonstrated exceptional performance at extremely long in-context learning fortranslation,both with Kalamang(on MTOB)and other low-resource languages(on standard bench-marks below).Kalamang,like many other endangered languages,is primarily oral;applicationsmust therefore support speech in order to

187、be socially useful.We take the next step towards theseapplicationsand at the same time stress test Gemini 1.5s mixed-modal in-context learning ca-pabilitiesby evaluating how well it can learn to transcribe Kalamang speech from text and audiodocumentation in context.This task framing is possible in L

188、LMs for the first time due to Gemini 1.5snative audio support.We present a preview of results on a new benchmark,ASROB(Automatic Speech Recognitionfrom One Book).ASROB extends MTOB with 104 speech recordings(15 total hours)of transcribedand translated Kalamang speech from The Kalamang Collection(Vis

189、ser,2020c).15Here we reportexperiments on a subset of 6 recordings(45 minutes)with manually realigned phrase-level captions;we use 5 of the recordings(800 phrases)as the in-context train set and 1(100 phrases)as the testset.The same speaker from the test recording is present in 3 of the train record

190、ings.We report Character Error Rate(CER)across various combinations of text context(the2000-entry bilingual wordlist and400 parallel sentences from MTOB)and audio context(up to 800Kalamang speech/text pairs).We cannot compare directly to GPT-4 or Claude 3 because they do notprovide access to audio i

191、nput at the time of writing.1615When the linguist author of Visser(2020b)got the blessing of a community member to use the Kalamang data formachine learning for MTOB,she included speech data/tasks in the discussion.16We tried to compare to GPT-4 by cascading the speech input through Whisper,but we w

192、ere unable to get Whisper toconsistently produce transcriptions of the kind that Gemini 1.5 produces zero-shot.We tried using a)default settings,b)the language code for Indonesian,and c)prompts in English&Indonesian introducing“an audio sample in Kalamang,which is written with Indonesian orthography

193、.”19Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextGemini 1.5 Flashaudio context0-audioshot5-audioshot50-audioshot800-audioshottext contextnone73.3%32.0%32.3%26.0%wordlist45.4%31.3%31.3%26.1%sentences37.9%33.2%33.1%25.3%both37.5%33.0%33.4%25.2%Table 7|Character Err

194、or Rate(CER,)for Gemini 1.5 Flash learning Kalamang ASR in context.We see in Table 6 that Gemini 1.5 Pro performs remarkably well without any in-context examples(where the prompt instructs the model to transcribe Kalamang speech with Indonesian orthography),achieving 35.0%CER.Judging from the output

195、s,the model clearly hears the sounds of the languagebut does not know how to segment or spell words in it,especially affixes.As we add text and audiocontext for many-shot learning(Agarwal et al.,2024a),ASR quality improves relatively gracefully,reaching 22.9%CER in the best setting.These outputs are

196、 segmented and spelled much better,withsome examples transcribed perfectly,but there is still significant room for improvementin particularensuring that all outputs are grammatically correct Kalamang.Gemini 1.5 Flash(Table 7)follows asimilar trajectory to Pro,but with worse scores across the board.5

197、.2.2.3Scaling In-Context learning for low-resource machine translation30.040.050.0Test chrF(Flores)Translation:English Bemba35.040.045.0Translation:English Kurdish25.030.035.040.0Translation:English Ewe20212223242526272829210211212Number of Shots(K)15.020.025.030.035.0Test chrF(In-house)Translation:

198、English Acholi20212223242526272829210211212Number of Shots(K)10.020.030.0Translation:English Abkhaz20212223242526272829210211212Number of Shots(K)10.015.020.025.030.035.0Translation:English NavajoGemini 1.5 ProGemini 1.5 FlashGPT-4 TurboFigure 13|In-context learning scaling results(chrF)on low-resou

199、rce translation.Top:results onFlores test sets.We use Flores dev sets for prompting and scale the number of shots to 1K(includingabout 90K tokens).Bottom:results on in-house evaluation sets.We use Gatitos for prompting andscale the number of shots to 4K(including about 30K tokens).Gemini 1.5 yields

200、increasingly bettertranslation performance as the number of shots grows,surpassing GPT-4 Turbo significantly.The impressive in-context language learning capability of Gemini 1.5 inspires us to revisit tra-ditional in-context learning(ICL)at scale.ICL allows LLMs to learn new tasks from input-outpute

201、xamples provided at inference time.While it has been widely observed across tasks and models,the number of in-context examples explored is often limited,ranging from a handful to a few dozen,20Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextbecause of context length

202、 limitations and/or suboptimal long-context capabilities(Brown et al.,2020;Min et al.,2022;Zhang et al.,2023a).By contrast,Gemini 1.5s millions of tokens of contextopen new opportunities for scaling ICL to thousands of examples,also known as the many-shot ICLregime(Agarwal et al.,2024a;Bertsch et al

203、.,2024).In this section,we explore to what extentGemini 1.5 can leverage an increased number of in-context examples(or shots)to improve machinetranslation for low-resource languages,extending the prior work exploring the limits of few-shotlearning for machine translation(Garcia et al.,2023).We evalu

204、ate translation from English to 6 diverse low-resource languages,including Acholi(ISO639-3:ach;Africa),Abkhaz(ISO 639-3:abk;Asia),Navajo(ISO 639-3:nav;Americas),Bemba(ISO 639-3:bem;Africa),Ewe(ISO 639-3:ewe;Africa),and Kurdish(ISO 639-3:kur;Asia).Theselanguages each have between a few hundred thousa

205、nd to a few million speakers,with relatively lowexposure on the Internet.Due to such low-resource nature,it requires the model to understand andlearn from the given in-context examples rather than from the pretraining data and to acquire newknowledge about the language when appropriate.We create two

206、 evaluation setups:public and in-house,depending on the accessibility of the testdata to the general public.Following Agarwal et al.(2024a),the public setup covers Bemba,Ewe,andKurdish,using the dev set of Flores-200(Team et al.,2022)(up to 997 examples and 90K tokens)as the in-context example set a

207、nd the first 200 examples from its devtest set as the test set;Thein-house setup covers Acholi,Abkhaz,and Navajo,using Gatitos(Jones et al.,2023)(including upto 4K examples and 30K tokens)as the in-context example set and 200 held-out sentence pairsannotated by professional translators as the test s

208、et.We use the translation prompt presented inAppendix 12.16.15.We report averaged performance over three runs,each with freshly sampledin-context examples,and use chrF(Popovi,2015)as the evaluation metric.Figure 13 shows the ICL scaling performance.While previous ICL studies often see performancesat

209、urate after dozens of examples,Gemini 1.5 delivers almost consistent improvements as we scalethe number of shots.The quality gain over zero-shot translation can be quite substantial,e.g.+11.1/+21.4 chrF on Bemba and+9.5/+15.9 chrF on Navajo for 1.5 Pro/Flash,although it variesgreatly across language

210、s.We see Gemini 1.5 Flash degrades from few-shot ICL on Acholi,but furtherscaling ICL significantly improves the translation.Overall,Gemini 1.5 Flash presents more pronouncedmany-shot scaling compared to 1.5 Pro which may be related to its smaller model size:it relies moreon in-context examples to p

211、erform the translation rather than its implicit knowledge.While GPT-4 Turbo also shows some positive trend we scale the number of shots,Gemini 1.5Pro often outperforms GPT-4 Turbo across languages and numbers of shots by a wide margin.Oneinteresting exception is the translation for Abkhaz,where Gemi

212、ni 1.5 Pro lags behind GPT-4 Turbowith few prompts but significantly surpasses it as ICL scales.This pattern also occurs with Gemini 1.5Flash across languages:scaling up to 1K/4K examples improves Flash,enabling it to achieve superiorperformance to GPT-4 Turbo,e.g.+9.6/+6.4 chrF on Ewe/Acholi.In sho

213、rt,the Gemini 1.5 modelsexcel at translating low-resources languages and gracefully improve with in-context examples.5.2.2.4Long-document QAAfter testing Gemini 1.5 models in-context language learning capabilities up to 250k tokens,weproceed into another realistic evaluation setup.In this section we

214、 present experiments on questionanswering,we create questions using the book“Les Misrables”(by Victor Hugo)and test the modelsability to answer them correctly when the entire 1,462 page book(i.e.,710K tokens)is providedas input.Evaluating a models ability to answer questions about long documents(or

215、collectionsof documents)presents a unique challenge.Unlike tasks that focus on specific facts or details21Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context0k context Gemini 1.0 ProRAG 4k context Gemini 1.0 ProRAG 4k context GPT-4 Turbo0k context GPT-4 Turbo0k context

216、 Gemini 1.5 ProRAG 4k context Gemini 1.5 Profull 710k context Gemini 1.5 Pro01234567Strength0.10410.29711.29941.64241.37461.76566.2417Gemini 1.0 ProGPT-4 TurboGemini 1.5 ProFigure 14|Answer quality based on side-by-side auto-rater(Gemini 1.5 Pro),rankings and scores()computed via the Bradley-Terry M

217、odel.that measure the retrieval capability of the models,such questions often require understandingrelationships between pieces of information spanning large portions of text.For example,a questionlike“How is the concept of duality portrayed through the character who embodies both respect forauthori

218、ty and hatred of rebellion?”necessitates comprehending the overall narrative and characterdynamics within the above book.We compare Gemini 1.5 Pro against Gemini 1.0 Pro.Due to the limited context window of thelatter,Gemini 1.0 Pro requires retrieval-augmented generation to access useful passages fr

219、om thebook.This method indexes passages using TF-IDF and stores the results in an external database.The question is then used as a query to re-rank passages by cosine similarity,and the most relevantpassages are retrieved,up to a maximum of 4k tokens(roughly 41 passages).The retrieved passagesare th

220、en put into context following a temporal ordering.In contrast,Gemini 1.5 Pro,due to its largercontext window capable of accommodating much longer material,eliminates any need for additionaldata post-processing,indexing and retrieval pipelines.17To evaluate the models response,we createa set of 100 q

221、uestions.Generally,LLMs today can achieve high factual accuracy in the zero-shot setting for well-knownworks such as“Les Misrables”.This makes it challenging to distinguish between models whenusing absolute performance measures.We therefore use side-by-side comparisons to assess theanswer quality be

222、tween models with varying context sizes.For a more detailed discussion on thismethodology and its implications,see(Bohnet et al.,2024).The side-by-side comparison allowsus to rate if models provide enough details to answer a question sufficiently.We use an auto-raterthat takes a question and answers

223、 from two different systems and compares them against each other.The auto-rater response is either system-A is better,system-B is better or None if both answers arenon-factual,in which case they are both excluded.17See Appendix 12.3 on details of the automatic question generation pipeline.22Gemini 1

224、.5:Unlocking multimodal understanding across millions of tokens of contextUsing these side-by-side comparisons results,we provide an analysis of model strength using theBradley-Terry model(Bradley and Terry,1952).Such ranking models are used in many applicationsand are best known for their use in Ch

225、ess or Go to rate player strength.The Bradley-Terry modelassigns scores to a fixed set of models based on pairwise comparisons,where the log-odds of modeloutperforming modelis given by the difference of their scores.The fitting of the parameters formodels,(1,.,)is performed via maximum likelihood es

226、timation.The model strength has adirect mapping to the probability that an answer from Modelis better than an answer from:(answers better than)=+(1)Figure 14 summarizes the results for this evaluation.When using the entire book“Les Misrables”as context,Gemini 1.5 Pro outperforms all other system by

227、large margin.For example,full-contextGemini 1.5 Pro provides better answers than retrieval-augmented generation with 4k tokens usingGemini 1.5 Pro with probability=6.24176.2417+1.7656=0.7795,or in 78%of cases.Using the full book ascontext with Gemini 1.5 Pro provides a better answer compared to retr

228、ieval-augmented GPT4-Turbowith 4k tokens in 83%of cases.5.2.2.5Long-context AudioNext,we evaluate Gemini 1.5s long context understanding capabilities on audio inputs.To evaluatelong-context automatic speech recognition(ASR)performance,we test Gemini 1.5 models on aninternal benchmark derived from 15

229、 minute segments of YouTube videos.For this evaluation,wereport results against the 1.0 Pro model,which is trained on audio segments much shorter in length.We also report performance with the Universal Speech Model(USM)(Zhang et al.,2023b)andWhisper(OpenAI,2023).Note that ASR tasks report a word err

230、or rate(WER)metric,where a lowernumber is better.The Table 8 below shows that the 1.0 Pro model,when evaluated on transcribing 15-minutevideos without segmentation,has a WER of 100%due to a mismatch between training and testingaudio lengths.When we segment the videos every 30 seconds and pass the te

231、xtual content of thelanguage model across each segment boundary,the 1.0 Pro model can achieve a WER of 7.8%.TheUSM model with a CTC decoder,while robust to long segments,achieves a WER of 8.8%.As indicatedin the table,Whisper is not robust to long segments and hence requires audio to be segmentedeve

232、ry 30 seconds to achieve a WER of 7.3%.In comparison,Gemini 1.5 Pro is much more robust onthese longer-context tasks.Specifically,thanks to its long-context capabilities and without the addedcomplexity of extra input segmentation and pre-processing,Gemini 1.5 Pro can transcribe 15-minutevideos more

233、accurately than other models,achieving a WER of 5.5%,while Gemini 1.5 Flash trailingbehind 1.0 Pro with a WER of 8.8%,a remarkable level of quality considering its smaller size andsuperier efficiency.USMWhisperGemini1.0 Pro1.5 Pro1.5 FlashSegmentation30s30sWER8.8%12.5%7.3%100%7.8%5.5%8.8%Table 8|Wor

234、d error rate(WER)for various models on 15-minute videos.23Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context1H-VideoQA(40-105 minutes)EgoSchema(3 minutes)Dataset0.00.20.40.60.81.0Accuracy0.238*0.5360.4520.7020.5630.7270.722Gemini 1.5 Pro Performance on 1H-VideoQA and

235、EgoSchemaRandom BaselineFirst frameLinearly subsampled to 16 framesLinearly subsampled to 150 framesAll frames(1 fps)Figure 15|Comparison between 1H-VideoQA and EgoSchema,reporting Gemini 1.5 Pros accuracywhen linearly subsampling to 1,16,or 150 frames.We also show performance if we provide allthe f

236、rames for each video for 1H-VideoQA,in yellow.Gemini 1.5 Pro achieves SotA accuracy onboth benchmarks.Gemini 1.5 Pros performance on 1H-VideoQA keeps increasing as we scale up toproviding all frames in the video,while its performance on EgoSchema saturates after 16 frames,showing that 1H-VideoQA is

237、more effective at differentiating among models that operate over differentcontext lengths.*When provided with the first frame only,Gemini 1.5 Pro tends to avoid using oneof the choices and instead indicates it lacks information to reply accurately.5.2.2.6Long-context Video QAWe next proceed to long

238、video question answering evaluation to test Gemini 1.5s efficacy on long-context video understanding.Question-answering benchmarks for long-context video understandingneed to have at least two properties:first,they need to contain long videos and second,their questionsneed to be designed to in a way

239、 that can differentiate among models that operate over differentcontext lengths.Unfortunately,no existing benchmarks satisfy these properties for evaluating modelsthat can handle hours-long videos like Gemini 1.5 models.The publicly available question answeringbenchmark with the longest videos is Eg

240、oSchema(Mangalam et al.,2023),but its videos are at most3 minutes(i.e.,180 frames)in length.To bridge this evaluation gap,we introduce a new benchmark,1H-VideoQA,composed of 125 five-way multiple-choice questions over public videos 40-105 minuteslong.We collected annotations that require understandi

241、ng one or multiple events,each spanning onlya few seconds from the full video so that the answer is extremely challenging to infer by looking at afew randomly sampled video frames.We run experiments by extracting video frames at one frame-per-second,and further linearlysubsampling long videos to a f

242、ixed context length.We also measure performance if we provide all24Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextModelFrames16150full video(1 fps)GPT-4V36.5%52.3%Not supportedGemini 1.5 Pro45.2%56.3%72.2%Gemini 1.5 Flash39.7%50.8%65.9%Table 9|Comparison between GP

243、T-4V and Gemini 1.5 models on 1H-VideoQA.Experiments are runzero-shot,by sampling one video frame-per-second and linearly subsampling 16 or 150 frames.Wealso show performance if we provide all the frames for each video to Gemini 1.5 models.frames for each video for 1H-VideoQA as a reference.Results

244、are shown in Figure 15 and Table 9).Figure 15 illustrates the improvement of 1H-VideoQA over EgoSchema in terms of its abilityto differentiate among models that operate over different numbers of frames.Gemini 1.5 Pro setsa new state-of-the-art of 70.2%accuracy on EgoSchema using only 16 frames(vs 55

245、.6%for GPT-4V(Balaevi et al.,2024).However,we do not see clear gains from going to 150 frames,suggestingthat many questions in EgoSchema can be easily solved with a limited number of frames.In contrast,on 1H-VideoQA the performance of Gemini 1.5 Pro consistently increases as the numberof frames prov

246、ided increases from the first frame to the full video,suggesting that a substantialproportion of questions in 1H-VideoQA can only be solved with more frames as context,therebymaking 1H-VideoQA more effective as a long-context benchmark.Table 9 further shows that Gemini1.5 Pro consistently outperform

247、s GPT-4V on 1H-VideoQA,whether the video has been subsampledto 16 or to 150 frames.The fact that Gemini 1.5 Pro does not solve 1H-VideoQA perfectly(seeexamples in Appendix,Table 47),despite observing a frame every second,makes 1H-VideoQA auseful benchmark for evaluating and driving the development o

248、f future long-context video models.We additionally highlight the quality of Gemini 1.5 Flash,which is only slightly behind GPT-4V with150 frames,and even better than GPT-4V with 16 frames,despite being significantly smaller andmore efficient.5.2.2.7In-Context PlanningReasoning and planning are funda

249、mental human skills for problem-solving and decision-making.While LLMs have demonstrated recent progress in reasoning tasks,planning remains a more chal-lenging area which so far have received limited attention.In this section,we report performance ofmodels on both classical planning benchmarks expr

250、essed in the standard Planning Domain DefinitionLanguage(PDDL)and more modern ones expressed in natural language.See Appendix 12.8 forexample of prompts for all planning tasks considered in this section.We evaluate the planning capability of the model as we add more examples(“shots)into thecontext,i

251、nspired by the success of many-shot learning across a large number of tasks(Agarwal et al.,2024a).The challenge in“in-context planninginvolves understanding a specific task and problemthrough a limited number of examples.Additionally,it requires the models to produce a solutionwithout checking each

252、planning step to confirm if a proposed move is correct.The model has tocreate a plan in one go.To humans,this might be seen as thinking fast(instead of slow).Figure 16 shows the in-context learning performance on classical planning and natural languagebenchmarks as we vary the number of shots.Overal

253、l,Gemini 1.5 Pro outperforms other modelsindicating that the model not only can plan better with a fewer number of examples/shots,it canalso make effective use of additional and longer context.Gemini 1.5 Flash which is a smaller,faster25Gemini 1.5:Unlocking multimodal understanding across millions o

254、f tokens of context12481020406080100200Few-shot exemplars(log scale)0102030405060Planning Accuracy(%)Error bars represent a 70%CIGemini 1.5 FlashGemini 1.5 ProGPT-4 Turbo 202404090.4k0.6k1.1k2.0k2.5k4.9k9.6k19.1k23.9k47.7kSentence pieces in 1000(log scale)(a)BlocksWorld.124810204080100200400800Few-s

255、hot exemplars(log scale)010203040506070Planning Accuracy(%)Error bars represent a 70%CIGemini 1.5 FlashGemini 1.5 ProGPT-4 Turbo 202404090.9k1.3k2.3k4.1k5.1k9.7k19.1k37.7k47.1k93.7k187.1k373.8kSentence pieces in 1000(log scale)(b)Logistics.12410204080100200400Few-shot exemplars(log scale)020406080Pl

256、anning Accuracy(%)Error bars represent a 70%CIGemini 1.5 FlashGemini 1.5 ProGPT-4 Turbo 202404092.6k3.8k6.4k13.9k27.2k53.1k105.0k130.6k259.7k518.0kSentence pieces in 1000(log scale)(c)Mini-Grid.14102040100200400800Few-shot exemplars(log scale)01020304050Planning Accuracy(%)Error bars represent a 70%

257、CIGemini 1.5 FlashGemini 1.5 ProGPT-4 Turbo 202404090.7k1.5k3.8k8.5k17.3k45.4k89.5k178.3k355.4kSentence pieces in 1000(log scale)(d)Trip Planning.14102040100200400Few-shot exemplars(log scale)0102030405060Planning Accuracy(%)Error bars represent a 70%CIGemini 1.5 FlashGemini 1.5 ProGPT-4 Turbo 20240

258、4090.7k1.6k3.8k7.3k13.6k34.6k70.6k144.6kSentence pieces in 1000(log scale)(e)Calendar Scheduling.Figure 16|PDDL Planning and Natural Language Planning with few-shots.In all benchmarks,Gemini1.5 Pro outperforms other models indicating that the model not only can plan better with a fewernumber of exam

259、ples,it can also make effective use of a longer context.and more efficient model is outperformed by Gemini 1.5 Pro but occasionally matching GPT-4 Turboperformance.Bellow are the details of each task and their corresponding results.BlocksWorldBlocksWorld is a well-known planning problem from Interna

260、tional Planning Confer-ence(IPC)18.This domain consists of a set of blocks,a table and a robot hand.The goal is to finda plan to move from one configuration of blocks to another.We generated BlocksWorld probleminstances of 3 to 7 blocks.Figure 16a shows the performance of Gemini 1.5 models on this b

261、enchmarkas we increase the number of few-shot examples.The 1-shot planning capability of Gemini 1.5 Proand Gemini 1.5 Flash reaches reaches 35%and 26%,while GPT-4 Turbo performance is close to zero.Moreover the 40-shots planning capability of Gemini 1.5 Pro reaches 48%range which performsbetter than

262、 the best(200-shots)performance of GPT-4 Turbo,which peaks at 43%.LogisticsLogistics is an AI planning problem from IPC-199819expressed in PDDL that involvesarranging the delivery of packages to their destinations using trucks within cities and airplanesbetween cities.The aim is to optimize transpor

263、tation modes under constraints like vehicle capacitiesand locations,showcasing models ability to manage multi-step logistics efficiently.The planning18https:/ 1.5:Unlocking multimodal understanding across millions of tokens of contextcapability of Gemini 1.5 models on Logistics benchmark is shown in

264、 Figure 16b.As one can see the1-shot planning capability of Gemini 1.5 Pro reaches 43%while GPT-4 Turbo can only reach to 18%.Moreover for Gemini 1.5 Pro more context leads to consistently better results,indicating that themodel can make effective use of additional and longer context.This is not the

265、 case for GPT-4 Turbowhere the accuracy drops when more examples are provided.Mini-GridIn Mini-Grid problem from Artificial Intelligence Planning Systems(AIPS)-199820,alsoexpressed in PDDL.We create various floorplans with rooms containing random configurations ofkey shapes.The goal then is for a ro

266、bot to navigate from an initial position to a designated goalcell.Figure 16c shows the performance of Gemini 1.5 models as we increase the number of few-shotexamples.The 1-shot planning capability of Gemini 1.5 Pro reaches 28%while GPT-4 Turbo achievedonly 15%.More context leads to better performanc

267、e for Gemini 1.5 Pro.With 400-shots Gemini 1.5Pro reached 77%accuracy.GPT-4 Turbo performance is also increasing with the increasing numberof shots but it is far behind Gemini 1.5 Pro.With 80-shots GPT-4 Turbo reaches 38%accuracy whichis 32%lower than the accuracy of Gemini 1.5 Pro.Gemini 1.5 Flash

268、is outperformed by Gemini 1.5Pro but almost matching GPT-4 Turbo performance.Trip PlanningTrip Planning is a task focusing on planning a trip itinerary under given constraintswhere the goal is to find the itinerary regarding the order of visiting N cities.We add enoughconstraints such that there is

269、only one solution to the task,which makes the evaluation of thepredictions straightforward.Figure 16d shows the performance of Gemini 1.5 Pro on this benchmarkas we increase the number of few-shot examples.The 1-shot performance of the GPT-4 Turbo modelseems to be better than the Gemini 1.5 Pro.Howe

270、ver,as we increase the number of shots theperformance of Gemini 1.5 Pro improves dramatically.With 100-shots Gemini 1.5 Pro reaches 42%while the best(20-shots)performance of GPT-4 Turbo is 31%.Calendar SchedulingCalendar Scheduling is a task to schedule a meeting of either 30 minutes oran hour among

271、 up to 7 attendees.The attendees may have a busy schedule or a light schedule withless than half of the working hours spent in meetings.The planning capability of Gemini 1.5 Pro onthis benchmark is shown in Figure 16e.The 1-shot planning capability of Gemini 1.5 Pro reaches33%while GPT-4 Turbos accu

272、racy is under 10%.It also seems that more context leads to betterperformance for both Gemini 1.5 and GPT-4 Turbo models.With 40-shots GPT-4 Turbo achieves36%accuracy while Gemini 1.5 Pro reaches 48%.With 100-shots the Gemini 1.5 Pro is able reach52%indicating that the model can make effective use of

273、 the longer context.In summary,planning empowers intelligent agents to look ahead and proactively determine acourse of action to reach objectives(Russell and Norvig,2016).Recently prompting LLMs to extractcommon sense knowledge gained attention(Ding et al.,2023;Huang et al.,2022;Singh et al.,2023)an

274、d effectiveness of LLMs in generating plans has been studied in(Guan et al.,2024;Hao et al.,2023;Valmeekam et al.,2024).Even though existing work finds current models to be incapable offew-shot planning,our results confirm that Gemini 1.5 Pro can perform well even in 1-shot settingwhile making effec

275、tive use of additional and longer context to further improve performance.Sinceplanning is at the core of robotics,embodied environments,and agentic space,leveraging this modelcapability in such applications has a great potential.20https:/ 1.5:Unlocking multimodal understanding across millions of tok

276、ens of context5.2.2.8Unstructured Multimodal Data Analytics TaskWhile performing data analytics on structured data is a very mature field with many successfulmethods,the majority of real-world data exists in unstructured formats like images and conversations.We investigate the potential of Large Lan

277、guage Models(LLMs)to enable unstructured data analyticsand explore how LLMs can directly analyze this vast pool of multimodal information.As an instance of unstructured data analytics,we perform an image structuralization task.Wepresent LLMs with a set of 1024 images with the goal of extracting the

278、information that the imagescontain into a structured data sheet(see Appendix 12.7 for examples prompts used in this study).Asthis is a long-context task,in case where context length of models does not permit processing of allthe images at once,we use mini-batches with different batch sizes to allevi

279、ate this shortcoming.Inthe end,the results of each mini-batch are concatenated to form the final structured table.IDCategoryColorSemantic Attribute0category_0color_csemantic_class_31category_9color_ysemantic_class_02category_1color_ysemantic_class_103category_9color_bsemantic_class_04category_2color

280、_rsemantic_class_55category_2color_msemantic_class_7.Please process the following images,.image_0:image_1:image_2:.Please output the table only in the format specified.Input:A batch of imagesOutput:Structured extracted attributes8163264128256512Batch Size1015202530354045Accuracy(%)Accuracy on all th

281、e attributes extractionGemini 1.5 ProGPT 4 Turbo 20240409Claude 3 OpusFigure 17|Performance of models on unstructured data analytics tasks.Figure 17 presents the results in terms of accuracy for different types of information extractedfrom images.We find that:The accuracy on all attributes extractio

282、n of Gemini 1.5 Pro is improved by 9%(absolute)or 27%(relative compared to GPT-4 Turbos accuracy).At the time of evaluation,Claude 3 API is notable to analyze more than 20 images which is why the results for Claude 3 Opus are capped.For Gemini 1.5 Pro more images lead to consistently better results,

283、indicating that the modelcan make effective use of the additional and longer context.This is not the case for GPT-4 Turbowhere the accuracy drops as more images are provided.6.Core Capability EvaluationsThe final component of our evaluation harness for the Gemini 1.5 Pro and Gemini 1.5 Flash measure

284、sthe quality of the models core capabilities(i.e.,performance on non long-context task).The evalua-tions in this section consist of benchmarks covering all three modalities:text,vision and audio.Werely on a combination of established benchmarks that are public and used by the community alongwith som

285、e internal benchmarks that are held-out and unleaked.Our selection criteria primarily aimto measure the improvement of Gemini 1.5 series compared to its predecessor,Gemini 1.0 series ofmodels:Gemini 1.0 Pro and Gemini 1.0 Ultra.Our goal is to highlight the extent of the trade-off,if itexists,between

286、 the 1.5 generation of Gemini models that excel in long-context capabilities and theirperformance on non long-context tasks.In particular,as we develop the 1.5 series,we aim to enhancethe models proficiency in this new dimension of multimodal long-context without compromising theirquality across all

287、 other capabilities.All in all,we find a clear generational improvement between the 1.0 and 1.5 series,with Gemini28Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextCore Capability1.5 Pro Relative to1.5 Flash Relative to1.5 Pro(Feb)1.0 Pro1.0 Ultra1.0 Pro1.0 UltraTex

288、tMath,Science&Reasoning+5.9%+49.6%+18.1%+30.8%+4.1%Multilinguality-0.7%+21.4%+5.9%+16.7%+2.1%Coding+11.6%+21.5%+11.7%+10.3%+1.5%Instruction following+9.9%-0.2%+8.7%-1.2%Function calling+72.8%+54.6%VisionMultimodal reasoning+15.5%+31.5%+14.8%+15.6%+1.0%Charts&Documents+8.8%+63.9%+39.6%+35.9%+17.9%Nat

289、ural images+8.3%+21.7%+8.1%+18.9%+5.6%Video understanding-0.3%+18.7%+2.1%+7.5%-8.1%AudioSpeech recognition*+1.0%+2.2%-3.8%-17.9%-25.5%Speech translation-1.7%-1.5%-3.9%-9.8%-11.9%Table 10|Detailed breakdown of the results presented in Table 1.In speech recognition,it isgenerally accepted that any dif

290、ference in Word Error Rate(WER)that falls within a 3%relative rangeis not statistically significant and can be considered as mere noise,and we grouped such instances aswins for the latest systems.1.5 Pro uniformly outperforming 1.0 Pro and approaching(often even surpassing)1.0 Ultra,a state-of-the-a

291、rt model on most benchmarks,despite being significantly more efficient to train.An outlier tothis picture is the situation on the audio capability.The post-training data of the model contains 5head languages,resulting thus in slight regressions on multi-lingual datasets that are not head heavy(e.g.,

292、YouYube,FLEURS and Covost 2).6.1.Core Text EvalsWe start by evaluating Gemini models performance on seven major core text capabilities:(1)Mathand Science(Section 6.1.1);(2)General reasoning(Section 6.1.2),(3)Coding(Section 6.1.3);(4)Multilinguality(Section 6.1.4);(5)Function calling(Section 6.1.5);(

293、6)Instruction Following(Section 6.1.6);and(7)Real-world and expert long-tail GenAI tasks(Section 6.1.7).See Table 11 fora summary of these results;see the Appendix for details on each,and an additional evaluation on QAfor Web Search Topics.With web-scale pretraining of langauge models,decontaminatio

294、n of public benchmarks is a persis-tant challenge(Brown et al.,2020;Gemini-Team et al.,2023;OpenAI,2023a).Gemini 1.5 employedstandard n-gram based decontamination procedures to help mitigate this issue,however these n-grambased procedures are imperfect.To move beyond the reliance on training set dec

295、ontamination,we alsoreport performance on a internally developed non-public evals,such as PhysicsFinals,HiddenMath,and Natural2Code.6.1.1.Math and ScienceWe find that 1.5 Pro consistently outperforms both 1.0 Ultra and 1.0 Pro on grade-school math(i.e.,GSM8K)and even shows material improvement over

296、the more demanding benchmarks where thereis more headroom for improvement,i.e.,+14.5%over 1.0 Ultra for middle-and high-school mathproblems(i.e.,Hendrycks MATH),+13.2%for the American Mathematical Competitions(i.e.,AMC)29Gemini 1.5:Unlocking multimodal understanding across millions of tokens of cont

297、extCapabilityBenchmarkGemini1.0 Pro1.0 Ultra1.5 Flash1.5 ProMath and ScienceGPQA:Graduate-Level Google-Proof Q&A.(Rein et al.,2023)27.9%4-shot35.7%4-shot39.5%0-shot46.2%0-shotMATH:Math problems rangingacross 5 levels of difficultyand 7 sub-disciplines.(Hendrycks et al.,2021b)32.6%4-shotMinervaprompt

298、53.2%4-shotMinervaprompt54.9%4-shotMinervaprompt67.7%4-shotMinervaprompt77.9%*maj64PhysicsFinals:61 undergraduatephysics problems that havenot appeared on the internet.31.1%0-shot41.0%0-shot57.4%0-shot63.9%0-shotHiddenMath 179 new math problemscreated from scratch.6.1%0-shot11.2%0-shot6.7%0-shot20.1

299、%0-shotFunctional MATH Functional variant of 1745MATH problems(December snapshot)39.9%0-shot55.8%0-shot53.6%0-shot64.6%0-shotAMC 2022-23:250 latest problemsincluding 100 AMC 12,100 AMC 10,and 50 AMC 8 problems.22.8%4-shot30%4-shot34.8%4-shot46.4%4-shotGSM8K:Grade-school math problems.(Cobbe et al.,2

300、021)77.9%11-shot88.9%11-shot86.2%11-shot90.8%11-shotGeneral ReasoningBigBench-Hard:A subset of hardertasks from Big Bench.(Srivastava et al.,2022;Suzgun et al.,2022)75.0%3-shot83.6%3-shot85.5%3-shot89.2%3-shotDROP:Reading comprehension&arithmetic.(Metric:F1-Score).(Dua et al.,2019)74.1Variableshots8

301、2.4Variableshots78.4Variableshots74.9VariableshotsMMLU:Multiple-choice questions in57 subjects(professional&academic).(Hendrycks et al.,2021a)71.8%5-shot83.7%5-shot90.0%*maj3278.9%5-shot85.9%5-shot91.7%*maj32Hellaswag(Zellers et al.,2019)84.7%10-shot87.8%10-shot86.5%10-shot93.3%10-shotCodingHumanEva

302、lchat preamble*(Metric:pass rate).(Chen et al.,2021)67.7%0-shot(PT)74.4%0-shot(PT)74.3%0-shot84.1%0-shotNatural2Codechat preamble*(Metric:pass rate).69.6%0-shot74.9%0-shot77.2%0-shot82.6%0-shotMultilingualityWMT23:sentence-level machinetranslation(Metric:BLEURT).(Tom et al.,2023)71.71-shot(PT)74.41-

303、shot(PT)74.11-shot75.31-shotMGSM:multilingual mathreasoning.(Shi et al.,2023a)63.5%8-shot(PT)79.0%8-shot(PT)82.6%8-shot87.5%8-shotTable 11|Evaluation results of Gemini 1.5 Pro,1.5 Flash and Gemini 1.0 models on standard coding,multilingual as well as math,science and reasoning benchmarks.Unless expl

304、icitly specified,alltasks are evaluated in terms of answer accuracy.Note that in this table,PT for the 1.0 Ultra andPro models denote tasks evaluated with model variants that have undergone a post-training(i.e.instruction-tuning)phase after pre-training.All numbers for the 1.5 Pro and 1.5 Flash are

305、obtainedafter instruction-tuning except for the ones marked with*which come from pretrained models,asdescribed in Section 4.30Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextand+5.8%on graduate-level science problems(i.e.,GPQA(Rein et al.,2023).21Gemini 1.5 Flash de

306、monstrates significant improvements over the 1.0 Pro version,achieving an11.6%increase on GPQA,a 22.3%increase on middle-and high-school math problems(HendrycksMATH),a 26.3%increase on undergraduate physics problems(PhysicsFinals),a 0.6%increase onHiddenMath problems,an 8.4%increase on AMC problems,

307、and an 8.3%increase on Grade SchoolMath problems.Functional MATH(Srivastava et al.,2024),a new benchmark derived from the MATH dataset,comprises 1,745 problems in both original and modified forms.The benchmark aims to evaluatea models ability to solve modified problems compared to its performance on

308、 the original versions.The Reasoning Gap,defined as the relative decrease in performance between original and modifiedproblems,serves as a novel metric for assessing generalization capability.It quantifies a modelsadaptability to modifications of problems previously published online,specifically tho

309、se introducedafter the training data cut-off date.As modifications often lead to problems of increased computationalcomplexity,a performance gap is expected.The Functional MATH dataset provides a means to controland analyze the magnitude of this gap across different models.We specifically chose to u

310、tilize theDecember snapshot of Functional MATH.Our evaluation prioritizes both overall performance on themodified problem set and minimization of the Reasoning Gap.We test the models in a zero-shotsetting wherein we present them with problems without any accompanying context or instructions.The auto

311、mated evaluation process consisted of two steps.First,we use the Gemini 1.0 Pro model toextract the proposed final answer from the models output,and then this answer is compared to theground truth using the same programmatic heuristic as employed in our evaluations of the MATHdataset.Gemini 1.0 Pro

312、solved 55.8%of the original problems and 39.9%of the modified problems(reasoning gap=28.5%).Gemini 1.5 Flash solved 69.7%of the original problems and 53.6%ofthe modified problems(reasoning gap=23.1%),Gemini Ultra 1.0 solved 74.5%of the originalproblems,55.8%of the modified problems(reasoning gap=25.

313、1%)whereas Gemini 1.5 Pro solved81.1%and 64.6%(reasoning gap=20.3%).We also evaluated Gemini 1.5 Pro on two new,unreleased internal benchmarks:PhysicsFinalsand HiddenMath.PhysicsFinals comprises 61 undergraduate physics problems,curated by a group ofphysics professors for offline final exams,coverin

314、g topics such as wave mechanics,quantum mechanics,special relativity,and introductory general relativity.Answers were graded by a physics professor.Gemini 1.5 Pro achieved a score of 39,significantly surpassing Gemini 1.0 Ultra(25)and Gemini1.0 Pro(19).HiddenMath comprises 179 competition-level math

315、 problems,crafted by experts andevaluated automatically.Gemini 1.0 Pro solved 11 problems,Gemini 1.5 Flash solved 12,Gemini 1.0Ultra solved 20,and Gemini 1.5 Pro solved 36.Section 7 presents additional results obtained with a Math-Specialized Gemini 1.5 Pro.6.1.2.General ReasoningBigBench-Hard,DROP,

316、MMLU,and Hellaswag are benchmarks designed to assess a models ability tohandle complex relationships within text,perform multi-step reasoning,and apply common senseknowledge to novel situations.BigBench-Hard,a curated subset of challenging tasks from the broaderBigBench suite,requires models to enga

317、ge in intricate reasoning processes.Gemini 1.5 Pro achievesa state-of-the-art score of 89.2%on this benchmark.MMLU,encompassing a diverse range of 57subjects across professional and academic domains,sees Gemini 1.0 Ultra,1.5 Pro,and 1.5 Flash allexceeding 80%.Hellaswag,a benchmark designed to test c

318、ommon sense reasoning and the ability to21In Appendix 12.6,we analyze the impact of long-context prompting on the performance of Gemini 1.5 Pro on a selectionof 528 challenging problems from Hendrycks MATH(Level 4 and 5 Intermediate Algebra problems).31Gemini 1.5:Unlocking multimodal understanding a

319、cross millions of tokens of contextdistinguish plausible scenarios,sees Gemini 1.5 Pro achieving 93.3%.We observe a similar strongperformance for Gemini 1.5 Flash,with the model outperforming Gemini 1.0 Pro.6.1.3.CodeGemini 1.5 Pro is our best performing model in code to date,surpassing Gemini 1.0 U

320、ltra onHumanEval and Natural2Code,our internal held-out code generation test set made to prevent web-leakage.We see the same gains being transferred to Gemini 1.5 Flash with the model outperformingGemini Ultra 1.0.HumanEval leakageHumanEval is an industry standard open-source evaluation benchmark(Ch

321、enet al.,2021),but we found controlling for accidental leakage on webpages and open-source coderepositories to be a non-trivial task,even with conservative filtering heuristics.An analysis of the testdata leakage of Gemini 1.0 Ultra showed that continued pre-training on a dataset containing even asi

322、ngle epoch of the test split for HumanEval boosted scores from 74.4%to 89.0%,highlighting thedanger of data contamination.We found that this sharp increase persisted even when examples wereembedded in extraneous formats(e.g.JSON,HTML).We invite researchers assessing coding abilitiesof these models h

323、ead-to-head to always maintain a small set of truly held-out test functions that arewritten in-house,thereby minimizing the risk of leakage.The Natural2Code benchmark,which weannounced and used in the evaluation of Gemini 1.0 series of models,was created to fill this gap.Itfollows the exact same for

324、mat of HumanEval but with a different set of prompts and tests.6.1.4.MultilingualityFor our multilingual evaluations we use a multilingual math reasoning(MGSM;Shi et al.,2023a)benchmark and a machine translation benchmark(WMT23;Kocmi et al.,2023)which was constructedafter the models training data cu

325、t-off hence minimizing test set leakage risks.Both of these coverdiverse languages from different language families and resource groups,with MGSM covering 11languages and WMT23 eight languages for a total of 14 language pairs.We find that Gemini 1.5 Pro improves over Gemini 1.0 Ultra on both tasks,p

326、articularly showing asubstantial improvement of almost+9%on the MGSM dataset,in line with the English-only mathimprovements reported above.Interestingly,we find that these improvements are not limited to aparticular resource group;rather,1.5 Pro improves performance equally among differently-resourc

327、edlanguages.Particularly,on medium and low resource languages the gap between 1.0 Ultra and 1.5Pro increases to9%and7%,respectively.22In addition,Gemini 1.5 Flash achieves comparableperformance to Gemini 1.0 Ultra on WMT23,and surpasses it by3 on MGSM,despite its muchsmaller size.6.1.5.Function Call

328、ingThere is an increasing interest in LLMs as the core building block of AI systems(often called agents)thatoperate in environments to achieve complex goals.Environments may include anything from generalweb search,private documents or calendars of users,internal enterprise APIs to general programmin

329、ginterpreters and robot sensors.Due to their general-purpose nature,LLMs are expectedand have thepromiseto operate in many such environments without having seen them at training time.Here,we focus on evaluating this capability via Function Calling(FC),or zero-shot tool use:given thedescriptions and

330、type signatures of a set of functions or APIs,and a user prompt,the model has22See Appendix 12.4 for a complete performance breakdown.32Gemini 1.5:Unlocking multimodal understanding across millions of tokens of contextTaskGemini1.0 Pro1.5 Flash1.5 ProSimple Functions92.0%88.0%92.8%Multiple Functions

331、90.0%92.0%90.5%Parallel Functions38.5%73.5%88.5%Parallel Multiple27.0%73.5%83.5%Relevance67.5%75.4%83.3%Weighted Average67.8%81.8%88.4%Table 12|Function calling performance on Berkeley Function Calling Leaderboard splits(excludingJava and Javascript,and with various fixes).to infer what function cal

332、ls have to be made to service the prompt,if any.Specifically,use the TheBerkeley Function Calling Leaderboard(Yan et al.,2024,BFCL)and focus on a subset of BCFL splits.23In Table 12 we show performance in terms of accuracy of various Gemini FC endpoints.We see asubstantial improvement from Gemini 1.

333、0 Pro to 1.5 Pro in terms of overall weighted accuracy.Thiscan be attributed in large part to the new support for parallel function calling(where one prompttriggers several independent functions)but also to the improved ability of Gemini to determine whennot to call functions(cf.“Relevance”column).We also find the 1.5 Flash FC endpoint to be extremelyclose to 1.5 Pro.Finally,we believe that while

相關圖表

本文介紹了Gemini 1.5系列模型，包括Gemini 1.5 Pro和Gemini 1.5 Flash。這些模型代表了下一代高效的多模態模型，能夠回憶和推理來自數百萬個上下文標記（包括多個長文檔和數小時的視頻和音頻）的細粒度信息。Gemini 1.5 Pro在大多數能力和基準測試中超過了之前的版本，而Gemini 1.5 Flash則是一個更輕量級的版本，設計用于提高效率，同時對質量的影響最小。Gemini 1.5模型在長上下文檢索任務中實現了近乎完美的召回率，在長文檔QA、長視頻QA和長上下文ASR方面改進了最先進的技術，并在一系列基準測試中與Gemini 1.0 Ultra的先進性能相匹配或超越。 Gemini 1.5 Pro和Gemini 1.5 Flash在長上下文能力方面取得了顯著的進步，例如，在100萬標記的上下文中實現近100%的召回率，并在1000萬標記的上下文中保持99.2%的召回率。這些模型還展示了在長文檔、長視頻和長音頻中的新能力，例如，僅通過一本參考語法書和雙語詞匯表學習將英語翻譯成卡拉芒語，以及從單個視頻幀中提取信息。總的來說，Gemini 1.5系列模型在多模態理解和長上下文處理方面取得了重大突破，為處理更復雜和更長的多模態輸入提供了新的可能性。

谷歌Gemini 1.5模型如何實現多模態理解？ Gemini 1.5模型在長文本處理方面有何優勢？谷歌Gemini 1.5模型如何助力新語言學習？

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站