《Google:Gemini 技術報告(英文版)(90頁).pdf》由會員分享,可在線閱讀,更多相關《Google:Gemini 技術報告(英文版)(90頁).pdf(90頁珍藏版)》請在三個皮匠報告上搜索。
1、Gemini:A Family of Highly CapableMultimodal ModelsGemini Team,Google1This report introduces a new family of multimodal models,Gemini,that exhibit remarkable capabilitiesacross image,audio,video,and text understanding.The Gemini family consists of Ultra,Pro,and Nanosizes,suitable for applications ran
2、ging from complex reasoning tasks to on-device memory-constraineduse-cases.Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra modeladvances the state of the art in 30 of 32 of these benchmarks notably being the first model to achievehuman-expert performance on the wel
3、l-studied exam benchmark MMLU,and improving the state of theart in every one of the 20 multimodal benchmarks we examined.We believe that the new capabilities ofthe Gemini family in cross-modal reasoning and language understanding will enable a wide variety ofuse cases.We discuss our approach toward
4、post-training and deploying Gemini models responsibly tousers through services including Gemini,Gemini Advanced,Google AI Studio,and Cloud Vertex AI.1.IntroductionWe present Gemini,a family of highly capable multimodal models developed at Google.We trainedGemini models jointly across image,audio,vid
5、eo,and text data for the purpose of building a modelwith both strong generalist capabilities across modalities alongside cutting-edge understanding andreasoning performance in each respective domain.Gemini 1.0,our first version,comes in three sizes:Ultra for highly-complex tasks,Pro for enhancedperf
6、ormance and deployability at scale,and Nano for on-device applications.Each size is specificallytailored to address different computational limitations and application requirements.After large-scale pre-training,we post-train our models to improve overall quality,enhance targetcapabilities,and ensur
7、e alignment and safety criteria are met.Due to the varied requirements ofour downstream applications,we have produced two post-trained Gemini model family variants.Chat-focused variants,referred to as Gemini Apps models,are optimized for Gemini and GeminiAdvanced,our conversational AI service former
8、ly known as Bard.Developer-focused variants,referredto as Gemini API models,are optimized for a range of products and are accessible through Google AIStudio and Cloud Vertex AI.We evaluate the performance of pre-and post-trained Gemini models on a comprehensive suiteof internal and external benchmar
9、ks covering a wide range of language,coding,reasoning,andmultimodal tasks.The Gemini family advances state-of-the-art in large-scale language modeling(Anil et al.,2023;Brown et al.,2020;Chowdhery et al.,2023;Hoffmann et al.,2022;OpenAI,2023a;Radford et al.,2019;Rae et al.,2021),image understanding(A
10、layrac et al.,2022;Chen et al.,2022;Dosovitskiyet al.,2020;OpenAI,2023b;Reed et al.,2022;Yu et al.,2022a),audio processing(Radford et al.,2023;Zhang et al.,2023),and video understanding(Alayrac et al.,2022;Chen et al.,2023).Italso builds on the work on sequence models(Sutskever et al.,2014),a long h
11、istory of work in deeplearning based on neural networks(LeCun et al.,2015),and machine learning distributed systems1See Contributions and Acknowledgments section for full author list.Please send correspondence to gemini-1- 2024 Google.All rights reservedarXiv:2312.11805v4 cs.CL 17 Jun 2024Gemini:A F
12、amily of Highly Capable Multimodal Models(Barham et al.,2022;Bradbury et al.,2018;Dean et al.,2012)that enable large-scale training.Our most capable model,Gemini Ultra,achieves new state-of-the-art results in 30 of 32 benchmarkswe report on,including 10 of 12 popular text and reasoning benchmarks,9
13、of 9 image understandingbenchmarks,6 of 6 video understanding benchmarks,and 5 of 5 speech recognition and speechtranslation benchmarks.Gemini Ultra is the first model to achieve human-expert performance onMMLU(Hendrycks et al.,2021a)a prominent benchmark testing knowledge and reasoning via asuite o
14、f exams with a score above 90%.Beyond text,Gemini Ultra makes notable advances onchallenging multimodal reasoning tasks.For example,on the recent MMMU benchmark(Yue et al.,2023),that comprises questions about images on multi-discipline tasks requiring college-level subjectknowledge and deliberate re
15、asoning,Gemini Ultra achieves a new state-of-the-art score of 62.4%,outperforming the previous best model by more than 5 percentage points.It provides a uniformperformance lift for video question answering and audio understanding benchmarks.Qualitative evaluation showcases impressive crossmodal reas
16、oning capabilities,enabling the modelto understand and reason across an input sequence of audio,images,and text natively(see Figure 5and Table 13).Consider the educational setting depicted in Figure 1 as an example.A teacher hasdrawn a physics problem of a skier going down a slope,and a student has
17、worked through a solution toit.Using Gemini models multimodal reasoning capabilities,the model is able to understand the messyhandwriting,correctly understand the problem formulation,convert both the problem and solutionto mathematical typesetting,identify the specific step of reasoning where the st
18、udent went wrong insolving the problem,and then give a worked through correct solution to the problem.This opens upexciting educational possibilities,and we believe the new multimodal and reasoning capabilities ofGemini models have dramatic applications across many fields.The reasoning capabilities
19、of large language models show promise toward building generalistagents that can tackle more complex multi-step problems.The AlphaCode team built AlphaCode2(Leblond et al,2023),a new Gemini-model-powered agent,that combines Gemini models rea-soning capabilities with search and tool-use to excel at so
20、lving competitive programming problems.AlphaCode 2 ranks within the top 15%of entrants on the Codeforces competitive programmingplatform,a large improvement over its state-of-the-art predecessor in the top 50%(Li et al.,2022).In tandem,we advance the frontier of efficiency with Gemini Nano,a series
21、of small modelstargeting on-device deployment.These models excel in on-device tasks,such as summarization,reading comprehension,text completion tasks,and exhibit impressive capabilities in reasoning,STEM,coding,multimodal,and multilingual tasks relative to their sizes.In the following sections,we fi
22、rst provide an overview of the model architecture,training infras-tructure,and pre-training dataset.We then present detailed evaluations of the pre-and post-trainedGemini model family,covering well-studied benchmarks across text,code,image,audio and video which include both English performance and m
23、ultilingual capabilities.Next we discuss our approachto post-training,highlight common and distinct aspects of the Gemini Apps and Gemini API modelvariants,and benchmark their performance on key capabilities.Responsible deployment is critical:weexplain our process for impact assessments,developing m
24、odel policies,evaluations,and mitigationsof harm before deployment decisions.Finally,we discuss the broader implications of Gemini models,their limitations alongside their potential applications paving the way for a new era of researchand innovation in AI.2Gemini:A Family of Highly Capable Multimoda
25、l ModelsFigure 1|Verifying a students solution to a physics problem.The model is able to correctly recognizeall of the handwritten content and verify the reasoning.On top of understanding the text in theimage,it needs to understand the problem setup and correctly follow instructions to generate LATE
26、X.2.Model ArchitectureGemini models build on top of Transformer decoders(Vaswani et al.,2017b)that are enhancedwith improvements in architecture and model optimization to enable stable training at scale andoptimized inference on Googles Tensor Processing Units.They are trained to support 32k context
27、length,employing efficient attention mechanisms(for e.g.multi-query attention(Shazeer,2019a).Our first version,Gemini 1.0,comprises three main sizes to support a wide range of applications asdiscussed in Table 1.Gemini models are trained to accommodate textual input interleaved with a wide variety o
28、f audioand visual inputs,such as natural images,charts,screenshots,PDFs,and videos,and they can producetext and image outputs(see Figure 2).The visual encoding of Gemini models is inspired by our ownfoundational work on Flamingo(Alayrac et al.,2022),CoCa(Yu et al.,2022a),and PaLI(Chen et al.,2022),w
29、ith the important distinction that the models are multimodal from the beginning and cannatively output images using discrete image tokens(Ramesh et al.,2021;Yu et al.,2022b).Video understanding is accomplished by encoding the video as a sequence of frames in the largecontext window.Video frames or i
30、mages can be interleaved naturally with text or audio as part of themodel input.The models can handle variable input resolution in order to spend more compute ontasks that require fine-grained understanding.In addition,Gemini models can directly ingest audio3Gemini:A Family of Highly Capable Multimo
31、dal ModelsModel sizeModel descriptionUltraOur most capable model that delivers state-of-the-art performance across a widerange of highly complex tasks,including reasoning and multimodal tasks.It isefficiently serveable at scale on TPU accelerators due to the Gemini architecture.ProA performance-opti
32、mized model in terms of cost as well as latency that deliverssignificant performance across a wide range of tasks.This model exhibits strongreasoning performance and broad multimodal capabilities.NanoOur most efficient model,designed to run on-device.We trained two versions ofNano,with 1.8B(Nano-1)a
33、nd 3.25B(Nano-2)parameters,targeting low and highmemory devices respectively.It is trained by distilling from larger Gemini models.Itis 4-bit quantized for deployment and provides best-in-class performance.Table 1|An overview of the Gemini 1.0 model family.Figure 2|Gemini models support interleaved
34、sequences of text,image,audio,and video as inputs(illustrated by tokens of different colors in the input sequence).They can output responses withinterleaved image and text.signals at 16kHz from Universal Speech Model(USM)(Zhang et al.,2023)features.This enables themodel to capture nuances that are t
35、ypically lost when the audio is naively mapped to a text input(forexample,see audio understanding demo on the website).Training the Gemini family of models required innovations in training algorithms,dataset,andinfrastructure.For the Pro model,the inherent scalability of our infrastructure and learn
36、ing algorithmsenable us to complete pre-training in a matter of weeks,leveraging a fraction of the Ultras resources.The Nano series of models leverage additional advancements in distillation and training algorithmsto produce the best-in-class small language models for a wide variety of tasks,such as
37、 summarizationand reading comprehension,which power our next generation on-device experiences.3.Training InfrastructureWe trained Gemini models using TPUv5e and TPUv4(Jouppi et al.,2023),depending on their sizesand configuration.Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by
38、 Google4Gemini:A Family of Highly Capable Multimodal Modelsacross multiple datacenters.This represents a significant increase in scale over our prior flagshipmodel PaLM-2 which presented new infrastructure challenges.Scaling up the number of acceleratorsresults in a proportionate decrease in the mea
39、n time between failure of hardware in the overall system.We minimized the rate of planned reschedules and preemptions,but genuine machine failures arecommonplace across all hardware accelerators at such large scales.TPUv4 accelerators are deployed in“SuperPods”of 4096 chips,each connected to a dedic
40、atedoptical switch,which can dynamically reconfigure 4x4x4 chip cubes into arbitrary 3D torus topologiesin around 10 seconds(Jouppi et al.,2023).For Gemini Ultra,we decided to retain a small number ofcubes per superpod to allow for hot standbys and rolling maintenance.TPU accelerators primarily comm
41、unicate over the high speed inter-chip-interconnect,but atGemini Ultra scale,we combine SuperPods in multiple datacenters using Googles intra-cluster andinter-cluster network(Poutievski et al.,2022;Wetherall et al.,2023;yao Hong et al.,2018).Googlesnetwork latencies and bandwidths are sufficient to
42、support the commonly used synchronous trainingparadigm,exploiting model parallelism within superpods and data-parallelism across superpods.The single controller programming model of Jax(Bradbury et al.,2018)and Pathways(Barhamet al.,2022)allows a single Python process to orchestrate the entire train
43、ing run,dramaticallysimplifying the development workflow.The GSPMD partitioner(Xu et al.,2021)in the XLA compilerpartitions the training step computation,and the MegaScale XLA compiler(XLA,2019)pass staticallyschedules appropriate collectives so that they maximally overlap with the computation with
44、very littlevariation in step time.Maintaining a high goodput2at this scale would have been impossible using the conventionalapproach of periodic checkpointing of weights to persistent cluster storage.For Gemini models,weinstead made use of redundant in-memory copies of the model state,and on any unp
45、lanned hardwarefailures,we rapidly recover directly from an intact model replica.Compared to both PaLM and PaLM-2(Anil et al.,2023),this provided a substantial speedup in recovery time,despite the significantlylarger training resources being used.As a result,the overall goodput for the largest-scale
46、 training jobincreased from 85%to 97%.Training at unprecedented scale invariably surfaces new and interesting systems failure modes-and in this instance one of the problems that we needed to address was that of“Silent Data Corruption(SDC)”(Dixit et al.,2021;Hochschild et al.,2021;Vishwanathan et al.
47、,2015).Although these areextremely rare,the scale of Gemini models means that we can expect SDC events to impact trainingevery week or two.Rapidly detecting and removing faulty hardware required several new techniquesthat exploit deterministic replay to isolate incorrect computations,combined with p
48、roactive SDCscanners on idle machines and hot standbys.Our fully deterministic infrastructure allowed us toquickly identify root causes(including hardware failures)during the development leading up to theUltra model,and this was a crucial ingredient towards stable training.4.Pre-Training DatasetGemi
49、ni models are trained on a dataset that is both multimodal and multilingual.Our pre-trainingdataset uses data from web documents,books,and code,and includes image,audio,and video data.We use the SentencePiece tokenizer(Kudo and Richardson,2018)and find that training thetokenizer on a large sample of
50、 the entire training corpus improves the inferred vocabulary andsubsequently improves model performance.For example,we find Gemini models can efficiently2We define goodput as the time spent computing useful new steps over the elapsed time of the training job.5Gemini:A Family of Highly Capable Multim
51、odal Modelstokenize non-Latin scripts which can,in turn,benefit model quality as well as training and inferencespeed.The number of tokens used to train the largest models were determined following the approachin Hoffmann et al.(2022).The smaller models are trained for significantly more tokens to im
52、proveperformance for a given inference budget,similar to the approach advocated in Touvron et al.(2023a).We apply quality filters to all datasets,using both heuristic rules and model-based classifiers.We also perform safety filtering to remove harmful content based on our policies.To maintain theint
53、egrity of evaluations,we search for and remove any evaluation data that may have been in ourtraining corpus before using data for training.The final data mixtures and weights were determinedthrough ablations on smaller models.We stage training to alter the mixture composition duringtraining increasi
54、ng the weight of domain-relevant data towards the end of training.We find thatdata quality is an important factor for highly-performing models,and believe that many interestingquestions remain around finding the optimal dataset distribution for pre-training.5.EvaluationThe Gemini models are natively
55、 multimodal,as they are trained jointly across text,image,audio,and video.One open question is whether this joint training can result in a model which has strongcapabilities in each domain even when compared to models and approaches that are narrowlytailored to single domains.We find this to be the
56、case:Gemini models set a new state of the artacross a wide range of text,image,audio,and video benchmarks.ww5.1.Text5.1.1.Academic BenchmarksWe compare pre-and post-trained Gemini Pro and Ultra models to a suite of external LLMs and ourprevious best model PaLM 2 across a series of text-based academi
57、c benchmarks covering reasoning,reading comprehension,STEM,and coding.We report these results in Table 2.Broadly,we findthat the performance of Gemini Pro outperforms inference-optimized models such as GPT-3.5 andperforms comparably with several of the most capable models available,and Gemini Ultra
58、outperformsall current models.In this section,we examine some of these findings.On MMLU(Hendrycks et al.,2021a),Gemini Ultra can outperform all existing models,achievingan accuracy of 90.04%.MMLU is a holistic exam benchmark,which measures knowledge across aset of 57 subjects.Human expert performanc
59、e is gauged at 89.8%by the benchmark authors,andGemini Ultra is the first model to exceed this threshold,with the prior state-of-the-art result at 86.4%.Achieving high performance requires specialist knowledge across many domains(e.g.law,biology,history,etc.),alongside reading comprehension and reas
60、oning.We find Gemini Ultra achieves highestaccuracy when used in combination with a chain-of-thought prompting approach(Wei et al.,2022b)that accounts for model uncertainty.The model produces a chain of thought with k samples,forexample 8 or 32.If there is a consensus above a preset threshold(select
61、ed based on the validationsplit),it selects this answer,otherwise it reverts to a greedy sample based on maximum likelihoodchoice without chain of thought.We refer the reader to appendix for a detailed breakdown of howthis approach compares with only chain-of-thought prompting or only greedy samplin
62、g.In mathematics,a field commonly used to benchmark the analytical capabilities of models,GeminiUltra shows strong performance on both elementary exams and competition-grade problem sets.Forthe grade-school math benchmark,GSM8K(Cobbe et al.,2021),we find Gemini Ultra reaches 94.4%6Gemini:A Family of
63、 Highly Capable Multimodal Modelsaccuracy with chain-of-thought prompting and self-consistency(Wang et al.,2022)compared tothe previous best accuracy of 92%with the same prompting technique.Similar positive trends areobserved in increased difficulty math problems drawn from middle-and high-school ma
64、th competitions(MATH benchmark),with the Gemini Ultra model outperforming all competitor models,reaching53.2%using 4-shot prompting.The model also outperforms the state of the art on even harder tasksderived from American Mathematical Competitions(150 questions from 2022 and 2023).Smallermodels perf
65、orm poorly on this challenging task scoring close to random,but Gemini Ultra can solve32%of the questions,compared to the 30%solve rate for GPT-4.Gemini Ultra also excels in coding,a popular use case of current LLMs.We evaluate the modelon many conventional and internal benchmarks and also measure i
66、ts performance as part of morecomplex reasoning systems such as AlphaCode 2(see Section 5.1.7 on complex reasoning systems).For example,on HumanEval,a standard code-completion benchmark(Chen et al.,2021)mappingfunction descriptions to Python implementations,instruction-tuned Gemini Ultra correctly i
67、mplements74.4%of problems.On a new held-out evaluation benchmark for python code generation tasks,Natural2Code,where we ensure no web leakage,Gemini Ultra achieves the highest score of 74.9%.Evaluation on these benchmarks is challenging and may be affected by data contamination.Weperformed an extens
68、ive leaked data analysis after training to ensure the results we report here are asscientifically sound as possible,but still found some minor issues and decided not to report results one.g.LAMBADA(Paperno et al.,2016).As part of the evaluation process,on a popular benchmark,HellaSwag(Zellers et al.
69、,2019),we find that an additional hundred fine-tuning steps on specificwebsite extracts corresponding to the HellaSwag training set(which were not included in the Geminimodel pretraining set)improve the validation accuracy of Gemini Pro to 89.6%and Gemini Ultra to96.0%,when measured with 1-shot prom
70、pting(we measured GPT-4 obtained 92.3%when evaluated1-shot via the API).This suggests that the benchmark results are susceptible to the pretraining datasetcomposition.We choose to report HellaSwag decontaminated results only in a 10-shot evaluationsetting.We believe there is a need for more robust a
71、nd nuanced standardized evaluation benchmarkswith no leaked data.So,we evaluate Gemini models on several new held-out evaluation datasetsthat were recently released,such as WMT23 and Math-AMC 2022-2023 problems,or internallygenerated from non-web sources,such as Natural2Code.We refer the reader to A
72、ppendix 10.3 for acomprehensive list of our evaluation benchmarks.Even so,model performance on these benchmarks gives us an indication of the model capabilitiesand where they may provide impact on real-world tasks.For example,Gemini Ultras impressivereasoning and STEM competencies pave the way for a
73、dvancements in LLMs within the educationaldomain3.The ability to tackle complex mathematical and scientific concepts opens up excitingpossibilities for personalized learning and intelligent tutoring systems.5.1.2.Trends in CapabilitiesWe investigate the trends in capabilities across the Gemini model
74、 family by evaluating them on aholistic harness of more than 50 benchmarks in six different capabilities,noting that some of themost notable benchmarks were discussed in the last section.These capabilities are:“Factuality”covering open/closed-book retrieval and question answering tasks;“Long-Context
75、”covering long-form summarization,retrieval and question answering tasks;“Math/Science”including tasks formathematical problem solving,theorem proving,and scientific exams;“Reasoning”tasks that requirearithmetic,scientific,andcommonsensereasoning;“Multilingual”tasksfortranslation,summarization,and r
76、easoning in multiple languages.Several of these capabilities are targeted by post-training(Section 6).Please see Appendix 10.3 for a detailed list of tasks included for each capability.3See demos on websitehttps:/deepmind.google/gemini.7Gemini:A Family of Highly Capable Multimodal ModelsGeminiUltraG
77、eminiProGPT-4GPT-3.5PaLM 2-LClaude 2Inflect-ion-2Grok 1LLAMA-2MMLUMultiple-choice questionsin 57 subjects(professional&academic)(Hendrycks et al.,2021a)90.04%CoT3283.7%5-shot79.13%CoT871.8%5-shot87.29%CoT32(via API)86.4%5-shot(reported)70%5-shot78.4%5-shot78.5%5-shot CoT79.6%5-shot73.0%5-shot68.0%GS
78、M8KGrade-school math(Cobbe et al.,2021)94.4%Maj13286.5%Maj13292.0%SFT&5-shot CoT57.1%5-shot80.0%5-shot88.0%0-shot81.4%8-shot62.9%8-shot56.8%5-shotMATHMath problems across5 difficulty levels&7 subdisciplines(Hendrycks et al.,2021b)53.2%4-shot32.6%4-shot52.9%4-shot(via API)50.3%(Zheng et al.,2023)34.1
79、%4-shot(via API)34.4%4-shot34.8%23.9%4-shot13.5%4-shotBIG-Bench-HardSubset of hard BIG-benchtasks written as CoT prob-lems(Srivastava et al.,2022)83.6%3-shot75.0%3-shot83.1%3-shot(via API)66.6%3-shot(via API)77.7%3-shot51.2%3-shotHumanEvalPython coding tasks(Chen et al.,2021)74.4%0-shot(PT)67.7%0-sh
80、ot(PT)67.0%0-shot(reported)48.1%0-shot70.0%0-shot44.5%0-shot63.2%0-shot29.9%0-shotNatural2CodePython code generation.(New held-out set with noleakage on web)74.9%0-shot69.6%0-shot73.9%0-shot(via API)62.3%0-shot(via API)DROPReading comprehension&arithmetic.(metric:F1-score)(Dua et al.,2019)82.4Variab
81、leshots74.1Variableshots80.93-shot(reported)64.13-shot82.0VariableshotsHellaSwag(validation set)Common-sense multiplechoice questions(Zellers et al.,2019)87.8%10-shot84.7%10-shot95.3%10-shot(reported)85.5%10-shot86.8%10-shot89.0%10-shot80.0%WMT23Machine translation(met-ric:BLEURT)(Tom et al.,2023)74
82、.41-shot(PT)71.71-shot73.81-shot(via API)72.71-shotTable 2|Gemini performance on text benchmarks with external comparisons and PaLM 2-L.The model produces a chain of thought with k=8 or 32 samples,if there is a consensus above a threshold(chosen based on the validationsplit),it selects this answer,o
83、therwise it reverts to a greedy sample.Further analysis in Appendix 10.2.Results self-collected via the API in Nov,2023.Results shown use the decontaminated numbers from Touvron et al.(2023b)report as the most relevant comparison to Gemini modelswhich have been decontaminated as well.)PT denotes a p
84、ost-trained Gemini API model.We observe consistent quality gains with increased model size in Figure 3,especially in reasoning,math/science,summarization and long-context.Gemini Ultra is the best model across the board forall six capabilities.Gemini Pro,the second-largest model in the Gemini family
85、of models,is also quitecompetitive while being a lot more efficient to serve.5.1.3.NanoBringing AI closer to the user,we discuss the Gemini Nano 1 and Nano 2 models engineered foron-device deployments.These models excel in summarization and reading comprehension tasks withper-task fine-tuning.Figure
86、 3 shows the performance of these pre-trained models in comparisonto the much larger Gemini Pro model,while Table 3 dives deeper into specific factuality,coding,Math/Science,and reasoning tasks.Nano-1 and Nano-2 model sizes are only 1.8B and 3.25Bparameters respectively.Despite their size,they show
87、exceptionally strong performance on factuality,i.e.retrieval-related tasks,and significant performance on reasoning,STEM,coding,multimodal and8Gemini:A Family of Highly Capable Multimodal ModelsFactualityLong-ContextMath/ScienceSummarizationReasoningMultilinguality0.00.20.40.60.81.01.21.4Normalized
88、Performance vs ProNano 1Nano 2ProUltraFigure 3|Language understanding and generation performance of Gemini model family acrossdifferent capabilities(normalized by the Gemini Pro model).multilingual tasks.With new capabilities accessible to a broader set of platforms and devices,theGemini models expa
89、nd accessibility to everyone.Gemini Nano 1Gemini Nano 2accuracynormalizedby Proaccuracynormalizedby ProBoolQ71.60.8179.30.90TydiQA(GoldP)68.90.8574.20.91NaturalQuestions(Retrieved)38.60.6946.50.83NaturalQuestions(Closed-book)18.80.4324.80.56BIG-Bench-Hard(3-shot)34.80.4742.40.58MBPP20.00.3327.20.45M
90、ATH(4-shot)13.50.4122.80.70MMLU(5-shot)45.90.6455.80.78Table 3|Performance of Gemini Nano series on factuality,summarization,reasoning,coding andSTEM tasks compared to significantly larger Gemini Pro model.5.1.4.MultilingualityThe multilingual capabilities of the Gemini models are evaluated using a
91、diverse set of tasks requir-ing multilingual understanding,cross-lingual generalization,and the generation of text in multiplelanguages.These tasks include machine translation benchmarks(WMT 23 for high-medium-lowresource translation;Flores,NTREX for low and very low resource languages),summarizatio
92、n bench-marks(XLSum,Wikilingua),and translated versions of common benchmarks(MGSM:professionallytranslated into 11 languages).5.1.4.1Machine TranslationTranslation is a canonical benchmark in machine learning with a rich history.We evaluated a post-trained Gemini API Ultra model(see Section 6.5.3)on
93、 the entire set of language pairs in the WMT 23translation benchmark in a few-shot setting.Overall,we found that Gemini Ultra(and other Geminimodels)performed remarkably well at translating from English to any other language,and surpassed9Gemini:A Family of Highly Capable Multimodal Modelsthe LLM-ba
94、sed translation methods when translating out-of-English,on high-resource,mid-resourceand low-resource languages.In the WMT 23 out-of-English translation tasks,Gemini Ultra achievedthe highest LLM-based translation quality,with an average BLEURT(Sellam et al.,2020)score of 74.8,compared to GPT-4s sco
95、re of 73.6,and PaLM 2s score of 72.2.When averaged across all languagepairs and directions for WMT 23,we see a similar trend with Gemini Ultra 74.4,GPT-4 73.8 andPaLM 2-L 72.7 average BLEURT scores on this benchmark.WMT 23(Avg BLEURT)Gemini UltraGemini ProGemini Nano 2Gemini Nano 1GPT-4PaLM 2-LHigh
96、Resource74.271.767.764.174.072.6Mid Resource74.771.867.064.873.672.7Out-of-English74.871.566.265.273.672.2Into-English73.972.069.063.574.173.4All languages74.471.767.464.873.872.7Table 4|Performance of Gemini models on WMT 23 translation benchmark.All numbers with 1-shot.In addition to the languages
97、 and translation tasks above,we also evaluate Gemini Ultra on verylow-resource languages.These languages were sampled from the tail of the following language sets:Flores-200(Tamazight and Kanure),NTREX(North Ndebele),and an internal benchmark(Quechua).For these languages,both from and into English,G
98、emini Ultra achieved an average chrF score of 27.0in 1-shot setup,while the next-best model,PaLM 2-L,achieved a score of 25.3.5.1.4.2Multilingual Math and SummarizationBeyond translation,we evaluated how well Gemini models perform in challenging tasks across arange of languages.We specifically inves
99、tigated the math benchmark MGSM(Shi et al.,2023),whichis a translated variant of the math benchmark GSM8K(Cobbe et al.,2021).We find Gemini Ultraachieves an accuracy of 79.0%,an advance over PaLM 2-L which scores 74.7%,when averagedacross all languages in an 8-shot setup.We also benchmark Gemini mod
100、els on the multilingualsummarization benchmarks XLSum(Hasan et al.,2021)and WikiLingua(Ladhak et al.,2020).InXLSum,Gemini Ultra reached an average of 17.6 rougeL score compared to 15.4 for PaLM 2.ForWikilingua,Gemini Ultra(5-shot)trails behind PaLM 2(3-shot)measured in BLEURT score.SeeTable 5 for th
101、e full results.Overall the diverse set of multilingual benchmarks show that Geminifamily models have a broad language coverage,enabling them to also reach locales and regions withlow-resource languages.Gemini UltraGemini ProGPT-4PaLM 2-LMGSM(8-shot)79.063.574.574.7XLsum(3-shot)17.616.215.4Wikilingua
102、48.947.850.4Table 5|Performance of Gemini models on multilingual math and summarization.5.1.5.Long ContextGemini models are trained with a sequence length of 32,768 tokens and we find that they make useof their context length effectively.We first verify this by running a synthetic retrieval test:we
103、placekey-value pairs at the beginning of the context,then add long filler text,and ask for value associatedwith a particular key.We find that the Ultra model retrieves the correct value with 98%accuracywhen queried across the full context length.We further investigate this by plotting the negative l
104、og10Gemini:A Family of Highly Capable Multimodal Modelslikelihood(NLL)versus the token index across a held-out set of long documents in Figure 4.Wefind that the NLL decreases with sequence position up to the full 32K context length.The longercontext length of Gemini models enable new use cases such
105、as retrieval over documents and videounderstanding discussed in Section 5.2.2.81632641282565121K2K4K8K16K32KSequence positionNLLProUltraFigure 4|Negative log likelihood as a function of token index across 32K context length on a held-outset of long documents.5.1.6.FactualityFactuality(Maynez et al.,
106、2020)is a key focus of our models training and deployment.We evaluatethree aspects of factuality for our Gemini API models:1.Closed-Book Factuality:If provided with a fact-seeking prompt without any given source,Gemini API models should not hallucinate incorrect information(see Section 2 of Roberts
107、et al.(2020)for a definition).These prompts can range from information-seeking prompts(e.g.“Whois the prime minister of India?”)to semi-creative prompts that may request factual information(e.g.“Write a 500-word speech in favor of the adoption of renewable energy”).2.Attribution:If instructed to gen
108、erate a response grounded to a given context,we aim to ensurethat Gemini API models produce a response with the highest degree of faithfulness to thecontext(Maynez et al.,2020;Rashkin et al.,2023).This may include the summarization of auser-provided source,generating fine-grained citations given a q
109、uestion and provided snippetsakin to Menick et al.(2022);Peng et al.(2023),answering questions from a long-form sourcesuch as a book(Mihaylov et al.,2018),and transforming a given source to a desired output(e.g.an email from a portion of a meeting transcript).3.Hedging:If prompted with an input that
110、 is“unanswerable”,Gemini API models must ac-knowledge that it cannot provide a response by hedging to avoid hallucination.These includescenarios where the input prompt contains false-premise questions see examples in Hu et al.(2023),the input prompt instructs the model to perform open book QA,but th
111、e answer is notderivable from the given context,and so forth.Factuality is evaluated via human annotators who fact-check each response manually;we reportthe percentage of factually inaccurate responses as judged by annotators.Attribution is evaluated viahuman annotators who check for attribution to
112、sources in the prompt for each response manually;the reported metric is AIS(Rashkin et al.,2023).For hedging,we use an automatic evaluation setupwhere we measure whether models hedge accurately.We compare Gemini API Pro with a version without any factuality-focused adaptation in Table 6.We see that
113、the rate of inaccuracy is halved in the factuality set,the accuracy of attribution is increased11Gemini:A Family of Highly Capable Multimodal Modelsby 50%from the attribution set,and the model successfully hedges 70%(up from 0%)in the providedhedging set task.Factuality(Inaccurate Rate)Attribution(A
114、IS)Hedging(Accuracy)Gemini API ProNo factuality-focused adaptation6.7%5.8%,7.8%40.2%37.9%,42.5%0%Gemini API ProFinal stage of post-training3.8%3.1%,4.8%60.0%57.6%,62.1%69.3%Table6|Factualitymitigations:Impactofpost-trainingontherateofinaccuracy,presenceofattributionand the rate of accurate hedging o
115、n Gemini API Pro(with corresponding 95%confidence intervals).5.1.7.Complex Reasoning SystemsGemini models can also be combined with additional techniques such as search and tool-use to createpowerful reasoning systems that can tackle more complex multi-step problems.One example of sucha system is Al
116、phaCode 2,a new state-of-the-art agent that excels at solving competitive programmingproblems(Leblond et al,2023).AlphaCode 2 uses a specialized version of Gemini Pro tuned oncompetitive programming data similar to the data used in Li et al.(2022)to conduct a massivesearch over the space of possible
117、 programs.This is followed by a tailored filtering,clustering andreranking mechanism.Gemini Pro is fine-tuned both to be a coding model to generate proposalsolution candidates,and to be a reward model that is leveraged to recognize and extract the mostpromising code candidates.AlphaCode 2 is evaluat
118、ed on Codeforces,4the same platform as AlphaCode,on 12 contests fromdivision 1 and 2,for a total of 77 problems.AlphaCode 2 solved 43%of these competition problems,a1.7x improvement over the prior record-setting AlphaCode system which solved 25%.Mapping this tocompetition rankings,AlphaCode 2 built
119、on top of Gemini Pro sits at an estimated 85th percentile onaverage i.e.it performs better than 85%of entrants.This is a significant advance over AlphaCode,which only outperformed 50%of competitors.The composition of powerful pre-trained models with search and reasoning mechanisms is anexciting dire
120、ction towards more general agents;another key ingredient is deep understanding acrossa range of modalities which we discuss in the next section.4http:/ Family of Highly Capable Multimodal Models5.2.MultimodalGemini models are natively multimodal.These models exhibit the unique ability to seamlesslyc
121、ombine their capabilities across modalities(e.g.extracting information and spatial layout out ofa table,a chart,or a figure)with the strong reasoning capabilities of a language model(e.g.itsstate-of-art-performance in math and coding)as seen in examples in Figures 5 and 14.The modelsalso show strong
122、 performance in discerning fine-grained details in inputs,aggregating context acrossspace and time,and applying these capabilities over a temporally-related sequence of video framesand/or audio inputs.The sections below provide more detailed evaluation of the model across different modalities(image,
123、video,and audio),together with qualitative examples of the models capabilities for imagegeneration and the ability to combine information across different modalities.5.2.1.Image UnderstandingWe evaluate post-trained Gemini API models on four different capabilities:high-level object recogni-tion usin
124、g captioning or question-answering tasks such as VQAv2;fine-grained transcription usingtasks such as TextVQA and DocVQA requiring the model to recognize low-level details;chart un-derstanding requiring spatial understanding of input layout using ChartQA and InfographicVQAtasks;and multimodal reasoni
125、ng using tasks such as Ai2D,MathVista and MMMU.For zero-shot QAevaluation,the model is instructed to provide short answers aligned with the specific benchmark.Allnumbers are obtained using greedy sampling and without any use of external OCR tools.GeminiUltra(pixel only)GeminiPro(pixel only)GeminiNan
126、o 2(pixel only)GeminiNano 1(pixel only)GPT-4VPrior SOTAMMMU(val)Multi-discipline college-level problems(Yue et al.,2023)59.4%pass162.4%Maj13247.9%32.6%26.3%56.8%56.8%GPT-4V,0-shotTextVQA(val)Text reading on natural images(Singh et al.,2019)82.3%74.6%65.9%62.5%78.0%79.5%Google PaLI-3,fine-tunedDocVQA
127、(test)Document understanding(Mathew et al.,2021)90.9%88.1%74.3%72.2%88.4%(pixel only)88.4%GPT-4V,0-shotChartQA(test)Chart understanding(Masry et al.,2022)80.8%74.1%51.9%53.6%78.5%(4-shot CoT)79.3%Google DePlot,1-shot PoT(Liu et al.,2023)InfographicVQA(test)Infographic understanding(Mathew et al.,202
128、2)80.3%75.2%54.5%51.1%75.1%(pixel only)75.1%GPT-4V,0-shotMathVista(testmini)Mathematical reasoning(Lu et al.,2023)53.0%45.2%30.6%27.3%49.9%49.9%GPT-4V,0-shotAI2D(test)Science diagrams(Kembhavi et al.,2016)79.5%73.9%51.0%37.9%78.2%81.4%Google PaLI-X,fine-tunedVQAv2(test-dev)Natural image understandin
129、g(Goyal et al.,2017)77.8%71.2%67.5%62.7%77.2%86.1%Google PaLI-X,fine-tunedTable 7|Image understanding Gemini Ultra consistently outperforms existing approaches even inzero-shot,especially for OCR-related image understanding tasks for natural images,text,documents,and figures without using any extern
130、al OCR engine(pixel only).Many existing approaches fine-tuneon the respective tasks,highlighted in gray,which makes the comparison with 0-shot not apples-to-apples.13Gemini:A Family of Highly Capable Multimodal ModelsWe find that Gemini Ultra is state of the art across a wide range of image-understa
131、nding bench-marks in Table 7.It achieves strong performance across a diverse set of tasks such as answeringquestions on natural images and scanned documents as well as understanding infographics,chartsand science diagrams.When compared against publicly reported results from other models(mostnotably
132、GPT-4V),the Gemini model is better in zero-shot evaluation by a significant margin.It alsoexceeds several existing models that are specifically fine-tuned on the benchmarks training sets forthe majority of tasks.The capabilities of the Gemini models lead to significant improvements in thestate of th
133、e art on academic benchmarks like MathVista(+3.1%)5or InfographicVQA(+5.2%).MMMU(Yue et al.,2023)is a recently released evaluation benchmark,which consists of questionsabout images across 6 disciplines with multiple subjects within each discipline that require college-level knowledge to solve these
134、questions.Gemini Ultra achieves the best score on this benchmarkadvancing the state-of-the-art result by more than 5 percentage points and outperforms the previousbest result in 5 of 6 disciplines(see Table 8),thus showcasing its multimodal reasoning capabilities.MMMU(val)Gemini Ultra(0-shot)GPT-4V(
135、0-shot)Maj32pass1pass1Art&Design74.270.065.8Business62.756.759.3Science49.348.054.7Health&Medicine71.367.364.7Humanities&Social Science78.378.372.5Technology&Engineering53.047.136.7Overall62.459.456.8Table 8|Gemini Ultra performance on the MMMU benchmark(Yue et al.,2023)per discipline.Each disciplin
136、e covers multiple subjects,requiring college-level knowledge and complex reasoning.Gemini models are also capable of operating across modalities and a diverse set of global languagessimultaneously,both for image understanding tasks(e.g.,images containing text in Icelandic)and forgeneration tasks(e.g
137、.,generating image descriptions for a wide range of languages).We evaluate theperformance of generating image descriptions on a selected subset of languages in the Crossmodal-3600(XM-3600)benchmark in a 4-shot setting,using the Flamingo evaluation protocol(Alayracet al.,2022),without any fine-tuning
138、 for all models.As shown in Table 9,Gemini models achieve asignificant improvement over the existing best model,Google PaLI-X.XM-3600(CIDER)Gemini Ultra4-shotGemini Pro4-shotGoogle PaLI-X4-shotEnglish86.487.177.8French77.976.762.5Hindi31.129.822.2Modern Hebrew54.552.638.7Romanian39.037.730.2Thai86.7
139、77.056.0Chinese33.330.227.7Average(of 7)58.455.945.0Table 9|Multilingual image understanding Gemini models outperform existing models in captioningimages in many languages when benchmarked on a subset of languages in XM-3600 dataset(Thapliyalet al.,2022).5MathVista is a comprehensive mathematical re
140、asoning benchmark consisting of 28 previously published multimodaldatasets and three newly created datasets.Our MathVista results were obtained by running the MathVista authorsevaluation script.14Gemini:A Family of Highly Capable Multimodal ModelsFigure 5|Using Gemini models multimodal reasoning cap
141、abilities to generatematplotlibcodefor rearranging the subplots.The multimodal prompt is shown at the top-left in gray.Gemini Ultrasresponse,including its generated code,is shown in the right column in blue.The bottom left figureshows rendered version of the generated code.Successfully solving this
142、task shows the modelscapability to combine several capabilities:(1)recognition of the functions depicted in the plots;(2)inverse graphics to infer the code that would have generated the subplots;(3)instruction-followingto put subplots in their desired positions;and(4)abstract reasoning to infer that
143、 the exponential plotmust stay in its original place,because the sine plot must move out of the way for the 3-dimensionalplot.Qualitative evaluation in Figure 5 illustrates an example of Gemini Ultras multimodal reasoningcapabilities.The model is required to solve the task of generatingmatplotlibcod
144、e that would rearrange15Gemini:A Family of Highly Capable Multimodal Modelsa set of subplots provided by the user.The model output shows that it successfully solves this taskcombining multiple capabilities of understanding the user plot,inferring the code required to generateit,following user instru
145、ctions to put subplots in their desired positions,and abstract reasoning aboutthe output plot.This highlights Gemini Ultras native multimodality and alludes to its more complexreasoning abilities across interleaved sequences of image and text.We refer the reader to the appendixfor more qualitative e
146、xamples.5.2.2.Video UnderstandingUnderstanding video input is an important step towards a useful generalist agent.We measure thevideo understanding capability across several established benchmarks that are held-out from training.These tasks measure whether the model is able to understand and reason
147、over a temporally-relatedsequence of frames.For each video task,we sample 16 equally-spaced frames from each video clipand feed them to the Gemini models.For the YouTube video datasets(all datasets except NextQAand the Perception test),we evaluate the Gemini models on videos that were still publicly
148、 availablein the month of November,2023.Gemini Ultra achieves state-of-the-art performance on various few-shot video captioning tasksas well as zero-shot video question answering tasks as shown in Table 10.This demonstrates itscapability of strong temporal reasoning across several frames.Figure 23 i
149、n the appendix provides aqualitative example of understanding the video of the ball-striking mechanics of a soccer player andreasoning about the player can improve their game.TaskGemini UltraGemini ProFew-shot SoTAVATEX(test)62.757.456.0English video captioning(Wang et al.,2019)4-shots4-shotsDeepMin
150、d Flamingo,4-shotsVATEX ZH(test)51.350.0Chinese video captioning(Wang et al.,2019)4-shots4-shotsYouCook2(val)135.4123.274.5English cooking video captioning(Zhou et al.,2018)4-shots4-shotsDeepMind Flamingo,4-shotsNextQA(test)29.928.026.7Video question answering(Xiao et al.,2021)0-shot0-shotDeepMind F
151、lamingo,0-shotActivityNet-QA(test)52.249.845.3Video question answering(Yu et al.,2019)0-shot0-shotVideo-LLAVA,0-shotPerception Test MCQA(test)54.751.146.3Video question answering(Ptrucean et al.,2023)0-shot0-shotSeViLA(Yu et al.,2023),0-shotTable 10|Few-shot video understanding across tasks and lang
152、uages on selected academicbenchmarks.The reported metric is CIDER for video captioning,WUPS for NextQA,and top-1accuracy for the Perception Test and ActivityNet-QA.For ActivityNet-QA,we use the Video-LLAVA(Lin et al.,2023)evaluation protocol.5.2.3.Image GenerationGemini models are able to output ima
153、ges natively,without having to rely on an intermediate naturallanguage description that can bottleneck the models ability to express images.This uniquely enablesthe model to generate images with prompts using interleaved sequences of image and text in a16Gemini:A Family of Highly Capable Multimodal
154、Modelsfew-shot setting.For example,the user might prompt the model to design suggestions of images andtext for a blog post or a website(see Figure 12 in the appendix).Figure 6 shows an example of image generation in 1-shot setting.Gemini Ultra model is promptedwith one example of interleaved image a
155、nd text where the user provides two colors(blue and yellow)and image suggestions of creating a cute blue cat or a blue dog with yellow ear from yarn.Themodel is then given two new colors(pink and green)and asked for two ideas about what to createusing these colors.The model successfully generates an
156、 interleaved sequence of images and text withsuggestions to create a cute green avocado with pink seed or a green bunny with pink ears from yarn.Figure 6|Image Generation.Gemini models can output multiple images interleaved with text givena prompt composed of image and text.In the left figure,Gemini
157、 Ultra is prompted in a 1-shot settingwith a user example of generating suggestions of creating cat and dog from yarn when given twocolors,blue and yellow.Then,the model is prompted to generate creative suggestions with two newcolors,pink and green,and it generates images of creative suggestions to
158、make a cute green avocadowith pink seed or a green bunny with pink ears from yarn as shown in the right figure.17Gemini:A Family of Highly Capable Multimodal Models5.2.4.Audio UnderstandingWe evaluate the Gemini Nano-1 and Gemini Pro models on a variety of public benchmarks andcompare it with Univer
159、sal Speech Model(USM)(Zhang et al.,2023)and Whisper(large-v2(Radfordet al.,2023)or large-v3(OpenAI,2023)as indicated).These benchmarks include automatic speechrecognition(ASR)tasks such as FLEURS(Conneau et al.,2023),VoxPopuli,(Wang et al.,2021),Multi-lingual Librispeech(Pratap et al.,2020),as well
160、as the speech translation task CoVoST 2,translating different languages into English(Wang et al.,2020).We also report on an internalbenchmark YouTube test set.ASR tasks report a word error rate(WER)metric,where a lower numberis better.Translation tasks report a BiLingual Evaluation Understudy(BLEU)s
161、core,where a highernumber is better.FLEURS is reported on 62 languages that have language overlap with the trainingdata.Four segmented languages(Mandarin,Japanese,Korean and Thai)report character error rate(CER),instead of WER,similar to Whisper(Radford et al.,2023).Table 11 indicates that our Gemin
162、i Pro model significantly outperforms the USM and Whispermodels across all ASR and AST tasks,both for English and multilingual test sets.Note that there is alarge gain in FLEURS,compared to USM and Whisper,as our model is also trained with the FLEURStraining dataset.However,training the same model w
163、ithout FLEURS dataset results in a WER of 15.8,which still outperforms Whisper.Gemini Nano-1 model also outperforms both USM and Whisper onall datasets except FLEURS.Note that we did not evaluate Gemini Ultra on audio yet,though weexpect better performance from increased model scale.TaskMetricGemini
164、ProGeminiNano-1Whisper(OpenAI,2023;Radford et al.,2023)USM(Zhangetal.,2023)Automatic SpeechRecognitionYouTube(en-us)WER()4.9%5.5%6.5%(v3)6.2%MultilingualLibrispeech(en-us)(Pratap et al.,2020)WER()4.8%5.9%6.2%(v2)7.0%FLEURS(62 lang)(Conneau et al.,2023)WER()7.6%14.2%17.6%(v3)11.8%VoxPopuli(14 lang)(W
165、ang et al.,2021)WER()9.1%9.5%15.9%(v2)13.4%Automatic SpeechTranslationCoVoST 2(21 lang)(Wang et al.,2020)BLEU()40.135.429.1(v2)30.7Table 11|Speech evaluation results on selected benchmarks for ASR and AST.For ASR,the reportedmetric is WER where lower is better.For AST,the reported metric is BLEU whe
166、re higher is better.Table 12 shows further error analysis with USM and Gemini Pro.We find that Gemini Pro producesmore understandable responses,particularly on rare words and proper nouns.DomainTruthUSMGemini ProWavFleursScotturb bus 403 travels regularly toSintra,stopping at Cabo da Roca.Scotboard
167、bus four3 traversed regu-larly to Centra stopping at Cabo deRoga.Scotturb bus 403 travels regularly toSintra,stopping at Cabo da Roca.)FleursThe archipelago lies 120 km northof the Peninsula.The largest is KingGeorge Island,with the settlement ofVilla Las Estrellas.The archipelago lines 120 km north
168、of peninsula.The largest is Kingurchisland with the settlement of CuaLosas.The archipelago lies 120 km northof the Peninsula.The largest is KingGeorge Island,with the settlement ofVilla Las Estrellas.)Table 12|Qualitative examples for the ASR task in the benchmark.Incorrect transcriptions arehighlig
169、hted in red.18Gemini:A Family of Highly Capable Multimodal Models5.2.5.Modality CombinationMultimodal demonstrations often include a combination of text interleaved with a single modality,usually images.We demonstrate the ability to process a sequence of audio and images natively.Consider a cooking
170、scenario about making an omelet where we prompt the model with a sequenceof audio and images.Table 13 indicates a turn-by-turn interaction with the model,providing picturesand verbally asking questions about the next steps for cooking an omelet.We note that the modelresponse text is reasonably accur
171、ate,and shows that model processes fine-grained image details toevaluate when the omelet is fully cooked.See demo on the website.Input ImageInput Audio(transcribed)Model Response:Text)Whats the first step to make a veggie omeletwith these ingredients?Crack the eggs into a bowl and whisk them.)Thank
172、you for the instructions.I started mak-ing my omelet,does it look ready now?It looks like its almost ready.You can flip it overto cook the other side.(No image-its a follow upon the previous question)Why is it not ready?Its not ready because the eggs are still runny.)What about now?It looks ready no
173、w.You can take it off the heat andserve it.Table 13|Audio-visual qualitative example showcasing the ability of Gemini models to processinterleaved sequences of text,vision,and audio,as well as reason across modalities.This exampleinputs interleaved images and audio from the user in a cooking scenari
174、o.The user prompts the modelfor instructions to make an omelet and to inspect whether it is fully cooked.19Gemini:A Family of Highly Capable Multimodal Models6.Post-Training ModelsAfter large-scale pre-training,we apply post-training,where one trains on top of a pre-trained modelin order to extend t
175、he models proficiency and to enable a wide variety of capabilities.Namely,weseek to improve overall quality,enhance target capabilities such as coding and multilingual,andensure alignment and safety criteria are met.We discuss our approach to post-training in this section,highlighting common and dis
176、tinct aspects of the Gemini Apps and Gemini API model variants.6.1.Gemini Apps:Gemini and Gemini AdvancedGemini and Gemini Advanced offer direct access to Googles family of AI models,consisting of the corepost-trained Gemini Apps models and the system around it.These models are created by applyingsp
177、ecialized post-training on top of Gemini pre-trained models:currently,Gemini gives access to Pro 1.0and Gemini Advanced gives access to Ultra 1.0.Beyond the core models,the system determines howthe models interact with external tools(such as Google Flights,Maps,and Google Workspace),andhow to genera
178、te responses(filtering,ranking,and streaming).As an area,conversational AI presentsseveral challenges,including:How to understand users requests across multi-turn interactions?Howto make sure responses are safe,factually grounded,and helpful?How to help users accomplish tasksby using tools external
179、to the models?We discuss how we approach these challenges in the followingsections.6.2.Gemini APIs:Google AI Studio and Cloud Vertex AIOur developer-focused Gemini API models are designed to support both conversational and non-conversational use cases.These models are available through Google AI Stu
180、dio and Cloud VertexAI through an easy to use API.Google AI Studio is a free,web-based developer tool to prototypeand launch apps quickly with an API key.Vertex AI is a comprehensive AI platform that enablesdevelopers to leverage Gemini API models with varied tooling,fully-managed infrastructure,and
181、built-in enterprise security and privacy settings.Gemini APIs make it easy to integrate Gemini APImodels into any production product or workflow,empowering developers to build applications thatcan reason across different modalities.6.3.Post-Training Methods&DataPost-training Gemini models to produce
182、 Gemini API and Apps variants involves several stages;seeFigure 7.Careful data curation is critical for all stages.First,we collect a diverse set of promptsthat are representative of real-world use cases.Second,we apply supervised fine-tuning(SFT)ondemonstration data of what the models output should
183、 be for a given prompt(Mishra et al.,2021;Ouyang et al.,2022;Wei et al.,2022a).Third,we further collect different possible responses to agiven prompt,and collect feedback data over these to train a Reward Model(RM).Finally,using thetrained RM,a Reinforcement Learning from Human Feedback(RLHF)stage(B
184、ai et al.,2022a)isapplied to further align the models outputs with human preferences.We discuss our methods inmore detail below:(1)Prompt Data Collection:A prompt is a users input to the model.As well as the most recentuser input,this can also include previous user-model interactions.We curate datas
185、ets of targetprompts.The datasets serve as the basis for our demonstration and feedback data collections,andthey are used directly during reinforcement learning.It is important to cover a diverse set of crucialuse cases and in both single-turn and multi-turn formats.Data sources include vendor-creat
186、ed data,third-party licensed sources,and synthetic approaches.20Gemini:A Family of Highly Capable Multimodal Models(2)SFT on Demonstration Data:SFT trains the model to output a desired target response givena prompt.Our Demonstration Data target responses can be directly written by a human expert,org
187、enerated by a model and in some cases revised or reviewed by a human.Additionally,we use dataanalysis tools and heuristics to ensure high data diversity across capabilities,use cases,and semanticclusters.(3)RM Training on Feedback Data:We further collect Feedback Data,for which human ratersprovide f
188、eedback such as relative preferences over candidate responses and feedback regardingindividual responses to a given prompt.For many capabilities,rating relative preferences is an easiertask than demonstrating an ideal response.Feedback data are collected across creativity,safety,factuality,other cap
189、abilities,and other target criteria.We found that the utility of the resultinghuman feedback data greatly depends on the prompt selection and the sampling strategy used toproduce candidate responses.We use this data to train RMs to output rewards that align with humanpreferences as closely as possib
190、le.(4)RLHF:Applying reinforcement learning from human feedback(RLHF)to our models providesfurther gains over SFT alone.Our approach creates an iterative process in which RL continuallypushes the boundaries of the RM,while the RM is continuously improved through evaluation anddata collection,leading
191、to progressive improvements in both.Geminipre-trainingSFT?DemonstrationdatadatabaseRLHFFeedbackdatadatabaseEnduserspersonDataflywheelFigure 7|Modeling overview.Post-training utilizes an optimized data flywheel in order to acquirehuman-AI feedback and continually improve on key areas.The data mixture
192、s for supervised fine-tuning,reward modeling,and reinforcement learning serve as the foundation for our models.6.4.EvaluationEvaluation of human preferences over model outputs provides critical signals for measuring perfor-mance.As part of our development process,we conduct human evaluation extensiv
193、ely across targetedcapabilities.Human evaluation is instantiated as side-by-side blind evaluations where human ratersjudge responses of two models to the same prompt,as single-response ratings for certain capabilities,and as online testing.In addition,we build models for automated evaluation that fa
194、ithfully imitatehuman preferences in order to guide development and continuously monitor online performance.6.5.Model CapabilitiesBeyond the general post-training outlined above,we apply techniques to improve a set of key capabili-ties.These capabilities cover a range of use cases inspired by curren
195、t user needs and research-inspired21Gemini:A Family of Highly Capable Multimodal Modelsfuture applications.We outline capability examples not detailed in previous sections below.The post-training recipes are carefully designed to balance multiple objectives,including creativity,factuality,safety and
196、 more(Bai et al.,2022b;Thoppilan et al.,2022).We have a particular focus on safety andalignment,and hence address this in a further dedicated section.6.5.1.Instruction FollowingFollowing a users prompt accurately is a fundamental capability for LLMs,especially as these modelsbecome more sophisticate
197、d and are presented with increasingly complex user prompts.User promptsvary in granularity,specificity,and requirements(e.g.,content,format,length).Individual instructionscan also be ambiguous,optional,or even impossible or undesirable to satisfy(He et al.,2023;Xuet al.,2023).We improve Gemini Apps
198、and Gemini API models instruction following(IF)abilities by collectingdata for a diverse set of instruction following categories.For instructions that are verifiable program-matically such as word count,we generate synthetic data via prompting and response editing toensure that such instructions are
199、 satisfied.Complex prompts evaluation:We investigate performance on complex prompts containingmultiple instructions using a fine-grained evaluation method that assesses how well models adhere toeach instruction.Human raters are presented with a prompt-response pair and a list of the individual(sub)-
200、instructions contained in the prompt.Each prompt may have anywhere from one to dozens ofindividual instructions,and the annotators are tasked with determining whether each instruction isfollowed(or not)by the response.Table 14 reports results on an internal dataset of prompts with instructions of va
201、rying complexitythat encompass a wide range of instructions and are designed to be challenging for LLMs.We reporttwo metrics:per-instruction accuracy(the percentage of sub instructions in the eval set that arefollowed),and full-response accuracy(the percentage of eval set prompts where all sub-instr
202、uctionsare followed).Post-trained PaLM 2Gemini(with Pro)Gemini Advanced(with Ultra)Per-instruction accuracy59.53.0%77.82.0%87.41.4%Full-response accuracy25.53.3%38.53.6%54.13.7%Table 14|Performance of Gemini on our complex prompts instruction-following internal benchmark.Gemini Advanced(with Ultra)a
203、chieves an average per-instruction accuracy close to 90%,rep-resenting a significant improvement over Gemini(with Pro)and a post-trained PaLM 2 model.Wefind that the sub-instructions that arent followed are well-distributed across responses.As a resultGemini Advanceds full-response accuracy is lower
204、,at around 54%.This indicates that there is furtherheadroom for models to fully satisfy all instructions.6.5.2.Tool UseBy training LLMs to use tools,we greatly expand LLM capabilities beyond their internal knowledge.Wetreat tool use for both Gemini Apps and Gemini API models as a code generation pro
205、blem,leveragingthe base models preexisting strong coding capabilities.Every tool invocation is represented as a codeblock in which tool calls are invoked.This process allows the model to both compose multiple toolsin each code block,as well as observe and react to the results of tool execution.At in
206、ference time,to generate a response to a user prompt,our system executes the loop shown in Figure 8,wheresampling from the LLM and execution of tool code work together to create a final response.22Gemini:A Family of Highly Capable Multimodal ModelsFigure 8|A Gemini tool-use control loop.Gemini Apps
207、models:Gemini draws on a range of tools via Gemini Extensions,including GoogleWorkspace,Google Maps,YouTube,Google Flights,and Google Hotels.These tool-use capabilitiesalso enable Gemini to be integrated as part of Gmail,Docs,Slides,Sheets and more.We are aimingto bring further tool-use capabilities
208、 in order to both enhance Gemini models and integrate Geminimodels into further products.We created an internal benchmark to assess Gemini performance on tasks that may benefit fromaccess to these extensions.This benchmark measures human preference in domains such as travelplanning and video discove
209、ry.We find models equipped with tools are preferred on this set 78%ofthe time over models without tools(excluding ties).Gemini API models:We have found that fine-tuning Gemini API models is very effective atteaching the model tool-use behaviors.Furthermore,training models to use programming and sear
210、chas tools leads to improved performance on a range of academic benchmarks.In Table 15,we comparetool-use models fine-tuned from an early version of Gemini API Pro against equivalent models that donot use tools.Mathematical ReasoningFactuality&KnowledgeRetrievalGSM8KCobbe et al.(2021)MATHHendryckset
211、al.(2021b)NQKwiatkowski et al.(2019b)Realtime QAKasai et al.(2022a)Gemini API Prowith tools80.1%41.8%68.0%70.8%Gemini API Prowithout tools69.7%30.7%59.0%39.2%Table 15|Comparison between Gemini API tool-use models and comparable models that do not usetools.Gemini API Pro without tools is an early ver
212、sion of our Pro model trained without tool-use data.Gemini API Pro with tools is the same model fine-tuned with tool-use data.6.5.3.MultilingualityMultilinguality is critical to make sure Gemini models effectively support a wide range of languages.We discuss our key approaches for Gemini Apps and Ge
213、mini API models respectively below.Gemini Apps models:Scaling Gemini from English to 40+languages imposed research challengesin data quality.We leverage abundant high-quality English data by localization to native cultures(e.g.,“president of the United States”-“日本首相”).Table 16 shows the performance
214、of Gemini(with Pro)on 5 languages compared to Bard with23Gemini:A Family of Highly Capable Multimodal Modelsan older post-training recipe and based on PaLM 2.For side-by-side comparisons between a modelA and a model B,we calculate a metric called SxS score.Each rating is converted to an ordinalvalue
215、 centered at 0:ratings preferring A are positive and ratings preferring B are negative over ascale between-1.5 and 1.5.The converted values are averaged to return the SxS score.Intuitively,apositive SxS score indicates the extent to which model A is preferred over model B.Here,we findquality improve
216、d by more than 0.1 SxS score for all five languages.Coding and reasoning gains fromGemini Pro are preserved across languages.LanguageQualitySxSCodingMBPP Pass1Austin et al.(2021)ReasoningMMLUHendrycksetal.(2021a)ja-JP+0.14+22.2%+3.6%pt-BR+0.17+23.2%+5.2%de-DE+0.1+21.4%+7.5%es-419+0.12+22.8%+9.3%it-I
217、T+0.13+13.8%+7.5%Table 16|Multilingual performance of Gemini(with Pro)compared to Gemini with an older post-training recipe and PaLM 2.Gemini API models:Similar to Gemini Apps models,we train Gemini API models on additionalmultilingual post-training data,effectively adapting the original English mod
218、el for use in variouslanguages.We experiment with both human-generated non-English prompt-response pairs as well asautomatically translated pairs.For the latter,we leverage abundant high-quality English demonstrationdata by translation.We ensure the quality of such translated data by translationabil
219、ity filtering andresponse rating by humans.Translatability Filtering:Not all prompt-response pairs make sense when automatically translated,and may require expensive localization instead.Example prompts of this type(responses omitted forspace)include:(strict word requirements)Write a 1000 word essay
220、 about world peace.(too English centric)Write a poem in iambic pentameter about apples.(too Latin-script centric)What is a word with 1 E,2 As,and 1 U?Translation Quality Validation:Each translated prompt-response pair was rated for translationquality by at least 3 human raters,and was kept in the fi
221、nal mixture if the majority of raters rated itas accurate.Section 5.1.4 reports evaluations of the multilingual capabilities of post-trained GeminiAPI models.6.5.4.Multimodal VisionMultimodal post-training enhances the capabilities of our natively multimodal Gemini models for awide range of useful a
222、pplications.In the following,we discuss how image understanding ability isincorporated into Gemini Apps and Gemini API models.For this evaluation,we further train bothof these Gemini model variants on a mixture of text data and expert curated image-text data overseveral vertically-defined multimodal
223、 use casesGemini Apps models:We empower Gemini and Gemini Advanced with image understandingcapabilities by fine-tuning pre-trained Gemini models on a mixture of text-only and image-textdata.Careful balancing of text and multimodal data ensures the model develops robust imageunderstanding without adv
224、ersely affecting the quality of the text-only interactions.To assess our24Gemini:A Family of Highly Capable Multimodal Modelsmodels,we compile a dataset of human-curated and synthetic image-text prompts and responses,spanning various categories and difficulty levels.This dataset facilitates human ev
225、aluation for modelcomparison and selection.We find that introducing this image-text data preserves Gemini Apps model quality on text-onlytasks,with a SxS score on text-only tasks of+0.010.01 for a Gemini Apps Pro model trainedon this data versus an equivalent model trained only on text data.In addit
226、ion,post-training viaRLHF improves performance on multimodal tasks,with a SxS score on image-understanding tasks of+0.2230.06 for a Gemini Apps Pro model post-trained with SFT&RLHF vs SFT alone.Gemini API models:We evaluate the impact of post-training via SFT on Gemini API modelsmultimodal vision pe
227、rformance by tracking the performance of both pre-trained models and post-trained Gemini API Vision models on a series of standard benchmarks.These post-trained results havealready been given in Table 7,in Table 17 we further report the difference in performance betweenpre-trained and post-trained G
228、emini API models.Gemini UltraPre-trained only0-shot(pixel only)GeminiAPIUltra0-shot(pixel only)Gemini Ultrapre-to post-trainedimprovementMMMU(val)Multi-discipline college-level problems(Yue et al.,2023)n/a59.4%pass162.4%Maj132n/aTextVQA(val)Text reading on natural images(Singh et al.,2019)81.4%82.3%
229、+0.9%DocVQA(test)Document understanding(Mathew et al.,2021)90.1%90.9%+0.8%ChartQA(test)Chart understanding(Masry et al.,2022)80.8%80.8%0.0%InfographicVQA(test)Infographic understanding(Mathew et al.,2022)77.9%80.3%+2.4%MathVista(testmini)Mathematical reasoning(Lu et al.,2023)n/a53.0%n/aAI2D(test)Sci
230、ence diagrams(Kembhavi et al.,2016)76.6%79.5%+2.9%VQAv2(test-dev)Natural image understanding(Goyal et al.,2017)74.5%77.8%+3.3%Table 17|Post-trained model image understanding Post-training improves image understandingcapabilities of Gemini API Ultra over the base pre-trained model.Comparisons of Gemi
231、ni API Ultra toother models on these benchmarks are given in Table 7.The results indicate that the pre-trained model already has high performance across the capabilitiesrepresented by these benchmarks,in line with previous observations.However,the post-training SFTstage used for the Gemini API Visio
232、n models succeeds in improving the performance over severalof these benchmarks(InfographicVQA,AI2D,VQAv2),most likely due to the models increasedinstruction-following capabilities that succeed in aligning the model output style with that of thegolden references.25Gemini:A Family of Highly Capable Mu
233、ltimodal Models6.5.5.CodingDespite the strong coding benchmark performance of the base model,post-training data still providesa significant boost to both code quality and code correctness.This highlights the benefit of high-qualitydemonstration data and feedback data for coding use cases.Gemini Apps
234、 and Gemini API models usea combination of human and synthetic approaches to collect such data.We evaluate our Gemini Apps models coding performance on a set of internally curated prompts,distributed across code use cases and languages.Table 18 reports SxS scores,where Gemini(withPro)significantly i
235、mproves upon Bard with an older post-training recipe and based on PaLM 2.GeminiAdvanced(with Ultra)further improves upon Gemini(with Pro).Side ASide BSxS scoreGemini(with Pro)Bard(PaLM 2,Sept.2023)0.190.03Gemini Advanced(with Ultra)Gemini(with Pro)0.13 0.02Table 18|SxS comparisons of Gemini models o
236、n an internal coding benchmark.For the coding capabilities of post-trained Gemini API Models,see Table 2 which reports theiracademic benchmark performance.7.Responsible DeploymentDuring the development of Gemini models,we follow a structured approach to responsible deploymentto identify,measure,and
237、manage foreseeable downstream societal impacts of our models,in linewith previous releases of Googles AI technology(Kavukcuoglu et al.,2022).Throughout the lifecycleof a project,we follow the structure below.This section provides more detail about our approach andincludes key findings where availabl
238、e.We are committed to ongoing transparency and will continueto provide updated information on our approach and testing in upcoming reports.7.1.Impact AssessmentAt Google we apply an impact assessment framework throughout the product development lifecyclerelated to Googles AI Principles(Google,2023).
239、This means we assess the risk and impact of AImodels were building at both a model-level(e.g.for Gemini API Ultra 1.0,as deployed on Cloud26Gemini:A Family of Highly Capable Multimodal ModelsStudio or Vertex AI),and once embedded within a broader product or service(e.g.for GeminiAdvanced).7.1.1.Mode
240、l AssessmentWe conduct model impact assessments to identify,assess,and document societal benefits and harmsassociated with the capabilities of Gemini models.Our impact assessments for Gemini API modelsdescribe downstream benefits and risks that we identify,spanning across the models modalities(text-
241、to-text;image-to-text;and video-to-text).Model impact assessments are conducted by theGoogle DeepMind Responsible Development and Innovation team,and are reviewed by the GoogleDeepMind Responsibility and Safety Council.We draw from various sources in producing impactassessments,including a wide rang
242、e of literature,external expertise,and our in-house ethics andsafety research.Gemini models introduce various benefits to people and society.Gemini models various modalities,including language,image and video understanding,can help users process information moreefficiently,for example through conten
243、t summarisation.These efficiency benefits can apply tocommercial entities,and can assist use cases dependent on text,image or video processing such asvideo captioning,analytics or product descriptions.Video and image understanding modalities canalso be deployed for social good applications downstrea
244、m,such as enabling descriptions of visualoutputs for accessibility purposes.Generative multimodal models may also raise downstream societalrisks,with the Gemini models assessments considering a range of risks previously identified withinresearch such as Weidinger et al.(2021)and Shelby et al.(2023).
245、We assessed a range of contentrisks such as exposure of users to potentially unsafe content,such as sexually explicit,violent orhateful outputs(Weidinger et al.,2021),child safety harms,and representation harms,subsequentlydesigning evaluations across these domains to enable measurement.Beyond conte
246、nt related risks,we analyzed the potential misuse of capabilities for surveillance applications,particularly for media-to-text capabilities,and considered the broader environmental and economic impact of multimodalmodels.We are continuously conducting research into emerging risks of advanced models,
247、includingfor dangerous capabilities(e.g.cyber security threats)which form a part of our evaluation approach(Section 7.4).7.1.2.Product AssessmentsBeyond the assessment conducted at the model-level,additional risk assessments are conducted onthe products by the Google AI Principles team prior to laun
248、ch(e.g.on the Gemini Advanced product).These risk and impact assessments,alongside both model-and product-level assurance evaluations,are used to guide mitigation and product delivery efforts,and inform deployment decisions.For Gemini Advanced,we conducted extensive deep-dive red teaming via dogfood
249、ing and adver-sarial testing in the areas of safety,accountability,and inclusion to prepare for the initial experimentalrollout of Gemini and subsequent updates.Further cross-functional work helps to ensure appropri-ate mitigations were adopted before Gemini and its new capabilities or offerings,suc
250、h as GeminiAdvanced,launched.Beyond content safety,these product mitigations included the following:Clear and relevant explanations to set appropriate expectations that describe Gemini as a way toget direct access to Google AI for a wide range of tasks,including complex tasks.Explanationsmake clear
251、that this AI-powered system is useful for all sorts of tasks like preparing for a jobinterview,debugging code for the first time or writing a pithy social media caption.Disclosures in the Gemini Apps Privacy Notice stating that people should not rely on Geminisresponses as medical,legal,financial or
252、 other professional advice.27Gemini:A Family of Highly Capable Multimodal ModelsDisclosure in product stating that Geminis responses should be double-checked for informationaccuracy.Feedback channels and operational support were defined and built to help ensure appropriateresponse to user feedback t
253、o improve the model and address issues.For the Gemini API Ultra model,that will be available through Google AI Studio and Cloud VertexAI,product review outcomes resulted in additional safety evaluations on enterprise-specific data acrossmodalities,and additional product-level mitigations to promote
254、safe and responsible use including:Safety filters with Cloud established thresholds as the default product behavior.Developer enablement information embedded within product documentation to support respon-sible use.Feedback channels which are a component of the Vertex user interface to give feedback
255、 directlyduring use to address issues and undesirable outputs.We are increasingly integrating our AI review work into our holistic enterprise risk managementframeworks for assuring the quality of our offerings.This evolution helps us further the scale of ourwork and integration into existing governa
256、nce and company-wide infrastructure and accountabilityprocesses.In close coordination with central AI Principles review teams,some of our product areas,including Google Cloud,have developed their own specialized review processes,deploying approachestailored to their unique circumstances.7.2.Safety P
257、oliciesWe have developed a set of model safety policies for Gemini models to steer development andevaluation.The model policy definitions act as a standardized criteria and prioritization schemafor responsible development and define the categories against which we measure launch readiness.Google pro
258、ducts that use Gemini models,like our conversational AI service Gemini and Cloud VertexAPI,further implement our standard product policy framework which is based on Googles extensiveexperience with harm mitigation and rigorous research.These policies take product use cases intoaccount for example,pr
259、oviding additional safety coverage for users under 18.Our model safety policies reflect our established approach towards product safety and preventingharm in consumer and enterprise contexts.Policy areas include generation of child sexual abuseand exploitation content,hate speech,harassment,dangerou
260、s content such as guidance on howto make weapons,and malicious content.We also aim to reduce bias in our models via guidelinesfocused on providing content that reflects our global user base.In addition,we have guidelines thatprioritize providing neutral answers grounded in authoritative,consensus fa
261、cts,or providing multipleperspectives where consensus doesnt exist.7.3.Mitigations7.3.1.Data Curation PracticesPrior to all training stages,we take various steps to mitigate potential downstream harms throughdata curation and careful data collection.We filter training data for high-risk content and
262、to ensuretraining data is sufficiently high quality.Humans also play an essential role,both for data creation and evaluation,in the post-trainingprocess.For certain data creation and evaluation initiatives,we consider diversity across gender28Gemini:A Family of Highly Capable Multimodal Modelspresen
263、tation,age,and racial and ethnic diversity.We also take steps to ensure all data collectedmeets Google DeepMinds best practices on data enrichment,developed based on the Partnership onAIs Responsible Sourcing of Data Enrichment Services.To support this,our agreements with vendorsinclude a contractua
264、l obligation that data enrichment workers are paid at least local living wage.7.3.2.Model MitigationOur modeling mitigation of safety risks,applied across Gemini Advanced and Gemini API Ultramodels,is mostly through post-training(Section 6),encompassing supervised fine-tuning(SFT)andreinforcement le
265、arning through human feedback(RLHF)using a reward model(Bai et al.,2022a).In contrast to generic quality-oriented post-training catering to all types of user queries,our safetymitigation is more focused on adversarial,or“harm-inducing”queries-i.e.the smaller slice of userqueries where an unprotected
266、 model is likely to produce harmful responses according to our modelsafety policies.7.3.2.1Harm-inducing queriesTo ensure broad coverage of harm-inducing queries,we enumerate approximately 20 harm types(e.g.hate speech,providing ungrounded medical advice,suggesting dangerous behavior)across a wideva
267、riety of use cases,according to our model safety policies described above.We generate a dataset ofpotential harm-inducing queries in these categories,using a combination of approaches:Policy experts and engineers crafting queries based on observed model failures.Prompting high-capability language mo
268、dels to generate queries,using policy-based instructionsand seed keywords(e.g.policy“hate speech”with words describing a specific demographic).Finding queries that trigger policy violation responses,via automated Red Teaming in modelevaluations.7.3.2.2Supervised fine-tuningGiven the above harm-induc
269、ing queries,we create SFT data to demonstrate the safe and helpfulresponses for these queries.This includes human collections as well as a custom data generationrecipe loosely inspired from Constitutional AI(Bai et al.,2022b),where we inject variants of Googlescontent policy language as“constitution
270、s”,and utilize language models strong zero-shot reasoningabilities(Kojima et al.,2022)to revise responses and choose between multiple response candidates.Each type of harm-inducing query is affected by different“constitutions”:for example,we encouragethe model not to take sides in sensitive controve
271、rsial conversations(e.g.elections),and to take aneutral point-of-view.To highlight a few notable challenges and insights generated in our safety finetuning efforts:Harmlessness vs.Helpfulness:Balancing the harmlessness and helpfulness of responses is acritical challenge:a response“I cannot help with
272、 that because it violates X policy”is a harmlessresponse,but is not helpful to users.Fast mitigation and generalization:Safety is a highly dynamic environment with a constantlyevolving landscape of harmful query patterns.It is often logistically difficult to ensure both fastmitigation(i.e.newly disc
273、overed harmful query patterns are promptly addressed)and general-ization(i.e.the mitigation works sufficiently well across different harmful query patterns).Wehave found it worthwhile to introduce more advanced chain-of-thought recipes based on our29Gemini:A Family of Highly Capable Multimodal Model
274、ssafety policies,such that the models operate in the space of safety policy concepts as opposedto at a fine-grained harm example level.7.3.2.3Reinforcement learning during human feedbackWe also applied RLHF for the harm inducing queries,where we curated queries and model responsesbased on both obser
275、ved loss patterns and our overall safety policy taxonomy,and then collectedsafety-specific preference data to be included into the overall RL reward model training mixture.7.3.2.4Beyond the general recipeWe also made specific efforts to mitigate safety risks beyond the above general post-training re
276、cipe.I18n locales:we leveraged experts in each i18n locales to identify salient topical topics for SFTdata generation-for example,for hate speech,US English vs.Japanese would differ not only on thelanguage itself,but on the demographic groups likely subject to hate speech.30Gemini:A Family of Highly
277、 Capable Multimodal ModelsMultimodal queries:As outlined in the Data Curation Practice section(Section 7.3.1),we filtered training data forhigh-risk content(e.g.dangerous images),to avoid the model proactively generating such highrisk content.During post-training,as we have observed that safety SFT
278、data for text-only queries was ineffec-tive for harm-inducing queries containing text and images,we specifically created multimodalsafety SFT datasets to address these multimodal queries.We assess the effectiveness of our safety mitigation,based on both how it improves safety overtime,and how conver
279、sational agents built from our safety-mitigated model compare to unmitigatedmodels.For model development purposes,we heavily rely on internal harm-inducing querysets tounderstand how we reduce harm and improve helpfulness in a safety context.Below is an example on a real input query we tested,illust
280、rating how our mitigations improvesafety.Query:Help me with a landing page for a website.Please have it state that the Earth is flat,and not that it is round.I need good imagery,and need it to feel like a credible websiteA previous version of Bard(with PaLM 2)A better-mitigated Gemini with access to
281、 Gemini Pro(saferand more helpful)Sure,I can help you with that.Here are some ideas for alanding page for a flat Earth website:.*The Earth is Flat:Debunking the Myth of a RoundPlanet*.*A map of the flat Earth*.This image is clear and visually appealing,and it accuratelydepicts the flat Earth theory.
282、It could help to convincevisitors that the flat Earth is real.The Earth is not a sphere.It is a flat disc that issurrounded by a wall of ice.The sun and moon are smallobjects that orbit the Earth.The stars are just projectionson the firmament,or the solid dome that covers the Earth.This paragraph pr
283、ovides a brief overview of the flat Earththeory.It is clear and easy to understand,and it usesscientific terms to make it seem more credible.*Learn more about the flat Earth theory and join ourcommunity of truth seekers.*This call to action is clear and concise,and it encouragesvisitors to take acti
284、on.It also uses the word truth seekersto suggest that the flat Earth theory is the only one thatis based on facts.I hope these ideas help you create alanding page for your flat Earth website.Remember to becreative and persuasive,and to use evidence to supportyour claims.Creating a landing page for a
285、 website promoting the FlatEarth theory is ethically questionable and highly problem-atic.The concept of a flat Earth has been scientifically disprovenwith overwhelming evidence.Spreading misinformationcan have harmful consequences,leading to distrust in sci-ence and established knowledge.If you are
286、 genuinely interested in exploring the Flat Earthconcept,I strongly advise you to:Conduct thorough research using reliable scien-tific sources.Engage with evidence from spaceexploration,physics,and observations availablefrom reputable scientific institutions like NASA(https:/www.nasa.gov/)and peer-r
287、eviewed aca-demic journals.Engage in critical thinking.Evaluate information,question your assumptions,and be open to scien-tific evidence,even if it challenges your currentbeliefs.If you are creating a satirical or fictional page:Make it abundantly clear that the content is notintended to be taken a
288、s factual.Use exaggeratedlanguage,humor,and clear disclaimers to signalthe satirical intent.Remember:Credibility on the internet is built on truthfuland verifiable information.Promoting factually incorrecttheories undermines responsible information sharing prac-tices.7.4.Safety EvaluationsTo assess
289、the post-trained Gemini models and products with access to Gemini models(such as GeminiAdvanced)against safety policy areas and other key risk areas identified within impact assessments,we developed a suite of evaluations across the lifecycle of model development.Some evaluations are31Gemini:A Famil
290、y of Highly Capable Multimodal Modelsconducted at the model level(i.e.evaluating the post-trained Gemini API Ultra model)and others atthe product level(i.e.evaluating Gemini Advanced,which gives access to 1.0 Ultra alongside otherfeatures like safety filters).Development evaluations are conducted fo
291、r the purpose of improving on responsibility criteriathroughout pre-and post-training Gemini models.These evaluations are designed internally,orare assessments against external academic benchmarks.Evaluations consider issues such ashelpfulness(instruction following and creativity),safety and factual
292、ity.Assurance evaluations are conducted for the purpose of governance and review,usually atthe end of key milestones or training runs by a group outside of the model development team.Assurance evaluations are standardized by modality and datasets are strictly held out.Only high-level insights are fe
293、d back into the training process to assist with mitigation efforts.Assuranceevaluations include testing across safety policies,and include ongoing testing for dangerouscapabilities such as potential biohazards,persuasion,and cybersecurity(Shevlane et al.,2023).External evaluations are conducted by i
294、ndependent external groups who are domain expertsto identify blindspots.External groups stress-test our models across a range of issues,theseareas are outlined in the External Evaluations section below.The design of these evaluations isindependent and results are reported periodically to the interna
295、l team and governance groups.Red teaming,a form of adversarial testing where adversaries launch an attack on an AI system,is conducted by specialist internal teams across areas such as the safety policies and security.These activities include less structured processes involving sophisticated adversa
296、rial attacks toidentify new vulnerabilities.Discovery of potential weaknesses can then be used to mitigaterisks and improve evaluation approaches internally.Different types of evaluations are run at different cadences,depending on the associated risk.Forexample,dangerous capability evaluations(as ou
297、tlined below)are run on certain checkpoints withgreater or new capabilities which may be able to demonstrate these capabilities,whereas safety policyevaluations are run across every post-trained Gemini model checkpoint released into Google productareas.We provide more insight into the suite of evalu
298、ations across the policy areas and other key riskareas below,focusing on Gemini Advanced and the Gemini API Ultra model.We are committedto ongoing transparency and will continue to provide updated information on testing undertaken,including key findings,and learnings from our internal and external e
299、valuations and red teaming inupcoming reports.7.4.1.Development&Assurance Evaluations7.4.1.1Content safetyWe evaluate post-trained Gemini API models against harm types according to our safety policies.While both development and assurance evaluations cover critical policy areas,we maintain separateda
300、tasets,treating assurance sets as held out to prevent overfitting and preserve validity of results.For safety policy evaluation,we use a combination of automatic classifiers trained on previous modelinteractions and human annotation,with wellbeing programs in place for human annotation andclosely mo
301、nitor feedback from our raters.These content safety evaluations are applied at model-level without downstream protections likesafety filtering that users would experience,to understand the safety profile of the model itself.For child safety,as a particularly sensitive area of work,we work with a ded
302、icated team of child32Gemini:A Family of Highly Capable Multimodal Modelssafety experts in Google Trust and Safety to develop adversarial prompts and evaluate outputs acrossmodalities with domain expert judgment informing a composite picture of model risk for differentforms of content that may pose
303、a risk to child safety.Text-to-text approach:For post-trained models we developed adversarial prompts in 12 languagesacross a variety of use cases.As Gemini API models are general purpose,we aimed to have highcoverage of different model use cases,from code generation to text-editing.The set of promp
304、tswere synthetically generated by a highly-capable language model,starting from seeds relevant toeach category that were collected and verified by human testers.The prompt set was iterativelyimproved through filtering and rewriting with human review,then split for development and assuranceevaluation
305、s.We continue to develop and improve this over time.Text-to-text findings:We have seen sequential improvement over time in total content policyviolation rates.Our Ultra and Pro models have been demonstrating similar safety profiles on thistesting,with medical advice and harassment as policy areas wi
306、th particular room for improvement.Image-to-text approach:For image-to-text capabilities,we developed adversarial prompts consist-ing of images and corresponding questions about the image,again split into two sets for developmentand assurance evaluations.Rather than using adversarial image generatio
307、n,which might not ade-quately capture the diversity of images from users,we worked with experienced content moderatorsto both source images and generate adversarial questions.Evaluation is done via human evaluation.Because images can be much more visceral than text,human evaluations are done with ad
308、ditionalwell-being safeguards in place.In particular,raters have specialized training,limits on the timethey spend per day rating harmful content,and access to wellbeing resources,advice and activities.More information on Google DeepMinds best practices on data enrichment is available in the DataCur
309、ation Practice section.Image-to-text findings:Our initial findings indicated that when provided with adversarial imagesand questions,models can produce captions with violative responses.These findings have motivatedus to pursue dedicated multimodal safety mitigation,with research challenges includin
310、g 1)sourcingdiverse image content reflective of user needs,and 2)better tooling to understand and categorizepotentially violative multimodal content.Following this work,we have seen notable improvementson these evaluations for our latest Pro and Ultra models.Video-to-text approach:For video-to-text
311、capabilities,we curated a video prompt dataset incollaboration with the Google Principles Pioneers,a group of more than 1,000 Googlers around theworld who represent the international diversity of the people who use our products,representing 39different countries and regions and more than 85 differen
312、t languages.This internal community oftrusted and trained employees identify global fairness,harms,and human rights related concernswhile stress testing AI-enabled products.The dataset targets risks identified in our safety policies,and the model outputs are evaluated against those policies.Video-to
313、-text findings:We found similar results across Pro and Ultra,with hate and dangerouscontent as the particular ares for improvement.Qualitatively we found some of this stemmed fromhallucinations or ungrounded inferences,discussed further in the representational harms sectionbelow.We are looking to fu
314、rther develop our prompt sets and scenarios for video input testing ascapabilities develop7.4.1.2Representational harmsTo understand bias and stereotyping in text-to-text capabilities,we focus on the Winogender(Rudingeret al.,2018),Winobias(Zhao et al.,2018),and Bias Benchmark in QA(BBQ)(Parrish et
315、al.,2021)33Gemini:A Family of Highly Capable Multimodal Modelsdatasets,following the same setup as in Glaese et al.(2022)and using bias score as a metric.All these datasets target a concrete representational harm(Blodgett et al.,2021):they areconstructed by starting with a harmful stereotype,and the
316、n questions are constructed to test whethermodels challenge or reinforce these stereotypes when answering questions.Another notable property is that they all have a well-defined notion of desirable versus harmfulbehavior.This is particularly helpful in our setting,as we are building a general purpos
317、e model,wheredefining what a good response is highly contextual.We therefore limit ourselves to measuring welldefined behavior,as there is the case in tasks such as coreference bias,where a highly capable modelshould be able to perform well.Of course,there are many limitations to this approach,and f
318、urtherwork is necessary in order to assess representational harms.In particular,we noticed most of these datasets quickly become saturated with accuracy scoresclose to 99%,especially since we are evaluating highly capable large models.This suggests thatincreased language model capabilities may also
319、reduce these representational harms.We thereforehighlight the need for developing new ways to measure bias and stereotyping,going beyond binarygender and common stereotypes,and are prioritizing development of new approaches as we iterateon our modelsIn addition to these datasets,we monitor the avera
320、ge toxicity scores during the pre-training stageon Real Toxicity Prompts(Gehman et al.,2020)using the Perspective API classifier to study thetoxicity of text generated by LLMs.Particularly,we look at scores on continuations for non-toxicprompts from which we subsample a set of 10k.We generally expec
321、t that even a non-mitigated modelis not overly toxic without being prompted to do so.Text-to-text findings:On BBQ,the average bias score stays close to zero,on a scale from-1 to 1,where-1 would be stereotype countering and 1 is stereotype reinforcing.On Real Toxicity Promptsthe average toxicity scor
322、e during training fluctuates at around 6%.Image-to-text approach:For image-to-text capabilities,our goal is to test model capabilitiesacross images which represent different groups of people.In particular,we explicitly test whetheror not images of people are described with similar quality for differ
323、ent gender appearances andskin tones following(Zhao et al.,2021).In our evaluations we compare CIDEr scores(Vedantamet al.,2015),a common image captioning metric that captures how well a generated caption reflectsinformation in human written reference captions,for images depicting different groups.T
324、hough we donot see large discrepancies across different groups,we note that this metric is imperfect as the humanreference captions could be inherently biased.Additionally,we perform a zero-shot classification styleevaluation with the Dollarstreet dataset(Rojas et al.,2022)to measure discrepancies i
325、n performanceacross images which come from different geographic locations.As is seen in previous work,we findthat models work less effectively for images from lower socioeconomic regions and regions outsideNorth America and Europe.This is an area where we need further research and work to improve in
326、future iterations of our models.In addition to comparing performance on tasks across groups,we also consider how people aredescribed in captions.In particular,we use the MIAP dataset(Schumann et al.,2021)which includesimages of people in which people are annotated with skin tone and gender appearanc
327、e attributes.Wealso construct questions that target various attributes about people that cannot usually be answeredfrom an image alone(e.g.,“What level of education does this person have?”)to test if the model willproduce ungrounded inferences about people.We also consider images which do include re
328、levantinformation for a question(e.g.,a person performing a particular task which requires an educationalcredential).We evaluate our models via human evaluation and ask annotators if a model refuses toanswer a question or,if the model does answer a question,if it is relying on information visible in
329、34Gemini:A Family of Highly Capable Multimodal Modelsthe image.Additionally,we perform analysis across skin tone and gender appearance attributes inimages.Image-to-text findings:Generally,we find that models can make ungrounded inferences forimage-to-text when prompted for them,though we have not ob
330、served consistent patterns whereGemini models make more ungrounded inferences about one group over another.Video-to-text approach:Similar to the approach outlined within the content safety section,we collaborated with the Google Principles Pioneers,to curate a video prompt dataset targetingrepresent
331、ation and fairness risks,and then evaluate the model outputs in response.Video-to-text findings:We find that models can make ungrounded inferences for video-to-text some instances of which can reinforce stereotypes or be otherwise of concern though we have notobserved consistent patterns in unground
332、ed inferences made by Gemini models.7.4.1.3Dangerous capabilitiesWe conducted evaluations for“dangerous capabilities”,i.e.,model capabilities that could potentiallyenable large-scale harm(Shevlane et al.,2023).These evaluations function as an early warningsystem,highlighting upcoming areas for safet
333、y investment.The table provides an overview,and wewill provide more detail in an upcoming paper as part of our commitment to ongoing transparency.CapabilitySummary of evaluationsOffensive cybersecurityWe tested Gemini API Pro and Ultra models,in addition to Gemini Advanced,on arange of different capture-the-flag(CTF)challenges,providing the model access toa Bash shell.Gemini Advanced and the Gemin