《智譜:ChatGLM技術報告(英文版)(19頁).pdf》由會員分享,可在線閱讀,更多相關《智譜:ChatGLM技術報告(英文版)(19頁).pdf(19頁珍藏版)》請在三個皮匠報告上搜索。
1、ChatGLM:A Family of Large Language Modelsfrom GLM-130B to GLM-4 All ToolsTeam GLM1Zhipu AI2Tsinghua UniversityAbstractWe introduce ChatGLM,an evolving family of large language models that we havebeen developing over time.This report primarily focuses on the GLM-4 languageseries,which includes GLM-4,
2、GLM-4-Air,and GLM-4-9B.They represent ourmost capable models that are trained with all the insights and lessons gained fromthe preceding three generations of ChatGLM.To date,the GLM-4 models arepre-trained on ten trillions of tokens mostly in Chinese and English,along witha small set of corpus from
3、24 languages,and aligned primarily for Chinese andEnglish usage.The high-quality alignment is achieved via a multi-stage post-training process,which involves supervised fi ne-tuning and learning from humanfeedback.Evaluations show that GLM-4,1)closely rivals or outperforms GPT-4in terms of general m
4、etrics such as MMLU,GSM8K,MATH,BBH,GPQA,andHumanEval,2)gets close to GPT-4-Turbo in instruction following as measured byIFEval,3)matches GPT-4 Turbo(128K)and Claude 3 for long context tasks,and 4)outperforms GPT-4 in Chinese alignments as measured by AlignBench.The GLM-4 All Tools model is further a
5、ligned to understand user intent and autonomouslydecide when and which tool(s)to useincluding web browser,Python interpreter,text-to-image model,and user-defi ned functionsto effectively complete complextasks.In practical applications,it matches and even surpasses GPT-4 All Toolsin tasks like access
6、ing online information via web browsing and solving mathproblems using Python interpreter.Over the course,we have open-sourced a seriesof models,including ChatGLM-6B(three generations),GLM-4-9B(128K,1M),GLM-4V-9B,WebGLM,and CodeGeeX,attracting over 10 million downloads onHugging face in the year 202
7、3 alone.The open models can be accessed throughhttps:/ and https:/huggingface.co/THUDM.*Team GLM:Aohan Zeng,Bin Xu,Bowen Wang,Chenhui Zhang,Da Yin,Dan Zhang,Diego ROJAS,Guanyu Feng,Hanlin Zhao,Hanyu Lai,Hao Yu,Hongning Wang,Jiadai Sun,Jiajie Zhang,Jiale Cheng,JiayiGui,Jie Tang,Jing Zhang,Jingyu Sun,
8、Juanzi Li,Lei Zhao,Lindong Wu,Lucen Zhong,Mingdao Liu,MinlieHuang,Peng Zhang,Qinkai Zheng,Rui Lu,Shuaiqi Duan,Shudan Zhang,Shulin Cao,Shuxun Yang,WengLam Tam,Wenyi Zhao,Xiao Liu,Xiao Xia,Xiaohan Zhang,Xiaotao Gu,Xin Lv,Xinghan Liu,Xinyi Liu,Xinyue Yang,Xixuan Song,Xunkai Zhang,Yifan An,Yifan Xu,Yili
9、n Niu,Yuantao Yang,Yueyan Li,Yushi Bai,Yuxiao Dong,Zehan Qi,Zhaoyu Wang,Zhen Yang,Zhengxiao Du,Zhenyu Hou,Zihan Wang.Team members are listed alphabetically by fi rst name.Preprint.Under review.arXiv:2406.12793v2 cs.CL 30 Jul 2024GLM11Mar.2021GLM-10BJun.2021GLM-130B54CodeGeeX-13B58Aug.2022GLM-ProEmbe
10、ddingCharacterGLM61Jun.2023ChatGLM2-6BChatGLM2-6B-32KCodeGeeX2-6BGLM-4(0116)GLM-4VCogView359Jan.2024ChatGLM-130BMar.2023ChatGLM-6BVisualGLM-6BGLM-3-TurboOct.2023ChatGLM3-6BChatGLM3-6B-32KCogVLM-17B46APIsOpen LLMsOpen VLMsGLM-4(0520)GLM-4-Air(0605)Jun.2024GLM-4-9BGLM-4-9B-ChatGLM-4-9B-Chat-1MGLM-4V-9
11、BCogVLM2-19BGLM-4 All ToolsWebGLM22CodeGeeX Code InterpreterAgent ModelsAutoWebGLM18CogView9Apr.2022CogView210CogVideo(Dec.)CogAgent16(May)(May)(May)(128K)(32K)(128K)(Jul.)Oct.2022GLM-130B54(32K)(Aug.)GLM-10BmGLM-1BFigure 1:Timeline of the GLM family of language,code,vision,and agent models.The focu
12、s ofthis report is primarily on the language models,i.e.,ChatGLM.The APIs are publicly available athttps:/and open models can be accessed throughhttps:/ rapid development of large language models(LLMs)has been phenomenal 57.Take one ofthe most successful model series,the OpenAIs GPT models,as an exa
13、mple:the original GPT-3model released in 2020 3 marked a signifi cant scale-up from GPT-1s 117 million parameters andGPT-2s 1.5 billion parameters,to 175 billion parameters.This scale-up enables the decoder-onlytransformer-based GPT-3 model with in-context learning and generalized capabilities:accor
14、ding toOpenAI,the GPT-3.5 series improved upon GPT-3 by incorporating instruction tuning,supervisedfi ne tuning(SFT),and/or reinforcement learning from human feedback(RLHF)29.This hasnow became a standard procedure to create performing LLMs,including the PaLM models 6,theLLaMA models 41,the Gemini m
15、odels 40,and many more.In a parallel line to the popularly adopted LLMs development practices,we proposed the GeneralLanguage Model(GLM)architecture 11 featured with the autoregressive blank infi lling objectiveand open-sourced the GLM-10B model in 2021(See the GLM timeline in Figure 1).Starting in
16、late2021,we began pre-training GLM-130B 53.The goal was to train a 100B-scale model to match orsurpass GPT-3(davinci)while also verifying the techniques for successfully training models at thisscale,along with other contemporary efforts such as OPT-175B 54 and BLOOM-176B 33.Wecompleted the 400B-toke
17、n training and evaluation of GLM-130B in July,and subsequently releasedthe model and pre-training details 53 in August 2022.According to HELM in November 2022,GLM-130B matches GPT-3(davinci)across various dimensions 20.Following this,we initiated instruction tuning on GLM-130B.Later,ChatGPT further
18、motivated usto align the base models with SFT and RLHF.We created and crafted the prompt-response pairs fromscratch and performed SFT,while also starting to examine how to effectively apply RLHF.On March14,2023,the aligned model,ChatGLM-130B,went live onhttps:/.In addition,2UserGLM-4(All Tools)Pytho
19、nWeb BrowserCogViewAccording to the information I found,from 2000 to 2023,the global population grew from about 6.15 billion to about 8.05 billion.#Starting and ending population valuesstarting_population=6.15e9#2000 populationending_population=8.05e9#2023 population#Number of years between 2000 and
20、 2023years=2023-2000#CAGR formulacagr=(ending_population/starting_population)*(1/years)-1cagr*100#Convert to percentageGenerateExecuteThe average annual growth rate of the global population from 2000 to 2023 was approximately 1.18%.Result:1.17739919480071World Population Clockhttps:/www.worldometers
21、.info/.123Search for the global population from 2000 to 2023,then calculate the average annual growth rate.global population from 2000 to 202312World Population 1950-2024https:/ PythonSearch Web Browser+Figure 2:An Illustrative Example of GLM-4 All Tools.a smaller version,ChatGLM-6B 13,was open-sour
22、ced on the same day,attracting signifi cantlymore attention than anticipated.It was designed to have 6.2 billion parameters for 1)facilitatingfast iteration of pre-and post-training techniques as well as data selection,and 2)enabling localdeployment on consumer-grade graphics cards using INT4 quanti
23、zation.Since then,we have beenrapidly exploring and refi ning our pre-training and alignment techniques,leading to the secondand third generations of ChatGLM series every other three months,both of which were pre-trainedentirely from the beginning.ChatGLM-6B was pre-trained on approximately one tril
24、lion tokens of Chinese and English corpuswith a context length of 2,048(2K),supplemented mostly by SFT.Released in June,ChatGLM2-6Bwas pre-trained and aligned with more high-quality data,leading to substantial improvements overits predecessor,including a 23%improvement on MMLU,571%on GSM8K,and 60%on
25、 BBH.By adopting the FlashAttention technique 8,its context length was extended to 32K.Additionally,the integration of Multi-Query Attention 35 contributed to a 42%increase in inference speed.Taking this further,our 2nd generation code model CodeGeeX2-6B was developed by pre-trainingon an additional
26、 600 billion code tokens.It demonstrated Pass1 improvements over the initialgeneration,CodeGeeX-13B 58,with increases of 57%in Python,71%in C+,54%in Java,83%in JavaScript,and 56%in Go as measured by HumanEval-X.When adapting to Character-basedDialogues,CharacterGLM 61 allows effective and safe chara
27、cter customization on LLMs.Byfurther adapting more diverse training datasets,more suffi cient training steps,and more optimizedtraining strategies,ChatGLM3-6B topped 42 benchmarks across semantics,mathematics,reasoning,code,and knowledge.Starting from this generation,ChatGLM also supports function c
28、all and codeinterpreter,as well as complex agent tasks 22;52;18.In the course of these developments,we alsodeveloped models with 1.5B,3B,12B,32B,66B,and 130B parameters,allowing us to validate ourobservations and establish our own scaling laws.With all the lessons learned and experiences accumulated
29、,we kicked off the training of GLM-4.The fi rst cutoff checkpoint then underwent a multi-stage post-training process(e.g.,SFT,RLHF,safety alignment)with a focus on the Chinese and English language for now.Subsequently,it wasdeveloped into two distinct versions:GLM-4 and GLM-4 All Tools,both supporti
30、ng a 128K contextlength.Since Janurary 16,2024,GLM-4(0116)has been made available through the GLM-4 API athttps:/,and GLM-4 All Tools is accessible via the websitehttps:/and mobile applications that support the creation of ones own agentGLMs.The latest modelsare GLM-4(0520)and GLM-4-Air(0605)with an
31、 upgrade on both pre-training and alignment.GLM-4-Air achieves comparable performance to GLM-4(0116)with lower latency and inferencecost.Evaluations of GLM-4 were performed on a variety of language benchmarks.These evaluationsassess GLM-4s general abilities in English,instruction following in both E
32、nglish and Chinese,andalignment,long-context,and agent capacities in Chinese.3Table 1:Performance of Open ChatGLM-6B,ChatGLM2-6B,ChatGLM3-6B,and GLM-4-9B.LanguageDatasetChatGLM-6BChatGLM2-6BChatGLM3-6B-BaseGLM-4-9B(2023-03-14)(2023-06-25)(2023-10-27)(2024-06-05)EnglishGSM8K1.525.972.384.0MATH3.16.92
33、5.730.4BBH0.029.266.176.3MMLU25.245.261.474.7GPQA-26.834.3HumanEval0.09.858.570.1BoolQ51.879.087.989.6CommonSenseQA20.565.486.590.7HellaSwag30.457.079.782.6PIQA65.769.680.179.1DROP3.925.670.977.2ChineseC-Eval23.751.769.077.1CMMLU25.350.067.575.1GAOKAO-Bench26.846.467.374.5C335.158.673.977.2First,on
34、the most commonly-used English academic benchmarksMMLU,GSM8K,MATH,BBH,GPQA,and HumanEval,GLM-4 0520 achieves performance closely comparable to that of GPT-40613 28 and Gemini 1.5 Pro 40.For example,it scores 83.3 vs.86.4 and 83.7 on MMLU,respectively.Second,according to IFEval 62,GLM-4s instruction
35、following capacities on bothprompt and instruction levels are approximately as effective as GPT-4-Turbo in both English andChinese.Third,in terms of Chinese language alignment,GLM-4 outperforms GPT-4 and matchesGPT-4-Turbo across eight dimensions in AlignBench 23.Finally,for long-context tasks,theGL
36、M-4(128K)model matches the performance of GPT-4 Turbo and Claude 3 Opus as measured byLongBench-Chat 1,i.e.,87.3 vs.87.2 and 87.7,respectively.The GLM-4 All Tools model is specifi cally aligned to better understand user intent and autonomouslyselect the most appropriate tool(s)for task completion.Fo
37、r example,it can access online informationvia a web browser in a multi-round manner,use Python interpreter to solve math problems,leveragea text-to-image model to generate images,and call user-defi ned functions.Figure 2 illustrates anexample showing GLM-4 All Tools with a web browser and Python int
38、erpreter for addressing theuser query of“Search for the global population from 2000 to 2023,then calculate the average annualgrowth rate”.Our fi rst-hand test shows that it not only matches but often surpasses the capabilities ofGPT-4 All Tools for common tasks.Following our three generations of ope
39、n ChatGLM-6B models,we also openly released the GLM-4-9B(128K and 1M context length)model.GLM-4-9B is pre-trained on approximately ten trilliontokens of multilingual corpus with a context length of 8192(8K)and post-trained with the samepipeline and data used for GLM-4(0520).With less training comput
40、e,it outperforms Llama-3-8B 26 and supports all the functionality of All Tools in GLM-4.We also provide an experimentalmodel GLM-4-9B-Chat-1M with 1 million(1M)context length(about 2 million Chinese characters).Table 1 shows the performance of the three generations of ChatGLM-6B models and GLM-4-9B,
41、illustrating the progressive improvements of ChatGLM over time.Figure 3 summarizes the major improvements and features from GLM-130B to GLM-4 All Tools.Throughout this journey,we have also contributed to the open development of the code LLMs(CodeGeeX 58)as well as visual language models for image un
42、derstanding(CogVLM 45 andCogAgent 16)and text-to-image generation(CogView 9;10;59).The open models and data canbe accessed via https:/ and https:/huggingface.co/THUDM.4GLM-130BMMLU:44.8%ChatGLMGLM-2MMLU:66.6%ChatGLM2GLM-3MMLU:71.0%ChatGLM3GLM-4MMLU:83.3%GLM-4 All ToolsAlign with human intentBetter A
43、rchitecture with 32K ContextNative Agent&Function Call CapabilitiesPowerful Agents with 128K ContextExperimental 1M Context Length&VisionFigure 3:From GLM-130B to ChatGLM to ChatGLM2/3 to GLM-4 All Tools.2ChatGLM TechniquesIn this section,we introduce both the pre-and post-training techniques we ado
44、pted and developed inChatGLM,including the model architecture,pre-training data,alignment,and All Tools.We havedetailed technical reports introducing each of the major techniques we used to reach GLM-4.Pre-Training Data.Our pre-training corpus consists of multilingual(mostly English and Chinese)docu
45、ments from a mixture of different sources,including webpages,Wikipedia,books,code,andresearch papers.The data processing pipeline mainly includes three stages:deduplication,fi ltering,and tokenization.The deduplication stage improves data diversity by removing duplicated or similardocuments,with bot
46、h exact and fuzzy deduplication.The fi ltering stage for webpages improves dataquality by removing noisy documents that contain offensive language,placeholder text,source code,etc.The tokenization stage converts text into a sequence of tokens for further processing.The numberof tokens in the pre-tra
47、ining data directly affects model training speed.To optimize this aspect,weemploy the byte-level byte pair encoding(BPE)algorithm 34 to separately learn the Chinese andmultilingual tokens and merge them with the tokens of the cl100k_base tokenizer in tiktoken 27 intoa unifi ed vocabulary with a size
48、 of 150,000.In the fi nal training set,we re-weight different sources toincrease the importance of high-quality and educational sources like books and Wikipedia.To thisend,the pre-training corpus consists of around ten trillion tokens.Throughout the four generations of ChatGLM development,our fi ndi
49、ngs align with existing stud-ies 60:data quality and diversity are crucial for building effective LLMs.Despite the empiricallessons and insights gained,we have to date yet to identify a fundamental principle that couldguide the processes of data collection,cleaning,and selection,which might inspire
50、future researchdirections.Architecture.The GLM family of LLMs is built on Transformer 43.In GLM-130B 53,weexplored various options to stabilize its pre-training by taking into account the hardware constraintswe faced at the time.Specifi cally,GLM-130B leveraged DeepNorm 44 as the layer normalization
51、strategy and used Rotary Positional Encoding(RoPE)38 as well as the Gated Linear Unit 36 withGeLU 15 activation function in FFNs.Throughout our exploration,we have investigated differentstrategies to enhance model performance and inference effi ciency.The recent GLM-4 model adoptsthe following archi
52、tecture design choices.No Bias Except QKV:To increase training speed,we removed all bias terms with the exceptionof the biases in Query,Key,and Value(QKV)matrices of the attention layers.In doing so,weobserved a slight improvement in length extrapolation.RMSNorm and SwiGLU:We adopted RMSNorm and Swi
53、GLU to replace LayerNorm and ReLU,respectively.These two strategies brought better model performance.Rotary positional embeddings(RoPE):We extended the RoPE to a two-dimensional form toaccommodate the 2D positional encoding in GLM.Group Query Attention(GQA):We replaced Multi-Head Attention(MHA)with
54、Group QueryAttention(GQA)to cut down on the KV cache size during inference.Given GQA uses fewerparameters than MHA,we increased the FFN parameter count to maintain the same model size,i.e.,setting dffnto 10/3 of the hidden size.5The context length of our models was extended from 2K(ChatGLM),to 32K(C
55、hatGLM2 andChatGLM3),and to 128K and 1M(GLM-4).These expansions were achieved not only throughcontext extensionposition encoding extension 31;5 and continual training 47 on long textbutalso long context alignment,enabling GLM-4 to effectively handle very long contexts(Cf 1 fortechnical details).Alig
56、nment.Pre-training builds the foundation of LLMs while post-training 29 further refi nesthese models to align with human preferences,such as understanding human intents,followinginstructions,and facilitating multi-turn dialogues.For GLM-4,the alignment is mostly achieved withsupervised fi ne-tuning(
57、SFT)and reinforcement learning from human feedback(RLHF)17.In SFT,we fi nd that authentic human prompts and interactions instead of template-based or model-generatedresponses are vital to the alignment quality.While SFT largely aligns the base models with humanpreferences,RLHF can further help mitig
58、ate issues of response rejection,safety,mixture of bilingualtokens generated,and multi-turn coherence among others.For the fi rst generation of our models(ChatGLM-6B and ChatGLM-130B),the prompt-responsepairs were mostly annotated by the model developers.For later models,the alignment data is acombi
59、nation of in-house annotation and proprietary data acquired from third parties,subject to strictquality control measures.Similar to existing practices 42,annotators are instructed to score modelresponses from several dimensions,including safety,factuality,relevance,helpfulness,and humanpreferences.C
60、hatGLM Techniques.Throughout the development of ChatGLM,we have introduced and willpublish techniques that are used to enhance its performance.Emergent Abilities of LLMs 12:We examined the relationship between pre-training loss andperformance on downstream tasks and found that with the same pre-trai
61、ning loss,LLMs of differentmodel sizes and training tokens generate the same downtream performance.We also found that onsome tasks(such as MMLU and GSM8K),the performance improves beyond random chance onlywhen the pre-training loss falls below a certain threshold.We thus redefi ne emergent abilities
62、 asthose exhibited by models with lower pre-training losses 12.LongAlign 1:To extend LLMs context window size,we proposed LongAligna comprehensiverecipe for long context alignment.It enables GLM-4 to process long context texts(up to 128Ktokens)with performance comparable to that of Claude 2 and GPT-
63、4 Turbo(1106).ChatGLM-Math 48:To improve math problem solving in LLMs,we introduced ChatGLM-Maththat leverages self-critique rather than external models or manual annotations for data selection.ChatGLM-RLHF 17:To align LLMs with human feedback,we introduced ChatGLM-RLHFour practices of applying PPO
64、and DPO into LLMs.Self-Contrast 24:To avoid the need for expensive human preference feedback data,we developeda feedback-free alignment strategy Self-Contrast.It utilizes the target LLM to self-generate massivenegative samples for its RLHF alignment.AgentTuning 52:To improve LLMs agent capabilities,
65、we developed the AgentTurning frame-work with the AgentInstruct instruction-tuning dataset that includes high-quality interaction trajec-tories between agents and environment.APAR 21:To improve the inference speed of LLMs for responses with hierarchical structures,we presented an auto-parallel auto-
66、regressive(APAR)generation approach.It leverages instructtuning to train LLMs to plan their(parallel)generation process and execute APAR generation.Benchmarks:We also developed several open LLM benchmarks,including AgentBench 25 forevaluating LLMs as agents,LongBench 2 for evaluating the long contex
67、t handling performanceof LLMs,AlignBench 1 to measure the alignment quality of ChatGLM with Chinese languagecontent,HumanEval-X 58 to evaluate HumanEval 4 problems in programming languagesbeyond Python,as well as NaturalCodeBench(NCB)to measure models capacities to solvepractical programming tasks.G
68、LM-4 All Tools.The latest ChatGLM models are GLM-4 and GLM-4 All Tools,both of whichwere trained and aligned by using the techniques above.GLM-4 All Tools is a model version furtheraligned to support intelligent agents and related tasks.It is trained to autonomously understand userintent,plan comple
69、x instructions,and call one or multiple tools(e.g.,web browser,Python interpreter,and the text-to-image model)to complete complex tasks.Figure 4 presents the overall pipeline of6CustomizedGLMs 1PythonCogViewWeb BrowserExternal KnowledgeUserGLM-4(All Tools)GLM-4Function Call128K ContextPlanAnalyzeRec
70、ursiveExecuteTool CallFeedbackMemoryExecutionCustomizedGLMs 2CustomizedGLMs 3Figure 4:The overall pipeline of GLM-4 All Tools and customized GLMs(agents).the GLM-4 All Tools system.When a user issues a complex request,the model analyzes the taskand plan the problem-solving process step by step.If it
71、 determines that it cannot complete the taskindependently,it will sequentially call one or multiple external tools,utilizing their intermediatefeedback and results to help solve the task.Built on the GLM-4s all-tools capabilities,we also developed the GLMs application platform thatallows users to cr
72、eate and customize their own agents for specifi c tasks.The GLMs support not onlythe embedded Python interpreter,web browser,text-to-image model but also user-defi ned functions,APIs,and external knowledge bases to more effectively address user needs.3GLM-4 CapabilitiesWe examine the capabilities of
73、 the GLM-4 model from diverse perspectives,including the basecapacity on academic benchmarks,code problem-solving,agent abilities in English,and instructionfollowing,long context for both Chinese and English,as well as alignment in Chinese.As mentioned,GLM-4 was pre-trained mostly in Chinese and Eng
74、lish and aligned predominantly to Chinese.In thissection,we report results primarily for the latest GLM-4 version,i.e.,GLM-4(0520)and GLM-4-Air(0605),as GLM-4(0520)is slightly better than its original 0116 version across the evaluatedbenchmarks.During evaluation,both GLM-4 and GLM-4-Air are deployed
75、 with BFloat16 precision.For baselines,we present results for GPT-4(0603),GPT-4 Turbo(1106,2024-04-09),Claude 2,Claude 3 Opus,and Gemini 1.5 Pro,all of which were extracted from the corresponding technicalreports or tested through their public APIs.Overall,GLM-4 gets close to the state-of-the-art mo
76、dels(GPT-4-Turbo,Gemini 1.5 Pro,and Claude3 Opus)over the standard benchmarks,as well as instruction following,long context,code problem-solving,and agent abilities in English environment.For Chinese alignment,it generates strongperformance against SOTA models across various domains,such as fundamen
77、tal language ability,advanced Chinese understanding,professional knowledge,and open-ended question answering.Insummary,GLM-4 is among the best in terms of Chinese language tasks.It also demonstrates compa-rable performance to GPT-4 and Claude 3 Opus in Chinese math and logic reasoning capabilitiesth
78、ough it lags behind GPT-4 Turbo.3.1Evaluation of Academic BenchmarksTo evaluate the general performance of the base model,we select six commonly-used benchmarksspanning knowledge,math,reasoning,commonsense,and coding:MMLU 14:Multi-choice questions collected from various examinations including mathe-
79、matics,history,computer science,and more.We present all answers to the model and ask itto choose the letter of the answer.GSM8K 7:8,500 grade school math word problems(1,000 in the test set)that requirethe model to solve real-life situational problems using mathematical concepts.We usechain-of-thoug
80、ht prompting 46 for this benchmark.MATH:12,500 challenging competition-level mathematics problems(5,000 in the test set).We use chain-of-thought prompting 46 for this benchmark.7BBH 39:A suite of 23 challenging BIG-Bench 37 tasks.We use chain-of-thoughtprompting 46 for this benchmark.GPQA 32:A gradu
81、ate-level multi-choice benchmark in biology,chemistry,and physics.HumanEval 4:a coding benchmark that measures correctness of synthetic functions withautomatic test-case checking.We compare the performance of GLM-4 with the original GPT-4 28.The results are shownin Table 2.We can observe that GLM-4
82、achieves 96.3%of GPT-4s accuracy on MMLU,andoutperforms GPT-4 on other benchmarks.Overall,the base capacity of GLM-4 approaches that ofGPT-4-Turbo and Claude 3 Opus.Table 2:GLM-4 performance on academic benchmarks.ModelMMLUGSM8KMATHBBHGPQAHumanEvalGPT-4(0314)86.492.052.983.135.767.0GPT-4 Turbo(1106)
83、84.795.764.388.342.583.7GPT-4 Turbo(2024-04-09)86.795.673.488.249.388.2Claude 3 Opus86.895.060.186.850.484.9Gemini 1.5 Pro85.990.867.789.246.284.1GLM-4-9B-Chat72.479.650.676.328.871.8GLM-4-Air(0605)81.990.957.980.438.475.7GLM-4(0116)81.587.647.982.335.772.0GLM-4(0520)83.393.361.384.739.978.53.2Evalu
84、ation of Instruction FollowingWe assess the profi ciency of GLM-4 in following instructions with the recently-introduced IFEvaldataset 62.The dataset comprises 541 prompts derived from 25 distinct instructions that areverifi able through explicit criteria(e.g.,“end your email with:P.S.I do like the
85、cake”can be verifi edvia string matching).We adhere to the methodologies outlined by 62 to calculate prompt-leveland instruction-level accuracy in both strict mode and loose mode.To further evaluate the modelsperformance on following instructions in Chinese,we translate the original prompts into Chi
86、nese,omitted instructions that are not applicable in Chinese(such as capitalization),and adjust the scoringscripts to accommodate Chinese data.Table 3:GLM-4 performance on IFEval 62,an LLM instruction following benchmark.L standsfor Loose and S stands for Strict.P stands for Prompt and I stands for
87、Instruction.ModelEnglishChineseL-PS-PL-IS-IL-PS-PL-IS-IGPT-4(0613)79.577.185.583.772.468.980.075.7GPT-4 Turbo(1106)79.175.485.182.474.369.180.876.5GPT-4 Turbo(2024-04-09)84.581.288.785.979.372.684.279.1Claude 275.058.081.767.757.146.564.955.1Claude 3 Opus90.685.593.790.078.373.384.380.4GLM-4-9B-Chat
88、73.069.080.377.273.069.080.377.2GLM-4-Air(0605)80.475.286.182.379.371.284.077.3GLM-4(0520)83.779.188.785.079.771.984.278.08In loose mode,GLM-4 matches instruction-level accuracy achieved by GPT-4 Turbo in both Englishand Chinese.In strict mode,GLM-4 achieves99.0%and98.6%of instruction-level accuracy
89、 ofGPT-4 Turbo(2024-04-09)in English and Chinese,respectively.3.3Evaluation of AlignmentAlignBench 23 provides an automatic LLMs-as-Judge method to benchmark the alignment ofLLMs in Chinese context.It consists 683 queries spanning 8 different categories,and evaluatesmodel responses using a GPT-4 bas
90、ed multidimensional rule-calibrated pointwise reference-basedscoring method.We evaluate on AlignBench-v1.1,which more carefully improves the referencegeneration quality,especially by complementing human-collected evidences from webpages withurls for knowledge-related questions that takes up 66.5%of
91、total queries.On this version,almost allLLMs achieve lower scores than they do in the previous AlignBench.Table 4:GLM-4 performance on AlignBench 23,an LLM benchmark for alignment in Chinese.ModelMathLogicLanguageChineseQAWritingRole PlayProfessionalOverallGPT-4(0613)7.547.177.827.027.397.678.207.29
92、7.46GPT-4 Turbo(1106)7.857.667.907.228.248.538.467.957.90GPT-4 Turbo(2024-04-09)8.327.677.607.578.377.758.188.598.00Claude 26.395.856.755.726.685.876.866.566.26Claude 3 Opus7.277.117.947.718.217.617.738.027.53Gemini 1.5 Pro7.077.777.317.228.557.837.798.527.47GLM-4-9B-Chat7.006.016.697.267.977.598.10
93、7.527.01GLM-4-Air(0605)7.696.957.538.007.908.018.358.097.65GLM-4(0116)7.207.207.608.198.457.888.058.567.66GLM-4(0520)7.897.958.007.868.118.048.068.478.00Results are shown in Table 4.GLM-4 outperforms GPT-4 Turbo,Claude 3 Opus,and Gemini 1.5Pro in general,achieves the highest overall score among the
94、baselines.Especially on Chinese LogicReasoning and Language Understanding tasks,GLM-4 signifi cantly outperforms all other powerfulmodels.These results demonstrate its strong grasping of Chinese language and knowledge.The current performance gap between GLM-4 and GPT-4 Turbo(2024-04-09)mostly lies i
95、n theMathematics dimension.We have been employing techniques introduced in ChatGLM-Math 48such as self-critique to continuously enhance GLM models math reasoning capabilities.3.4Evaluation of Long Context Handling AbilitiesTo assess the performance of GLM-4 on long text tasks,we carry out evaluation
96、s on LongBench-Chat 1,a benchmark set with context lengths ranging from 10-100k,encompassing a wide rangeof long text scenarios frequently utilized by users,such as document Q&A,summarization,andcoding.In our quest to provide a more detailed comparison against the performance of GLM-4 indifferent la
97、nguages,we also segregate LongBench-Chat according to language.This yields twodistinct portions:Chinese and English.We therefore report the results for both segments separately,offering a fi ne-grained overview of GLM-4s cross-linguistic capabilities.Regarding the specifi c evaluation settings,we sc
98、ore the outputs of each model based on GPT-4,adopting a few-shot strategy within LongBench-Chat.Moreover,given our objective to minimizescore variations and to reach a more reliable statistical conclusion,we repeated evaluations multipletimes.Subsequently,we report the average from these multiple ev
99、aluations in Table 5 to ensure thatthe fi nal performance metric reflects a thorough understanding of how GLM-4 behaves under diverseconditions.And the results clearly suggested that the performance of GLM-4 aligns with that ofGPT-4 Turbo and Claude 3 Opus on English prompts,and it outperforms the b
100、est of them on Chineseprompts.9Table 5:GLM-4 performance on LongBench-Chat 2.ModelEnglishChineseGPT-4 Turbo(1106)87.271.4GPT-4 Turbo(2024-04-09)85.082.1Claude 281.376.2Claude 3 Opus87.782.7GLM-4-9B-Chat76.879.0GLM-4-Air(0605)82.481.0GLM-4(0520)87.384.03.5Evaluation of Coding Abilities on Real-world
101、User PromptsWhile HumanEval 4 has been widely adopted for evaluating LLMs code generation,most of itsproblems are about introductory algorithms.However,in practice,users ask complicated questionsto complete their daily work,whose diffi culty is usually far beyond the scope of HumanEval.Additionally,
102、previous works have reported HumanEval-contaminated training data 28;19;50 intheir own or other LLMs,making the results on HumanEval relatively less trustful than before.As a result,beside HumanEval we evaluate GLM-4 on NaturalCodeBench(NCB)55,a challengingbilingual coding benchmark derived from rea
103、l user prompts to mirror the complexity of real-worldcoding tasks.As shown in Table 6,GLM-4 has a close coding performance to Claude 3 Opus inpractical scenarios.While there is still some gaps to GPT-4 models,considering GLM-4 bilinguallybalanced nature,there is quite much potential to improve its p
104、erformance on NCB via better trainingstrategies and data curation in our following iterations.Table 6:GLM-4 performance on NaturalCodeBench(NCB)55,a benchmark with real codingprompts in two programming languages(Python and Java)for English and Chinese.ModelPython(en)Java(en)Python(zh)Java(zh)Overall
105、GPT-4(0613)55.751.153.451.152.8GPT-4 Turbo(1106)51.955.047.351.951.5GPT-4 Turbo(2024-04-09)57.552.353.152.353.8Claude 234.436.633.632.834.4Claude 3 Opus48.948.945.050.448.3Gemini 1.5 Pro45.039.741.543.142.3GLM-4-9B-Chat33.929.830.834.432.2GLM-4-Air(0605)40.839.743.139.740.8GLM-4(0520)51.642.845.448.
106、947.13.6Evaluation of Function CallTo evaluate the performance of GLM models on function call,we carry out evaluations on BerkeleyFunctionCallLeaderboard49,abenchmarkwith2kquestion-function-answerpairs.Thebenchmarkevaluates models ability on calling functions in three categories:evaluation by Abstra
107、ct SyntaxTree(AST),evaluation by executing APIs,and relevance detection.The fi rst category compares themodel output functions against function documents and possible answers with AST analysis.Thesecond category checks for response correctness by executing the generated function calls.Relevancedetec
108、tion evaluates the models capacity on recognizing functions that are not suitable to address theusers question.The results are shown in Table 7.We can observe that the function-call capability ofGLM-4(0520)aligns with that of GPT-4 Turbo(2024-04-09),while GLM-4-9B-Chat signifi cantlyoutperforms Llam
109、a-3-8B-Instruct.Another observation is that the overall accuracy does not improve10with model sizes,while GLM-4-9B-Chat can even outperform GLM-4-Air.On the other hand,we observe that the performance on execution summary,which evaluates the execution results ofreal-world APIs,improves smoothly with
110、model size.Table 7:GLM performance on the Berkeley Function Call Leaderboard.ModelAST SummaryExec SummaryRelevanceOverallLlama-3-8B-Instruct59.2570.0145.8358.88GPT-4 Turbo(2024-04-09)82.1478.6188.7581.24GPT-4o(2024-05-13)85.2380.3781.2582.94ChatGLM3-6B62.1869.785.4257.88GLM-4-9B-Chat80.2684.4087.928
111、1.00GLM-4-Air(0605)84.3485.9368.3380.94GLM-4(0520)82.5987.7884.1781.763.7Evaluation of Agent AbilitiesIt is widely observed that LLMs are capable to serve as intelligent agents in versatile environmentsand contexts 30;51,known as LLMs-as-Agents 25.As a result,we evaluate GLM-4 together withother com
112、parison LLMs on AgentBench 25,a comprehensive agentic benchmark for text-basedLLMs across an array of practical environments,including code-based,game-based,and web-basedcontexts.Specifi cally,we evaluate on 7 out of 8 AgentBench environments except for Digital CardGame,which is too time-consuming t
113、o interact with.Overall scores are calculated using the originalper-dataset weights provided in AgentBench 25.Table 8:GLM-4 performance on AgentBench 25.OperatingSystemDataBaseKnowledgeGraphLateral ThinkingPuzzlesHouseHoldingWebShoppingWebBrowsingOverallGPT-4(0613)42.432.058.816.678.061.129.03.69GPT
114、-4 Turbo(1106)40.352.754.017.770.052.830.03.77GPT-4 Turbo(2024-04-09)41.046.753.219.472.055.119.03.68Claude 218.127.341.38.454.061.40.02.03Claude 3 Opus23.655.053.420.070.048.528.03.62GLM-4-Air(0605)31.951.053.812.378.069.230.03.58GLM-4(0520)36.852.751.415.382.068.329.03.79TheresultsarepresentedinTa
115、ble8.Asitshows,GLM-4modelspresentquiteimpressiveperformanceon agent tasks,with the GLM-4-Airs comparable and GLM-4s outperforming results to GPT-4Turbo and Claude 3 Opus.In terms of specifi c environments,we fi nd GLM-4 series performedespecially well on Database,House-Holding,and Web Shopping tasks
116、,while still demonstrating agap to GPT-4 series on Operating System,Knowledge Graph,and Lateral Thinking Puzzles.The gapsuggests that there is still room for GLM-4 to improve its performance on code-related agentic tasksand highly interactive language tasks.3.8Evaluation of All ToolsGLM-4 is further
117、 aligned to support intelligent agents and user-confi gured GLMs functionalities onhttps:/,and the resultant model is GLM-4 All Tools.As mentioned,GLM-4 AllTools can complete complex tasks by autonomously understanding user intent,planing step-by-stepinstructions,and calling multiple tools,including
118、 web browser,Python interpreter,and the text-to-image model(e.g.,CogView3 59.Table 9 shows that GLM-4 All Tools(Web)achieved similar11performance on Python interpreter for solving math problems,browser for information seeking,compared to ChatGPT-4(Web),respectively.Table 9:Performance of GLM-4 All T
119、ools.GLM-4 All ToolsGPT-4(Web,0116)(Web,0110)PythonInterpreterGSM8K91.5992.72MATH63.6065.00Math23K88.5088.40BrowserInformation Seeking78.0867.124Safety and RisksWe are committed to ensuring that GLM-4 operates as a safe,responsible,and unbiased model.In addition to addressing common ethical and fair
120、ness concerns,we carefully assess and mitigatepotential harms that the model may pose to users in real-world scenarios.Table 10:GLM-4 performance on SafetyBench 56,compared to GPT-4 models and Claude 3 Opus.Ethics&MoralityIllegalActivitiesMentalHealthOffens-ivenessPhysicalHealthPrivacy&PropertyUnfai
121、rness&BiasOverallGPT-4(0613)92.793.393.087.796.791.373.389.7GPT-4 Turbo(1106)91.092.093.086.092.088.774.388.1GPT-4 Turbo(2024-04-09)90.391.391.785.392.089.375.087.9Claude 3 Opus92.791.792.786.394.788.766.087.5GLM-4(0520)92.391.393.386.392.388.666.087.2Risk Mitigation.We carefully cleaned data in the
122、 pre-training stage by removing text containingsensitive keywords and web pages from a pre-defi ned blacklist.In the alignment phase,we evaluateeach training sample for safety and remove any that pose potential risks.Harmlessness is also animportant criteria for preference alignment when comparing m
123、ultiple model outputs.We have a red team that constantly challenges the model with tricky questions that tend to causeunsafe answers.We collect all harmful question-answer pairs from GLM-4 and improve them withhuman annotations for further model alignment.Safety Evaluation.We evaluate the GLM-4 mode
124、l on the SafetyBench 56,which assesseseach model from 7 dimensions:Ethics and Morality(unethical behaviors),Illegal Activities(basicknowledge of law),Mental Health(adverse impacts on mental health),Offensiveness(offensivebehaviors),Physical Health(dangerous behaviors that can cause physical harms),P
125、rivacy andProperty(privacy breach or property loss),Unfairness and Bias.We evaluate different models on theChinese subset of SafetyBench,which is created by removing highly sensitive questions that tend tobe censored,to mitigate interference from different API safety policies.Table 10 shows the safe
126、ty results of GLM-4 and SOTA models.On most dimensions GLM-4(0520)shows competitive safety performance,and overall it achieves comparable performance with Claude3 Opus.GLM-4 slightly falls behind the GPT-4 family,especially on the Physical Health dimension,which demands robust common sense knowledge
127、 about the physical world to avoid potential risks.More efforts have been put into this direction to develop a more capable and safe GLM model.125ConclusionIn this report,we introduce the ChatGLM family of large language models from GLM-130B toGLM-4(All Tools).Over the past one and half years,we hav
128、e made great progress in understandingvarious perspectives of large language models from our fi rst-hand experiences.With the developmentof each model generation,the team has learned and applied more effective and effi cient strategiesfor both model pre-training and alignment.The recent ChatGLM mode
129、lsGLM-4(0116,0520),GLM-4-Air(0605),and GLM-4 All Toolsdemonstrate signifi cant advancements in understandingand executing complex tasks by autonomously employing external tools and functions.These GLM-4models have achieved performance on par with,and in some cases surpassing,state-of-the-art modelss
130、uch as GPT-4 Turbo,Claude 3 Opus,and Gemini 1.5 Pro,particularly in handling tasks relevant tothe Chinese language.In addition,we are committed to promoting accessibility and safety of LLMsthrough open releasing of our model weights and techniques developed throughout this journey.Ouropen models,inc
131、luding language,code,and vision models,have attracted over 10 million downloadson Hugging Face in the year 2023 alone.Currently,we are working on more capable models witheverything we have learned to date.In the future,we will continue democratizing cutting-edge LLMtechnologies through open sourcing
132、,and push the boundary of model capabilities towards the missionof teaching machines to think like humans.Acknowledgement.We would like to thank all the data annotators,infra operating staffs,collabora-tors,and partners as well as everyone at Zhipu AI and Tsinghua University not explicitly mentioned
133、 inthe report who have provided support,feedback,and contributed to ChatGLM.We would also like tothank Yuxuan Zhang and Wei Jia from Zhipu AI as well as the teams at Hugging Face,ModelScope,WiseModel,and others for their help on the open-sourcing efforts of the GLM family of models.References1Y.Bai,
134、X.Lv,J.Zhang,Y.He,J.Qi,L.Hou,J.Tang,Y.Dong,and J.Li.Longalign:A recipefor long context alignment of large language models,2024.2Y.Bai,X.Lv,J.Zhang,H.Lyu,J.Tang,Z.Huang,Z.Du,X.Liu,A.Zeng,L.Hou,Y.Dong,J.Tang,and J.Li.Longbench:A bilingual,multitask benchmark for long context understanding,2023.3T.B.Br
135、own,B.Mann,N.Ryder,M.Subbiah,J.Kaplan,P.Dhariwal,A.Neelakantan,P.Shyam,G.Sastry,A.Askell,S.Agarwal,A.Herbert-Voss,G.Krueger,T.Henighan,R.Child,A.Ramesh,D.M.Ziegler,J.Wu,C.Winter,C.Hesse,M.Chen,E.Sigler,M.Litwin,S.Gray,B.Chess,J.Clark,C.Berner,S.McCandlish,A.Radford,I.Sutskever,and D.Amodei.Lan-guage
136、 models are few-shot learners.In Proceedings of the 34th International Conferenceon Neural Information Processing Systems,NIPS20,Red Hook,NY,USA,2020.CurranAssociates Inc.4M.Chen,J.Tworek,H.Jun,Q.Yuan,H.P.de Oliveira Pinto,J.Kaplan,H.Edwards,Y.Burda,N.Joseph,G.Brockman,A.Ray,R.Puri,G.Krueger,M.Petro
137、v,H.Khlaaf,G.Sastry,P.Mishkin,B.Chan,S.Gray,N.Ryder,M.Pavlov,A.Power,L.Kaiser,M.Bavarian,C.Winter,P.Tillet,F.P.Such,D.Cummings,M.Plappert,F.Chantzis,E.Barnes,A.Herbert-Voss,W.H.Guss,A.Nichol,A.Paino,N.Tezak,J.Tang,I.Babuschkin,S.Balaji,S.Jain,W.Saunders,C.Hesse,A.N.Carr,J.Leike,J.Achiam,V.Misra,E.Mo
138、rikawa,A.Radford,M.Knight,M.Brundage,M.Murati,K.Mayer,P.Welinder,B.McGrew,D.Amodei,S.McCandlish,I.Sutskever,and W.Zaremba.Evaluating large language models trained on code.CoRR,abs/2107.03374,2021.5S.Chen,S.Wong,L.Chen,and Y.Tian.Extending context window of large language modelsvia positional interpo
139、lation.arXiv preprint arXiv:2306.15595,2023.6A.Chowdhery,S.Narang,J.Devlin,M.Bosma,G.Mishra,A.Roberts,P.Barham,H.W.Chung,C.Sutton,S.Gehrmann,et al.Palm:Scaling language modeling with pathways.arXivpreprint arXiv:2204.02311,2022.7K.Cobbe,V.Kosaraju,M.Bavarian,M.Chen,H.Jun,L.Kaiser,M.Plappert,J.Tworek
140、,J.Hilton,R.Nakano,C.Hesse,and J.Schulman.Training verifi ers to solve math wordproblems.CoRR,abs/2110.14168,2021.138T.Dao,D.Fu,S.Ermon,A.Rudra,and C.R.Flashattention:Fast and memory-effi cient exactattention with io-awareness.Advances in Neural Information Processing Systems,35:1634416359,2022.9M.D
141、ing,Z.Yang,W.Hong,W.Zheng,C.Zhou,D.Yin,J.Lin,X.Zou,Z.Shao,H.Yang,andJ.Tang.Cogview:Mastering text-to-image generation via transformers,2021.10M.Ding,W.Zheng,W.Hong,and J.Tang.Cogview2:Faster and better text-to-imagegeneration via hierarchical transformers.Advances in Neural Information Processing Sy
142、stems,35:1689016902,2022.11Z.Du,Y.Qian,X.Liu,M.Ding,J.Qiu,Z.Yang,and J.Tang.Glm:General language modelpretraining with autoregressive blank infi lling.In Proceedings of the 60th Annual Meeting ofthe Association for Computational Linguistics(Volume 1:Long Papers),pages 320335,2022.12Z.Du,A.Zeng,Y.Don
143、g,and J.Tang.Understanding emergent abilities of language modelsfrom the loss perspective,2024.13T.GLM.Chatglm-6b:An open bilingual dialogue language model.https:/ J.Steinhardt.Mea-suring massive multitask language understanding.In International Conference on LearningRepresentations,2021.15D.Hendryc
144、ks and K.Gimpel.Gaussian error linear units(gelus).arXiv preprintarXiv:1606.08415,2016.16W.Hong,W.Wang,Q.Lv,J.Xu,W.Yu,J.Ji,Y.Wang,Z.Wang,Y.Zhang,J.Li,B.Xu,Y.Dong,M.Ding,and J.Tang.Cogagent:A visual language model for gui agents,2023.17Z.Hou,Y.Niu,Z.Du,X.Zhang,X.Liu,A.Zeng,Q.Zheng,M.Huang,H.Wang,J.Ta
145、ng,and Y.Dong.Chatglm-rlhf:Practices of aligning large language models with human feedback,2024.18H.Lai,X.Liu,I.L.Iong,S.Yao,Y.Chen,P.Shen,H.Yu,H.Zhang,X.Zhang,Y.Dong,et al.Autowebglm:Bootstrap and reinforce a large language model-based web navigating agent.arXiv preprint arXiv:2404.03648,2024.19Y.L
146、i,S.Bubeck,R.Eldan,A.D.Giorno,S.Gunasekar,and Y.T.Lee.Textbooks are all youneed ii:phi-1.5 technical report,2023.20P.Liang,R.Bommasani,T.Lee,D.Tsipras,D.Soylu,M.Yasunaga,Y.Zhang,D.Narayanan,Y.Wu,A.Kumar,B.Newman,B.Yuan,B.Yan,C.Zhang,C.Cosgrove,C.D.Manning,C.R,D.Acosta-Navas,D.A.Hudson,E.Zelikman,E.D
147、urmus,F.Ladhak,F.Rong,H.Ren,H.Yao,J.Wang,K.Santhanam,L.Orr,L.Zheng,M.Yuksekgonul,M.Suzgun,N.Kim,N.Guha,N.Chatterji,O.Khattab,P.Henderson,Q.Huang,R.Chi,S.M.Xie,S.Santurkar,S.Ganguli,T.Hashimoto,T.Icard,T.Zhang,V.Chaudhary,W.Wang,X.Li,Y.Mai,Y.Zhang,and Y.Koreeda.Holistic evaluation of language models,
148、2023.21M.Liu,A.Zeng,B.Wang,P.Zhang,J.Tang,and Y.Dong.Apar:Llms can do auto-parallelauto-regressive decoding.ArXiv,abs/2401.06761,2024.22X.Liu,H.Lai,H.Yu,Y.Xu,A.Zeng,Z.Du,P.Zhang,Y.Dong,and J.Tang.Webglm:Towards an effi cient web-enhanced question answering system with human preferences.InProceedings
149、 of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pages 45494560,2023.23X.Liu,X.Lei,S.Wang,Y.Huang,Z.Feng,B.Wen,J.Cheng,P.Ke,Y.Xu,W.L.Tam,X.Zhang,L.Sun,H.Wang,J.Zhang,M.Huang,Y.Dong,and J.Tang.Alignbench:Benchmarkingchinese alignment of large language models,2023.24X.Liu,X.So
150、ng,Y.Dong,and J.Tang.Extensive self-contrast enables feedback-free languagemodel alignment,2024.1425X.Liu,H.Yu,H.Zhang,Y.Xu,X.Lei,H.Lai,Y.Gu,H.Ding,K.Men,K.Yang,S.Zhang,X.Deng,A.Zeng,Z.Du,C.Zhang,S.Shen,T.Zhang,Y.Su,H.Sun,M.Huang,Y.Dong,andJ.Tang.Agentbench:Evaluating llms as agents,2023.26Meta.Intr
151、oducing meta llama 3:The most capable openly available llm to date.https:/ OpenAI.tiktoken.https:/ R.OpenAI.Gpt-4 technical report.arXiv,pages 230308774,2023.29L.Ouyang,J.Wu,X.Jiang,D.Almeida,C.Wainwright,P.Mishkin,C.Zhang,S.Agarwal,K.Slama,A.Ray,et al.Training language models to follow instructions
152、 with human feedback.Advances in Neural Information Processing Systems,35:2773027744,2022.30J.S.Park,J.OBrien,C.J.Cai,M.R.Morris,P.Liang,and M.S.Bernstein.Generative agents:Interactive simulacra of human behavior.In Proceedings of the 36th Annual ACM Symposiumon User Interface Software and Technolog
153、y,pages 122,2023.31O.Press,N.Smith,and M.Lewis.Train short,test long:Attention with linear biases enablesinput length extrapolation.In International Conference on Learning Representations,2022.32D.Rein,B.L.Hou,A.C.Stickland,J.Petty,R.Y.Pang,J.Dirani,J.Michael,and S.R.Bowman.GPQA:A graduate-level goo
154、gle-proof q&a benchmark.CoRR,abs/2311.12022,2023.33T.L.Scao,A.Fan,C.Akiki,E.Pavlick,S.Ili c,D.Hesslow,R.Castagn,A.S.Luccioni,F.Yvon,M.Gall,et al.Bloom:A 176b-parameter open-access multilingual language model.arXiv preprint arXiv:2211.05100,2022.34R.Sennrich,B.Haddow,and A.Birch.Neural machine transl
155、ation of rare words with subwordunits.In Proceedings of the 54th Annual Meeting of the Association for ComputationalLinguistics(Volume 1:Long Papers),pages 17151725,Berlin,Germany,2016.Associationfor Computational Linguistics.35N.Shazeer.Fast transformer decoding:One write-head is all you need.arXiv
156、 preprintarXiv:1911.02150,2019.36 N.Shazeer.Glu variants improve transformer,2020.37A.Srivastava,A.Rastogi,A.Rao,A.A.M.Shoeb,A.Abid,A.Fisch,A.R.Brown,A.Santoro,A.Gupta,A.Garriga-Alonso,A.Kluska,A.Lewkowycz,A.Agarwal,A.Power,A.Ray,A.Warstadt,A.W.Kocurek,A.Safaya,A.Tazarv,A.Xiang,A.Parrish,A.Nie,A.Hus
157、sain,A.Askell,A.Dsouza,A.Rahane,A.S.Iyer,A.Andreassen,A.Santilli,A.Stuhlmller,A.M.Dai,A.La,A.K.Lampinen,A.Zou,A.Jiang,A.Chen,A.Vuong,A.Gupta,A.Gottardi,A.Norelli,A.Venkatesh,A.Gholamidavoodi,A.Tabassum,A.Menezes,A.Kirubarajan,A.Mullokandov,A.Sabharwal,A.Herrick,A.Efrat,A.Erdem,A.Karakas,and et al.Be
158、yondthe imitation game:Quantifying and extrapolating the capabilities of language models.CoRR,abs/2206.04615,2022.38J.Su,Y.Lu,S.Pan,A.Murtadha,B.Wen,and Y.Liu.Roformer:Enhanced transformer withrotary position embedding.arXiv preprint arXiv:2104.09864,2021.39M.Suzgun,N.Scales,N.Schrli,S.Gehrmann,Y.Ta
159、y,H.W.Chung,A.Chowdhery,Q.V.Le,E.H.Chi,D.Zhou,and J.Wei.Challenging big-bench tasks and whether chain-of-thoughtcan solve them.In A.Rogers,J.L.Boyd-Graber,and N.Okazaki,editors,Findings of theAssociation for Computational Linguistics:ACL 2023,Toronto,Canada,July 9-14,2023,pages1300313051.Association
160、 for Computational Linguistics,2023.40G.Team,R.Anil,S.Borgeaud,Y.Wu,J.-B.Alayrac,J.Yu,R.Soricut,J.Schalkwyk,A.M.Dai,A.Hauth,K.Millican,D.Silver,S.Petrov,M.Johnson,I.Antonoglou,J.Schrittwieser,A.Glaese,J.Chen,E.Pitler,T.Lillicrap,A.Lazaridou,O.Firat,J.Molloy,M.Isard,P.R.Barham,T.Hennigan,B.Lee,F.Viol
161、a,M.Reynolds,Y.Xu,R.Doherty,E.Collins,C.Meyer,15E.Rutherford,E.Moreira,K.Ayoub,M.Goel,G.Tucker,E.Piqueras,M.Krikun,I.Barr,N.Savinov,I.Danihelka,B.Roelofs,A.White,A.Andreassen,T.von Glehn,L.Yagati,M.Kazemi,L.Gonzalez,M.Khalman,J.Sygnowski,A.Frechette,C.Smith,L.Culp,L.Pro-leev,Y.Luan,X.Chen,J.Lottes,N
162、.Schucher,F.Lebron,A.Rrustemi,N.Clay,P.Crone,T.Kocisky,J.Zhao,B.Perz,D.Yu,H.Howard,A.Bloniarz,J.W.Rae,H.Lu,L.Sifre,M.Maggioni,F.Alcober,D.Garrette,M.Barnes,S.Thakoor,J.Austin,G.Barth-Maron,W.Wong,R.Joshi,R.Chaabouni,D.Fatiha,A.Ahuja,R.Liu,Y.Li,S.Cogan,J.Chen,C.Jia,C.Gu,Q.Zhang,J.Grimstad,A.J.Hartman
163、,M.Chadwick,G.S.Tomar,X.Garcia,E.Senter,E.Taropa,T.S.Pillai,J.Devlin,M.Laskin,D.de Las Casas,D.Valter,C.Tao,L.Blanco,A.P.Badia,D.Reitter,M.Chen,J.Brennan,C.Rivera,S.Brin,S.Iqbal,G.Surita,J.Labanowski,A.Rao,S.Winkler,E.Parisotto,Y.Gu,K.Olszewska,Y.Zhang,R.Addanki,A.Miech,A.Louis,L.E.Shafey,D.Teplyash
164、in,G.Brown,E.Catt,N.Attaluri,J.Balaguer,J.Xiang,P.Wang,Z.Ashwood,A.Briukhov,A.Webson,S.Ganapathy,S.Sanghavi,A.Kannan,M.-W.Chang,A.Stjerngren,J.Djolonga,Y.Sun,A.Bapna,M.Aitchison,P.Pejman,H.Michalewski,T.Yu,C.Wang,J.Love,J.Ahn,D.Bloxwich,K.Han,P.Humphreys,T.Sellam,J.Bradbury,V.God-bole,S.Samangooei,B
165、.Damoc,A.Kaskasoli,S.M.R.Arnold,V.Vasudevan,S.Agrawal,J.Riesa,D.Lepikhin,R.Tanburn,S.Srinivasan,H.Lim,S.Hodkinson,P.Shyam,J.Ferret,S.Hand,A.Garg,T.L.Paine,J.Li,Y.Li,M.Giang,A.Neitz,Z.Abbas,S.York,M.Reid,E.Cole,A.Chowdhery,D.Das,D.Rogozi nska,V.Nikolaev,P.Sprechmann,Z.Nado,L.Zilka,F.Prost,L.He,M.Mont
166、eiro,G.Mishra,C.Welty,J.Newlan,D.Jia,M.Allamanis,C.H.Hu,R.de Liedekerke,J.Gilmer,C.Saroufi m,S.Rijhwani,S.Hou,D.Shrivastava,A.Baddepudi,A.Goldin,A.Ozturel,A.Cassirer,Y.Xu,D.Sohn,D.Sachan,R.K.Amplayo,C.Swan-son,D.Petrova,S.Narayan,A.Guez,S.Brahma,J.Landon,M.Patel,R.Zhao,K.Villela,L.Wang,W.Jia,M.Rahtz
167、,M.Gimnez,L.Yeung,H.Lin,J.Keeling,P.Georgiev,D.Mincu,B.Wu,S.Haykal,R.Saputro,K.Vodrahalli,J.Qin,Z.Cankara,A.Sharma,N.Fernando,W.Hawkins,B.Neyshabur,S.Kim,A.Hutter,P.Agrawal,A.Castro-Ros,G.van den Driessche,T.Wang,F.Yang,S.yiin Chang,P.Komarek,R.McIlroy,M.Lu ci c,G.Zhang,W.Farhan,M.Sharman,P.Natsev,P
168、.Michel,Y.Cheng,Y.Bansal,S.Qiao,K.Cao,S.Shakeri,C.But-terfi eld,J.Chung,P.K.Rubenstein,S.Agrawal,A.Mensch,K.Soparkar,K.Lenc,T.Chung,A.Pope,L.Maggiore,J.Kay,P.Jhakra,S.Wang,J.Maynez,M.Phuong,T.Tobin,A.Tacchetti,M.Trebacz,K.Robinson,Y.Katariya,S.Riedel,P.Bailey,K.Xiao,N.Ghelani,L.Aroyo,A.Slone,N.Houls
169、by,X.Xiong,Z.Yang,E.Gribovskaya,J.Adler,M.Wirth,L.Lee,M.Li,T.Kagohara,J.Pavagadhi,S.Bridgers,A.Bortsova,S.Ghemawat,Z.Ahmed,T.Liu,R.Pow-ell,V.Bolina,M.Iinuma,P.Zablotskaia,J.Besley,D.-W.Chung,T.Dozat,R.Comanescu,X.Si,J.Greer,G.Su,M.Polacek,R.L.Kaufman,S.Tokumine,H.Hu,E.Buchatskaya,Y.Miao,M.Elhawaty,A
170、.Siddhant,N.Tomasev,J.Xing,C.Greer,H.Miller,S.Ashraf,A.Roy,Z.Zhang,A.Ma,A.Filos,M.Besta,R.Blevins,T.Klimenko,C.-K.Yeh,S.Changpinyo,J.Mu,O.Chang,M.Pajarskas,C.Muir,V.Cohen,C.L.Lan,K.Haridasan,A.Marathe,S.Hansen,S.Douglas,R.Samuel,M.Wang,S.Austin,C.Lan,J.Jiang,J.Chiu,J.A.Lorenzo,L.L.Sjsund,S.Cevey,Z.G
171、leicher,T.Avrahami,A.Boral,H.Srinivasan,V.Selo,R.May,K.Aiso-pos,L.Hussenot,L.B.Soares,K.Baumli,M.B.Chang,A.Recasens,B.Caine,A.Pritzel,F.Pavetic,F.Pardo,A.Gergely,J.Frye,V.Ramasesh,D.Horgan,K.Badola,N.Kassner,S.Roy,E.Dyer,V.Campos,A.Tomala,Y.Tang,D.E.Badawy,E.White,B.Mustafa,O.Lang,A.Jindal,S.Vikram,
172、Z.Gong,S.Caelles,R.Hemsley,G.Thornton,F.Feng,W.Stokowiec,C.Zheng,P.Thacker,aglar nl,Z.Zhang,M.Saleh,J.Svensson,M.Bileschi,P.Patil,A.Anand,R.Ring,K.Tsihlas,A.Vezer,M.Selvi,T.Shevlane,M.Rodriguez,T.Kwiatkowski,S.Daruki,K.Rong,A.Dafoe,N.FitzGerald,K.Gu-Lemberg,M.Khan,L.A.Hendricks,M.Pellat,V.Feinberg,J
173、.Cobon-Kerr,T.Sainath,M.Rauh,S.H.Hashemi,R.Ives,Y.Hasson,Y.Li,E.Noland,Y.Cao,N.Byrd,L.Hou,Q.Wang,T.Sottiaux,M.Paganini,J.-B.Lespiau,A.Moufarek,S.Hassan,K.Shivakumar,J.van Amersfoort,A.Mandhane,P.Joshi,A.Goyal,M.Tung,A.Brock,H.Sheahan,V.Misra,C.Li,N.Raki cevi c,M.Dehghani,F.Liu,S.Mittal,J.Oh,S.Noury,
174、E.Sezener,F.Huot,M.Lamm,N.D.Cao,C.Chen,G.Elsayed,E.Chi,M.Mahdieh,I.Tenney,N.Hua,I.Petrychenko,P.Kane,D.Scandinaro,R.Jain,J.Uesato,R.Datta,A.Sadovsky,O.Bunyan,D.Rabiej,S.Wu,J.Zhang,G.Vasudevan,E.Leurent,M.Al-nahlawi,I.Georgescu,N.Wei,I.Zheng,B.Chan,P.G.Rabinovitch,P.Stanczyk,Y.Zhang,D.Steiner,S.Naska
175、r,M.Azzam,M.Johnson,A.Paszke,C.-C.Chiu,J.S.Elias,A.Mohiud-din,F.Muhammad,J.Miao,A.Lee,N.Vieillard,S.Potluri,J.Park,E.Davoodi,J.Zhang,J.Stanway,D.Garmon,A.Karmarkar,Z.Dong,J.Lee,A.Kumar,L.Zhou,J.Evens,W.Isaac,Z.Chen,J.Jia,A.Levskaya,Z.Zhu,C.Gorgolewski,P.Grabowski,Y.Mao,A.Magni,K.Yao,16J.Snaider,N.Ca
176、sagrande,P.Suganthan,E.Palmer,G.Irving,E.Loper,M.Faruqui,I.Arkatkar,N.Chen,I.Shafran,M.Fink,A.Castao,I.Giannoumis,W.Kim,M.Rybinski,A.Sreevatsa,J.Prendki,D.Soergel,A.Goedeckemeyer,W.Gierke,M.Jafari,M.Gaba,J.Wiesner,D.G.Wright,Y.Wei,H.Vashisht,Y.Kulizhskaya,J.Hoover,M.Le,L.Li,C.Iwuanyanwu,L.Liu,K.Rami
177、rez,A.Khorlin,A.Cui,T.LIN,M.Georgiev,M.Wu,R.Aguilar,K.Pallo,A.Chakladar,A.Repina,X.Wu,T.van der Weide,P.Ponnapalli,C.Kaplan,J.Simsa,S.Li,O.Dousse,F.Yang,J.Piper,N.Ie,M.Lui,R.Pasumarthi,N.Lintz,A.Vijayakumar,L.N.Thiet,D.Andor,P.Valenzuela,C.Paduraru,D.Peng,K.Lee,S.Zhang,S.Greene,D.D.Nguyen,P.Kurylowi
178、cz,S.Velury,S.Krause,C.Hardin,L.Dixon,L.Janzer,K.Choo,Z.Feng,B.Zhang,A.Singhal,T.Latkar,M.Zhang,Q.Le,E.A.Abellan,D.Du,D.McKinnon,N.Antropova,T.Boluk-basi,O.Keller,D.Reid,D.Finchelstein,M.A.Raad,R.Crocker,P.Hawkins,R.Dadashi,C.Gaffney,S.Lall,K.Franko,E.Filonov,A.Bulanova,R.Leblond,V.Yadav,S.Chung,H.A
179、skham,L.C.Cobo,K.Xu,F.Fischer,J.Xu,C.Sorokin,C.Alberti,C.-C.Lin,C.Evans,H.Zhou,A.Dimitriev,H.Forbes,D.Banarse,Z.Tung,J.Liu,M.Omernick,C.Bishop,C.Kumar,R.Sterneck,R.Foley,R.Jain,S.Mishra,J.Xia,T.Bos,G.Cideron,E.Amid,F.Piccinno,X.Wang,P.Banzal,P.Gurita,H.Noga,P.Shah,D.J.Mankowitz,A.Polozov,N.Kushman,V
180、.Krakovna,S.Brown,M.Bateni,D.Duan,V.Firoiu,M.Thotakuri,T.Natan,A.Mohananey,M.Geist,S.Mudgal,S.Girgin,H.Li,J.Ye,O.Roval,R.Tojo,M.Kwong,J.Lee-Thorp,C.Yew,Q.Yuan,S.Bagri,D.Sinopalnikov,S.Ramos,J.Mellor,A.Sharma,A.Severyn,J.Lai,K.Wu,H.-T.Cheng,D.Miller,N.Sonnerat,D.Vnukov,R.Greig,J.Beattie,E.Caveness,L.
181、Bai,J.Eisenschlos,A.Korchemniy,T.Tsai,M.Jasarevic,W.Kong,P.Dao,Z.Zheng,F.Liu,F.Yang,R.Zhu,M.Geller,T.H.Teh,J.Sanmiya,E.Gladchenko,N.Trdin,A.Sozanschi,D.Toyama,E.Rosen,S.Tavakkol,L.Xue,C.Elkind,O.Woodman,J.Carpenter,G.Papamakarios,R.Kemp,S.Kafle,T.Grunina,R.Sinha,A.Talbert,A.Goyal,D.Wu,D.Owusu-Afriyi
182、e,C.Du,C.Thornton,J.Pont-Tuset,P.Narayana,J.Li,S.Fatehi,J.Wieting,O.Ajmeri,B.Uria,T.Zhu,Y.Ko,L.Knight,A.Hliou,N.Niu,S.Gu,C.Pang,D.Tran,Y.Li,N.Levine,A.Stolovich,N.Kalb,R.Santamaria-Fernandez,S.Goenka,W.Yustalim,R.Strudel,A.Elqursh,B.Lakshminarayanan,C.Deck,S.Upadhyay,H.Lee,M.Dusenberry,Z.Li,X.Wang,K
183、.Levin,R.Hoffmann,D.Holtmann-Rice,O.Bachem,S.Yue,S.Arora,E.Malmi,D.Mirylenka,Q.Tan,C.Koh,S.H.Yeganeh,S.Pder,S.Zheng,F.Pongetti,M.Tariq,Y.Sun,L.Ionita,M.Seyedhosseini,P.Tafti,R.Kotikalapudi,Z.Liu,A.Gulati,J.Liu,X.Ye,B.Chrzaszcz,L.Wang,N.Sethi,T.Li,B.Brown,S.Singh,W.Fan,A.Parisi,J.Stanton,C.Kuang,V.Ko
184、verkathu,C.A.Choquette-Choo,Y.Li,T.Lu,A.Ittycheriah,P.Shroff,P.Sun,M.Varadarajan,S.Bahargam,R.Willoughby,D.Gaddy,I.Dasgupta,G.Desjardins,M.Cor-nero,B.Robenek,B.Mittal,B.Albrecht,A.Shenoy,F.Moiseev,H.Jacobsson,A.Ghaffarkhah,M.Rivire,A.Walton,C.Crepy,A.Parrish,Y.Liu,Z.Zhou,C.Farabet,C.Radebaugh,P.Srin
185、i-vasan,C.van der Salm,A.Fidjeland,S.Scellato,E.Latorre-Chimoto,H.Klimczak-Plucinska,D.Bridson,D.de Cesare,T.Hudson,P.Mendolicchio,L.Walker,A.Morris,I.Penchev,M.Mauger,A.Guseynov,A.Reid,S.Odoom,L.Loher,V.Cotruta,M.Yenugula,D.Grewe,A.Petrushkina,T.Duerig,A.Sanchez,S.Yadlowsky,A.Shen,A.Globerson,A.Kur
186、zrok,L.Webb,S.Dua,D.Li,P.Lahoti,S.Bhupatiraju,D.Hurt,H.Qureshi,A.Agarwal,T.Shani,M.Eyal,A.Khare,S.R.Belle,L.Wang,C.Tekur,M.S.Kale,J.Wei,R.Sang,B.Saeta,T.Liechty,Y.Sun,Y.Zhao,S.Lee,P.Nayak,D.Fritz,M.R.Vuyyuru,J.Aslanides,N.Vyas,M.Wicke,X.Ma,T.Bilal,E.Eltyshev,D.Balle,N.Martin,H.Cate,J.Manyika,K.Amiri
187、,Y.Kim,X.Xiong,K.Kang,F.Luisier,N.Tripuraneni,D.Madras,M.Guo,A.Waters,O.Wang,J.Ainslie,J.Baldridge,H.Zhang,G.Pruthi,J.Bauer,F.Yang,R.Mansour,J.Gelman,Y.Xu,G.Polovets,J.Liu,H.Cai,W.Chen,X.Sheng,E.Xue,S.Ozair,A.Yu,C.Angermueller,X.Li,W.Wang,J.Wiesinger,E.Koukoumidis,Y.Tian,A.Iyer,M.Gurumurthy,M.Golden
188、son,P.Shah,M.Blake,H.Yu,A.Urbanowicz,J.Palomaki,C.Fernando,K.Brooks,K.Durden,H.Mehta,N.Momchev,E.Rahimtoroghi,M.Georgaki,A.Raul,S.Ruder,M.Redshaw,J.Lee,K.Jalan,D.Li,G.Perng,B.Hechtman,P.Schuh,M.Nasr,M.Chen,K.Milan,V.Mikulik,T.Strohman,J.Franco,T.Green,D.Hassabis,K.Kavukcuoglu,J.Dean,and O.Vinyals.Ge
189、mini:A family of highly capable multimodal models,2023.41H.Touvron,T.Lavril,G.Izacard,X.Martinet,M.-A.Lachaux,T.Lacroix,B.Rozire,N.Goyal,E.Hambro,F.Azhar,A.Rodriguez,A.Joulin,E.Grave,and G.Lample.Llama:Open andeffi cient foundation language models,2023.42H.Touvron,L.Martin,K.Stone,P.Albert,A.Almahai
190、ri,Y.Babaei,N.Bashlykov,S.Batra,P.Bhargava,S.Bhosale,D.Bikel,L.Blecher,C.C.Ferrer,M.Chen,G.Cucurull,D.Esiobu,17J.Fernandes,J.Fu,W.Fu,B.Fuller,C.Gao,V.Goswami,N.Goyal,A.Hartshorn,S.Hosseini,R.Hou,H.Inan,M.Kardas,V.Kerkez,M.Khabsa,I.Kloumann,A.Korenev,P.S.Koura,M.-A.Lachaux,T.Lavril,J.Lee,D.Liskovich,
191、Y.Lu,Y.Mao,X.Martinet,T.Mihaylov,P.Mishra,I.Molybog,Y.Nie,A.Poulton,J.Reizenstein,R.Rungta,K.Saladi,A.Schelten,R.Silva,E.M.Smith,R.Subramanian,X.E.Tan,B.Tang,R.Taylor,A.Williams,J.X.Kuan,P.Xu,Z.Yan,I.Zarov,Y.Zhang,A.Fan,M.Kambadur,S.Narang,A.Rodriguez,R.Stojnic,S.Edunov,andT.Scialom.Llama 2:Open fou
192、ndation and fi ne-tuned chat models,2023.43A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.Kaiser,andI.Polosukhin.Attention is all you need,2023.44H.Wang,S.Ma,L.Dong,S.Huang,D.Zhang,and F.Wei.Deepnet:Scaling transformers to1,000 layers,2022.45W.Wang,Q.Lv,W.Yu,W.Hong,J.Qi,Y.Wang,J.Ji,Z.Y
193、ang,L.Zhao,X.Song,J.Xu,B.Xu,J.Li,Y.Dong,M.Ding,and J.Tang.Cogvlm:Visual expert for pretrained language models,2023.46J.Wei,X.Wang,D.Schuurmans,M.Bosma,B.Ichter,F.Xia,E.H.Chi,Q.V.Le,andD.Zhou.Chain-of-thought prompting elicits reasoning in large language models.In S.Koyejo,S.Mohamed,A.Agarwal,D.Belgr
194、ave,K.Cho,and A.Oh,editors,Advances in NeuralInformation Processing Systems 35:Annual Conference on Neural Information ProcessingSystems 2022,NeurIPS 2022,New Orleans,LA,USA,November 28-December 9,2022,2022.47W.Xiong,J.Liu,I.Molybog,H.Zhang,P.Bhargava,R.Hou,L.Martin,R.Rungta,K.A.Sankararaman,B.Oguz,
195、et al.Effective long-context scaling of foundation models.arXivpreprint arXiv:2309.16039,2023.48Y.Xu,X.Liu,X.Liu,Z.Hou,Y.Li,X.Zhang,Z.Wang,A.Zeng,Z.Du,W.Zhao,J.Tang,and Y.Dong.Chatglm-math:Improving math problem-solving in large language models with aself-critique pipeline,2024.49F.Yan,H.Mao,C.C.-J.
196、Ji,T.Zhang,S.G.Patil,I.Stoica,and J.E.Gonzalez.Berkeley functioncalling leaderboard.2024.50S.Yang,W.-L.Chiang,L.Zheng,J.E.Gonzalez,and I.Stoica.Rethinking benchmark andcontamination for language models with rephrased samples.arXiv preprint arXiv:2311.04850,2023.51S.Yao,J.Zhao,D.Yu,N.Du,I.Shafran,K.N
197、arasimhan,and Y.Cao.React:Synergizingreasoning and acting in language models.arXiv preprint arXiv:2210.03629,2022.52A.Zeng,M.Liu,R.Lu,B.Wang,X.Liu,Y.Dong,and J.Tang.Agenttuning:Enablinggeneralized agent abilities for llms,2023.53A.Zeng,X.Liu,Z.Du,Z.Wang,H.Lai,M.Ding,Z.Yang,Y.Xu,W.Zheng,X.Xia,et al.G
198、lm-130b:An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414,2022.54S.Zhang,S.Roller,N.Goyal,M.Artetxe,M.Chen,S.Chen,C.Dewan,M.Diab,X.Li,X.V.Lin,etal.Opt:Openpre-trainedtransformerlanguagemodels.arXivpreprintarXiv:2205.01068,2022.55S.Zhang,H.Zhao,X.Liu,Q.Zheng,Z.Qi,X.Gu,X.Zhang,Y.Dong
199、,and J.Tang.Natural-codebench:Examining coding performance mismatch on humaneval and natural user prompts.arXiv preprint arXiv:2405.04520,2024.56Z.Zhang,L.Lei,L.Wu,R.Sun,Y.Huang,C.Long,X.Liu,X.Lei,J.Tang,and M.Huang.Safetybench:Evaluating the safety of large language models with multiple choice ques
200、tions.arXiv preprint arXiv:2309.07045,2023.57W.X.Zhao,K.Zhou,J.Li,T.Tang,X.Wang,Y.Hou,Y.Min,B.Zhang,J.Zhang,Z.Dong,et al.A survey of large language models.arXiv preprint arXiv:2303.18223,2023.1858Q.Zheng,X.Xia,X.Zou,Y.Dong,S.Wang,Y.Xue,Z.Wang,L.Shen,A.Wang,Y.Li,T.Su,Z.Yang,and J.Tang.Codegeex:A pre-
201、trained model for code generation with multilingualevaluations on humaneval-x,2023.59W.Zheng,J.Teng,Z.Yang,W.Wang,J.Chen,X.Gu,Y.Dong,M.Ding,and J.Tang.Cogview3:Finer and faster text-to-image generation via relay diffusion,2024.60C.Zhou,P.Liu,P.Xu,S.Iyer,J.Sun,Y.Mao,X.Ma,A.Efrat,P.Yu,L.Yu,S.Zhang,G.G
202、hosh,M.Lewis,L.Zettlemoyer,and O.Levy.Lima:Less is more for alignment,2023.61J.Zhou,Z.Chen,D.Wan,B.Wen,Y.Song,J.Yu,Y.Huang,L.Peng,J.Yang,X.Xiao,et al.Characterglm:Customizing chinese conversational ai characters with large language models.arXiv preprint arXiv:2311.16832,2023.62J.Zhou,T.Lu,S.Mishra,S.Brahma,S.Basu,Y.Luan,D.Zhou,and L.Hou.Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,2023.19