當前位置：首頁 > 報告詳情

智譜：ChatGLM技術報告（英文版）（19頁）.pdf

上傳人：淘*** 編號：650877 2025-04-07 PDF PDF 中文版中文版中文版 DOCX DOCX DOCX 19頁 1.15MB 收藏收藏已收藏上傳文檔翻譯

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第1頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第2頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第3頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第4頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第5頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第6頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第7頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第8頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第9頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第10頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第11頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第12頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第13頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第14頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第15頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第1頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第2頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第3頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第4頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第5頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第6頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第7頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第8頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第9頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第10頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第11頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第12頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第13頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第14頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第15頁

/ 15

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第1頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第2頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第3頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第4頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第5頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第6頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第7頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第8頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第9頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第10頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第11頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第12頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第13頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第14頁

智譜：ChatGLM技術報告（英文版）（19頁）.pdf_第15頁

本資源只提供15頁預覽，全部文檔請下載后查看！喜歡就下載吧，查找使用更方便 VIP專享文檔

《智譜：ChatGLM技術報告（英文版）（19頁）.pdf》由會員分享，可在線閱讀，更多相關《智譜：ChatGLM技術報告（英文版）（19頁）.pdf（19頁珍藏版）》請在三個皮匠報告上搜索。

1、ChatGLM:A Family of Large Language Modelsfrom GLM-130B to GLM-4 All ToolsTeam GLM1Zhipu AI2Tsinghua UniversityAbstractWe introduce ChatGLM,an evolving family of large language models that we havebeen developing over time.This report primarily focuses on the GLM-4 languageseries,which includes GLM-4,

2、GLM-4-Air,and GLM-4-9B.They represent ourmost capable models that are trained with all the insights and lessons gained fromthe preceding three generations of ChatGLM.To date,the GLM-4 models arepre-trained on ten trillions of tokens mostly in Chinese and English,along witha small set of corpus from

3、24 languages,and aligned primarily for Chinese andEnglish usage.The high-quality alignment is achieved via a multi-stage post-training process,which involves supervised fi ne-tuning and learning from humanfeedback.Evaluations show that GLM-4,1)closely rivals or outperforms GPT-4in terms of general m

4、etrics such as MMLU,GSM8K,MATH,BBH,GPQA,andHumanEval,2)gets close to GPT-4-Turbo in instruction following as measured byIFEval,3)matches GPT-4 Turbo(128K)and Claude 3 for long context tasks,and 4)outperforms GPT-4 in Chinese alignments as measured by AlignBench.The GLM-4 All Tools model is further a

5、ligned to understand user intent and autonomouslydecide when and which tool(s)to useincluding web browser,Python interpreter,text-to-image model,and user-defi ned functionsto effectively complete complextasks.In practical applications,it matches and even surpasses GPT-4 All Toolsin tasks like access

6、ing online information via web browsing and solving mathproblems using Python interpreter.Over the course,we have open-sourced a seriesof models,including ChatGLM-6B(three generations),GLM-4-9B(128K,1M),GLM-4V-9B,WebGLM,and CodeGeeX,attracting over 10 million downloads onHugging face in the year 202

7、3 alone.The open models can be accessed throughhttps:/ and https:/huggingface.co/THUDM.*Team GLM:Aohan Zeng,Bin Xu,Bowen Wang,Chenhui Zhang,Da Yin,Dan Zhang,Diego ROJAS,Guanyu Feng,Hanlin Zhao,Hanyu Lai,Hao Yu,Hongning Wang,Jiadai Sun,Jiajie Zhang,Jiale Cheng,JiayiGui,Jie Tang,Jing Zhang,Jingyu Sun,

8、Juanzi Li,Lei Zhao,Lindong Wu,Lucen Zhong,Mingdao Liu,MinlieHuang,Peng Zhang,Qinkai Zheng,Rui Lu,Shuaiqi Duan,Shudan Zhang,Shulin Cao,Shuxun Yang,WengLam Tam,Wenyi Zhao,Xiao Liu,Xiao Xia,Xiaohan Zhang,Xiaotao Gu,Xin Lv,Xinghan Liu,Xinyi Liu,Xinyue Yang,Xixuan Song,Xunkai Zhang,Yifan An,Yifan Xu,Yili

9、n Niu,Yuantao Yang,Yueyan Li,Yushi Bai,Yuxiao Dong,Zehan Qi,Zhaoyu Wang,Zhen Yang,Zhengxiao Du,Zhenyu Hou,Zihan Wang.Team members are listed alphabetically by fi rst name.Preprint.Under review.arXiv:2406.12793v2 cs.CL 30 Jul 2024GLM11Mar.2021GLM-10BJun.2021GLM-130B54CodeGeeX-13B58Aug.2022GLM-ProEmbe

10、ddingCharacterGLM61Jun.2023ChatGLM2-6BChatGLM2-6B-32KCodeGeeX2-6BGLM-4(0116)GLM-4VCogView359Jan.2024ChatGLM-130BMar.2023ChatGLM-6BVisualGLM-6BGLM-3-TurboOct.2023ChatGLM3-6BChatGLM3-6B-32KCogVLM-17B46APIsOpen LLMsOpen VLMsGLM-4(0520)GLM-4-Air(0605)Jun.2024GLM-4-9BGLM-4-9B-ChatGLM-4-9B-Chat-1MGLM-4V-9

11、BCogVLM2-19BGLM-4 All ToolsWebGLM22CodeGeeX Code InterpreterAgent ModelsAutoWebGLM18CogView9Apr.2022CogView210CogVideo(Dec.)CogAgent16(May)(May)(May)(128K)(32K)(128K)(Jul.)Oct.2022GLM-130B54(32K)(Aug.)GLM-10BmGLM-1BFigure 1:Timeline of the GLM family of language,code,vision,and agent models.The focu

12、s ofthis report is primarily on the language models,i.e.,ChatGLM.The APIs are publicly available athttps:/and open models can be accessed throughhttps:/ rapid development of large language models(LLMs)has been phenomenal 57.Take one ofthe most successful model series,the OpenAIs GPT models,as an exa

13、mple:the original GPT-3model released in 2020 3 marked a signifi cant scale-up from GPT-1s 117 million parameters andGPT-2s 1.5 billion parameters,to 175 billion parameters.This scale-up enables the decoder-onlytransformer-based GPT-3 model with in-context learning and generalized capabilities:accor

14、ding toOpenAI,the GPT-3.5 series improved upon GPT-3 by incorporating instruction tuning,supervisedfi ne tuning(SFT),and/or reinforcement learning from human feedback(RLHF)29.This hasnow became a standard procedure to create performing LLMs,including the PaLM models 6,theLLaMA models 41,the Gemini m

15、odels 40,and many more.In a parallel line to the popularly adopted LLMs development practices,we proposed the GeneralLanguage Model(GLM)architecture 11 featured with the autoregressive blank infi lling objectiveand open-sourced the GLM-10B model in 2021(See the GLM timeline in Figure 1).Starting in

16、late2021,we began pre-training GLM-130B 53.The goal was to train a 100B-scale model to match orsurpass GPT-3(davinci)while also verifying the techniques for successfully training models at thisscale,along with other contemporary efforts such as OPT-175B 54 and BLOOM-176B 33.Wecompleted the 400B-toke

17、n training and evaluation of GLM-130B in July,and subsequently releasedthe model and pre-training details 53 in August 2022.According to HELM in November 2022,GLM-130B matches GPT-3(davinci)across various dimensions 20.Following this,we initiated instruction tuning on GLM-130B.Later,ChatGPT further

18、motivated usto align the base models with SFT and RLHF.We created and crafted the prompt-response pairs fromscratch and performed SFT,while also starting to examine how to effectively apply RLHF.On March14,2023,the aligned model,ChatGLM-130B,went live onhttps:/.In addition,2UserGLM-4(All Tools)Pytho

19、nWeb BrowserCogViewAccording to the information I found,from 2000 to 2023,the global population grew from about 6.15 billion to about 8.05 billion.#Starting and ending population valuesstarting_population=6.15e9#2000 populationending_population=8.05e9#2023 population#Number of years between 2000 and

20、 2023years=2023-2000#CAGR formulacagr=(ending_population/starting_population)*(1/years)-1cagr*100#Convert to percentageGenerateExecuteThe average annual growth rate of the global population from 2000 to 2023 was approximately 1.18%.Result:1.17739919480071World Population Clockhttps:/www.worldometers

21、.info/.123Search for the global population from 2000 to 2023,then calculate the average annual growth rate.global population from 2000 to 202312World Population 1950-2024https:/ PythonSearch Web Browser+Figure 2:An Illustrative Example of GLM-4 All Tools.a smaller version,ChatGLM-6B 13,was open-sour

22、ced on the same day,attracting signifi cantlymore attention than anticipated.It was designed to have 6.2 billion parameters for 1)facilitatingfast iteration of pre-and post-training techniques as well as data selection,and 2)enabling localdeployment on consumer-grade graphics cards using INT4 quanti

23、zation.Since then,we have beenrapidly exploring and refi ning our pre-training and alignment techniques,leading to the secondand third generations of ChatGLM series every other three months,both of which were pre-trainedentirely from the beginning.ChatGLM-6B was pre-trained on approximately one tril

24、lion tokens of Chinese and English corpuswith a context length of 2,048(2K),supplemented mostly by SFT.Released in June,ChatGLM2-6Bwas pre-trained and aligned with more high-quality data,leading to substantial improvements overits predecessor,including a 23%improvement on MMLU,571%on GSM8K,and 60%on

25、 BBH.By adopting the FlashAttention technique 8,its context length was extended to 32K.Additionally,the integration of Multi-Query Attention 35 contributed to a 42%increase in inference speed.Taking this further,our 2nd generation code model CodeGeeX2-6B was developed by pre-trainingon an additional

26、 600 billion code tokens.It demonstrated Pass1 improvements over the initialgeneration,CodeGeeX-13B 58,with increases of 57%in Python,71%in C+,54%in Java,83%in JavaScript,and 56%in Go as measured by HumanEval-X.When adapting to Character-basedDialogues,CharacterGLM 61 allows effective and safe chara

27、cter customization on LLMs.Byfurther adapting more diverse training datasets,more suffi cient training steps,and more optimizedtraining strategies,ChatGLM3-6B topped 42 benchmarks across semantics,mathematics,reasoning,code,and knowledge.Starting from this generation,ChatGLM also supports function c

28、all and codeinterpreter,as well as complex agent tasks 22;52;18.In the course of these developments,we alsodeveloped models with 1.5B,3B,12B,32B,66B,and 130B parameters,allowing us to validate ourobservations and establish our own scaling laws.With all the lessons learned and experiences accumulated

29、,we kicked off the training of GLM-4.The fi rst cutoff checkpoint then underwent a multi-stage post-training process(e.g.,SFT,RLHF,safety alignment)with a focus on the Chinese and English language for now.Subsequently,it wasdeveloped into two distinct versions:GLM-4 and GLM-4 All Tools,both supporti

30、ng a 128K contextlength.Since Janurary 16,2024,GLM-4(0116)has been made available through the GLM-4 API athttps:/,and GLM-4 All Tools is accessible via the websitehttps:/and mobile applications that support the creation of ones own agentGLMs.The latest modelsare GLM-4(0520)and GLM-4-Air(0605)with an

31、 upgrade on both pre-training and alignment.GLM-4-Air achieves comparable performance to GLM-4(0116)with lower latency and inferencecost.Evaluations of GLM-4 were performed on a variety of language benchmarks.These evaluationsassess GLM-4s general abilities in English,instruction following in both E

32、nglish and Chinese,andalignment,long-context,and agent capacities in Chinese.3Table 1:Performance of Open ChatGLM-6B,ChatGLM2-6B,ChatGLM3-6B,and GLM-4-9B.LanguageDatasetChatGLM-6BChatGLM2-6BChatGLM3-6B-BaseGLM-4-9B(2023-03-14)(2023-06-25)(2023-10-27)(2024-06-05)EnglishGSM8K1.525.972.384.0MATH3.16.92

33、5.730.4BBH0.029.266.176.3MMLU25.245.261.474.7GPQA-26.834.3HumanEval0.09.858.570.1BoolQ51.879.087.989.6CommonSenseQA20.565.486.590.7HellaSwag30.457.079.782.6PIQA65.769.680.179.1DROP3.925.670.977.2ChineseC-Eval23.751.769.077.1CMMLU25.350.067.575.1GAOKAO-Bench26.846.467.374.5C335.158.673.977.2First,on

34、the most commonly-used English academic benchmarksMMLU,GSM8K,MATH,BBH,GPQA,and HumanEval,GLM-4 0520 achieves performance closely comparable to that of GPT-40613 28 and Gemini 1.5 Pro 40.For example,it scores 83.3 vs.86.4 and 83.7 on MMLU,respectively.Second,according to IFEval 62,GLM-4s instruction

35、following capacities on bothprompt and instruction levels are approximately as effective as GPT-4-Turbo in both English andChinese.Third,in terms of Chinese language alignment,GLM-4 outperforms GPT-4 and matchesGPT-4-Turbo across eight dimensions in AlignBench 23.Finally,for long-context tasks,theGL

36、M-4(128K)model matches the performance of GPT-4 Turbo and Claude 3 Opus as measured byLongBench-Chat 1,i.e.,87.3 vs.87.2 and 87.7,respectively.The GLM-4 All Tools model is specifi cally aligned to better understand user intent and autonomouslyselect the most appropriate tool(s)for task completion.Fo

37、r example,it can access online informationvia a web browser in a multi-round manner,use Python interpreter to solve math problems,leveragea text-to-image model to generate images,and call user-defi ned functions.Figure 2 illustrates anexample showing GLM-4 All Tools with a web browser and Python int

38、erpreter for addressing theuser query of“Search for the global population from 2000 to 2023,then calculate the average annualgrowth rate”.Our fi rst-hand test shows that it not only matches but often surpasses the capabilities ofGPT-4 All Tools for common tasks.Following our three generations of ope

39、n ChatGLM-6B models,we also openly released the GLM-4-9B(128K and 1M context length)model.GLM-4-9B is pre-trained on approximately ten trilliontokens of multilingual corpus with a context length of 8192(8K)and post-trained with the samepipeline and data used for GLM-4(0520).With less training comput

40、e,it outperforms Llama-3-8B 26 and supports all the functionality of All Tools in GLM-4.We also provide an experimentalmodel GLM-4-9B-Chat-1M with 1 million(1M)context length(about 2 million Chinese characters).Table 1 shows the performance of the three generations of ChatGLM-6B models and GLM-4-9B,

41、illustrating the progressive improvements of ChatGLM over time.Figure 3 summarizes the major improvements and features from GLM-130B to GLM-4 All Tools.Throughout this journey,we have also contributed to the open development of the code LLMs(CodeGeeX 58)as well as visual language models for image un

42、derstanding(CogVLM 45 andCogAgent 16)and text-to-image generation(CogView 9;10;59).The open models and data canbe accessed via https:/ and https:/huggingface.co/THUDM.4GLM-130BMMLU:44.8%ChatGLMGLM-2MMLU:66.6%ChatGLM2GLM-3MMLU:71.0%ChatGLM3GLM-4MMLU:83.3%GLM-4 All ToolsAlign with human intentBetter A

43、rchitecture with 32K ContextNative Agent&Function Call CapabilitiesPowerful Agents with 128K ContextExperimental 1M Context Length&VisionFigure 3:From GLM-130B to ChatGLM to ChatGLM2/3 to GLM-4 All Tools.2ChatGLM TechniquesIn this section,we introduce both the pre-and post-training techniques we ado

44、pted and developed inChatGLM,including the model architecture,pre-training data,alignment,and All Tools.We havedetailed technical reports introducing each of the major techniques we used to reach GLM-4.Pre-Training Data.Our pre-training corpus consists of multilingual(mostly English and Chinese)docu

45、ments from a mixture of different sources,including webpages,Wikipedia,books,code,andresearch papers.The data processing pipeline mainly includes three stages:deduplication,fi ltering,and tokenization.The deduplication stage improves data diversity by removing duplicated or similardocuments,with bot

46、h exact and fuzzy deduplication.The fi ltering stage for webpages improves dataquality by removing noisy documents that contain offensive language,placeholder text,source code,etc.The tokenization stage converts text into a sequence of tokens for further processing.The numberof tokens in the pre-tra

47、ining data directly affects model training speed.To optimize this aspect,weemploy the byte-level byte pair encoding(BPE)algorithm 34 to separately learn the Chinese andmultilingual tokens and merge them with the tokens of the cl100k_base tokenizer in tiktoken 27 intoa unifi ed vocabulary with a size

48、 of 150,000.In the fi nal training set,we re-weight different sources toincrease the importance of high-quality and educational sources like books and Wikipedia.To thisend,the pre-training corpus consists of around ten trillion tokens.Throughout the four generations of ChatGLM development,our fi ndi

49、ngs align with existing stud-ies 60:data quality and diversity are crucial for building effective LLMs.Despite the empiricallessons and insights gained,we have to date yet to identify a fundamental principle that couldguide the processes of data collection,cleaning,and selection,which might inspire

50、future researchdirections.Architecture.The GLM family of LLMs is built on Transformer 43.In GLM-130B 53,weexplored various options to stabilize its pre-training by taking into account the hardware constraintswe faced at the time.Specifi cally,GLM-130B leveraged DeepNorm 44 as the layer normalization

51、strategy and used Rotary Positional Encoding(RoPE)38 as well as the Gated Linear Unit 36 withGeLU 15 activation function in FFNs.Throughout our exploration,we have investigated differentstrategies to enhance model performance and inference effi ciency.The recent GLM-4 model adoptsthe following archi

52、tecture design choices.No Bias Except QKV:To increase training speed,we removed all bias terms with the exceptionof the biases in Query,Key,and Value(QKV)matrices of the attention layers.In doing so,weobserved a slight improvement in length extrapolation.RMSNorm and SwiGLU:We adopted RMSNorm and Swi

53、GLU to replace LayerNorm and ReLU,respectively.These two strategies brought better model performance.Rotary positional embeddings(RoPE):We extended the RoPE to a two-dimensional form toaccommodate the 2D positional encoding in GLM.Group Query Attention(GQA):We replaced Multi-Head Attention(MHA)with

54、Group QueryAttention(GQA)to cut down on the KV cache size during inference.Given GQA uses fewerparameters than MHA,we increased the FFN parameter count to maintain the same model size,i.e.,setting dffnto 10/3 of the hidden size.5The context length of our models was extended from 2K(ChatGLM),to 32K(C

55、hatGLM2 andChatGLM3),and to 128K and 1M(GLM-4).These expansions were achieved not only throughcontext extensionposition encoding extension 31;5 and continual training 47 on long textbutalso long context alignment,enabling GLM-4 to effectively handle very long contexts(Cf 1 fortechnical details).Alig

56、nment.Pre-training builds the foundation of LLMs while post-training 29 further refi nesthese models to align with human preferences,such as understanding human intents,followinginstructions,and facilitating multi-turn dialogues.For GLM-4,the alignment is mostly achieved withsupervised fi ne-tuning(

57、SFT)and reinforcement learning from human feedback(RLHF)17.In SFT,we fi nd that authentic human prompts and interactions instead of template-based or model-generatedresponses are vital to the alignment quality.While SFT largely aligns the base models with humanpreferences,RLHF can further help mitig

58、ate issues of response rejection,safety,mixture of bilingualtokens generated,and multi-turn coherence among others.For the fi rst generation of our models(ChatGLM-6B and ChatGLM-130B),the prompt-responsepairs were mostly annotated by the model developers.For later models,the alignment data is acombi

59、nation of in-house annotation and proprietary data acquired from third parties,subject to strictquality control measures.Similar to existing practices 42,annotators are instructed to score modelresponses from several dimensions,including safety,factuality,relevance,helpfulness,and humanpreferences.C

60、hatGLM Techniques.Throughout the development of ChatGLM,we have introduced and willpublish techniques that are used to enhance its performance.Emergent Abilities of LLMs 12:We examined the relationship between pre-training loss andperformance on downstream tasks and found that with the same pre-trai

61、ning loss,LLMs of differentmodel sizes and training tokens generate the same downtream performance.We also found that onsome tasks(such as MMLU and GSM8K),the performance improves beyond random chance onlywhen the pre-training loss falls below a certain threshold.We thus redefi ne emergent abilities

62、 asthose exhibited by models with lower pre-training losses 12.LongAlign 1:To extend LLMs context window size,we proposed LongAligna comprehensiverecipe for long context alignment.It enables GLM-4 to process long context texts(up to 128Ktokens)with performance comparable to that of Claude 2 and GPT-

63、4 Turbo(1106).ChatGLM-Math 48:To improve math problem solving in LLMs,we introduced ChatGLM-Maththat leverages self-critique rather than external models or manual annotations for data selection.ChatGLM-RLHF 17:To align LLMs with human feedback,we introduced ChatGLM-RLHFour practices of applying PPO

64、and DPO into LLMs.Self-Contrast 24:To avoid the need for expensive human preference feedback data,we developeda feedback-free alignment strategy Self-Contrast.It utilizes the target LLM to self-generate massivenegative samples for its RLHF alignment.AgentTuning 52:To improve LLMs agent capabilities,

65、we developed the AgentTurning frame-work with the AgentInstruct instruction-tuning dataset that includes high-quality interaction trajec-tories between agents and environment.APAR 21:To improve the inference speed of LLMs for responses with hierarchical structures,we presented an auto-parallel auto-

66、regressive(APAR)generation approach.It leverages instructtuning to train LLMs to plan their(parallel)generation process and execute APAR generation.Benchmarks:We also developed several open LLM benchmarks,including AgentBench 25 forevaluating LLMs as agents,LongBench 2 for evaluating the long contex

67、t handling performanceof LLMs,AlignBench 1 to measure the alignment quality of ChatGLM with Chinese languagecontent,HumanEval-X 58 to evaluate HumanEval 4 problems in programming languagesbeyond Python,as well as NaturalCodeBench(NCB)to measure models capacities to solvepractical programming tasks.G

68、LM-4 All Tools.The latest ChatGLM models are GLM-4 and GLM-4 All Tools,both of whichwere trained and aligned by using the techniques above.GLM-4 All Tools is a model version furtheraligned to support intelligent agents and related tasks.It is trained to autonomously understand userintent,plan comple

69、x instructions,and call one or multiple tools(e.g.,web browser,Python interpreter,and the text-to-image model)to complete complex tasks.Figure 4 presents the overall pipeline of6CustomizedGLMs 1PythonCogViewWeb BrowserExternal KnowledgeUserGLM-4(All Tools)GLM-4Function Call128K ContextPlanAnalyzeRec

70、ursiveExecuteTool CallFeedbackMemoryExecutionCustomizedGLMs 2CustomizedGLMs 3Figure 4:The overall pipeline of GLM-4 All Tools and customized GLMs(agents).the GLM-4 All Tools system.When a user issues a complex request,the model analyzes the taskand plan the problem-solving process step by step.If it

71、 determines that it cannot complete the taskindependently,it will sequentially call one or multiple external tools,utilizing their intermediatefeedback and results to help solve the task.Built on the GLM-4s all-tools capabilities,we also developed the GLMs application platform thatallows users to cr

72、eate and customize their own agents for specifi c tasks.The GLMs support not onlythe embedded Python interpreter,web browser,text-to-image model but also user-defi ned functions,APIs,and external knowledge bases to more effectively address user needs.3GLM-4 CapabilitiesWe examine the capabilities of

73、 the GLM-4 model from diverse perspectives,including the basecapacity on academic benchmarks,code problem-solving,agent abilities in English,and instructionfollowing,long context for both Chinese and English,as well as alignment in Chinese.As mentioned,GLM-4 was pre-trained mostly in Chinese and Eng

74、lish and aligned predominantly to Chinese.In thissection,we report results primarily for the latest GLM-4 version,i.e.,GLM-4(0520)and GLM-4-Air(0605),as GLM-4(0520)is slightly better than its original 0116 version across the evaluatedbenchmarks.During evaluation,both GLM-4 and GLM-4-Air are deployed

75、 with BFloat16 precision.For baselines,we present results for GPT-4(0603),GPT-4 Turbo(1106,2024-04-09),Claude 2,Claude 3 Opus,and Gemini 1.5 Pro,all of which were extracted from the corresponding technicalreports or tested through their public APIs.Overall,GLM-4 gets close to the state-of-the-art mo

76、dels(GPT-4-Turbo,Gemini 1.5 Pro,and Claude3 Opus)over the standard benchmarks,as well as instruction following,long context,code problem-solving,and agent abilities in English environment.For Chinese alignment,it generates strongperformance against SOTA models across various domains,such as fundamen

77、tal language ability,advanced Chinese understanding,professional knowledge,and open-ended question answering.Insummary,GLM-4 is among the best in terms of Chinese language tasks.It also demonstrates compa-rable performance to GPT-4 and Claude 3 Opus in Chinese math and logic reasoning capabilitiesth

78、ough it lags behind GPT-4 Turbo.3.1Evaluation of Academic BenchmarksTo evaluate the general performance of the base model,we select six commonly-used benchmarksspanning knowledge,math,reasoning,commonsense,and coding:MMLU 14:Multi-choice questions collected from various examinations including mathe-

79、matics,history,computer science,and more.We present all answers to the model and ask itto choose the letter of the answer.GSM8K 7:8,500 grade school math word problems(1,000 in the test set)that requirethe model to solve real-life situational problems using mathematical concepts.We usechain-of-thoug

80、ht prompting 46 for this benchmark.MATH:12,500 challenging competition-level mathematics problems(5,000 in the test set).We use chain-of-thought prompting 46 for this benchmark.7BBH 39:A suite of 23 challenging BIG-Bench 37 tasks.We use chain-of-thoughtprompting 46 for this benchmark.GPQA 32:A gradu

81、ate-level multi-choice benchmark in biology,chemistry,and physics.HumanEval 4:a coding benchmark that measures correctness of synthetic functions withautomatic test-case checking.We compare the performance of GLM-4 with the original GPT-4 28.The results are shownin Table 2.We can observe that GLM-4

82、achieves 96.3%of GPT-4s accuracy on MMLU,andoutperforms GPT-4 on other benchmarks.Overall,the base capacity of GLM-4 approaches that ofGPT-4-Turbo and Claude 3 Opus.Table 2:GLM-4 performance on academic benchmarks.ModelMMLUGSM8KMATHBBHGPQAHumanEvalGPT-4(0314)86.492.052.983.135.767.0GPT-4 Turbo(1106)

83、84.795.764.388.342.583.7GPT-4 Turbo(2024-04-09)86.795.673.488.249.388.2Claude 3 Opus86.895.060.186.850.484.9Gemini 1.5 Pro85.990.867.789.246.284.1GLM-4-9B-Chat72.479.650.676.328.871.8GLM-4-Air(0605)81.990.957.980.438.475.7GLM-4(0116)81.587.647.982.335.772.0GLM-4(0520)83.393.361.384.739.978.53.2Evalu

84、ation of Instruction FollowingWe assess the profi ciency of GLM-4 in following instructions with the recently-introduced IFEvaldataset 62.The dataset comprises 541 prompts derived from 25 distinct instructions that areverifi able through explicit criteria(e.g.,“end your email with:P.S.I do like the

85、cake”can be verifi edvia string matching).We adhere to the methodologies outlined by 62 to calculate prompt-leveland instruction-level accuracy in both strict mode and loose mode.To further evaluate the modelsperformance on following instructions in Chinese,we translate the original prompts into Chi

86、nese,omitted instructions that are not applicable in Chinese(such as capitalization),and adjust the scoringscripts to accommodate Chinese data.Table 3:GLM-4 performance on IFEval 62,an LLM instruction following benchmark.L standsfor Loose and S stands for Strict.P stands for Prompt and I stands for

87、Instruction.ModelEnglishChineseL-PS-PL-IS-IL-PS-PL-IS-IGPT-4(0613)79.577.185.583.772.468.980.075.7GPT-4 Turbo(1106)79.175.485.182.474.369.180.876.5GPT-4 Turbo(2024-04-09)84.581.288.785.979.372.684.279.1Claude 275.058.081.767.757.146.564.955.1Claude 3 Opus90.685.593.790.078.373.384.380.4GLM-4-9B-Chat

88、73.069.080.377.273.069.080.377.2GLM-4-Air(0605)80.475.286.182.379.371.284.077.3GLM-4(0520)83.779.188.785.079.771.984.278.08In loose mode,GLM-4 matches instruction-level accuracy achieved by GPT-4 Turbo in both Englishand Chinese.In strict mode,GLM-4 achieves99.0%and98.6%of instruction-level accuracy

89、 ofGPT-4 Turbo(2024-04-09)in English and Chinese,respectively.3.3Evaluation of AlignmentAlignBench 23 provides an automatic LLMs-as-Judge method to benchmark the alignment ofLLMs in Chinese context.It consists 683 queries spanning 8 different categories,and evaluatesmodel responses using a GPT-4 bas

90、ed multidimensional rule-calibrated pointwise reference-basedscoring method.We evaluate on AlignBench-v1.1,which more carefully improves the referencegeneration quality,especially by complementing human-collected evidences from webpages withurls for knowledge-related questions that takes up 66.5%of

91、total queries.On this version,almost allLLMs achieve lower scores than they do in the previous AlignBench.Table 4:GLM-4 performance on AlignBench 23,an LLM benchmark for alignment in Chinese.ModelMathLogicLanguageChineseQAWritingRole PlayProfessionalOverallGPT-4(0613)7.547.177.827.027.397.678.207.29

92、7.46GPT-4 Turbo(1106)7.857.667.907.228.248.538.467.957.90GPT-4 Turbo(2024-04-09)8.327.677.607.578.377.758.188.598.00Claude 26.395.856.755.726.685.876.866.566.26Claude 3 Opus7.277.117.947.718.217.617.738.027.53Gemini 1.5 Pro7.077.777.317.228.557.837.798.527.47GLM-4-9B-Chat7.006.016.697.267.977.598.10

93、7.527.01GLM-4-Air(0605)7.696.957.538.007.908.018.358.097.65GLM-4(0116)7.207.207.608.198.457.888.058.567.66GLM-4(0520)7.897.958.007.868.118.048.068.478.00Results are shown in Table 4.GLM-4 outperforms GPT-4 Turbo,Claude 3 Opus,and Gemini 1.5Pro in general,achieves the highest overall score among the

94、baselines.Especially on Chinese LogicReasoning and Language Understanding tasks,GLM-4 signifi cantly outperforms all other powerfulmodels.These results demonstrate its strong grasping of Chinese language and knowledge.The current performance gap between GLM-4 and GPT-4 Turbo(2024-04-09)mostly lies i

95、n theMathematics dimension.We have been employing techniques introduced in ChatGLM-Math 48such as self-critique to continuously enhance GLM models math reasoning capabilities.3.4Evaluation of Long Context Handling AbilitiesTo assess the performance of GLM-4 on long text tasks,we carry out evaluation

96、s on LongBench-Chat 1,a benchmark set with context lengths ranging from 10-100k,encompassing a wide rangeof long text scenarios frequently utilized by users,such as document Q&A,summarization,andcoding.In our quest to provide a more detailed comparison against the performance of GLM-4 indifferent la

97、nguages,we also segregate LongBench-Chat according to language.This yields twodistinct portions:Chinese and English.We therefore report the results for both segments separately,offering a fi ne-grained overview of GLM-4s cross-linguistic capabilities.Regarding the specifi c evaluation settings,we sc

98、ore the outputs of each model based on GPT-4,adopting a few-shot strategy within LongBench-Chat.Moreover,given our objective to minimizescore variations and to reach a more reliable statistical conclusion,we repeated evaluations multipletimes.Subsequently,we report the average from these multiple ev

99、aluations in Table 5 to ensure thatthe fi nal performance metric reflects a thorough understanding of how GLM-4 behaves under diverseconditions.And the results clearly suggested that the performance of GLM-4 aligns with that ofGPT-4 Turbo and Claude 3 Opus on English prompts,and it outperforms the b

100、est of them on Chineseprompts.9Table 5:GLM-4 performance on LongBench-Chat 2.ModelEnglishChineseGPT-4 Turbo(1106)87.271.4GPT-4 Turbo(2024-04-09)85.082.1Claude 281.376.2Claude 3 Opus87.782.7GLM-4-9B-Chat76.879.0GLM-4-Air(0605)82.481.0GLM-4(0520)87.384.03.5Evaluation of Coding Abilities on Real-world

101、User PromptsWhile HumanEval 4 has been widely adopted for evaluating LLMs code generation,most of itsproblems are about introductory algorithms.However,in practice,users ask complicated questionsto complete their daily work,whose diffi culty is usually far beyond the scope of HumanEval.Additionally,

102、previous works have reported HumanEval-contaminated training data 28;19;50 intheir own or other LLMs,making the results on HumanEval relatively less trustful than before.As a result,beside HumanEval we evaluate GLM-4 on NaturalCodeBench(NCB)55,a challengingbilingual coding benchmark derived from rea

103、l user prompts to mirror the complexity of real-worldcoding tasks.As shown in Table 6,GLM-4 has a close coding performance to Claude 3 Opus inpractical scenarios.While there is still some gaps to GPT-4 models,considering GLM-4 bilinguallybalanced nature,there is quite much potential to improve its p

104、erformance on NCB via better trainingstrategies and data curation in our following iterations.Table 6:GLM-4 performance on NaturalCodeBench(NCB)55,a benchmark with real codingprompts in two programming languages(Python and Java)for English and Chinese.ModelPython(en)Java(en)Python(zh)Java(zh)Overall

105、GPT-4(0613)55.751.153.451.152.8GPT-4 Turbo(1106)51.955.047.351.951.5GPT-4 Turbo(2024-04-09)57.552.353.152.353.8Claude 234.436.633.632.834.4Claude 3 Opus48.948.945.050.448.3Gemini 1.5 Pro45.039.741.543.142.3GLM-4-9B-Chat33.929.830.834.432.2GLM-4-Air(0605)40.839.743.139.740.8GLM-4(0520)51.642.845.448.

106、947.13.6Evaluation of Function CallTo evaluate the performance of GLM models on function call,we carry out evaluations on BerkeleyFunctionCallLeaderboard49,abenchmarkwith2kquestion-function-answerpairs.Thebenchmarkevaluates models ability on calling functions in three categories:evaluation by Abstra

107、ct SyntaxTree(AST),evaluation by executing APIs,and relevance detection.The fi rst category compares themodel output functions against function documents and possible answers with AST analysis.Thesecond category checks for response correctness by executing the generated function calls.Relevancedetec

108、tion evaluates the models capacity on recognizing functions that are not suitable to address theusers question.The results are shown in Table 7.We can observe that the function-call capability ofGLM-4(0520)aligns with that of GPT-4 Turbo(2024-04-09),while GLM-4-9B-Chat signifi cantlyoutperforms Llam

109、a-3-8B-Instruct.Another observation is that the overall accuracy does not improve10with model sizes,while GLM-4-9B-Chat can even outperform GLM-4-Air.On the other hand,we observe that the performance on execution summary,which evaluates the execution results ofreal-world APIs,improves smoothly with

110、model size.Table 7:GLM performance on the Berkeley Function Call Leaderboard.ModelAST SummaryExec SummaryRelevanceOverallLlama-3-8B-Instruct59.2570.0145.8358.88GPT-4 Turbo(2024-04-09)82.1478.6188.7581.24GPT-4o(2024-05-13)85.2380.3781.2582.94ChatGLM3-6B62.1869.785.4257.88GLM-4-9B-Chat80.2684.4087.928

111、1.00GLM-4-Air(0605)84.3485.9368.3380.94GLM-4(0520)82.5987.7884.1781.763.7Evaluation of Agent AbilitiesIt is widely observed that LLMs are capable to serve as intelligent agents in versatile environmentsand contexts 30;51,known as LLMs-as-Agents 25.As a result,we evaluate GLM-4 together withother com

112、parison LLMs on AgentBench 25,a comprehensive agentic benchmark for text-basedLLMs across an array of practical environments,including code-based,game-based,and web-basedcontexts.Specifi cally,we evaluate on 7 out of 8 AgentBench environments except for Digital CardGame,which is too time-consuming t

113、o interact with.Overall scores are calculated using the originalper-dataset weights provided in AgentBench 25.Table 8:GLM-4 performance on AgentBench 25.OperatingSystemDataBaseKnowledgeGraphLateral ThinkingPuzzlesHouseHoldingWebShoppingWebBrowsingOverallGPT-4(0613)42.432.058.816.678.061.129.03.69GPT

114、-4 Turbo(1106)40.352.754.017.770.052.830.03.77GPT-4 Turbo(2024-04-09)41.046.753.219.472.055.119.03.68Claude 218.127.341.38.454.061.40.02.03Claude 3 Opus23.655.053.420.070.048.528.03.62GLM-4-Air(0605)31.951.053.812.378.069.230.03.58GLM-4(0520)36.852.751.415.382.068.329.03.79TheresultsarepresentedinTa

115、ble8.Asitshows,GLM-4modelspresentquiteimpressiveperformanceon agent tasks,with the GLM-4-Airs comparable and GLM-4s outperforming results to GPT-4Turbo and Claude 3 Opus.In terms of specifi c environments,we fi nd GLM-4 series performedespecially well on Database,House-Holding,and Web Shopping tasks

116、,while still demonstrating agap to GPT-4 series on Operating System,Knowledge Graph,and Lateral Thinking Puzzles.The gapsuggests that there is still room for GLM-4 to improve its performance on code-related agentic tasksand highly interactive language tasks.3.8Evaluation of All ToolsGLM-4 is further

117、 aligned to support intelligent agents and user-confi gured GLMs functionalities onhttps:/,and the resultant model is GLM-4 All Tools.As mentioned,GLM-4 AllTools can complete complex tasks by autonomously understanding user intent,planing step-by-stepinstructions,and calling multiple tools,including

118、 web browser,Python interpreter,and the text-to-image model(e.g.,CogView3 59.Table 9 shows that GLM-4 All Tools(Web)achieved similar11performance on Python interpreter for solving math problems,browser for information seeking,compared to ChatGPT-4(Web),respectively.Table 9:Performance of GLM-4 All T

119、ools.GLM-4 All ToolsGPT-4(Web,0116)(Web,0110)PythonInterpreterGSM8K91.5992.72MATH63.6065.00Math23K88.5088.40BrowserInformation Seeking78.0867.124Safety and RisksWe are committed to ensuring that GLM-4 operates as a safe,responsible,and unbiased model.In addition to addressing common ethical and fair

120、ness concerns,we carefully assess and mitigatepotential harms that the model may pose to users in real-world scenarios.Table 10:GLM-4 performance on SafetyBench 56,compared to GPT-4 models and Claude 3 Opus.Ethics&MoralityIllegalActivitiesMentalHealthOffens-ivenessPhysicalHealthPrivacy&PropertyUnfai

121、rness&BiasOverallGPT-4(0613)92.793.393.087.796.791.373.389.7GPT-4 Turbo(1106)91.092.093.086.092.088.774.388.1GPT-4 Turbo(2024-04-09)90.391.391.785.392.089.375.087.9Claude 3 Opus92.791.792.786.394.788.766.087.5GLM-4(0520)92.391.393.386.392.388.666.087.2Risk Mitigation.We carefully cleaned data in the

122、 pre-training stage by removing text containingsensitive keywords and web pages from a pre-defi ned blacklist.In the alignment phase,we evaluateeach training sample for safety and remove any that pose potential risks.Harmlessness is also animportant criteria for preference alignment when comparing m

123、ultiple model outputs.We have a red team that constantly challenges the model with tricky questions that tend to causeunsafe answers.We collect all harmful question-answer pairs from GLM-4 and improve them withhuman annotations for further model alignment.Safety Evaluation.We evaluate the GLM-4 mode

124、l on the SafetyBench 56,which assesseseach model from 7 dimensions:Ethics and Morality(unethical behaviors),Illegal Activities(basicknowledge of law),Mental Health(adverse impacts on mental health),Offensiveness(offensivebehaviors),Physical Health(dangerous behaviors that can cause physical harms),P

125、rivacy andProperty(privacy breach or property loss),Unfairness and Bias.We evaluate different models on theChinese subset of SafetyBench,which is created by removing highly sensitive questions that tend tobe censored,to mitigate interference from different API safety policies.Table 10 shows the safe

126、ty results of GLM-4 and SOTA models.On most dimensions GLM-4(0520)shows competitive safety performance,and overall it achieves comparable performance with Claude3 Opus.GLM-4 slightly falls behind the GPT-4 family,especially on the Physical Health dimension,which demands robust common sense knowledge

127、 about the physical world to avoid potential risks.More efforts have been put into this direction to develop a more capable and safe GLM model.125ConclusionIn this report,we introduce the ChatGLM family of large language models from GLM-130B toGLM-4(All Tools).Over the past one and half years,we hav

128、e made great progress in understandingvarious perspectives of large language models from our fi rst-hand experiences.With the developmentof each model generation,the team has learned and applied more effective and effi cient strategiesfor both model pre-training and alignment.The recent ChatGLM mode

129、lsGLM-4(0116,0520),GLM-4-Air(0605),and GLM-4 All Toolsdemonstrate signifi cant advancements in understandingand executing complex tasks by autonomously employing external tools and functions.These GLM-4models have achieved performance on par with,and in some cases surpassing,state-of-the-art modelss

130、uch as GPT-4 Turbo,Claude 3 Opus,and Gemini 1.5 Pro,particularly in handling tasks relevant tothe Chinese language.In addition,we are committed to promoting accessibility and safety of LLMsthrough open releasing of our model weights and techniques developed throughout this journey.Ouropen models,inc

131、luding language,code,and vision models,have attracted over 10 million downloadson Hugging Face in the year 2023 alone.Currently,we are working on more capable models witheverything we have learned to date.In the future,we will continue democratizing cutting-edge LLMtechnologies through open sourcing

132、,and push the boundary of model capabilities towards the missionof teaching machines to think like humans.Acknowledgement.We would like to thank all the data annotators,infra operating staffs,collabora-tors,and partners as well as everyone at Zhipu AI and Tsinghua University not explicitly mentioned

133、 inthe report who have provided support,feedback,and contributed to ChatGLM.We would also like tothank Yuxuan Zhang and Wei Jia from Zhipu AI as well as the teams at Hugging Face,ModelScope,WiseModel,and others for their help on the open-sourcing efforts of the GLM family of models.References1Y.Bai,

134、X.Lv,J.Zhang,Y.He,J.Qi,L.Hou,J.Tang,Y.Dong,and J.Li.Longalign:A recipefor long context alignment of large language models,2024.2Y.Bai,X.Lv,J.Zhang,H.Lyu,J.Tang,Z.Huang,Z.Du,X.Liu,A.Zeng,L.Hou,Y.Dong,J.Tang,and J.Li.Longbench:A bilingual,multitask benchmark for long context understanding,2023.3T.B.Br

135、own,B.Mann,N.Ryder,M.Subbiah,J.Kaplan,P.Dhariwal,A.Neelakantan,P.Shyam,G.Sastry,A.Askell,S.Agarwal,A.Herbert-Voss,G.Krueger,T.Henighan,R.Child,A.Ramesh,D.M.Ziegler,J.Wu,C.Winter,C.Hesse,M.Chen,E.Sigler,M.Litwin,S.Gray,B.Chess,J.Clark,C.Berner,S.McCandlish,A.Radford,I.Sutskever,and D.Amodei.Lan-guage

136、 models are few-shot learners.In Proceedings of the 34th International Conferenceon Neural Information Processing Systems,NIPS20,Red Hook,NY,USA,2020.CurranAssociates Inc.4M.Chen,J.Tworek,H.Jun,Q.Yuan,H.P.de Oliveira Pinto,J.Kaplan,H.Edwards,Y.Burda,N.Joseph,G.Brockman,A.Ray,R.Puri,G.Krueger,M.Petro

137、v,H.Khlaaf,G.Sastry,P.Mishkin,B.Chan,S.Gray,N.Ryder,M.Pavlov,A.Power,L.Kaiser,M.Bavarian,C.Winter,P.Tillet,F.P.Such,D.Cummings,M.Plappert,F.Chantzis,E.Barnes,A.Herbert-Voss,W.H.Guss,A.Nichol,A.Paino,N.Tezak,J.Tang,I.Babuschkin,S.Balaji,S.Jain,W.Saunders,C.Hesse,A.N.Carr,J.Leike,J.Achiam,V.Misra,E.Mo

138、rikawa,A.Radford,M.Knight,M.Brundage,M.Murati,K.Mayer,P.Welinder,B.McGrew,D.Amodei,S.McCandlish,I.Sutskever,and W.Zaremba.Evaluating large language models trained on code.CoRR,abs/2107.03374,2021.5S.Chen,S.Wong,L.Chen,and Y.Tian.Extending context window of large language modelsvia positional interpo

139、lation.arXiv preprint arXiv:2306.15595,2023.6A.Chowdhery,S.Narang,J.Devlin,M.Bosma,G.Mishra,A.Roberts,P.Barham,H.W.Chung,C.Sutton,S.Gehrmann,et al.Palm:Scaling language modeling with pathways.arXivpreprint arXiv:2204.02311,2022.7K.Cobbe,V.Kosaraju,M.Bavarian,M.Chen,H.Jun,L.Kaiser,M.Plappert,J.Tworek

140、,J.Hilton,R.Nakano,C.Hesse,and J.Schulman.Training verifi ers to solve math wordproblems.CoRR,abs/2110.14168,2021.138T.Dao,D.Fu,S.Ermon,A.Rudra,and C.R.Flashattention:Fast and memory-effi cient exactattention with io-awareness.Advances in Neural Information Processing Systems,35:1634416359,2022.9M.D

141、ing,Z.Yang,W.Hong,W.Zheng,C.Zhou,D.Yin,J.Lin,X.Zou,Z.Shao,H.Yang,andJ.Tang.Cogview:Mastering text-to-image generation via transformers,2021.10M.Ding,W.Zheng,W.Hong,and J.Tang.Cogview2:Faster and better text-to-imagegeneration via hierarchical transformers.Advances in Neural Information Processing Sy

142、stems,35:1689016902,2022.11Z.Du,Y.Qian,X.Liu,M.Ding,J.Qiu,Z.Yang,and J.Tang.Glm:General language modelpretraining with autoregressive blank infi lling.In Proceedings of the 60th Annual Meeting ofthe Association for Computational Linguistics(Volume 1:Long Papers),pages 320335,2022.12Z.Du,A.Zeng,Y.Don

143、g,and J.Tang.Understanding emergent abilities of language modelsfrom the loss perspective,2024.13T.GLM.Chatglm-6b:An open bilingual dialogue language model.https:/ J.Steinhardt.Mea-suring massive multitask language understanding.In International Conference on LearningRepresentations,2021.15D.Hendryc

144、ks and K.Gimpel.Gaussian error linear units(gelus).arXiv preprintarXiv:1606.08415,2016.16W.Hong,W.Wang,Q.Lv,J.Xu,W.Yu,J.Ji,Y.Wang,Z.Wang,Y.Zhang,J.Li,B.Xu,Y.Dong,M.Ding,and J.Tang.Cogagent:A visual language model for gui agents,2023.17Z.Hou,Y.Niu,Z.Du,X.Zhang,X.Liu,A.Zeng,Q.Zheng,M.Huang,H.Wang,J.Ta

145、ng,and Y.Dong.Chatglm-rlhf:Practices of aligning large language models with human feedback,2024.18H.Lai,X.Liu,I.L.Iong,S.Yao,Y.Chen,P.Shen,H.Yu,H.Zhang,X.Zhang,Y.Dong,et al.Autowebglm:Bootstrap and reinforce a large language model-based web navigating agent.arXiv preprint arXiv:2404.03648,2024.19Y.L

146、i,S.Bubeck,R.Eldan,A.D.Giorno,S.Gunasekar,and Y.T.Lee.Textbooks are all youneed ii:phi-1.5 technical report,2023.20P.Liang,R.Bommasani,T.Lee,D.Tsipras,D.Soylu,M.Yasunaga,Y.Zhang,D.Narayanan,Y.Wu,A.Kumar,B.Newman,B.Yuan,B.Yan,C.Zhang,C.Cosgrove,C.D.Manning,C.R,D.Acosta-Navas,D.A.Hudson,E.Zelikman,E.D

147、urmus,F.Ladhak,F.Rong,H.Ren,H.Yao,J.Wang,K.Santhanam,L.Orr,L.Zheng,M.Yuksekgonul,M.Suzgun,N.Kim,N.Guha,N.Chatterji,O.Khattab,P.Henderson,Q.Huang,R.Chi,S.M.Xie,S.Santurkar,S.Ganguli,T.Hashimoto,T.Icard,T.Zhang,V.Chaudhary,W.Wang,X.Li,Y.Mai,Y.Zhang,and Y.Koreeda.Holistic evaluation of language models,

148、2023.21M.Liu,A.Zeng,B.Wang,P.Zhang,J.Tang,and Y.Dong.Apar:Llms can do auto-parallelauto-regressive decoding.ArXiv,abs/2401.06761,2024.22X.Liu,H.Lai,H.Yu,Y.Xu,A.Zeng,Z.Du,P.Zhang,Y.Dong,and J.Tang.Webglm:Towards an effi cient web-enhanced question answering system with human preferences.InProceedings

149、 of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pages 45494560,2023.23X.Liu,X.Lei,S.Wang,Y.Huang,Z.Feng,B.Wen,J.Cheng,P.Ke,Y.Xu,W.L.Tam,X.Zhang,L.Sun,H.Wang,J.Zhang,M.Huang,Y.Dong,and J.Tang.Alignbench:Benchmarkingchinese alignment of large language models,2023.24X.Liu,X.So

150、ng,Y.Dong,and J.Tang.Extensive self-contrast enables feedback-free languagemodel alignment,2024.1425X.Liu,H.Yu,H.Zhang,Y.Xu,X.Lei,H.Lai,Y.Gu,H.Ding,K.Men,K.Yang,S.Zhang,X.Deng,A.Zeng,Z.Du,C.Zhang,S.Shen,T.Zhang,Y.Su,H.Sun,M.Huang,Y.Dong,andJ.Tang.Agentbench:Evaluating llms as agents,2023.26Meta.Intr

151、oducing meta llama 3:The most capable openly available llm to date.https:/ OpenAI.tiktoken.https:/ R.OpenAI.Gpt-4 technical report.arXiv,pages 230308774,2023.29L.Ouyang,J.Wu,X.Jiang,D.Almeida,C.Wainwright,P.Mishkin,C.Zhang,S.Agarwal,K.Slama,A.Ray,et al.Training language models to follow instructions

152、 with human feedback.Advances in Neural Information Processing Systems,35:2773027744,2022.30J.S.Park,J.OBrien,C.J.Cai,M.R.Morris,P.Liang,and M.S.Bernstein.Generative agents:Interactive simulacra of human behavior.In Proceedings of the 36th Annual ACM Symposiumon User Interface Software and Technolog

153、y,pages 122,2023.31O.Press,N.Smith,and M.Lewis.Train short,test long:Attention with linear biases enablesinput length extrapolation.In International Conference on Learning Representations,2022.32D.Rein,B.L.Hou,A.C.Stickland,J.Petty,R.Y.Pang,J.Dirani,J.Michael,and S.R.Bowman.GPQA:A graduate-level goo

154、gle-proof q&a benchmark.CoRR,abs/2311.12022,2023.33T.L.Scao,A.Fan,C.Akiki,E.Pavlick,S.Ili c,D.Hesslow,R.Castagn,A.S.Luccioni,F.Yvon,M.Gall,et al.Bloom:A 176b-parameter open-access multilingual language model.arXiv preprint arXiv:2211.05100,2022.34R.Sennrich,B.Haddow,and A.Birch.Neural machine transl

155、ation of rare words with subwordunits.In Proceedings of the 54th Annual Meeting of the Association for ComputationalLinguistics(Volume 1:Long Papers),pages 17151725,Berlin,Germany,2016.Associationfor Computational Linguistics.35N.Shazeer.Fast transformer decoding:One write-head is all you need.arXiv

156、 preprintarXiv:1911.02150,2019.36 N.Shazeer.Glu variants improve transformer,2020.37A.Srivastava,A.Rastogi,A.Rao,A.A.M.Shoeb,A.Abid,A.Fisch,A.R.Brown,A.Santoro,A.Gupta,A.Garriga-Alonso,A.Kluska,A.Lewkowycz,A.Agarwal,A.Power,A.Ray,A.Warstadt,A.W.Kocurek,A.Safaya,A.Tazarv,A.Xiang,A.Parrish,A.Nie,A.Hus

157、sain,A.Askell,A.Dsouza,A.Rahane,A.S.Iyer,A.Andreassen,A.Santilli,A.Stuhlmller,A.M.Dai,A.La,A.K.Lampinen,A.Zou,A.Jiang,A.Chen,A.Vuong,A.Gupta,A.Gottardi,A.Norelli,A.Venkatesh,A.Gholamidavoodi,A.Tabassum,A.Menezes,A.Kirubarajan,A.Mullokandov,A.Sabharwal,A.Herrick,A.Efrat,A.Erdem,A.Karakas,and et al.Be

158、yondthe imitation game:Quantifying and extrapolating the capabilities of language models.CoRR,abs/2206.04615,2022.38J.Su,Y.Lu,S.Pan,A.Murtadha,B.Wen,and Y.Liu.Roformer:Enhanced transformer withrotary position embedding.arXiv preprint arXiv:2104.09864,2021.39M.Suzgun,N.Scales,N.Schrli,S.Gehrmann,Y.Ta

159、y,H.W.Chung,A.Chowdhery,Q.V.Le,E.H.Chi,D.Zhou,and J.Wei.Challenging big-bench tasks and whether chain-of-thoughtcan solve them.In A.Rogers,J.L.Boyd-Graber,and N.Okazaki,editors,Findings of theAssociation for Computational Linguistics:ACL 2023,Toronto,Canada,July 9-14,2023,pages1300313051.Association

160、 for Computational Linguistics,2023.40G.Team,R.Anil,S.Borgeaud,Y.Wu,J.-B.Alayrac,J.Yu,R.Soricut,J.Schalkwyk,A.M.Dai,A.Hauth,K.Millican,D.Silver,S.Petrov,M.Johnson,I.Antonoglou,J.Schrittwieser,A.Glaese,J.Chen,E.Pitler,T.Lillicrap,A.Lazaridou,O.Firat,J.Molloy,M.Isard,P.R.Barham,T.Hennigan,B.Lee,F.Viol

161、a,M.Reynolds,Y.Xu,R.Doherty,E.Collins,C.Meyer,15E.Rutherford,E.Moreira,K.Ayoub,M.Goel,G.Tucker,E.Piqueras,M.Krikun,I.Barr,N.Savinov,I.Danihelka,B.Roelofs,A.White,A.Andreassen,T.von Glehn,L.Yagati,M.Kazemi,L.Gonzalez,M.Khalman,J.Sygnowski,A.Frechette,C.Smith,L.Culp,L.Pro-leev,Y.Luan,X.Chen,J.Lottes,N

162、.Schucher,F.Lebron,A.Rrustemi,N.Clay,P.Crone,T.Kocisky,J.Zhao,B.Perz,D.Yu,H.Howard,A.Bloniarz,J.W.Rae,H.Lu,L.Sifre,M.Maggioni,F.Alcober,D.Garrette,M.Barnes,S.Thakoor,J.Austin,G.Barth-Maron,W.Wong,R.Joshi,R.Chaabouni,D.Fatiha,A.Ahuja,R.Liu,Y.Li,S.Cogan,J.Chen,C.Jia,C.Gu,Q.Zhang,J.Grimstad,A.J.Hartman

163、,M.Chadwick,G.S.Tomar,X.Garcia,E.Senter,E.Taropa,T.S.Pillai,J.Devlin,M.Laskin,D.de Las Casas,D.Valter,C.Tao,L.Blanco,A.P.Badia,D.Reitter,M.Chen,J.Brennan,C.Rivera,S.Brin,S.Iqbal,G.Surita,J.Labanowski,A.Rao,S.Winkler,E.Parisotto,Y.Gu,K.Olszewska,Y.Zhang,R.Addanki,A.Miech,A.Louis,L.E.Shafey,D.Teplyash

164、in,G.Brown,E.Catt,N.Attaluri,J.Balaguer,J.Xiang,P.Wang,Z.Ashwood,A.Briukhov,A.Webson,S.Ganapathy,S.Sanghavi,A.Kannan,M.-W.Chang,A.Stjerngren,J.Djolonga,Y.Sun,A.Bapna,M.Aitchison,P.Pejman,H.Michalewski,T.Yu,C.Wang,J.Love,J.Ahn,D.Bloxwich,K.Han,P.Humphreys,T.Sellam,J.Bradbury,V.God-bole,S.Samangooei,B

165、.Damoc,A.Kaskasoli,S.M.R.Arnold,V.Vasudevan,S.Agrawal,J.Riesa,D.Lepikhin,R.Tanburn,S.Srinivasan,H.Lim,S.Hodkinson,P.Shyam,J.Ferret,S.Hand,A.Garg,T.L.Paine,J.Li,Y.Li,M.Giang,A.Neitz,Z.Abbas,S.York,M.Reid,E.Cole,A.Chowdhery,D.Das,D.Rogozi nska,V.Nikolaev,P.Sprechmann,Z.Nado,L.Zilka,F.Prost,L.He,M.Mont

166、eiro,G.Mishra,C.Welty,J.Newlan,D.Jia,M.Allamanis,C.H.Hu,R.de Liedekerke,J.Gilmer,C.Saroufi m,S.Rijhwani,S.Hou,D.Shrivastava,A.Baddepudi,A.Goldin,A.Ozturel,A.Cassirer,Y.Xu,D.Sohn,D.Sachan,R.K.Amplayo,C.Swan-son,D.Petrova,S.Narayan,A.Guez,S.Brahma,J.Landon,M.Patel,R.Zhao,K.Villela,L.Wang,W.Jia,M.Rahtz

167、,M.Gimnez,L.Yeung,H.Lin,J.Keeling,P.Georgiev,D.Mincu,B.Wu,S.Haykal,R.Saputro,K.Vodrahalli,J.Qin,Z.Cankara,A.Sharma,N.Fernando,W.Hawkins,B.Neyshabur,S.Kim,A.Hutter,P.Agrawal,A.Castro-Ros,G.van den Driessche,T.Wang,F.Yang,S.yiin Chang,P.Komarek,R.McIlroy,M.Lu ci c,G.Zhang,W.Farhan,M.Sharman,P.Natsev,P

168、.Michel,Y.Cheng,Y.Bansal,S.Qiao,K.Cao,S.Shakeri,C.But-terfi eld,J.Chung,P.K.Rubenstein,S.Agrawal,A.Mensch,K.Soparkar,K.Lenc,T.Chung,A.Pope,L.Maggiore,J.Kay,P.Jhakra,S.Wang,J.Maynez,M.Phuong,T.Tobin,A.Tacchetti,M.Trebacz,K.Robinson,Y.Katariya,S.Riedel,P.Bailey,K.Xiao,N.Ghelani,L.Aroyo,A.Slone,N.Houls

169、by,X.Xiong,Z.Yang,E.Gribovskaya,J.Adler,M.Wirth,L.Lee,M.Li,T.Kagohara,J.Pavagadhi,S.Bridgers,A.Bortsova,S.Ghemawat,Z.Ahmed,T.Liu,R.Pow-ell,V.Bolina,M.Iinuma,P.Zablotskaia,J.Besley,D.-W.Chung,T.Dozat,R.Comanescu,X.Si,J.Greer,G.Su,M.Polacek,R.L.Kaufman,S.Tokumine,H.Hu,E.Buchatskaya,Y.Miao,M.Elhawaty,A

170、.Siddhant,N.Tomasev,J.Xing,C.Greer,H.Miller,S.Ashraf,A.Roy,Z.Zhang,A.Ma,A.Filos,M.Besta,R.Blevins,T.Klimenko,C.-K.Yeh,S.Changpinyo,J.Mu,O.Chang,M.Pajarskas,C.Muir,V.Cohen,C.L.Lan,K.Haridasan,A.Marathe,S.Hansen,S.Douglas,R.Samuel,M.Wang,S.Austin,C.Lan,J.Jiang,J.Chiu,J.A.Lorenzo,L.L.Sjsund,S.Cevey,Z.G

171、leicher,T.Avrahami,A.Boral,H.Srinivasan,V.Selo,R.May,K.Aiso-pos,L.Hussenot,L.B.Soares,K.Baumli,M.B.Chang,A.Recasens,B.Caine,A.Pritzel,F.Pavetic,F.Pardo,A.Gergely,J.Frye,V.Ramasesh,D.Horgan,K.Badola,N.Kassner,S.Roy,E.Dyer,V.Campos,A.Tomala,Y.Tang,D.E.Badawy,E.White,B.Mustafa,O.Lang,A.Jindal,S.Vikram,

172、Z.Gong,S.Caelles,R.Hemsley,G.Thornton,F.Feng,W.Stokowiec,C.Zheng,P.Thacker,aglar nl,Z.Zhang,M.Saleh,J.Svensson,M.Bileschi,P.Patil,A.Anand,R.Ring,K.Tsihlas,A.Vezer,M.Selvi,T.Shevlane,M.Rodriguez,T.Kwiatkowski,S.Daruki,K.Rong,A.Dafoe,N.FitzGerald,K.Gu-Lemberg,M.Khan,L.A.Hendricks,M.Pellat,V.Feinberg,J

173、.Cobon-Kerr,T.Sainath,M.Rauh,S.H.Hashemi,R.Ives,Y.Hasson,Y.Li,E.Noland,Y.Cao,N.Byrd,L.Hou,Q.Wang,T.Sottiaux,M.Paganini,J.-B.Lespiau,A.Moufarek,S.Hassan,K.Shivakumar,J.van Amersfoort,A.Mandhane,P.Joshi,A.Goyal,M.Tung,A.Brock,H.Sheahan,V.Misra,C.Li,N.Raki cevi c,M.Dehghani,F.Liu,S.Mittal,J.Oh,S.Noury,

174、E.Sezener,F.Huot,M.Lamm,N.D.Cao,C.Chen,G.Elsayed,E.Chi,M.Mahdieh,I.Tenney,N.Hua,I.Petrychenko,P.Kane,D.Scandinaro,R.Jain,J.Uesato,R.Datta,A.Sadovsky,O.Bunyan,D.Rabiej,S.Wu,J.Zhang,G.Vasudevan,E.Leurent,M.Al-nahlawi,I.Georgescu,N.Wei,I.Zheng,B.Chan,P.G.Rabinovitch,P.Stanczyk,Y.Zhang,D.Steiner,S.Naska

175、r,M.Azzam,M.Johnson,A.Paszke,C.-C.Chiu,J.S.Elias,A.Mohiud-din,F.Muhammad,J.Miao,A.Lee,N.Vieillard,S.Potluri,J.Park,E.Davoodi,J.Zhang,J.Stanway,D.Garmon,A.Karmarkar,Z.Dong,J.Lee,A.Kumar,L.Zhou,J.Evens,W.Isaac,Z.Chen,J.Jia,A.Levskaya,Z.Zhu,C.Gorgolewski,P.Grabowski,Y.Mao,A.Magni,K.Yao,16J.Snaider,N.Ca

176、sagrande,P.Suganthan,E.Palmer,G.Irving,E.Loper,M.Faruqui,I.Arkatkar,N.Chen,I.Shafran,M.Fink,A.Castao,I.Giannoumis,W.Kim,M.Rybinski,A.Sreevatsa,J.Prendki,D.Soergel,A.Goedeckemeyer,W.Gierke,M.Jafari,M.Gaba,J.Wiesner,D.G.Wright,Y.Wei,H.Vashisht,Y.Kulizhskaya,J.Hoover,M.Le,L.Li,C.Iwuanyanwu,L.Liu,K.Rami

177、rez,A.Khorlin,A.Cui,T.LIN,M.Georgiev,M.Wu,R.Aguilar,K.Pallo,A.Chakladar,A.Repina,X.Wu,T.van der Weide,P.Ponnapalli,C.Kaplan,J.Simsa,S.Li,O.Dousse,F.Yang,J.Piper,N.Ie,M.Lui,R.Pasumarthi,N.Lintz,A.Vijayakumar,L.N.Thiet,D.Andor,P.Valenzuela,C.Paduraru,D.Peng,K.Lee,S.Zhang,S.Greene,D.D.Nguyen,P.Kurylowi

178、cz,S.Velury,S.Krause,C.Hardin,L.Dixon,L.Janzer,K.Choo,Z.Feng,B.Zhang,A.Singhal,T.Latkar,M.Zhang,Q.Le,E.A.Abellan,D.Du,D.McKinnon,N.Antropova,T.Boluk-basi,O.Keller,D.Reid,D.Finchelstein,M.A.Raad,R.Crocker,P.Hawkins,R.Dadashi,C.Gaffney,S.Lall,K.Franko,E.Filonov,A.Bulanova,R.Leblond,V.Yadav,S.Chung,H.A

179、skham,L.C.Cobo,K.Xu,F.Fischer,J.Xu,C.Sorokin,C.Alberti,C.-C.Lin,C.Evans,H.Zhou,A.Dimitriev,H.Forbes,D.Banarse,Z.Tung,J.Liu,M.Omernick,C.Bishop,C.Kumar,R.Sterneck,R.Foley,R.Jain,S.Mishra,J.Xia,T.Bos,G.Cideron,E.Amid,F.Piccinno,X.Wang,P.Banzal,P.Gurita,H.Noga,P.Shah,D.J.Mankowitz,A.Polozov,N.Kushman,V

180、.Krakovna,S.Brown,M.Bateni,D.Duan,V.Firoiu,M.Thotakuri,T.Natan,A.Mohananey,M.Geist,S.Mudgal,S.Girgin,H.Li,J.Ye,O.Roval,R.Tojo,M.Kwong,J.Lee-Thorp,C.Yew,Q.Yuan,S.Bagri,D.Sinopalnikov,S.Ramos,J.Mellor,A.Sharma,A.Severyn,J.Lai,K.Wu,H.-T.Cheng,D.Miller,N.Sonnerat,D.Vnukov,R.Greig,J.Beattie,E.Caveness,L.

181、Bai,J.Eisenschlos,A.Korchemniy,T.Tsai,M.Jasarevic,W.Kong,P.Dao,Z.Zheng,F.Liu,F.Yang,R.Zhu,M.Geller,T.H.Teh,J.Sanmiya,E.Gladchenko,N.Trdin,A.Sozanschi,D.Toyama,E.Rosen,S.Tavakkol,L.Xue,C.Elkind,O.Woodman,J.Carpenter,G.Papamakarios,R.Kemp,S.Kafle,T.Grunina,R.Sinha,A.Talbert,A.Goyal,D.Wu,D.Owusu-Afriyi

182、e,C.Du,C.Thornton,J.Pont-Tuset,P.Narayana,J.Li,S.Fatehi,J.Wieting,O.Ajmeri,B.Uria,T.Zhu,Y.Ko,L.Knight,A.Hliou,N.Niu,S.Gu,C.Pang,D.Tran,Y.Li,N.Levine,A.Stolovich,N.Kalb,R.Santamaria-Fernandez,S.Goenka,W.Yustalim,R.Strudel,A.Elqursh,B.Lakshminarayanan,C.Deck,S.Upadhyay,H.Lee,M.Dusenberry,Z.Li,X.Wang,K

183、.Levin,R.Hoffmann,D.Holtmann-Rice,O.Bachem,S.Yue,S.Arora,E.Malmi,D.Mirylenka,Q.Tan,C.Koh,S.H.Yeganeh,S.Pder,S.Zheng,F.Pongetti,M.Tariq,Y.Sun,L.Ionita,M.Seyedhosseini,P.Tafti,R.Kotikalapudi,Z.Liu,A.Gulati,J.Liu,X.Ye,B.Chrzaszcz,L.Wang,N.Sethi,T.Li,B.Brown,S.Singh,W.Fan,A.Parisi,J.Stanton,C.Kuang,V.Ko

184、verkathu,C.A.Choquette-Choo,Y.Li,T.Lu,A.Ittycheriah,P.Shroff,P.Sun,M.Varadarajan,S.Bahargam,R.Willoughby,D.Gaddy,I.Dasgupta,G.Desjardins,M.Cor-nero,B.Robenek,B.Mittal,B.Albrecht,A.Shenoy,F.Moiseev,H.Jacobsson,A.Ghaffarkhah,M.Rivire,A.Walton,C.Crepy,A.Parrish,Y.Liu,Z.Zhou,C.Farabet,C.Radebaugh,P.Srin

185、i-vasan,C.van der Salm,A.Fidjeland,S.Scellato,E.Latorre-Chimoto,H.Klimczak-Plucinska,D.Bridson,D.de Cesare,T.Hudson,P.Mendolicchio,L.Walker,A.Morris,I.Penchev,M.Mauger,A.Guseynov,A.Reid,S.Odoom,L.Loher,V.Cotruta,M.Yenugula,D.Grewe,A.Petrushkina,T.Duerig,A.Sanchez,S.Yadlowsky,A.Shen,A.Globerson,A.Kur

186、zrok,L.Webb,S.Dua,D.Li,P.Lahoti,S.Bhupatiraju,D.Hurt,H.Qureshi,A.Agarwal,T.Shani,M.Eyal,A.Khare,S.R.Belle,L.Wang,C.Tekur,M.S.Kale,J.Wei,R.Sang,B.Saeta,T.Liechty,Y.Sun,Y.Zhao,S.Lee,P.Nayak,D.Fritz,M.R.Vuyyuru,J.Aslanides,N.Vyas,M.Wicke,X.Ma,T.Bilal,E.Eltyshev,D.Balle,N.Martin,H.Cate,J.Manyika,K.Amiri

187、,Y.Kim,X.Xiong,K.Kang,F.Luisier,N.Tripuraneni,D.Madras,M.Guo,A.Waters,O.Wang,J.Ainslie,J.Baldridge,H.Zhang,G.Pruthi,J.Bauer,F.Yang,R.Mansour,J.Gelman,Y.Xu,G.Polovets,J.Liu,H.Cai,W.Chen,X.Sheng,E.Xue,S.Ozair,A.Yu,C.Angermueller,X.Li,W.Wang,J.Wiesinger,E.Koukoumidis,Y.Tian,A.Iyer,M.Gurumurthy,M.Golden

188、son,P.Shah,M.Blake,H.Yu,A.Urbanowicz,J.Palomaki,C.Fernando,K.Brooks,K.Durden,H.Mehta,N.Momchev,E.Rahimtoroghi,M.Georgaki,A.Raul,S.Ruder,M.Redshaw,J.Lee,K.Jalan,D.Li,G.Perng,B.Hechtman,P.Schuh,M.Nasr,M.Chen,K.Milan,V.Mikulik,T.Strohman,J.Franco,T.Green,D.Hassabis,K.Kavukcuoglu,J.Dean,and O.Vinyals.Ge

189、mini:A family of highly capable multimodal models,2023.41H.Touvron,T.Lavril,G.Izacard,X.Martinet,M.-A.Lachaux,T.Lacroix,B.Rozire,N.Goyal,E.Hambro,F.Azhar,A.Rodriguez,A.Joulin,E.Grave,and G.Lample.Llama:Open andeffi cient foundation language models,2023.42H.Touvron,L.Martin,K.Stone,P.Albert,A.Almahai

190、ri,Y.Babaei,N.Bashlykov,S.Batra,P.Bhargava,S.Bhosale,D.Bikel,L.Blecher,C.C.Ferrer,M.Chen,G.Cucurull,D.Esiobu,17J.Fernandes,J.Fu,W.Fu,B.Fuller,C.Gao,V.Goswami,N.Goyal,A.Hartshorn,S.Hosseini,R.Hou,H.Inan,M.Kardas,V.Kerkez,M.Khabsa,I.Kloumann,A.Korenev,P.S.Koura,M.-A.Lachaux,T.Lavril,J.Lee,D.Liskovich,

191、Y.Lu,Y.Mao,X.Martinet,T.Mihaylov,P.Mishra,I.Molybog,Y.Nie,A.Poulton,J.Reizenstein,R.Rungta,K.Saladi,A.Schelten,R.Silva,E.M.Smith,R.Subramanian,X.E.Tan,B.Tang,R.Taylor,A.Williams,J.X.Kuan,P.Xu,Z.Yan,I.Zarov,Y.Zhang,A.Fan,M.Kambadur,S.Narang,A.Rodriguez,R.Stojnic,S.Edunov,andT.Scialom.Llama 2:Open fou

192、ndation and fi ne-tuned chat models,2023.43A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.Kaiser,andI.Polosukhin.Attention is all you need,2023.44H.Wang,S.Ma,L.Dong,S.Huang,D.Zhang,and F.Wei.Deepnet:Scaling transformers to1,000 layers,2022.45W.Wang,Q.Lv,W.Yu,W.Hong,J.Qi,Y.Wang,J.Ji,Z.Y

193、ang,L.Zhao,X.Song,J.Xu,B.Xu,J.Li,Y.Dong,M.Ding,and J.Tang.Cogvlm:Visual expert for pretrained language models,2023.46J.Wei,X.Wang,D.Schuurmans,M.Bosma,B.Ichter,F.Xia,E.H.Chi,Q.V.Le,andD.Zhou.Chain-of-thought prompting elicits reasoning in large language models.In S.Koyejo,S.Mohamed,A.Agarwal,D.Belgr

194、ave,K.Cho,and A.Oh,editors,Advances in NeuralInformation Processing Systems 35:Annual Conference on Neural Information ProcessingSystems 2022,NeurIPS 2022,New Orleans,LA,USA,November 28-December 9,2022,2022.47W.Xiong,J.Liu,I.Molybog,H.Zhang,P.Bhargava,R.Hou,L.Martin,R.Rungta,K.A.Sankararaman,B.Oguz,

195、et al.Effective long-context scaling of foundation models.arXivpreprint arXiv:2309.16039,2023.48Y.Xu,X.Liu,X.Liu,Z.Hou,Y.Li,X.Zhang,Z.Wang,A.Zeng,Z.Du,W.Zhao,J.Tang,and Y.Dong.Chatglm-math:Improving math problem-solving in large language models with aself-critique pipeline,2024.49F.Yan,H.Mao,C.C.-J.

196、Ji,T.Zhang,S.G.Patil,I.Stoica,and J.E.Gonzalez.Berkeley functioncalling leaderboard.2024.50S.Yang,W.-L.Chiang,L.Zheng,J.E.Gonzalez,and I.Stoica.Rethinking benchmark andcontamination for language models with rephrased samples.arXiv preprint arXiv:2311.04850,2023.51S.Yao,J.Zhao,D.Yu,N.Du,I.Shafran,K.N

197、arasimhan,and Y.Cao.React:Synergizingreasoning and acting in language models.arXiv preprint arXiv:2210.03629,2022.52A.Zeng,M.Liu,R.Lu,B.Wang,X.Liu,Y.Dong,and J.Tang.Agenttuning:Enablinggeneralized agent abilities for llms,2023.53A.Zeng,X.Liu,Z.Du,Z.Wang,H.Lai,M.Ding,Z.Yang,Y.Xu,W.Zheng,X.Xia,et al.G

198、lm-130b:An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414,2022.54S.Zhang,S.Roller,N.Goyal,M.Artetxe,M.Chen,S.Chen,C.Dewan,M.Diab,X.Li,X.V.Lin,etal.Opt:Openpre-trainedtransformerlanguagemodels.arXivpreprintarXiv:2205.01068,2022.55S.Zhang,H.Zhao,X.Liu,Q.Zheng,Z.Qi,X.Gu,X.Zhang,Y.Dong

199、,and J.Tang.Natural-codebench:Examining coding performance mismatch on humaneval and natural user prompts.arXiv preprint arXiv:2405.04520,2024.56Z.Zhang,L.Lei,L.Wu,R.Sun,Y.Huang,C.Long,X.Liu,X.Lei,J.Tang,and M.Huang.Safetybench:Evaluating the safety of large language models with multiple choice ques

200、tions.arXiv preprint arXiv:2309.07045,2023.57W.X.Zhao,K.Zhou,J.Li,T.Tang,X.Wang,Y.Hou,Y.Min,B.Zhang,J.Zhang,Z.Dong,et al.A survey of large language models.arXiv preprint arXiv:2303.18223,2023.1858Q.Zheng,X.Xia,X.Zou,Y.Dong,S.Wang,Y.Xue,Z.Wang,L.Shen,A.Wang,Y.Li,T.Su,Z.Yang,and J.Tang.Codegeex:A pre-

201、trained model for code generation with multilingualevaluations on humaneval-x,2023.59W.Zheng,J.Teng,Z.Yang,W.Wang,J.Chen,X.Gu,Y.Dong,M.Ding,and J.Tang.Cogview3:Finer and faster text-to-image generation via relay diffusion,2024.60C.Zhou,P.Liu,P.Xu,S.Iyer,J.Sun,Y.Mao,X.Ma,A.Efrat,P.Yu,L.Yu,S.Zhang,G.G

202、hosh,M.Lewis,L.Zettlemoyer,and O.Levy.Lima:Less is more for alignment,2023.61J.Zhou,Z.Chen,D.Wan,B.Wen,Y.Song,J.Yu,Y.Huang,L.Peng,J.Yang,X.Xiao,et al.Characterglm:Customizing chinese conversational ai characters with large language models.arXiv preprint arXiv:2311.16832,2023.62J.Zhou,T.Lu,S.Mishra,S.Brahma,S.Basu,Y.Luan,D.Zhou,and L.Hou.Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,2023.19

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站