《DeepSeek Coder V2技術報告(英文版)(19頁).pdf》由會員分享,可在線閱讀,更多相關《DeepSeek Coder V2技術報告(英文版)(19頁).pdf(19頁珍藏版)》請在三個皮匠報告上搜索。
1、DeepSeek-Coder-V2:Breaking the Barrier of Closed-SourceModels in Code IntelligenceQihao Zhu*,Daya Guo*,Zhihong Shao*,Dejian Yang*,Peiyi Wang,Runxin Xu,Y.WuYukun Li,Huazuo Gao,Shirong Ma,Wangding Zeng,Xiao Bi,Zihui Gu,Hanwei Xu,Damai DaiKai Dong,Liyue Zhang,Yishi Piao,Zhibin Gou,Zhenda Xie,Zhewen Hao
2、,Bingxuan WangJunxiao Song,Deli Chen,Xin Xie,Kang Guan,Yuxiang You,Aixin Liu,Qiushi Du,Wenjun GaoXuan Lu,Qinyu Chen,Yaohui Wang,Chengqi Deng,Jiashi Li,Chenggang ZhaoChong Ruan,Fuli Luo,Wenfeng LiangDeepSeek-AIhttps:/ present DeepSeek-Coder-V2,an open-source Mixture-of-Experts(MoE)code languagemodel
3、that achieves performance comparable to GPT4-Turbo in code-specific tasks.Specifically,DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2with additional 6 trillion tokens.Through this continued pre-training,DeepSeek-Coder-V2substantially enhances the coding and m
4、athematical reasoning capabilities of DeepSeek-V2,while maintaining comparable performance in general language tasks.Compared to DeepSeek-Coder-33B,DeepSeek-Coder-V2 demonstrates significant advancements in various aspects ofcode-related tasks,as well as reasoning and general capabilities.Additional
5、ly,DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338,while extending the contextlength from 16K to 128K.In standard benchmark evaluations,DeepSeek-Coder-V2 achievessuperior performance compared to closed-source models such as GPT4-Turbo,Claude 3 Opus,and Gemini 1.5 Pro i
6、n coding and math benchmarks.HumanEvalMBPP+MATHGSM8K5060708090100Accuracy(%)90.276.275.794.988.272.273.493.783.574.667.790.884.972.060.195.081.769.050.493.081.168.2AiderLiveCodeBenchSWE-Bench0102030405060708073.743.412.763.945.718.357.134.118.768.434.611.749.228.751.131.02.7DeepSeek-Coder-V2GPT-4-Tu
7、rbo-0409Gemini-1.5-ProClaude-3-OpusLlama-3-70BCodestralFigure 1|The Performance of DeepSeek-Coder-V2 on math and code benchmarks.*Core contributorsarXiv:2406.11931v1 cs.SE 17 Jun 20241.IntroductionThe open-source community has made significant strides in advancing code intelligence throughthe develo
8、pment of open-source code models such as StarCoder(Li et al.,2023b;Lozhkovet al.,2024),CodeLlama(Roziere et al.,2023),DeepSeek-Coder(Guo et al.,2024),and Codestral(MistralAI,2024).These models have steadily approached the performance levels of closed-source counterparts,contributing to the progress
9、of code intelligence.However,there remains adiscernible gap when comparing them to state-of-the-art closed-source models like GPT4-Turbo(OpenAI,2023),Claude 3 Opus(Anthropic,2024),and Gemini 1.5 Pro(Reid et al.,2024).Tobridge this gap and further propel the development of open-source code models,we
10、introducethe DeepSeek-Coder-V2 series.These models are built upon the foundation of DeepSeek-V2(DeepSeek-AI,2024)and are further pre-trained with an additional corpus with 6 trillion tokens.In the pre-training phase,the dataset of DeepSeek-Coder-V2 is created with a compositionof 60%source code,10%m
11、ath corpus,and 30%natural language corpus.The source codeconsists of 1,170B code-related tokens sourced from GitHub and CommonCrawl,using the samepipeline as DeepSeekMath(Shao et al.,2024).This corpus expands from 86 to 338 programminglanguages compared to the code corpus used to train DeepSeek-Code
12、r.To demonstrate theeffectiveness of the new code corpus,we conduct ablation studies with the 1B parameter modeland observe improvements of 6.7%and 9.4%in accuracy across both HumanEval(from 30.5%to 37.2%)and MBPP(from 44.6%to 54.0%)benchmarks(Austin et al.,2021a;Chen et al.,2021),respectively.For t
13、he math corpus,we collect 221B math-related tokens sourced fromCommonCrawl using the same pipeline,which approximately doubles the size of the 120BDeepSeekMath corpus(Shao et al.,2024),while for the natural language corpus,we directlysample from the training corpus in DeepSeek-V2.In total,DeepSeek-C
14、oder-V2 has been exposedto 10.2T training tokens,where 4.2 trillion tokens originate from the DeepSeek V2 dataset,whilethe remaining 6 trillion tokens come from the DeepSeek-Coder-V2 dataset.To accommodate longer code inputs and enhance applicability across various programmingscenarios,we extend the
15、 context length from 16K to 128K tokens,allowing our models tohandle more complex and extensive coding tasks.After continuous pre-training DeepSeek-V2 on this multi-source corpora,we find that DeepSeek-Coder-V2 significantly enhances themodels capabilities in coding and mathematical reasoning while
16、maintaining comparablegeneral language performance.In the alignment phase,we first construct an instruction training dataset that includes codeand math data from DeepSeek-Coder(Guo et al.,2024)and DeepSeek-Math(Shao et al.,2024),aswell as general instruction data from DeepSeek-V2(DeepSeek-AI,2024).T
17、his dataset is used tofine-tune the base model.Then,in the reinforcement learning phase,we employ Group RelativePolicy Optimization(GRPO)algorithm to align its behavior with human preferences.Preferencedata is collected in the coding domain using compiler feedback and test cases,and a rewardmodel is
18、 developed to guide the training of the policy model.This approach ensures that themodels responses are optimized for correctness and human preference in coding tasks.Toenable the model to support code completion after alignment,we also utilize Fill-In-Middleapproach(Guo et al.,2024)during the fine-
19、tuning of the base model with 16B parameters.1.1.ContributionsIn summary,our main contributions are:We introduce DeepSeek-Coder-V2 with 16B and 236B parameters based on the DeepSeek-2MoE framework,which has activation parameters of only 2.4B and 21B,efficiently sup-porting diverse computational and
20、application needs.Additionally,DeepSeek-Coder-V2supports 338 programming languages and a maximum context length of 128K tokens.We make the first attempt to develop an open-source hundred-billion-parameter code modelto advance the field of code intelligence.Experimental results indicate that DeepSeek
21、-Coder-V2 236B outperforms state-of-the-art closed-source models,such as GPT4-Turbo,Claude 3Opus,and Gemini 1.5 Pro,in both coding and mathematics tasks.DeepSeek-Coder-V2 models are released publicly under a permissive license,allowing forboth research and unrestricted commercial use.1.2.Summary of
22、Evaluations and MetricsCode:Regarding code generation benchmark evaluation,DeepSeek-Coder-V2 demon-strates remarkable superiority over all open source models while exhibiting performanceon par with the leading closed-source models,such as GPT4-Turbo,Claude 3 Opus,andGemini 1.5 Pro.Notably,we achieve
23、 a 90.2%score on HumanEval(Chen et al.,2021),a 76.2%score on MBPP(Austin et al.,2021a)(establishing a new state-of-the-art resultwith EvalPlus evaluation pipeline),and a 43.4%score on LiveCodeBench(Jain et al.,2024)(questions from Dec.2023 to June.2024).Additionally,DeepSeek-Coder-V2 is the firstope
24、n-source model that surpasses a score of 10%on SWEBench(Jimenez et al.,2023).Math:DeepSeek-Coder-V2 exhibits strong mathematical reasoning abilities,rivalingtop closed-source models such as GPT-4o,Gemini 1.5 Pro,and Claude 3 Opus on bothelementary benchmarks like GSM8K(Cobbe et al.,2021)and advanced
25、 competition-levelbenchmarks including MATH(Hendrycks et al.,2021),AIME(MAA,2024),and MathOdyssey(Netmind.AI,2024).Notably,DeepSeek-Coder-V2 attains an accuracy of 75.7%on the MATH benchmark,nearly matching the state-of-the-art accuracy of 76.6%achievedby GPT-4o.Furthermore,it surpasses the performa
26、nce of these closed-source models inthe AIME 2024 competition.Natural Language:DeepSeek-Coder-V2 maintains comparable general language perfor-mance to DeepSeek-V2.For example,DeepSeek-Coder-V2 achieves 79.2%on MMLU withOpenAI simple-eval pipeline.Regarding subjective evaluation with GPT-4 as a judge
27、r,DeepSeek-Coder-V2 achieves 65.0 on arena-hard(Li et al.,2024),8.77 on MT-bench(Zhenget al.,2023)and 7.84 on alignbench(Liu et al.,2023c).These scores are significantly betterthan other code-specific models,even comparable with general open source models.2.Data CollectionThe pre-training data for D
28、eepSeek-Coder-V2 primarily consists of 60%source code,10%mathcorpus,and 30%natural language corpus.Since the natural language corpus is directly sampledfrom the training dataset of DeepSeek-V2,this section focuses on the collection,cleaning,andfiltering processes of the code and math data.Meanwhile,
29、we further validate the quality of thisdata through comparative analysis experiments.We collect public repositories created before November 2023 on GitHub.We first apply thesame filtering rules and near-deduplication as those used in the DeepSeek-Coder(Guo et al.,2024)to filter out lower-quality and
30、 duplicated source code.To make the paper self-contained,we briefly describe the filtering rules.Firstly,we filter out files with an average line lengthexceeding 100 characters or a maximum line length surpassing 1000 characters.Additionally,we remove files with fewer than 25%alphabetic characters.E
31、xcept for the XSLT programminglanguage,we further filter out files where the string?xml version=appears in the first 1003characters.For HTML files,we consider the ratio of visible text to HTML code.We retain fileswhere the visible text constitutes at least 20%of the code and is no less than 100 char
32、acters.ForJSON and YAML files,which typically contain more data,we only keep files that have a charactercount ranging from 50 to 5000 characters.This effectively removes most data-heavy files.Byapplying these filtering rules and near-deduplication,we obtain 821B code encompassing 338programming lang
33、uages and 185B code-related text,such as markdown and issues.The list ofsupported programming languages can be found in the Appendix A.We use the same tokenizeras DeepSeekV2,detailed in(DeepSeek-AI,2024).To collect code-related and math-related web texts from Common Crawl,we follow thesame pipeline
34、as DeepSeekMath(Shao et al.,2024).Specifically,we select coding forums such asStackOverflow1,library sites such as PyTorch documentation2,and mathematics website suchas StackExchange3as our initial seed corpus.Using this seed corpus,we train a fastText model(Joulin et al.,2016)to recall more coding-
35、related and math-related web pages.Since tokenizationfor languages like Chinese cannot be done through spaces,we use the Byte Pair Encoding(BPE)tokenizer from DeepSeek-V2,which significantly improves the recall accuracy of fastText.Foreach domain,we calculate the percentage of web pages collected in
36、 the first iteration.Domainswith over 10%of web pages collected are classified as code-related or math-related.We thenannotate the URLs associated with code-related or math-related content within these identifieddomains.Uncollected web pages linked to these URLs are added to the seed corpus.After th
37、reeiterations of data collection,we gather 70 billion code-related tokens and 221B math-relatedtokens from web pages.To further collect high-quality source code from GitHub,we alsoapply the same pipeline on GitHub with two iterations of data collection and collect 94B sourcecode.The initial seed cor
38、pus is constructed by manually collecting high-quality source code,such as those containing detailed descriptions.Finally,the new code corpus consists of 1,170Bcode-related tokens sourced from GitHub and CommonCrawl.To demonstrate the effectiveness of the new code corpus,we conducted ablation studie
39、s(seeTable 1)using a 1B parameter model,comparing it with the corpus used to train DeepSeek-Coder.Pre-training the 1B model on the new code corpus using 1T tokens resulted in improvements of5.5%and 4.4%in accuracy on the HumanEval(from 30.5%to 36.0%)and MBPP(from 44.6%to49.0%)benchmarks,respectively
40、.Further training the 1B model with 2T tokens led to additionalimprovements,with HumanEval and MBPP scores rising to 37.2%and 54.0%,respectively.Therefore,the new code corpus is superior to the code corpus used to train DeepSeek-Coder.ModelTokensPythonC+JavaPHPTSC#BashJSAvgMBPPDeepSeek-Coder-1B1T30.
41、5%28.0%31.7%23.0%30.8%31.7%9.5%28.6%26.7%44.6%DeepSeek-Coder-V2-1B1T36.0%34.8%31.7%27.3%37.7%34.2%6.3%38.5%31.2%49.0%DeepSeek-Coder-V2-1B2T37.2%39.1%32.3%31.7%34.6%36.7%12.0%32.9%32.0%54.0%Table 1|Performance of 1B base model between DeepSeek-Coder and DeepSeek-Coder-V2.3.Training Policy3.1.Training
42、 StrategyWe use two training objectives for DeepSeek-Coder-v2 16B:Next-Token-Prediction and Fill-In-Middle(FIM)(Bavarian et al.,2022;Guo et al.,2024;Li et al.,2023b).For DeepSeek-Coder-v21https:/2https:/pytorch.org/docs3https:/4236B,we only utilize the Next-Token-Prediction objective.Here we give a
43、brief introduction ofthe FIM training policy.We adopt the FIM training approach for the development of DeepSeek-Coder-v2-16B,leveraging the PSM(Prefix,Suffix,Middle)mode.This method structures thecontent reconstruction in the sequence:Prefix,Suffix,and Middle,as illustrated below:This structure is a
44、pplied at the document level as part of the pre-packing process.The FIM isutilized at a rate of 0.5,consistent with the PSM framework,to enhance the training efficacy andmodel performance.3.2.Model ArchitectureOur architecture aligns with that of DeepSeekV2(DeepSeek-AI,2024).The hyperparameterssetti
45、ngs,16B and 236B,correspond to those used in DeepSeek-V2-Lite and DeepSeek-V2,respec-tively.Notably,we encountered instability during training and spikes in gradient values,whichwe attributed to the exponential normalization technique.To address this,we reverted to theconventional normalization meth
46、od.3.3.Training Hyper-ParametersConsistent with the DeepSeek V2 methodology(DeepSeek-AI,2024),we utilize the AdamWoptimizer(Loshchilov and Hutter,2019),configured with1=0.9,2=0.95,and a weight decayof 0.1.Batch sizes and learning rates are adjusted according to DeepSeek-V2 specifications.Forlearning
47、 rate scheduling,we employ a cosine decay strategy,starting with 2000 warm-up stepsand gradually reducing the learning rate to 10%of its initial value.Both DeepSeek-Coder-V2 and DeepSeek-Coder-V2-Lite are trained using the same method-ology.To maintain robust natural language understanding capabilit
48、ies in DeepSeek-Coder-V2,we continue the pre-training process from an intermediate checkpoint of DeepSeek-V2.Theintermediate checkpoint was initially trained on 4.2T tokens.Consequently,DeepSeek-Coder-V2has been exposed to a total of 10.2T high-quality tokens during the pre-training phase.ModelDeepS
49、eek-Coder-V2-LiteDeepSeek-Coder-V2#Total Parameters(#TP)16B236B#Active Parameters(#AP)2.4B21BPre-training Tokens4.2T+6T4.2T+6TLR SchedulerCosineCosineFIMEnableDisableTable 2|Training Setting of DeepSeek-Coder-V2.3.4.Long Context ExtensionFollowing DeepSeek-V2,we extend the context length of DeepSeek
50、-Coder-V2 to 128K usingYarn(Peng et al.,2023).The hyper-parameters of YARN are the same as DeepSeek-V2:the scaleto 40,to 1,to 32.We further continue training the model using two stages to enhanceits capability for handling long contexts.In the first stage,we utilize a sequence length of32K and a bat
51、ch size of 1152 for 1000 steps.In the second stage,we train the model for anadditional 1000 steps,employing a sequence length of 128K and a batch size of 288 sequences.51K13K26K39K51K64K77K89K102K115K128KContext Length(#Tokens)0102030405060708090100Document Depth Percent(%)Pressure Testing DeepSeek-
52、Coder-V2 Base 128K Context via Needle In A HayStack12345678910ScoreFigure 2|Evaluation results on the“Needle In A Haystack”(NIAH)tests.DeepSeek-Coder-V2performs well across all context window lengths up to 128K.It should be noted here we upsample long context data ratio during long context extension
53、.As shown in Figure 2,the results on the“Needle In A Haystack”(NIAH)tests indicate thatDeepSeek-Coder-V2 performs well across all context window lengths up to 128K.3.5.Alignment3.5.1.Supervised Fine-TuningTo build DeepSeek-Coder-V2 Chat,we construct the instruction training dataset mixed withcode an
54、d math data.We first collect 20k code-related instruction data and 30k math related datafrom DeepSeek-Coder and DeepSeek-Math.To maintain the general ability,we also sampleseveral data from the instruction data of DeepSeek-V2.Finally,we use a instruction datasetof 300M tokens.For training,we use a c
55、osine schedule with 100 warm-up steps and an initiallearning rate 56.We also use a batch size of 1M tokens and 1B tokens in total.3.5.2.Reinforcement LearningWe further employ Reinforcement Learning(RL)techniques to fully simulate the capabilities ofDeepSeek-Coder-V2,which is proven to be quite effe
56、ctive.PromptsConsiderable effort was spent collecting prompts related to code and math fromvarious sources,and each code prompt comes with corresponding test cases.After filtering theprompts,there are approximately 40k data in total.Reward ModelingReward models play crucial roles in the RL training.
57、In terms of mathemat-ical preference data,we obtain them using the ground-truth labels.In terms of code preferencedata,although the code compiler itself can already provide 0-1 feedback(whether the code passall test cases or not),some code prompts may have a limited number of test cases,and do notpr
58、ovide full coverage,and hence directly using 0-1 feedback from the compiler may be noisyand sub-optimal.Therefore,we still decide to train a reward model on the data provided by thecompiler,and use the reward model to provide signal during RL training,which is more robust6and has better generalizati
59、on ability,in comparison with raw compiler signal.As illustrated inFigure 3,in our in-house test sets(Leetcode and Leetcode-zh),using a reward model to provideRL training signal clearly outperforms using raw compiler signal.Hence,we use reward modelsignal rather than compiler signal in all subsequen
60、t experiments.Reinforcement Learning AlgorithmWe employ Group Relative Policy Optimization(GRPO)Shao et al.(2024)as our RL algorithm,which is the same as what DeepSeek-V2 uses.Notably,GRPO is proven to be quite effective and has less cost compared with PPO,since there is noneed to maintain an additi
61、onal critic model.100200300400500600Steps0.120.140.160.180.200.22Pass1LeetCode-Pass1Compiler SignalReward Model SignalSFT Model100200300400500600Steps0.100.120.140.16Pass1LeetCode-zh-Pass1Compiler SignalReward Model SignalSFT ModelFigure 3|Performances of Different Methods4.Experimental ResultsIn th
62、is section,we evaluate DeepSeek-Coder-V2 on three types of tasks,including coding,mathematics,and general natural language.We compare DeepSeek-Coder-V2 with the previousstate-of-the-art large language models.CodeLlama(Roziere et al.,2023)consists of a series of code language models based onLlama2(To
63、uvron et al.,2023),and continue pre-training on datasets ranging from 500 to1000 billion code tokens.These models are available in four sizes:7B,13B,34B,and 70B.StarCoder(Lozhkov et al.,2024)is a publicly accessible model with 15 billion parameters.It is specifically trained on a meticulously curate
64、d subset of the Stack dataset(Kocetkovet al.,2022),covering 86 programming languages.StarCoder2(Lozhkov et al.,2024)consists of 3B,7B,and 15B parameters models trainedon 3.3 to 4.3 trillion tokens of the Stack2 dataset(Lozhkov et al.,2024),spanning 619programming languages.DeepSeek-Coder(Guo et al.,
65、2024)comprises a series of code language models,rangingfrom 1 billion to 33 billion parameters.Each model is trained from scratch on 2 trilliontokens,with a composition of 87%code and 13%natural language in both English andChinese.These models are pre-trained on a project-level code corpus using a w
66、indowsize of 16K and an additional fill-in-the-blank task,enabling support for project-level codecompletion and infilling.Codestral(MistralAI,2024)is a 22B parameter model developed by Mistral.It is trainedon a diverse dataset of over 80 programming languages,including popular ones such asPython,Jav
67、a,and JavaScript,as well as more specialized languages like Swift and Fortran.7General language models that we compare include Llama3 70B(Meta,2024),GPT-4(OpenAI,2023),Claude 3 Opus(Anthropic,2024),and Gemini 1.5 Pro(Reid et al.,2024).While they are not specifically trained on large code corpora,the
68、y achieve state-of-the-artperformance in coding.4.1.Code GenerationHumanEval and MBPP Benchmarks.The HumanEval(Chen et al.,2021)4and MBPP(Austinetal.,2021b)benchmarksarecommonlyutilizedforassessingtheperformanceofcode-generatingLarge Language Models(LLMs).HumanEval comprises 164 Python tasks that ar
69、e verifiedthrough test cases to evaluate the performance of Code LLMs in a zero-shot scenario.For MBPP,we use the MBPP-Plus version(Liu et al.,2023a)to evaluate the models.To test the multilingualabilities of models,we extended the HumanEval benchmark problems into seven additionallanguages:C+,Java,
70、PHP,TypeScript,C#,Bash,JavaScript,Swift,R,Julia,D,Rust and Racket.For both benchmarks,we employed a greedy search strategy and recreated the baseline resultsusing identical scripts and environments to ensure a fair comparison.#TP#APPythonJavaC+C#TSJSPHPBashClosed-Source ModelsGemini-1.5-Pro-83.5%81.
71、0%78.3%75.3%77.4%80.8%74.5%39.9%Claude-3-Opus-84.2%78.5%81.4%74.7%76.1%75.8%78.3%48.7%GPT-4-1106-87.8%82.3%78.9%80.4%81.8%80.1%77.6%55.7%GPT-4-Turbo-0409-88.2%81.7%78.3%79.1%79.3%80.8%78.9%55.1%GPT-4o-0513-91.0%80.4%87.0%82.9%86.2%87.6%79.5%53.8%Open-Source ModelsCodestral22B22B78.1%71.5%71.4%77.2%7
72、2.3%73.9%69.6%47.5%DS-Coder-instruct33B33B79.3%73.4%68.9%74.1%67.9%73.9%72.7%43.0%Llama3-Instruct70B70B81.1%67.7%64.0%69.6%69.8%70.2%65.8%36.1%DS-Coder-V2-Lite-Instruct16B2.4B81.1%76.6%75.8%76.6%80.5%77.6%74.5%43.0%DS-Coder-V2-Instruct236B21B90.2%82.3%84.8%82.3%83.0%84.5%79.5%52.5%#TP#APSwiftRJuliaD
73、RustRacketMBPP+AverageClosed-Source ModelsGemini-1.5-Pro-66.5%53.4%71.7%55.8%73.1%48.4%74.6%68.9%Claude-3-Opus-63.9%55.9%76.1%60.3%71.2%64.6%72.0%70.8%GPT-4-1106-62.7%57.8%69.2%60.9%78.8%64.0%69.3%72.5%GPT-4-Turbo-0409-63.9%56.5%69.8%61.5%78.8%63.4%72.2%72.3%GPT-4o-0513-75.9%65.2%78.0%60.9%80.1%64.6
74、%73.5%76.4%Open-Source ModelsCodestral22B22B63.3%49.7%67.9%32.1%67.3%37.3%68.2%63.2%DS-Coder-instruct33B33B61.4%44.7%53.5%31.4%68.6%46.0%70.1%61.9%Llama3-Instruct70B70B55.1%46.0%62.9%48.1%58.3%46.0%68.8%60.6%DS-Coder-V2-Lite-Instruct16B2.4B64.6%47.8%67.3%45.5%62.2%41.6%68.8%65.6%DS-Coder-V2-Instruct
75、236B21B72.2%64.0%72.3%64.1%78.2%63.4%76.2%75.3%Table 3|Performance Metrics for Various Models on HumanEval and MBPP BenchmarksTable 3 provides an extensive overview of the performance metrics for various models acrossmultiple programming languages on the HumanEval andMBPP+Benchmarks.The DeepSeek-Cod
76、er-V2-Instruct demonstrates exceptional performance,securing the second-highest average4We use the template Please complete the python function below.The final complete version of your functionmust be returned within a code block.Here is the unfinished function:npythonnproblem_descriptionnnto build
77、the instruction prompt.8score of 75.3%.This performance is notable as it breaks the dominance typically seen fromclosed-source models,standing out as a leading open-source contender.It is surpassed onlyby GPT-4o,which leads with an average score of 76.4%.DeepSeek-Coder-V2-Instruct showstop-tier resu
78、lts across a variety of languages,including the highest scores in Java and PHP,andstrong performances in Python,C+,C#,TypeScript,and JavaScript,underscoring its robustnessand versatility in handling diverse coding challenges.Furthermore,the DeepSeek-Coder-V2-Lite-Instruct also performs impressively,
79、surpassingthe larger 33B model.With a considerable margin in average performance(65.6%vs.61.9%),ithighlights the effectiveness of the 16B model in delivering competitive results despite its smallersize.This underscores the models efficiency and the advancements in model architecture andtraining meth
80、odologies that allow it to outperform larger counterparts.Competitive Programming.To further validate the models capability in real-world compet-itive programming problems,we utilize the LiveCodeBench(Jain et al.,2024)and USACObenchmark(Shi et al.,2024)to estimate the effectiveness of DeepSeek-Coder
81、-V2.LiveCodeBenchis a meticulous and contamination-free assessment of Large Language Models(LLMs)for codegeneration,systematically gathering novel challenges over time from three prominent compet-itive programming platforms:LeetCode,AtCoder,and CodeForces.Since the cut-off of thetraining data is bef
82、ore November 2023,we use the subset(1201-0601)of Livecodebench.USACObenchmark contains 307 problems from the USA Computing Olympiad,along with high-qualityunit tests,reference code,and official analyses for each problem.Model#TP#APLiveCodeBenchUSACOEasy(82)Medium(87)Hard(57)Overall(226)Closed-Source
83、 ModelsGemini-1.5-Pro-74.9%16.8%1.8%34.1%4.9%Claude-3-Opus-77.2%16.7%0.7%34.6%7.8%GPT-4-1106-78.4%20.2%3.5%37.1%11.1%GPT-4-Turbo-0409-84.1%35.4%6.1%45.7%12.3%GPT-4o-0513-87.4%27.5%4.9%43.4%18.8%Open-Source ModelsCodestral22B22B66.5%17.7%0.2%31.0%4.6%DS-Coder-instruct33B33B51.6%9.7%0.4%22.5%4.2%Llama
84、3-Instruct70B70B62.4%14.4%2.1%28.7%3.3%DS-Coder-V2-Lite-Instruct16B2.4B58.5%8.0%0.0%24.3%6.5%DS-Coder-V2-Instruct236B21B84.1%29.9%5.3%43.4%12.1%Table 4|Performance on the LiveCodeBench(LCB)and USACO benchmarks.Table 4 showcases the performance of various language models on the two benchmarks.Notably
85、,DeepSeek-Coder-V2-Instruct delivers a standout performance,tying for the highestscore among large models at 43.4%,on par with GPT-4o.This exceptional result places itsecond overall,just behind GPT-4-Turbo-0409,which leads with an overall performance of45.7%.DeepSeek-Coder-V2-Instructs impressive ab
86、ility to handle complex coding challengesfirmly establishes it as a top contender,closely trailing the leading GPT-4-Turbo variant.94.2.Code Completion4.2.1.Repository-Level Code Completion EvaluationWe use RepoBench(Liu et al.,2023b)to evaluate the capabilities of currently available open-source co
87、de models with sizes below 35B in repository-level code completion tasks.This datasetis constructed from a diverse set of real-world,open-sourced,permissively licensed repositoriesin two popular programming languages:Python and Java.Notably,the latest version(v1.1)of RepoBench sources its data from
88、GitHub repositories created between October 6th andDecember 31st,2023,while our pre-training data includes code created before November 2023.To ensure this dataset was not present in our pre-training data and avoid data leakage,we onlyuse data from December 2023.Our evaluation includes five context
89、length levels2k,4k,8k,12k,and 16k tokensacrossthree settings:cross-file-first,cross-file-random,and in-file.We use greedy search for all modelsunder evaluation.The models were constrained to generate a maximum of 64 new tokensper prompt,and the first non-empty and non-comment line of the output was
90、selected as theprediction.The maximum token length for prompts was set to 15,800 by truncating excesscross-file context.We report the average exact match for the different context length levels.Model#TP#APPythonJava2k4k8k12k16kAvg2k4k8k12k16kAvgStarCoder2-Base15B15B35.7%36.7%34.6%27.4%25.1%32.1%46.2
91、%45.0%39.8%30.5%30.7%38.7%CodeLlama-Base7B7B32.0%34.4%35.3%33.3%32.2%33.5%43.1%42.1%40.4%37.0%40.3%40.6%CodeLlama-Base13B13B33.0%36.5%37.0%34.6%35.0%35.2%43.5%44.8%40.7%38.6%41.1%41.8%CodeLlama-Base34B34B35.3%37.5%39.5%34.9%35.6%36.6%45.9%45.4%42.5%41.0%41.2%43.3%DS-Coder-Base6.7B6.7B36.1%37.5%38.2%
92、34.0%35.0%36.2%46.8%46.4%42.9%38.8%40.8%43.3%DS-Coder-Base33B33B39.7%40.1%40.0%36.9%38.5%39.1%47.9%47.7%43.3%40.9%43.6%44.8%Codestral22B22B42.1%44.3%46.6%46.6%51.5%46.1%48.3%47.8%46.0%42.2%43.9%45.7%DS-Coder-V2-Lite-Base16B2.4B38.3%38.6%40.6%38.3%38.7%38.9%48.8%45.7%42.4%38.1%41.1%43.3%Table 5|Perfo
93、rmance of different models on December subset of RepoBench v1.1.As shown in Table 5,the results indicate that the DeepSeek-Coder-V2-Lite-Base model,despite having only 2.4 billion active parameters,achieves code completion capabilities inPython comparable to the DeepSeek-Coder-Base 33B model and in
94、Java comparable to theDeepSeek-Coder-Base 7B model.Compared to CodeStral,the DeepSeek-Coder-V2-Lite-Basemodel has only one-tenth of the active parameters of CodeStral,resulting in lower performancein code completion tasks.However,we believe that the smaller number of active parameters inDeepSeek-Cod
95、er-V2 makes it faster for code completion scenarios.4.2.2.Fill-in-the-Middle Code CompletionDeepSeek-Coder-V2-Lite is trained with a unique approach that includes a 0.5 Fill-In-the-Middle(FIM)rate during their pre-training phase.This method allows the model to adeptly completecode by filling in blan
96、ks using the surrounding context,which includes both the preceding andfollowing code segments.This ability is particularly advantageous for code completion tools.Several open-source models,such as SantaCoder(Allal et al.,2023),StarCoder(Li et al.,2023b),and CodeLlama(Roziere et al.,2023),also levera
97、ge similar capabilities and have establishedhigh standards in the domain of code generation and completion.To evaluate the performance of DeepSeek-Coder-V2 models,we conducted a comparativeanalysis against leading models.The assessment was based on the Single-Line Infilling bench-marks,covering thre
98、e different programming languages as described by Allal et al.(2023).The10main metric for this evaluation was the line exact match accuracy5.Model#TP#APpythonjavajavascriptMeanStarCoder616B16B71.5%82.3%83.0%80.2%CodeLlama-Base7B7B58.6%70.6%70.7%68.0%CodeLlama-Base13B13B60.7%74.3%78.5%73.1%DS-Coder-B
99、ase1B1B74.1%85.1%82.9%81.8%DS-Coder-Base7B7B79.8%89.6%86.3%86.1%DS-Coder-Base33B33B80.5%88.4%86.6%86.4%Codestral22B22B77.2%83.2%85.9%83.0%DS-Coder-V2-Lite-Base16B2.4B80.0%89.1%87.2%86.4%Table 6|Performance of different approaches on the FIM-Tasks.The table presents the performance of various coding
100、models on FIM(Fill-in-the-Middle)tasks across three programming languages:Python,Java,and JavaScript,with a Mean scoreindicating overall effectiveness.Among the compared models,DeepSeek-Coder-V2-Lite-Base,with a configuration of 2.4B active parameters,achieves outstanding results.It scores 80.0%in P
101、ython,89.1%in Java,and 87.2%in JavaScript,leading to a top Mean score of 86.4%.Thisdemonstrates the superior effectiveness of DeepSeek-Coder-V2-Lite-Base,particularly in han-dling FIM tasks across different programming languages,achieving comparable performancewith other bigger models in the evaluat
102、ion.4.3.Code FixingTo evaluate the bug-fixing capabilities of the model,we used the Defects4J7,SWE-bench(Jimenez et al.,2023),and Aider8datasets for testing.Defects4J is a widely used dataset in thefield of software engineering,specifically designed for the purpose of evaluating and testingprogram r
103、epair techniques.It consists of a collection of real-world software bugs from variousopen-source projects,including but not limited to Apache Commons,JFreeChart,and ClosureCompiler.Each bug in the dataset is accompanied by test suites that can be used to validatethe effectiveness of program repair t
104、ools.Since the original bugs in Defec4J may need modifyseveral files in the repository resulting in a long context,we collect 238 bugs that only need tomodify one method from this benchmark.SWE-bench is a comprehensive benchmark designed to evaluate the performance of largelanguage models in address
105、ing real-world software issues sourced from GitHub.The benchmarkpresents a codebase alongside a specific issue,challenging the language model to generate apatch that effectively resolves the described problem.This rigorous evaluation frameworkensures that the language models ability to understand an
106、d fix real-world software issues isthoroughly tested,providing a clear measure of its practical utility and effectiveness in softwaredevelopment tasks.Aiders code editing benchmark evaluates the LLMs ability to modify Python source files,completing 133 distinct coding tasks.This benchmark not only t
107、ests the LLMs coding skills butalso checks its consistency in producing code edits according to the specifications in the prompt.5We use the first generated line rather than the whole generated chunk,thus the result is slightly different withDeepSeek-Coder.7https:/ DeepSeek-Coder-V2 models,we use wh
108、ole format to evaluate.Model#TP#APDefects4JSWE-BenchAiderClosed-Source ModelsGemini-1.5-Pro-18.6%19.3%57.1%Claude-3-Opus-25.5%11.7%68.4%GPT-4-1106-22.8%22.7%65.4%GPT-4-Turbo-0409-24.3%18.3%63.9%GPT-4o-0513-26.1%26.7%72.9%Open-Source ModelsCodestral22B22B17.8%2.7%51.1%DS-Coder-Instruct33B33B11.3%0.0%
109、54.5%Llama3-Instruct70B70B16.2%-49.2%DS-Coder-V2-Lite-Instruct16B2.4B9.2%0.0%44.4%DS-Coder-V2-Instruct236B21B21.0%12.7%73.7%Table 7|Performances of different models on repair benchmarks.We do not evaluateLlama3-Instruct on SWE-Bench as it just supports 8K context length.Table 7 outlines the performa
110、nces of different language models on software repair bench-marks,including Defects4J,SWE-Bench,and Aider.Among open-source models,DeepSeek-Coder-Instruct emerges as a standout,achieving the best performance within the open sourcemodels.It scores 21%in Defects4J and 12.7%in SWE-Bench,closely approach
111、ing the resultsof leading closed-source models and demonstrating significant capability in handling longercode sequences.Notably,DeepSeek-Coder-V2-Instruct achieves the highest score of 73.7%inAider,surpassing all other models listed,including closed-source counterparts.This superiorperformance high
112、lights its efficiency and robustness in automated code repair tasks,positioningDeepSeek-Coder-V2-Instruct as the top open-source model and a formidable competitor toclosed-source alternatives in the field.4.4.Code Understanding and ReasoningTo assess the code reasoning capabilities of our models,we
113、utilize the CRUXEval benchmark.This benchmark comprises 800 Python functions paired with corresponding input-outputexamples.It is divided into two distinct tasks:CRUXEval-I,which requires the large languagemodel(LLM)to predict the output based on the given input,and CRUXEval-O,where the modelmust pr
114、edict the input from the known output.This structure challenges the models abilityto understand and reason through Python code in both forward and reverse directions.Table8 presents the performance of various language models on the CruxEval benchmark,whichassesses models on two metrics:CruxEval-I-CO
115、T and CruxEval-O-COT.Among the open-sourcemodels,DeepSeek-Coder-V2-Instruct stands out significantly.It scores 70.0%on CruxEval-I-COT and 75.1%on CruxEval-O-COT,showcasing its superior capability within the open-sourcedomain.However,when compared to larger closed-source models,there is a performance
116、 gap.This performance gap may largely be attributed to the fact that DeepSeek-Coder-V2-Instructoperates with only 21 billion activation parameters,which is considerably fewer than those inlarger,more advanced closed-source models like GPT-4o.This limitation in model complexitycould restrict its lear
117、ning and problem-solving capacities.12Model#TP#APCruxEval-I-COTCruxEval-O-COTClosed-Source ModelsGemini-1.5-Pro-67.0%77.5%Claude-3-Opus-73.4%82.0%GPT-4-1106-75.5%77.1%GPT-4-Turbo-0409-75.7%82.0%GPT-4o-0513-77.4%88.7%Open-Source ModelsCodestral22B22B48.0%60.6%DS-Coder-Instruct33B33B47.3%50.6%Llama3-I
118、nstruct70B70B61.1%64.3%DS-Coder-V2-Lite-Instruct16B2.4B53.0%52.9%DS-Coder-V2-Instruct236B21B70.0%75.1%Table 8|Performance of different models on the CruxEval benchmark.4.5.Mathematical ReasoningTo assess the mathematical reasoning capabilities of DeepSeekCoder-V2,we utilized the populargrade-school
119、benchmark GSM8K(Cobbe et al.,2021),along with advanced competition-levelbenchmarks including MATH(Hendrycks et al.,2021),the American Invitational MathematicsExamination(AIME)2024(MAA,2024),and Math Odyssey(Netmind.AI,2024)9.Model#TP#AP GSM8K MATH AIME 2024 Math OdysseyClosed-Source ModelsGemini 1.5
120、 Pro-90.8%67.7%2/3045.0%Claude-3-Opus-95.0%60.1%2/3040.6%GPT-4-1106-91.4%64.3%1/3049.1%GPT-4-Turbo-0409-93.7%73.4%3/3046.8%GPT-4o-0513-95.8%76.6%2/3053.2%Open-Source ModelsLlama3-Instruct70B70B93.0%50.4%1/3027.9%DS-Coder-V2-Lite-Instruct16B2.4B86.4%61.8%0/3044.4%DS-Coder-V2-Instruct236B 21B94.9%75.7
121、%4/3053.7%Table 9|Performance of different models on the mathematical reasoning.DeepSeek-Coder-V2-Instruct can achieve 5/30 on AIME 2024 with maj64.The results,presented in Table 9,were obtained using greedy decoding without the aidof tools or voting techniques,unless otherwise specified.DeepSeek-Co
122、der-V2 achieved anaccuracy of 75.7%on the MATH benchmark and 53.7%on Math Odyssey,comparable to thestate-of-the-art GPT-4o.Additionally,DeepSeek-Coder-V2 solves more problems from AIME2024 than the other models,demonstrating its strong mathematical reasoning capabilities.9The performance of DeepSeek
123、-Coder-V2 on the four mathematical benchmarks was obtained with zero-shotchain-of-thought prompting;each test question was concatenated with the instruction:nPlease reason step by step,and put your final answer within boxed.134.6.General Natural LanguageAs DeepSeek-Coder-V2 is built upon DeepSeek-V2
124、,it inherits the strong natural language capa-bility,even surpassing DeepSeek-V2 on reasoning-related benchmarks.We compare DeepSeek-Coder-V2 Instruct with DeepSeek-V2 Chat on standard benchmarks,which covers both En-glish and Chinese benchmarks,including BigBench Hard(BBH)(Suzgun et al.,2022),MMLU(
125、Hendrycks et al.,2020),ARC(Clark et al.,2018),TriviaQA(Joshi et al.,2017),NaturalQuestions(Kwiatkowski et al.,2019),AGIEval(Zhong et al.,2023),CLUEWSC(Xu et al.,2020),C-Eval(Huang et al.,2023),and CMMLU(Li et al.,2023a).Besides,we also evaluate the open-endedgeneration ability of models,including Ar
126、ena-Hard(Li et al.,2024),AlpacaEval2.0(Dubois et al.,2024),MT-Bench(Zheng et al.,2023),and Alignbench(Liu et al.,2023c).The evaluation pipelineand metrics are the same as in DeepSeek-V2,where the MMLU are evaluated using OpenAIsimple-eval packagehttps:/ Params-2.4B2.4B21B21B#Total Params-16B16B236B2
127、36B#Training Tokens-5.7T10.2T8.1T10.2TEnglishBBH(EM)3-shot48.161.279.783.9MMLU(Acc.)5-shot55.760.178.179.2ARC-Easy(Acc.)25-shot86.188.998.197.4ARC-Challenge(Acc.)25-shot73.477.492.392.8TriviaQA(EM)5-shot65.259.586.782.3NaturalQuestions(EM)5-shot35.530.853.447.5AGIEval(Acc.)0-shot42.828.761.460.0Chin
128、eseCLUEWSC(EM)5-shot80.076.589.985.9C-Eval(Acc.)5-shot60.161.678.079.4CMMLU(Acc.)5-shot62.562.781.680.9Open-endedArena-Hard-11.4038.1041.6065.00AlpacaEval 2.0-16.8517.7438.9036.92MT-Bench-7.377.818.978.77Alignbench-6.026.837.917.84Table 10|A Comparison of DeepSeek-Coder-V2 Instruct with DeepSeek-V2
129、Chat.When comparing the performance of 16B models,it is evident that DeepSeek-Coder-V2-Lite-Instruct outperforms DeepSeek-V2-Lite-Chat in benchmarks like BBH and Arena-Hard.These benchmarks place a high demand on the models reasoning ability,which DeepSeek-Coder-V2-Lite-Instruct excels at.However,De
130、epSeek-Coder-V2-Lite Instruct falls behind inknowledge-intensive benchmarks like TriviaQA,primarily due to the relatively smaller amountof web data used during pre-training.Moving on to 236B models,DeepSeek-Coder-V2 Instruct exhibits greater strength in rea-soning benchmarks,particularly in Arena-Ha
131、rd,which comprises a substantial proportion ofcode,math,and reasoning questions.On the other hand,DeepSeek-V2 Chat demonstratesslightly better results in benchmarks such as MT-bench(Zheng et al.,2023),AlpacaEval 2.0(Dubois et al.,2024),and AlignBench(Liu et al.,2023c).This advantage can be attribute
132、d to thegeneral-purpose alignment stage of DeepSeek-V2 Chat.5.ConclusionIn this paper,we introduce DeepSeek-Coder-V2 to further advance the field of code intelligence,which is continually pre-trained from DeepSeek-V2 with 6 trillion tokens sourced from a high-quality and multi-source corpus.Through
133、this continued pre-training,we find that DeepSeek-14Coder-V2 significantly enhances the models capabilities in coding and mathematical reasoningwhile maintaining comparable general language performance to DeepSeek-V2.Compared toDeepSeek-Coder,DeepSeek-Coder-V2 supports a significantly larger number
134、of programminglanguages,increasing from 86 to 338,and extends the maximum context length from 16K to128K tokens.Experimental results demonstrate that DeepSeek-Coder-V2 achieves performancecomparable to state-of-the-art closed-source models such as GPT-4 Turbo,Claude 3 Opus,andGemini 1.5 Pro in code
135、and math-specific tasks.Although DeepSeek-Coder-V2 achieves impressive performance on standard benchmarks,we find that there is still a significant gap in instruction-following capabilities compared tocurrent state-of-the-art models like GPT-4 Turbo.This gap leads to poor performance in complexscena
136、rios and tasks such as those in SWEbench.Therefore,we believe that a code model needsnot only strong coding abilities but also exceptional instruction-following capabilities to handlereal-world complex programming scenarios.In the future,we will focus more on improving themodels instruction-followin
137、g capabilities to better handle real-world complex programmingscenarios and enhance the productivity of the development process.ReferencesL.B.Allal,R.Li,D.Kocetkov,C.Mou,C.Akiki,C.M.Ferrandis,N.Muennighoff,M.Mishra,A.Gu,M.Dey,et al.Santacoder:dont reach for the stars!arXiv preprint arXiv:2301.03988,
138、2023.A.Anthropic.The claude 3 model family:Opus,sonnet,haiku.Claude-3 Model Card,2024.J.Austin,A.Odena,M.Nye,M.Bosma,H.Michalewski,D.Dohan,E.Jiang,C.Cai,M.Terry,Q.Le,and C.Sutton.Program synthesis with large language models,2021a.J.Austin,A.Odena,M.Nye,M.Bosma,H.Michalewski,D.Dohan,E.Jiang,C.Cai,M.T
139、erry,Q.Le,et al.Program synthesis with large language models.arXiv preprint arXiv:2108.07732,2021b.M.Bavarian,H.Jun,N.Tezak,J.Schulman,C.McLeavey,J.Tworek,and M.Chen.Efficienttraining of language models to fill in the middle.arXiv preprint arXiv:2207.14255,2022.M.Chen,J.Tworek,H.Jun,Q.Yuan,H.P.d.O.P
140、into,J.Kaplan,H.Edwards,Y.Burda,N.Joseph,G.Brockman,et al.Evaluating large language models trained on code.arXivpreprint arXiv:2107.03374,2021.P.Clark,I.Cowhey,O.Etzioni,T.Khot,A.Sabharwal,C.Schoenick,and O.Tafjord.Think youhave solved question answering?try arc,the AI2 reasoning challenge.CoRR,abs/
141、1803.05457,2018.URLhttp:/arxiv.org/abs/1803.05457.K.Cobbe,V.Kosaraju,M.Bavarian,M.Chen,H.Jun,L.Kaiser,M.Plappert,J.Tworek,J.Hilton,R.Nakano,et al.Training verifiers to solve math word problems.arXiv preprintarXiv:2110.14168,2021.DeepSeek-AI.Deepseek-v2:A strong,economical,and efficient mixture-of-ex
142、perts languagemodel,2024.Y.Dubois,B.Galambosi,P.Liang,and T.B.Hashimoto.Length-controlled alpacaeval:A simpleway to debias automatic evaluators.arXiv preprint arXiv:2404.04475,2024.15D.Guo,Q.Zhu,D.Yang,Z.Xie,K.Dong,W.Zhang,G.Chen,X.Bi,Y.Wu,Y.Li,et al.Deepseek-coder:When the large language model meet
143、s programmingthe rise of code intelligence.arXiv preprint arXiv:2401.14196,2024.D.Hendrycks,C.Burns,S.Basart,A.Zou,M.Mazeika,D.Song,and J.Steinhardt.Measuringmassive multitask language understanding.arXiv preprint arXiv:2009.03300,2020.D.Hendrycks,C.Burns,S.Kadavath,A.Arora,S.Basart,E.Tang,D.Song,an
144、dJ.Steinhardt.Mea-suring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,2021.Y.Huang,Y.Bai,Z.Zhu,J.Zhang,J.Zhang,T.Su,J.Liu,C.Lv,Y.Zhang,J.Lei,et al.C-Eval:Amulti-level multi-discipline chinese evaluation suite for foundation models.arXiv preprintarXiv:2305.08322,
145、2023.N.Jain,K.Han,A.Gu,W.-D.Li,F.Yan,T.Zhang,S.Wang,A.Solar-Lezama,K.Sen,and I.Stoica.Livecodebench:Holistic and contamination free evaluation of large language models for code,2024.C.E.Jimenez,J.Yang,A.Wettig,S.Yao,K.Pei,O.Press,and K.Narasimhan.Swe-bench:Canlanguage models resolve real-world githu
146、b issues?arXiv preprint arXiv:2310.06770,2023.M.Joshi,E.Choi,D.Weld,and L.Zettlemoyer.TriviaQA:A large scale distantly supervised chal-lenge dataset for reading comprehension.In R.Barzilay and M.-Y.Kan,editors,Proceedings ofthe 55th Annual Meeting of the Association for Computational Linguistics(Vol
147、ume 1:LongPapers),pages 16011611,Vancouver,Canada,July 2017.Association for ComputationalLinguistics.doi:10.18653/v1/P17-1147.URLhttps:/aclanthology.org/P17-1147.A.Joulin,E.Grave,P.Bojanowski,M.Douze,H.Jgou,and T.Mikolov.Fasttext.zip:Compress-ing text classification models.arXiv preprint arXiv:1612.
148、03651,2016.D.Kocetkov,R.Li,L.Jia,C.Mou,Y.Jernite,M.Mitchell,C.M.Ferrandis,S.Hughes,T.Wolf,D.Bahdanau,et al.The stack:3 tb of permissively licensed source code.Transactions onMachine Learning Research,2022.T.Kwiatkowski,J.Palomaki,O.Redfield,M.Collins,A.P.Parikh,C.Alberti,D.Epstein,I.Polosukhin,J.Dev
149、lin,K.Lee,K.Toutanova,L.Jones,M.Kelcey,M.Chang,A.M.Dai,J.Uszkoreit,Q.Le,and S.Petrov.Natural questions:a benchmark for question answeringresearch.Trans.Assoc.Comput.Linguistics,7:452466,2019.doi:10.1162/tacl_a_00276.URLhttps:/doi.org/10.1162/tacl_a_00276.H.Li,Y.Zhang,F.Koto,Y.Yang,H.Zhao,Y.Gong,N.Du
150、an,and T.Baldwin.CMMLU:Measur-ing massive multitask language understanding in Chinese.arXiv preprint arXiv:2306.09212,2023a.R.Li,L.B.Allal,Y.Zi,N.Muennighoff,D.Kocetkov,C.Mou,M.Marone,C.Akiki,J.Li,J.Chim,et al.Starcoder:may the source be with you!arXiv preprint arXiv:2305.06161,2023b.T.Li,W.-L.Chian
151、g,E.Frick,L.Dunlap,B.Zhu,J.E.Gonzalez,and I.Stoica.From live data tohigh-quality benchmarks:The arena-hard pipeline,April 2024.URLhttps:/lmsys.org/blog/2024-04-19-arena-hard/.J.Liu,C.S.Xia,Y.Wang,and L.Zhang.Is your code generated by chatGPT really correct?rigor-ous evaluation of large language mode
152、ls for code generation.InThirty-seventh Conference16on Neural Information Processing Systems,2023a.URLhttps:/ J.McAuley.Repobench:Benchmarking repository-level code auto-completionsystems.In The Twelfth International Conference on Learning Representations,2023b.X.Liu,X.Lei,S.Wang,Y.Huang,Z.Feng,B.We
153、n,J.Cheng,P.Ke,Y.Xu,W.L.Tam,X.Zhang,L.Sun,H.Wang,J.Zhang,M.Huang,Y.Dong,and J.Tang.Alignbench:Benchmarkingchinese alignment of large language models.CoRR,abs/2311.18743,2023c.doi:10.48550/ARXIV.2311.18743.URLhttps:/doi.org/10.48550/arXiv.2311.18743.I.Loshchilov and F.Hutter.Decoupled weight decay re
154、gularization,2019.A.Lozhkov,R.Li,L.B.Allal,F.Cassano,J.Lamy-Poirier,N.Tazi,A.Tang,D.Pykhtar,J.Liu,Y.Wei,et al.Starcoder 2 and the stack v2:The next generation.arXiv preprintarXiv:2402.19173,2024.MAA.American invitational mathematics examination-aime.American InvitationalMathematics Examination-AIME
155、2024,2024.URLhttps:/maa.org/math-competitions/american-invitational-mathematics-examination-aime.Meta.Introducing meta llama 3:The most capable openly available llm to date.https:/ 2024.MistralAI.Codestral.https:/mistral.ai/news/codestral/,2024.Accessed:2024-05-29.Netmind.AI.Odyssey-math.https:/ 22,
156、2024.OpenAI.Gpt-4 technical report,2023.B.Peng,J.Quesnelle,H.Fan,and E.Shippole.Yarn:Efficient context window extension of largelanguage models.arXiv preprint arXiv:2309.00071,2023.M.Reid,N.Savinov,D.Teplyashin,D.Lepikhin,T.Lillicrap,J.-b.Alayrac,R.Soricut,A.Lazari-dou,O.Firat,J.Schrittwieser,et al.
157、Gemini 1.5:Unlocking multimodal understanding acrossmillions of tokens of context.arXiv preprint arXiv:2403.05530,2024.B.Roziere,J.Gehring,F.Gloeckle,S.Sootla,I.Gat,X.E.Tan,Y.Adi,J.Liu,T.Remez,J.Rapin,et al.Code llama:Open foundation models for code.arXiv preprint arXiv:2308.12950,2023.Z.Shao,P.Wang
158、,Q.Zhu,R.Xu,J.Song,M.Zhang,Y.Li,Y.Wu,and D.Guo.Deepseekmath:Pushing the limits of mathematical reasoning in open language models.arXiv preprintarXiv:2402.03300,2024.Q.Shi,M.Tang,K.Narasimhan,andS.Yao.Canlanguagemodelssolveolympiadprogramming?arXiv preprint arXiv:2404.10952,2024.M.Suzgun,N.Scales,N.S
159、chrli,S.Gehrmann,Y.Tay,H.W.Chung,A.Chowdhery,Q.V.Le,E.H.Chi,D.Zhou,et al.Challenging big-bench tasks and whether chain-of-thought can solvethem.arXiv preprint arXiv:2210.09261,2022.H.Touvron,L.Martin,K.Stone,P.Albert,A.Almahairi,Y.Babaei,N.Bashlykov,S.Batra,P.Bhargava,S.Bhosale,et al.Llama 2:Open fo
160、undation and fine-tuned chat models.arXivpreprint arXiv:2307.09288,2023.17L.Xu,H.Hu,X.Zhang,L.Li,C.Cao,Y.Li,Y.Xu,K.Sun,D.Yu,C.Yu,Y.Tian,Q.Dong,W.Liu,B.Shi,Y.Cui,J.Li,J.Zeng,R.Wang,W.Xie,Y.Li,Y.Patterson,Z.Tian,Y.Zhang,H.Zhou,S.Liu,Z.Zhao,Q.Zhao,C.Yue,X.Zhang,Z.Yang,K.Richardson,and Z.Lan.CLUE:A chi-
161、nese language understanding evaluation benchmark.In D.Scott,N.Bel,and C.Zong,editors,Proceedings of the 28th International Conference on Computational Linguistics,COLING2020,Barcelona,Spain(Online),December 8-13,2020,pages 47624772.International Com-mittee on Computational Linguistics,2020.doi:10.18
162、653/V1/2020.COLING-MAIN.419.URLhttps:/doi.org/10.18653/v1/2020.coling-main.419.L.Zheng,W.-L.Chiang,Y.Sheng,S.Zhuang,Z.Wu,Y.Zhuang,Z.Lin,Z.Li,D.Li,E.P.Xing,H.Zhang,J.E.Gonzalez,and I.Stoica.Judging llm-as-a-judge with mt-bench and chatbotarena,2023.W.Zhong,R.Cui,Y.Guo,Y.Liang,S.Lu,Y.Wang,A.Saied,W.Ch
163、en,and N.Duan.AGIEval:Ahuman-centric benchmark for evaluating foundation models.CoRR,abs/2304.06364,2023.doi:10.48550/arXiv.2304.06364.URLhttps:/doi.org/10.48550/arXiv.2304.06364.18A.Supported Programming LanguagesABAP,ActionScript,Ada,Agda,AGS Script,Alloy,AmbientTalk,AMD GPU,AMPL,ANSYSParametric D
164、esign Language,ANTLR,Apache Configuration,APL,AppleScript,Arc,Arduino,ASP,AspectJ,Assembly,Asymptote,Augeas,AutoHotkey,AutoIt,AWK,BC,Berry,BitBake,BlitzBasic,BlitzMax,Bluespec,BNF,Boo,Boogie,Brainfuck,BrightScript,Bro,BST,C,C#,C2HS Haskell,CADL,CapDL,Ceylon,Chapel,ChucK,Cirru,Click,Clojure,CMake,COB
165、OL,COBOLFree,CoffeeScript,ColdFusionCFC,CommonLisp,C+,Crystal,Csound,CsoundScore,CSS,CUDA,Cypher,Cython,Darcs Patch,Dart,DASM16,Debian Control File,DeviceTree,Diff,DM,Docker,Dockerfile,Dylan,EBNF,eC,Eiffel,Elixir,Elm,ELPi,Emacs Lisp,EmberScript,Erlang,Execline,F#,Factor,Fancy,Fantom,Felix,Fennel,Fis
166、h,Flux,Fortran,Fortran Fixed Form,FoxPro,FreeFem,FreeMarker,F*,Futhark,G-Code,GAP,GAS,GDScript,Genshi,Gentoo Ebuild,Gentoo Eclass,Gettext Catalog,GLSL,Glyph,Gnuplot,Go,Gosu,Grace,Gradle,GrammaticalFramework,GraphQL,Graphviz DOT,Groff,Groovy,Groovy Server Pages,GSQL,Handlebars,Haskell,Haxe,HCL,HLSL,H
167、TML,HTML Django,HTML ERB,HTML PHP,HTTP,Hy,Idris,IGOR Pro,Inform 6 Template,Inno Setup,Io,Isabelle,J,Jade,JAGS,Jasmin,Java,Java ServerPages,JavaScript,JavaScriptMozPreproc,JCL,JFlex,JSON,JSONiq,JSX,Julia,JupyterNotebook,K,Kconfig,Koka,Kotlin,KRL,Lean,Less,Lex,LFE,Lighttpd Configuration File,LilyPond,
168、Limbo,Linker Script,Liquid,Literate Agda,Literate CoffeeScript,LLVM,Logtalk,LSL,Lua,M4,Makefile,Mako,Mason,MATLAB,Maxima,Meson,Metal,MiniScript,Mirah,Mizar,Modelica,Modula-2,Monkey,MooCode,MoonScript,Mosel,MQL,MUF,MuPAD,NASM,NCL,NetLinx,Nginx Configuration File,Nimrod,Ninja,Nit,Nix,NSIS,Nu,NuSMV,Obj
169、dump,Objective-C,Objective-C+,OCaml,Octave,Odin,OMG Interface Definition Language,ooc,Opa,OpenCL,OpenEdge ABL,OpenSCAD,Ox,Oz,Papyrus,Parrot Internal Representation,Pascal,PAWN,PEG,Perl,Perl 6,PHP,Pike,PkgConfig,POD,Pony,POV-Ray,PowerShell,Praat,Processing,Propeller Spin,Protocol Buffer,Pug,Puppet,Pu
170、reBasic,PureScript,Python,Q,QML,QVTO,R,Racket,Ragel in Ruby Host,RAML,RConsole,Rd,REALbasic,ReasonML,Red,RenderScript,RenPy,REXX,RHTML,Ride,Robot Framework,Rouge,Ruby,Rust,S,Sage,SARL,SAS,Sass,Scala,Scheme,Scilab,SCSS,Self,Shell,ShExC,Sieve,Silver,Singularity,Slim,Smali,Smarty,Smithy,SMT,Solidity,So
171、urcePawn,SPARQL,SQF,SQL,Squirrel,Stan,Standard ML,Stata,Stylus,SuperCollider,Swift,SWIG,SystemVerilog,Tcl,Tcsh,Tea,Terminfo,TeX,Thrift,Transact-SQL,Treetop,Turing,Twig,TypeScript,TypoScript,Unity3D Asset,Uno,UnrealScript,UrWeb,USD,Vala,VBScript,VCL,Velocity,Verilog,VHDL,VimL,Visual Basic,Vue,WebAssembly,Web IDL,Whiley,X10,XBase,XC,XML,XML Lasso,XQuery,XS,XSLT,Xtend,Xtlang,YANG,Zeek,Zephir,Zig,Zimpl19