《DeepSeek Coder技術報告(英文版)(23頁).pdf》由會員分享,可在線閱讀,更多相關《DeepSeek Coder技術報告(英文版)(23頁).pdf(23頁珍藏版)》請在三個皮匠報告上搜索。
1、DeepSeek-Coder:When the Large Language Model MeetsProgramming-The Rise of Code IntelligenceDaya Guo*1,Qihao Zhu1,2,Dejian Yang1,Zhenda Xie1,Kai Dong1,Wentao Zhang1Guanting Chen1,Xiao Bi1,Y.Wu1,Y.K.Li1,Fuli Luo1,Yingfei Xiong2,Wenfeng Liang11DeepSeek-AI2Key Lab of HCST(PKU),MOE;SCS,Peking Universityz
2、huqh,https:/ rapid development of large language models has revolutionized code intelligence insoftware development.However,the predominance of closed-source models has restrictedextensive research and development.To address this,we introduce the DeepSeek-Coder series,a range of open-source code mod
3、els with sizes from 1.3B to 33B,trained from scratch on 2trillion tokens.These models are pre-trained on a high-quality project-level code corpus andemploy a fill-in-the-blank task with a 16K window to enhance code generation and infilling.Our extensive evaluations demonstrate that DeepSeek-Coder no
4、t only achieves state-of-the-artperformance among open-source code models across multiple benchmarks but also surpassesexisting closed-source models like Codex and GPT-3.5.Furthermore,DeepSeek-Coder modelsare under a permissive license that allows for both research and unrestricted commercial use.Fi
5、gure 1|The Performance of DeepSeek-Coder*Core contributors,ordered alphabetically by the name.arXiv:2401.14196v2 cs.SE 26 Jan 20241.IntroductionThe field of software development has been significantly transformed by the swift advancementof large language models(OpenAI,2023;Touvron et al.,2023),which
6、 have brought abouta new era of code intelligence.These models have the potential to automate and streamlinemany aspects of coding,from bug detection to code generation,thereby enhancing productivityand reducing the likelihood of human error.However,a major challenge in this field is theperformance
7、gap between open-source models(Li et al.,2023;Nijkamp et al.,2022;Roziereet al.,2023;Wang et al.,2021)and closed-source models(Gemini Team,2023;OpenAI,2023).The giant closed-source models,while powerful,are often inaccessible to many researchers anddevelopers due to their proprietary nature.In respo
8、nse to this challenge,we present the DeepSeek-Coder series.This series comprisesa range of open-source code models,varying in size from 1.3B to 33B,including the baseversion and instructed version for each size.Each model in the series has been trained fromscratch on 2 trillion tokens sourced from 8
9、7 programming languages,ensuring a comprehensiveunderstanding of coding languages and syntax.Besides,we attempt to organize the pre-training data at the repository level to enhance the pre-trained models understanding capabilitywithin the context of cross-files within a repository.In addition to emp
10、loying the next tokenprediction loss during pre-training,we have also incorporated the Fill-In-Middle(FIM)approach(Bavarian et al.,2022;Li et al.,2023).This approach is designed to further bolster the modelscode completion capabilities.To meet the requirements of handling longer code inputs,wehave e
11、xtended the context length to 16K.This adjustment allows our models to handle morecomplex and extensive coding tasks,thereby increasing their versatility and applicability invarious coding scenarios.We have carried out comprehensive experiments using a variety of public code-relatedbenchmarks.The fi
12、ndings reveal that among open-source models,DeepSeek-Coder-Base 33Bconsistently delivers superior performance across all benchmarks.Furthermore,DeepSeek-Coder-Instruct 33B surpasses OpenAI GPT-3.5 Turbo in the majority of the evaluation benchmarks,significantly narrowing the performance gap between
13、OpenAI GPT-4 and open-source models.Remarkably,despite having fewer parameters,DeepSeek-Coder-Base 7B demonstrates competi-tive performance when compared to models that are five times larger,such as CodeLlama-33B(Roziere et al.,2023).To summarize,our main contributions are:We introduce DeepSeek-Code
14、r-Base and DeepSeek-Coder-Instruct,our advanced code-focused large language models(LLMs).Developed through extensive training on anexpansive code corpus,these models exhibit proficiency in understanding 87 programminglanguages.Additionally,they are available in various model scales to cater to a wid
15、e rangeof computational and application needs.We make the first attempt to incorporate repository-level data construction during thepre-training phase of our models.We find that it can significantly boost the capability ofcross-file code generation.Our analysis rigorously examines the impact of FIM
16、training strategies on the pretrainingphase of code models.The outcomes of these comprehensive studies shed light on intriguingaspects of FIM configurations,offering valuable insights that significantly contribute to theenhancement and development of code pretrained models.We conduct extensive evalu
17、ations of our code LLMs against a wide array of benchmarks en-compassing numerous code-related tasks.The findings demonstrate that DeepSeek-Coder-Base surpasses all existing open-source code LLMs across these benchmarks.Furthermore,2with meticulous fine-tuning using instructional data,DeepSeek-Coder
18、-Instruct achievesbetter performance compared to the OpenAI GPT-3.5 Turbo model in code-related tasks.2.Data CollectionThe training dataset of DeepSeek-Coder is composed of 87%source code,10%English code-related natural language corpus,and 3%code-unrelated Chinese natural language corpus.TheEnglish
19、corpus consists of materials from GitHubs Markdown and StackExchange1,which areused to enhance the models understanding of code-related concepts and improve its abilityto handle tasks like library usage and bug fixing.Meanwhile,the Chinese corpus consists ofhigh-quality articles aimed at improving t
20、he models proficiency in understanding the Chineselanguage.In this section,we will provide an overview of how we construct the code trainingdata.This process involves data crawling,rule-based filtering,dependency parsing,repository-level deduplication,and quality screening,as illustrated in Figure 2
21、.In the following,we willdescribe the data creation procedure step by step.RuleFilteringDependencyParsingRepo-levelDeduplicationQualityScreeningDataCrawlingFigure 2|The Procedure of Dataset Creation2.1.GitHub Data Crawling and FilteringWe collect public repositories created before February 2023 on G
22、itHub and retain only 87programming languages,as listed in Table 1.To reduce the amount of data to be processed,weapply filtering rules similar to those used in the StarCoder project(Li et al.,2023)to preliminarilyfilter out lower-quality code.By applying these filtering rules,we reduce the total am
23、ount ofdata to only 32.8%of its original size.To make the paper self-contained,we briefly describe thefilter rules used in the StarCoder Data project:Firstly,we filter out files with an average line length exceeding 100 characters or a maximumline length surpassing 1000 characters.Additionally,we re
24、move files with fewer than 25%alphabetic characters.Except for the XSLT programming language,we further filter out fileswhere the string?xml version=appeared in the first 100 characters.For HTML files,weconsider the ratio of visible text to HTML code.We retain files where the visible text constitute
25、sat least 20%of the code and is no less than 100 characters.For JSON and YAML files,whichtypically contain more data,we only keep files that have a character count ranging from 50 to5000 characters.This effectively removes most data-heavy files.2.2.Dependency ParsingIn previous works(Chen et al.,202
26、1;Li et al.,2023;Nijkamp et al.,2022;Roziere et al.,2023),large language models for code are mainly pre-trained on file-level source code,which ignoresthe dependencies between different files in a project.However,in practical applications,suchmodels struggle to effectively scale to handle entire pro
27、ject-level code scenarios.Therefore,we1https:/3Algorithm 1 Topological Sort for Dependency Analysis1:procedure TOPOLOGICALSORT()2:Initialize an empty adjacency list3:Initialize an empty dictionary for in-degrees4:for eachindo5:6:07:end for8:9:for eachindo10:for eachindo11:if HASDEPENDENCY(,)thenIf f
28、ileA depends on fileB12:.append()Add edge from B to A13:+1Increment in-degree of A14:end if15:end for16:end for17:18:getDisconnectedSubgraphs()Identify disconnected subgraphs19:20:for eachindo21:22:while length()NumberOfNodes()do23:argmin(|and)24:for eachin do25:126:end for27:.append()28:end while29
29、:.append()30:end for31:32:return33:end procedurewill consider how to leverage the dependencies between files within the same repository inthis step.Specifically,we first parse the dependencies between files and then arrange thesefiles in an order that ensures the context each file relies on is place
30、d before that file in theinput sequence.By aligning the files in accordance with their dependencies,our dataset moreaccurately represents real coding practices and structures.This enhanced alignment not onlymakes our dataset more relevant but also potentially increases the practicality and applicabi
31、lityof the model in handling project-level code scenarios.Its worth noting that we only considerthe invocation relationships between files and use regular expressions to extract them,such asimport in Python,using in C#,and include in C.The algorithm 1 describes a topological sort for dependency anal
32、ysis on a list of files withinthe same project.Initially,it sets up two data structures:an empty adjacency list named graphsto represent dependencies between files and an empty dictionary called inDegree for storingthe in-degrees of each file.The algorithm then iterates over each file pair to identi
33、fy depen-4dencies,updating graphs and inDegree accordingly.Next,it identifies any disconnectedsubgraphs within the overall dependency graph.For each subgraph,the algorithm employs amodified topological sort.Unlike the standard approach that selects nodes with zero in-degrees,this algorithm selects n
34、odes with minimal in-degrees,which allows it to handle cycles withinthe graph.Selected nodes are added to a results list,and the in-degrees of their connectednodes are decreased.This process continues until a topologically sorted sequence is generatedfor each subgraph.The algorithm concludes by retu
35、rning a list of these sorted sequences,andeach sequences files are concatenated to form a single training sample.To incorporate file pathinformation,a comment indicating the files path is added at the beginning of each file.Thismethod ensures that the path information is preserved in the training da
36、ta.2.3.Repo-Level DeduplicationRecent studies have demonstrated the significant performance improvements that can beachieved by deduplicating training datasets for Large Language Models(LLMs).Lee et al.(2022)have shown that language model training corpora often contain numerous near-duplicates,andth
37、e performance of LLMs can be enhanced by removing long repetitive substrings.Kocetkovet al.(2022)have applied a near-deduplication method to training data,resulting in dramaticimprovements,and they emphasize that near-deduplication is a crucial preprocessing step forachieving competitive performance
38、 on code benchmark tasks.In our dataset,we have alsoemployed near-deduplication.However,there is a distinction in our approach compared toprevious works.We perform deduplication at the repository level of code,rather than at the filelevel,as the latter approach may filter out certain files within a
39、repository,potentially disruptingthe structure of the repository.Specifically,we treat the concatenated code from the repositorylevel as a single sample and apply the same near-deduplication algorithm to ensure the integrityof the repository structure.2.4.Quality Screening and DecontaminationIn addi
40、tion to applying the filtering rules mentioned in Section 2.1,we also employ a compilerand a quality model,combined with heuristic rules,to further filter out low-quality data.Thisincludes code with syntax errors,poor readability,and low modularity.We provide the statisticalsummary of source code in
41、 Table 1,which includes a total of 87 languages,detailing the disksize,number of files,and percentage for each language.The total data volume is 798 GB with603 million files.To ensure that our code training data is not contaminated by informationfrom the test set,which may be present on GitHub,weve
42、implemented an n-gram filteringprocess.This process involves the removal of any code segments that match specific criteria.Specifically,we filter out files containing docstrings,questions,and solutions from sources suchas HumanEval(Chen et al.,2021),MBPP(Austin et al.,2021),GSM8K(Cobbe et al.,2021)a
43、ndMATH(Hendrycks et al.,2021).For the filtering criteria,we apply the following rules:if apiece of code includes a 10-gram string identical to any in the test data,it is excluded from ourtraining data.In cases where the test data comprises strings that are shorter than 10-grams butno less than 3-gra
44、ms,we use an exact match approach for filtering.5LanguageSize(GB)Files(k)Prop.(%)LanguageSize(GB)Files(k)Prop.(%)Ada0.911260.11Literate Haskell0.16200.02Agda0.26590.03Lua0.821380.10Alloy0.07240.01Makefile0.924600.12ANTLR0.19380.02Maple0.0360.00AppleScript0.03170.00Mathematica0.82100.10Assembly0.9179
45、40.11MATLAB0.0110.00Augeas0.0010.00OCaml0.911390.11AWK0.09530.01Pascal0.794700.10Batchfile0.928590.12Perl0.811480.10Bluespec0.10150.01PHP58.9240,6277.38C28.6427,1113.59PowerShell0.912360.11C#58.5653,7397.34Prolog0.0350.00Clojure0.902950.11Protocol Buffer0.923910.12CMake0.903590.11Python120.6875,1881
46、5.12CoffeeScript0.923610.12R0.921580.11Common Lisp0.921050.11Racket0.09130.01C+90.8736,00611.39RMarkdown6.831,6060.86CSS5.6311,6380.71Ruby15.0118,5261.88CUDA0.911150.11Rust0.616920.08Dart0.892640.11SAS0.92700.11Dockerfile0.04480.00Scala0.819710.10Elixir0.915490.11Scheme0.922160.12Elm0.922320.12Shell
47、13.9210,8901.74Emacs Lisp0.911480.11Smalltalk0.928800.12Erlang0.921450.12Solidity0.85830.11F#0.913400.11Sparql0.10880.01Fortran1.676540.21SQL15.147,0091.90GLSL0.922960.11Stan0.20410.03Go2.581,3650.32Standard ML0.741170.09Groovy0.893400.11Stata0.911220.11Haskell0.872130.11SystemVerilog0.911650.11HTML
48、30.0514,9983.77TCL0.901100.11Idris0.11320.01Tcsh0.17530.02Isabelle0.74390.09Tex20.462,8672.56Java148.66134,36718.63Thrift0.05210.01Java Server Pages0.8610720.11TypeScript60.6262,4327.60JavaScript53.8471,8956.75Verilog0.0110.00JSON4.61119560.58VHDL0.853920.11Julia0.922020.12Visual Basic0.75730.09Jupy
49、ter Notebook14.382,5551.80XSLT0.36480.04Kotlin6.003,1210.75Yacc0.72670.09Lean0.52680.07YAML0.748900.09Literate Agda0.0540.01Zig0.81700.10Literate CoffeeScript0.0130.00Total797.92603,173100.00Table 1|A summary of the cleaned training data for the selected programming languages.3.Training Policy3.1.Tr
50、aining Strategy3.1.1.Next Token PredictionThe first training objective for our model is known as next token prediction.In this process,various files are concatenated to form a fixed-length entry.Then,these entries are used to trainthe model,enabling it to predict the subsequent token based on the pr
51、ovided context.63.1.2.Fill-in-the-MiddleThe second training objective for our model is known as fill-in-the-middle.In the code pre-trainingscenario,it is often necessary to generate corresponding inserted content based on the givencontext and subsequent text.Due to specific dependencies in a program
52、ming language,relyingsolely on next token prediction is insufficient to learn this fill-in-the-middle capability.Therefore,several approaches(Bavarian et al.,2022;Li et al.,2023)propose the pretraining method ofFill-in-the-Midlle(FIM).This approach involves randomly dividing the text into three part
53、s,then shuffling the order of these parts and connecting them with special characters.This methodaims to incorporate a fill-in-the-blank pretraining task during the training process.Within theFIM methodology,two distinct modes are employed:PSM(Prefix-Suffix-Middle)and SPM(Suffix-Prefix-Middle).In th
54、e PSM mode,the training corpus is organized in the sequenceof,aligning the text in a way that the middle segment is flanked by theprefix and suffix.Conversely,the SPM mode arranges the segments as,presenting a different structural challenge.These modes are instrumental in enhancing themodels capabil
55、ityto handle various structural arrangements in code,providing a robusttrainingframework for advanced code prediction tasks.02000400060008000Steps0.0000.0250.0500.0750.1000.1250.1500.1750.200ValueBenchmark/HumanEval-Pass102000400060008000Steps0.00.10.20.30.40.50.60.7ValueBenchmark/HumanFIM-Pass10200
56、0400060008000Steps0.000.050.100.150.200.25ValueBenchmark/MBPP-Pass1fim_0fim_0.5fim_1.0msp_0.5Figure 3|The effectiveness of using FIM objective.To determine the effectiveness of various hyperparameters within the FIM approach,weconducted a series of ablation experiments.Experiment Settings:In this ex
57、periment,we employ DeepSeek-Coder-Base 1.3B as our modelarchitecture.We focused on a Python subset from our training dataset to streamline the ex-perimental process.Our primary objective was to assess the efficacy of the Fill-in-the-Middle(FIM)technique,utilizing the HumanEval-FIM benchmark(Fried et
58、 al.,2022).This benchmarkspecializes in a single-line FIM task for Python,in which one line of code from a HumanEvalsolution is randomly obscured,testing the models proficiency in predicting the missing line.We hypothesize that the PSM mode may exhibit subtle differences compared to the traditionaln
59、ext-token prediction objective.This is primarily because PSM involves rearranging the orderof the original text,potentially impacting the learning dynamics of the model.Therefore,weimplement the PSM mode for FIM across four distinct configurations:0%FIM rate,50%FIMrate,100%FIM rate,and 50%MSP rate.T
60、he Masked Span Prediction(MSP)strategy,initiallyintroduced in T5(Raffel et al.,2023),conceals multiple text spans and trains the model to recon-struct these segments.According to CodeGen2.5(Nijkamp et al.,2023),MSP may enhance FIMperformance compared to PSM.Thus,we include this method in our compara
61、tive analysis.Results:The outcomes of our experiment are illustrated in Figure 3.While the model demon-strates peak performance on the HumanEval-FIM with a 100%FIM rate,this configuration alsoresults in the weakest code completion capability.This indicates a trade-off between FIM and7code completion
62、 abilities.Moreover,we observe that with a 50%PSM rate,the model outper-forms the MSP strategy.To achieve a balance between FIM efficiency and code completionproficiency,we ultimately choose the 50%PSM rate as our preferred training policy.In our implementation,we have introduced three sentinel toke
63、ns specifically for this task.For each code file,we initially divide its content into three segments,denoted as,and.Using the PSM mode,we construct the training example as follows:We implement the Fill-in-the-Middle(FIM)method at the document level before the packingprocess,as proposed in the origin
64、al work by Bavarian et al.(2022).This is done with an FIMrate of 0.5,following the PSM mode.3.2.TokenizerFor the tokenization process,we employ the HuggingFace Tokenizer library2to train Byte PairEncoding(BPE)tokenizers,as outlined in Sennrich et al.(2015)(Sennrich et al.,2015),on asubset of our tra
65、ining corpus.Ultimately,we utilize a tokenizer configured with a vocabularysize of 32,000.3.3.Model ArchitectureWe develop a range of models with varying parameters to cater to diverse applications,includingmodels with 1.3B,6.7B,and 33B parameters.These models are built upon the same frameworkas the
66、 DeepSeek Large Language Model(LLM)outlined by DeepSeek-AI(2024).Each model is adecoder-only Transformer,incorporating Rotary Position Embedding(RoPE)as described by Suet al.(2023).Notably,the DeepSeek 33B model integrates Grouped-Query-Attention(GQA)witha group size of 8,enhancing both training and
67、 inference efficiency.Additionally,we employFlashAttention v2(Dao,2023)to expedite the computation involved in the attention mechanism.The architectural details of our models are summarized in Table 2.3.4.OptimizationFollowing DeepSeek LLM(DeepSeek-AI,2024),we use AdamW(Loshchilov and Hutter,2019)as
68、 the optimizer with1and2values of 0.9 and 0.95.We adapt batch sizes and learning rates bythe scaling laws suggested in DeepSeek LLM.For the learning rate scheduling,we implement athree-stage policy,which includes 2000 warm-up steps,and set the final learning rate to 10%ofthe initial rate.Notably,the
69、 learning rate at each stage is scaled down to110of the precedingstages rate,following the guidelines established in DeepSeek LLM(DeepSeek-AI,2024).3.5.EnvironmentsOur experiments are conducted using the HAI-LLM(High-Flyer,2023)framework,known forits efficiency and lightweight approach in training l
70、arge language models.This frameworkincorporates a variety of parallelism strategies to optimize computational efficiency.These in-clude tensor parallelism(Korthikanti et al.,2023),alongside ZeRO data parallelism(Rajbhandariet al.,2020)and PipeDream pipeline parallelism(Narayanan et al.,2019).Our exp
71、eriments2https:/ 1.3BDeepSeek-Coder 6.7BDeepSeek-Coder 33BHidden ActivationSwiGLUSwiGLUSwiGLUHidden size204840967168Intermediate size55041100819200Hidden layers number243262Attention heads number163256AttentionMulti-headMulti-headGrouped-query(8)Batch Size102423043840Max Learning Rate5.3e-44.2e-43.5
72、e-4Table 2|Hyperparameters of DeepSeek-Coder.utilize clusters outfitted with NVIDIA A100 and H800 GPUs.In the A100 cluster,each nodeis configured with 8 GPUs,interconnected in pairs using NVLink bridges.The H800 cluster issimilarly arranged,with each node containing 8 GPUs.These GPUs are interconnec
73、ted usinga combination of NVLink and NVSwitch technologies,ensuring efficient data transfer withinnodes.To facilitate seamless communication between nodes in both A100 and H800 clusters,weemploy InfiniBand interconnects,known for their high throughput and low latency.This setupprovides a robust and
74、efficient infrastructure for our computational experiments.3.6.Long ContextTo enhance the capabilities of DeepSeek-Coder in handling extended contexts,particularly forscenarios like repository-level code processing,we have reconfigured the RoPE(Su et al.,2023)parameters to extend the default context
75、 window.Following previous practices(Chen et al.,2023;kaiokendev,2023),we employed a linear scaling strategy,increasing the scaling factor from1 to 4 and altering the base frequency from 10000 to 100000.The model underwent an additional1000 steps of training,using a batch size of 512 and a sequence
76、length of 16K.The learning ratewas maintained as in the final pre-training phase.Theoretically,these modifications enable ourmodel to process up to 64K tokens in context.However,empirical observations suggest thatthe model delivers its most reliable outputs within a 16K token range.Future research w
77、illcontinue to refine and evaluate the long-context adaptation methodology,aiming to furtherenhance DeepSeek-Coders efficiency and user-friendliness in processing extended contexts.3.7.Instruction TuningWedevelopDeepSeek-Coder-InstructbyenhancingtheDeepSeek-Coder-Basethroughinstruction-based fine-tu
78、ning using high-quality data.This data comprises helpful and impartial humaninstructions,structured by the Alpaca Instruction format(Taori et al.,2023).To demarcate eachdialogue turn,we employed a unique delimiter tokento signify the conclusion of eachsegment.For training,we use a cosine schedule wi
79、th 100 warm-up steps and an initial learningrate 1e-5.We also use a batch size of 4M tokens and 2B tokens in total.An example of using DeepSeek-Coder-Instruct 34B is depicted in Figure 4.This example isa multi-turn dialogue scenario for building a snake game.Initially,we ask the model to writea game
80、 snake using pygame.The model successfully creates a basic snake game that can runwithout bugs.To improve the game,we further request adding a scoring system in the top leftcorner.The model then introduces a score variable and a display_score function,along withan explanation of how to integrate the
81、se features.This example illustrates DeepSeek-Coder-Instructs ability to provide complete solutions in multi-turn dialogue settings.More cases canbe found in the Appendix A.9Q1:Write a game snake using pygameSure,here is a simple implementation of the classic Snake game using Pygame.This code will c
82、reate a window with a snake that can be controlled using the arrow keys.The snake will grow in length each time it eats the food.The game will end if the snake hits the edge of the window or its own tail.pythonimport pygame import randomrunning=TrueWhile running:#Draw everything screen.fill(BLACK)sn
83、ake.draw()food.draw()pygame.display.flip()#Cap the frame rate clock.tick(SPEED)#Quit Pygame pygame.quit()This code sets up a basic game loop where the snake moves in response to key presses,and the game ends if the snake hits the edge of the window or its own tail.The snake grows in length each time
84、 it eats the food.The game continues until the player closes the window.Run CodeQ2:Add a scoring system in the top left cornerTo add a scoring system to the top left corner of the game window,you can create a function to display the score and increment it each time the snake eats the food.Heres how
85、you can modify the code to include a scoring system:pythonrunning=TrueWhile running:#Draw everything screen.fill(BLACK)snake.draw()food.draw()display_score(score)#Display the scorepygame.display.flip()#Cap the frame rate clock.tick(SPEED)#Quit Pygame pygame.quit()In this updated code,weve added a sc
86、ore variable to keep track of the players score,which is incremented each time the snake eats the food.The display_score function is used to render the score text and display it in the top left corner of the window.The display_score function is called after the snake has moved and the screen has bee
87、n filled with the black color,but before the pygame.display.flip()call,so that the score is drawn on top of everything else.Run CodeFigure 4|An example of responses from DeepSeek-Coder-Instruct 33B in a multi-turn setting.4.Experimental ResultsIn this section,we evaluate DeepSeek-Coder on four tasks
88、,including code generation(4.1),FIMcode completion(4.2),cross-file code completion(4.3)and program-based math reasoning(4.4).We compare DeepSeek-Coder with the previous state-of-the-art large language models:CodeGeeX2(Zheng et al.,2023)represents the second generation of the multilingualcode generat
89、ion model CodeGeeX.It is developed using the ChatGLM2(Du et al.,2022)architecture and is enhanced with an extensive dataset of coding examples.StarCoder(Li et al.,2023)is a publicly accessible model with a substantial parameter countof 15 billion.It is specifically trained on a meticulously curated
90、subset of the Stack dataset(Kocetkov et al.,2022),covering 86 programming languages,ensuring its proficiencyacross a wide range of coding tasks.CodeLlama(Roziere et al.,2023)encompasses a series of code-centric Large LanguageModels(LLMs)that are derivatives of LLaMA2(Touvron et al.,2023).Available i
91、n threesizes 7B,13B,and 34B these models undergo continued training on a vast 500 billiontoken code corpus,building upon the foundational LLaMA2 architecture.code-cushman-001 Chen et al.(2021)is a 12 billion parameter model developed by OpenAIand served as the initial model for Github Copilot.GPT-3.
92、5 and GPT-4(OpenAI,2023)are advanced generative AI models developed byOpenAI.While they are not explicitly trained for code generation,they also demonstrate10notable performance in this domain.Their effectiveness in handling code generation tasksis largely attributed to their massive scale in terms
93、of parameter count.4.1.Code GenerationHumanEval and MBPP BenchmarksThe HumanEval(Chen et al.,2021)and MBPP(Austinet al.,2021)benchmarks are widely used for evaluating code LLMs.HumanEval consists of 164hand-written Python problems that are validated using test cases to assess the code generatedby a
94、Code LLM in a zero-shot setting,while the MBPP benchmark includes 500 problemsin a few-shot setting.To evaluate the models multilingual capabilities,we expanded thePython problems of Humaneval Benchmark to seven additional commonly used programminglanguages,namely C+,Java,PHP,TypeScript(TS),C#,Bash,
95、and JavaScript(JS)(Cassano et al.,2023).For both benchmarks,We adopted a greedy search approach and re-implemented thebaseline results using the same script and environment for fair comparison.ModelSizePythonC+JavaPHPTSC#BashJSAvgMBPPMultilingual Base Modelscode-cushman-00112B33.5%31.9%30.6%28.9%31.
96、3%22.1%11.7%-CodeGeeX26B36.0%29.2%25.9%23.6%20.8%29.7%6.3%24.8%24.5%36.2%StarCoderBase16B31.7%31.1%28.5%25.4%34.0%34.8%8.9%29.8%28.0%42.8%CodeLlama7B31.7%29.8%34.2%23.6%36.5%36.7%12.0%29.2%29.2%38.6%CodeLlama13B36.0%37.9%38.0%34.2%45.2%43.0%16.5%32.3%35.4%48.4%CodeLlama34B48.2%44.7%44.9%41.0%42.1%48
97、.7%15.8%42.2%41.0%55.2%DeepSeek-Coder-Base1.3B34.8%31.1%32.3%24.2%28.9%36.7%10.1%28.6%28.3%46.2%DeepSeek-Coder-Base6.7B49.4%50.3%43.0%38.5%49.7%50.0%28.5%48.4%44.7%60.6%DeepSeek-Coder-Base33B56.1%58.4%51.9%44.1%52.8%51.3%32.3%55.3%50.3%66.0%Instruction-Tuned ModelsGPT-3.5-Turbo-76.2%63.4%69.2%60.9%6
98、9.1%70.8%42.4%67.1%64.9%70.8%GPT-4-84.1%76.4%81.6%77.2%77.4%79.1%58.2%78.0%76.5%80.0%DeepSeek-Coder-Instruct1.3B65.2%45.3%51.9%45.3%59.7%55.1%12.7%52.2%48.4%49.4%DeepSeek-Coder-Instruct6.7B78.6%63.4%68.4%68.9%67.2%72.8%36.7%72.7%66.1%65.4%DeepSeek-Coder-Instruct33B79.3%68.9%73.4%72.7%67.9%74.1%43.0%
99、73.9%69.2%70.0%Table 3|Performance of approaches on the Multilingual HumanEval and MBPP Benchmarks.The results are presented in Table 3.As we can see,DeepSeek-Coder-Base achieves state-of-the-art performance with an average accuracy of 50.3%on HumanEval and 66.0%on MBPP.In comparison to the similarl
100、y sized open-source model CodeLlama-Base 34B,our model hasdemonstrated a notable improvement of 9%and 11%in accuracy,respectively.Its worthnoting that even our smaller model,DeepSeek-Coder-Base 6.7B,surpasses the performanceof CodeLlama-Base 34B.After instruction fine-tuning,our model surpasses the
101、closed-sourceGPT-3.5-Turbo model in HumanEval benchmark,significantly reducing the performance gapbetween OpenAI GPT-4 and open-source models.DS-1000 BenchmarkHumanEval and MBPP have a significant drawback in that they relyheavily on straightforward programming tasks that may not accurately represen
102、t the kind ofcode most programmers typically write.In contrast,the DS-1000 benchmark,as introduced inthe work by Lai et al.(2023),offers a comprehensive collection of 1,000 practical and realistic datascience workflows across seven different libraries.This benchmark evaluates code generation byexecu
103、ting it against specific test cases.What sets DS-1000 apart is its categorization of problemsbased on the libraries involved,which encompass Matplotlib,NumPy,Pandas,SciPy,Scikit-11Learn,PyTorch,and TensorFlow.The benchmark assesses the performance of base models in thecode completion setting and we
104、provide pass1 results for each library,as well as overall score.The results of DS-1000 benchmark are shown in Table 4.As can be seen from the table,theDeepSeek-Coder model achieves relatively high accuracy in all libraries,demonstrating that ourmodel is not only capable of generating good code but a
105、lso of using libraries more accurately inreal data science workflows.ModelSizeMatplotlibNumpyPandasPytorchScipyScikit-LearnTensorflowAvgCodeGeeX26B38.7%26.8%14.4%11.8%19.8%27.0%17.8%22.9%StarCoder-Base16B43.2%29.1%11.0%20.6%23.6%32.2%15.6%24.6%CodeLlama-Base7B41.9%24.6%14.8%16.2%18.9%17.4%17.8%22.1%
106、CodeLlama-Base13B46.5%28.6%18.2%19.1%18.9%27.8%33.3%26.8%CodeLlama-Base34B50.3%42.7%23.0%25.0%28.3%33.9%40.0%34.3%DeepSeek-Coder-Base1.3B32.3%21.4%9.3%8.8%8.5%16.5%8.9%16.2%DeepSeek-Coder-Base6.7B48.4%35.5%20.6%19.1%22.6%38.3%24.4%30.5%DeepSeek-Coder-Base33B56.1%49.6%25.8%36.8%36.8%40.0%46.7%40.2%Ta
107、ble 4|Performance of different approaches on the DS-1000-Tasks.LeetCode Contest BenchmarkTo further validate the models capability in real-world pro-gramming problems,we construct the LeetCode Contest benchmark3.LeetCode4presentscompetition-level problems,offering significant challenges that test th
108、e models problem under-standing and code generation skills.We collected the latest problems from LeetCode Contests toprevent the appearance of both the problems or their solutions in our pre-training data.A totalof 180 problems were collected from July 2023 to January 2024.For each problem,we collec
109、ted100 test cases to ensure the test coverage.We use the template problem_descriptionnPleasecomplete the code below to solve the above problem:npythonncode_templaten tobuild the instruction prompt.The evaluation results are shown in Table 5.In our evaluation,the DeepSeek-Coder modelsdemonstrate rema
110、rkable performance over current open-source coding models.Specifically,theDeepSeek-Coder-Instruct 6.7B and 33B achieve Pass1 scores of 19.4%and 27.8%respectivelyin this benchmark.This performance notably surpasses existing open-sourced models such asCode-Llama-33B.The DeepSeek-Coder-Instruct 33B is
111、the only open-sourced model that outper-forms OpenAIs GPT-3.5-Turbo in this task.However,there remains a substantial performancegap when compared to the more advanced GPT-4-Turbo.Our analysis indicates that the implementation of Chain-of-Thought(CoT)prompting no-tably enhances the capabilities of De
112、epSeek-Coder-Instruct models.This improvement becomesparticularly evident in the more challenging subsets of tasks.By adding the directive,Youneed first to write a step-by-step outline and then write the code.following the initial prompt,we have observed enhancements in performance.This observation
113、leads us to believe that theprocess of first crafting detailed code descriptions assists the model in more effectively under-standing and addressing the intricacies of logic and dependencies in coding tasks,particularlythose of higher complexity.Therefore,we strongly recommend employing CoT promptin
114、g strate-gies when utilizing DeepSeek-Coder-Instruct models for complex coding challenges.Such anapproach promotes a more methodical and logical framework for problem-solving,potentiallyresulting in more precise and efficient outcomes in code generation tasks.3We have published this benchmark inhttp
115、s:/ 5|Performance of different models on the LeetCode Contest Benchmark.It is important to acknowledge that despite our diligent efforts to gather the most recentcode questions for model testing,the possibility of data contamination cannot be entirely ruledout.We observed that the GPT-4-Turbo and De
116、epSeek-Coder models achieved higher scoresin the LeetCode Contest held in July and August.We encourage the research community toconsider the potential issue of data contamination when evaluating models in future studiesusing our released LeetCode data.4.2.Fill-in-the-Middle Code CompletionDeepSeek-C
117、oder models are trained with a 0.5 FIM(Fill-In-the-Middle)rate during their pre-training phase.This specialized training strategy empowers the model to proficiently generatecode by filling in blanks based on the surrounding context,both prefix and suffix,of the givencode snippet.This capability is p
118、articularly advantageous in the realm of code completiontools.Several open-source models have emerged with similar capabilities.Notable among theseare SantaCoder(Allal et al.,2023),StarCoder(Li et al.,2023),and CodeLlama(Roziere et al.,2023).These models have set a precedent in the field of code gen
119、eration and completion.Inevaluating the performance DeepSeek-Coder models,we conducted a comparative analysis withthe aforementioned models.The benchmark for this comparison was the Single-Line Infillingbenchmarks,encompassing three different programming languages,as proposed by Allal et al.(2023).T
120、his benchmark uses the line exact match accuracy as the evaluation metric.ModelSizepythonjavajavascriptMeanSantaCoder1.1B44.0%62.0%74.0%69.0%StarCoder16B62.0%73.0%74.0%69.7%CodeLlama-Base7B67.6%74.3%80.2%69.7%CodeLlama-Base13B68.3%77.6%80.7%75.5%DeepSeek-Coder-Base1B57.4%82.2%71.7%70.4%DeepSeek-Code
121、r-Base7B66.6%88.1%79.7%80.7%DeepSeek-Coder-Base33B65.4%86.6%82.5%81.2%Table 6|Performance of different approaches on the FIM-Tasks.The evaluation results are shown in Table 6.Despite being the smallest model with a capacity13of 1.3 billion parameters,DeepSeek-Coder outperforms its larger counterpart
122、s,StarCoder andCodeLlama,in these benchmarks.This superior performance can be attributed to the highquality of the pre-trained data utilized by DeepSeek-Coder.Furthermore,a notable trendobserved is the correlation between the size of the model and its performance.As the modelsize increases,there is
123、a corresponding and responsible enhancement in performance.Thistrend underscores the importance of model capacity in achieving higher accuracy in codecompletion tasks.Based on these findings,we recommend the deployment of the DeepSeek-Coder-Base 6.7B model in code completion tools.This recommendatio
124、n is grounded in themodels demonstrated balance between efficiency and accuracy.The DeepSeek-Coder-Base 6.7Bmodel,with its substantial parameter size,has proven to be highly effective in the context ofcode completion,making it an ideal choice for integrating advanced computational capabilitiesinto c
125、oding environments.4.3.Cross-File Code CompletionIn this section,we will evaluate the performance of existing open-source models in cross-filecode completion tasks.Unlike code generation discussed in the previous section,cross-file codecompletion requires the model to access and understand repositor
126、ies that span multiple fileswith numerous cross-file dependencies.We use CrossCodeEval(Ding et al.,2023)to evaluate thecapabilities of currently available open-source code models of 7B scale in cross-file completiontasks.This dataset is constructed on a diverse set of real-world,open-sourced,permiss
127、ivelylicensed repositories in four popular programming languages:Python,Java,TypeScript,and C#.The dataset is specifically designed to strictly require cross-file context for accurate completion.Notably,this dataset was constructed from repositories created between March and June 2023,while our pre-
128、training data only includes code created before February 2023,which ensures thatthis dataset was not present in our pre-training data,thus avoiding data leakage.ModelSizePythonJavaTypeScriptC#EMESEMESEMESEMESCodeGeex26B8.11%59.55%7.34%59.60%6.14%55.50%1.70%51.66%+Retrieval10.73%61.76%10.10%59.56%7.7
129、2%55.17%4.64%52.30%StarCoder-Base7B6.68%59.55%8.65%62.57%5.01%48.83%4.75%59.53%+Retrieval13.06%64.24%15.61%64.78%7.54%42.06%14.20%65.03%CodeLlama-Base7B7.32%59.66%9.68%62.64%8.19%58.50%4.07%59.19%+Retrieval13.02%64.30%16.41%64.64%12.34%60.64%13.19%63.04%DeepSeek-Coder-Base6.7B9.53%61.65%10.80%61.77%
130、9.59%60.17%5.26%61.32%+Retrieval16.14%66.51%17.72%63.18%14.03%61.77%16.23%63.42%+Retrieval w/o Repo Pre-training16.02%66.65%16.64%61.88%13.23%60.92%14.48%62.38%Table 7|Performance of different models on cross-file code completion.In our evaluation of various models,we set the maximum sequence length
131、 to 2048 tokens,the maximum output length to 50 tokens,and a limit of 512 tokens for the cross-file context.Forthe cross-file context,we utilize the official BM25 search results provided by Ding et al.(2023).Evaluation metrics include exact match and edit similarity.The results,presented in Table 7,
132、demonstrate that DeepSeek-Coder consistently outperforms other models in cross-file comple-tion tasks across multiple languages,showcasing its superior practical application capabilities.When only utilizing file-level code corpus(w/o Repo Pre-training)to pre-train DeepSeek-Coder,we observe a decreas
133、e in performance in the Java,TypeScript,and C#languages,indicating theeffectiveness of the repository-level pre-training.144.4.Program-based Math ReasoningProgram-based math reasoning involves evaluating a models ability to understand and solvemathematical problems through programming.This type of r
134、easoning is critical in fieldssuch as data analysis and scientific computing.To conduct this assessment,we utilize theProgram-Aided Math Reasoning(PAL)method as outlined in Gao et al.(2023).This approachis applied across seven distinct benchmarks,each offering unique challenges and contexts.These be
135、nchmarks includes GSM8K(Cobbe et al.,2021),MATH(Hendrycks et al.,2021),GSM-Hard(Gao et al.,2023),SVAMP(Patel et al.,2021),TabMWP(Lu et al.,2022),ASDiv(Miaoet al.,2020)and MAWPS(Gou et al.,2023).In each of these benchmarks,the model is promptedto alternately describe a solution step in natural langua
136、ge and then execute that step withcode.As seen in Table 8,DeepSeek-Coder models achieve a remarkable performance across allbenchmarks,especially the 33B variant,which demonstrates the potential of using such modelsin applications that require complex mathematical computations and problem-solving abi
137、lities.ModelSizeGSM8kMATHGSM-HardSVAMPTabMWPASDivMAWPSAvgMultilingual Base ModelsCodeGeex-27B22.2%9.7%23.6%39.0%44.6%48.5%66.0%36.2%StarCoder-Base16B23.4%10.3%23.0%42.4%45.0%54.9%81.1%40.0%CodeLlama-Base7B31.2%12.1%30.2%54.2%52.9%59.6%82.6%46.1%CodeLlama-Base13B43.1%14.4%40.2%59.2%60.3%63.6%85.3%52.
138、3%CodeLlama-Base34B58.2%21.2%51.8%70.3%69.8%70.7%91.8%62.0%DeepSeek-Coder-Base1.3B14.6%16.8%14.5%36.7%30.0%48.2%62.3%31.9%DeepSeek-Coder-Base6.7B43.2%19.2%40.3%58.4%67.9%67.2%87.0%54.7%DeepSeek-Coder-Base33B60.7%29.1%54.1%71.6%75.3%76.7%93.3%65.8%Table 8|Performance of different approaches on the pr
139、ogram-aid math reasoning tasks.5.Continue Pre-Training From General LLMTo further enhance the natural language understanding and mathematical reasoning abilitiesof the DeepSeek-Coder model,we perform additional pre-training from the general languagemodel DeepSeek-LLM-7B Base(DeepSeek-AI,2024)on 2 tr
140、illion tokens,resulting in DeepSeek-Coder-v1.5 7B.For this pre-training,we specifically use the data sources listed in Table 9.UnlikeDeepSeek-Coder,DeepSeek-Coder-v1.5 employs solely a next token prediction objective with a4K context length during its pre-training phase.Data SourcePercentageSource C
141、ode70%Markdown and StackExchange10%Natural language related to code7%Natural language related to math7%Bilingual(Chinese-English)natural language6%Table 9|Data sources for DeepSeek-Coder-v1.5 7B pre-trainingWe conduct a comparison between DeepSeek-Coder-v1.5 7B and DeepSeek-Coder 6.7B,andre-run all
142、benchmarks using our evaluation pipeline to ensure a fair comparison.We evaluateperformance across a wide range of tasks,which can be categorized as follows:15Programming:This category includes evaluations in a multilingual setting using theHumanEval dataset by Chen et al.(2021),as well as evaluatio
143、ns in a Python setting usingthe MBPP dataset by Austin et al.(2021)Math Reasoning:We assess performance on math reasoning tasks using the GSM8Kbenchmark(Cobbe et al.,2021)and the MATH(Hendrycks et al.,2021)benchmark 4.These tasks involve solving math problems by generating programs.Natural Language
144、Our evaluation in natural language tasks includes MMLU(Hendryckset al.,2020),BBH(Suzgun et al.,2022),HellaSwag(Zellers et al.,2019),Winogrande(Sak-aguchi et al.,2021),and ARC-Challenge(Clark et al.,2018)benchmarks.The results for the Base and Instruct models are presented in Table 10.It is observedt
145、hat the DeepSeek-Coder-Base-v1.5 model,despite a slight decrease in coding performance,shows marked improvements across most tasks when compared to the DeepSeek-Coder-Basemodel.In particular,in the Math Reasoning and Natural Language categories,DeepSeek-Coder-Base-v1.5 significantly outperforms its
146、predecessor across all benchmarks,which alsodemonstrates significant improvements in its mathematical reasoning and natural languageprocessing capabilities.ProgrammingMath ReasoningNatural LanguageModelsSize HumanEval MBPP GSM8K MATH MMLUBBHHellaSwag WinoG ARC-CDeepSeek-Coder-Base6.7B44.7%60.6%43.2%
147、19.2%36.6%44.3%53.8%57.1%32.5%DeepSeek-Coder-Base-v1.56.9B43.2%60.4%62.4%24.7%49.1%55.2%69.9%63.8%47.2%DeepSeek-Coder-Instruct6.7B66.1%65.4%62.8%28.6%37.2%46.9%55.0%57.6%37.4%DeepSeek-Coder-Instruct-v1.5 6.9B64.1%64.6%72.6%34.1%49.5%53.3%72.2%63.4%48.1%Table 10|Comparative analysis of performance be
148、tween DeepSeek-Coder-Base andDeepSeek-Coder-Base-v1.5.Math tasks are solved through programming.6.ConclusionIn this technical report,we introduce a series of specialized Large Language Models(LLMs)forcoding,named DeepSeek-Coder,available in three distinct scales:1.3B,6.7B,and 33B parameters.These mo
149、dels are uniquely trained on a meticulously curated project-level code corpus,utilizinga fill-in-the-blank pre-training objective to enhance code infilling capabilities.A significantadvancement is the extension of the models context window to 16,384 tokens,thereby greatlyimproving their effectivenes
150、s in handling extensive code generation tasks.Our evaluationsreveal that the most advanced model in our series,DeepSeek-Coder-Base 33B surpasses existingopen-source code models across a variety of standard tests.Impressively,the DeepSeek-Coder-Base 6.7B model,despite its smaller scale,delivers perfo
151、rmance on par with the 34B parameterCodeLlama,a testament to the high quality of our pretraining corpus.To augment the zero-shot instruction capabilities of the DeepSeek-Coder-Base models,wehave fine-tuned them with high-quality instructional data.This has led to the DeepSeek-Coder-Instruct 33B mode
152、l outperforming OpenAIs GPT-3.5 Turbo in a range of coding-related tasks,showcasing its exceptional proficiency in code generation and understanding.To further improve the natural language understanding capabilities of the DeepSeek-Coder-Base models,we have conducted additional pretraining based on
153、the DeepSeek-LLM 7B check-point.This additional training involved processing a diverse dataset comprising 2 billion tokens,including natural language,code,and mathematical data.The result is the creation of a new16and improved code model,DeepSeek-Coder-v1.5.Our observations indicate that DeepSeek-Co
154、der-v1.5 not only maintains its predecessors high-level coding performance but also exhibitsenhanced natural language comprehension.This advancement underscores our belief that themost effective code-focused Large Language Models(LLMs)are those built upon robust generalLLMs.The reason is evident:to
155、effectively interpret and execute coding tasks,these modelsmust also possess a deep understanding of human instructions,which often come in variousforms of natural language.Looking ahead,our commitment is to develop and openly shareeven more powerful code-focused LLMs based on larger-scale general L
156、LMs.AcknowledgementsWe would like to express our gratitude to Bo Liu,Chengqi Deng,Chong Ruan,Damai Dai,Jiashi Li,Kang Guan,Mingchuan Zhang,Panpan Huang,Shuiping Yu,Shirong Ma,YaofengSun,Yishi Piao,Zhihong Shao,and Zhewen Hao for their invaluable discussions and assistanceduring training DeepSeek-Cod
157、er models.ReferencesL.B.Allal,R.Li,D.Kocetkov,C.Mou,C.Akiki,C.M.Ferrandis,N.Muennighoff,M.Mishra,A.Gu,M.Dey,et al.Santacoder:dont reach for the stars!arXiv preprint arXiv:2301.03988,2023.J.Austin,A.Odena,M.Nye,M.Bosma,H.Michalewski,D.Dohan,E.Jiang,C.Cai,M.Terry,Q.Le,and C.Sutton.Program synthesis wi
158、th large language models,2021.M.Bavarian,H.Jun,N.Tezak,J.Schulman,C.McLeavey,J.Tworek,and M.Chen.Efficienttraining of language models to fill in the middle.arXiv preprint arXiv:2207.14255,2022.F.Cassano,J.Gouwar,D.Nguyen,S.Nguyen,L.Phipps-Costin,D.Pinckney,M.-H.Yee,Y.Zi,C.J.Anderson,M.Q.Feldman,et a
159、l.Multipl-e:a scalable and polyglot approach tobenchmarking neural code generation.IEEE Transactions on Software Engineering,2023.M.Chen,J.Tworek,H.Jun,Q.Yuan,H.P.d.O.Pinto,J.Kaplan,H.Edwards,Y.Burda,N.Joseph,G.Brockman,et al.Evaluating large language models trained on code.arXivpreprint arXiv:2107.
160、03374,2021.S.Chen,S.Wong,L.Chen,and Y.Tian.Extending context window of large language models viapositional interpolation.arXiv preprint arXiv:2306.15595,2023.P.Clark,I.Cowhey,O.Etzioni,T.Khot,A.Sabharwal,C.Schoenick,and O.Tafjord.Thinkyou have solved question answering?try arc,the ai2 reasoning chal
161、lenge.arXiv preprintarXiv:1803.05457,2018.K.Cobbe,V.Kosaraju,M.Bavarian,M.Chen,H.Jun,L.Kaiser,M.Plappert,J.Tworek,J.Hilton,R.Nakano,et al.Training verifiers to solve math word problems.arXiv preprintarXiv:2110.14168,2021.T.Dao.Flashattention-2:Faster attention with better parallelism and work partit
162、ioning,2023.DeepSeek-AI.Deepseek llm:Scaling open-source language models with longtermism.arXivpreprint arXiv:2401.02954,2024.17Y.Ding,Z.Wang,W.U.Ahmad,H.Ding,M.Tan,N.Jain,M.K.Ramanathan,R.Nallapati,P.Bhatia,D.Roth,et al.Crosscodeeval:A diverse and multilingual benchmark for cross-filecode completio
163、n.InThirty-seventh Conference on Neural Information Processing SystemsDatasets and Benchmarks Track,2023.Z.Du,Y.Qian,X.Liu,M.Ding,J.Qiu,Z.Yang,and J.Tang.Glm:General language modelpretraining with autoregressive blank infilling.InProceedings of the 60th Annual Meetingof the Association for Computati
164、onal Linguistics(Volume 1:Long Papers),pages 320335,2022.D.Fried,A.Aghajanyan,J.Lin,S.Wang,E.Wallace,F.Shi,R.Zhong,W.-t.Yih,L.Zettlemoyer,and M.Lewis.Incoder:A generative model for code infilling and synthesis.arXiv preprintarXiv:2204.05999,2022.L.Gao,A.Madaan,S.Zhou,U.Alon,P.Liu,Y.Yang,J.Callan,and
165、 G.Neubig.Pal:Program-aided language models.InInternational Conference on Machine Learning,pages 1076410799.PMLR,2023.G.Gemini Team.Gemini:A family of highly capable multimodal models,2023.URLhttps:/goo.gle/GeminiPaper.Z.Gou,Z.Shao,Y.Gong,Y.Yang,M.Huang,N.Duan,W.Chen,et al.Tora:A tool-integratedreas
166、oning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452,2023.D.Hendrycks,C.Burns,S.Basart,A.Zou,M.Mazeika,D.Song,and J.Steinhardt.Measur-ing massive multitask language understanding.InInternational Conference on LearningRepresentations,2020.D.Hendrycks,C.Burns,S.Kadavath,A.Arora
167、,S.Basart,E.Tang,D.Song,andJ.Steinhardt.Mea-suring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,2021.High-Flyer.Hai-llm:An efficient and lightweight tool for training large models.2023.URLhttps:/www.high- im learning while training superhot.https:/kaiokendev.git
168、hub.io/til#extending-context-to-8k,2023.D.Kocetkov,R.Li,L.Jia,C.Mou,Y.Jernite,M.Mitchell,C.M.Ferrandis,S.Hughes,T.Wolf,D.Bahdanau,et al.The stack:3 tb of permissively licensed source code.Transactions onMachine Learning Research,2022.V.A.Korthikanti,J.Casper,S.Lym,L.McAfee,M.Andersch,M.Shoeybi,and B
169、.Catanzaro.Reducing activation recomputation in large transformer models.Proceedings of MachineLearning and Systems,5,2023.Y.Lai,C.Li,Y.Wang,T.Zhang,R.Zhong,L.Zettlemoyer,W.-t.Yih,D.Fried,S.Wang,and T.Yu.Ds-1000:A natural and reliable benchmark for data science code generation.InInternationalConfere
170、nce on Machine Learning,pages 1831918345.PMLR,2023.K.Lee,D.Ippolito,A.Nystrom,C.Zhang,D.Eck,C.Callison-Burch,and N.Carlini.Deduplicat-ing training data makes language models better.InProceedings of the 60th Annual Meetingof the Association for Computational Linguistics(Volume 1:Long Papers),pages 84
171、248445,2022.18R.Li,L.B.Allal,Y.Zi,N.Muennighoff,D.Kocetkov,C.Mou,M.Marone,C.Akiki,J.Li,J.Chim,et al.Starcoder:may the source be with you!arXiv preprint arXiv:2305.06161,2023.I.Loshchilov and F.Hutter.Decoupled weight decay regularization,2019.P.Lu,L.Qiu,K.-W.Chang,Y.N.Wu,S.-C.Zhu,T.Rajpurohit,P.Clar
172、k,and A.Kalyan.Dynamicprompt learning via policy gradient for semi-structured mathematical reasoning.InTheEleventh International Conference on Learning Representations,2022.S.-Y.Miao,C.-C.Liang,and K.-Y.Su.A diverse corpus for evaluating and developing englishmath word problem solvers.InProceedings
173、of the 58th Annual Meeting of the Associationfor Computational Linguistics,pages 975984,2020.D.Narayanan,A.Harlap,A.Phanishayee,V.Seshadri,N.R.Devanur,G.R.Ganger,P.B.Gibbons,and M.Zaharia.Pipedream:Generalized pipeline parallelism for dnn training.InProceedings of the 27th ACM Symposium on Operating
174、 Systems Principles,pages 115,2019.E.Nijkamp,B.Pang,H.Hayashi,L.Tu,H.Wang,Y.Zhou,S.Savarese,and C.Xiong.Codegen:An open large language model for code with multi-turn program synthesis.arXiv preprintarXiv:2203.13474,2022.E.Nijkamp,H.Hayashi,C.Xiong,S.Savarese,and Y.Zhou.Codegen2:Lessons for trainingl
175、lms on programming and natural languages,2023.OpenAI.Gpt-4 technical report,2023.A.Patel,S.Bhattamishra,and N.Goyal.Are nlp models really able to solve simple mathword problems?InProceedings of the 2021 Conference of the North American Chapter ofthe Association for Computational Linguistics:Human La
176、nguage Technologies,pages 20802094,2021.C.Raffel,N.Shazeer,A.Roberts,K.Lee,S.Narang,M.Matena,Y.Zhou,W.Li,and P.J.Liu.Exploring the limits of transfer learning with a unified text-to-text transformer,2023.S.Rajbhandari,J.Rasley,O.Ruwase,and Y.He.Zero:Memory optimizations toward training tril-lion par
177、ameter models.InSC20:International Conference for High Performance Computing,Networking,Storage and Analysis,pages 116.IEEE,2020.B.Roziere,J.Gehring,F.Gloeckle,S.Sootla,I.Gat,X.E.Tan,Y.Adi,J.Liu,T.Remez,J.Rapin,et al.Code llama:Open foundation models for code.arXiv preprint arXiv:2308.12950,2023.K.S
178、akaguchi,R.L.Bras,C.Bhagavatula,and Y.Choi.Winogrande:An adversarial winogradschema challenge at scale.Communications of the ACM,64(9):99106,2021.R.Sennrich,B.Haddow,and A.Birch.Neural machine translation of rare words with subwordunits.arXiv preprint arXiv:1508.07909,2015.J.Su,Y.Lu,S.Pan,A.Murtadha
179、,B.Wen,and Y.Liu.Roformer:Enhanced transformer withrotary position embedding,2023.M.Suzgun,N.Scales,N.Schrli,S.Gehrmann,Y.Tay,H.W.Chung,A.Chowdhery,Q.V.Le,E.H.Chi,D.Zhou,and J.Wei.Challenging big-bench tasks and whether chain-of-thoughtcan solve them.arXiv preprint arXiv:2210.09261,2022.19R.Taori,I.
180、Gulrajani,T.Zhang,Y.Dubois,X.Li,C.Guestrin,P.Liang,and T.B.Hashimoto.Stanford alpaca:An instruction-following llama model.https:/ al.Llama 2:Open foundation and fine-tuned chat models.arXivpreprint arXiv:2307.09288,2023.Y.Wang,W.Wang,S.Joty,and S.C.Hoi.Codet5:Identifier-aware unified pre-trained enc
181、oder-decoder models for code understanding and generation.arXiv preprint arXiv:2109.00859,2021.R.Zellers,A.Holtzman,Y.Bisk,A.Farhadi,and Y.Choi.Hellaswag:Can a machine reallyfinish your sentence?InProceedings of the 57th Annual Meeting of the Association forComputational Linguistics,pages 47914800,2
182、019.Q.Zheng,X.Xia,X.Zou,Y.Dong,S.Wang,Y.Xue,L.Shen,Z.Wang,A.Wang,Y.Li,et al.Codegeex:A pre-trained model for code generation with multilingual benchmarking onhumaneval-x.InProceedingsofthe29thACMSIGKDDConferenceonKnowledgeDiscoveryand Data Mining,pages 56735684,2023.20A.Cases of Chatting with DeepSe
183、ek-Coder-InstructWe will present two cases of interactions with DeepSeek-Coder-Instruct,with one involving amulti-turn conversation about creating a database and performing data analysis,and the othercentered around using a model to solve a sample problem from LeetCode.In the first scenario,depicted
184、 in Figure 5,we instruct the model to build a student databaseusing Python and randomly insert 10 pieces of information.Subsequently,in the second roundof the conversation,we continue to ask the model by analyzing the age distribution of thestudents.From Figure 5,its evident that the model can gener
185、ate bug-free and comprehensivecode,accompanied by explanatory details.In the second scenario,as illustrated in Figure 6,we further assess the models capabilities by testing it on an out-of-domain LeetCode contestproblem.This particular problem was released in November 2023,after our data collection,
186、andthus,isnt part of our models training data.The results show that our model excels at solvingproblems that extend beyond its training distribution.Q1:Build a database in Python that includes various pieces of information about students,insert ten random records and display.Sure,here is a simple ex
187、ample of how you can do this using SQLite in Python.pythonimport sqlite3import random#Connect to SQLite databaseconn=sqlite3.connect(students.db)#Create a cursor objectcur=conn.cursor()#Create tablecur.execute(CREATE TABLE IF NOT EXISTS students(id INTEGER PRIMARY KEY,name TEXT,age INTEGER,grade TEX
188、T)#Display all recordscur.execute(SELECT*FROM students)This script will create a SQLite database named students.db and a table named students with four columns:id,name,age,and grade.It will then insert 10 random records into the table and display all records.Run CodeQ2:create a graph depicting the a
189、ge distribution.To create a graph depicting the age distribution of the students,you can use the matplotlib library in Python.Heres how you can do it:pythonimport sqlite3import pandas as pdimport matplotlib.pyplot as plt#Connect to SQLite databaseconn=sqlite3.connect(students.db)#Read the data from
190、the database into a pandas DataFramedf=pd.read_sql_query(SELECT*from students,conn)#Create a histogram of the age distributionplt.hist(dfage,bins=range(10,30,2),edgecolor=black)plt.title(Age Distribution of Students)plt.xlabel(Age)plt.ylabel(Number of Students)plt.show()#Close the connectionconn.clo
191、se()This script will connect to the SQLite database,read the data into a pandas DataFrame,then create a histogram of the age distribution.The histogram will show the number of students in each age range.Run CodeFigure 5|An example of building database and data analysis.21Q:There are n teams numbered
192、 from 0 to n-1 in a tournament;each team is also a node in a DAG.You are given the integer n and a 0-indexed 2D integer array edges of length m representing the DAG,where edgesi=ui,vi indicates that there is a directed edge from team uito team viin the graph.A directed edge from a to b in the graph
193、means that team a is stronger than team b and team b is weaker than team a.Team a will be the champion of the tournament if there is no team b that is stronger than team a.Return the team that will be the champion of the tournament if there is a unique champion,otherwise,return-1.NotesA cycle is a s
194、eries of nodes a1,a2,.,an,an+1such that node a1is the same node as node an+1,the nodes a1,a2,.,anare distinct,and there is a directed edge from the node aito node ai+1for every i in the range 1,n.A DAG is a directed graph that does not have any cycle.Example 1:Example 2:Constraints:Please follow the
195、 template below to solve the problem:pythonclass Solution:def findChampion(self,n:int,edges:ListListint)-int:To solve this problem,we can use a topological sorting algorithm.The idea is to pythonfrom typing import Listclass Solution:def findChampion(self,n:int,edges:ListListint)-int:#Create an adjac
196、ency list to represent the DAGadj_list=i:for i in range(n)for u,v in edges:adj_listu.append(v)#Initialize a list to keep track of the number of in_degree=0*n#Calculate the in-degree for each teamfor u in adj_list:for v in adj_listu:in_degreev+=1#Initialize a list to keep track of the teams with no i
197、ncoming edgesno_incoming_edges=i for i in range(n)if in_degreei=0#If there is more than one team with no incoming edges,there is no unique championif len(no_incoming_edges)!=1:return-1#Return the team with no incoming edges,which is the championreturn no_incoming_edges0This code first creates an adj
198、acency list to represent the DAG.Then,Submit CodeFigure 6|An example of solving LeetCode Problem.B.Benchmark curves during training of DeepSeek-Coder-BaseIn Figure 7,we present the benchmark curves illustrating the performance of DeepSeek-Coder-Base models during their training phase.For validation,
199、a carefully curated subset of the trainingcorpus was employed,consisting of 8,000 code files.This subset was deliberately chosen toensure a diverse and representative sample,critical for an accurate assessment of the modelscapabilities.The performance metrics of these models are specifically detaile
200、d in the final twosub-figures of Figure 7,offering a clear visual representation of their efficacy throughout thetraining process.22025050075010001250150017502000Tokens(Billions)0.00.10.20.30.40.5ValueHumanEval-Pass1025050075010001250150017502000Tokens(Billions)0.00.10.20.30.40.50.6ValueHumanEval-cp
201、p-Pass1025050075010001250150017502000Tokens(Billions)0.00.10.20.30.40.50.6ValueHumanEval-java-Pass1025050075010001250150017502000Tokens(Billions)0.500.550.600.650.700.750.800.85ValueHumanEvalFIM-EM025050075010001250150017502000Tokens(Billions)0.00.20.40.6ValueMBPP-Pass1025050075010001250150017502000
202、Tokens(Billions)0.00.10.20.30.40.5ValueMeanHumanEval025050075010001250150017502000Tokens(Billions)0.00.10.20.30.4ValueValidation-Completion-EM025050075010001250150017502000Tokens(Billions)0.00.10.20.30.40.5ValueValidation-CompletionFIM-EMDeepSeek-Coder-Base-1.3BDeepSeek-Coder-Base-6.7BDeepSeek-Coder-Base-33BFigure 7|Benchmark curves during training of DeepSeek-Coder-Base.23