當前位置：首頁 > 報告詳情

DeepSeek Coder技術報告（英文版）（23頁）.pdf

上傳人：淘*** 編號：650873 2025-04-07 PDF PDF 中文版中文版中文版 DOCX DOCX DOCX 23頁 1.82MB 收藏收藏已收藏上傳文檔翻譯

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第1頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第2頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第3頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第4頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第5頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第6頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第7頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第8頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第9頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第10頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第11頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第12頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第13頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第14頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第15頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第1頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第2頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第3頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第4頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第5頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第6頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第7頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第8頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第9頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第10頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第11頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第12頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第13頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第14頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第15頁

/ 15

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第1頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第2頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第3頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第4頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第5頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第6頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第7頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第8頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第9頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第10頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第11頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第12頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第13頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第14頁

DeepSeek Coder技術報告（英文版）（23頁）.pdf_第15頁

本資源只提供15頁預覽，全部文檔請下載后查看！喜歡就下載吧，查找使用更方便 VIP專享文檔

《DeepSeek Coder技術報告（英文版）（23頁）.pdf》由會員分享，可在線閱讀，更多相關《DeepSeek Coder技術報告（英文版）（23頁）.pdf（23頁珍藏版）》請在三個皮匠報告上搜索。

1、DeepSeek-Coder:When the Large Language Model MeetsProgramming-The Rise of Code IntelligenceDaya Guo*1,Qihao Zhu1,2,Dejian Yang1,Zhenda Xie1,Kai Dong1,Wentao Zhang1Guanting Chen1,Xiao Bi1,Y.Wu1,Y.K.Li1,Fuli Luo1,Yingfei Xiong2,Wenfeng Liang11DeepSeek-AI2Key Lab of HCST(PKU),MOE;SCS,Peking Universityz

2、huqh,https:/ rapid development of large language models has revolutionized code intelligence insoftware development.However,the predominance of closed-source models has restrictedextensive research and development.To address this,we introduce the DeepSeek-Coder series,a range of open-source code mod

3、els with sizes from 1.3B to 33B,trained from scratch on 2trillion tokens.These models are pre-trained on a high-quality project-level code corpus andemploy a fill-in-the-blank task with a 16K window to enhance code generation and infilling.Our extensive evaluations demonstrate that DeepSeek-Coder no

4、t only achieves state-of-the-artperformance among open-source code models across multiple benchmarks but also surpassesexisting closed-source models like Codex and GPT-3.5.Furthermore,DeepSeek-Coder modelsare under a permissive license that allows for both research and unrestricted commercial use.Fi

5、gure 1|The Performance of DeepSeek-Coder*Core contributors,ordered alphabetically by the name.arXiv:2401.14196v2 cs.SE 26 Jan 20241.IntroductionThe field of software development has been significantly transformed by the swift advancementof large language models(OpenAI,2023;Touvron et al.,2023),which

6、 have brought abouta new era of code intelligence.These models have the potential to automate and streamlinemany aspects of coding,from bug detection to code generation,thereby enhancing productivityand reducing the likelihood of human error.However,a major challenge in this field is theperformance

7、gap between open-source models(Li et al.,2023;Nijkamp et al.,2022;Roziereet al.,2023;Wang et al.,2021)and closed-source models(Gemini Team,2023;OpenAI,2023).The giant closed-source models,while powerful,are often inaccessible to many researchers anddevelopers due to their proprietary nature.In respo

8、nse to this challenge,we present the DeepSeek-Coder series.This series comprisesa range of open-source code models,varying in size from 1.3B to 33B,including the baseversion and instructed version for each size.Each model in the series has been trained fromscratch on 2 trillion tokens sourced from 8

9、7 programming languages,ensuring a comprehensiveunderstanding of coding languages and syntax.Besides,we attempt to organize the pre-training data at the repository level to enhance the pre-trained models understanding capabilitywithin the context of cross-files within a repository.In addition to emp

10、loying the next tokenprediction loss during pre-training,we have also incorporated the Fill-In-Middle(FIM)approach(Bavarian et al.,2022;Li et al.,2023).This approach is designed to further bolster the modelscode completion capabilities.To meet the requirements of handling longer code inputs,wehave e

11、xtended the context length to 16K.This adjustment allows our models to handle morecomplex and extensive coding tasks,thereby increasing their versatility and applicability invarious coding scenarios.We have carried out comprehensive experiments using a variety of public code-relatedbenchmarks.The fi

12、ndings reveal that among open-source models,DeepSeek-Coder-Base 33Bconsistently delivers superior performance across all benchmarks.Furthermore,DeepSeek-Coder-Instruct 33B surpasses OpenAI GPT-3.5 Turbo in the majority of the evaluation benchmarks,significantly narrowing the performance gap between

13、OpenAI GPT-4 and open-source models.Remarkably,despite having fewer parameters,DeepSeek-Coder-Base 7B demonstrates competi-tive performance when compared to models that are five times larger,such as CodeLlama-33B(Roziere et al.,2023).To summarize,our main contributions are:We introduce DeepSeek-Code

14、r-Base and DeepSeek-Coder-Instruct,our advanced code-focused large language models(LLMs).Developed through extensive training on anexpansive code corpus,these models exhibit proficiency in understanding 87 programminglanguages.Additionally,they are available in various model scales to cater to a wid

15、e rangeof computational and application needs.We make the first attempt to incorporate repository-level data construction during thepre-training phase of our models.We find that it can significantly boost the capability ofcross-file code generation.Our analysis rigorously examines the impact of FIM

16、training strategies on the pretrainingphase of code models.The outcomes of these comprehensive studies shed light on intriguingaspects of FIM configurations,offering valuable insights that significantly contribute to theenhancement and development of code pretrained models.We conduct extensive evalu

17、ations of our code LLMs against a wide array of benchmarks en-compassing numerous code-related tasks.The findings demonstrate that DeepSeek-Coder-Base surpasses all existing open-source code LLMs across these benchmarks.Furthermore,2with meticulous fine-tuning using instructional data,DeepSeek-Coder

18、-Instruct achievesbetter performance compared to the OpenAI GPT-3.5 Turbo model in code-related tasks.2.Data CollectionThe training dataset of DeepSeek-Coder is composed of 87%source code,10%English code-related natural language corpus,and 3%code-unrelated Chinese natural language corpus.TheEnglish

19、corpus consists of materials from GitHubs Markdown and StackExchange1,which areused to enhance the models understanding of code-related concepts and improve its abilityto handle tasks like library usage and bug fixing.Meanwhile,the Chinese corpus consists ofhigh-quality articles aimed at improving t

20、he models proficiency in understanding the Chineselanguage.In this section,we will provide an overview of how we construct the code trainingdata.This process involves data crawling,rule-based filtering,dependency parsing,repository-level deduplication,and quality screening,as illustrated in Figure 2

21、.In the following,we willdescribe the data creation procedure step by step.RuleFilteringDependencyParsingRepo-levelDeduplicationQualityScreeningDataCrawlingFigure 2|The Procedure of Dataset Creation2.1.GitHub Data Crawling and FilteringWe collect public repositories created before February 2023 on G

22、itHub and retain only 87programming languages,as listed in Table 1.To reduce the amount of data to be processed,weapply filtering rules similar to those used in the StarCoder project(Li et al.,2023)to preliminarilyfilter out lower-quality code.By applying these filtering rules,we reduce the total am

23、ount ofdata to only 32.8%of its original size.To make the paper self-contained,we briefly describe thefilter rules used in the StarCoder Data project:Firstly,we filter out files with an average line length exceeding 100 characters or a maximumline length surpassing 1000 characters.Additionally,we re

24、move files with fewer than 25%alphabetic characters.Except for the XSLT programming language,we further filter out fileswhere the string?xml version=appeared in the first 100 characters.For HTML files,weconsider the ratio of visible text to HTML code.We retain files where the visible text constitute

25、sat least 20%of the code and is no less than 100 characters.For JSON and YAML files,whichtypically contain more data,we only keep files that have a character count ranging from 50 to5000 characters.This effectively removes most data-heavy files.2.2.Dependency ParsingIn previous works(Chen et al.,202

26、1;Li et al.,2023;Nijkamp et al.,2022;Roziere et al.,2023),large language models for code are mainly pre-trained on file-level source code,which ignoresthe dependencies between different files in a project.However,in practical applications,suchmodels struggle to effectively scale to handle entire pro

27、ject-level code scenarios.Therefore,we1https:/3Algorithm 1 Topological Sort for Dependency Analysis1:procedure TOPOLOGICALSORT()2:Initialize an empty adjacency list3:Initialize an empty dictionary for in-degrees4:for eachindo5:6:07:end for8:9:for eachindo10:for eachindo11:if HASDEPENDENCY(,)thenIf f

28、ileA depends on fileB12:.append()Add edge from B to A13:+1Increment in-degree of A14:end if15:end for16:end for17:18:getDisconnectedSubgraphs()Identify disconnected subgraphs19:20:for eachindo21:22:while length()NumberOfNodes()do23:argmin(|and)24:for eachin do25:126:end for27:.append()28:end while29

29、:.append()30:end for31:32:return33:end procedurewill consider how to leverage the dependencies between files within the same repository inthis step.Specifically,we first parse the dependencies between files and then arrange thesefiles in an order that ensures the context each file relies on is place

30、d before that file in theinput sequence.By aligning the files in accordance with their dependencies,our dataset moreaccurately represents real coding practices and structures.This enhanced alignment not onlymakes our dataset more relevant but also potentially increases the practicality and applicabi

31、lityof the model in handling project-level code scenarios.Its worth noting that we only considerthe invocation relationships between files and use regular expressions to extract them,such asimport in Python,using in C#,and include in C.The algorithm 1 describes a topological sort for dependency anal

32、ysis on a list of files withinthe same project.Initially,it sets up two data structures:an empty adjacency list named graphsto represent dependencies between files and an empty dictionary called inDegree for storingthe in-degrees of each file.The algorithm then iterates over each file pair to identi

33、fy depen-4dencies,updating graphs and inDegree accordingly.Next,it identifies any disconnectedsubgraphs within the overall dependency graph.For each subgraph,the algorithm employs amodified topological sort.Unlike the standard approach that selects nodes with zero in-degrees,this algorithm selects n

34、odes with minimal in-degrees,which allows it to handle cycles withinthe graph.Selected nodes are added to a results list,and the in-degrees of their connectednodes are decreased.This process continues until a topologically sorted sequence is generatedfor each subgraph.The algorithm concludes by retu

35、rning a list of these sorted sequences,andeach sequences files are concatenated to form a single training sample.To incorporate file pathinformation,a comment indicating the files path is added at the beginning of each file.Thismethod ensures that the path information is preserved in the training da

36、ta.2.3.Repo-Level DeduplicationRecent studies have demonstrated the significant performance improvements that can beachieved by deduplicating training datasets for Large Language Models(LLMs).Lee et al.(2022)have shown that language model training corpora often contain numerous near-duplicates,andth

37、e performance of LLMs can be enhanced by removing long repetitive substrings.Kocetkovet al.(2022)have applied a near-deduplication method to training data,resulting in dramaticimprovements,and they emphasize that near-deduplication is a crucial preprocessing step forachieving competitive performance

38、 on code benchmark tasks.In our dataset,we have alsoemployed near-deduplication.However,there is a distinction in our approach compared toprevious works.We perform deduplication at the repository level of code,rather than at the filelevel,as the latter approach may filter out certain files within a

39、repository,potentially disruptingthe structure of the repository.Specifically,we treat the concatenated code from the repositorylevel as a single sample and apply the same near-deduplication algorithm to ensure the integrityof the repository structure.2.4.Quality Screening and DecontaminationIn addi

40、tion to applying the filtering rules mentioned in Section 2.1,we also employ a compilerand a quality model,combined with heuristic rules,to further filter out low-quality data.Thisincludes code with syntax errors,poor readability,and low modularity.We provide the statisticalsummary of source code in

41、 Table 1,which includes a total of 87 languages,detailing the disksize,number of files,and percentage for each language.The total data volume is 798 GB with603 million files.To ensure that our code training data is not contaminated by informationfrom the test set,which may be present on GitHub,weve

42、implemented an n-gram filteringprocess.This process involves the removal of any code segments that match specific criteria.Specifically,we filter out files containing docstrings,questions,and solutions from sources suchas HumanEval(Chen et al.,2021),MBPP(Austin et al.,2021),GSM8K(Cobbe et al.,2021)a

43、ndMATH(Hendrycks et al.,2021).For the filtering criteria,we apply the following rules:if apiece of code includes a 10-gram string identical to any in the test data,it is excluded from ourtraining data.In cases where the test data comprises strings that are shorter than 10-grams butno less than 3-gra

44、ms,we use an exact match approach for filtering.5LanguageSize(GB)Files(k)Prop.(%)LanguageSize(GB)Files(k)Prop.(%)Ada0.911260.11Literate Haskell0.16200.02Agda0.26590.03Lua0.821380.10Alloy0.07240.01Makefile0.924600.12ANTLR0.19380.02Maple0.0360.00AppleScript0.03170.00Mathematica0.82100.10Assembly0.9179

45、40.11MATLAB0.0110.00Augeas0.0010.00OCaml0.911390.11AWK0.09530.01Pascal0.794700.10Batchfile0.928590.12Perl0.811480.10Bluespec0.10150.01PHP58.9240,6277.38C28.6427,1113.59PowerShell0.912360.11C#58.5653,7397.34Prolog0.0350.00Clojure0.902950.11Protocol Buffer0.923910.12CMake0.903590.11Python120.6875,1881

46、5.12CoffeeScript0.923610.12R0.921580.11Common Lisp0.921050.11Racket0.09130.01C+90.8736,00611.39RMarkdown6.831,6060.86CSS5.6311,6380.71Ruby15.0118,5261.88CUDA0.911150.11Rust0.616920.08Dart0.892640.11SAS0.92700.11Dockerfile0.04480.00Scala0.819710.10Elixir0.915490.11Scheme0.922160.12Elm0.922320.12Shell

47、13.9210,8901.74Emacs Lisp0.911480.11Smalltalk0.928800.12Erlang0.921450.12Solidity0.85830.11F#0.913400.11Sparql0.10880.01Fortran1.676540.21SQL15.147,0091.90GLSL0.922960.11Stan0.20410.03Go2.581,3650.32Standard ML0.741170.09Groovy0.893400.11Stata0.911220.11Haskell0.872130.11SystemVerilog0.911650.11HTML

48、30.0514,9983.77TCL0.901100.11Idris0.11320.01Tcsh0.17530.02Isabelle0.74390.09Tex20.462,8672.56Java148.66134,36718.63Thrift0.05210.01Java Server Pages0.8610720.11TypeScript60.6262,4327.60JavaScript53.8471,8956.75Verilog0.0110.00JSON4.61119560.58VHDL0.853920.11Julia0.922020.12Visual Basic0.75730.09Jupy

49、ter Notebook14.382,5551.80XSLT0.36480.04Kotlin6.003,1210.75Yacc0.72670.09Lean0.52680.07YAML0.748900.09Literate Agda0.0540.01Zig0.81700.10Literate CoffeeScript0.0130.00Total797.92603,173100.00Table 1|A summary of the cleaned training data for the selected programming languages.3.Training Policy3.1.Tr

50、aining Strategy3.1.1.Next Token PredictionThe first training objective for our model is known as next token prediction.In this process,various files are concatenated to form a fixed-length entry.Then,these entries are used to trainthe model,enabling it to predict the subsequent token based on the pr

51、ovided context.63.1.2.Fill-in-the-MiddleThe second training objective for our model is known as fill-in-the-middle.In the code pre-trainingscenario,it is often necessary to generate corresponding inserted content based on the givencontext and subsequent text.Due to specific dependencies in a program

52、ming language,relyingsolely on next token prediction is insufficient to learn this fill-in-the-middle capability.Therefore,several approaches(Bavarian et al.,2022;Li et al.,2023)propose the pretraining method ofFill-in-the-Midlle(FIM).This approach involves randomly dividing the text into three part

53、s,then shuffling the order of these parts and connecting them with special characters.This methodaims to incorporate a fill-in-the-blank pretraining task during the training process.Within theFIM methodology,two distinct modes are employed:PSM(Prefix-Suffix-Middle)and SPM(Suffix-Prefix-Middle).In th

54、e PSM mode,the training corpus is organized in the sequenceof,aligning the text in a way that the middle segment is flanked by theprefix and suffix.Conversely,the SPM mode arranges the segments as,presenting a different structural challenge.These modes are instrumental in enhancing themodels capabil

55、ityto handle various structural arrangements in code,providing a robusttrainingframework for advanced code prediction tasks.02000400060008000Steps0.0000.0250.0500.0750.1000.1250.1500.1750.200ValueBenchmark/HumanEval-Pass102000400060008000Steps0.00.10.20.30.40.50.60.7ValueBenchmark/HumanFIM-Pass10200

56、0400060008000Steps0.000.050.100.150.200.25ValueBenchmark/MBPP-Pass1fim_0fim_0.5fim_1.0msp_0.5Figure 3|The effectiveness of using FIM objective.To determine the effectiveness of various hyperparameters within the FIM approach,weconducted a series of ablation experiments.Experiment Settings:In this ex

57、periment,we employ DeepSeek-Coder-Base 1.3B as our modelarchitecture.We focused on a Python subset from our training dataset to streamline the ex-perimental process.Our primary objective was to assess the efficacy of the Fill-in-the-Middle(FIM)technique,utilizing the HumanEval-FIM benchmark(Fried et

58、 al.,2022).This benchmarkspecializes in a single-line FIM task for Python,in which one line of code from a HumanEvalsolution is randomly obscured,testing the models proficiency in predicting the missing line.We hypothesize that the PSM mode may exhibit subtle differences compared to the traditionaln

59、ext-token prediction objective.This is primarily because PSM involves rearranging the orderof the original text,potentially impacting the learning dynamics of the model.Therefore,weimplement the PSM mode for FIM across four distinct configurations:0%FIM rate,50%FIMrate,100%FIM rate,and 50%MSP rate.T

60、he Masked Span Prediction(MSP)strategy,initiallyintroduced in T5(Raffel et al.,2023),conceals multiple text spans and trains the model to recon-struct these segments.According to CodeGen2.5(Nijkamp et al.,2023),MSP may enhance FIMperformance compared to PSM.Thus,we include this method in our compara

61、tive analysis.Results:The outcomes of our experiment are illustrated in Figure 3.While the model demon-strates peak performance on the HumanEval-FIM with a 100%FIM rate,this configuration alsoresults in the weakest code completion capability.This indicates a trade-off between FIM and7code completion

62、 abilities.Moreover,we observe that with a 50%PSM rate,the model outper-forms the MSP strategy.To achieve a balance between FIM efficiency and code completionproficiency,we ultimately choose the 50%PSM rate as our preferred training policy.In our implementation,we have introduced three sentinel toke

63、ns specifically for this task.For each code file,we initially divide its content into three segments,denoted as,and.Using the PSM mode,we construct the training example as follows:We implement the Fill-in-the-Middle(FIM)method at the document level before the packingprocess,as proposed in the origin

64、al work by Bavarian et al.(2022).This is done with an FIMrate of 0.5,following the PSM mode.3.2.TokenizerFor the tokenization process,we employ the HuggingFace Tokenizer library2to train Byte PairEncoding(BPE)tokenizers,as outlined in Sennrich et al.(2015)(Sennrich et al.,2015),on asubset of our tra

65、ining corpus.Ultimately,we utilize a tokenizer configured with a vocabularysize of 32,000.3.3.Model ArchitectureWe develop a range of models with varying parameters to cater to diverse applications,includingmodels with 1.3B,6.7B,and 33B parameters.These models are built upon the same frameworkas the

66、 DeepSeek Large Language Model(LLM)outlined by DeepSeek-AI(2024).Each model is adecoder-only Transformer,incorporating Rotary Position Embedding(RoPE)as described by Suet al.(2023).Notably,the DeepSeek 33B model integrates Grouped-Query-Attention(GQA)witha group size of 8,enhancing both training and

67、 inference efficiency.Additionally,we employFlashAttention v2(Dao,2023)to expedite the computation involved in the attention mechanism.The architectural details of our models are summarized in Table 2.3.4.OptimizationFollowing DeepSeek LLM(DeepSeek-AI,2024),we use AdamW(Loshchilov and Hutter,2019)as

68、 the optimizer with1and2values of 0.9 and 0.95.We adapt batch sizes and learning rates bythe scaling laws suggested in DeepSeek LLM.For the learning rate scheduling,we implement athree-stage policy,which includes 2000 warm-up steps,and set the final learning rate to 10%ofthe initial rate.Notably,the

69、 learning rate at each stage is scaled down to110of the precedingstages rate,following the guidelines established in DeepSeek LLM(DeepSeek-AI,2024).3.5.EnvironmentsOur experiments are conducted using the HAI-LLM(High-Flyer,2023)framework,known forits efficiency and lightweight approach in training l

70、arge language models.This frameworkincorporates a variety of parallelism strategies to optimize computational efficiency.These in-clude tensor parallelism(Korthikanti et al.,2023),alongside ZeRO data parallelism(Rajbhandariet al.,2020)and PipeDream pipeline parallelism(Narayanan et al.,2019).Our exp

71、eriments2https:/ 1.3BDeepSeek-Coder 6.7BDeepSeek-Coder 33BHidden ActivationSwiGLUSwiGLUSwiGLUHidden size204840967168Intermediate size55041100819200Hidden layers number243262Attention heads number163256AttentionMulti-headMulti-headGrouped-query(8)Batch Size102423043840Max Learning Rate5.3e-44.2e-43.5

72、e-4Table 2|Hyperparameters of DeepSeek-Coder.utilize clusters outfitted with NVIDIA A100 and H800 GPUs.In the A100 cluster,each nodeis configured with 8 GPUs,interconnected in pairs using NVLink bridges.The H800 cluster issimilarly arranged,with each node containing 8 GPUs.These GPUs are interconnec

73、ted usinga combination of NVLink and NVSwitch technologies,ensuring efficient data transfer withinnodes.To facilitate seamless communication between nodes in both A100 and H800 clusters,weemploy InfiniBand interconnects,known for their high throughput and low latency.This setupprovides a robust and

74、efficient infrastructure for our computational experiments.3.6.Long ContextTo enhance the capabilities of DeepSeek-Coder in handling extended contexts,particularly forscenarios like repository-level code processing,we have reconfigured the RoPE(Su et al.,2023)parameters to extend the default context

75、 window.Following previous practices(Chen et al.,2023;kaiokendev,2023),we employed a linear scaling strategy,increasing the scaling factor from1 to 4 and altering the base frequency from 10000 to 100000.The model underwent an additional1000 steps of training,using a batch size of 512 and a sequence

76、length of 16K.The learning ratewas maintained as in the final pre-training phase.Theoretically,these modifications enable ourmodel to process up to 64K tokens in context.However,empirical observations suggest thatthe model delivers its most reliable outputs within a 16K token range.Future research w

77、illcontinue to refine and evaluate the long-context adaptation methodology,aiming to furtherenhance DeepSeek-Coders efficiency and user-friendliness in processing extended contexts.3.7.Instruction TuningWedevelopDeepSeek-Coder-InstructbyenhancingtheDeepSeek-Coder-Basethroughinstruction-based fine-tu

78、ning using high-quality data.This data comprises helpful and impartial humaninstructions,structured by the Alpaca Instruction format(Taori et al.,2023).To demarcate eachdialogue turn,we employed a unique delimiter tokento signify the conclusion of eachsegment.For training,we use a cosine schedule wi

79、th 100 warm-up steps and an initial learningrate 1e-5.We also use a batch size of 4M tokens and 2B tokens in total.An example of using DeepSeek-Coder-Instruct 34B is depicted in Figure 4.This example isa multi-turn dialogue scenario for building a snake game.Initially,we ask the model to writea game

80、 snake using pygame.The model successfully creates a basic snake game that can runwithout bugs.To improve the game,we further request adding a scoring system in the top leftcorner.The model then introduces a score variable and a display_score function,along withan explanation of how to integrate the

81、se features.This example illustrates DeepSeek-Coder-Instructs ability to provide complete solutions in multi-turn dialogue settings.More cases canbe found in the Appendix A.9Q1:Write a game snake using pygameSure,here is a simple implementation of the classic Snake game using Pygame.This code will c

82、reate a window with a snake that can be controlled using the arrow keys.The snake will grow in length each time it eats the food.The game will end if the snake hits the edge of the window or its own tail.pythonimport pygame import randomrunning=TrueWhile running:#Draw everything screen.fill(BLACK)sn

83、ake.draw()food.draw()pygame.display.flip()#Cap the frame rate clock.tick(SPEED)#Quit Pygame pygame.quit()This code sets up a basic game loop where the snake moves in response to key presses,and the game ends if the snake hits the edge of the window or its own tail.The snake grows in length each time

84、 it eats the food.The game continues until the player closes the window.Run CodeQ2:Add a scoring system in the top left cornerTo add a scoring system to the top left corner of the game window,you can create a function to display the score and increment it each time the snake eats the food.Heres how

85、you can modify the code to include a scoring system:pythonrunning=TrueWhile running:#Draw everything screen.fill(BLACK)snake.draw()food.draw()display_score(score)#Display the scorepygame.display.flip()#Cap the frame rate clock.tick(SPEED)#Quit Pygame pygame.quit()In this updated code,weve added a sc

86、ore variable to keep track of the players score,which is incremented each time the snake eats the food.The display_score function is used to render the score text and display it in the top left corner of the window.The display_score function is called after the snake has moved and the screen has bee

87、n filled with the black color,but before the pygame.display.flip()call,so that the score is drawn on top of everything else.Run CodeFigure 4|An example of responses from DeepSeek-Coder-Instruct 33B in a multi-turn setting.4.Experimental ResultsIn this section,we evaluate DeepSeek-Coder on four tasks

88、,including code generation(4.1),FIMcode completion(4.2),cross-file code completion(4.3)and program-based math reasoning(4.4).We compare DeepSeek-Coder with the previous state-of-the-art large language models:CodeGeeX2(Zheng et al.,2023)represents the second generation of the multilingualcode generat

89、ion model CodeGeeX.It is developed using the ChatGLM2(Du et al.,2022)architecture and is enhanced with an extensive dataset of coding examples.StarCoder(Li et al.,2023)is a publicly accessible model with a substantial parameter countof 15 billion.It is specifically trained on a meticulously curated

90、subset of the Stack dataset(Kocetkov et al.,2022),covering 86 programming languages,ensuring its proficiencyacross a wide range of coding tasks.CodeLlama(Roziere et al.,2023)encompasses a series of code-centric Large LanguageModels(LLMs)that are derivatives of LLaMA2(Touvron et al.,2023).Available i

91、n threesizes 7B,13B,and 34B these models undergo continued training on a vast 500 billiontoken code corpus,building upon the foundational LLaMA2 architecture.code-cushman-001 Chen et al.(2021)is a 12 billion parameter model developed by OpenAIand served as the initial model for Github Copilot.GPT-3.

92、5 and GPT-4(OpenAI,2023)are advanced generative AI models developed byOpenAI.While they are not explicitly trained for code generation,they also demonstrate10notable performance in this domain.Their effectiveness in handling code generation tasksis largely attributed to their massive scale in terms

93、of parameter count.4.1.Code GenerationHumanEval and MBPP BenchmarksThe HumanEval(Chen et al.,2021)and MBPP(Austinet al.,2021)benchmarks are widely used for evaluating code LLMs.HumanEval consists of 164hand-written Python problems that are validated using test cases to assess the code generatedby a

94、Code LLM in a zero-shot setting,while the MBPP benchmark includes 500 problemsin a few-shot setting.To evaluate the models multilingual capabilities,we expanded thePython problems of Humaneval Benchmark to seven additional commonly used programminglanguages,namely C+,Java,PHP,TypeScript(TS),C#,Bash,

95、and JavaScript(JS)(Cassano et al.,2023).For both benchmarks,We adopted a greedy search approach and re-implemented thebaseline results using the same script and environment for fair comparison.ModelSizePythonC+JavaPHPTSC#BashJSAvgMBPPMultilingual Base Modelscode-cushman-00112B33.5%31.9%30.6%28.9%31.

96、3%22.1%11.7%-CodeGeeX26B36.0%29.2%25.9%23.6%20.8%29.7%6.3%24.8%24.5%36.2%StarCoderBase16B31.7%31.1%28.5%25.4%34.0%34.8%8.9%29.8%28.0%42.8%CodeLlama7B31.7%29.8%34.2%23.6%36.5%36.7%12.0%29.2%29.2%38.6%CodeLlama13B36.0%37.9%38.0%34.2%45.2%43.0%16.5%32.3%35.4%48.4%CodeLlama34B48.2%44.7%44.9%41.0%42.1%48

97、.7%15.8%42.2%41.0%55.2%DeepSeek-Coder-Base1.3B34.8%31.1%32.3%24.2%28.9%36.7%10.1%28.6%28.3%46.2%DeepSeek-Coder-Base6.7B49.4%50.3%43.0%38.5%49.7%50.0%28.5%48.4%44.7%60.6%DeepSeek-Coder-Base33B56.1%58.4%51.9%44.1%52.8%51.3%32.3%55.3%50.3%66.0%Instruction-Tuned ModelsGPT-3.5-Turbo-76.2%63.4%69.2%60.9%6

98、9.1%70.8%42.4%67.1%64.9%70.8%GPT-4-84.1%76.4%81.6%77.2%77.4%79.1%58.2%78.0%76.5%80.0%DeepSeek-Coder-Instruct1.3B65.2%45.3%51.9%45.3%59.7%55.1%12.7%52.2%48.4%49.4%DeepSeek-Coder-Instruct6.7B78.6%63.4%68.4%68.9%67.2%72.8%36.7%72.7%66.1%65.4%DeepSeek-Coder-Instruct33B79.3%68.9%73.4%72.7%67.9%74.1%43.0%

99、73.9%69.2%70.0%Table 3|Performance of approaches on the Multilingual HumanEval and MBPP Benchmarks.The results are presented in Table 3.As we can see,DeepSeek-Coder-Base achieves state-of-the-art performance with an average accuracy of 50.3%on HumanEval and 66.0%on MBPP.In comparison to the similarl

100、y sized open-source model CodeLlama-Base 34B,our model hasdemonstrated a notable improvement of 9%and 11%in accuracy,respectively.Its worthnoting that even our smaller model,DeepSeek-Coder-Base 6.7B,surpasses the performanceof CodeLlama-Base 34B.After instruction fine-tuning,our model surpasses the

101、closed-sourceGPT-3.5-Turbo model in HumanEval benchmark,significantly reducing the performance gapbetween OpenAI GPT-4 and open-source models.DS-1000 BenchmarkHumanEval and MBPP have a significant drawback in that they relyheavily on straightforward programming tasks that may not accurately represen

102、t the kind ofcode most programmers typically write.In contrast,the DS-1000 benchmark,as introduced inthe work by Lai et al.(2023),offers a comprehensive collection of 1,000 practical and realistic datascience workflows across seven different libraries.This benchmark evaluates code generation byexecu

103、ting it against specific test cases.What sets DS-1000 apart is its categorization of problemsbased on the libraries involved,which encompass Matplotlib,NumPy,Pandas,SciPy,Scikit-11Learn,PyTorch,and TensorFlow.The benchmark assesses the performance of base models in thecode completion setting and we

104、provide pass1 results for each library,as well as overall score.The results of DS-1000 benchmark are shown in Table 4.As can be seen from the table,theDeepSeek-Coder model achieves relatively high accuracy in all libraries,demonstrating that ourmodel is not only capable of generating good code but a

105、lso of using libraries more accurately inreal data science workflows.ModelSizeMatplotlibNumpyPandasPytorchScipyScikit-LearnTensorflowAvgCodeGeeX26B38.7%26.8%14.4%11.8%19.8%27.0%17.8%22.9%StarCoder-Base16B43.2%29.1%11.0%20.6%23.6%32.2%15.6%24.6%CodeLlama-Base7B41.9%24.6%14.8%16.2%18.9%17.4%17.8%22.1%

106、CodeLlama-Base13B46.5%28.6%18.2%19.1%18.9%27.8%33.3%26.8%CodeLlama-Base34B50.3%42.7%23.0%25.0%28.3%33.9%40.0%34.3%DeepSeek-Coder-Base1.3B32.3%21.4%9.3%8.8%8.5%16.5%8.9%16.2%DeepSeek-Coder-Base6.7B48.4%35.5%20.6%19.1%22.6%38.3%24.4%30.5%DeepSeek-Coder-Base33B56.1%49.6%25.8%36.8%36.8%40.0%46.7%40.2%Ta

107、ble 4|Performance of different approaches on the DS-1000-Tasks.LeetCode Contest BenchmarkTo further validate the models capability in real-world pro-gramming problems,we construct the LeetCode Contest benchmark3.LeetCode4presentscompetition-level problems,offering significant challenges that test th

108、e models problem under-standing and code generation skills.We collected the latest problems from LeetCode Contests toprevent the appearance of both the problems or their solutions in our pre-training data.A totalof 180 problems were collected from July 2023 to January 2024.For each problem,we collec

109、ted100 test cases to ensure the test coverage.We use the template problem_descriptionnPleasecomplete the code below to solve the above problem:npythonncode_templaten tobuild the instruction prompt.The evaluation results are shown in Table 5.In our evaluation,the DeepSeek-Coder modelsdemonstrate rema

110、rkable performance over current open-source coding models.Specifically,theDeepSeek-Coder-Instruct 6.7B and 33B achieve Pass1 scores of 19.4%and 27.8%respectivelyin this benchmark.This performance notably surpasses existing open-sourced models such asCode-Llama-33B.The DeepSeek-Coder-Instruct 33B is

111、the only open-sourced model that outper-forms OpenAIs GPT-3.5-Turbo in this task.However,there remains a substantial performancegap when compared to the more advanced GPT-4-Turbo.Our analysis indicates that the implementation of Chain-of-Thought(CoT)prompting no-tably enhances the capabilities of De

112、epSeek-Coder-Instruct models.This improvement becomesparticularly evident in the more challenging subsets of tasks.By adding the directive,Youneed first to write a step-by-step outline and then write the code.following the initial prompt,we have observed enhancements in performance.This observation

113、leads us to believe that theprocess of first crafting detailed code descriptions assists the model in more effectively under-standing and addressing the intricacies of logic and dependencies in coding tasks,particularlythose of higher complexity.Therefore,we strongly recommend employing CoT promptin

114、g strate-gies when utilizing DeepSeek-Coder-Instruct models for complex coding challenges.Such anapproach promotes a more methodical and logical framework for problem-solving,potentiallyresulting in more precise and efficient outcomes in code generation tasks.3We have published this benchmark inhttp

115、s:/ 5|Performance of different models on the LeetCode Contest Benchmark.It is important to acknowledge that despite our diligent efforts to gather the most recentcode questions for model testing,the possibility of data contamination cannot be entirely ruledout.We observed that the GPT-4-Turbo and De

116、epSeek-Coder models achieved higher scoresin the LeetCode Contest held in July and August.We encourage the research community toconsider the potential issue of data contamination when evaluating models in future studiesusing our released LeetCode data.4.2.Fill-in-the-Middle Code CompletionDeepSeek-C

117、oder models are trained with a 0.5 FIM(Fill-In-the-Middle)rate during their pre-training phase.This specialized training strategy empowers the model to proficiently generatecode by filling in blanks based on the surrounding context,both prefix and suffix,of the givencode snippet.This capability is p

118、articularly advantageous in the realm of code completiontools.Several open-source models have emerged with similar capabilities.Notable among theseare SantaCoder(Allal et al.,2023),StarCoder(Li et al.,2023),and CodeLlama(Roziere et al.,2023).These models have set a precedent in the field of code gen

119、eration and completion.Inevaluating the performance DeepSeek-Coder models,we conducted a comparative analysis withthe aforementioned models.The benchmark for this comparison was the Single-Line Infillingbenchmarks,encompassing three different programming languages,as proposed by Allal et al.(2023).T

120、his benchmark uses the line exact match accuracy as the evaluation metric.ModelSizepythonjavajavascriptMeanSantaCoder1.1B44.0%62.0%74.0%69.0%StarCoder16B62.0%73.0%74.0%69.7%CodeLlama-Base7B67.6%74.3%80.2%69.7%CodeLlama-Base13B68.3%77.6%80.7%75.5%DeepSeek-Coder-Base1B57.4%82.2%71.7%70.4%DeepSeek-Code

121、r-Base7B66.6%88.1%79.7%80.7%DeepSeek-Coder-Base33B65.4%86.6%82.5%81.2%Table 6|Performance of different approaches on the FIM-Tasks.The evaluation results are shown in Table 6.Despite being the smallest model with a capacity13of 1.3 billion parameters,DeepSeek-Coder outperforms its larger counterpart

122、s,StarCoder andCodeLlama,in these benchmarks.This superior performance can be attributed to the highquality of the pre-trained data utilized by DeepSeek-Coder.Furthermore,a notable trendobserved is the correlation between the size of the model and its performance.As the modelsize increases,there is

123、a corresponding and responsible enhancement in performance.Thistrend underscores the importance of model capacity in achieving higher accuracy in codecompletion tasks.Based on these findings,we recommend the deployment of the DeepSeek-Coder-Base 6.7B model in code completion tools.This recommendatio

124、n is grounded in themodels demonstrated balance between efficiency and accuracy.The DeepSeek-Coder-Base 6.7Bmodel,with its substantial parameter size,has proven to be highly effective in the context ofcode completion,making it an ideal choice for integrating advanced computational capabilitiesinto c

125、oding environments.4.3.Cross-File Code CompletionIn this section,we will evaluate the performance of existing open-source models in cross-filecode completion tasks.Unlike code generation discussed in the previous section,cross-file codecompletion requires the model to access and understand repositor

126、ies that span multiple fileswith numerous cross-file dependencies.We use CrossCodeEval(Ding et al.,2023)to evaluate thecapabilities of currently available open-source code models of 7B scale in cross-file completiontasks.This dataset is constructed on a diverse set of real-world,open-sourced,permiss

127、ivelylicensed repositories in four popular programming languages:Python,Java,TypeScript,and C#.The dataset is specifically designed to strictly require cross-file context for accurate completion.Notably,this dataset was constructed from repositories created between March and June 2023,while our pre-

128、training data only includes code created before February 2023,which ensures thatthis dataset was not present in our pre-training data,thus avoiding data leakage.ModelSizePythonJavaTypeScriptC#EMESEMESEMESEMESCodeGeex26B8.11%59.55%7.34%59.60%6.14%55.50%1.70%51.66%+Retrieval10.73%61.76%10.10%59.56%7.7

129、2%55.17%4.64%52.30%StarCoder-Base7B6.68%59.55%8.65%62.57%5.01%48.83%4.75%59.53%+Retrieval13.06%64.24%15.61%64.78%7.54%42.06%14.20%65.03%CodeLlama-Base7B7.32%59.66%9.68%62.64%8.19%58.50%4.07%59.19%+Retrieval13.02%64.30%16.41%64.64%12.34%60.64%13.19%63.04%DeepSeek-Coder-Base6.7B9.53%61.65%10.80%61.77%

130、9.59%60.17%5.26%61.32%+Retrieval16.14%66.51%17.72%63.18%14.03%61.77%16.23%63.42%+Retrieval w/o Repo Pre-training16.02%66.65%16.64%61.88%13.23%60.92%14.48%62.38%Table 7|Performance of different models on cross-file code completion.In our evaluation of various models,we set the maximum sequence length

131、 to 2048 tokens,the maximum output length to 50 tokens,and a limit of 512 tokens for the cross-file context.Forthe cross-file context,we utilize the official BM25 search results provided by Ding et al.(2023).Evaluation metrics include exact match and edit similarity.The results,presented in Table 7,

132、demonstrate that DeepSeek-Coder consistently outperforms other models in cross-file comple-tion tasks across multiple languages,showcasing its superior practical application capabilities.When only utilizing file-level code corpus(w/o Repo Pre-training)to pre-train DeepSeek-Coder,we observe a decreas

133、e in performance in the Java,TypeScript,and C#languages,indicating theeffectiveness of the repository-level pre-training.144.4.Program-based Math ReasoningProgram-based math reasoning involves evaluating a models ability to understand and solvemathematical problems through programming.This type of r

134、easoning is critical in fieldssuch as data analysis and scientific computing.To conduct this assessment,we utilize theProgram-Aided Math Reasoning(PAL)method as outlined in Gao et al.(2023).This approachis applied across seven distinct benchmarks,each offering unique challenges and contexts.These be

135、nchmarks includes GSM8K(Cobbe et al.,2021),MATH(Hendrycks et al.,2021),GSM-Hard(Gao et al.,2023),SVAMP(Patel et al.,2021),TabMWP(Lu et al.,2022),ASDiv(Miaoet al.,2020)and MAWPS(Gou et al.,2023).In each of these benchmarks,the model is promptedto alternately describe a solution step in natural langua

136、ge and then execute that step withcode.As seen in Table 8,DeepSeek-Coder models achieve a remarkable performance across allbenchmarks,especially the 33B variant,which demonstrates the potential of using such modelsin applications that require complex mathematical computations and problem-solving abi

137、lities.ModelSizeGSM8kMATHGSM-HardSVAMPTabMWPASDivMAWPSAvgMultilingual Base ModelsCodeGeex-27B22.2%9.7%23.6%39.0%44.6%48.5%66.0%36.2%StarCoder-Base16B23.4%10.3%23.0%42.4%45.0%54.9%81.1%40.0%CodeLlama-Base7B31.2%12.1%30.2%54.2%52.9%59.6%82.6%46.1%CodeLlama-Base13B43.1%14.4%40.2%59.2%60.3%63.6%85.3%52.

138、3%CodeLlama-Base34B58.2%21.2%51.8%70.3%69.8%70.7%91.8%62.0%DeepSeek-Coder-Base1.3B14.6%16.8%14.5%36.7%30.0%48.2%62.3%31.9%DeepSeek-Coder-Base6.7B43.2%19.2%40.3%58.4%67.9%67.2%87.0%54.7%DeepSeek-Coder-Base33B60.7%29.1%54.1%71.6%75.3%76.7%93.3%65.8%Table 8|Performance of different approaches on the pr

139、ogram-aid math reasoning tasks.5.Continue Pre-Training From General LLMTo further enhance the natural language understanding and mathematical reasoning abilitiesof the DeepSeek-Coder model,we perform additional pre-training from the general languagemodel DeepSeek-LLM-7B Base(DeepSeek-AI,2024)on 2 tr

140、illion tokens,resulting in DeepSeek-Coder-v1.5 7B.For this pre-training,we specifically use the data sources listed in Table 9.UnlikeDeepSeek-Coder,DeepSeek-Coder-v1.5 employs solely a next token prediction objective with a4K context length during its pre-training phase.Data SourcePercentageSource C

141、ode70%Markdown and StackExchange10%Natural language related to code7%Natural language related to math7%Bilingual(Chinese-English)natural language6%Table 9|Data sources for DeepSeek-Coder-v1.5 7B pre-trainingWe conduct a comparison between DeepSeek-Coder-v1.5 7B and DeepSeek-Coder 6.7B,andre-run all

142、benchmarks using our evaluation pipeline to ensure a fair comparison.We evaluateperformance across a wide range of tasks,which can be categorized as follows:15Programming:This category includes evaluations in a multilingual setting using theHumanEval dataset by Chen et al.(2021),as well as evaluatio

143、ns in a Python setting usingthe MBPP dataset by Austin et al.(2021)Math Reasoning:We assess performance on math reasoning tasks using the GSM8Kbenchmark(Cobbe et al.,2021)and the MATH(Hendrycks et al.,2021)benchmark 4.These tasks involve solving math problems by generating programs.Natural Language

144、Our evaluation in natural language tasks includes MMLU(Hendryckset al.,2020),BBH(Suzgun et al.,2022),HellaSwag(Zellers et al.,2019),Winogrande(Sak-aguchi et al.,2021),and ARC-Challenge(Clark et al.,2018)benchmarks.The results for the Base and Instruct models are presented in Table 10.It is observedt

145、hat the DeepSeek-Coder-Base-v1.5 model,despite a slight decrease in coding performance,shows marked improvements across most tasks when compared to the DeepSeek-Coder-Basemodel.In particular,in the Math Reasoning and Natural Language categories,DeepSeek-Coder-Base-v1.5 significantly outperforms its

146、predecessor across all benchmarks,which alsodemonstrates significant improvements in its mathematical reasoning and natural languageprocessing capabilities.ProgrammingMath ReasoningNatural LanguageModelsSize HumanEval MBPP GSM8K MATH MMLUBBHHellaSwag WinoG ARC-CDeepSeek-Coder-Base6.7B44.7%60.6%43.2%

147、19.2%36.6%44.3%53.8%57.1%32.5%DeepSeek-Coder-Base-v1.56.9B43.2%60.4%62.4%24.7%49.1%55.2%69.9%63.8%47.2%DeepSeek-Coder-Instruct6.7B66.1%65.4%62.8%28.6%37.2%46.9%55.0%57.6%37.4%DeepSeek-Coder-Instruct-v1.5 6.9B64.1%64.6%72.6%34.1%49.5%53.3%72.2%63.4%48.1%Table 10|Comparative analysis of performance be

148、tween DeepSeek-Coder-Base andDeepSeek-Coder-Base-v1.5.Math tasks are solved through programming.6.ConclusionIn this technical report,we introduce a series of specialized Large Language Models(LLMs)forcoding,named DeepSeek-Coder,available in three distinct scales:1.3B,6.7B,and 33B parameters.These mo

149、dels are uniquely trained on a meticulously curated project-level code corpus,utilizinga fill-in-the-blank pre-training objective to enhance code infilling capabilities.A significantadvancement is the extension of the models context window to 16,384 tokens,thereby greatlyimproving their effectivenes

150、s in handling extensive code generation tasks.Our evaluationsreveal that the most advanced model in our series,DeepSeek-Coder-Base 33B surpasses existingopen-source code models across a variety of standard tests.Impressively,the DeepSeek-Coder-Base 6.7B model,despite its smaller scale,delivers perfo

151、rmance on par with the 34B parameterCodeLlama,a testament to the high quality of our pretraining corpus.To augment the zero-shot instruction capabilities of the DeepSeek-Coder-Base models,wehave fine-tuned them with high-quality instructional data.This has led to the DeepSeek-Coder-Instruct 33B mode

152、l outperforming OpenAIs GPT-3.5 Turbo in a range of coding-related tasks,showcasing its exceptional proficiency in code generation and understanding.To further improve the natural language understanding capabilities of the DeepSeek-Coder-Base models,we have conducted additional pretraining based on

153、the DeepSeek-LLM 7B check-point.This additional training involved processing a diverse dataset comprising 2 billion tokens,including natural language,code,and mathematical data.The result is the creation of a new16and improved code model,DeepSeek-Coder-v1.5.Our observations indicate that DeepSeek-Co

154、der-v1.5 not only maintains its predecessors high-level coding performance but also exhibitsenhanced natural language comprehension.This advancement underscores our belief that themost effective code-focused Large Language Models(LLMs)are those built upon robust generalLLMs.The reason is evident:to

155、effectively interpret and execute coding tasks,these modelsmust also possess a deep understanding of human instructions,which often come in variousforms of natural language.Looking ahead,our commitment is to develop and openly shareeven more powerful code-focused LLMs based on larger-scale general L

156、LMs.AcknowledgementsWe would like to express our gratitude to Bo Liu,Chengqi Deng,Chong Ruan,Damai Dai,Jiashi Li,Kang Guan,Mingchuan Zhang,Panpan Huang,Shuiping Yu,Shirong Ma,YaofengSun,Yishi Piao,Zhihong Shao,and Zhewen Hao for their invaluable discussions and assistanceduring training DeepSeek-Cod

157、er models.ReferencesL.B.Allal,R.Li,D.Kocetkov,C.Mou,C.Akiki,C.M.Ferrandis,N.Muennighoff,M.Mishra,A.Gu,M.Dey,et al.Santacoder:dont reach for the stars!arXiv preprint arXiv:2301.03988,2023.J.Austin,A.Odena,M.Nye,M.Bosma,H.Michalewski,D.Dohan,E.Jiang,C.Cai,M.Terry,Q.Le,and C.Sutton.Program synthesis wi

158、th large language models,2021.M.Bavarian,H.Jun,N.Tezak,J.Schulman,C.McLeavey,J.Tworek,and M.Chen.Efficienttraining of language models to fill in the middle.arXiv preprint arXiv:2207.14255,2022.F.Cassano,J.Gouwar,D.Nguyen,S.Nguyen,L.Phipps-Costin,D.Pinckney,M.-H.Yee,Y.Zi,C.J.Anderson,M.Q.Feldman,et a

159、l.Multipl-e:a scalable and polyglot approach tobenchmarking neural code generation.IEEE Transactions on Software Engineering,2023.M.Chen,J.Tworek,H.Jun,Q.Yuan,H.P.d.O.Pinto,J.Kaplan,H.Edwards,Y.Burda,N.Joseph,G.Brockman,et al.Evaluating large language models trained on code.arXivpreprint arXiv:2107.

160、03374,2021.S.Chen,S.Wong,L.Chen,and Y.Tian.Extending context window of large language models viapositional interpolation.arXiv preprint arXiv:2306.15595,2023.P.Clark,I.Cowhey,O.Etzioni,T.Khot,A.Sabharwal,C.Schoenick,and O.Tafjord.Thinkyou have solved question answering?try arc,the ai2 reasoning chal

161、lenge.arXiv preprintarXiv:1803.05457,2018.K.Cobbe,V.Kosaraju,M.Bavarian,M.Chen,H.Jun,L.Kaiser,M.Plappert,J.Tworek,J.Hilton,R.Nakano,et al.Training verifiers to solve math word problems.arXiv preprintarXiv:2110.14168,2021.T.Dao.Flashattention-2:Faster attention with better parallelism and work partit

162、ioning,2023.DeepSeek-AI.Deepseek llm:Scaling open-source language models with longtermism.arXivpreprint arXiv:2401.02954,2024.17Y.Ding,Z.Wang,W.U.Ahmad,H.Ding,M.Tan,N.Jain,M.K.Ramanathan,R.Nallapati,P.Bhatia,D.Roth,et al.Crosscodeeval:A diverse and multilingual benchmark for cross-filecode completio

163、n.InThirty-seventh Conference on Neural Information Processing SystemsDatasets and Benchmarks Track,2023.Z.Du,Y.Qian,X.Liu,M.Ding,J.Qiu,Z.Yang,and J.Tang.Glm:General language modelpretraining with autoregressive blank infilling.InProceedings of the 60th Annual Meetingof the Association for Computati

164、onal Linguistics(Volume 1:Long Papers),pages 320335,2022.D.Fried,A.Aghajanyan,J.Lin,S.Wang,E.Wallace,F.Shi,R.Zhong,W.-t.Yih,L.Zettlemoyer,and M.Lewis.Incoder:A generative model for code infilling and synthesis.arXiv preprintarXiv:2204.05999,2022.L.Gao,A.Madaan,S.Zhou,U.Alon,P.Liu,Y.Yang,J.Callan,and

165、 G.Neubig.Pal:Program-aided language models.InInternational Conference on Machine Learning,pages 1076410799.PMLR,2023.G.Gemini Team.Gemini:A family of highly capable multimodal models,2023.URLhttps:/goo.gle/GeminiPaper.Z.Gou,Z.Shao,Y.Gong,Y.Yang,M.Huang,N.Duan,W.Chen,et al.Tora:A tool-integratedreas

166、oning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452,2023.D.Hendrycks,C.Burns,S.Basart,A.Zou,M.Mazeika,D.Song,and J.Steinhardt.Measur-ing massive multitask language understanding.InInternational Conference on LearningRepresentations,2020.D.Hendrycks,C.Burns,S.Kadavath,A.Arora

167、,S.Basart,E.Tang,D.Song,andJ.Steinhardt.Mea-suring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,2021.High-Flyer.Hai-llm:An efficient and lightweight tool for training large models.2023.URLhttps:/www.high- im learning while training superhot.https:/kaiokendev.git

168、hub.io/til#extending-context-to-8k,2023.D.Kocetkov,R.Li,L.Jia,C.Mou,Y.Jernite,M.Mitchell,C.M.Ferrandis,S.Hughes,T.Wolf,D.Bahdanau,et al.The stack:3 tb of permissively licensed source code.Transactions onMachine Learning Research,2022.V.A.Korthikanti,J.Casper,S.Lym,L.McAfee,M.Andersch,M.Shoeybi,and B

169、.Catanzaro.Reducing activation recomputation in large transformer models.Proceedings of MachineLearning and Systems,5,2023.Y.Lai,C.Li,Y.Wang,T.Zhang,R.Zhong,L.Zettlemoyer,W.-t.Yih,D.Fried,S.Wang,and T.Yu.Ds-1000:A natural and reliable benchmark for data science code generation.InInternationalConfere

170、nce on Machine Learning,pages 1831918345.PMLR,2023.K.Lee,D.Ippolito,A.Nystrom,C.Zhang,D.Eck,C.Callison-Burch,and N.Carlini.Deduplicat-ing training data makes language models better.InProceedings of the 60th Annual Meetingof the Association for Computational Linguistics(Volume 1:Long Papers),pages 84

171、248445,2022.18R.Li,L.B.Allal,Y.Zi,N.Muennighoff,D.Kocetkov,C.Mou,M.Marone,C.Akiki,J.Li,J.Chim,et al.Starcoder:may the source be with you!arXiv preprint arXiv:2305.06161,2023.I.Loshchilov and F.Hutter.Decoupled weight decay regularization,2019.P.Lu,L.Qiu,K.-W.Chang,Y.N.Wu,S.-C.Zhu,T.Rajpurohit,P.Clar

172、k,and A.Kalyan.Dynamicprompt learning via policy gradient for semi-structured mathematical reasoning.InTheEleventh International Conference on Learning Representations,2022.S.-Y.Miao,C.-C.Liang,and K.-Y.Su.A diverse corpus for evaluating and developing englishmath word problem solvers.InProceedings

173、of the 58th Annual Meeting of the Associationfor Computational Linguistics,pages 975984,2020.D.Narayanan,A.Harlap,A.Phanishayee,V.Seshadri,N.R.Devanur,G.R.Ganger,P.B.Gibbons,and M.Zaharia.Pipedream:Generalized pipeline parallelism for dnn training.InProceedings of the 27th ACM Symposium on Operating

174、 Systems Principles,pages 115,2019.E.Nijkamp,B.Pang,H.Hayashi,L.Tu,H.Wang,Y.Zhou,S.Savarese,and C.Xiong.Codegen:An open large language model for code with multi-turn program synthesis.arXiv preprintarXiv:2203.13474,2022.E.Nijkamp,H.Hayashi,C.Xiong,S.Savarese,and Y.Zhou.Codegen2:Lessons for trainingl

175、lms on programming and natural languages,2023.OpenAI.Gpt-4 technical report,2023.A.Patel,S.Bhattamishra,and N.Goyal.Are nlp models really able to solve simple mathword problems?InProceedings of the 2021 Conference of the North American Chapter ofthe Association for Computational Linguistics:Human La

176、nguage Technologies,pages 20802094,2021.C.Raffel,N.Shazeer,A.Roberts,K.Lee,S.Narang,M.Matena,Y.Zhou,W.Li,and P.J.Liu.Exploring the limits of transfer learning with a unified text-to-text transformer,2023.S.Rajbhandari,J.Rasley,O.Ruwase,and Y.He.Zero:Memory optimizations toward training tril-lion par

177、ameter models.InSC20:International Conference for High Performance Computing,Networking,Storage and Analysis,pages 116.IEEE,2020.B.Roziere,J.Gehring,F.Gloeckle,S.Sootla,I.Gat,X.E.Tan,Y.Adi,J.Liu,T.Remez,J.Rapin,et al.Code llama:Open foundation models for code.arXiv preprint arXiv:2308.12950,2023.K.S

178、akaguchi,R.L.Bras,C.Bhagavatula,and Y.Choi.Winogrande:An adversarial winogradschema challenge at scale.Communications of the ACM,64(9):99106,2021.R.Sennrich,B.Haddow,and A.Birch.Neural machine translation of rare words with subwordunits.arXiv preprint arXiv:1508.07909,2015.J.Su,Y.Lu,S.Pan,A.Murtadha

179、,B.Wen,and Y.Liu.Roformer:Enhanced transformer withrotary position embedding,2023.M.Suzgun,N.Scales,N.Schrli,S.Gehrmann,Y.Tay,H.W.Chung,A.Chowdhery,Q.V.Le,E.H.Chi,D.Zhou,and J.Wei.Challenging big-bench tasks and whether chain-of-thoughtcan solve them.arXiv preprint arXiv:2210.09261,2022.19R.Taori,I.

180、Gulrajani,T.Zhang,Y.Dubois,X.Li,C.Guestrin,P.Liang,and T.B.Hashimoto.Stanford alpaca:An instruction-following llama model.https:/ al.Llama 2:Open foundation and fine-tuned chat models.arXivpreprint arXiv:2307.09288,2023.Y.Wang,W.Wang,S.Joty,and S.C.Hoi.Codet5:Identifier-aware unified pre-trained enc

181、oder-decoder models for code understanding and generation.arXiv preprint arXiv:2109.00859,2021.R.Zellers,A.Holtzman,Y.Bisk,A.Farhadi,and Y.Choi.Hellaswag:Can a machine reallyfinish your sentence?InProceedings of the 57th Annual Meeting of the Association forComputational Linguistics,pages 47914800,2

182、019.Q.Zheng,X.Xia,X.Zou,Y.Dong,S.Wang,Y.Xue,L.Shen,Z.Wang,A.Wang,Y.Li,et al.Codegeex:A pre-trained model for code generation with multilingual benchmarking onhumaneval-x.InProceedingsofthe29thACMSIGKDDConferenceonKnowledgeDiscoveryand Data Mining,pages 56735684,2023.20A.Cases of Chatting with DeepSe

183、ek-Coder-InstructWe will present two cases of interactions with DeepSeek-Coder-Instruct,with one involving amulti-turn conversation about creating a database and performing data analysis,and the othercentered around using a model to solve a sample problem from LeetCode.In the first scenario,depicted

184、 in Figure 5,we instruct the model to build a student databaseusing Python and randomly insert 10 pieces of information.Subsequently,in the second roundof the conversation,we continue to ask the model by analyzing the age distribution of thestudents.From Figure 5,its evident that the model can gener

185、ate bug-free and comprehensivecode,accompanied by explanatory details.In the second scenario,as illustrated in Figure 6,we further assess the models capabilities by testing it on an out-of-domain LeetCode contestproblem.This particular problem was released in November 2023,after our data collection,

186、andthus,isnt part of our models training data.The results show that our model excels at solvingproblems that extend beyond its training distribution.Q1:Build a database in Python that includes various pieces of information about students,insert ten random records and display.Sure,here is a simple ex

187、ample of how you can do this using SQLite in Python.pythonimport sqlite3import random#Connect to SQLite databaseconn=sqlite3.connect(students.db)#Create a cursor objectcur=conn.cursor()#Create tablecur.execute(CREATE TABLE IF NOT EXISTS students(id INTEGER PRIMARY KEY,name TEXT,age INTEGER,grade TEX

188、T)#Display all recordscur.execute(SELECT*FROM students)This script will create a SQLite database named students.db and a table named students with four columns:id,name,age,and grade.It will then insert 10 random records into the table and display all records.Run CodeQ2:create a graph depicting the a

189、ge distribution.To create a graph depicting the age distribution of the students,you can use the matplotlib library in Python.Heres how you can do it:pythonimport sqlite3import pandas as pdimport matplotlib.pyplot as plt#Connect to SQLite databaseconn=sqlite3.connect(students.db)#Read the data from

190、the database into a pandas DataFramedf=pd.read_sql_query(SELECT*from students,conn)#Create a histogram of the age distributionplt.hist(dfage,bins=range(10,30,2),edgecolor=black)plt.title(Age Distribution of Students)plt.xlabel(Age)plt.ylabel(Number of Students)plt.show()#Close the connectionconn.clo

191、se()This script will connect to the SQLite database,read the data into a pandas DataFrame,then create a histogram of the age distribution.The histogram will show the number of students in each age range.Run CodeFigure 5|An example of building database and data analysis.21Q:There are n teams numbered

192、 from 0 to n-1 in a tournament;each team is also a node in a DAG.You are given the integer n and a 0-indexed 2D integer array edges of length m representing the DAG,where edgesi=ui,vi indicates that there is a directed edge from team uito team viin the graph.A directed edge from a to b in the graph

193、means that team a is stronger than team b and team b is weaker than team a.Team a will be the champion of the tournament if there is no team b that is stronger than team a.Return the team that will be the champion of the tournament if there is a unique champion,otherwise,return-1.NotesA cycle is a s

194、eries of nodes a1,a2,.,an,an+1such that node a1is the same node as node an+1,the nodes a1,a2,.,anare distinct,and there is a directed edge from the node aito node ai+1for every i in the range 1,n.A DAG is a directed graph that does not have any cycle.Example 1:Example 2:Constraints:Please follow the

195、 template below to solve the problem:pythonclass Solution:def findChampion(self,n:int,edges:ListListint)-int:To solve this problem,we can use a topological sorting algorithm.The idea is to pythonfrom typing import Listclass Solution:def findChampion(self,n:int,edges:ListListint)-int:#Create an adjac

196、ency list to represent the DAGadj_list=i:for i in range(n)for u,v in edges:adj_listu.append(v)#Initialize a list to keep track of the number of in_degree=0*n#Calculate the in-degree for each teamfor u in adj_list:for v in adj_listu:in_degreev+=1#Initialize a list to keep track of the teams with no i

197、ncoming edgesno_incoming_edges=i for i in range(n)if in_degreei=0#If there is more than one team with no incoming edges,there is no unique championif len(no_incoming_edges)!=1:return-1#Return the team with no incoming edges,which is the championreturn no_incoming_edges0This code first creates an adj

198、acency list to represent the DAG.Then,Submit CodeFigure 6|An example of solving LeetCode Problem.B.Benchmark curves during training of DeepSeek-Coder-BaseIn Figure 7,we present the benchmark curves illustrating the performance of DeepSeek-Coder-Base models during their training phase.For validation,

199、a carefully curated subset of the trainingcorpus was employed,consisting of 8,000 code files.This subset was deliberately chosen toensure a diverse and representative sample,critical for an accurate assessment of the modelscapabilities.The performance metrics of these models are specifically detaile

200、d in the final twosub-figures of Figure 7,offering a clear visual representation of their efficacy throughout thetraining process.22025050075010001250150017502000Tokens(Billions)0.00.10.20.30.40.5ValueHumanEval-Pass1025050075010001250150017502000Tokens(Billions)0.00.10.20.30.40.50.6ValueHumanEval-cp

201、p-Pass1025050075010001250150017502000Tokens(Billions)0.00.10.20.30.40.50.6ValueHumanEval-java-Pass1025050075010001250150017502000Tokens(Billions)0.500.550.600.650.700.750.800.85ValueHumanEvalFIM-EM025050075010001250150017502000Tokens(Billions)0.00.20.40.6ValueMBPP-Pass1025050075010001250150017502000

202、Tokens(Billions)0.00.10.20.30.40.5ValueMeanHumanEval025050075010001250150017502000Tokens(Billions)0.00.10.20.30.4ValueValidation-Completion-EM025050075010001250150017502000Tokens(Billions)0.00.10.20.30.40.5ValueValidation-CompletionFIM-EMDeepSeek-Coder-Base-1.3BDeepSeek-Coder-Base-6.7BDeepSeek-Coder-Base-33BFigure 7|Benchmark curves during training of DeepSeek-Coder-Base.23

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站