《DeepSeek LLM技術報告(英文版)(48頁).pdf》由會員分享,可在線閱讀,更多相關《DeepSeek LLM技術報告(英文版)(48頁).pdf(48頁珍藏版)》請在三個皮匠報告上搜索。
1、DeepSeek LLMScaling Open-Source Language Models with LongtermismXiao Bi,Deli Chen,Guanting Chen,Shanhuang Chen,Damai Dai,Chengqi Deng,Honghui Ding,Kai Dong,Qiushi Du,Zhe Fu,Huazuo Gao,Kaige Gao,Wenjun Gao,Ruiqi Ge,Kang Guan,Daya Guo,Jianzhong Guo,Guangbo Hao,Zhewen Hao,Ying He,Wenjie Hu,Panpan Huang
2、,Erhang Li,Guowei Li,Jiashi Li,Yao Li,Y.K.Li,Wenfeng Liang,Fangyun Lin,A.X.Liu,Bo Liu,Wen Liu,Xiaodong Liu,Xin Liu,Yiyuan Liu,Haoyu Lu,Shanghao Lu,Fuli Luo,Shirong Ma,Xiaotao Nie,Tian Pei,Yishi Piao,Junjie Qiu,Hui Qu,Tongzheng Ren,Zehui Ren,Chong Ruan,Zhangli Sha,Zhihong Shao,Junxiao Song,Xuecheng S
3、u,Jingxiang Sun,Yaofeng Sun,Minghui Tang,Bingxuan Wang,Peiyi Wang,Shiyu Wang,Yaohui Wang,Yongji Wang,Tong Wu,Y.Wu,Xin Xie,Zhenda Xie,Ziwei Xie,Yiliang Xiong,Hanwei Xu,R.X.Xu,Yanhong Xu,Dejian Yang,Yuxiang You,Shuiping Yu,Xingkai Yu,B.Zhang,Haowei Zhang,Lecong Zhang,Liyue Zhang,Mingchuan Zhang,Minghu
4、a Zhang,Wentao Zhang,Yichao Zhang,Chenggang Zhao,Yao Zhao,Shangyan Zhou,Shunfeng Zhou,Qihao Zhu,Yuheng Zou*DeepSeek-AIAbstractThe rapid development of open-source large language models(LLMs)has been truly remarkable.However,the scaling laws described in previous literature presents varying conclusio
5、ns,whichcasts a dark cloud over scaling LLMs.We delve into the study of scaling laws and present ourdistinctive findings that facilitate the scaling of large scale models in two prevalent used open-source configurations,7B and 67B.Guided by the scaling laws,we introduce DeepSeek LLM,a project dedica
6、ted to advancing open-source language models with a long-term perspective.To support the pre-training phase,we have developed a dataset that currently consists of 2trillion tokens and is continuously expanding.We further conduct supervised fine-tuning(SFT)and direct preference optimization(DPO)on De
7、epSeek LLM Base models,resulting in thecreation of DeepSeek Chat models.Our evaluation results demonstrate that DeepSeek LLM67B surpasses LLaMA-2 70B across a range of benchmarks,especially in the domains of code,mathematics,and reasoning.Furthermore,open-ended evaluations reveal that our DeepSeekLL
8、M 67B Chat exhibits superior performance compared to GPT-3.5.*Authors are ordered alphabetically by the last name.arXiv:2401.02954v1 cs.CL 5 Jan 2024Contents1Introduction32Pre-Training42.1Data.42.2Architecture.52.3Hyperparameters.52.4Infrastructures.63Scaling Laws73.1Scaling Laws for Hyperparameters
9、.83.2Estimating Optimal Model and Data Scaling.93.3Scaling Laws with Different Data.124Alignment125Evaluation135.1Public Benchmark Evaluation.135.1.1Base Model.145.1.2Chat Model.145.2Open-Ended Evaluation.175.2.1Chinese Open-Ended Evaluation.175.2.2English Open-Ended Evaluation.185.3Held-Out Evaluat
10、ion.185.4Safety Evaluation.195.5Discussion.206Conclusion,Limitation,and Future Work23A Appendix30A.1 Acknowledgments.30A.2 Different Model Scale Representations.30A.3 Benchmark Metrics Curves.31A.4 Comparison with Code or Math Specific Models.32A.5 Benchmark Results w/DPO Stage.32A.6 Evaluation Form
11、ats.3221.IntroductionOver the past few years,Large Language Models(LLMs)based on decoder-only Transformers(Vaswani et al.,2017)have increasingly become the cornerstone and pathway to achieving Arti-ficial General Intelligence(AGI).By predicting the next word in continuous text,LLMs undergoself-super
12、vised pre-training on massive datasets,enabling them to achieve various purposes andpossess many abilities,such as novel creation,text summarization,code completion,and more.Subsequent developments like supervised fine-tuning and reward modeling have enabled LargeLanguage Models(LLMs)to better follo
13、w user intentions and instructions.This has endowedthem with more versatile conversational capabilities and rapidly expanded their influence.This wave is sparked withclosed products,such as ChatGPT(OpenAI,2022),Claude(An-thropic,2023),and Bard(Google,2023),which are developed with extensive computat
14、ionalresources and substantial annotation costs.These products have significantly raised the commu-nitys expectations for the capabilities of open-source LLMs,consequently inspiring a series ofwork(Bai et al.,2023;Du et al.,2022;Jiang et al.,2023;Touvron et al.,2023a,b;Yang et al.,2023).Among these,
15、the LLaMA series models(Touvron et al.,2023a,b)stand out.It consolidates arange of works to create an efficient and stable architecture,building well-performing modelsranging from 7B to 70B parameters.Consequently,the LLaMA series has become the de factobenchmark for architecture and performance amo
16、ng open-source models.Following LLaMA,the open-source community has primarily focused on training fixed-size(7B,13B,34B,and 70B),high-quality models,often neglecting research exploration into LLMscaling laws(Hoffmann et al.,2022;Kaplan et al.,2020).Nonetheless,research on scaling laws isof utmost im
17、portance,considering that the current open-source models are merely at the initialstage of Artificial General Intelligence(AGI)development.In addition,early works(Hoffmannet al.,2022;Kaplan et al.,2020)reached varying conclusions on the scaling of model and datawith increased compute budgets and ina
18、dequately addressed hyperparameter discussions.Inthis paper,we extensively investigate the scaling behavior of language models and apply ourfindings in two widely used large-scale model configurations,namely 7B and 67B.Our studyaims to lay the groundwork for future scaling of open-source LLMs,paving
19、 the way for furtheradvancements in this domain.Specifically,we first examined the scaling laws of batch sizeand learning rate,and found their trends with model size.Building on this,we conducted acomprehensive study of the scaling laws of the data and model scale,successfully revealing theoptimal m
20、odel/data scaling-up allocation strategy and predicting the expected performanceof our large-scale models.Additionally,during development,we discovered that the scalinglaws derived from different datasets show significant differences.This suggests that choiceof dataset remarkably affects the scaling
21、 behavior,indicating that caution should be exercisedwhen generalizing scaling laws across datasets.Under the guidance of our scaling laws,we build from scratch open-source large languagemodels,and release as much information as possible for community reference.We collect2 trillion tokens for pre-tr
22、aining,primarily in Chinese and English.At the model level,wegenerally followed the architecture of LLaMA,but replaced the cosine learning rate schedulerwith a multi-step learning rate scheduler,maintaining performance while facilitating continualtraining.We collected over 1 million instances for su
23、pervised fine-tuning(SFT)(Ouyang et al.,2022)from diverse sources.This paper shares our experiences with different SFT strategiesand findings in data ablation techniques.Additionally,we have utilized direct preferenceoptimization(DPO)(Rafailov et al.,2023)to improve the conversational performance of
24、 themodel.3We conduct extensive evaluations using our base and chat models.The evaluation resultsdemonstrate that DeepSeek LLM surpasses LLaMA-2 70B across various benchmarks,particu-larly in the fields of code,mathematics,and reasoning.Following SFT and DPO,the DeepSeek67B chat model outperforms GP
25、T-3.5 in both Chinese and English open-ended evaluations.Thishighlights the superior performance of DeepSeek 67B in generating high-quality responses andengaging in meaningful conversations in both languages.Furthermore,the safety evaluationindicates that DeepSeek 67B Chat can provide harmless respo
26、nses in practice.In the rest of this paper,we first introduce our pre-training basic concepts of DeepSeekLLM in Section 2,including the composition of data,model architecture,infrastructure,andhyperparameters.In Section 3,we provide a detailed explanation of the scaling laws we havediscovered and it
27、s implications.Additionally,we discuss the rationale behind our selection ofpre-training hyperparameters,taking into account the insights gained from the scaling lawsanalysis.In Section 4,we discuss our fine-tuning methodology,encompassing the compositionof fine-tuning data and specific methods duri
28、ng the SFT and DPO stages.We then presentthe detailed evaluation results of DeepSeek LLM in Section 5,covering both the base and chatmodels,as well as their performance in open-ended evaluations and safety evaluations.Finally,we discuss the current limitations and future directions of DeepSeek LLM i
29、n Section 6.2.Pre-Training2.1.DataOur main objective is to comprehensively enhance the richness and diversity of the dataset.We have gained valuable insights from reputable sources such as(Computer,2023;Gao et al.,2020;Penedo et al.,2023;Touvron et al.,2023a).To achieve these goals,we have organized
30、 ourapproach into three essential stages:deduplication,filtering,and remixing.The deduplicationand remixing stages ensure a diverse representation of the data by sampling unique instances.The filtering stage enhances the density of information,thereby enabling more efficient andeffective model train
31、ing.We adopted an aggressive deduplication strategy,expanding the deduplication scope.Ouranalysis revealed that deduplicating the entire Common Crawl corpus results in higher removalof duplicate instances compared to deduplicating within a single dump.Table 1 illustratesthat deduplicating across 91
32、dumps eliminates four times more documents than a single dumpmethod.Dumps Used1261216224191Deduplication Rate(%)22.246.755.769.975.776.381.689.8Table 1|Deduplication ratios for various Common Crawl dumps.In the filtering stage,we focus on developing robust criteria for document quality assess-ment.T
33、his involves a detailed analysis incorporating both linguistic and semantic evaluations,providing a view of data quality from individual and global perspectives.In the remixing phase,we adjust our approach to address data imbalances,focusing on increasing the presence ofunderrepresented domains.This
34、 adjustment aims to achieve a more balanced and inclusivedataset,ensuring that diverse perspectives and information are adequately represented.For our tokenizer,we implemented the Byte-level Byte-Pair Encoding(BBPE)algorithmbased on the tokenizers library(Huggingface Team,2019).Pre-tokenization was
35、employed to4prevent the merging of tokens from different character categories such as new lines,punctuation,and Chinese-Japanese-Korean(CJK)symbols,similar to GPT-2(Radford et al.,2019).We alsochose to split numbers into individual digits following the approach used in(Touvron et al.,2023a,b).Based
36、on our prior experience,we set the number of conventional tokens in thevocabulary at 100000.The tokenizer was trained on a multilingual corpus of approximately24 GB,and we augmented the final vocabulary with 15 special tokens,bringing the totalsize to 100015.To ensure computational efficiency during
37、 training and to reserve space forany additional special tokens that might be needed in the future,we configured the modelsvocabulary size to 102400 for training.2.2.ArchitectureParamslayersmodelheadskv_headsContextSequenceLearningTokensLengthBatch SizeRate7B3040963232409623044.2e-42.0T67B9581926484
38、09646083.2e-42.0TTable 2|Detailed specs of DeepSeek LLM family of models.We choose the hyper-parametersbased on our findings in Section 3The micro design of DeepSeek LLM largely follows the design of LLaMA(Touvron et al.,2023a,b),adopting a Pre-Norm structure with RMSNorm(Zhang and Sennrich,2019)fun
39、ctionand using SwiGLU(Shazeer,2020)as the activation function for the Feed-Forward Network(FFN),with an intermediate layer dimension of83.It also incorporates Rotary Embedding(Su et al.,2024)for positional encoding.To optimize inference cost,the 67B model uses Grouped-Query Attention(GQA)(Ainslie et
40、 al.,2023)instead of the traditional Multi-Head Attention(MHA).However,in terms of macro design,DeepSeek LLM differs slightly.Specifically,DeepSeekLLM 7B is a 30-layer network,while DeepSeek LLM 67B has 95 layers.These layer adjustments,while maintaining parameter consistency with other open-source
41、models,also facilitate modelpipeline partitioning to optimize training and inference.Unlike most works using Grouped-Query Attention(GQA),we expanded the 67B modelsparameters in network depth rather than the common practice of widening the intermediatewidth of FFN layers,aiming for better performanc
42、e.Detailed network specifications can befound in Table 2.2.3.HyperparametersDeepSeek LLM is initialized with a standard deviation of 0.006 and trained using the AdamWoptimizer(Loshchilov and Hutter,2017),with the following hyperparameters:1=0.9,2=0.95,and weight_decay=0.1.A multi-step learning rate
43、scheduler is employed during pre-training instead of the typicalcosine scheduler.Specifically,the learning rate of the model reaches its maximum value after2000 warmup steps,and then decreases to 31.6%of the maximum value after processing 80%ofthe training tokens.It further reduces to 10%of the maxi
44、mum value after 90%of the tokens.The gradient clipping during the training phase is set to 1.0.Based on our empirical findings,we observed that despite differences in the loss reduction5(a)Multi-step v.s.cosine learning rate decay(b)Different proportions of multi-step stagesFigure 1|Training loss cu
45、rves with different learning rate schedulers or different parameters forschedulers.The model size is 1.6 billion parameters,trained on a dataset of 100 billion tokens.trend during training,the final performance using a multi-step learning rate scheduler isessentially consistent with that of a cosine
46、 scheduler,as shown in Figure 1(a).When adjustingthe training scale while keeping the model size fixed,the multi-step learning rate schedulerallows for the reuse of training from the first phase,offering a unique convenience for continualtraining.Therefore,we chose the multi-step learning rate sched
47、uler as our default setting.We also demonstrate in Figure 1(b)that adjusting the proportions of different stages in themulti-step learning rate scheduler can yield slightly better performance.However,for thesake of balancing reuse ratios in continual training and model performance,we opted for theaf
48、orementioned distribution of 80%,10%,and 10%for the three stages respectively.The batch size and learning rate vary with the model size.Specific parameters for thepre-training phases of the 7B and 67B models can be found in Table 2.2.4.InfrastructuresWe use an efficient and light-weight training fra
49、mework named HAI-LLM(High-flyer,2023)to train and evaluate large language models.Data parallelism,tensor parallelism,sequenceparallelism,and 1F1B pipeline parallelism are integrated into this framework as done in Mega-tron(Korthikanti et al.,2023;Narayanan et al.,2021;Shoeybi et al.,2019).We also le
50、verage theflash attention(Dao,2023;Dao et al.,2022)technique to improve hardware utilization.ZeRO-1(Rajbhandari et al.,2020)is exploited to partition optimizer states over data parallel ranks.Ef-forts are also made to overlap computation and communication to minimize additional waitingoverhead,inclu
51、ding the backward procedure of the last micro-batch and reduce-scatter oper-ation in ZeRO-1,and GEMM computation and all-gather/reduce-scatter in sequence parallel.Some layers/operators are fused to speed up training,including LayerNorm,GEMM wheneverpossible,and Adam updates.To improve model trainin
52、g stability,we train the model in bf16precision but accumulate gradients in fp32 precision.In-place cross-entropy is performed toreduce GPU memory consumption,i.e.:we convert bf16 logits to fp32 precision on the fly inthe cross-entropy CUDA kernel(instead of converting it beforehand in HBM),calculat
53、e thecorresponding bf16 gradient,and overwrite logits with its gradient.Model weights and optimizer states are saved every 5 minutes asynchronously,whichmeans we will lose no more than 5 minutes of training in the worst case of occasional hardwareor network failures.These temporary model checkpoints
54、 are cleared up regularly to avoid6consuming too much storage space.We also support resuming training from a different 3Dparallel configuration to cope with dynamic changes in computing cluster load.As for evaluation,we employ vLLM(Kwon et al.,2023)in generative tasks,and continuousbatching in non-g
55、enerative tasks to avoid manual batch size tuning and reduce token padding.3.Scaling LawsResearch on scaling laws(Hestness et al.,2017)predates the emergence of large language models.Scaling laws(Henighan et al.,2020;Hoffmann et al.,2022;Kaplan et al.,2020)suggest that modelperformance can be predic
56、tably improved with increases in compute budget,model scale,and data scale.When model scaleis represented by model parameters and data scaleby the number of tokens,can be approximated as=6.Therefore,how to optimize theallocation between model and data scales when increasing the compute budget is als
57、o a crucialresearch objective in scaling laws.The development of LLMs(Dai et al.,2019;Radford et al.,2019),with larger models achievingunexpected and significant performance improvements,has brought scaling laws research to anew peak.Results in scaling laws demonstrate that expanding the compute bud
58、get continues toyield significant benefits,which further encourages the increase in model scales(Brown et al.,2020;Smith et al.,2022).However,as shown in Table 4,early works(Hoffmann et al.,2022;Kaplan et al.,2020)onthe optimal model/data scaling-up allocation strategy have shown varying conclusions
59、,raisingdoubts about the general applicability of scaling laws.Moreover,these studies often lacked acomplete description of hyperparameter settings,leaving it uncertain whether models underdifferent compute budgets reached optimal performance.Therefore,we revisit scaling laws inthis section to addre
60、ss these uncertainties and ensure we are on the right path to efficiently scale-up compute,which reflects the long-term perspective and is key to developing continuouslyimproving models.To ensure that models under different compute budgets can achieve optimal performance,we first studied the scaling
61、 laws of hyperparameters.Empirically,it has been observed thatthe optimal values of most parameters during training do not change when varying computebudgets.Therefore,these parameters are consistent with those outlined in Section 2.3 and remainunchanged across different compute budgets.However,the
62、hyperparameters that have themost significant impact on performance,namely batch size and learning rate,were re-examined.Early works(Goyal et al.,2017;McCandlish et al.,2018;Shallue et al.,2019;Smith et al.,2017;Zhang et al.,2019)provided some empirical observations for setting batch size and learni
63、ngrate,but we found these observations have limited applicability in our preliminary experiments.Through extensive experiments,we modeled the power law relationship between the computebudgetand the optimal batch size and learning rate.This relationship,which we refer to as thescaling laws of hyperpa
64、rameters,provides an empirical framework for determining the optimalhyperparameters.This methodology ensures that models across different compute budgets canreach their near-optimal performance.We then study the scaling laws of the model and data scales.To reduce experimental costsand fitting diffic
65、ulties,we adopted the IsoFLOP profile approach from Chinchilla(Hoffmannet al.,2022)to fit the scaling curve.To represent the model scale more accurately,we utilized anew model scale representation,non-embedding FLOPs/token,replacing the earlier-usedmodel parameters,and substituted the approximate co
66、mpute budget formula=67with the more precise=.The experimental results provided insights into the optimalmodel/data scaling-up allocation strategy and performance predictions,and also accuratelyforecasted the expected performance of DeepSeek LLM 7B and 67B models.Additionally,in the process of explo
67、ring scaling laws,the data we used underwent multipleiterations,continually improving in quality.We attempted to fit the scaling curve on variousdatasets and found that the data quality significantly influences the optimal model/data scaling-up allocation strategy.The higher the data quality,the mor
68、e the increased compute budgetshould be allocated to model scaling.This implies that high-quality data can drive the training oflarger models given the same data scale.The differences in the optimal model/data scaling-upallocation strategy may also serve as an indirect approach to assess the quality
69、 of data.We willcontinue to pay close attention to the changes in data quality and its impact on scaling laws,and provide more analysis in future works.In summary,our contributions and findings in scaling laws can be summarized as follows:We established the scaling laws for hyperparameters,providing
70、 an empirical frameworkfor determining the optimal hyperparameters.Instead of model parameters,we adopt non-embedding FLOPs/tokento representthe model scale,leading to a more accurate optimal model/data scaling-up allocationstrategy and a better prediction of generalization loss for large-scale mode
71、ls.The quality of pre-training data impacts the optimal model/data scaling-up allocationstrategy.The higher the data quality,the more the increased compute budget should beallocated to model scaling.3.1.Scaling Laws for HyperparametersWe initially conducted a grid search for batch size and learning
72、rate on small-scale experimentswith a compute budget of 1e17,and the results of a specific model size(177M FLOPs/token)areillustrated in Figure 2(a).The results demonstrate that the generalization error remains stableacross a wide range of choices of batch sizes and learning rates.This indicates tha
73、t near-optimalperformance can be achieved within a relatively wide parameter space.(a)1e17 FLOPs(177M FLOPs/token)(b)1e20 FLOPs(2.94B FLOPs/token)Figure 2|Training loss w.r.t.batch size and learning rate with 1e17 and 1e20 FLOPs.Then,we utilized the aforementioned multi-step learning rate scheduler
74、to effectively trainmultiple models with different batch sizes,learning rates,and compute budgets ranging from81e17 to 2e19 by reusing the first stage.Considering the redundancy in the parameter space,weregarded the parameters used by models whose generalization error exceeded the minimumby no more
75、than 0.25%as near-optimal hyperparameters.We then fitted the batch sizeandlearning ratewith respect to the compute budget.The fitting results,as shown in Figure 3,reveal that the optimal batch sizegradually increases with the increase in compute budget,while the optimal learning rategradually decrea
76、ses.This is in line with the intuitive empiricalsettings for batch size and learning rate when scaling up models.Moreover,all near-optimalhyperparameters fall within a broad band range,indicating that it is relatively easy to choosenear-optimal parameters within this interval.The final formulae we f
77、itted for batch size andlearning rate are as follows:opt=0.3118 0.1250opt=0.2920 0.3271(1)(a)Batch size scaling curve(b)Learning rate scaling curveFigure 3|Scaling curves of batch size and learning rate.The grey circles represent models whosegeneralization error exceeded the minimum by no more than
78、0.25%.The dotted line representsthe power law fitting the smaller model.The blue stars represent DeepSeek LLM 7B and 67B.We validated our formulae on a series of models with a 1e20 compute budget,and the resultsof a specific model size(2.94B FLOPs per token)are shown in Figure 2(b).The results indic
79、atethat the fitted parameters are centered in the optimal parameter space.Subsequent sections alsoshow that the parameters we fitted for DeepSeek LLM 7B and 67B models similarly achievedgood performance.However,its important to note that we have not yet considered the impact of factors beyondthe com
80、pute budgeton the optimal hyperparameters.This is inconsistent with some earlierworks(Kaplan et al.,2020;McCandlish et al.,2018)which suggested that the optimal batch sizecan be modeled as being solely related to the generalization error.Furthermore,we observedthat in models with the same compute bu
81、dget but different model/data allocations,the optimalparameter space varies slightly.This suggests that further research is needed to understand theselection of hyperparameters and training dynamics.We will explore these aspects in futureworks.3.2.Estimating Optimal Model and Data ScalingAfter deriv
82、ing the formulae for fitting near-optimal hyperparameters,we started fitting thescaling curve and analyzing the optimal model/data scaling-up allocation strategy.This strategyinvolves finding model scaling exponentand data scaling exponentthat satisfyopt9andopt,respectively.The data scalecan be cons
83、istently represented by the number oftokens in the dataset.In previous works,the model scale was typically represented by modelparameters,with non-embedding parameters1(Kaplan et al.,2020)and complete parameters2(Hoffmann et al.,2022).The relationship between compute budgetand model/data scalecould
84、be approximately described as=6,meaning we could use 61or 62to approximatethemodelscale.However,sinceboth61and62donotaccountforthecomputationaloverheadof attention operation,and 62also includes the vocabulary computation,which contributes lessto the models capacity,they both have significant approxi
85、mation errors under certain settings.To mitigate these errors,we introduced a new model scale representation:non-embeddingFLOPs/token.includes the computational overhead of attention operation but does nottake into account the vocabulary computation.With the model scale represented by,thecompute bud
86、getcan be simply expressed as=.The specific differences between 61,62,andare as shown in the following formulae:61=72layer2model62=72layer2model+6vocabmodel=72layer2model+12layermodelseq(2)wherelayerrepresents the number of layers,modelrepresents the model width,vocabis thevocabulary size,andseqis t
87、he sequence length.We assessed the differences between thesethree representations across models of varying scales,as shown in Table 3.The results indicatethat both 61and 62either overestimate or underestimate the computational cost in modelsof different scales.This discrepancy is particularly pronou
88、nced in small-scale models,withdifferences reaching up to 50%.Such inaccuracies can introduce substantial statistical errorswhen fitting the scaling curve.Please refer to Appendix A.2 for further analysis regardingdifferent representations of model scale.layersmodelvocabseq1261628512102400409625.2M7
89、7.6M352M0.431.321276884.9M164M963M0.531.02241024302M407M3.02B0.600.812420481.21B1.42B9.66B0.750.883240966.44B6.86B45.1B0.850.9140512012.6B13.1B85.6B0.880.9280819264.4B65.3B419B0.920.94Table 3|Difference in model scale representations and disparities of non-embedding parameters1and complete parameter
90、s2relative to non-embedding FLOPs/token.After adoptingto represent the model scale,our objective could be described more clearlyas:Given a computing budget=,find the optimal model scaleoptand data scaleoptthat minimize the generalization error of the model.This target could be formalized as:opt(),op
91、t()=argmin,s.t.=(,)(3)To reduce experimental costs and fitting difficulties,the IsoFLOP profile approach fromChinchilla(Hoffmann et al.,2022)was used to fit the scaling curve.We selected 8 different10(a)IsoFLOP curve(b)Optimal model scaling(c)Optimal data scalingFigure 4|IsoFLOP curve and optimal mo
92、del/data allocation.The metric in IsoFLOP curveis bits-per-byte on the validation set.The dotted lines in optimal model/data scaling curvesrepresent the power law fitting the smaller model(grey circles).compute budgets ranging from 1e17 to 3e20,and designed around 10 different model/datascale alloca
93、tions for each budget.The hyperparameters for each budget were determinedby Formula(1),and the generalization error was calculated on an independent validation set,distributed similarly to the training set and containing 100M tokens.Figure 4 demonstrates the IsoFLOP curve and model/data scaling curv
94、es,which are fittedby using the optimal model/data allocation for each compute budget.The specific formulae forthe optimal non-embedding FLOPs/tokenoptand optimal tokensoptare as follows:opt=base,base=0.1715,=0.5243opt=base,base=5.8316,=0.4757(4)Additionally,we fitted the loss scaling curve accordin
95、g to compute budgetand optimalgeneralization error,and predicted the generalization error for DeepSeek LLM 7B and 67B,asshown in Figure 5.The results indicate that using small-scale experiments can accurately predictFigure 5|Performance scaling curve.The metric is the bits-per-byte on the validation
96、 set.Thedotted line represents the power law fitting the smaller model(grey circles).The blue starsrepresent DeepSeek LLM 7B and 67B.Their performance is well-predicted by the scaling curve.11the performance of models with 1000compute budget.This provides both confidence andguidance for training mod
97、els on a larger scale.3.3.Scaling Laws with Different DataIn the development process of DeepSeek LLM,the dataset was iteratively refined multiple times,with adjustments in the proportions of different data sources while enhancing the overall quality.This allowed us to further analyze the impact of d
98、ifferent datasets on scaling laws.We studied the scaling laws using three different datasets:early in-house data,current in-housedata,andOpenWebText2,whichwasutilizedinthepreviousstudyofscalinglaws(Kaplanet al.,2020).Our internal data assessment revealed that current in-house data has higher dataqua
99、lity than early in-house data.Furthermore,the quality of OpenWebText2 even surpasses thecurrent in-house data,due to its smaller scale which allows for more meticulous processing.ApproachCoeff.whereCoeff.whereopt(opt)optOpenAI(OpenWebText2)0.730.27Chinchilla(MassiveText)0.490.51Ours(Early Data)0.450
100、0.550Ours(Current Data)0.5240.476Ours(OpenWebText2)0.5780.422Table 4|Coefficients of model scaling and data scaling vary with training data distribution.An interesting observation from the analysis is that the optimal model/data scaling-up allo-cation strategy across these three datasets showed cons
101、istency with data quality.As illustratedin Table 4,as data quality improves,the model scaling exponentgradually increases,while thedata scaling exponentdecreases,which suggests that the increased compute budget should beallocated more to the model instead of the data.This finding might also explain
102、the significantdifferences in optimal model/data scaling-up allocation observed in earlier studies of scalinglaws.An intuitive speculation for this finding is that high-quality data usually implies logicalclarity and less predictive difficulty after sufficient training.Therefore,its more advantageou
103、s toscale up the model size when increasing compute budget.We will continue to pay close attentionto the changes in data quality and its impact on scaling laws,and provide more analysis infuture works.4.AlignmentWe collect around 1.5 million instruction data instances in English and Chinese,covering
104、 a widerange of helpfulness and harmlessness topics.Our helpful data contains 1.2 million instances,with a distribution of 31.2%for general language tasks,46.6%for mathematical problems,and22.2%for coding exercises.The safety data consists of 300K instances,covering various sensitivetopics.Our align
105、ment pipeline contains two stages.Supervised Fine-Tuning:We fine-tuned our 7B model with 4 epochs,but only 2 epochsfor the 67B model,since we observed the overfitting problem is serious on the 67B model.We12observed that GSM8K(Cobbe et al.,2021)and HumanEval(Chen et al.,2021)are improvedconsistently
106、 for the 7B model,while the 67B model hits the upper bound soon.The learning rateis 1e-5 and 5e-6 for 7B and 67B models,respectively.In addition to monitoring the benchmarkaccuracy,we also assess the repetition ratio of a chat model during the fine-tuning process.We gathered a total of 3868 Chinese
107、and English prompts and determined the proportion ofgenerated responses that fail to terminate and instead endlessly repeat a sequence of text.Weobserved that the repetition ratio tends to rise as the quantity of math SFT data increases.This can be attributed to the fact that math SFT data occasiona
108、lly includes similar patterns inreasoning.Consequently,weaker models struggle to grasp such reasoning patterns,resulting inrepetitive responses.To tackle the problem,we tried two-stage fine-tuning and DPO(Rafailovet al.,2023),both of which could almost keep the benchmark score and reduce the repetit
109、ionsignificantly.DPO:To further enhance the models ability,we used the direct preference optimizationalgorithm(Rafailov et al.,2023),which is proven to be a simple but effective method forLLM alignment.We constructed the preference data for DPO training in terms of helpfulnessand harmlessness.For he
110、lpfulness data,we collected multilingual prompts,which covercategories including creative writing,question answering,instruction following,and so on.Then we generated responses using our DeepSeek Chat models as response candidates.Similaroperations are applied to harmlessness preference data constru
111、ction.We trained an epoch for DPO,with a learning rate of 5e-6 and batch size of 512,and weused a learning rate warmup and cosine learning rate scheduler.We found out that DPOcan strengthen the models open-ended generation skill,while engendering little difference inperformance among standard benchm
112、arks.5.Evaluation5.1.Public Benchmark EvaluationWe evaluate our models on a series of public benchmarks both in English and Chinese,based onthe internal evaluation framework.Multi-subject multiple-choice datasets including MMLU(Hendrycks et al.,2020),C-Eval(Huang et al.,2023)and CMMLU(Li et al.,2023
113、).Language understanding and reasoning datasets including HellaSwag(Zellers et al.,2019),PIQA(Bisk et al.,2020),ARC(Clark et al.,2018),OpenBookQA(Mihaylov et al.,2018)andBigBench Hard(BBH)(Suzgun et al.,2022).Closed-book question answering datasets including TriviaQA(Joshi et al.,2017)and Natu-ralQu
114、estions(Kwiatkowski et al.,2019).Reading comprehension datasets including RACE Lai et al.(2017)and DROP(Dua et al.,2019),C3(Sun et al.,2019).Reference disambiguation datasets including WinoGrande Sakaguchi et al.(2019)andCLUEWSC(Xu et al.,2020).Language modeling datasets including Pile(Gao et al.,20
115、20).Chinese understanding and culture datasets including CHID(Zheng et al.,2019)and CCPM(Li et al.,2021).13Math datasets including GSM8K(Cobbe et al.,2021),MATH(Hendrycks et al.,2021)andCMath(Wei et al.,2023).Code datasets including HumanEval(Chen et al.,2021)and MBPP(Austin et al.,2021).Standardize
116、d exams including AGIEval(Zhong et al.,2023).We apply perplexity-based evaluation to datasets that require answers to be chosen fromseveral options.These datasets include HellaSwag,PIQA,WinoGrande,RACE-Middle,RACE-High,MMLU,ARC-Easy,ARC-Challenge,OpenBookQA,CHID,C-Eval,CMMLU,C3 andCCPM.The perplexit
117、y-based evaluation here refers to calculating the perplexity of each optionand selecting the lowest one as the model prediction.For ARC and OpenBookQA,we calculatethe perplexity with unconditional normalization(Brown et al.,2020),and for other datasets weuse length normalization.We apply generation-
118、based evaluation for TriviaQA,NaturalQuestions,DROP,MATH,GSM8K,HumanEval,MBPP,BBH,AGIEval,CLUEWSC,and CMath.The generation-basedevaluation here refers to letting the model generate free texts and parsing results from generatedtexts.For generation-based evaluation,we use greedy decoding.We apply lang
119、uage-modeling-based evaluation for Pile-test,which means calculating thebits-per-byte on the test corpus.We use 2048 or 4096 as the maximum sequence length for different benchmarks.Details ofevaluation formats can be found in Appendix A.6.5.1.1.Base ModelTable 5 presents the main results on the eval
120、uation benchmark.Despite DeepSeek models arepre-trained on 2T bilingual corpus,they show comparable performance on English languageunderstanding benchmarks with LLaMA2 models,which also consume 2T tokens but focuson English.Furthermore,DeepSeek 67B achieves considerably better performance on MATH,GS
121、M8K,HumanEval,MBPP,BBH,and Chinese benchmarks compared to LLaMA2 70B.Weshow the benchmark curve in the Appendix A.3.We can see some task performance is boostedas model scaling,such as GSM8K and BBH.Given that we train both 7B and 67B on the samedataset,the emergence of this improvement can be attrib
122、uted to the powerful few-shot learningability of large models.However,as the proportion of mathematical data increases,the disparitybetween small and large models may diminish.An interesting observation is that the advantage of DeepSeek 67B over LLaMA2 70B is largerthan that of DeepSeek 7B over LLaM
123、A2 7B.This phenomenon highlights the greater influenceof language conflict on smaller models.Additionally,LLaMA2 demonstrates impressive perfor-mance on certain Chinese tasks,such as CMath,despite not being specifically trained on Chinesedata.This suggests that certain fundamental abilities,such as
124、mathematical reasoning,can beeffectively transferred across languages.However,tasks like CHID,which involve evaluatingthe usage of Chinese idioms,require the model to consume a significant number of Chinesetokens during pre-training.In this case,LLaMA2 significantly underperforms compared toDeepSeek
125、 LLM.5.1.2.Chat ModelTable6demonstratestheresultsoftheDeepSeekChatmodels,showcasingoverallimprovementsin most tasks following tuning.However,there were a few instances where the performance of14LanguageBenchmarkTest-shotsLLaMA2DeepSeekLLaMA2DeepSeek7B7B70B67BEnglishHellaSwag0-shot75.675.484.084.0PIQ
126、A0-shot78.079.282.083.6WinoGrande0-shot69.670.580.479.8RACE-Middle5-shot60.763.270.169.9RACE-High5-shot45.846.554.350.7TriviaQA5-shot63.859.779.578.9NaturalQuestions5-shot25.522.236.136.6MMLU5-shot45.848.269.071.3ARC-Easy0-shot69.167.976.576.9ARC-Challenge0-shot49.048.159.559.0OpenBookQA0-shot57.455
127、.860.460.2DROP1-shot39.841.069.267.9MATH4-shot2.56.013.518.7GSM8K8-shot15.517.458.463.4HumanEval0-shot14.626.228.742.7MBPP3-shot21.839.045.657.4BBH3-shot38.539.562.968.7AGIEval0-shot22.826.437.241.3Pile-test-0.7410.7250.6490.642ChineseCLUEWSC5-shot64.073.176.581.0CHID0-shot37.989.355.592.1C-Eval5-sh
128、ot33.945.051.466.1CMMLU5-shot32.647.253.170.8CMath3-shot25.134.553.963.0C30-shot47.465.461.775.3CCPM0-shot60.776.966.288.5Table 5|Main results.The evaluation results we report are based on the internal evaluationframework.Bold numbers indicate the best results among the 4 models.For Pile-test werepo
129、rt bits-per-byte(BPB),for DROP we report F1 score and for other tasks we report accuracy.Note that the test-shots is the maximum value and fewer shots might be applied because oflimited context length or limited few-shot examples available in the same passage for readingcomprehension tasks such as R
130、ACE.certain tasks declined.Knowledge:We have observed fluctuations of base and chat models in knowledge-relatedtasks,such as TriviaQA,MMLU,and C-Eval.However,we do not believe that such minorfluctuations indicate the acquisition or loss of knowledge after SFT.The value of SFT lies in theability to l
131、earn to achieve comparable scores to the base models few-shot setting in the chatmodels zero-shot setting,which is aligned with real scenarios.For example,0-shot MMLUperformance of a chat model is comparable with 5-shot MMLU performance of a base model.Reasoning:As a significant proportion of the SF
132、T instances are in the CoT format Wei et al.(2022),the chat models demonstrate slight improvements in reasoning tasks,such as BBH andNaturalQuestions.However,we believe that the SFT stage does not learn reasoning capabilitiesbut rather the correct format for reasoning paths.15LanguageBenchmarkDeepSe
133、ekDeepSeekDeepSeekDeepSeek7B Base7B Chat67B Base67B ChatEnglishHellaSwag75.468.584.075.7PIQA79.277.683.682.6WinoGrande70.566.979.876.0RACE-Middle63.265.269.970.9RACE-High46.550.850.756.0TriviaQA59.757.978.981.5NaturalQuestions22.232.536.647.0MMLU48.249.471.371.1ARC-Easy67.971.076.981.6ARC-Challenge4
134、8.149.459.064.1GSM8K17.463.063.484.1MATH6.015.818.732.6HumanEval26.248.242.773.8MBPP39.035.257.461.4DROP41.049.167.971.9OpenBookQA55.854.860.263.2BBH39.542.368.771.7AGIEval26.419.341.346.4ChineseCLUEWSC73.171.981.060.0CHID89.364.992.172.6C-Eval45.047.066.165.2CMMLU47.249.770.867.8CMath34.568.463.080
135、.3C365.466.475.377.0CCPM76.976.588.584.9Table 6|The comparison between base and chat models.We evaluate chat models with 0-shotfor MMLU,GSM8K,MATH,C-Eval,and CMMLU,while base model results are still obtained inthe few-shot setting.Performance Drop Tasks:The performance of a few tasks consistently de
136、clines after fine-tuning,regardless of the model size or pre-trained checkpoint selected.These particular taskstypically involve cloze tasks or sentence completion tasks,such as HellaSwag.It is reasonable toassume that pure language models are better equipped to handle such tasks.Math and Code:Our m
137、odel exhibits significant improvements in math and coding tasksafter fine-tuning.For instance,HumanEval and GSM8K scores are improved by over 20 points.Our explanation for this is that the base model was initially underfitted for these tasks,and theSFT stage has learned additional knowledge in codin
138、g and mathematics through the extensiveSFT data.However,it is important to note that the models capabilities may be primarily focusedon code completion and algebraic questions.To develop a comprehensive understanding ofmathematics and coding,it is crucial to incorporate a diverse range of data durin
139、g the pre-training stage,which is left as future work.We conducted a detailed analysis of code and mathtasks in Appendix A.4.In the 7B model fine-tuning,we initially fine-tune the model using all data.Subsequently,asecond stage is introduced,which excludes math and code data.The motivation behind th
140、isapproach is that the stage-1 model exhibits a repetition ratio of 2.0%,which is reduced to 1.4%16ModelOverallReasoning 中文推理Language 中文語言Avg.Math.Logi.Avg.Fund.Chi.Open.Writ.Role.Pro.模型總分推理總分數學計算邏輯推理語言總分基本任務中文理解綜合問答文本寫作角色扮演專業能力gpt-4-1106-preview8.017.737.807.668.297.997.338.618.678.478.65gpt-4-0613
141、7.537.477.567.377.597.816.937.427.937.517.94DeepSeek-67B-Chat-DPO*6.695.776.135.417.607.297.477.827.517.837.71DeepSeek-67B-Chat*6.435.755.715.797.117.126.527.587.206.917.37chatglm-turbo(智譜清言)6.245.004.745.267.496.827.178.167.777.767.24erniebot-3.5(文心一言)6.145.155.035.277.136.627.607.267.566.836.90gpt
142、-3.5-turbo-06136.085.355.685.026.826.715.817.297.037.286.77chatglm-pro(智譜清言)5.834.654.544.757.016.516.767.477.077.346.89spark_desk_v2(訊飛星火)5.744.734.714.746.765.846.977.297.186.926.34Qwen-14B-Chat5.724.814.914.716.636.906.366.746.646.596.56Baichuan2-13B-Chat5.253.923.764.076.596.226.057.116.976.756.
143、43ChatGLM3-6B4.973.853.554.146.105.755.296.716.836.285.73Baichuan2-7B-Chat4.973.663.563.756.285.815.507.136.846.535.84InternLM-20B4.963.663.393.926.265.965.507.186.196.496.22Qwen-7B-Chat4.913.733.623.836.096.405.746.266.316.195.66ChatGLM2-6B4.483.393.163.615.584.914.526.666.256.085.08InternLM-Chat-7
144、B3.652.562.452.664.754.344.095.824.895.324.06Chinese-LLaMA-2-7B-Chat3.572.682.293.074.464.314.264.504.634.914.13LLaMA-2-13B-Chinese-Chat3.352.472.212.734.234.133.314.793.934.534.71Table 7|AlignBench leaderboard rated by gpt-4-0613.Models are ranked in descending or-der of total score.Results with*ar
145、e our evaluation results based on the official AlignBenchrepository,whereas all other results are derived from the AlignBench paper.We found that ourDeepseek-67B-Chat model surpasses ChatGPT and other baseline models by a clear margin,which indicates the superior performance of our model in both bas
146、ic Chinese language tasksand advanced Chinese reasoning tasks.Besides,we can find that the DPO process has broughtimprovements in almost all fields.after stage-2 tuning,while maintaining the benchmark score.In the case of the 67B model,therepetition ratio is already below 1%following the first stage
147、 fine-tuning,and the second stagehurts the model score on the benchmark.Therefore,only one stage of SFT is done for the 67Bmodel.5.2.Open-Ended EvaluationFor chat models,in addition to observing metrics on standard benchmarks,the quality of resultsgenerated in open domains and open-ended questions d
148、irectly affects the actual user experience.Hence,we separately tested the open-ended generation capabilities of our chat model in bothChinese and English tasks.5.2.1.Chinese Open-Ended EvaluationFor Chinese open-ended evaluation,we tested the comprehensive of our chat model in differentdomains on a
149、high-quality open-ended question testset AlignBench(Liu et al.,2023).AlignBenchincludes a total of 8 primary categories,36 secondary categories,and encompasses 683 questions.For each question,in addition to the prompt,AlignBench also provides professional referenceanswers and rating templates for GP
150、T-4 to judge the quality of the response.We utilized the official AlignBench Github code repository to implement the evaluation of17our model.We strictly aligned the key temperature parameter with the original setting:forrole-playing,writing ability,and open-ended questions,the generation temperatur
151、e was set to0.7;whereas for other tasks,the generation temperature was set to 0.1.The AlignBench leaderboard is shown in Table 7.We can find that our DeepSeek 67B Chatmodel surpasses ChatGPT and other baseline models,and is only after the two versions ofGPT-4.This demonstrates the excellent performa
152、nce of our model across various Chinese tasks,compared to other open-source or proprietary Chinese Large Language Models.The DPO modelhas shown improvement across almost all metrics,which demonstrates the positive impact ofthe DPO training process on model alignment.For the basic Chinese Language ta
153、sks,our model is in the first tier among all models,and theChinese fundamental language ability of our DPO model is even higher than the newest versionof GPT-4.For the advanced Chinese Reasoning tasks,our models scores are significantly higherthan those of other Chinese LLMs with a clear margin,demo
154、nstrating the superior performanceof our model in more complex Chinese logical reasoning and mathematical calculations.5.2.2.English Open-Ended EvaluationFor English open-ended evaluation,we use the MT-Bench benchmark(Zheng et al.,2023),whichcontains 8 different categories of multi-turn questions.As
155、 illustrated in Table 8,our DeepSeekLLM 67B Chat outperforms other open-source models such as LLaMA-2-Chat Touvron et al.(2023b)70B,Xwin 70b v0.1,and TLU 2+DPO 70B(Ivison et al.,2023),and achieves 8.35 scorecomparable with GPT-3.5-turbo.Besides,after the DPO stage,our DeepSeek LLM 67B ChatDPO furthe
156、r improves the average score to 8.76,which is only behind GPT-4(OpenAI,2023).These results illustrate the strong multi-turn open-ended generation ability of DeepSeek LLM.ModelSTEMHumanitiesReasoningCodingMathExtractionRoleplayWritingAverageGPT-4-1106-preview9.909.958.109.057.959.909.509.709.26GPT-3.
157、5-turbo-06139.559.956.207.057.059.008.659.658.39LLAMA-2-Chat 7B8.658.754.253.002.406.507.708.906.27LLAMA-2-Chat 13B8.639.755.103.003.456.937.508.856.65LLAMA-2-Chat 70B8.939.635.803.153.307.257.509.306.86Zephyr-Beta 7B9.039.635.605.104.457.458.209.357.35Xwin 70b v0.19.689.956.554.253.308.758.259.557.
158、53Xwin 13b v0.29.559.885.203.602.857.708.608.687.01TLU 2+DPO 70B9.009.907.004.704.659.359.259.257.89DeepSeek LLM 67B Chat9.609.708.007.356.258.408.209.308.35DeepSeek LLM 67B Chat DPO9.709.809.056.756.659.309.109.758.76Table 8|MT-Bench Evaluation.Results withare reported in Ivison et al.(2023)5.3.Hel
159、d-Out EvaluationData contamination and benchmark overfitting are two challenges in evaluating LLMs.Onecommon practice is to utilize testsets published recently to evaluate the model as held-outtestsets.LeetCode:To assess the coding proficiency of the model,we have utilized problems fromthe LeetCode
160、Weekly Contest(Weekly Contest 351-372,Bi-Weekly Contest 108-117,from July2023 to Nov 2023).We have obtained these problems by crawling data from LeetCode,whichconsists of 126 problems with over 20 test cases for each.The evaluation metric employed is akinto that of HumanEval.In this regard,if a mode
161、ls outputs successfully pass all test cases,themodel is considered to have effectively solved the problem.The models coding capabilities are18depicted in the Figure below,where the y-axis represents the pass1 score on in-domain humanevaluation testing,and the x-axis represents the pass1 score on out
162、-domain LeetCode WeeklyContest problems.The LeetCode test data will be released accompanied with the DeepSeekCoder technique report soon.Hungarian National High-School Exam:In line with Grok-1,we have evaluated the modelsmathematical capabilities using the Hungarian National High School Exam.This ex
163、am com-prises 33 problems,and the models scores are determined through human annotation.Wefollow the scoring metric in the solution.pdf to evaluate all models.Instruction Following Evaluation:On Nov 15th,2023,Google released an instructionfollowing the evaluation dataset(Zhou et al.,2023).They ident
164、ified 25 types of verifiableinstructions and constructed around 500 prompts,with each prompt containing one or moreverifiable instructions.We use the prompt-level loose metric to evaluate all models.ModelLeetCodeHungarian ExamIFEvalGPT-448.46879.3ChatGLM3 6B2.43229.7DeepSeek LLM 7B Chat4.728.541.2Ba
165、ichuan2-Chat 13B1.619.544.5Yi-Chat 34B7.93948.4Qwen 72B Chat12.75250.8DeepSeek LLM 67B Chat17.55855.5Table 9|Held-out Dataset Evaluation.We have conducted a comparative analysis of our model against various baseline models ofdifferent sizes,namely Qwen 72B Chat(Bai et al.,2023),ChatGLM3(Du et al.,20
166、22),Baichuan2(Yang et al.,2023),and Yi-34B Chat.Our observations indicate that there exists a significantperformance gap between large models and small models on these held-out datasets,evenif certain small models achieve promising results on conventional benchmarks.For instance,ChatGLM3 achieves a
167、score of 52.4 on MBPP,a code testset,which is close to DeepSeek 67B.However,when evaluated on new benchmarks,its performance falls considerably short com-pared to DeepSeek 67B.A similar trend is also observed in math datasets,where ChatGLM3is very strong on GSM8K(72.3),but its performance in the Hun
168、garian Exam score is inferiorto large models.Furthermore,the capability of instruction following demonstrates that totalcomputing plays a crucial role.The DeepSeek 7B and 67B models utilize the same training pipeline,but there is a significantdisparity in their performance.Through our subjective eva
169、luation,we have observed a notablediscrepancy in intelligence across various tasks when scaling model size to 67B.While DeepSeek7B falls behind other smaller language models on standard benchmarks,its performance onheld-out tasks is relatively commendable when compared to others.5.4.Safety Evaluatio
170、nWe profoundly recognize the importance of safety for general artificial intelligence.The premisefor establishing a truly helpful artificial intelligence model is that it possesses values consistentwith those of humans and exhibits friendliness towards humanity.We incorporate the assuranceof model s
171、afety throughout the entire training process,including pre-training,SFT,and DPO.To validate the safety of our model,we established a 20-person expert team from various19CategorySubcategory#Safety Answers/#Total Cases歧視偏見問題(Discrimination and Prejudice Questions)民族種族(Ethnic and Racial),宗教信仰(Religious
172、 Belief),國別地域(Nationality and Geography),性別(Gender),年齡(Age),職業(Occupation),健康(Health),其他方面歧視(Discrimination in Other Aspects)486/500侵犯他人合法權益(Infringement of Others Legal Rights)身心健康(Physical and Mental Health),合法財產(Legitimate Property),肖像權(Portrait Rights),名譽權(Reputation Rights),榮譽權(Honor Rights),隱私
173、權(Privacy Rights),信息權益(Information Rights),其他合法權益(Other Legal Rights)473/500商業秘密與知識產權(Trade Secrets and Intellectual Property Rights)侵犯他人知識產權(Infringing Others Intellectual Property Rights),壟斷和不正當競爭行為(Monopolistic and Unfair Competitive Actions),其他商業違法違規行為(Other Commercially Illegal and Non-complian
174、t Behaviors),違反商業道德(Violating Business Ethics),泄露他人商業機密(Disclosing Others Trade Secrets)281/300違法違規行為(Illegal and Non-compliant Behavior)邪教迷信(Cults and Superstition),色情(Pornography),賭博(Gambling),毒品和違禁品(Drugs and Prohibited Items),侮辱謾罵(Insults and Abuse),暴力行為(Violent Behavior),涉黑涉惡(Involvement in Org
175、anized Crime),其他違法違規行為(Other Illegal and Non-compliant Behaviors)290/300其他安全問題(Other Safety Issues)幻覺和真實性問題(Issues of Illusion and Reality),時效性問題(Time-sensitive Issues),自我認知問題(Self-recognition Problems),其他敏感話題(Other Sensitive Topics),767/800Table 10|Our taxonomy for safety evaluation.The total numbe
176、r of test cases for each categoryand the number of safe answers provided by our model(DeepSeek-67B-Chat)are listed in the far-right column of the table.The annotation of test questions and the evaluation of generated resultsare carried out by a professional human team.We can observe that our model d
177、emonstratesstrong security across various types of safety test sets.disciplines and constructed a safety content classification system that aligns with human values(the safety evaluation taxonomy is shown in Table 10).Subsequently,the expert team constructeddozens of high-quality test cases for each
178、 safety subcategory manually.In addition to focusingon the diversity of safety content areas,we also pay attention to the diversity of formats insafety content.The infamous grandmother loophole indicates that models can be deceivedby the surface format of a query into providing unsafe responses.Ther
179、efore,when devisingquestions,the expert team also pays attention to diversifying the ways of inquiry.They constructdiverse safety issues through means such as inducement,role-playing,multi-turn dialogues,preset positions,and etc.Ultimately,we obtained a safety test set comprising 2400 questions.Inad
180、dition,the expert team has constructed a basic guideline constitution for safety reviews foreach different content type and format type.For the output results of our model on this test set,we manually inspected its safety.Ourreview team was well-trained and cross-verification was performed on the an
181、notation results.The annotators perform a three-category annotation for each question:safe,unsafe,and modelrefusal.We tested the safety of our DeepSeek 67B Chat model,and the results are presentedin Table 10.The number of test questions for each safety category and the number of safetytests passed b
182、y our model are listed in the table.We label both the securely answered and themodel-refused test cases as secure responses.The results indicate that our model exhibits goodsecurity performance across numerous safety test categories.Complementing our existing approach to safety,we further enriched o
183、ur evaluation usingthe Do-Not-Answer dataset(Wang et al.,2023)to evaluate the safety mechanisms of ourDeepSeek 67B Chat model.The datasets 939 risk-categorized prompts were instrumental inhighlighting our models enhanced capabilities.As shown in Table 11,DeepSeek 67B Chatmodel has demonstrated notab
184、le performance,achieving a score of 97.8,which is higher thanboth ChatGPT and GPT-4.This score not only benchmarks our models capability to safelyhandle sensitive queries but also places it competitively among leading models in the field.5.5.DiscussionThroughout the development process,we have disco
185、vered some interesting findings in buildingLLMs.20ModelDo-Not-AnswerLLAMA-2-7B-Chat99.4Claude98.3DeepSeek-67B-Chat*97.8ChatGPT97.7GPT-496.5Vicuna-7B94.9ChatGLM292.9Table 11|Do-Not-Answer Score(Wang et al.,2023),a higher score signifies greater model safety.Results with*are our evaluation results bas
186、ed on the official repository,whereas all other resultsare derived from the original paper.We can find that our model has a higher safety score thanboth ChatGPT and GPT-4,placing it amongst the ranks of the safest models.Staged Fine-Tuning:As we mentioned above,small models need longer fine-tuning o
187、n mathand code dataset,but it will hurt the model conversation ability,such as increasing repetitionbehavior.To address this issue,we have implemented a staged fine-tuning process.In thisapproach,the first stage involves fine-tuning with all available data,while the second stagefocuses specifically
188、on fine-tuning with conversational data.ModelHumanEvalGSM8KRepetitionIFEvalDeepSeek LLM 7B Chat Stage148.263.90.02038.0DeepSeek LLM 7B Chat Stage248.263.00.01441.2Table 12|Two-stage fine-tuning results.The repetition ratio is computed when the temperatureis 0.The lower repetition ratio is better.The
189、 IFEval result is the prompt-level loose accuracy.Table 12 displays the results obtained from the two-stage training process.These resultsclearly demonstrate that the second stage does not compromise the models proficiency in codeand math,while simultaneously decreasing the repetition behavior and e
190、nhancing instructionfollowing capability.Multi-Choice Question:It is a common practice to test a model with multi-choice styleevaluation data,such as MMLU,AGI Eval,and C-Eval.Multi-choice questions require the modelnot only to have the corresponding knowledge but also to understand what the option r
191、efersto.During the alignment stage,we tested adding 20 million Chinese multi-choice questionsand obtained the performance as shown in Table 13.It is important to note that we conducteddeduplication for the C-Eval validation set and CMMLU test set to prevent data contamination.ModelMMLUC-EvalCMMLUTri
192、viaQAChineseQADeepSeek LLM 7B Chat49.447.049.757.975.0DeepSeek LLM 7B Chat+MC60.971.373.857.974.4Table 13|The impact of adding multi-choice question data.The inclusion of an additional 20M MC(multiple-choice)data has proven to be beneficialnot only for Chinese multiple-choice benchmarks but also for
193、 improving English benchmarks.This indicates that the models capability to solve MC problems has been enhanced.However,we have observed that this improvement does not extend to the models performance on otherevaluations that do not utilize the multiple-choice format,such as TriviaQA and our in-house
194、21ChineseQA testsets,which are generative evaluation benchmarks.This suggests that users maynot perceive the model as becoming more intelligent during conversational interactions,as theseinteractions involve generating responses rather than solving multiple-choice problems.Therefore,we have chosen t
195、o exclude MC data from both the pre-training and fine-tuningstages,as including it would result in overfitting to benchmarks and would not contribute toachieving true intelligence in the model.Instruction Data in Pre-Training:It is widely acknowledged that incorporating instructiondata during the la
196、tter part of the pre-training phase enhances the performance of a base model onbenchmark tasks.In our study,we integrated 5 million instruction data,primarily consisting ofmulti-choice questions,during the final 10%of the pre-training stage.We observed that the basemodel did exhibit improved perform
197、ance on the benchmark.However,the final outcomes werenearly identical to those achieved by adding the same data during the SFT stage.We concludethat while this approach strengthens the base models performance on the benchmark,its overallpotential is equivalent to not incorporating these instruction
198、data.If the instruction data issubstantial in size,it is acceptable to incorporate it into the pre-training process.Due to ourpreference for excluding multi-choice questions and the limited availability of non-multi-choicequestions we have,we made the decision not to include instruction data in the
199、pre-trainingprocess.System Prompt:A well-designed system prompt should effectively guide a model togenerate responses that are both helpful and respectful.We slightly changed the promptintroduced by LLaMA-2 as our system prompt.System prompt:You are DeepSeek Chat,a helpful,respectful and honest AI a
200、ssistant developedby DeepSeek.The knowledge cut-off date for your training data is up to May 2023.Always answer ashelpfully as possible,while being safe.Your answers should not include any harmful,unethical,racist,sexist,toxic,dangerous,or illegal content.Please ensure that your responses are social
201、ly unbiased andpositive in nature.If a question does not make any sense,or is not factually coherent,explain why insteadof answering something not correct.If you dont know the answer to a question,please dont share falseinformation.We have observed an intriguing phenomenon wherein the performance of
202、 a 7B LLM experi-ences a slight degradation when a system prompt is introduced.However,when utilizing a 67BLLM,the addition of a prompt leads to significantly improved results,as illustrated in Table14.Our explanation for this disparity is that larger models possess a better understanding ofthe inte
203、nded meaning behind the system prompt,enabling them to follow instructions moreeffectively and generate superior responses.On the other hand,smaller models struggle tograsp the system prompt adequately,and the inconsistency between training and testing mightnegatively impact their performance.ModelM
204、T BenchDeepSeek LLM 7B Chat7.15DeepSeek LLM 7B Chat+System Prompt7.11DeepSeek LLM 67B Chat8.35DeepSeek LLM 67B Chat+System Prompt8.58Table 14|The impact of adding a system prompt.226.Conclusion,Limitation,and Future WorkWe introduce DeepSeek LLMs,a series of open-source models trained from scratch o
205、n a vastdataset of 2 trillion tokens in both English and Chinese.In this paper,we provide an in-depthexplanation of hyper-parameters selection,scaling laws,as well as the various fine-tuningattempts we made.We calibrate the scaling laws in the previous work and propose a newoptimal model/data scalin
206、g-up allocation strategy.In addition,we present a method to predictthe near-optimal batch size and learning rate with given compute budget.We further concludethat the scaling laws is related to the data quality,which might be the root cause of varyingscaling behavior in different works.Guided by the
207、 scaling laws,we conduct pre-trainingwith the best hyper-parameter and provide a comprehensive evaluation.We avoid benchmarkdecoration and dark secrets in all training stages.DeepSeek Chat shares the acknowledged limitations commonly found in other LLMs,whichinclude the lack of ongoing knowledge upd
208、ates after pre-training,the possibility of generatingnon-factual information such as unverified advice,and a tendency to produce hallucinations.Moreover,it is important to note that our initial version of Chinese data is not exhaustive,which may result in suboptimal performance on certain Chinese-sp
209、ecific topics.Since our dataprimarily consists of Chinese and English sources,the models proficiency in other languagesremains delicate and should be approached with caution.DeepSeek LLM is a long-term project committed to advancing open-source language models.Soon,wewillreleaseourtechniquereportsin
210、codeintelligenceandMixture-of-Experts(MoE),respectively.They show how we create high-quality code data for pre-training,and designa sparse model to achieve dense model performance.At present,we are constructing a larger and improved dataset for the upcoming version ofDeepSeek LLM.We hope the reasoni
211、ng,Chinese knowledge,math,and code capabilitieswill be significantly improved in the next version.Our alignment team is dedicated to studying ways to deliver a model that is helpful,honest,and safe to the public.Our initial experiments prove that reinforcement learningcould boost model complex reaso
212、ning capability.ReferencesJ.Ainslie,J.Lee-Thorp,M.de Jong,Y.Zemlyanskiy,F.Lebrn,and S.Sanghai.Gqa:Traininggeneralized multi-query transformer models from multi-head checkpoints.arXiv preprintarXiv:2305.13245,2023.Anthropic.Introducing Claude,2023.URLhttps:/ al.Program synthesis with large language m
213、odels.arXiv preprint arXiv:2108.07732,2021.J.Bai,S.Bai,Y.Chu,Z.Cui,K.Dang,X.Deng,Y.Fan,W.Ge,Y.Han,F.Huang,et al.Qwentechnical report.arXiv preprint arXiv:2309.16609,2023.Y.Bisk,R.Zellers,R.L.Bras,J.Gao,and Y.Choi.PIQA:reasoning about physical commonsensein natural language.InThe Thirty-Fourth AAAI C
214、onference on Artificial Intelligence,AAAI232020,The Thirty-Second Innovative Applications of Artificial Intelligence Conference,IAAI2020,The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence,EAAI2020,New York,NY,USA,February 7-12,2020,pages 74327439.AAAI Press,2020.doi:10.1609/
215、aaai.v34i05.6239.URLhttps:/doi.org/10.1609/aaai.v34i05.6239.T.B.Brown,B.Mann,N.Ryder,M.Subbiah,J.Kaplan,P.Dhariwal,A.Neelakantan,P.Shyam,G.Sastry,A.Askell,S.Agarwal,A.Herbert-Voss,G.Krueger,T.Henighan,R.Child,A.Ramesh,D.M.Ziegler,J.Wu,C.Winter,C.Hesse,M.Chen,E.Sigler,M.Litwin,S.Gray,B.Chess,J.Clark,
216、C.Berner,S.McCandlish,A.Radford,I.Sutskever,and D.Amodei.Lan-guage models are few-shot learners,2020.M.Chen,J.Tworek,H.Jun,Q.Yuan,H.P.de Oliveira Pinto,J.Kaplan,H.Edwards,Y.Burda,N.Joseph,G.Brockman,A.Ray,R.Puri,G.Krueger,M.Petrov,H.Khlaaf,G.Sastry,P.Mishkin,B.Chan,S.Gray,N.Ryder,M.Pavlov,A.Power,L.
217、Kaiser,M.Bavarian,C.Winter,P.Tillet,F.P.Such,D.Cummings,M.Plappert,F.Chantzis,E.Barnes,A.Herbert-Voss,W.H.Guss,A.Nichol,A.Paino,N.Tezak,J.Tang,I.Babuschkin,S.Balaji,S.Jain,W.Saunders,C.Hesse,A.N.Carr,J.Leike,J.Achiam,V.Misra,E.Morikawa,A.Radford,M.Knight,M.Brundage,M.Murati,K.Mayer,P.Welinder,B.McGr
218、ew,D.Amodei,S.McCandlish,I.Sutskever,andW.Zaremba.Evaluating large language models trained on code.CoRR,abs/2107.03374,2021.URLhttps:/arxiv.org/abs/2107.03374.P.Clark,I.Cowhey,O.Etzioni,T.Khot,A.Sabharwal,C.Schoenick,and O.Tafjord.Think youhave solved question answering?try arc,the AI2 reasoning cha
219、llenge.CoRR,abs/1803.05457,2018.URLhttp:/arxiv.org/abs/1803.05457.K.Cobbe,V.Kosaraju,M.Bavarian,M.Chen,H.Jun,L.Kaiser,M.Plappert,J.Tworek,J.Hilton,R.Nakano,et al.Training verifiers to solve math word problems.arXiv preprintarXiv:2110.14168,2021.T.Computer.Redpajama:an open dataset for training large
220、 language models,2023.URLhttps:/ R.Salakhutdinov.Transformer-xl:Attentivelanguage models beyond a fixed-length context.arXiv preprint arXiv:1901.02860,2019.T.Dao.FlashAttention-2:Faster attention with better parallelism and work partitioning.2023.T.Dao,D.Y.Fu,S.Ermon,A.Rudra,and C.R.FlashAttention:F
221、ast and memory-efficient exactattention with IO-awareness.In Advances in Neural Information Processing Systems,2022.Z.Du,Y.Qian,X.Liu,M.Ding,J.Qiu,Z.Yang,and J.Tang.Glm:General language modelpretraining with autoregressive blank infilling.InProceedings of the 60th Annual Meetingof the Association fo
222、r Computational Linguistics(Volume 1:Long Papers),pages 320335,2022.D.Dua,Y.Wang,P.Dasigi,G.Stanovsky,S.Singh,and M.Gardner.DROP:A reading compre-hension benchmark requiring discrete reasoning over paragraphs.In J.Burstein,C.Doran,andT.Solorio,editors,Proceedings of the 2019 Conference of the North
223、American Chapter of theAssociation for Computational Linguistics:Human Language Technologies,NAACL-HLT2019,Minneapolis,MN,USA,June 2-7,2019,Volume 1(Long and Short Papers),pages 23682378.Association for Computational Linguistics,2019.doi:10.18653/V1/N19-1246.URLhttps:/doi.org/10.18653/v1/n19-1246.24
224、L.Gao,S.Biderman,S.Black,L.Golding,T.Hoppe,C.Foster,J.Phang,H.He,A.Thite,N.Nabeshima,et al.The Pile:An 800GB dataset of diverse text for language modeling.arXivpreprint arXiv:2101.00027,2020.Google.An important next step on our AI journey,2023.URLhttps:/blog.google/technology/ai/bard-google-ai-searc
225、h-updates/.Z.Gou,Z.Shao,Y.Gong,Y.Shen,Y.Yang,M.Huang,N.Duan,and W.Chen.Tora:A tool-integrated reasoning agent for mathematical problem solving.CoRR,abs/2309.17452,2023.doi:10.48550/ARXIV.2309.17452.URLhttps:/doi.org/10.48550/arXiv.2309.17452.P.Goyal,P.Dollr,R.Girshick,P.Noordhuis,L.Wesolowski,A.Kyro
226、la,A.Tulloch,Y.Jia,and K.He.Accurate,large minibatch sgd:Training imagenet in 1 hour.arXiv preprintarXiv:1706.02677,2017.D.Hendrycks,C.Burns,S.Basart,A.Zou,M.Mazeika,D.Song,and J.Steinhardt.Measuringmassive multitask language understanding.arXiv preprint arXiv:2009.03300,2020.D.Hendrycks,C.Burns,S.K
227、adavath,A.Arora,S.Basart,E.Tang,D.Song,andJ.Steinhardt.Mea-suring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,2021.T.Henighan,J.Kaplan,M.Katz,M.Chen,C.Hesse,J.Jackson,H.Jun,T.B.Brown,P.Dhari-wal,S.Gray,et al.Scaling laws for autoregressive generative modeling.a
228、rXiv preprintarXiv:2010.14701,2020.J.Hestness,S.Narang,N.Ardalani,G.Diamos,H.Jun,H.Kianinejad,M.M.A.Patwary,Y.Yang,and Y.Zhou.Deep learning scaling is predictable,empirically.arXiv preprintarXiv:1712.00409,2017.High-flyer.Hai-llm:高效且輕量的大模型訓練工具,2023.URLhttps:/www.high- Las Casas,L.A.Hendricks,J.Welbl
229、,A.Clark,T.Hennigan,E.Noland,K.Millican,G.van den Driessche,B.Damoc,A.Guy,S.Osindero,K.Simonyan,E.Elsen,J.W.Rae,O.Vinyals,and L.Sifre.Training compute-optimal large language models.CoRR,abs/2203.15556,2022.doi:10.48550/ARXIV.2203.15556.URLhttps:/doi.org/10.48550/arXiv.2203.15556.Y.Huang,Y.Bai,Z.Zhu,
230、J.Zhang,J.Zhang,T.Su,J.Liu,C.Lv,Y.Zhang,J.Lei,et al.C-Eval:Amulti-level multi-discipline chinese evaluation suite for foundation models.arXiv preprintarXiv:2305.08322,2023.Huggingface Team.Tokenizers:Fast state-of-the-art tokenizers optimized for research andproduction,2019.URLhttps:/ J.Wei.Language
231、 models are multilingual chain-of-thought reasoners.InThe Eleventh International Conference on Learning Representations,ICLR 2023,Kigali,Rwanda,May 1-5,2023.OpenR,2023.URLhttps:/ H.Hajishirzi.Camels in a changing climate:Enhancing lm adaptation withtulu 2.2023.A.Q.Jiang,A.Sablayrolles,A.Mensch,C.Bam
232、ford,D.S.Chaplot,D.d.l.Casas,F.Bressand,G.Lengyel,G.Lample,L.Saulnier,et al.Mistral 7b.arXiv preprint arXiv:2310.06825,2023.M.Joshi,E.Choi,D.Weld,and L.Zettlemoyer.TriviaQA:A large scale distantly supervised chal-lenge dataset for reading comprehension.In R.Barzilay and M.-Y.Kan,editors,Proceedings
233、ofthe 55th Annual Meeting of the Association for Computational Linguistics(Volume 1:LongPapers),pages 16011611,Vancouver,Canada,July 2017.Association for ComputationalLinguistics.doi:10.18653/v1/P17-1147.URLhttps:/aclanthology.org/P17-1147.J.Kaplan,S.McCandlish,T.Henighan,T.B.Brown,B.Chess,R.Child,S
234、.Gray,A.Radford,J.Wu,and D.Amodei.Scaling laws for neural language models.CoRR,abs/2001.08361,2020.URLhttps:/arxiv.org/abs/2001.08361.V.A.Korthikanti,J.Casper,S.Lym,L.McAfee,M.Andersch,M.Shoeybi,and B.Catanzaro.Reducing activation recomputation in large transformer models.Proceedings of MachineLearn
235、ing and Systems,5,2023.T.Kwiatkowski,J.Palomaki,O.Redfield,M.Collins,A.P.Parikh,C.Alberti,D.Epstein,I.Polosukhin,J.Devlin,K.Lee,K.Toutanova,L.Jones,M.Kelcey,M.Chang,A.M.Dai,J.Uszkoreit,Q.Le,and S.Petrov.Natural questions:a benchmark for question answeringresearch.Trans.Assoc.Comput.Linguistics,7:452
236、466,2019.doi:10.1162/tacl_a_00276.URLhttps:/doi.org/10.1162/tacl_a_00276.W.Kwon,Z.Li,S.Zhuang,Y.Sheng,L.Zheng,C.H.Yu,J.E.Gonzalez,H.Zhang,and I.Stoica.Efficient memory management for large language model serving with pagedattention.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems
237、Principles,2023.G.Lai,Q.Xie,H.Liu,Y.Yang,and E.H.Hovy.RACE:large-scale reading comprehensiondataset from examinations.In M.Palmer,R.Hwa,and S.Riedel,editors,Proceedings ofthe 2017 Conference on Empirical Methods in Natural Language Processing,EMNLP 2017,Copenhagen,Denmark,September 9-11,2017,pages 7
238、85794.Association for ComputationalLinguistics,2017.doi:10.18653/V1/D17-1082.URLhttps:/doi.org/10.18653/v1/d17-1082.H.Li,Y.Zhang,F.Koto,Y.Yang,H.Zhao,Y.Gong,N.Duan,and T.Baldwin.CMMLU:Measur-ing massive multitask language understanding in Chinese.arXiv preprint arXiv:2306.09212,2023.W.Li,F.Qi,M.Sun,
239、X.Yi,and J.Zhang.Ccpm:A chinese classical poetry matching dataset,2021.X.Liu,X.Lei,S.Wang,Y.Huang,Z.Feng,B.Wen,J.Cheng,P.Ke,Y.Xu,W.L.Tam,X.Zhang,L.Sun,H.Wang,J.Zhang,M.Huang,Y.Dong,and J.Tang.Alignbench:Benchmarkingchinese alignment of large language models.CoRR,abs/2311.18743,2023.doi:10.48550/ARXI
240、V.2311.18743.URLhttps:/doi.org/10.48550/arXiv.2311.18743.I.Loshchilov and F.Hutter.Decoupled weight decay regularization.arXiv preprintarXiv:1711.05101,2017.26H.Luo,Q.Sun,C.Xu,P.Zhao,J.Lou,C.Tao,X.Geng,Q.Lin,S.Chen,and D.Zhang.Wizardmath:Empowering mathematical reasoning for large language models vi
241、a reinforcedevol-instruct.arXiv preprint arXiv:2308.09583,2023.S.McCandlish,J.Kaplan,D.Amodei,and O.D.Team.An empirical model of large-batchtraining.arXiv preprint arXiv:1812.06162,2018.T.Mihaylov,P.Clark,T.Khot,and A.Sabharwal.Can a suit of armor conduct electricity?a newdataset for open book quest
242、ion answering,2018.D.Narayanan,M.Shoeybi,J.Casper,P.LeGresley,M.Patwary,V.Korthikanti,D.Vainbrand,P.Kashinkunti,J.Bernauer,B.Catanzaro,et al.Efficient large-scale language model trainingon gpu clusters using megatron-lm.InProceedings of the International Conference for HighPerformance Computing,Netw
243、orking,Storage and Analysis,pages 115,2021.OpenAI.Introducing ChatGPT,2022.URLhttps:/ technical report.arXiv preprint arXiv:2303.08774,2023.L.Ouyang,J.Wu,X.Jiang,D.Almeida,C.Wainwright,P.Mishkin,C.Zhang,S.Agarwal,K.Slama,A.Ray,et al.Training language models to follow instructions with human feedback
244、.Advances in Neural Information Processing Systems,35:2773027744,2022.G.Penedo,Q.Malartic,D.Hesslow,R.Cojocaru,A.Cappelli,H.Alobeidli,B.Pannier,E.Al-mazrouei,and J.Launay.The refinedweb dataset for falcon llm:outperforming curatedcorpora with web data,and web data only.arXiv preprint arXiv:2306.0111
245、6,2023.A.Radford,J.Wu,R.Child,D.Luan,D.Amodei,I.Sutskever,et al.Language models areunsupervised multitask learners.OpenAI blog,1(8):9,2019.R.Rafailov,A.Sharma,E.Mitchell,S.Ermon,C.D.Manning,and C.Finn.Direct preferenceoptimization:Your language model is secretly a reward model.2023.S.Rajbhandari,J.R
246、asley,O.Ruwase,and Y.He.Zero:Memory optimizations toward training tril-lion parameter models.InSC20:International Conference for High Performance Computing,Networking,Storage and Analysis,pages 116.IEEE,2020.K.Sakaguchi,R.L.Bras,C.Bhagavatula,and Y.Choi.Winogrande:An adversarial winogradschema chall
247、enge at scale,2019.C.J.Shallue,J.Lee,J.Antognini,J.Sohl-Dickstein,R.Frostig,and G.E.Dahl.Measuring theeffects of data parallelism on neural network training.Journal of Machine Learning Research,20(112):149,2019.N.Shazeer.Glu variants improve transformer.arXiv preprint arXiv:2002.05202,2020.M.Shoeybi
248、,M.Patwary,R.Puri,P.LeGresley,J.Casper,and B.Catanzaro.Megatron-lm:Training multi-billion parameter language models using model parallelism.arXiv preprintarXiv:1909.08053,2019.S.Smith,M.Patwary,B.Norick,P.LeGresley,S.Rajbhandari,J.Casper,Z.Liu,S.Prabhumoye,G.Zerveas,V.Korthikanti,et al.Using deepspe
249、ed and megatron to train megatron-turing nlg530b,a large-scale generative language model.arXiv preprint arXiv:2201.11990,2022.S.L.Smith,P.-J.Kindermans,C.Ying,and Q.V.Le.Dont decay the learning rate,increase thebatch size.arXiv preprint arXiv:1711.00489,2017.27J.Su,M.Ahmed,Y.Lu,S.Pan,W.Bo,and Y.Liu.
250、Roformer:Enhanced transformer with rotaryposition embedding.Neurocomputing,568:127063,2024.K.Sun,D.Yu,D.Yu,and C.Cardie.Investigating prior knowledge for challenging chinesemachine reading comprehension,2019.M.Suzgun,N.Scales,N.Schrli,S.Gehrmann,Y.Tay,H.W.Chung,A.Chowdhery,Q.V.Le,E.H.Chi,D.Zhou,et a
251、l.Challenging big-bench tasks and whether chain-of-thought can solvethem.arXiv preprint arXiv:2210.09261,2022.H.Touvron,T.Lavril,G.Izacard,X.Martinet,M.-A.Lachaux,T.Lacroix,B.Rozire,N.Goyal,E.Hambro,F.Azhar,et al.LLaMA:Open and efficient foundation language models.arXivpreprint arXiv:2302.13971,2023
252、a.H.Touvron,L.Martin,K.Stone,P.Albert,A.Almahairi,Y.Babaei,N.Bashlykov,S.Batra,P.Bhargava,S.Bhosale,D.Bikel,L.Blecher,C.Canton-Ferrer,M.Chen,G.Cucurull,D.Esiobu,J.Fernandes,J.Fu,W.Fu,B.Fuller,C.Gao,V.Goswami,N.Goyal,A.Hartshorn,S.Hosseini,R.Hou,H.Inan,M.Kardas,V.Kerkez,M.Khabsa,I.Kloumann,A.Korenev,
253、P.S.Koura,M.Lachaux,T.Lavril,J.Lee,D.Liskovich,Y.Lu,Y.Mao,X.Martinet,T.Mihaylov,P.Mishra,I.Molybog,Y.Nie,A.Poulton,J.Reizenstein,R.Rungta,K.Saladi,A.Schelten,R.Silva,E.M.Smith,R.Subramanian,X.E.Tan,B.Tang,R.Taylor,A.Williams,J.X.Kuan,P.Xu,Z.Yan,I.Zarov,Y.Zhang,A.Fan,M.Kambadur,S.Narang,A.Rodriguez,R
254、.Stojnic,S.Edunov,andT.Scialom.Llama 2:Open foundation and fine-tuned chat models.CoRR,abs/2307.09288,2023b.doi:10.48550/arXiv.2307.09288.URLhttps:/doi.org/10.48550/arXiv.2307.09288.A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,.Kaiser,and I.Polo-sukhin.Attention is all you need.Advance
255、s in neural information processing systems,30,2017.Y.Wang,H.Li,X.Han,P.Nakov,and T.Baldwin.Do-not-answer:A dataset for evaluatingsafeguards in llms.CoRR,abs/2308.13387,2023.doi:10.48550/ARXIV.2308.13387.URLhttps:/doi.org/10.48550/arXiv.2308.13387.J.Wei,X.Wang,D.Schuurmans,M.Bosma,B.Ichter,F.Xia,E.H.
256、Chi,Q.V.Le,and D.Zhou.Chain-of-thought prompting elicits reasoning in large language models.InNeurIPS,2022.URLhttp:/papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.T.Wei,J.Luan,W.Liu,S.Dong,and B.Wang.Cmath:Can your language model pass chineseelem
257、entary school math test?,2023.L.Xu,H.Hu,X.Zhang,L.Li,C.Cao,Y.Li,Y.Xu,K.Sun,D.Yu,C.Yu,Y.Tian,Q.Dong,W.Liu,B.Shi,Y.Cui,J.Li,J.Zeng,R.Wang,W.Xie,Y.Li,Y.Patterson,Z.Tian,Y.Zhang,H.Zhou,S.Liu,Z.Zhao,Q.Zhao,C.Yue,X.Zhang,Z.Yang,K.Richardson,and Z.Lan.CLUE:A chi-nese language understanding evaluation bench
258、mark.In D.Scott,N.Bel,and C.Zong,editors,Proceedings of the 28th International Conference on Computational Linguistics,COLING2020,Barcelona,Spain(Online),December 8-13,2020,pages 47624772.International Com-mittee on Computational Linguistics,2020.doi:10.18653/V1/2020.COLING-MAIN.419.URLhttps:/doi.or
259、g/10.18653/v1/2020.coling-main.419.A.Yang,B.Xiao,B.Wang,B.Zhang,C.Yin,C.Lv,D.Pan,D.Wang,D.Yan,F.Yang,F.Deng,F.Wang,F.Liu,G.Ai,G.Dong,H.Zhao,H.Xu,H.Sun,H.Zhang,H.Liu,J.Ji,J.Xie,J.Dai,28K.Fang,L.Su,L.Song,L.Liu,L.Ru,L.Ma,M.Wang,M.Liu,M.Lin,N.Nie,P.Guo,R.Sun,T.Zhang,T.Li,T.Li,W.Cheng,W.Chen,X.Zeng,X.Wa
260、ng,X.Chen,X.Men,X.Yu,X.Pan,Y.Shen,Y.Wang,Y.Li,Y.Jiang,Y.Gao,Y.Zhang,Z.Zhou,and Z.Wu.Baichuan 2:Open large-scale language models.Technical report,Baichuan Inc.,2023.URLhttps:/cdn.baichuan- W.Liu.Metamath:Bootstrap your own mathematical questions for large language models.CoRR,abs/2309.12284,2023.doi:
261、10.48550/ARXIV.2309.12284.URLhttps:/doi.org/10.48550/arXiv.2309.12284.R.Zellers,A.Holtzman,Y.Bisk,A.Farhadi,and Y.Choi.HellaSwag:Can a machine really finishyour sentence?In A.Korhonen,D.R.Traum,and L.Mrquez,editors,Proceedings of the 57thConference of the Association for Computational Linguistics,AC
262、L 2019,Florence,Italy,July28-August 2,2019,Volume 1:Long Papers,pages 47914800.Association for ComputationalLinguistics,2019.doi:10.18653/v1/p19-1472.URLhttps:/doi.org/10.18653/v1/p19-1472.B.Zhang and R.Sennrich.Root mean square layer normalization.Advances in NeuralInformation Processing Systems,32
263、,2019.G.Zhang,L.Li,Z.Nado,J.Martens,S.Sachdeva,G.Dahl,C.Shallue,and R.B.Grosse.Whichalgorithmic choices matter at which batch sizes?insights from a noisy quadratic model.Advances in neural information processing systems,32,2019.C.Zheng,M.Huang,and A.Sun.Chid:A large-scale chinese idiom dataset for c
264、loze test.InA.Korhonen,D.R.Traum,and L.Mrquez,editors,Proceedings of the 57th Conference ofthe Association for Computational Linguistics,ACL 2019,Florence,Italy,July 28-August 2,2019,Volume 1:Long Papers,pages 778787.Association for Computational Linguistics,2019.doi:10.18653/V1/P19-1075.URLhttps:/d
265、oi.org/10.18653/v1/p19-1075.L.Zheng,W.-L.Chiang,Y.Sheng,S.Zhuang,Z.Wu,Y.Zhuang,Z.Lin,Z.Li,D.Li,E.P.Xing,H.Zhang,J.E.Gonzalez,and I.Stoica.Judging llm-as-a-judge with mt-bench and chatbotarena.2023.W.Zhong,R.Cui,Y.Guo,Y.Liang,S.Lu,Y.Wang,A.Saied,W.Chen,and N.Duan.AGIEval:Ahuman-centric benchmark for
266、evaluating foundation models.CoRR,abs/2304.06364,2023.doi:10.48550/arXiv.2304.06364.URLhttps:/doi.org/10.48550/arXiv.2304.06364.J.Zhou,T.Lu,S.Mishra,S.Brahma,S.Basu,Y.Luan,D.Zhou,and L.Hou.Instruction-followingevaluation for large language models.arXiv preprint arXiv:2311.07911,2023.29A.AppendixA.1.
267、AcknowledgmentsThis project was realized thanks to the efforts of numerous contributors.We offer our extendedthanks to the following individuals for their help1:Data Annotation Team:Jialu Cai,Ruijian Chen,Ruyi Chen,Bei Feng,Yanping Huang,Zhen Huang,Pin Jiang,Rongli Jin,Xiangyue Jin,Ziyun Ke,Hui Li,M
268、eng Li,SangsangLi,Xiaoqian Li,Yaohui Li,Yunxian Ma,Jiaqi Ni,Xiaojin Shen,Xinnan Song,Tianyu Sun,Xiaosha Chen,Haoyuan Tian,Xiaohan Wang,Xiaoxiang Wang,Yuhao Wang,Fanyi Xia,Lei Xu,Zeyuan Xu,Zhipeng Xu,Tian Yuan,Zhongyu Zhang,Yi Zheng,Shuang Zhou,Xinyi Zhou,Yuchen Zhu,Yuxuan Zhu.Compliance Team:Jin Che
269、n,Ying Tang,Miaojun Wang,Xianzu Wang,Shaoqing Wu,LeyiXia,W.L.Xiao.Business Team:Jian Liang,Mingming Li,T.Wang,Xianzu Wang,Zhiniu Wen,ShengfengYe,Peng Zhang,Zhen Zhang.Design Team:Wei An,Yukun Zha.A.2.Different Model Scale RepresentationsWe refitted the scaling curve for different model scale represe
270、ntations,reusing the experimentsfrom the IsoFLOP profile.We recalculated the compute FLOPs using 61and 62as model scalerepresentations and refitted the performance scaling curves.As shown in Figure 6,the resultsindicate that the deviation of optimal model/data allocation among these three representa
271、tions isnot significant at higher compute budgets,but there are noticeable differences at lower budgets.(a)Compute budget=61(b)Compute budget=62(c)Compute budget=Figure 6|Performance scaling curves using different model scale representations.The metricis the bits-per-byte on the validation set.The d
272、otted line represents the power law fittingthe smaller model(grey circles).The blue stars represent DeepSeek LLM 7B and 67B.1,2,andrepresent the non-embedding parameters,complete parameters,and non-embeddingFLOPs/token of the model,respectively.When using 61as the model scale representation,the fitt
273、ed performance scaling curvetends to overestimate the performance of large-scale models.Conversely,when using 62,the1Authors are ordered alphabetically by the last name.30curve tends to underestimate their performance.Usingas the model scale representation,however,achieves the most accurate predicti
274、ons.A.3.Benchmark Metrics CurvesFigure 7|Benchmark metrics curves of DeepSeek LLM Base.ChineseQA is our in-house test set,constructed in a manner akin to TriviaQA.Figure 7 shows benchmark metrics curves across different training steps.We can see con-sistent improvement on these benchmarks from the s
275、tart to the end of training.We believe theperformance will further be improved if the training continues.ModelSizeHumanEvalMBPPPythonMultilingualPre-Trained ModelsCodex-001-33.5%26.1%45.9%StarCoder16B36.0%28.7%46.8%CodeGeeX26B36.0%24.5%42.4%CodeLlama7B31.7%29.2%41.6%CodeLlama13B36.0%35.4%48.4%CodeLl
276、ama34B48.2%41.0%55.2%DeepSeek-LLM-Base67B42.7%37.2%57.4%Instruction-Tuned ModelsWizard-Coder34B73.2%48.8%61.2%DeepSeek-LLM-Chat67B73.8%53.3%61.4%Table 15|Comparison with code-specific models.31A.4.Comparison with Code or Math Specific ModelsWe have conducted a comparison between our model and specif
277、ic code and math languagemodels(LLMs).Table 15 demonstrates that DeepSeek LLM 67B is capable of achieving similarperformance to CodeLlama,despite having access to less code data.It is worth noting thatDeepSeek LLM possesses greater capabilities in areas other than code.Likewise,Table 16 presents the
278、 results obtained from various math-related benchmarks,suchas GSM8K(Cobbe et al.,2021),MATH(Hendrycks et al.,2021),MGSM-zh(i et al.,2023),andCMath(Wei et al.,2023).DeepSeek 67B exhibits exceptional performance on math-related tasksacross different languages,showcasing its superiority in this domain.
279、In addition,DeepSeekLLM can utilize programs to solve math problems,which demonstrates better performance thanchain-of-thoughts.It is significantly better than the previous SOTA model,ToRA(Gou et al.,2023),on the benchmarks.InferenceGSM8KMATHMGSM-zhCMathChain-of-ThoughtsMetaMath 70B(Yu et al.,2023)C
280、oT82.3%26.6%66.4%70.9%WizardMath 70B(Luo et al.,2023)CoT81.6%22.7%64.8%65.4%DeepSeek LLM 67B ChatCoT84.1%32.6%74.0%80.3%Tool-Integrated ReasoningToRA-Code 34B(Gou et al.,2023)Tool-Integrated80.7%50.8%41.2%53.4%DeepSeek LLM 67B ChatTool-Integrated86.7%51.1%76.4%85.4%Table 16|Comparison with math-spec
281、ific models.A.5.Benchmark Results w/DPO StageTable 17 presents the benchmark results obtained with the DPO stage.Based on these results,we can conclude that the DPO stage does not significantly impact the fundamental capability ofan LLM.DeepSeek 67B ChatDeepSeek 67B Chat DPOHellaSwag75.776.1TriviaQA
282、81.582.9NaturalQuestions47.048.8MMLU71.170.9GSM8K84.185.2MATH32.630.2HumanEval73.871.3BBH71.770.8AGIEval46.446.1CEval65.264.3CMMLU67.868.2Table 17|The benchmark metrics before and after DPO stage.A.6.Evaluation FormatsTable 18Table 40 present examples of our evaluation formats on different benchmark
283、s.32PROMPT以下是一道中國高考生物選擇題,請選擇正確的答案。問題:下列有關高爾基體、線粒體和葉綠體的敘述,正確的是選項:(A)三者都存在于藍藻中(B)三者都含有DNA(C)三者都是ATP合成的場所(D)三者的膜結構中都含有蛋白質答案:從A到D,我們應選擇Table 18|An example of AGIEval.PROMPTQuestion:Use the information below to answer the question.Cotton is aplant product used to make fabric.Cotton is made of cellulose,a
284、 fiber notdigestible by humans.Cellulose is composed of many sugar molecules bondedtogether into long chains.Each sugar molecule contains carbon,hydrogen,and oxygen atoms.When cotton fabric is washed,wrinkles often form.Theclothing industry uses chemicals to manufacture some cotton fabrics that arew
285、rinkle-free.Dyes are also added to color the cellulose fibers in cotton.Howwould a clothing manufacturer separate colors to determine the purity of thedyes?Answer:OPTIONS-through filtration-by their boiling points-by their freezing points-through paper chromatographyTable 19|An example of ARC.33PROM
286、PTEvaluate the result of a random Boolean expression.Q:not(not not True)isA:Lets think step by step.Remember that(i)expressions inside brackets are always evaluated first andthat(ii)the order of operations from highest priority to lowest priority is not,and,or,respectively.We first simplify this exp
287、ression Z as follows:Z=not(not not True)=not(A)where A=not not True.Lets evaluate A:A=not not True=not(not True)=not False=True.Plugging in A,we get:Z=not(A)=not(True)=not True=False.So the answer is False.Q:True and False and not True and True isA:Lets think step by step.Remember that(i)expressions
288、 inside brackets are always evaluated first andthat(ii)the order of operations from highest priority to lowest priority is not,and,or,respectively.We first simplify this expression Z as follows:Z=True and False and not True and True=A and B where A=True and Falseand B=not True and True.Lets evaluate
289、 A:A=True and False=False.Letsevaluate B:B=not True and True=not(True and True)=not(True)=False.Plugging in A and B,we get:Z=A and B=False and False=False.So theanswer is False.Q:not not(not(False)isA:Lets think step by step.Remember that(i)expressions inside brackets are always evaluated first andt
290、hat(ii)the order of operations from highest priority to lowest priority is not,and,or,respectively.We first simplify this expression Z as follows:Z=not not(not(False)=not not(A)where A=not(False).Lets evaluateA:A=not(False)=not False=True.Plugging in A,we get:Z=not not(A)=not not(True)=not not False
291、=True.So the answer is True.Q:False and False and False or not False isA:Lets think step by step.Table 20|An example of BBH.34PROMPT以下是中國關于教育學考試的單項選擇題,請選出其中的正確答案。根據我國心理學家馮忠良教授的學習分類,培養學生品德要通過_。A.知識的學習B.技能的學習C.行為規范的學習D.態度的學習答案:C開設跨學科課程或建立跨學科專業體現了高等教育課程發展的_。A.綜合化趨勢B.多樣化趨勢C.人文化趨勢D.科學化趨勢答案:A心智技能的特點有_。A.物
292、質性、外顯性、簡縮性B.觀念性、內潛性、簡縮性C.物質性、外顯性、展開性D.觀念性、內潛性、展開性答案:B下列關于大學生的情緒與理智關系的說法中正確的是_。A.能冷靜控制自己情緒B.感情用事,難以用理智控制情緒C.遇事能堅持自己正確認識D.已發展到不為小事而發怒和慪氣答案:B在學完一篇邏輯結構嚴密的課文以后,勾畫出課文的論點論據的邏輯關系圖以幫助理解和記憶。這種學習方法屬于_。A.精細加工策略B.組織策略C.復述策略D.做筆記策略答案:B有學者強調,教育要根據一個民族固有的特征來定,這種觀點體現了_A.生產力對教育的影響和制約B.政治制度對教育的影響和制約C.文化對教育的影響和制約D.經濟制度
293、對教育的影響和制約答案:OPTIONS-A-B-C-DTable 21|An example of C-Eval.35PROMPT女:這些藥怎么吃?男:一天三次,一次兩片。請根據上文回答問題:他們在哪兒?答案:OPTIONS-商店-飯店-醫院-教室Table 22|An example of C3.PROMPT以下是將某句古詩文翻譯而成的現代表述:春天已至,萬物復蘇,春風如一位美麗而又心靈手巧的姑娘,邁著纖纖細步款款而來,她揮舞剪刀,盡情地展示那高超的女工技巧,她先裁出了柳葉,隨著柳條裊裊依依地舞蹈,又裁出杏葉,桃葉。該翻譯所對應的古詩文是:OPTIONS-春風騁巧如翦刀-剪裁無巧似春風-風吹
294、怨恨快如刀-春風欲擅秋風巧Table 23|An example of CCPM.36PROMPTQ:某 小 學 在“獻 愛 心為 汶 川 地 震 區 捐 款”活 動 中,六 年 級 五 個 班 共捐款8000元,其中一班捐款1500元,二班比一班多捐款200元,三班捐款1600元,四班與五班捐款數之比是3:5四班捐款多少元?A:一 班 捐 款1500元,而 二 班 比 一 班 多 捐200元,所 以 二 班 捐款1500+200=1700元,又知道六年級五個班一共捐款8000元,所以四班和五班捐款之和=一共捐款-一班和二班和三班捐款之和,即8000-1500-1700-1600=3200元,
295、而題目說四班與五班捐款數之比是3:5,則四班捐款了3200/(3+5)*3=1200元。所以答案是:1200。Q:小俊在東西大道上跑步,若規定向東為正。他先向東跑了800米,然后又跑了一段之后,他位于出發點西邊100米處,小俊第二段跑了多少米?A:小俊第二段跑完后位于出發點西邊,所以第二段應該是向西跑,第二段跑的長度-第一段跑的長度=100,第二段跑了100+800=900米。所以答案是:900。Q:A車和B車同時從甲、乙兩地相向開出,經過5小時相遇然后,它們又各自按原速原方向繼續行駛3小時,這時A車離乙地還有135千米,B車離甲地還有165千米甲、乙兩地相距多少千米?A:假設A車的速度為x千
296、米每小時,B車的速度為y千米每小時,根據而A、B相遇時A車行駛了5小時,A車行駛3小時后離乙地還有135千米,B車行駛3小時后距離甲地還有165千米,可以得到甲乙兩地相距=5x+5y=135+8x=165+8y,變換得到:10(x+y)=300+8(x+y),于是x+y=150,甲乙兩地相距5(x+y)=750千米。所以答案是:750。Q:在一個底面半徑為10厘米的圓柱形容器內,倒入10厘米深的水,然后將一個底面直徑4厘米,高6厘米的圓錐形鉛錘放入水中,容器中水面上升多少厘米?A:Table 24|An example of CMATH.37PROMPT以下是關于解剖學的單項選擇題,請直接給出
297、正確答案的選項。題目:壁胸膜的分部不包括A.肋胸膜B.肺胸膜C.膈胸膜D.胸膜頂答案是:B題目:屬于蝶骨上的結構為A.垂體窩B.棘孔C.破裂孔D.視神經管答案是:B題目:屬于右心房的結構是A.肉柱B.室上嵴C.乳頭肌D.梳狀肌答案是:D題目:咽的分部A.咽隱窩B.口咽部C.鼻咽部D.喉咽部答案是:C題目:舌下神經核位于A.間腦B.延髓C.中腦D.腦撟答案是:B題目:從腦干背側出腦的腦神經是A.副神經B.三叉神經C.舌下神經D.滑車神經答案是:OPTIONS-A-B-C-DTable 25|An example of CMMLU.38PROMPTPassage:The median age in
298、 the city was 22.1 years.10.1%of residents wereunder the age of 18;56.2%were between the ages of 18 and 24;16.1%werefrom 25 to 44;10.5%were from 45 to 64;and 7%were 65 years of age or older.The gender makeup of the city was 64.3%male and 35.7%female.Answer the following questions based on the above
299、passage,please calculatecarefully if calculation is necessary.Q:How many percent were not from 25 to 44?A:The answer type is number.So according to above Passage,the answer is83.9.Q:How many in percent werent 25 to 44?A:The answer type is number.So according to above Passage,the answer isTable 26|An
300、 example of DROP.PROMPT中新網12月7日電綜合外媒6日報道,在美國得克薩斯州,負責治療新冠肺炎患者的醫生約瑟夫瓦隆(Joseph Varon)已連續上班超260天,每天只睡不超過2小時。瓦隆日前接受采訪時呼吁,美國民眾應遵從防疫規定,一線的醫護人員“已OPTIONS-神清氣爽”。-詭計多端”。-精疲力竭”。-分工合作”。-寅吃卯糧”。-土豪劣紳”。-蕓蕓眾生”。Table 27|An example of CHID.39PROMPT胡雪巖離船登岸,坐轎進城,等王有齡到家,他接著也到了他那里,臉上是掩抑不住的笑容,王有齡夫婦都覺得奇怪,問他什么事這么高興。上面的句子中的他指
301、的是胡雪巖漸漸地,湯中凝結出一團團塊狀物,將它們撈起放進盆里冷卻,肥皂便出現在世上了。上面的句子中的它們指的是塊狀物“她序上明明引著JulesTellier的比喻,說有個生脫發病的人去理發,那剃頭的對他說不用剪發,等不了幾天,頭毛壓兒全掉光了;大部分現代文學也同樣的不值批評。這比喻還算俏皮?!鄙厦娴木渥又械乃傅氖巧摪l病的人在洛倫佐大街的盡頭處,矗立著著名的圣三一大教堂。它有著巨大的穹頂,還有明亮的彩色玻璃窗,上面描繪著舊約和新約的場景。上面的句子中的它指的是圣三一大教堂他伯父還有許多女弟子,大半是富商財主的外室;這些財翁白天忙著賺錢,怕小公館里的情婦長日無聊,要不安分,常常叫她們學點玩藝兒
302、消遣。上面的句子中的她們指的是情婦趙雨又拿出了一個杯子,我們熱情地請老王入座,我邊給他倒酒邊問:1962年的哪次記得嗎?“上面的句子中的他指的是Table 28|An example of CLUEWSC.40PROMPTQ:Max can mow the lawn in 40 minutes.If it takes him twice that long to fertilize thelawn,how long will it take him to both mow and fertilize the lawn?A:Lets think step by step.It takes Max
303、 2*40 minutes=80 minutes to fertilize thelawn.In total,Max takes 80 minutes+40 minutes=120 minutes to both mow andfertilize the lawn.The answer is 120.Q:The bagels cost$2.25 each,or a dozen for$24.How much is saved,per bagel,incents,by buying a dozen at a time?A:Lets think step by step.They cost 2.2
304、5*100=225 cents each.At the bulk rate,theyare 24/12=2 dollar each.They cost 2*100=200 cents each.225-200=25 cents are savedper bagel.The answer is 25.Q:Tim is 5 years old.His cousin,Rommel,is thrice as old as he is.His other cousin,Jenny,is 2 years older than Rommel.How many years younger is Tim tha
305、n Jenny?A:Lets think step by step.Rommel is 5 x 3=15 years old.Jenny is 15+2=17 yearsold.So,Tim is 17-5=12 years younger than Jenny.The answer is 12.Q:The school has 14 boys and 10 girls.If 4 boys and 3 girls drop out,how many boysand girls are left?A:Lets think step by step.There are 14 boys-4 boys
306、=10 boys left.There are 10 girls-3 girls=7 girls left.In total there are 10 boys+7 girls=17 boys and girls left.Theanswer is 17.Q:Building one birdhouse requires 7 planks and 20 nails.If 1 nail costs 0.05,and oneplank costs 3,what is the cost,in dollars,to build 4 birdhouses?A:Lets think step by ste
307、p.The cost of the planks for one birdhouse is 7*3=21.Andthe nails are a cost of 20*0.05=1 for each birdhouse.So to build one birdhouse onewill need 21+1=22.So the cost of building 4 birdhouses is at 4*22=88.The answeris 88.Q:Danny brings 3 watermelons to his family picnic.He cuts each watermelon int
308、o 10slices.His sisterbrings 1 watermelon to thefamily picnic,and she cuts the watermeloninto 15 slices.How many watermelon slices are there in total at the picnic?A:Lets think step by step.From Danny,there are 3*10=30 watermelon slices.Fromhis sister,there are 1*15=15 watermelon slices.There are a t
309、otal of 30+15=45watermelon slices.The answer is 45.Q:Angela is a bike messenger in New York.She needs to deliver 8 times as manypackages as meals.If she needs to deliver 27 meals and packages combined,howmany meals does she deliver?A:Lets think step by step.Let p be the number of packages Angela del
310、ivers andm be the number of meals.We know that p+m=27 and p=8m.Substituting thesecond equation into the first equation,we get 8m+m=27.Combining like terms,we get 9m=27.Dividing both sides by 9,we get m=3.The answer is 3.Q:Cori is 3 years old today.In 5 years,she will be one-third the age of her aunt
311、.Howold is her aunt today?A:Lets think step by step.In 5 years,Cori will be 3+5=8 years old.In 5 years,Coris aunt will be 8 x 3=24 years old.Today,her aunt is 24-5=19 years old.Theanswer is 19.Q:Indras has 6 letters in her name.Her sisters name has 4 more letters than half ofthe letters in Indras na
312、me.How many letters are in Indras and her sisters names?A:Lets think step by step.Table 29|An example of GSM8K.41PROMPTPlaying piano:A man is seated at a piano.HeOPTIONS-is playing the piano with his hands and his face.-bigins to play a song by timbaland on the piano.-plays slowly,and pauses to snap
313、 his fingers.-is playing a song in front of him.Table 30|An example of HellaSwag.PROMPTdef starts_one_ends(n):Given a positive integer n,return the count of the numbers of n-digitpositive integers that start or end with 1.Table 31|An example of HumanEval.42PROMPTProblem:Find the domain of the expres
314、sion$fracsqrtx-2sqrt5-x$.Solution:The expressions inside each square root must be non-negative.Therefore,$x-2 ge 0$,so$xge2$,and$5-x ge 0$,so$x le 5$.Also,the denominator cannot be equal to zero,so$5-x0$,which gives$x5$.Therefore,the domain of the expression is$boxed2,5)$.Final Answer:The final answ
315、er is$2,5)$.I hope it is correct.Problem:If$det mathbfA=2$and$det mathbfB=12,$then find$det(mathbfA mathbfB).$Solution:We have that$det(mathbfA mathbfB)=(det mathbfA)(detmathbfB)=(2)(12)=boxed24.$Final Answer:The final answer is$24$.I hope it is correct.Problem:Terrell usually lifts two 20-pound wei
316、ghts 12 times.If he uses two 15-poundweights instead,how many times must Terrell lift them in order to lift the sametotal weight?Solution:If Terrell lifts two 20-pound weights 12 times,he lifts a total of$2cdot12cdot20=480$pounds of weight.If he lifts two 15-pound weights insteadfor$n$times,he will
317、lift a total of$2cdot15cdot n=30n$pounds of weight.Equating this to 480 pounds,we can solve for$n$:beginalign*30n&=480Rightarrowqquad n&=480/30=boxed16endalign*Final Answer:The final answer is$16$.I hope it is correct.Problem:If the system of equationsbeginalign*6x-4y&=a,6y-9x&=b.endalign*has a solu
318、tion$(x,y)$where$x$and$y$are both nonzero,find$fracab,$assuming$b$is nonzero.Solution:If we multiply the first equation by$-frac32$,we obtain$6y-9x=-frac32a.$Since we also know that$6y-9x=b$,we have$-frac32a=bRightarrowfracab=boxed-frac23.$Final Answer:The final answer is$-frac23$.I hope it is corre
319、ct.Problem:Evaluate$log_21$.Solution:Table 32|An example of MATH.43PROMPTYou are an expert Python programmer,and here is your task:Write a functionto find the similar elements from the given two tuple lists.Your code shouldpass these tests:assert similar_elements(3,4,5,6),(5,7,4,10)=(4,5)assert simi
320、lar_elements(1,2,3,4),(5,4,3,7)=(3,4)assert similar_elements(11,12,14,13),(17,15,14,13)=(13,14)BEGINdef similar_elements(test_tup1,test_tup2):res=tuple(set(test_tup1)&set(test_tup2)return(res)DONEYou are an expert Python programmer,and here is your task:Write a pythonfunction to identify non-prime n
321、umbers.Your code should pass these tests:assert is_not_prime(2)=Falseassert is_not_prime(10)=Trueassert is_not_prime(35)=TrueBEGINimport mathdef is_not_prime(n):result=Falsefor i in range(2,int(math.sqrt(n)+1):if n%i=0:result=Truereturn resultDONEYou are an expert Python programmer,and here is your
322、task:Write a functionto find the largest integers from a given list of numbers using heap queuealgorithm.Your code should pass these tests:assert heap_queue_largest(25,35,22,85,14,65,75,22,58,3)=85,75,65assert heap_queue_largest(25,35,22,85,14,65,75,22,58,2)=85,75assert heap_queue_largest(25,35,22,8
323、5,14,65,75,22,58,5)=85,75,65,58,35BEGINimport heapq as hqdef heap_queue_largest(nums,n):largest_nums=hq.nlargest(n,nums)return largest_numsDONEYou are an expert Python programmer,and here is your task:Write a functionto return the sum of all divisors of a number.Your code should pass these tests:ass
324、ert sum_div(8)=7assert sum_div(12)=16assert sum_div(7)=1BEGINTable 33|An example of MBPP.44PROMPTThe following are multiple choice questions(with answers)about miscella-neous.How many axles does a standard automobile have?A.oneB.twoC.fourD.eightAnswer:BWhat place is named in the title of the 1979 li
325、ve album by rock legends CheapTrick?A.BudapestB.BudokanC.BhutanD.BritainAnswer:BWho is the shortest man to ever win an NBA slam dunk competition?A.Anthony Spud WebbB.Michael Air JordanC.Tyrone Muggsy BoguesD.Julius Dr J ErvingAnswer:AWhat is produced during photosynthesis?A.hydrogenB.nylonC.oxygenD.
326、lightAnswer:CWhich of these songs was a Top 10 hit for the rock band The Police?A.Radio Ga-GaB.Ob-la-di Ob-la-daC.De Do Do Do De Da Da DaD.In-a-Gadda-Da-VidaAnswer:CWhich of the Three Stooges was not related to the others?A.MoeB.LarryC.CurlyD.ShempAnswer:OPTIONS-A-B-C-DTable 34|An example of MMLU.45
327、PROMPTAnswer these questions:Q:Who is hosting the fifa world cup in 2022?A:QatarQ:Who won the first women s fifa world cup?A:United StatesQ:When did miami vice go off the air?A:1989Q:Who wrote the song shout to the lord?A:Darlene ZschechQ:Who was thrown in the lion s den?A:DanielQ:What is the meanin
328、g of the name habib?A:Table 35|An example of NaturalQuestions.PROMPTA woman notices that she is depressed every autumn,and wonders why.Afriend suggests to her that perhaps certain changes that take place as seasonsmove from warm to cold may be having an effect on her.When pressed for anexample of th
329、ese changes,the friend citesOPTIONS-flowers blooming-grass turning brown-trees growing-blossoms bloomingTable 36|An example of OpenBookQA.PROMPTTo make it easier to push the reset button of the garbage disposable machinewhich is located underneath the machine,OPTIONS-place a wall mirror on the floor
330、 of the cabinet-hold a hand mirror under the garbage disposable machineTable 37|An example of PIQA.46PROMPTArticle:When you read an article you will understand and remember it better if you can work outhow the writer has put the ideas together.Sometimes a writer puts ideas together by askingquestion
331、s and then answering them.For example,if the article is about groundhogs,the setof questions in the writers head might be:What does a groundhog look like?Where do groundhogs live?What do they eat?.In the article,the author might answer those questions.Sometimes an author writes out her questions in
332、the article.These questions give yousignals.They tell you what the author is going to write next.Often an author has a questionin her head but she doesnt write it out for you.You have to work out her question foryourself.Heres a sample reading for you to practice this method.EarthwormsDo you know ho
333、w many kinds of earthworms there are?There are about 1800 kinds in theworld!They can be brown,purple,green.They can be as small as 3 cm long and as large as3 m long.The best time to see earthworms is at night,especially a cool,damp night.Thats whenthey come up from their burrows to hunt for food.Earthworms dont like to be in the sun.Thats because they breathe through their skin,and they cant breat