《DeepSeek:2024年DeepSeek-V2模型技術報告:經濟、高效的混合專家語言模型(英文版)(52頁).pdf》由會員分享,可在線閱讀,更多相關《DeepSeek:2024年DeepSeek-V2模型技術報告:經濟、高效的混合專家語言模型(英文版)(52頁).pdf(52頁珍藏版)》請在三個皮匠報告上搜索。
1、DeepSeek-V2:A Strong,Economical,and EfficientMixture-of-Experts Language ModelDeepSeek-AIAbstractWe present DeepSeek-V2,a strong Mixture-of-Experts(MoE)language model characterized byeconomical training and efficient inference.It comprises 236B total parameters,of which 21Bare activated for each tok
2、en,and supports a context length of 128K tokens.DeepSeek-V2 adoptsinnovative architectures including Multi-head Latent Attention(MLA)and DeepSeekMoE.MLA guarantees efficient inference through significantly compressing the Key-Value(KV)cacheinto a latent vector,while DeepSeekMoE enables training stro
3、ng models at an economicalcost through sparse computation.Compared with DeepSeek 67B,DeepSeek-V2 achievessignificantly stronger performance,and meanwhile saves 42.5%of training costs,reduces theKV cache by 93.3%,and boosts the maximum generation throughput to 5.76 times.We pretrainDeepSeek-V2 on a h
4、igh-quality and multi-source corpus consisting of 8.1T tokens,and furtherperform Supervised Fine-Tuning(SFT)and Reinforcement Learning(RL)to fully unlock itspotential.Evaluation results show that,even with only 21B activated parameters,DeepSeek-V2and its chat versions still achieve top-tier performa
5、nce among open-source models.The modelcheckpoints are available athttps:/ Parameters(Billions)556065707580Performance(MMLU)DeepSeek-V2DeepSeek 67BLLaMA 1 33BLLaMA 1 65BLLaMA 2 13BLLaMA 2 34BLLaMA 2 70BLLaMA 3 8BLLaMA 3 70BMistral 7BMixtral 8x7BMixtral 8x22BCommand RCommand R+Grok-1DBRXQwen1.5 32BQwe
6、n1.5 72BLLaMA 1 FamilyLLaMA 2 FamilyLLaMA 3 FamilyMixtral FamilyCommand R FamilyQwen1.5 Family(a)050100150200250300DeepSeek-V2DeepSeek 67Bsaving 42.5%oftraining costsTraining Costs(K GPU Hours/T Tokens)0100200300400DeepSeek-V2DeepSeek 67Breducing KV cache by 93.3%KV Cache for Generation(KB/Token)010
7、00020000300004000050000DeepSeek-V2DeepSeek 67B576%of maximum throughputMaximum Generation Throughput(Tokens/Sec)(b)Figure 1|(a)MMLU accuracy vs.activated parameters,among different open-source models.(b)Training costs and inference efficiency of DeepSeek 67B(Dense)and DeepSeek-V2.arXiv:2405.04434v5
8、cs.CL 19 Jun 2024Contents1Introduction42Architecture62.1Multi-Head Latent Attention:Boosting Inference Efficiency.62.1.1Preliminaries:Standard Multi-Head Attention.62.1.2Low-Rank Key-Value Joint Compression.72.1.3Decoupled Rotary Position Embedding.82.1.4Comparison of Key-Value Cache.82.2DeepSeekMoE
9、:Training Strong Models at Economical Costs.92.2.1Basic Architecture.92.2.2Device-Limited Routing.92.2.3Auxiliary Loss for Load Balance.102.2.4Token-Dropping Strategy.113Pre-Training113.1Experimental Setups.113.1.1Data Construction.113.1.2Hyper-Parameters.123.1.3Infrastructures.123.1.4Long Context E
10、xtension.133.2Evaluations.133.2.1Evaluation Benchmarks.133.2.2Evaluation Results.143.2.3Training and Inference Efficiency.164Alignment164.1Supervised Fine-Tuning.164.2Reinforcement Learning.174.3Evaluation Results.184.4Discussion.205Conclusion,Limitation,and Future Work21A Contributions and Acknowle
11、dgments27BDeepSeek-V2-Lite:A 16B Model Equipped with MLA and DeepSeekMoE292B.1Model Description.29B.2Performance Evaluation.30C Full Formulas of MLA31D Ablation of Attention Mechanisms31D.1 Ablation of MHA,GQA,and MQA.31D.2 Comparison Between MLA and MHA.31EDiscussion About Pre-Training Data Debiasi
12、ng32FAdditional Evaluations on Math and Code32G Evaluation Formats3331.IntroductionIn the past few years,Large Language Models(LLMs)(Anthropic,2023;Google,2023;OpenAI,2022,2023)have undergone rapid development,offering a glimpse into the dawn of ArtificialGeneral Intelligence(AGI).In general,the int
13、elligence of an LLM tends to improve as the numberof parameters increases,allowing it to exhibit emergent capabilities across various tasks(Weiet al.,2022).However,the improvement comes at the cost of larger computing resources fortraining and a potential decrease in inference throughput.These const
14、raints present significantchallenges that impede the widespread adoption and utilization of LLMs.In order to tackle thisproblem,we introduce DeepSeek-V2,a strong open-source Mixture-of-Experts(MoE)languagemodel,characterized by economical training and efficient inference through an innovativeTransfo
15、rmer architecture.It is equipped with a total of 236B parameters,of which 21B areactivated for each token,and supports a context length of 128K tokens.We optimize the attention modules and Feed-Forward Networks(FFNs)within the Trans-former framework(Vaswani et al.,2017)with our proposed Multi-head L
16、atent Attention(MLA)and DeepSeekMoE.(1)In the context of attention mechanisms,the Key-Value(KV)cacheof the Multi-Head Attention(MHA)(Vaswani et al.,2017)poses a significant obstacle to theinference efficiency of LLMs.Various approaches have been explored to address this issue,including Grouped-Query
17、 Attention(GQA)(Ainslie et al.,2023)and Multi-Query Attention(MQA)(Shazeer,2019).However,these methods often compromise performance in their attemptto reduce the KV cache.In order to achieve the best of both worlds,we introduce MLA,anattention mechanism equipped with low-rank key-value joint compres
18、sion.Empirically,MLAachieves superior performance compared with MHA,and meanwhile significantly reducesthe KV cache during inference,thus boosting the inference efficiency.(2)For Feed-ForwardNetworks(FFNs),we follow the DeepSeekMoE architecture(Dai et al.,2024),which adoptsfine-grained expert segmen
19、tation and shared expert isolation for higher potential in expertspecialization.The DeepSeekMoE architecture demonstrates great advantages compared withconventional MoE architectures like GShard(Lepikhin et al.,2021),enabling us to train strongmodels at an economical cost.As we employ expert paralle
20、lism during training,we also devisesupplementary mechanisms to control communication overheads and ensure load balance.By combining these two techniques,DeepSeek-V2 features strong performance(Figure 1(a),economical training costs,and efficient inference throughput(Figure 1(b),simultaneously.We cons
21、truct a high-quality and multi-source pre-training corpus consisting of 8.1T tokens.ComparedwiththecorpususedinDeepSeek67B(ourpreviousrelease)(DeepSeek-AI,2024),thiscorpus features an extended amount of data,especially Chinese data,and higher data quality.Wefirst pretrain DeepSeek-V2 on the full pre
22、-training corpus.Then,we collect 1.5M conversationalsessions,which encompass various domains such as math,code,writing,reasoning,safety,andmore,to perform Supervised Fine-Tuning(SFT)for DeepSeek-V2 Chat(SFT).Finally,we followDeepSeekMath(Shao et al.,2024)to employ Group Relative Policy Optimization(
23、GRPO)tofurther align the model with human preference and produce DeepSeek-V2 Chat(RL).We evaluate DeepSeek-V2 on a wide range of benchmarks in English and Chinese,andcompare it with representative open-source models.Evaluation results show that even with only21B activated parameters,DeepSeek-V2 stil
24、l achieves top-tier performance among open-sourcemodels and becomes the strongest open-source MoE language model.Figure 1(a)highlightsthat,on MMLU,DeepSeek-V2 achieves top-ranking performance with only a small numberof activated parameters.In addition,as shown in Figure 1(b),compared with DeepSeek 6
25、7B,DeepSeek-V2 saves 42.5%of training costs,reduces the KV cache by 93.3%,and boosts themaximum generation throughput to 5.76 times.We also evaluate DeepSeek-V2 Chat(SFT)and4RouterInput Hidden Output Hidden 112-1Shared ExpertRouted ExpertTop-AttentionFeed-Forward Network34RMS NormRMS NormTransformer
26、 Block DeepSeekMoE0Input Hidden Multi-Head Latent Attention(MLA)0,Latent Latent,Cached During InferenceMulti-Head Attentionconcatenateconcatenate,;,;Output Hidden 1 applyRoPEapplyRoPEFigure 2|Illustration of the architecture of DeepSeek-V2.MLA ensures efficient inference bysignificantly reducing the
27、 KV cache for generation,and DeepSeekMoE enables training strongmodels at an economical cost through the sparse architecture.DeepSeek-V2 Chat(RL)on open-ended benchmarks.Notably,DeepSeek-V2 Chat(RL)achieves38.9 length-controlled win rate on AlpacaEval 2.0(Dubois et al.,2024),8.97 overall score onMT-
28、Bench(Zheng et al.,2023),and 7.91 overall score on AlignBench(Liu et al.,2023).TheEnglish open-ended conversation evaluations demonstrate that DeepSeek-V2 Chat(RL)has top-tier performance among open-source chat models.In addition,the evaluation on AlignBenchindicates that in Chinese,DeepSeek-V2 Chat
29、(RL)outperforms all of open-source models,andeven beats most of closed-source models.In order to facilitate further research and development on MLA and DeepSeekMoE,we alsorelease DeepSeek-V2-Lite,a smaller model equipped with MLA and DeepSeekMoE,for theopen-source community.It has a total of 15.7B p
30、arameters,where 2.4B are activated for eachtoken.Detailed descriptions about DeepSeek-V2-Lite can be found in Appendix B.In the rest of this paper,we first provide a detailed description of the model architecture ofDeepSeek-V2(Section 2).Subsequently,we introduce our pre-training endeavors,including
31、 thetraining data construction,hyper-parameter settings,infrastructures,long context extension,and the evaluation of model performance and efficiency(Section 3).Following this,we demon-strate our efforts in alignment,encompassing Supervised Fine-Tuning(SFT),Reinforcement5Learning(RL),the evaluation
32、results,and other discussion(Section 4).Finally,we summarizethe conclusion,deliberate on the current limitations of DeepSeek-V2,and outline our futurework(Section 5).2.ArchitectureBy and large,DeepSeek-V2 is still in the Transformer architecture(Vaswani et al.,2017),whereeach Transformer block consi
33、sts of an attention module and a Feed-Forward Network(FFN).However,for both the attention module and the FFN,we design and employ innovative archi-tectures.For attention,we design MLA,which utilizes low-rank key-value joint compression toeliminate the bottleneck of inference-time key-value cache,thu
34、s supporting efficient inference.For FFNs,we adopt the DeepSeekMoE architecture(Dai et al.,2024),a high-performance MoEarchitecture that enables training strong models at an economical cost.An illustration of thearchitecture of DeepSeek-V2 is presented in Figure 2,and we will introduce the details o
35、f MLAand DeepSeekMoE in this section.For other tiny details(e.g.,layer normalization and theactivation function in FFNs),unless specifically stated,DeepSeek-V2 follows the settings ofDeepSeek 67B(DeepSeek-AI,2024).2.1.Multi-Head Latent Attention:Boosting Inference EfficiencyConventional Transformer
36、models usually adopts Multi-Head Attention(MHA)(Vaswaniet al.,2017),but during generation,its heavy Key-Value(KV)cache will become the bottle-neck that limit the inference efficiency.In order to reduce the KV cache,Multi-Query Atten-tion(MQA)(Shazeer,2019)and Grouped-Query Attention(GQA)(Ainslie et
37、al.,2023)areproposed.They require a smaller magnitude of KV cache,but their performance does not matchMHA(we provide the ablation of MHA,GQA and MQA in Appendix D.1).For DeepSeek-V2,we design an innovative attention mechanism called Multi-head LatentAttention(MLA).Equipped with low-rank key-value jo
38、int compression,MLA achieves betterperformance than MHA,but requires a significantly smaller amount of KV cache.We introduceits architecture in the following,and also provide a comparison between MLA and MHA inAppendix D.2.2.1.1.Preliminaries:Standard Multi-Head AttentionWe first introduce the stand
39、ard MHA mechanism as background.Letbe the embeddingdimension,be the number of attention heads,be the dimension per head,and h Rbe the attention input of the-th token at an attention layer.Standard MHA first producesq,k,v Rthrough three matrices,R,respectively:q=h,(1)k=h,(2)v=h,(3)6Grouped-Query Atte
40、ntion(GQA)Multi-Head Attention(MHA)Multi-Query Attention(MQA)Multi-Head Latent Attention(MLA)KeysQueriesValuesprojectionCompressedLatent KVCached During InferenceFigure 3|Simplified illustration of Multi-Head Attention(MHA),Grouped-Query Atten-tion(GQA),Multi-Query Attention(MQA),and Multi-head Late
41、nt Attention(MLA).Throughjointly compressing the keys and values into a latent vector,MLA significantly reduces the KVcache during inference.Then,q,k,vwill be sliced intoheads for the multi-head attention computation:q,1;q,2;.;q,=q,(4)k,1;k,2;.;k,=k,(5)v,1;v,2;.;v,=v,(6)o,=1Softmax(q,k,)v,(7)u=o,1;o
42、,2;.;o,(8)where q,k,v,Rdenote the query,key,and value of the-th attention head,respectively;Rdenotes the output projection matrix.During inference,all keys and values needto be cached to accelerate inference,so MHA needs to cache 2elements for each token.Inmodel deployment,this heavy KV cache is a l
43、arge bottleneck that limits the maximum batchsize and sequence length.2.1.2.Low-Rank Key-Value Joint CompressionThe core of MLA is the low-rank joint compression for keys and values to reduce KV cache:c=h,(9)k=c,(10)v=c,(11)where c Ris the compressed latent vector for keys and values;()denotes the K
44、Vcompression dimension;Ris the down-projection matrix;and,Rare the up-projection matrices for keys and values,respectively.During inference,MLA onlyneeds to cache c,so its KV cache has onlyelements,wheredenotes the number of layers.In addition,during inference,sincecan be absorbed into,andcan be abs
45、orbedinto,we even do not need to compute keys and values out for attention.Figure 3 intuitivelyillustrates how the KV joint compression in MLA reduces the KV cache.Moreover,in order to reduce the activation memory during training,we also perform7low-rank compression for the queries,even if it cannot
46、 reduce the KV cache:c=h,(12)q=c,(13)where c Ris the compressed latent vector for queries;()denotes the querycompression dimension;and R,Rare the down-projection and up-projection matrices for queries,respectively.2.1.3.Decoupled Rotary Position EmbeddingFollowing DeepSeek 67B(DeepSeek-AI,2024),we i
47、ntend to use the Rotary Position Embed-ding(RoPE)(Su et al.,2024)for DeepSeek-V2.However,RoPE is incompatible with low-rankKV compression.To be specific,RoPE is position-sensitive for both keys and queries.If we applyRoPE for the keys k,in Equation 10 will be coupled with a position-sensitive RoPE m
48、atrix.In this way,cannot be absorbed intoany more during inference,since a RoPE matrixrelated to the currently generating token will lie betweenandand matrix multiplicationdoes not obey a commutative law.As a result,we must recompute the keys for all the prefixtokens during inference,which will sign
49、ificantly hinder the inference efficiency.As a solution,we propose the decoupled RoPE strategy that uses additional multi-headqueries q,Rand a shared key k Rto carry RoPE,wheredenotes the per-headdimension of the decoupled queries and key.Equipped with the decoupled RoPE strategy,MLAperforms the fol
50、lowing computation:q,1;q,2;.;q,=q=RoPE(c),(14)k=RoPE(h),(15)q,=q,;q,(16)k,=k,;k,(17)o,=1Softmax(q,k,+)v,(18)u=o,1;o,2;.;o,(19)where Rand Rare matrices to produce the decouples queries and key,respectively;RoPE()denotes the operation that applies RoPE matrices;and;denotes theconcatenation operation.D
51、uring inference,the decoupled key should also be cached.Therefore,DeepSeek-V2 requires a total KV cache containing(+)elements.In order to demonstrate the complete computation process of MLA,we also organize andprovide its full formulas in Appendix C.2.1.4.Comparison of Key-Value CacheWe demonstrate
52、a comparison of the KV cache per token among different attention mechanismsin Table 1.MLA requires only a small amount of KV cache,equal to GQA with only 2.25 groups,but can achieve stronger performance than MHA.8Attention MechanismKV Cache per Token(#Element)CapabilityMulti-Head Attention(MHA)2Stro
53、ngGrouped-Query Attention(GQA)2ModerateMulti-Query Attention(MQA)2WeakMLA(Ours)(+)92StrongerTable 1|Comparison of the KV cache per token among different attention mechanisms.denotes the number of attention heads,denotes the dimension per attention head,denotesthe number of layers,denotes the number
54、of groups in GQA,andanddenote the KVcompression dimension and the per-head dimension of the decoupled queries and key in MLA,respectively.The amount of KV cache is measured by the number of elements,regardless of thestorage precision.For DeepSeek-V2,is set to 4andis set to2.So,its KV cache is equalt
55、o GQA with only 2.25 groups,but its performance is stronger than MHA.2.2.DeepSeekMoE:Training Strong Models at Economical Costs2.2.1.Basic ArchitectureFor FFNs,we employ the DeepSeekMoE architecture(Dai et al.,2024).DeepSeekMoE has twokey ideas:segmenting experts into finer granularity for higher ex
56、pert specialization and moreaccurate knowledge acquisition,and isolating some shared experts for mitigating knowledgeredundancy among routed experts.With the same number of activated and total expert parame-ters,DeepSeekMoE can outperform conventional MoE architectures like GShard(Lepikhin et al.,20
57、21)by a large margin.Let ube the FFN input of the-th token,we compute the FFN output has follows:h=u+=1FFN()(u)+=1,FFN()(u),(20),=(,Topk(,|1,),0,otherwise,(21),=Softmax?ue?,(22)whereanddenotethenumbersofsharedexpertsandroutedexperts,respectively;FFN()()andFFN()()denote the-th shared expert and the-t
58、h routed expert,respectively;denotesthe number of activated routed experts;,is the gate value for the-th expert;,is the token-to-expert affinity;eis the centroid of the-th routed expert in this layer;andTopk(,)denotesthe set comprisinghighest scores among the affinity scores calculated for the-th to
59、ken andall routed experts.2.2.2.Device-Limited RoutingWe design a device-limited routing mechanism to bound MoE-related communication costs.When expert parallelism is employed,the routed experts will be distributed across multipledevices.For each token,its MoE-related communication frequency is prop
60、ortional to thenumber of devices covered by its target experts.Due to the fine-grained expert segmentation inDeepSeekMoE,the number of activated experts can be large,so the MoE-related communicationwill be more costly if we apply expert parallelism.9For DeepSeek-V2,beyond the naive top-K selection o
61、f routed experts,we additionally ensurethat the target experts of each token will be distributed on at mostdevices.To be specific,foreach token,we first selectdevices that have experts with the highest affinity scores in them.Then,we perform top-K selection among experts on thesedevices.In practice,
62、we find thatwhen3,the device-limited routing can achieve a good performance roughly aligned withthe unrestricted top-K routing.2.2.3.Auxiliary Loss for Load BalanceWe take the load balance into consideration for automatically learned routing strategies.Firstly,unbalanced load will raise the risk of
63、routing collapse(Shazeer et al.,2017),preventing someexperts being fully trained and utilized.Secondly,when expert parallelism is employed,unbal-anced load will diminish computation efficiency.During the training of DeepSeek-V2,we designthree kinds of auxiliary losses,for controlling expert-level lo
64、ad balance(LExpBal),device-levelload balance(LDevBal),and communication balance(LCommBal),respectively.Expert-Level Balance Loss.We use an expert-level balance loss(Fedus et al.,2021;Lepikhinet al.,2021)to mitigate the risk of routing collapse:LExpBal=1=1,(23)=11(Tokenselects Expert),(24)=1=1,(25)wh
65、ere1is a hyper-parameter called expert-level balance factor;1()denotes the indicatorfunction;anddenotes the number of tokens in a sequence.Device-Level Balance Loss.In addition to the expert-level balance loss,we additionallydesign a device-level balance loss to ensure balanced computation across di
66、fferent devices.Inthe training process of DeepSeek-V2,we partition all routed experts intogroupsE1,E2,.,E,and deploy each group on a single device.The device-level balance loss is computed as follows:LDevBal=2=1,(26)=1|E|E,(27)=E,(28)where2is a hyper-parameter called device-level balance factor.Comm
67、unication Balance Loss.Finally,we introduce a communication balance loss to ensurethat the communication of each device is balanced.Although the device-limited routing mecha-nism guarantees that the sending communication of each device is bounded,if a certain device10receives more tokens than other
68、devices,the practical communication efficiency will also beaffected.In order to mitigate this issue,we design a communication balance loss as follows:LCommBal=3=1,(29)=11(Tokenis sent to Device),(30)=E,(31)where3is a hyper-parameter called communication balance factor.The device-limited routingmecha
69、nism operates on the principle of ensuring that each device transmits at mosthiddenstates to other devices.Simultaneously,the communication balance loss is employed to encour-age each device to receive aroundhidden states from other devices.The communicationbalance loss guarantees a balanced exchang
70、e of information among devices,promoting efficientcommunications.2.2.4.Token-Dropping StrategyWhile balance losses aim to encourage a balanced load,it is important to acknowledge thatthey cannot guarantee a strict load balance.In order to further mitigate the computationwastage caused by unbalanced
71、load,we introduce a device-level token-dropping strategy duringtraining.This approach first computes the average computational budget for each device,whichmeans that the capacity factor for each device is equivalent to 1.0.Then,inspired by Riquelmeet al.(2021),we drop tokens with the lowest affinity
72、 scores on each device until reaching thecomputational budget.In addition,we ensure that the tokens belonging to approximately 10%of the training sequences will never be dropped.In this way,we can flexibly decide whetherto drop tokens during inference according to the efficiency requirements,and alw
73、ays ensureconsistency between training and inference.3.Pre-Training3.1.Experimental Setups3.1.1.Data ConstructionWhile maintaining the same data processing stages as for DeepSeek 67B(DeepSeek-AI,2024),we extend the amount of data and elevate the data quality.In order to enlarge our pre-trainingcorpu
74、s,we explore the potential of the internet data and optimize our cleaning processes,thusrecovering a large amount of mistakenly deleted data.Moreover,we incorporate more Chinesedata,aiming to better leverage the corpus available on the Chinese internet.In addition tothe amount of data,we also focus
75、on the data quality.We enrich our pre-training corpus withhigh-quality data from various sources,and meanwhile improve the quality-based filteringalgorithm.The improved algorithm ensures that a large amount of non-beneficial data willbe removed,while the valuable data will be mostly retained.In addi
76、tion,we filter out thecontentious content from our pre-training corpus to mitigate the data bias introduced fromspecific regional cultures.A detailed discussion about the influence of this filtering strategy ispresented in Appendix E.11We adopt the same tokenizer as used in DeepSeek 67B,which is bui
77、lt based on the Byte-levelByte-Pair Encoding(BBPE)algorithm and has a vocabulary size of 100K.Our tokenized pre-training corpus contains 8.1T tokens,where Chinese tokens are approximately 12%more thanEnglish ones.3.1.2.Hyper-ParametersModel Hyper-Parameters.We set the number of Transformer layers to
78、 60 and the hiddendimension to 5120.All learnable parameters are randomly initialized with a standard deviationof 0.006.In MLA,we set the number of attention headsto 128 and the per-head dimensionto 128.The KV compression dimensionis set to 512,and the query compression dimensionis set to 1536.For t
79、he decoupled queries and key,we set the per-head dimensionto 64.Following Dai et al.(2024),we substitute all FFNs except for the first layer with MoE layers.Each MoE layer consists of 2 shared experts and 160 routed experts,where the intermediatehidden dimension of each expert is 1536.Among the rout
80、ed experts,6 experts will be activatedfor each token.In addition,the low-rank compression and fine-grained expert segmentationwill impact the output scale of a layer.Therefore,in practice,we employ additional RMS Normlayers after the compressed latent vectors,and multiply additional scaling factors
81、at the widthbottlenecks(i.e.,the compressed latent vectors and the intermediate hidden states of routedexperts)to ensure stable training.Under this configuration,DeepSeek-V2 comprises 236B totalparameters,of which 21B are activated for each token.Training Hyper-Parameters.We employ the AdamW optimiz
82、er(Loshchilov and Hutter,2017)with hyper-parameters set to1=0.9,2=0.95,andweight_decay=0.1.The learning rate isscheduled using a warmup-and-step-decay strategy(DeepSeek-AI,2024).Initially,the learningrate linearly increases from 0 to the maximum value during the first 2K steps.Subsequently,the learn
83、ing rate is multiplied by 0.316 after training about 60%of tokens,and again by 0.316after training about 90%of tokens.The maximum learning rate is set to 2.4104,and thegradient clipping norm is set to 1.0.We also use a batch size scheduling strategy,where thebatch size is gradually increased from 23
84、04 to 9216 in the training of the first 225B tokens,andthen keeps 9216 in the remaining training.We set the maximum sequence length to 4K,andtrain DeepSeek-V2 on 8.1T tokens.We leverage pipeline parallelism to deploy different layers ofa model on different devices,and for each layer,the routed exper
85、ts will be uniformly deployedon 8 devices(=8).As for the device-limited routing,each token will be sent to at most 3devices(=3).As for balance losses,we set1to 0.003,2to 0.05,and3to 0.02.We employthe token-dropping strategy during training for acceleration,but do not drop any tokens forevaluation.3.
86、1.3.InfrastructuresDeepSeek-V2 is trained based on the HAI-LLM framework(High-flyer,2023),an efficient andlight-weight training framework developed internally by our engineers.It employs a 16-wayzero-bubble pipeline parallelism(Qi et al.,2023),an 8-way expert parallelism(Lepikhin et al.,2021),and Ze
87、RO-1 data parallelism(Rajbhandari et al.,2020).Given that DeepSeek-V2 hasrelatively few activated parameters,and a portion of the operators are recomputed to save acti-vation memory,it can be trained without the necessity of tensor parallelism,thereby decreasingthe communication overhead.Moreover,in
88、 order to further improve the training efficiency,weoverlap the computation of shared experts with the expert parallel all-to-all communication.We also customize faster CUDA kernels for communications,routing algorithms,and fused121K12K24K35K47K58K70K81K93K104K116K128KContext Length(#Tokens)09182736
89、455564738291100Document Depth Percent(%)Pressure Testing DeepSeek-V2 Base 128K Context via Needle In A HayStack12345678910ScoreFigure 4|Evaluation results on the“Needle In A Haystack”(NIAH)tests.DeepSeek-V2performs well across all context window lengths up to 128K.linear computations across differen
90、t experts.In addition,MLA is also optimized based on animproved version of FlashAttention-2(Dao,2023).We conduct all experiments on a cluster equipped with NVIDIA H800 GPUs.Each node inthe H800 cluster contains 8 GPUs connected using NVLink and NVSwitch within nodes.Acrossnodes,InfiniBand interconne
91、cts are utilized to facilitate communications.3.1.4.Long Context ExtensionAfter the initial pre-training of DeepSeek-V2,we employ YaRN(Peng et al.,2023)to extend thedefault context window length from 4K to 128K.YaRN was specifically applied to the decoupledshared key kas it is responsible for carryi
92、ng RoPE(Su et al.,2024).For YaRN,we set the scaleto 40,to 1,to 32,and the target maximum context length to 160K.Under these settings,we can expect the model to respond well for a context length of 128K.Slightly diverging fromoriginal YaRN,due to our distinct attention mechanism,we adjust the length
93、scaling factor tomodulate the attention entropy.The factoris computed as=0.0707ln+1,aiming atminimizing the perplexity.We additionally train the model for 1000 steps,with a sequence length of 32K and a batchsize of 576 sequences.Although the training is conducted solely at the sequence length of 32K
94、,the model still demonstrates robust performance when being evaluated at a context length of128K.As shown in Figure 4,the results on the“Needle In A Haystack”(NIAH)tests indicatethat DeepSeek-V2 performs well across all context window lengths up to 128K.3.2.Evaluations3.2.1.Evaluation BenchmarksDeep
95、Seek-V2 is pretrained on a bilingual corpus,so we evaluate it on a series of benchmarks inEnglish and Chinese.Our evaluation is based on our internal evaluation framework integrated13in our HAI-LLM framework.Included benchmarks are categorized and listed as follows,whereunderlined benchmarks are in
96、Chinese:Multi-subject multiple-choice datasets include MMLU(Hendrycks et al.,2020),C-Eval(Huang et al.,2023),and CMMLU(Li et al.,2023).Language understanding and reasoning datasets include HellaSwag(Zellers et al.,2019),PIQA(Bisk et al.,2020),ARC(Clark et al.,2018),and BigBench Hard(BBH)(Suzgun et a
97、l.,2022).Closed-book question answering datasets include TriviaQA(Joshi et al.,2017)and Natu-ralQuestions(Kwiatkowski et al.,2019).Reading comprehension datasets include RACE Lai et al.(2017),DROP(Dua et al.,2019),C3(Sun et al.,2019),and CMRC(Cui et al.,2019).ReferencedisambiguationdatasetsincludeWi
98、noGrandeSakaguchietal.(2019)andCLUEWSC(Xu et al.,2020).Language modeling datasets include Pile(Gao et al.,2020).Chinese understanding and culture datasets includeCHID(Zheng et al.,2019)andCCPM(Li et al.,2021).Math datasets include GSM8K(Cobbe et al.,2021),MATH(Hendrycks et al.,2021),andCMath(Wei et
99、al.,2023).Code datasets include HumanEval(Chen et al.,2021),MBPP(Austin et al.,2021),andCRUXEval(Gu et al.,2024).Standardized exams includeAGIEval(Zhong et al.,2023).Note that AGIEval includes bothEnglish and Chinese subsets.Following our previous work(DeepSeek-AI,2024),we adopt perplexity-based eva
100、luationfor datasets including HellaSwag,PIQA,WinoGrande,RACE-Middle,RACE-High,MMLU,ARC-Easy,ARC-Challenge,CHID,C-Eval,CMMLU,C3,and CCPM,and adopt generation-based evaluation for TriviaQA,NaturalQuestions,DROP,MATH,GSM8K,HumanEval,MBPP,CRUXEval,BBH,AGIEval,CLUEWSC,CMRC,and CMath.In addition,we perfor
101、m language-modeling-based evaluation for Pile-test and use Bits-Per-Byte(BPB)as the metric to guaranteefair comparison among models with different tokenizers.For an intuitive overview of these benchmarks,we additionally provide our evaluationformats for each benchmark in Appendix G.3.2.2.Evaluation
102、ResultsIn Table 2,we compare DeepSeek-V2 with several representative open-source models,includ-ing DeepSeek 67B(DeepSeek-AI,2024)(our previous release),Qwen1.5 72B(Bai et al.,2023),LLaMA3 70B(AIMeta,2024),and Mixtral 8x22B(Mistral,2024).We evaluate all these modelswith our internal evaluation framew
103、ork,and ensure that they share the same evaluation setting.Overall,with only 21B activated parameters,DeepSeek-V2 significantly outperforms DeepSeek67B on almost all benchmarks,and achieves top-tier performance among open-source models.Further,we elaborately compare DeepSeek-V2 with its open-source
104、counterparts one byone.(1)Compared with Qwen1.5 72B,another model that supports both Chinese and English,DeepSeek-V2 demonstrates overwhelming advantages on the majority of English,code,andmath benchmarks.As for Chinese benchmarks,Qwen1.5 72B shows better performance on14Benchmark(Metric)#ShotsDeepS
105、eekQwen1.5MixtralLLaMA 3DeepSeek-V267B72B8x22B70BArchitecture-DenseDenseMoEDenseMoE#Activated Params-67B72B39B70B21B#Total Params-67B72B141B70B236BEnglishPile-test(BPB)-0.6420.6370.6230.6020.606BBH(EM)3-shot68.759.978.981.078.9MMLU(Acc.)5-shot71.377.277.678.978.5DROP(F1)3-shot69.771.580.482.580.1ARC
106、-Easy(Acc.)25-shot95.397.197.397.997.6ARC-Challenge(Acc.)25-shot86.492.891.293.392.4HellaSwag(Acc.)10-shot86.385.886.687.984.2PIQA(Acc.)0-shot83.683.383.685.083.7WinoGrande(Acc.)5-shot84.982.483.785.784.9RACE-Middle(Acc.)5-shot69.963.473.373.373.1RACE-High(Acc.)5-shot50.747.056.757.952.7TriviaQA(EM)
107、5-shot78.973.182.181.679.9NaturalQuestions(EM)5-shot36.635.639.640.238.7AGIEval(Acc.)0-shot41.364.443.449.851.2CodeHumanEval(Pass1)0-shot45.143.953.148.248.8MBPP(Pass1)3-shot57.453.664.268.666.6CRUXEval-I(Acc.)2-shot42.544.352.449.452.8CRUXEval-O(Acc.)2-shot41.042.352.854.349.8MathGSM8K(EM)8-shot63.
108、477.980.383.079.2MATH(EM)4-shot18.741.442.542.243.6CMath(EM)3-shot63.077.872.373.978.7ChineseCLUEWSC(EM)5-shot81.080.577.578.382.2C-Eval(Acc.)5-shot66.183.759.667.581.7CMMLU(Acc.)5-shot70.884.360.069.384.0CMRC(EM)1-shot73.466.673.173.377.5C3(Acc.)0-shot75.378.271.474.077.4CHID(Acc.)0-shot92.1-57.083
109、.292.7CCPM(Acc.)0-shot88.588.161.068.193.1Table 2|Comparison among DeepSeek-V2 and other representative open-source models.Allmodels are evaluated in our internal framework and share the same evaluation setting.Bolddenotes the best andunderlinedenotes the second-best.Scores with a gap smaller than 0
110、.3are regarded as at the same level.With only 21B activated parameters,DeepSeek-V2 achievestop-tier performance among open-source models.multi-subject multiple-choice tasks while DeepSeek-V2 is comparable or better on others.Notethat for the CHID benchmark,the tokenizer of Qwen1.5 72B will encounter
111、 errors in ourevaluation framework,so we leave the CHID score blank for Qwen1.5 72B.(2)Compared withMixtral 8x22B,DeepSeek-V2 achieves comparable or better English performance,except forTriviaQA,NaturalQuestions,and HellaSwag,which are closely related to English commonsenseknowledge.Notably,DeepSeek
112、-V2 outperforms Mixtral 8x22B on MMLU.On code and mathbenchmarks,DeepSeek-V2 demonstrates comparable performance with Mixtral 8x22B.SinceMixtral 8x22B is not specifically trained on Chinese data,its Chinese capability lags far behindDeepSeek-V2.(3)Compared with LLaMA3 70B,DeepSeek-V2 is trained on f
113、ewer than a quarterof English tokens.Therefore,we acknowledge that DeepSeek-V2 still has a slight gap in basicEnglish capabilities with LLaMA3 70B.However,even with much fewer training tokens andactivated parameters,DeepSeek-V2 still demonstrates comparable code and math capabilitywith LLaMA3 70B.Al
114、so,as a bilingual language model,DeepSeek-V2 outperforms LLaMA31570B overwhelmingly on Chinese benchmarks.Finally,it is worth mentioning that certain prior studies(Hu et al.,2024)incorporate SFTdata during the pre-training stage,whereas DeepSeek-V2 has never been exposed to SFT dataduring pre-traini
115、ng.3.2.3.Training and Inference EfficiencyTraining Costs.Since DeepSeek-V2 activates fewer parameters for each token and requiresfewer FLOPs than DeepSeek 67B,training DeepSeek-V2 will be more economical than trainingDeepSeek 67B theoretically.Although training an MoE model will introduce additional
116、 commu-nication overheads,through our operator and communication optimizations,the training forDeepSeek-V2 can attain a relatively high Model FLOPs Utilization(MFU).During our practicaltraining on the H800 cluster,for training on each trillion tokens,DeepSeek 67B requires 300.6KGPU hours,while DeepS
117、eek-V2 needs only 172.8K GPU hours,i.e.,sparse DeepSeek-V2 cansave 42.5%training costs compared with dense DeepSeek 67B.Inference Efficiency.In order to efficiently deploy DeepSeek-V2 for service,we first convertits parameters into the precision of FP8.In addition,we also perform KV cache quantiza-t
118、ion(Hooper et al.,2024;Zhao et al.,2023)for DeepSeek-V2 to further compress each elementin its KV cache into 6 bits on average.Benefiting from MLA and these optimizations,actuallydeployed DeepSeek-V2 requires significantly less KV cache than DeepSeek 67B,and thus canserve a much larger batch size.We
119、 evaluate the generation throughput of DeepSeek-V2 basedon the prompt and generation length distribution from the actually deployed DeepSeek 67Bservice.On a single node with 8 H800 GPUs,DeepSeek-V2 achieves a generation throughputexceeding 50K tokens per second,which is 5.76 times the maximum genera
120、tion throughput ofDeepSeek 67B.In addition,the prompt input throughput of DeepSeek-V2 exceeds 100K tokensper second.4.Alignment4.1.Supervised Fine-TuningBuilding upon our prior research(DeepSeek-AI,2024),we curate our instruction tuning datasetsto include 1.5M instances,comprising 1.2M instances for
121、 helpfulness and 0.3M instances forsafety.In comparison to the initial version,we improve the data quality to mitigate hallucinatoryresponses and enhance writing proficiency.We fine-tune DeepSeek-V2 with 2 epochs,andthe learning rate is set to 5106.For the evaluation of DeepSeek-V2 Chat(SFT),we main
122、lyinclude generation-based benchmarks,except for several representative multiple-choice tasks(MMLU and ARC).We also conduct an instruction-following evaluation(IFEval)(Zhou et al.,2023)for DeepSeek-V2 Chat(SFT),using prompt-level loose accuracy as the metric.Moreover,we employ LiveCodeBench(Jain et
123、al.,2024)questions from September 1st,2023 to April 1st,2024 to evaluate chat models.In addition to the standard benchmarks,we further evaluateour model on open-ended conversation benchmarks including MT-Bench(Zheng et al.,2023),AlpacaEval 2.0(Dubois et al.,2024),and AlignBench(Liu et al.,2023).For
124、comparison,we alsoevaluate Qwen1.5 72B Chat,LLaMA-3-70B Instruct,and Mistral-8x22B Instruct in our evaluationframework and settings.As for DeepSeek 67B Chat,we directly refer to the evaluation resultsreported in our previous release.164.2.Reinforcement LearningIn order to further unlock the potentia
125、l of DeepSeek-V2 and align it with human preference,weconduct Reinforcement Learning(RL)to adjust its preference.Reinforcement Learning Algorithm.In order to save the training costs of RL,we adopt GroupRelative Policy Optimization(GRPO)(Shao et al.,2024),which foregoes the critic model that istypica
126、lly with the same size as the policy model,and estimates the baseline from group scoresinstead.Specifically,for each question,GRPO samples a group of outputs1,2,from the old policyand then optimizes the policy modelby maximizing the followingobjective:J()=E(),=1(|)1=1?min?(|)(|),clip?(|)(|),1,1+?D?|
127、?,(32)D?|?=(|)(|)log(|)(|)1,(33)whereandare hyper-parameters;andis the advantage,computed using a group ofrewards 1,2,.,corresponding to the outputs within each group:=m(1,2,)s(1,2,).(34)Training Strategy.In our preliminary experiments,we find that the RL training on reasoningdata,such as code and m
128、ath prompts,exhibits unique characteristics that are distinct from thetraining on general data.For example,the mathematical and coding abilities of our model cankeep improving over a longer period of training steps.Therefore,we employ a two-stage RLtraining strategy,which first performs reasoning al
129、ignment,and then performs human prefer-ence alignment.In the first reasoning alignment stage,we train a reward modelforcode and math reasoning tasks,and optimize the policy model with the feedback of:=().(35)In the second human preference alignment stage,we adopt a multi-reward framework,whichacquir
130、es rewards from a helpful reward model,a safety reward model,and arule-based reward model.The final reward of a responseis=1()+2()+3(),(36)where1,2,and3are corresponding coefficients.In order to obtain reliable reward models that play crucial roles in the RL training,wecarefully collect preference d
131、ata,and meticulously conduct quality filtering and proportionadjustments.We obtain code preference data based on compiler-feedback,and mathematicalpreference data based on the ground-truth labels.For reward model training,we initializethe reward models with DeepSeek-V2 Chat(SFT)and train them with e
132、ither a point-wise ora pair-wise loss.In our experiments,we observe that the RL training can fully tap into andactivate the potential of our model,enabling it to select the correct and satisfactory answer frompossible responses.17Optimizations for Training Efficiency.Conducting RL training on extrem
133、ely large modelsplaces high demands on the training framework.It requires careful engineering optimization tomanage the GPU memory and RAM pressure,and meanwhile maintain a fast training speed.For this goal,we implement the following engineering optimizations.(1)Firstly,we propose ahybrid engine tha
134、t adopts different parallel strategies for training and inference respectively toachieve higher GPU utilization.(2)Secondly,we leverage vLLM(Kwon et al.,2023)with largebatch sizes as our inference backend to accelerate the inference speed.(3)Thirdly,we carefullydesign a scheduling strategy for offlo
135、ading models to CPUs and loading models back to GPUs,which achieves a near-optimal balance between the training speed and memory consumption.4.3.Evaluation ResultsEvaluations on Standard Benchmarks.Initially,we evaluate DeepSeek-V2 Chat(SFT)andDeepSeek-V2 Chat(RL)on standard benchmarks.Notably,DeepS
136、eek-V2 Chat(SFT)demon-strates substantial improvements in GSM8K,MATH,and HumanEval evaluations comparedwith its base version.This progress can be attributed to the inclusion of our SFT data,whichcomprises a considerable volume of math and code related content.In addition,DeepSeek-V2Chat(RL)further b
137、oosts the performance on math and code benchmarks.We show more codeand math evaluations in Appendix F.As for the comparisons with other models,we first compare DeepSeek-V2 Chat(SFT)with Qwen1.5 72B Chat,and find that DeepSeek-V2 Chat(SFT)surpasses Qwen1.5 72B Chaton almost all of English,math,and co
138、de benchmarks.On Chinese benchmarks,DeepSeek-V2 Chat(SFT)demonstrates slightly lower scores than Qwen1.5 72B Chat on multi-subjectmultiple-choice tasks,consistent with the performance observed from their base versions.Whencompared with the state-of-the-art open-source MoE model,Mixtral 8x22B Instruc
139、t,DeepSeek-V2 Chat(SFT)exhibits better performance on most benchmarks,except for NaturalQuestionsand IFEval.Furthermore,in comparison to the state-of-the-art open-source model LLaMA3 70BChat,DeepSeek-V2 Chat(SFT)shows similar performance in code and math related benchmarks.LLaMA3 70B Chat exhibits b
140、etter performance on MMLU and IFEval,while DeepSeek-V2 Chat(SFT)showcases stronger performance on Chinese tasks.Ultimately,DeepSeek-V2 Chat(RL)demonstrates further enhanced performance in both mathematical and coding tasks comparedwith DeepSeek-V2 Chat(SFT).These comparisons highlight the strengths
141、of DeepSeek-V2 Chatin relation to other language models in various domains and languages.Evaluations on Open-Ended Generation.We proceed with additional evaluations of ourmodels on open-ended conversation benchmarks.For English open-ended conversation genera-tion,we utilize MT-Bench and AlpacaEval 2
142、.0 as the benchmarks.Evaluation results presentedin Table 4 demonstrate a significant performance advantage of DeepSeek-V2 Chat(RL)overDeepSeek-V2 Chat(SFT).This outcome showcases the effectiveness of our RL training in achiev-ing improved alignment.In comparison to other open-source models,DeepSeek
143、-V2 Chat(RL)demonstrates superior performance over Mistral 8x22B Instruct and Qwen1.5 72B Chat on bothbenchmarks.When compared with LLaMA3 70B Instruct,DeepSeek-V2 Chat(RL)showcasescompetitive performance on MT-Bench and notably outperforms it on AlpacaEval 2.0.Theseresults highlight the strong perf
144、ormance of DeepSeek-V2 Chat(RL)in generating high-qualityand contextually relevant responses,particularly in instruction-based conversation tasks.In addition,we evaluate the Chinese open-ended generation capability based on AlignBench.As presented in Table 5,DeepSeek-V2 Chat(RL)exhibits a slight adv
145、antage over DeepSeek-V2Chat(SFT).Notably,DeepSeek-V2 Chat(SFT)surpasses all open-source Chinese models by asignificant margin.It significantly outperforms the second-best open-source model,Qwen1.518Benchmark#ShotsDeepSeek Qwen 1.5 LLaMA3MixtralDeepSeek-V2 DeepSeek-V267B Chat72B Chat 70B Inst.8x22B I
146、nst.Chat(SFT)Chat(RL)Context Length-4K32K8K64K128K128KArchitecture-DenseDenseDenseMoEMoEMoE#Activated Params-67B72B70B39B21B21B#Total Params-67B72B70B141B236B236BEnglishTriviaQA5-shot81.579.669.180.085.486.7NaturalQuestions5-shot47.046.944.654.951.953.4MMLU5-shot71.176.280.377.878.477.8ARC-Easy25-sh
147、ot96.696.896.997.197.698.1ARC-Challenge25-shot88.991.792.690.092.592.3BBH3-shot71.765.980.178.481.379.7AGIEval0-shot46.462.856.641.463.261.4IFEval0-shot55.557.379.772.164.163.8CodeHumanEval0-shot73.868.976.275.076.881.1MBPP3-shot61.452.269.864.470.472.0CRUXEval-I-COT2-shot49.151.461.159.459.561.5CRU
148、XEval-O-COT 2-shot50.956.563.663.660.763.0LiveCodeBench0-shot18.318.830.525.028.732.5MathGSM8K8-shot84.181.993.287.990.892.2MATH4-shot32.640.648.549.852.753.9CMath0-shot80.382.879.275.182.081.9ChineseCLUEWSC5-shot78.590.185.475.888.689.9C-Eval5-shot65.282.267.960.080.978.0CMMLU5-shot67.882.970.761.0
149、82.481.6Table 3|Comparison among DeepSeek-V2 Chat(SFT),DeepSeek-V2 Chat(RL),and otherrepresentative open-source chat models.Regarding TriviaQA and NaturalQuestions,it isworth noting that chat models,such as LLaMA3 70B Instruct,might not strictly adhere to theformat constraints typically specified in
150、 the few-shot setting.Consequently,this can lead tounderestimation of certain models in our evaluation framework.ModelMT-BenchAlpacaEval 2.0DeepSeek 67B Chat8.3516.6Mistral 8x22B Instruct v0.18.6630.9Qwen1.5 72B Chat8.6136.6LLaMA3 70B Instruct8.9534.4DeepSeek-V2 Chat(SFT)8.6230.0DeepSeek-V2 Chat(RL)
151、8.9738.9Table 4|English open-ended conversation evaluations.For AlpacaEval 2.0,we use the length-controlled win rate as the metric.72B Chat on both Chinese reasoning and language.Moreover,both DeepSeek-V2 Chat(SFT)and DeepSeek-V2 Chat(RL)outperform GPT-4-0613 and ERNIEBot 4.0,solidifying the positio
152、nof our models in the top-tier LLMs that support Chinese.Specifically,DeepSeek-V2 Chat(RL)shows remarkable performance in Chinese language understanding,which outperforms allmodels including GPT-4-Turbo-1106-Preview.On the other hand,the reasoning capability ofDeepSeek-V2 Chat(RL)still lags behind g
153、iant models,such as Erniebot-4.0 and GPT-4s.19ModelOverallReasoning 中文推理Language 中文語言Avg.Math.Logi.Avg.Fund.Chi.Open.Writ.Role.Pro.模型總分推理總分數學計算邏輯推理語言總分基本任務中文理解綜合問答文本寫作角色扮演專業能力GPT-4-1106-Preview8.017.737.807.668.297.997.338.618.678.478.65DeepSeek-V2 Chat(RL)7.917.457.777.148.368.108.288.378.538.338.5
154、3ERNIEBot-4.0-202404*(文心一言)7.897.617.817.418.177.568.538.138.458.248.09DeepSeek-V2 Chat(SFT)7.747.307.347.268.178.048.268.138.008.108.49GPT-4-06137.537.477.567.377.597.816.937.427.937.517.94ERNIEBot-4.0-202312*(文心一言)7.366.847.006.677.887.477.888.058.197.847.85Moonshot-v1-32k-202404*(月之暗面)7.226.426.4
155、16.438.027.827.588.008.228.198.29Qwen1.5-72B-Chat*7.196.456.586.317.937.387.778.158.028.058.24DeepSeek-67B-Chat6.435.755.715.797.117.126.527.587.206.917.37ChatGLM-Turbo(智譜清言)6.245.004.745.267.496.827.178.167.777.767.24ERNIEBot-3.5(文心一言)6.145.155.035.277.136.627.607.267.566.836.90Yi-34B-Chat*6.124.86
156、4.974.747.386.727.287.767.447.587.53GPT-3.5-Turbo-06136.085.355.685.026.826.715.817.297.037.286.77ChatGLM-Pro(智譜清言)5.834.654.544.757.016.516.767.477.077.346.89SparkDesk-V2(訊飛星火)5.744.734.714.746.765.846.977.297.186.926.34Qwen-14B-Chat5.724.814.914.716.636.906.366.746.646.596.56Baichuan2-13B-Chat5.25
157、3.923.764.076.596.226.057.116.976.756.43ChatGLM3-6B4.973.853.554.146.105.755.296.716.836.285.73Baichuan2-7B-Chat4.973.663.563.756.285.815.507.136.846.535.84InternLM-20B4.963.663.393.926.265.965.507.186.196.496.22Qwen-7B-Chat4.913.733.623.836.096.405.746.266.316.195.66ChatGLM2-6B4.483.393.163.615.584
158、.914.526.666.256.085.08InternLM-Chat-7B3.652.562.452.664.754.344.095.824.895.324.06Chinese-LLaMA-2-7B-Chat3.572.682.293.074.464.314.264.504.634.914.13LLaMA-2-13B-Chinese-Chat3.352.472.212.734.234.133.314.793.934.534.71Table 5|AlignBench leaderboard rated by GPT-4-0613.Models are ranked in descending
159、 orderbased on the overall score.Models marked with*represent that we evaluate them through theirAPI service or open-weighted model,instead of referring to the results reported in their originalpapers.Suffixes of Erniebot-4.0 and Moonshot denote the timestamps when we called their API.4.4.Discussion
160、Amount of SFT Data.The discussion surrounding the necessity of a large SFT corpus has beena topic of intense debate.Previous works(Young et al.,2024;Zhou et al.,2024)argue that fewerthan 10K instances of SFT data are enough to produce satisfactory results.However,in ourexperiments,we observe a signi
161、ficant performance decline on the IFEval benchmark if we usefewer than 10K instances.A possible explanation is that,a language model necessitates a certainamount of data to develop specific skills.Although the requisite data amount may diminishwith the model size increasing,it cannot be entirely eli
162、minated.Our observation underscores thecritical need for sufficient data to equip an LLM with desired capabilities.Moreover,the qualityof SFT data is also crucial,especially for tasks involving writing or open-ended questions.Alignment Tax of Reinforcement Learning.During human preference alignment,
163、we observea significant performance enhancement on the open-ended generation benchmarks,in termsof the scores rated by both AI and human evaluators.However,we also notice a phenomenonof“alignment tax”(Ouyang et al.,2022),i.e.,the alignment process can negatively impact theperformance on some standar
164、d benchmarks such as BBH.In order to alleviate the alignmenttax,during the RL stage,we make significant efforts in data processing and improving trainingstrategies,finally achieving a tolerable trade-off between the performance on standard andopen-ended benchmarks.Exploring how to align a model with
165、 human preferences without20compromising its general performance presents a valuable direction for future research.Online Reinforcement Learning.In our preference alignment experiments,we find that theonlineapproachsignificantlyoutperformstheofflineapproach.Therefore,weinvesttremendousefforts in imp
166、lementing an online RL framework for aligning DeepSeek-V2.The conclusionabout online or offline preference alignment can vary in different contexts,and we reserve amore thorough comparison and analysis between them for future work.5.Conclusion,Limitation,and Future WorkIn this paper,we introduce Dee
167、pSeek-V2,a large MoE language model that supports 128Kcontext length.In addition to strong performance,it is also characterized by economicaltraining and efficient inference,benefiting from its innovative architecture including MLA andDeepSeekMoE.In practice,compared with DeepSeek 67B,DeepSeek-V2 ac
168、hieves significantlystronger performance,and meanwhile saves 42.5%of training costs,reduces the KV cacheby 93.3%,and boosts the maximum generation throughput to 5.76 times.Evaluation resultsfurther demonstrate that with only 21B activated parameters,DeepSeek-V2 achieves top-tierperformance among ope
169、n-source models and becomes the strongest open-source MoE model.DeepSeek-V2 and its chat versions share the acknowledged limitations commonly found inother LLMs,including the lack of ongoing knowledge updates after pre-training,the possibilityof generating non-factual information such as unverified
170、advice,and a chance to producehallucinations.In addition,since our data primarily consist of Chinese and English content,ourmodel may exhibit limited proficiency in other languages.In scenarios beyond Chinese andEnglish,it should be used with caution.DeepSeek will continuously invest in open-source
171、large models with longtermism,aimingto progressively approach the goal of artificial general intelligence.In our ongoing exploration,we are dedicated to devising methods that enable furtherscaling up MoE models while maintaining economical training and inference costs.Thegoal of our next step is to
172、achieve performance on par with GPT-4 in our upcoming release.Our alignment team continuously strives to enhance our models,aiming to developa model that is not only helpful but also honest and safe for worldwide users.Ourultimate objective is to align the values of our model with human values,while
173、 minimizingthe need for human supervision.By prioritizing ethical considerations and responsibledevelopment,we are dedicated to creating a positive and beneficial impact on society.Currently,DeepSeek-V2 is designed to support the text modality exclusively.In ourforward-looking agenda,we intend to en
174、able our model to support multiple modalities,enhancing its versatility and utility in a wider range of scenarios.ReferencesAIMeta.Llama 3 model card,2024.URLhttps:/ Jong,Y.Zemlyanskiy,F.Lebrn,and S.Sanghai.Gqa:Traininggeneralized multi-query transformer models from multi-head checkpoints.arXiv prep
175、rintarXiv:2305.13245,2023.21Anthropic.Introducing Claude,2023.URLhttps:/ al.Program synthesis with large language models.arXiv preprint arXiv:2108.07732,2021.J.Bai,S.Bai,Y.Chu,Z.Cui,K.Dang,X.Deng,Y.Fan,W.Ge,Y.Han,F.Huang,B.Hui,L.Ji,M.Li,J.Lin,R.Lin,D.Liu,G.Liu,C.Lu,K.Lu,J.Ma,R.Men,X.Ren,X.Ren,C.Tan,
176、S.Tan,J.Tu,P.Wang,S.Wang,W.Wang,S.Wu,B.Xu,J.Xu,A.Yang,H.Yang,J.Yang,S.Yang,Y.Yao,B.Yu,H.Yuan,Z.Yuan,J.Zhang,X.Zhang,Y.Zhang,Z.Zhang,C.Zhou,J.Zhou,X.Zhou,and T.Zhu.Qwen technical report.arXiv preprint arXiv:2309.16609,2023.Y.Bisk,R.Zellers,R.L.Bras,J.Gao,and Y.Choi.PIQA:reasoning about physical commo
177、nsensein natural language.InThe Thirty-Fourth AAAI Conference on Artificial Intelligence,AAAI2020,The Thirty-Second Innovative Applications of Artificial Intelligence Conference,IAAI2020,The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence,EAAI2020,New York,NY,USA,February 7-1
178、2,2020,pages 74327439.AAAI Press,2020.doi:10.1609/aaai.v34i05.6239.URLhttps:/doi.org/10.1609/aaai.v34i05.6239.M.Chen,J.Tworek,H.Jun,Q.Yuan,H.P.de Oliveira Pinto,J.Kaplan,H.Edwards,Y.Burda,N.Joseph,G.Brockman,A.Ray,R.Puri,G.Krueger,M.Petrov,H.Khlaaf,G.Sastry,P.Mishkin,B.Chan,S.Gray,N.Ryder,M.Pavlov,A
179、.Power,L.Kaiser,M.Bavarian,C.Winter,P.Tillet,F.P.Such,D.Cummings,M.Plappert,F.Chantzis,E.Barnes,A.Herbert-Voss,W.H.Guss,A.Nichol,A.Paino,N.Tezak,J.Tang,I.Babuschkin,S.Balaji,S.Jain,W.Saunders,C.Hesse,A.N.Carr,J.Leike,J.Achiam,V.Misra,E.Morikawa,A.Radford,M.Knight,M.Brundage,M.Murati,K.Mayer,P.Welind
180、er,B.McGrew,D.Amodei,S.McCandlish,I.Sutskever,andW.Zaremba.Evaluating large language models trained on code.CoRR,abs/2107.03374,2021.URLhttps:/arxiv.org/abs/2107.03374.P.Clark,I.Cowhey,O.Etzioni,T.Khot,A.Sabharwal,C.Schoenick,and O.Tafjord.Think youhave solved question answering?try arc,the AI2 reas
181、oning challenge.CoRR,abs/1803.05457,2018.URLhttp:/arxiv.org/abs/1803.05457.K.Cobbe,V.Kosaraju,M.Bavarian,M.Chen,H.Jun,L.Kaiser,M.Plappert,J.Tworek,J.Hilton,R.Nakano,et al.Training verifiers to solve math word problems.arXiv preprintarXiv:2110.14168,2021.Y.Cui,T.Liu,W.Che,L.Xiao,Z.Chen,W.Ma,S.Wang,an
182、d G.Hu.A span-extractiondataset for Chinese machine reading comprehension.In K.Inui,J.Jiang,V.Ng,and X.Wan,editors,Proceedings of the 2019 Conference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP),pages 58835
183、889,Hong Kong,China,Nov.2019.Association for Computa-tional Linguistics.doi:10.18653/v1/D19-1600.URLhttps:/aclanthology.org/D19-1600.D.Dai,C.Deng,C.Zhao,R.X.Xu,H.Gao,D.Chen,J.Li,W.Zeng,X.Yu,Y.Wu,Z.Xie,Y.K.Li,P.Huang,F.Luo,C.Ruan,Z.Sui,and W.Liang.Deepseekmoe:Towards ultimate expertspecialization in
184、mixture-of-experts language models.CoRR,abs/2401.06066,2024.URLhttps:/doi.org/10.48550/arXiv.2401.06066.T.Dao.FlashAttention-2:Faster attention with better parallelism and work partitioning,2023.22DeepSeek-AI.Deepseek LLM:scaling open-source language models with longtermism.CoRR,abs/2401.02954,2024.
185、URLhttps:/doi.org/10.48550/arXiv.2401.02954.D.Dua,Y.Wang,P.Dasigi,G.Stanovsky,S.Singh,and M.Gardner.DROP:A reading compre-hension benchmark requiring discrete reasoning over paragraphs.In J.Burstein,C.Doran,andT.Solorio,editors,Proceedings of the 2019 Conference of the North American Chapter of theA
186、ssociation for Computational Linguistics:Human Language Technologies,NAACL-HLT2019,Minneapolis,MN,USA,June 2-7,2019,Volume 1(Long and Short Papers),pages 23682378.Association for Computational Linguistics,2019.doi:10.18653/V1/N19-1246.URLhttps:/doi.org/10.18653/v1/n19-1246.Y.Dubois,B.Galambosi,P.Lia
187、ng,and T.B.Hashimoto.Length-controlled alpacaeval:A simpleway to debias automatic evaluators.arXiv preprint arXiv:2404.04475,2024.W.Fedus,B.Zoph,and N.Shazeer.Switch transformers:Scaling to trillion parameter modelswith simple and efficient sparsity.CoRR,abs/2101.03961,2021.URLhttps:/arxiv.org/abs/2
188、101.03961.L.Gao,S.Biderman,S.Black,L.Golding,T.Hoppe,C.Foster,J.Phang,H.He,A.Thite,N.Nabeshima,et al.The Pile:An 800GB dataset of diverse text for language modeling.arXivpreprint arXiv:2101.00027,2020.Google.Introducing gemini:our largest and most capable ai model,2023.URLhttps:/blog.google/technolo
189、gy/ai/google-gemini-ai/.A.Gu,B.Rozire,H.Leather,A.Solar-Lezama,G.Synnaeve,and S.I.Wang.Cruxeval:Abenchmark for code reasoning,understanding and execution,2024.D.Hendrycks,C.Burns,S.Basart,A.Zou,M.Mazeika,D.Song,and J.Steinhardt.Measuringmassive multitask language understanding.arXiv preprint arXiv:2
190、009.03300,2020.D.Hendrycks,C.Burns,S.Kadavath,A.Arora,S.Basart,E.Tang,D.Song,andJ.Steinhardt.Mea-suring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,2021.High-flyer.Hai-llm:高效且輕量的大模型訓練工具,2023.URLhttps:/www.high- A.Gholami.Kvquant:Towards10millioncontextlengthLLM
191、inferencewithKVcachequantization.CoRR,abs/2401.18079,2024.URLhttps:/doi.org/10.48550/arXiv.2401.18079.S.Hu,Y.Tu,X.Han,C.He,G.Cui,X.Long,Z.Zheng,Y.Fang,Y.Huang,W.Zhao,et al.Minicpm:Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395,2024.
192、Y.Huang,Y.Bai,Z.Zhu,J.Zhang,J.Zhang,T.Su,J.Liu,C.Lv,Y.Zhang,J.Lei,et al.C-Eval:Amulti-level multi-discipline chinese evaluation suite for foundation models.arXiv preprintarXiv:2305.08322,2023.N.Jain,K.Han,A.Gu,W.-D.Li,F.Yan,T.Zhang,S.Wang,A.Solar-Lezama,K.Sen,and I.Stoica.Livecodebench:Holistic and
193、contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,2024.23M.Joshi,E.Choi,D.Weld,and L.Zettlemoyer.TriviaQA:A large scale distantly supervised chal-lenge dataset for reading comprehension.In R.Barzilay and M.-Y.Kan,editors,Proceedings ofthe 55th Annual Mee
194、ting of the Association for Computational Linguistics(Volume 1:LongPapers),pages 16011611,Vancouver,Canada,July 2017.Association for ComputationalLinguistics.doi:10.18653/v1/P17-1147.URLhttps:/aclanthology.org/P17-1147.T.Kwiatkowski,J.Palomaki,O.Redfield,M.Collins,A.P.Parikh,C.Alberti,D.Epstein,I.Po
195、losukhin,J.Devlin,K.Lee,K.Toutanova,L.Jones,M.Kelcey,M.Chang,A.M.Dai,J.Uszkoreit,Q.Le,and S.Petrov.Natural questions:a benchmark for question answeringresearch.Trans.Assoc.Comput.Linguistics,7:452466,2019.doi:10.1162/tacl_a_00276.URLhttps:/doi.org/10.1162/tacl_a_00276.W.Kwon,Z.Li,S.Zhuang,Y.Sheng,L.
196、Zheng,C.H.Yu,J.E.Gonzalez,H.Zhang,and I.Stoica.Efficient memory management for large language model serving with pagedattention.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,2023.G.Lai,Q.Xie,H.Liu,Y.Yang,and E.H.Hovy.RACE:large-scale reading comprehensiondataset from
197、 examinations.In M.Palmer,R.Hwa,and S.Riedel,editors,Proceedings ofthe 2017 Conference on Empirical Methods in Natural Language Processing,EMNLP 2017,Copenhagen,Denmark,September 9-11,2017,pages 785794.Association for ComputationalLinguistics,2017.doi:10.18653/V1/D17-1082.URLhttps:/doi.org/10.18653/
198、v1/d17-1082.D.Lepikhin,H.Lee,Y.Xu,D.Chen,O.Firat,Y.Huang,M.Krikun,N.Shazeer,and Z.Chen.Gshard:Scaling giant models with conditional computation and automatic sharding.In9thInternational Conference on Learning Representations,ICLR 2021.OpenR,2021.URLhttps:/ T.Baldwin.CMMLU:Measur-ing massive multitas
199、k language understanding in Chinese.arXiv preprint arXiv:2306.09212,2023.W.Li,F.Qi,M.Sun,X.Yi,and J.Zhang.Ccpm:A chinese classical poetry matching dataset,2021.X.Liu,X.Lei,S.Wang,Y.Huang,Z.Feng,B.Wen,J.Cheng,P.Ke,Y.Xu,W.L.Tam,X.Zhang,L.Sun,H.Wang,J.Zhang,M.Huang,Y.Dong,and J.Tang.Alignbench:Benchmar
200、kingchinese alignment of large language models.CoRR,abs/2311.18743,2023.doi:10.48550/ARXIV.2311.18743.URLhttps:/doi.org/10.48550/arXiv.2311.18743.I.Loshchilov and F.Hutter.Decoupled weight decay regularization.arXiv preprintarXiv:1711.05101,2017.Mistral.Cheaper,better,faster,stronger:Continuing to p
201、ush the frontier of ai and making itaccessible to all,2024.URLhttps:/mistral.ai/news/mixtral-8x22b.OpenAI.Introducing ChatGPT,2022.URLhttps:/ technical report.arXiv preprint arXiv:2303.08774,2023.L.Ouyang,J.Wu,X.Jiang,D.Almeida,C.Wainwright,P.Mishkin,C.Zhang,S.Agarwal,K.Slama,A.Ray,et al.Training la
202、nguage models to follow instructions with human feedback.Advances in neural information processing systems,35:2773027744,2022.24B.Peng,J.Quesnelle,H.Fan,and E.Shippole.Yarn:Efficient context window extension of largelanguage models.arXiv preprint arXiv:2309.00071,2023.P.Qi,X.Wan,G.Huang,and M.Lin.Ze
203、ro bubble pipeline parallelism.arXiv preprintarXiv:2401.10241,2023.S.Rajbhandari,J.Rasley,O.Ruwase,and Y.He.Zero:Memory optimizations toward training tril-lion parameter models.InSC20:International Conference for High Performance Computing,Networking,Storage and Analysis,pages 116.IEEE,2020.C.Riquel
204、me,J.Puigcerver,B.Mustafa,M.Neumann,R.Jenatton,A.S.Pinto,D.Keysers,andN.Houlsby.Scaling vision with sparse mixture of experts.InAdvances in Neural InformationProcessing Systems 34:Annual Conference on Neural Information Processing Systems 2021,NeurIPS 2021,pages 85838595,2021.URLhttps:/proceedings.n
205、eurips.cc/paper/2021/hash/48237d9f2dea8c74c2a72126cf63d933-Abstract.html.K.Sakaguchi,R.L.Bras,C.Bhagavatula,and Y.Choi.Winogrande:An adversarial winogradschema challenge at scale,2019.Z.Shao,P.Wang,Q.Zhu,R.Xu,J.Song,M.Zhang,Y.Li,Y.Wu,and D.Guo.Deepseekmath:Pushing the limits of mathematical reasonin
206、g in open language models.arXiv preprintarXiv:2402.03300,2024.N.Shazeer.Fast transformer decoding:One write-head is all you need.CoRR,abs/1911.02150,2019.URLhttp:/arxiv.org/abs/1911.02150.N.Shazeer,A.Mirhoseini,K.Maziarz,A.Davis,Q.V.Le,G.E.Hinton,andJ.Dean.Outrageouslylarge neural networks:The spars
207、ely-gated mixture-of-experts layer.In5th InternationalConference on Learning Representations,ICLR 2017.OpenR,2017.URLhttps:/ Y.Liu.Roformer:Enhanced transformer with rotaryposition embedding.Neurocomputing,568:127063,2024.K.Sun,D.Yu,D.Yu,and C.Cardie.Investigating prior knowledge for challenging chi
208、nesemachine reading comprehension,2019.M.Suzgun,N.Scales,N.Schrli,S.Gehrmann,Y.Tay,H.W.Chung,A.Chowdhery,Q.V.Le,E.H.Chi,D.Zhou,et al.Challenging big-bench tasks and whether chain-of-thought can solvethem.arXiv preprint arXiv:2210.09261,2022.A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,
209、.Kaiser,and I.Polo-sukhin.Attention is all you need.Advances in neural information processing systems,30,2017.J.Wei,Y.Tay,R.Bommasani,C.Raffel,B.Zoph,S.Borgeaud,D.Yogatama,M.Bosma,D.Zhou,D.Metzler,et al.Emergent abilities of large language models.arXiv preprint arXiv:2206.07682,2022.T.Wei,J.Luan,W.L
210、iu,S.Dong,and B.Wang.Cmath:Can your language model pass chineseelementary school math test?,2023.L.Xu,H.Hu,X.Zhang,L.Li,C.Cao,Y.Li,Y.Xu,K.Sun,D.Yu,C.Yu,Y.Tian,Q.Dong,W.Liu,B.Shi,Y.Cui,J.Li,J.Zeng,R.Wang,W.Xie,Y.Li,Y.Patterson,Z.Tian,Y.Zhang,H.Zhou,25S.Liu,Z.Zhao,Q.Zhao,C.Yue,X.Zhang,Z.Yang,K.Richard
211、son,and Z.Lan.CLUE:A chi-nese language understanding evaluation benchmark.In D.Scott,N.Bel,and C.Zong,editors,Proceedings of the 28th International Conference on Computational Linguistics,COLING2020,Barcelona,Spain(Online),December 8-13,2020,pages 47624772.International Com-mittee on Computational L
212、inguistics,2020.doi:10.18653/V1/2020.COLING-MAIN.419.URLhttps:/doi.org/10.18653/v1/2020.coling-main.419.A.Young,B.Chen,C.Li,C.Huang,G.Zhang,G.Zhang,H.Li,J.Zhu,J.Chen,J.Chang,et al.Yi:Open foundation models by 01.ai.arXiv preprint arXiv:2403.04652,2024.R.Zellers,A.Holtzman,Y.Bisk,A.Farhadi,and Y.Choi
213、.HellaSwag:Can a machine really finishyour sentence?In A.Korhonen,D.R.Traum,and L.Mrquez,editors,Proceedings of the 57thConference of the Association for Computational Linguistics,ACL 2019,Florence,Italy,July28-August 2,2019,Volume 1:Long Papers,pages 47914800.Association for ComputationalLinguistic
214、s,2019.doi:10.18653/v1/p19-1472.URLhttps:/doi.org/10.18653/v1/p19-1472.Y.Zhao,C.Lin,K.Zhu,Z.Ye,L.Chen,S.Zheng,L.Ceze,A.Krishnamurthy,T.Chen,andB.Kasikci.Atom:Low-bit quantization for efficient and accurate LLM serving.CoRR,abs/2310.19102,2023.URLhttps:/doi.org/10.48550/arXiv.2310.19102.C.Zheng,M.Hua
215、ng,and A.Sun.Chid:A large-scale chinese idiom dataset for cloze test.InA.Korhonen,D.R.Traum,and L.Mrquez,editors,Proceedings of the 57th Conference ofthe Association for Computational Linguistics,ACL 2019,Florence,Italy,July 28-August 2,2019,Volume 1:Long Papers,pages 778787.Association for Computat
216、ional Linguistics,2019.doi:10.18653/V1/P19-1075.URLhttps:/doi.org/10.18653/v1/p19-1075.L.Zheng,W.-L.Chiang,Y.Sheng,S.Zhuang,Z.Wu,Y.Zhuang,Z.Lin,Z.Li,D.Li,E.P.Xing,H.Zhang,J.E.Gonzalez,and I.Stoica.Judging llm-as-a-judge with mt-bench and chatbotarena,2023.W.Zhong,R.Cui,Y.Guo,Y.Liang,S.Lu,Y.Wang,A.Sa
217、ied,W.Chen,and N.Duan.AGIEval:Ahuman-centric benchmark for evaluating foundation models.CoRR,abs/2304.06364,2023.doi:10.48550/arXiv.2304.06364.URLhttps:/doi.org/10.48550/arXiv.2304.06364.C.Zhou,P.Liu,P.Xu,S.Iyer,J.Sun,Y.Mao,X.Ma,A.Efrat,P.Yu,L.Yu,et al.Lima:Less ismore for alignment.Advances in Neur
218、al Information Processing Systems,36,2024.J.Zhou,T.Lu,S.Mishra,S.Brahma,S.Basu,Y.Luan,D.Zhou,and L.Hou.Instruction-followingevaluation for large language models.arXiv preprint arXiv:2311.07911,2023.26AppendixA.Contributions and AcknowledgmentsResearch&EngineeringAixin LiuBingxuan WangBo LiuChenggang
219、 ZhaoChengqi DengChong RuanDamai DaiDaya GuoDejian YangDeli ChenErhang LiFangyun LinFuli LuoGuangbo HaoGuanting ChenGuowei LiH.ZhangHanwei XuHao YangHaowei ZhangHonghui DingHuajian XinHuazuo GaoHui QuJianzhong GuoJiashi LiJingyang YuanJunjie QiuJunxiao SongKai DongKaige GaoKang GuanLean WangLecong Z
220、hangLiang ZhaoLiyue ZhangMingchuan ZhangMinghua ZhangMinghui TangPanpan HuangPeiyi WangQihao ZhuQinyu ChenQiushi DuRuiqi GeRuizhe PanRunxin XuShanghao LuShangyan ZhouShanhuang ChenShengfeng YeShirong MaShiyu WangShuiping YuShunfeng ZhouSize ZhengTian PeiWangding ZengWen LiuWenfeng LiangWenjun GaoWen
221、tao ZhangXiao BiXiaohan WangXiaodong LiuXiaokang ChenXiaotao NieXin LiuXin XieXingkai YuXinyu YangXuan LuXuecheng SuY.WuY.K.LiY.X.WeiYanhong XuYao LiYao ZhaoYaofeng SunYaohui WangYichao ZhangYiliang XiongYilong ZhaoYing HeYishi PiaoYixin DongYixuan TanYiyuan Liu27Yongji WangYongqiang GuoYuduan WangY
222、uheng ZouYuxiang YouYuxuan LiuZ.Z.RenZehui RenZhangli ShaZhe FuZhenda XieZhewen HaoZhihong ShaoZhuoshu LiZihan WangZihui GuZilin LiZiwei XieData AnnotationBei FengHui LiJ.L.CaiJiaqi NiLei XuMeng LiNing TianR.J.ChenR.L.JinRuyi ChenS.S.LiShuang ZhouTian YuanTianyu SunX.Q.LiXiangyue JinXiaojin ShenXiao
223、sha ChenXiaowen SunXiaoxiang WangXinnan SongXinyi ZhouY.X.ZhuYanhong XuYanping HuangYaohui LiYi ZhengYuchen ZhuYunxian MaZhen HuangZhipeng XuZhongyu ZhangBusiness&ComplianceBin WangDongjie JiJian LiangJin ChenLeyi XiaMiaojun WangMingming LiPeng ZhangShaoqing WuShengfeng YeT.WangW.L.XiaoWei AnXianzu
224、WangYing TangYukun ZhaYuting YanZhen ZhangZhiniu WenWithin each role,authors are listed alphabetically by first name.Especially,Huazuo Gaoand Wangding Zeng have made key innovations in the research of the MLA architecture.Furthermore,wed like to thank Jianlin Su for his helpful discussion on positio
225、n embedding.We thank all those who have contributed to DeepSeek-V2 but are not mentioned in the paper.DeepSeek believes that innovation,novelty,and curiosity are essential in the path to AGI.28B.DeepSeek-V2-Lite:A 16B Model Equipped with MLA and DeepSeekMoEB.1.Model DescriptionArchitectures.DeepSeek
226、-V2-Lite has 27 layers and a hidden dimension of 2048.It also employsMLA and has 16 attention heads,where each head has a dimension of 128.Its KV compressiondimension is 512,but slightly different from DeepSeek-V2,it does not compress the queries.For the decoupled queries and key,it has a per-head d
227、imension of 64.DeepSeek-V2-Lite alsoemploys DeepSeekMoE,and all FFNs except for the first layer are replaced with MoE layers.Each MoE layer consists of 2 shared experts and 64 routed experts,where the intermediatehidden dimension of each expert is 1408.Among the routed experts,6 experts will be acti
228、vatedfor each token.Under this configuration,DeepSeek-V2-Lite comprises 15.7B total parameters,ofwhich 2.4B are activated for each token.BenchmarkDeepSeek 7B DeepSeekMoE 16B DeepSeek-V2-LiteArchitectureMHA+DenseMHA+MoEMLA+MoEContext Length4K4K32K#Activated Params6.9B2.8B2.4B#Total Params6.9B16.4B15.
229、7B#Training Tokens2T2T5.7TEnglishMMLU48.245.058.3BBH39.538.944.1TriviaQA59.764.864.2NaturalQuestions22.225.526.0ARC-Easy67.968.170.9ARC-Challenge48.149.851.2AGIEval26.417.433.2CodeHumanEval26.226.829.9MBPP39.039.243.2MathGSM8K17.418.841.1MATH3.34.317.1CMath34.540.458.4ChineseCLUEWSC73.172.174.3C-Eva
230、l45.040.660.3CMMLU47.242.564.3Table 6|Performance of DeepSeek-V2-Lite,DeepSeekMoE 16B,and DeepSeek 7B.Training Details.DeepSeek-V2-Lite is also trained from scratch on the same pre-trainingcorpus of DeepSeek-V2,which is not polluted by any SFT data.It uses the AdamW optimizerwith hyper-parameters se
231、t to1=0.9,2=0.95,andweight_decay=0.1.The learning rate isscheduled using a warmup-and-step-decay strategy.Initially,the learning rate linearly increasesfrom 0 to the maximum value during the first 2K steps.Subsequently,the learning rate ismultiplied by 0.316 after training about 80%of tokens,and aga
232、in by 0.316 after training about90%of tokens.The maximum learning rate is set to 4.2104,and the gradient clipping normis set to 1.0.We do not employ the batch size scheduling strategy for it,and it is trained witha constant batch size of 4608 sequences.During pre-training,we set the maximum sequence
233、29length to 4K,and train DeepSeek-V2-Lite on 5.7T tokens.We leverage pipeline parallelism todeploy different layers of it on different devices,but for each layer,all experts will be deployedon the same device.Therefore,we only employ a small expert-level balance loss with1=0.001,and do not employ de
234、vice-level balance loss and communication balance loss for it.Afterpre-training,we also perform long context extension and SFT for DeepSeek-V2-Lite and get achat model called DeepSeek-V2-Lite Chat.BenchmarkDeepSeekDeepSeekMoE DeepSeek-V2-Lite7B Chat16B ChatChatArchitectureMHA+DenseMHA+MoEMLA+MoECont
235、ext Length4K4K32K#Activated Params6.9B2.8B2.4B#Total Params6.9B16.4B15.7B#Training Tokens2T2T5.7TEnglishMMLU49.747.255.7BBH43.142.248.1TriviaQA59.563.365.2NaturalQuestions32.735.135.5ARC-Easy70.269.974.3ARC-Challenge50.250.051.5AGIEval17.619.742.8CodeHumanEval45.145.757.3MBPP39.046.245.8MathGSM8K62.
236、662.272.0MATH14.715.227.9CMath66.467.971.7ChineseCLUEWSC66.268.280.0C-Eval44.740.060.1CMMLU51.249.362.5Table 7|Performance of DeepSeek-V2-Lite Chat,DeepSeekMoE 16B Chat,and DeepSeek 7BChat.B.2.Performance EvaluationBase Model.We evaluate the performance of DeepSeek-V2-Lite and compare it with our pr
237、e-vious small-size base models in Table 6.DeepSeek-V2-Lite exhibits overwhelming performanceadvantages,especially in reasoning,coding,and math.Chat Model.We evaluate the performance of DeepSeek-V2-Lite Chat and compare it with ourprevious small-size chat models in Table 7.DeepSeek-V2-Lite also outpe
238、rforms our previoussmall-size chat models by a large margin.30C.Full Formulas of MLAIn order to demonstrate the complete computation process of MLA,we provide its full formulasin the following:c=h,(37)q,1;q,2;.;q,=q=c,(38)q,1;q,2;.;q,=q=RoPE(c),(39)q,=q,;q,(40)c=h,(41)k,1;k,2;.;k,=k=c,(42)k=RoPE(h),
239、(43)k,=k,;k,(44)v,1;v,2;.;v,=v=c,(45)o,=1Softmax(q,k,+)v,(46)u=o,1;o,2;.;o,(47)where the boxed vectors in blue need to be cached for generation.During inference,the naiveformula needs to recover kand vfrom cfor attention.Fortunately,due to the associativelaw of matrix multiplication,we can absorbint
240、o,andinto.Therefore,we donot need to compute keys and values out for each query.Through this optimization,we avoidthe computational overhead for recomputing kand vduring inference.D.Ablation of Attention MechanismsD.1.Ablation of MHA,GQA,and MQAWe show the evaluation results for 7B dense models with
241、 MHA,GQA,and MQA on four hardbenchmarks in Table 8.All of these three models are trained on 1.33T tokens,and share the samearchitecture except for the attention mechanisms.In addition,for a fair comparison,we alignthe number of parameters of them to around 7B by adjusting the number of layers.From t
242、hetable,we can find that MHA demonstrates significant advantages over GQA and MQA on thesebenchmarks.D.2.Comparison Between MLA and MHAIn Table 9,we show the evaluation results for MoE models equipped with MLA and MHA,respectively,on four hard benchmarks.For a solid conclusion,we train and evaluate
243、modelsacross two scales.Two small MoE models comprise about 16B total parameters,and we trainthem on 1.33T tokens.Two large MoE models comprise about 250B total parameters,and wetrain them on 420B tokens.Also,two small MoE models and two large MoE models respectivelyshare the same architecture excep
244、t for the attention mechanisms.From the table,we can observethat MLA shows better performance than MHA.More importantly,MLA requires a significantlysmaller amount of KV cache(14%for small MoE models and 4%for large MoE models)thanMHA.31Benchmark(Metric)#ShotsDense 7BDense 7BDense 7Bw/MQAw/GQA(8 Grou
245、ps)w/MHA#Params-7.1B6.9B6.9BBBH(EM)3-shot33.235.637.0MMLU(Acc.)5-shot37.941.245.2C-Eval(Acc.)5-shot30.037.742.9CMMLU(Acc.)5-shot34.638.443.5Table 8|Comparison among 7B dense models with MHA,GQA,and MQA,respectively.MHAdemonstrates significant advantages over GQA and MQA on hard benchmarks.Benchmark(
246、Metric)#ShotsSmall MoE Small MoE Large MoE Large MoEw/MHAw/MLAw/MHAw/MLA#Activated Params-2.5B2.4B25.0B21.5B#Total Params-15.8B15.7B250.8B247.4BKV Cache per Token(#Element)-110.6K15.6K860.2K34.6KBBH(EM)3-shot37.939.046.650.7MMLU(Acc.)5-shot48.750.057.559.0C-Eval(Acc.)5-shot51.650.957.959.2CMMLU(Acc.
247、)5-shot52.353.460.762.5Table 9|Comparison between MLA and MHA on hard benchmarks.DeepSeek-V2 shows betterperformance than MHA,but requires a significantly smaller amount of KV cache.E.Discussion About Pre-Training Data DebiasingDuring pre-training data preparation,we identify and filter out contenti
248、ous content,such asvalues influenced by regional cultures,to avoid our model exhibiting unnecessary subjectivebiases on these controversial topics.Consequently,we observe that DeepSeek-V2 performsslightly worse on the test sets that are closely associated with specific regional cultures.Forexample,w
249、hen evaluated on MMLU,although DeepSeek-V2 achieves comparable or superiorperformance on the majority of testsets compared with its competitors like Mixtral 8x22B,it stilllags behind on the Humanity-Moral subset,which is mainly associated with American values.Further,we conductamanual analysisonthis
250、 subset.Threewell-educated humanannotatorsconduct independent annotations on 420 moral scenarios from the MMLU Humanity-Moralsubset.Then,wecomputetheagreementamongtheirannotationsandtheground-truthlabel.Asshown in Table 10,three human annotators and the ground-truth label exhibit a low agreementwith
251、 each other.Therefore,we attribute the abnormal performance of DeepSeek-V2 on thesevalue-sensitive test sets to our efforts in debiasing the pre-training corpus.F.Additional Evaluations on Math and CodeThe evaluation employs the SC-Math6 corpus,which consists of thousands of Chinese mathproblems.Dee
252、pSeek-V2 Chat(RL)outperforms all Chinese LLMs,including both open-sourceand close-source models.We further share more results in Figure 5 on HumanEval and LiveCodeBench,where the32AgreementGround-Truth LabelAnnotator 1Annotator 2Annotator 3Ground-Truth Label100.0%66.7%59.8%42.1%Annotator 166.7%100.0
253、%57.9%69.0%Annotator 259.8%57.9%100.0%65.5%Annotator 342.1%69.0%65.5%100.0%Table 10|Three well-educated human annotators conduct independent annotations on 420 moralscenarios from the MMLU Humanity-Moral subset,on which DeepSeek-V2 and its competitivemodels demonstrate performance inconsistency.Thre
254、e annotators and the ground-truth labelexhibit a low agreement with each other.This indicates that the answers to the Humanity-Moralsubset can be contentious according to specific regional cultures.Model NameR LevelComp.ScoreReas.Steps ScoreOvrAcc ScoreGPT-4-1106-Preview590.7191.6589.77GPT-4588.4089
255、.1087.71DeepSeek-V2 Chat(RL)583.3585.7384.54Ernie-bot 4.0585.6086.8284.38Qwen-110B-Chat583.2584.9384.09GLM-4584.2485.7282.77Xinghuo 3.5583.7385.3782.09Qwen-72B-Chat478.4280.0779.25ChatGLM-Turbo457.7060.3255.09GPT-3.5-Turbo457.0559.6154.50Qwen-14B-Chat453.1255.9950.26ChatGLM3-6B340.9044.2037.60Xinghu
256、o 3.0340.0845.2734.89Baichuan2-13B-Chat339.4042.6336.18Ernie-3.5-turbo225.1927.7022.67Chinese-Alpaca2-13B220.5522.5218.58Table 11|SC-Math6 Model Reasoning Level.“R Level”stands for Reasoning Level,“Comp.Score”stands for Comprehensive Score,“Reas.Steps Score”stands for Reasoning Steps Score,and“OvrAc
257、c Score”stands for Overall Accuracy Score.questions of LiveCodeBench are selected from the period between September 1st,2023,andApril 1st,2024.As shown in the figure,DeepSeek-V2 Chat(RL)demonstrates considerableproficiency in LiveCodeBench,achieving a Pass1 score that even surpasses some giant model
258、s.This performance highlights the strong capability of DeepSeek-V2 Chat(RL)in tackling livecoding tasks.G.Evaluation FormatsWe present our evaluation formats for each benchmark in Table 12-37,respectively.33657075808590HumanEval(Pass1)15202530354045LiveCodeBench(Pass1)DeepSeek-V2-Chat-RLDeepSeek 67B
259、Mixtral 8x22BMistral LargeQwen MaxQwen1.5 72BGPT-4-Turbo-1106GPT-4-0613Claude OpusClaude SonnetClaude HaikuLLaMA3-70B-ChatFigure 5|Evaluation results on HumanEval and LiveCodeBench.The questions of Live-CodeBench are selected from the period between September 1st,2023 and April 1st,2024.PROMPT以下是一道中
260、國高考生物選擇題,請選擇正確的答案。問題:下列有關高爾基體、線粒體和葉綠體的敘述,正確的是選項:(A)三者都存在于藍藻中(B)三者都含有DNA(C)三者都是ATP合成的場所(D)三者的膜結構中都含有蛋白質答案:從A到D,我們應選擇Table 12|An example of AGIEval.34PROMPTQuestion:A sample in a cylindrical container has a cylindrical shape and afixed volume.The state of matter of the sample _A.must be solidB.could b
261、e either solid or liquidC.must be liquidD.could be either liquid or gasAnswer:BQuestion:The speed of sound is generally greatest in _A.solids and lowest in liquidsB.solids and lowest in gasesC.gases and lowest in liquidsD.gases and lowest in solidsAnswer:BQuestion:When oil and water are mixed togeth
262、er,they form a _A.gasB.solidC.compoundD.suspensionAnswer:DQuestion:A container of liquid water was placed outside during the day whenthe temperature was 3C.At night the outside temperature dropped to-2C.This temperature change most likely caused the water to _A.condenseB.evaporateC.remain a liquidD.
263、become a solidAnswer:Table 13|An example of ARC.35PROMPTEvaluate the result of a random Boolean expression.Q:not(not not True)isA:Lets think step by step.Remember that(i)expressions inside brackets are always evaluated first andthat(ii)the order of operations from highest priority to lowest priority
264、 is not,and,or,respectively.We first simplify this expression Z as follows:Z=not(not not True)=not(A)where A=not not True.Lets evaluate A:A=not not True=not(not True)=not False=True.Plugging in A,we get:Z=not(A)=not(True)=not True=False.So the answer is False.Q:True and False and not True and True i
265、sA:Lets think step by step.Remember that(i)expressions inside brackets are always evaluated first andthat(ii)the order of operations from highest priority to lowest priority is not,and,or,respectively.We first simplify this expression Z as follows:Z=True and False and not True and True=A and B where
266、 A=True and Falseand B=not True and True.Lets evaluate A:A=True and False=False.Letsevaluate B:B=not True and True=not(True and True)=not(True)=False.Plugging in A and B,we get:Z=A and B=False and False=False.So theanswer is False.Q:not not(not(False)isA:Lets think step by step.Remember that(i)expre
267、ssions inside brackets are always evaluated first andthat(ii)the order of operations from highest priority to lowest priority is not,and,or,respectively.We first simplify this expression Z as follows:Z=not not(not(False)=not not(A)where A=not(False).Lets evaluateA:A=not(False)=not False=True.Pluggin
268、g in A,we get:Z=not not(A)=not not(True)=not not False=True.So the answer is True.Q:False and False and False or not False isA:Lets think step by step.Table 14|An example of BBH.36PROMPT以下是中國關于教育學考試的單項選擇題,請選出其中的正確答案。根據我國心理學家馮忠良教授的學習分類,培養學生品德要通過_。A.知識的學習B.技能的學習C.行為規范的學習D.態度的學習答案:C開設跨學科課程或建立跨學科專業體現了高等
269、教育課程發展的_。A.綜合化趨勢B.多樣化趨勢C.人文化趨勢D.科學化趨勢答案:A心智技能的特點有_。A.物質性、外顯性、簡縮性B.觀念性、內潛性、簡縮性C.物質性、外顯性、展開性D.觀念性、內潛性、展開性答案:B下列關于大學生的情緒與理智關系的說法中正確的是_。A.能冷靜控制自己情緒B.感情用事,難以用理智控制情緒C.遇事能堅持自己正確認識D.已發展到不為小事而發怒和慪氣答案:B在學完一篇邏輯結構嚴密的課文以后,勾畫出課文的論點論據的邏輯關系圖以幫助理解和記憶。這種學習方法屬于_。A.精細加工策略B.組織策略C.復述策略D.做筆記策略答案:B有學者強調,教育要根據一個民族固有的特征來定,這種
270、觀點體現了_A.生產力對教育的影響和制約B.政治制度對教育的影響和制約C.文化對教育的影響和制約D.經濟制度對教育的影響和制約答案:OPTIONS-A-B-C-DTable 15|An example of C-Eval.37PROMPT女:這些藥怎么吃?男:一天三次,一次兩片。請根據上文回答問題:他們在哪兒?答案:OPTIONS-商店-飯店-醫院-教室Table 16|An example of C3.PROMPT以下是將某句古詩文翻譯而成的現代表述:春天已至,萬物復蘇,春風如一位美麗而又心靈手巧的姑娘,邁著纖纖細步款款而來,她揮舞剪刀,盡情地展示那高超的女工技巧,她先裁出了柳葉,隨著柳條裊
271、裊依依地舞蹈,又裁出杏葉,桃葉。該翻譯所對應的古詩文是:OPTIONS-春風騁巧如翦刀-剪裁無巧似春風-風吹怨恨快如刀-春風欲擅秋風巧Table 17|An example of CCPM.38PROMPTQ:某 小 學 在“獻 愛 心為 汶 川 地 震 區 捐 款”活 動 中,六 年 級 五 個 班 共捐款8000元,其中一班捐款1500元,二班比一班多捐款200元,三班捐款1600元,四班與五班捐款數之比是3:5四班捐款多少元?A:一 班 捐 款1500元,而 二 班 比 一 班 多 捐200元,所 以 二 班 捐款1500+200=1700元,又知道六年級五個班一共捐款8000元,所以四
272、班和五班捐款之和=一共捐款-一班和二班和三班捐款之和,即8000-1500-1700-1600=3200元,而題目說四班與五班捐款數之比是3:5,則四班捐款了3200/(3+5)*3=1200元。所以答案是:1200。Q:小俊在東西大道上跑步,若規定向東為正。他先向東跑了800米,然后又跑了一段之后,他位于出發點西邊100米處,小俊第二段跑了多少米?A:小俊第二段跑完后位于出發點西邊,所以第二段應該是向西跑,第二段跑的長度-第一段跑的長度=100,第二段跑了100+800=900米。所以答案是:900。Q:A車和B車同時從甲、乙兩地相向開出,經過5小時相遇然后,它們又各自按原速原方向繼續行駛3
273、小時,這時A車離乙地還有135千米,B車離甲地還有165千米甲、乙兩地相距多少千米?A:假設A車的速度為x千米每小時,B車的速度為y千米每小時,根據而A、B相遇時A車行駛了5小時,A車行駛3小時后離乙地還有135千米,B車行駛3小時后距離甲地還有165千米,可以得到甲乙兩地相距=5x+5y=135+8x=165+8y,變換得到:10(x+y)=300+8(x+y),于是x+y=150,甲乙兩地相距5(x+y)=750千米。所以答案是:750。Q:在一個底面半徑為10厘米的圓柱形容器內,倒入10厘米深的水,然后將一個底面直徑4厘米,高6厘米的圓錐形鉛錘放入水中,容器中水面上升多少厘米?A:Tab
274、le 18|An example of CMATH.39PROMPT以下是關于解剖學的單項選擇題,請直接給出正確答案的選項。題目:壁胸膜的分部不包括A.肋胸膜B.肺胸膜C.膈胸膜D.胸膜頂答案是:B題目:屬于蝶骨上的結構為A.垂體窩B.棘孔C.破裂孔D.視神經管答案是:B題目:屬于右心房的結構是A.肉柱B.室上嵴C.乳頭肌D.梳狀肌答案是:D題目:咽的分部A.咽隱窩B.口咽部C.鼻咽部D.喉咽部答案是:C題目:舌下神經核位于A.間腦B.延髓C.中腦D.腦撟答案是:B題目:從腦干背側出腦的腦神經是A.副神經B.三叉神經C.舌下神經D.滑車神經答案是:OPTIONS-A-B-C-DTable 19
275、|An example of CMMLU.40PROMPT文章:英雄廣場(Heldenplatz)是奧地利首都維也納的一個廣場。在此曾發生許多重要事件最著名的是1938年希特勒在此宣告德奧合并。英雄廣場是霍夫堡皇宮的外部廣場,興建于皇帝弗朗茨約瑟夫一世統治時期,是沒有完全建成的所謂“帝國廣場”(Kaiserforum)的一部分。其東北部是霍夫堡皇宮的Leopoldinian Tract,東南方是新霍夫堡,西南方的內環路,將其與“城門外”(ueres Burgtor)隔開。西北部沒有任何建筑物,可以很好地眺望內環路、國會大廈、市政廳,以及城堡劇院。廣場上有2尊軍事領袖的騎馬像:歐根親王和卡爾大公
276、。根據上文回答下面的問題。問題:英雄廣場是哪個皇宮的外部廣場?答案:霍夫堡皇宮問題:廣場上有哪兩位軍事領袖的騎馬像?答案:Table 20|An example of CMRC2018.PROMPTPassage:The median age in the city was 22.1 years.10.1%of residents wereunder the age of 18;56.2%were between the ages of 18 and 24;16.1%werefrom 25 to 44;10.5%were from 45 to 64;and 7%were 65 years of
277、 age or older.The gender makeup of the city was 64.3%male and 35.7%female.Answer the following questions based on the above passage,please calculatecarefully if calculation is necessary.Q:How many percent were not from 25 to 44?A:The answer type is number.So according to above Passage,the answer is8
278、3.9.Q:How many in percent werent 25 to 44?A:The answer type is number.So according to above Passage,the answer isTable 21|An example of DROP.PROMPT中新網12月7日電綜合外媒6日報道,在美國得克薩斯州,負責治療新冠肺炎患者的醫生約瑟夫瓦隆(Joseph Varon)已連續上班超260天,每天只睡不超過2小時。瓦隆日前接受采訪時呼吁,美國民眾應遵從防疫規定,一線的醫護人員“已OPTIONS-神清氣爽”。-詭計多端”。-精疲力竭”。-分工合作”。-寅吃卯
279、糧”。-土豪劣紳”。-蕓蕓眾生”。Table 22|An example of CHID.41PROMPT胡雪巖離船登岸,坐轎進城,等王有齡到家,他接著也到了他那里,臉上是掩抑不住的笑容,王有齡夫婦都覺得奇怪,問他什么事這么高興。上面的句子中的他指的是胡雪巖漸漸地,湯中凝結出一團團塊狀物,將它們撈起放進盆里冷卻,肥皂便出現在世上了。上面的句子中的它們指的是塊狀物“她序上明明引著JulesTellier的比喻,說有個生脫發病的人去理發,那剃頭的對他說不用剪發,等不了幾天,頭毛壓兒全掉光了;大部分現代文學也同樣的不值批評。這比喻還算俏皮?!鄙厦娴木渥又械乃傅氖巧摪l病的人在洛倫佐大街的盡頭處,矗
280、立著著名的圣三一大教堂。它有著巨大的穹頂,還有明亮的彩色玻璃窗,上面描繪著舊約和新約的場景。上面的句子中的它指的是圣三一大教堂他伯父還有許多女弟子,大半是富商財主的外室;這些財翁白天忙著賺錢,怕小公館里的情婦長日無聊,要不安分,常常叫她們學點玩藝兒消遣。上面的句子中的她們指的是情婦趙雨又拿出了一個杯子,我們熱情地請老王入座,我邊給他倒酒邊問:1962年的哪次記得嗎?“上面的句子中的他指的是Table 23|An example of CLUEWSC.42PROMPTQ:Max can mow the lawn in 40 minutes.If it takes him twice that l
281、ong tofertilize the lawn,how long will it take him to both mow and fertilize the lawn?A:Lets think step by step.It takes Max 2*40 minutes=80 minutes to fertilizethe lawn.In total,Max takes 80 minutes+40 minutes=120 minutes to bothmow and fertilize the lawn.The answer is 120.Q:The bagels cost$2.25 ea
282、ch,or a dozen for$24.How much is saved,perbagel,in cents,by buying a dozen at a time?A:Lets think step by step.They cost 2.25*100=225 cents each.At the bulk rate,they are 24/12=2 dollar each.They cost 2*100=200 cents each.225-200=25 centsare saved per bagel.The answer is 25.Q:Tim is 5 years old.His
283、cousin,Rommel,is thrice as old as he is.His othercousin,Jenny,is 2 years older than Rommel.How many years younger is Timthan Jenny?A:Lets think step by step.Rommel is 5 x 3=15 years old.Jenny is 15+2=17years old.So,Tim is 17-5=12 years younger than Jenny.The answer is 12.Q:The school has 14 boys and
284、 10 girls.If 4 boys and 3 girls drop out,howmany boys and girls are left?A:Lets think step by step.There are 14 boys-4 boys=10 boys left.There are10 girls-3 girls=7 girls left.In total there are 10 boys+7 girls=17 boys andgirls left.The answer is 17.Q:Building one birdhouse requires 7 planks and 20
285、nails.If 1 nail costs 0.05,and one plank costs 3,what is the cost,in dollars,to build 4 birdhouses?A:Lets think step by step.The cost of the planks for one birdhouse is 7*3=21.And the nails are a cost of 20*0.05=1 for each birdhouse.So to build onebirdhouse one will need 21+1=22.So the cost of build
286、ing 4 birdhouses is at 4*22=88.The answer is 88.Q:Danny brings 3 watermelons to his family picnic.He cuts each watermeloninto 10 slices.His sister brings 1 watermelon to the family picnic,and she cutsthe watermelon into 15 slices.How many watermelon slices are there in totalat the picnic?A:Lets thin
287、k step by step.From Danny,there are 3*10=30 watermelon slices.From his sister,there are 1*15=15 watermelon slices.There are a total of 30+15=45 watermelon slices.The answer is 45.Q:Angela is a bike messenger in New York.She needs to deliver 8 timesas many packages as meals.If she needs to deliver 27
288、 meals and packagescombined,how many meals does she deliver?A:Letsthinkstepbystep.LetpbethenumberofpackagesAngeladeliversandm be the number of meals.We know that p+m=27 and p=8m.Substitutingthe second equation into the first equation,we get 8m+m=27.Combininglike terms,we get 9m=27.Dividing both side
289、s by 9,we get m=3.The answeris 3.Q:Cori is 3 years old today.In 5 years,she will be one-third the age of her aunt.How old is her aunt today?A:Lets think step by step.In 5 years,Cori will be 3+5=8 years old.In 5years,Coris aunt will be 8 x 3=24 years old.Today,her aunt is 24-5=19years old.The answer
290、is 19.Q:Indras has 6 letters in her name.Her sisters name has 4 more letters thanhalf of the letters in Indras name.How many letters are in Indras and hersisters names?A:Lets think step by step.Table 24|An example of GSM8K.43PROMPTPlaying piano:A man is seated at a piano.HeOPTIONS-is playing the pia
291、no with his hands and his face.-bigins to play a song by timbaland on the piano.-plays slowly,and pauses to snap his fingers.-is playing a song in front of him.Table 25|An example of HellaSwag.PROMPTdef starts_one_ends(n):Given a positive integer n,return the count of the numbers of n-digitpositive
292、integers that start or end with 1.Table 26|An example of HumanEval.44PROMPTProblem:Find the domain of the expression$fracsqrtx-2sqrt5-x$.Solution:The expressions inside each square root must be non-negative.Therefore,$x-2 ge 0$,so$xge2$,and$5-x ge 0$,so$x le 5$.Also,the denominator cannot be equal t
293、o zero,so$5-x0$,which gives$x5$.Therefore,the domain of the expression is$boxed2,5)$.Final Answer:The final answer is$2,5)$.I hope it is correct.Problem:If$det mathbfA=2$and$det mathbfB=12,$then find$det(mathbfA mathbfB).$Solution:We have that$det(mathbfA mathbfB)=(det mathbfA)(detmathbfB)=(2)(12)=b
294、oxed24.$Final Answer:The final answer is$24$.I hope it is correct.Problem:Terrell usually lifts two 20-pound weights 12 times.If he uses two 15-poundweights instead,how many times must Terrell lift them in order to lift the sametotal weight?Solution:If Terrell lifts two 20-pound weights 12 times,he
295、lifts a total of$2cdot12cdot20=480$pounds of weight.If he lifts two 15-pound weights insteadfor$n$times,he will lift a total of$2cdot15cdot n=30n$pounds of weight.Equating this to 480 pounds,we can solve for$n$:beginalign*30n&=480Rightarrowqquad n&=480/30=boxed16endalign*Final Answer:The final answe
296、r is$16$.I hope it is correct.Problem:If the system of equationsbeginalign*6x-4y&=a,6y-9x&=b.endalign*has a solution$(x,y)$where$x$and$y$are both nonzero,find$fracab,$assuming$b$is nonzero.Solution:If we multiply the first equation by$-frac32$,we obtain$6y-9x=-frac32a.$Since we also know that$6y-9x=
297、b$,we have$-frac32a=bRightarrowfracab=boxed-frac23.$Final Answer:The final answer is$-frac23$.I hope it is correct.Problem:Evaluate$log_21$.Solution:Table 27|An example of MATH.45PROMPTYou are an expert Python programmer,and here is your task:Write a functionto find the similar elements from the giv
298、en two tuple lists.Your code shouldpass these tests:assert similar_elements(3,4,5,6),(5,7,4,10)=(4,5)assert similar_elements(1,2,3,4),(5,4,3,7)=(3,4)assert similar_elements(11,12,14,13),(17,15,14,13)=(13,14)BEGINdef similar_elements(test_tup1,test_tup2):res=tuple(set(test_tup1)&set(test_tup2)return(
299、res)DONEYou are an expert Python programmer,and here is your task:Write a pythonfunction to identify non-prime numbers.Your code should pass these tests:assert is_not_prime(2)=Falseassert is_not_prime(10)=Trueassert is_not_prime(35)=TrueBEGINimport mathdef is_not_prime(n):result=Falsefor i in range(
300、2,int(math.sqrt(n)+1):if n%i=0:result=Truereturn resultDONEYou are an expert Python programmer,and here is your task:Write a functionto find the largest integers from a given list of numbers using heap queuealgorithm.Your code should pass these tests:assert heap_queue_largest(25,35,22,85,14,65,75,22
301、,58,3)=85,75,65assert heap_queue_largest(25,35,22,85,14,65,75,22,58,2)=85,75assert heap_queue_largest(25,35,22,85,14,65,75,22,58,5)=85,75,65,58,35BEGINimport heapq as hqdef heap_queue_largest(nums,n):largest_nums=hq.nlargest(n,nums)return largest_numsDONEYou are an expert Python programmer,and here
302、is your task:Write a functionto return the sum of all divisors of a number.Your code should pass these tests:assert sum_div(8)=7assert sum_div(12)=16assert sum_div(7)=1BEGINTable 28|An example of MBPP.46PROMPTThe following are multiple choice questions(with answers)about miscella-neous.How many axle
303、s does a standard automobile have?A.oneB.twoC.fourD.eightAnswer:BWhat place is named in the title of the 1979 live album by rock legends CheapTrick?A.BudapestB.BudokanC.BhutanD.BritainAnswer:BWho is the shortest man to ever win an NBA slam dunk competition?A.Anthony Spud WebbB.Michael Air JordanC.Ty
304、rone Muggsy BoguesD.Julius Dr J ErvingAnswer:AWhat is produced during photosynthesis?A.hydrogenB.nylonC.oxygenD.lightAnswer:CWhich of these songs was a Top 10 hit for the rock band The Police?A.Radio Ga-GaB.Ob-la-di Ob-la-daC.De Do Do Do De Da Da DaD.In-a-Gadda-Da-VidaAnswer:CWhich of the Three Stoo
305、ges was not related to the others?A.MoeB.LarryC.CurlyD.ShempAnswer:OPTIONS-A-B-C-DTable 29|An example of MMLU.47PROMPTAnswer these questions:Q:Who is hosting the fifa world cup in 2022?A:QatarQ:Who won the first women s fifa world cup?A:United StatesQ:When did miami vice go off the air?A:1989Q:Who w
306、rote the song shout to the lord?A:Darlene ZschechQ:Who was thrown in the lion s den?A:DanielQ:What is the meaning of the name habib?A:Table 30|An example of NaturalQuestions.PROMPTA woman notices that she is depressed every autumn,and wonders why.Afriend suggests to her that perhaps certain changes
307、that take place as seasonsmove from warm to cold may be having an effect on her.When pressed for anexample of these changes,the friend citesOPTIONS-flowers blooming-grass turning brown-trees growing-blossoms bloomingTable 31|An example of OpenBookQA.PROMPTTo make it easier to push the reset button o
308、f the garbage disposable machinewhich is located underneath the machine,OPTIONS-place a wall mirror on the floor of the cabinet-hold a hand mirror under the garbage disposable machineTable 32|An example of PIQA.48PROMPTArticle:When you read an article you will understand and remember it better if yo
309、ucan work out how the writer has put the ideas together.Sometimes a writerputs ideas together by asking questions and then answering them.For example,if the article is about groundhogs,the set of questions in the writers headmight be:What does a groundhog look like?Where do groundhogs live?What do t
310、hey eat?.In the article,the author might answer those questions.Sometimes an author writes out her questions in the article.These questionsgive you signals.They tell you what the author is going to write next.Oftenan author has a question in her head but she doesnt write it out for you.Youhave to wo
311、rk out her question for yourself.Heres a sample reading for you topractice this method.EarthwormsDo you know how many kinds of earthworms there are?There are about 1800kinds in the world!They can be brown,purple,green.They can be as small as 3cm long and as large as 3 m long.The best time to see ear
312、thworms is at night,especially a cool,damp night.Thatswhen they come up from their burrows to hunt for food.Earthworms dontlike to be in the sun.Thats because they breathe through their skin,and theycant breathe if their skin gets too dry.Earthworms must come out of the earthif it rains a lot,becaus
313、e they cant breathe in their flooded burrows.What adangerous life!Earthworms dont have eyes,so how can they tell when its dark?They havespecial places on their skin that are sensitive to light.These spots tell whetherits light or dark.If you shine a flashlight on an earthworm at night,it willquickly
314、 disappear into the ground.Earthworms dont have ears either,but they can hear by feeling movements inthe earth.If you want to hear like an earthworm,lie on the ground with yourfingers in your ears.Then have a friend stamp his or her feet near you.This ishow earthworms feel birds and people walking,a
315、nd moles digging,near them.Earthworms are useful.Farmers and gardeners like having lots of earthwormsin their land because the worms help to make better soil when they dig.Thatdigging keeps the soil loose and airy.In one year earthworms can pile up asmuch as 23,000 kg of castings in an area about th
316、e size of a football field.Q:Whats the purpose of reading Earthworms?A:To put the writers idea into real use.Q:Which question CANNOT be answered in the passage?A:Why can human listen like earthworms?Q:How can you understand Earthworms better according to this passage?A:Read to work out all the quest
317、ions in the writers head while reading.Q:Whats the best title for the passage?A:OPTIONS-One way to help with understanding-One way to practice with a new idea-One way to learn to be a wise writer-One way to be clearer about wormsTable 33|An example of RACE.49PROMPTAnswer these questions:Q:A Jayhawke
318、r was a term applied to anti-slavery militant bands from acertain US state that clashed with pro-slavery factions from Missouri.Whichstate is this,sometimes referred to as the Jayhawk State?A:Kans.Q:Which Swedish DJ and record producer had a UK Number One single in2013 with Wake Me Up?A:Tim Bergling
319、Q:Who is the MP for Sheffield Hallam?A:Nick cleggQ:A case that riveted the nation,the case of The State of Tennessee v.JohnThomas Scopes concluded on July 21,1925,with the jury finding Mr.Scopesguilty of teaching what?A:Survival of speciesQ:What cartoon series featured a character called Little My?A
320、:MuumiQ:What English model,with her short-haired androgynous look,born LesleyHornby,was discovered in 1966 by Nigel Davies when she was 16 and weighed6 stone(41 kg,91 lbs),and became The Face of 66 with her high fashionmod look created by Mary Quant?A:Table 34|An example of TriviaQA.PREFIXES-So Moni
321、ca-So JessicaCOMPLETIONavoids eating carrots for their eye health because Emily needs good eyesightwhile Monica doesnt.Table 35|An example of WinoGrande.Note that there are multiple prefixes and only onecompletion for WinoGrande,and we choose the predicted prefix with the lowest perplexity ofthe com
322、pletion.50PromptYou will be given a function f and an output in the form f(?)=output.Findany input such that executing f on the input leads to the given output.Theremay be multiple answers,but you should only output one.In ANSWER and/ANSWER tags,complete the assertion with one such input that will p
323、roducethe output when executing the function.PYTHONdef f(my_list):count=0for i in my_list:if len(i)%2=0:count+=1return countassert f(?)=3/PYTHONANSWERassert f(mq,px,zy)=3/ANSWERPYTHONdef f(s1,s2):return s1+s2assert f(?)=banana/PYTHONANSWERassert f(ba,nana)=banana/ANSWERPYTHONdef f(a,b,c):result=for
324、d in a,b,c:result.update(dict.fromkeys(d)return resultassert f(?)=1:None,2:None/PYTHONANSWERTable 36|An example of CRUXEval-I.51PromptYou are given a Python function and an assertion containing an input to thefunction.Complete the assertion with a literal(no unsimplified expressions,no function call
325、s)containing the output when executing the provided code onthe given input,even if the function is incorrect or incomplete.Do NOT outputany extra information.Provide the full assertion with the correct output inANSWER and/ANSWER tags,following the examples.PYTHONdef f(n):return nassert f(17)=?/PYTHONANSWERassert f(17)=17/ANSWERPYTHONdef f(s):return s+aassert f(x9j)=?/PYTHONANSWERassert f(x9j)=x9ja/ANSWERPYTHONdef f(nums):output=for n in nums:output.append(nums.count(n),n)output.sort(reverse=True)return outputassert f(1,1,3,1,3,1)=?/PYTHONANSWERTable 37|An example of CRUXEval-O.52