《DeepSeek:2025年DeepSeek-V3模型技術報告(英文版)(53頁).pdf》由會員分享,可在線閱讀,更多相關《DeepSeek:2025年DeepSeek-V3模型技術報告(英文版)(53頁).pdf(53頁珍藏版)》請在三個皮匠報告上搜索。
1、DeepSeek-V3 Technical ReportDeepSeek-AIAbstractWe present DeepSeek-V3,a strong Mixture-of-Experts(MoE)language model with 671B totalparameters with 37B activated for each token.To achieve efficient inference and cost-effectivetraining,DeepSeek-V3 adopts Multi-head Latent Attention(MLA)and DeepSeekMo
2、E architec-tures,which were thoroughly validated in DeepSeek-V2.Furthermore,DeepSeek-V3 pioneersan auxiliary-loss-free strategy for load balancing and sets a multi-token prediction trainingobjective for stronger performance.We pre-train DeepSeek-V3 on 14.8 trillion diverse andhigh-quality tokens,fol
3、lowed by Supervised Fine-Tuning and Reinforcement Learning stages tofully harness its capabilities.Comprehensive evaluations reveal that DeepSeek-V3 outperformsother open-source models and achieves performance comparable to leading closed-sourcemodels.Despite its excellent performance,DeepSeek-V3 re
4、quires only 2.788M H800 GPU hoursfor its full training.In addition,its training process is remarkably stable.Throughout the entiretraining process,we did not experience any irrecoverable loss spikes or perform any rollbacks.The model checkpoints are available athttps:/ 500(EM)AIME 2024(Pass1)Codefor
5、ces(Percentile)SWE-bench Verified(Resolved)020406080100Accuracy/Percentile(%)75.959.190.239.251.642.066.241.374.716.735.622.671.649.080.023.324.823.873.351.173.823.325.324.572.649.974.69.323.638.878.065.078.316.020.350.8DeepSeek-V3DeepSeek-V2.5Qwen2.5-72B-InstLlama-3.1-405B-InstGPT-4o-0513Claude-3.5
6、-Sonnet-1022Figure 1|Benchmark performance of DeepSeek-V3 and its counterparts.arXiv:2412.19437v2 cs.CL 18 Feb 2025Contents1Introduction42Architecture62.1Basic Architecture.62.1.1Multi-Head Latent Attention.72.1.2DeepSeekMoE with Auxiliary-Loss-Free Load Balancing.82.2Multi-Token Prediction.103Infra
7、structures113.1Compute Clusters.113.2Training Framework.123.2.1DualPipe and Computation-Communication Overlap.123.2.2Efficient Implementation of Cross-Node All-to-All Communication.133.2.3Extremely Memory Saving with Minimal Overhead.143.3FP8 Training.143.3.1Mixed Precision Framework.153.3.2Improved
8、 Precision from Quantization and Multiplication.163.3.3Low-Precision Storage and Communication.183.4Inference and Deployment.183.4.1Prefilling.193.4.2Decoding.193.5Suggestions on Hardware Design.203.5.1Communication Hardware.203.5.2Compute Hardware.204Pre-Training214.1Data Construction.214.2Hyper-Pa
9、rameters.224.3Long Context Extension.234.4Evaluations.244.4.1Evaluation Benchmarks.244.4.2Evaluation Results.244.5Discussion.264.5.1Ablation Studies for Multi-Token Prediction.264.5.2Ablation Studies for the Auxiliary-Loss-Free Balancing Strategy.2624.5.3Batch-Wise Load Balance VS.Sequence-Wise Load
10、 Balance.275Post-Training285.1Supervised Fine-Tuning.285.2Reinforcement Learning.295.2.1Reward Model.295.2.2Group Relative Policy Optimization.305.3Evaluations.305.3.1Evaluation Settings.305.3.2Standard Evaluation.315.3.3Open-Ended Evaluation.335.3.4DeepSeek-V3 as a Generative Reward Model.335.4Disc
11、ussion.345.4.1Distillation from DeepSeek-R1.345.4.2Self-Rewarding.345.4.3Multi-Token Prediction Evaluation.356Conclusion,Limitations,and Future Directions35A Contributions and Acknowledgments45B Ablation Studies for Low-Precision Training47B.1FP8 v.s.BF16 Training.47B.2Discussion About Block-Wise Qu
12、antization.47C Expert Specialization Patterns of the 16B Aux-Loss-Based and Aux-Loss-Free Models 4831.IntroductionIn recent years,Large Language Models(LLMs)have been undergoing rapid iteration andevolution(Anthropic,2024;Google,2024;OpenAI,2024a),progressively diminishing the gap to-wards Artificia
13、l General Intelligence(AGI).Beyond closed-source models,open-source models,including DeepSeek series(DeepSeek-AI,2024a,b,c;Guo et al.,2024),LLaMA series(AIMeta,2024a,b;Touvron et al.,2023a,b),Qwen series(Qwen,2023,2024a,b),and Mistral series(Jianget al.,2023;Mistral,2024),are also making significant
14、 strides,endeavoring to close the gap withtheir closed-source counterparts.To further push the boundaries of open-source model capa-bilities,we scale up our models and introduce DeepSeek-V3,a large Mixture-of-Experts(MoE)model with 671B parameters,of which 37B are activated for each token.With a for
15、ward-looking perspective,we consistently strive for strong model performanceand economical costs.Therefore,in terms of architecture,DeepSeek-V3 still adopts Multi-headLatent Attention(MLA)(DeepSeek-AI,2024c)for efficient inference and DeepSeekMoE(Daiet al.,2024)for cost-effective training.These two
16、architectures have been validated in DeepSeek-V2(DeepSeek-AI,2024c),demonstrating their capability to maintain robust model performancewhile achieving efficient training and inference.Beyond the basic architecture,we implementtwo additional strategies to further enhance the model capabilities.Firstl
17、y,DeepSeek-V3 pi-oneers an auxiliary-loss-free strategy(Wang et al.,2024a)for load balancing,with the aim ofminimizing the adverse impact on model performance that arises from the effort to encourageload balancing.Secondly,DeepSeek-V3 employs a multi-token prediction training objective,which we have
18、 observed to enhance the overall performance on evaluation benchmarks.In order to achieve efficient training,we support the FP8 mixed precision training andimplement comprehensive optimizations for the training framework.Low-precision traininghas emerged as a promising solution for efficient trainin
19、g(Dettmers et al.,2022;Kalamkar et al.,2019;Narang et al.,2017;Peng et al.,2023b),its evolution being closely tied to advancements inhardware capabilities(Luo et al.,2024;Micikevicius et al.,2022;Rouhani et al.,2023a).In thiswork,we introduce an FP8 mixed precision training framework and,for the fir
20、st time,validateits effectiveness on an extremely large-scale model.Through the support for FP8 computationand storage,we achieve both accelerated training and reduced GPU memory usage.As forthe training framework,we design the DualPipe algorithm for efficient pipeline parallelism,which has fewer pi
21、peline bubbles and hides most of the communication during training throughcomputation-communication overlap.This overlap ensures that,as the model further scales up,as long as we maintain a constant computation-to-communication ratio,we can still employfine-grained experts across nodes while achievi
22、ng a near-zero all-to-all communication overhead.In addition,we also develop efficient cross-node all-to-all communication kernels to fully utilizeInfiniBand(IB)and NVLink bandwidths.Furthermore,we meticulously optimize the memoryfootprint,making it possible to train DeepSeek-V3 without using costly
23、 tensor parallelism.Combining these efforts,we achieve high training efficiency.During pre-training,we train DeepSeek-V3 on 14.8T high-quality and diverse tokens.Thepre-training process is remarkably stable.Throughout the entire training process,we did notencounter any irrecoverable loss spikes or h
24、ave to roll back.Next,we conduct a two-stagecontext length extension for DeepSeek-V3.In the first stage,the maximum context length isextended to 32K,and in the second stage,it is further extended to 128K.Following this,weconduct post-training,including Supervised Fine-Tuning(SFT)and Reinforcement Le
25、arning(RL)on the base model of DeepSeek-V3,to align it with human preferences and further unlock itspotential.During the post-training stage,we distill the reasoning capability from the DeepSeek-R1 series of models,and meanwhile carefully maintain the balance between model accuracy4Training CostsPre
26、-TrainingContext ExtensionPost-TrainingTotalin H800 GPU Hours2664K119K5K2788Kin USD$5.328M$0.238M$0.01M$5.576MTable 1|Training costs of DeepSeek-V3,assuming the rental price of H800 is$2 per GPU hour.and generation length.We evaluate DeepSeek-V3 on a comprehensive array of benchmarks.Despite its eco
27、nomicaltraining costs,comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as thestrongest open-source base model currently available,especially in code and math.Its chatversion also outperforms other open-source models and achieves performance comparable toleading closed-source models
28、,including GPT-4o and Claude-3.5-Sonnet,on a series of standardand open-ended benchmarks.Lastly,we emphasize again the economical training costs of DeepSeek-V3,summarized inTable 1,achieved through our optimized co-design of algorithms,frameworks,and hardware.During the pre-training stage,training D
29、eepSeek-V3 on each trillion tokens requires only 180KH800 GPU hours,i.e.,3.7 days on our cluster with 2048 H800 GPUs.Consequently,our pre-training stage is completed in less than two months and costs 2664K GPU hours.Combinedwith 119K GPU hours for the context length extension and 5K GPU hours for po
30、st-training,DeepSeek-V3 costs only 2.788M GPU hours for its full training.Assuming the rental price ofthe H800 GPU is$2 per GPU hour,our total training costs amount to only$5.576M.Note thatthe aforementioned costs include only the official training of DeepSeek-V3,excluding the costsassociated with p
31、rior research and ablation experiments on architectures,algorithms,or data.Our main contribution includes:Architecture:Innovative Load Balancing Strategy and Training ObjectiveOn top of the efficient architecture of DeepSeek-V2,we pioneer an auxiliary-loss-freestrategy for load balancing,which minim
32、izes the performance degradation that arisesfrom encouraging load balancing.We investigate a Multi-Token Prediction(MTP)objective and prove it beneficial to modelperformance.It can also be used for speculative decoding for inference acceleration.Pre-Training:Towards Ultimate Training EfficiencyWe de
33、sign an FP8 mixed precision training framework and,for the first time,validate thefeasibility and effectiveness of FP8 training on an extremely large-scale model.Through the co-design of algorithms,frameworks,and hardware,we overcome thecommunication bottleneck in cross-node MoE training,achieving n
34、ear-full computation-communication overlap.This significantly enhances our training efficiency and reduces thetraining costs,enabling us to further scale up the model size without additional overhead.At an economical cost of only 2.664M H800 GPU hours,we complete the pre-training ofDeepSeek-V3 on 14
35、.8T tokens,producing the currently strongest open-source base model.The subsequent training stages after pre-training require only 0.1M GPU hours.Post-Training:Knowledge Distillation from DeepSeek-R1We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thoug
36、ht(CoT)model,specifically from one of the DeepSeek R1 series models,into standard LLMs,particularly DeepSeek-V3.Our pipeline elegantly incorporates the5verification and reflection patterns of R1 into DeepSeek-V3 and notably improves itsreasoning performance.Meanwhile,we also maintain control over th
37、e output style andlength of DeepSeek-V3.Summary of Core Evaluation ResultsKnowledge:(1)On educational benchmarks such as MMLU,MMLU-Pro,and GPQA,DeepSeek-V3 outperforms all other open-source models,achieving 88.5 on MMLU,75.9on MMLU-Pro,and 59.1 on GPQA.Its performance is comparable to leading closed
38、-sourcemodels like GPT-4o and Claude-Sonnet-3.5,narrowing the gap between open-sourceand closed-source models in this domain.(2)For factuality benchmarks,DeepSeek-V3demonstrates superior performance among open-source models on both SimpleQA andChinese SimpleQA.While it trails behind GPT-4o and Claud
39、e-Sonnet-3.5 in English factualknowledge(SimpleQA),it surpasses these models in Chinese factual knowledge(ChineseSimpleQA),highlighting its strength in Chinese factual knowledge.Code,Math,and Reasoning:(1)DeepSeek-V3 achieves state-of-the-art performance onmath-related benchmarks among all non-long-
40、CoT open-source and closed-source models.Notably,it even outperforms o1-preview on specific benchmarks,such as MATH-500,demonstrating its robust mathematical reasoning capabilities.(2)On coding-related tasks,DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks,such as Li
41、veCodeBench,solidifying its position as the leading model in this domain.Forengineering-related tasks,while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5,it still outpaces all other models by a significant margin,demonstrating its competitivenessacross diverse technical benchmarks.In the rem
42、ainder of this paper,we first present a detailed exposition of our DeepSeek-V3model architecture(Section 2).Subsequently,we introduce our infrastructures,encompassingour compute clusters,the training framework,the support for FP8 training,the inferencedeployment strategy,and our suggestions on futur
43、e hardware design.Next,we describe ourpre-training process,including the construction of training data,hyper-parameter settings,long-context extension techniques,the associated evaluations,as well as some discussions(Section 4).Thereafter,we discuss our efforts on post-training,which include Supervi
44、sed Fine-Tuning(SFT),Reinforcement Learning(RL),the corresponding evaluations,and discussions(Section 5).Lastly,we conclude this work,discuss existing limitations of DeepSeek-V3,and propose potentialdirections for future research(Section 6).2.ArchitectureWe first introduce the basic architecture of
45、DeepSeek-V3,featured by Multi-head Latent Atten-tion(MLA)(DeepSeek-AI,2024c)for efficient inference and DeepSeekMoE(Dai et al.,2024)for economical training.Then,we present a Multi-Token Prediction(MTP)training objective,which we have observed to enhance the overall performance on evaluation benchmar
46、ks.Forother minor details not explicitly mentioned,DeepSeek-V3 adheres to the settings of DeepSeek-V2(DeepSeek-AI,2024c).2.1.Basic ArchitectureThe basic architecture of DeepSeek-V3 is still within the Transformer(Vaswani et al.,2017)framework.For efficient inference and economical training,DeepSeek-
47、V3 also adopts MLAand DeepSeekMoE,which have been thoroughly validated by DeepSeek-V2.Compared withDeepSeek-V2,anexceptionisthatweadditionallyintroduceanauxiliary-loss-freeloadbalancing6RouterInput Hidden Output Hidden 112-1Shared ExpertRouted ExpertTop-AttentionFeed-Forward Network34RMSNormRMSNormT
48、ransformer Block DeepSeekMoE0Input Hidden Multi-Head Latent Attention(MLA)0,Latent Latent,Cached During InferenceMulti-Head Attentionconcatenateconcatenate,;,;Output Hidden 1 applyRoPEapplyRoPEFigure 2|Illustration of the basic architecture of DeepSeek-V3.Following DeepSeek-V2,weadopt MLA and DeepSe
49、ekMoE for efficient inference and economical training.strategy(Wang et al.,2024a)for DeepSeekMoE to mitigate the performance degradation inducedby the effort to ensure load balance.Figure 2 illustrates the basic architecture of DeepSeek-V3,and we will briefly review the details of MLA and DeepSeekMo
50、E in this section.2.1.1.Multi-Head Latent AttentionFor attention,DeepSeek-V3 adopts the MLA architecture.Letdenote the embedding dimen-sion,denote the number of attention heads,denote the dimension per head,and h Rdenote the attention input for the-th token at a given attention layer.The core of MLA
51、 is thelow-rank joint compression for attention keys and values to reduce Key-Value(KV)cache duringinference:c=h,(1)k,1;k,2;.;k,=k=c,(2)k=RoPE(h),(3)k,=k,;k,(4)v,1;v,2;.;v,=v=c,(5)7where c Ris the compressed latent vector for keys and values;()indicates the KVcompression dimension;Rdenotes the down-
52、projection matrix;,Rare the up-projection matrices for keys and values,respectively;Ris the matrix usedto produce the decoupled key that carries Rotary Positional Embedding(RoPE)(Su et al.,2024);RoPE()denotes the operation that applies RoPE matrices;and;denotes concatenation.Notethat for MLA,only th
53、e blue-boxed vectors(i.e.,cand k)need to be cached during generation,which results in significantly reduced KV cache while maintaining performance comparable tostandard Multi-Head Attention(MHA)(Vaswani et al.,2017).For the attention queries,we also perform a low-rank compression,which can reduce th
54、eactivation memory during training:c=h,(6)q,1;q,2;.;q,=q=c,(7)q,1;q,2;.;q,=q=RoPE(c),(8)q,=q,;q,(9)where c Ris the compressed latent vector for queries;()denotes the querycompression dimension;R,Rare the down-projection and up-projectionmatrices for queries,respectively;and Ris the matrix to produce
55、 the decoupledqueries that carry RoPE.Ultimately,the attention queries(q,),keys(k,),and values(v,)are combined to yield thefinal attention output u:o,=1Softmax(q,k,+)v,(10)u=o,1;o,2;.;o,(11)where Rdenotes the output projection matrix.2.1.2.DeepSeekMoE with Auxiliary-Loss-Free Load BalancingBasic Arc
56、hitecture of DeepSeekMoE.For Feed-Forward Networks(FFNs),DeepSeek-V3employs the DeepSeekMoE architecture(Dai et al.,2024).Compared with traditional MoEarchitectures like GShard(Lepikhin et al.,2021),DeepSeekMoE uses finer-grained experts andisolates some experts as shared ones.Let udenote the FFN in
57、put of the-th token,we computethe FFN output has follows:h=u+=1FFN()(u)+=1,FFN()(u),(12),=,=1,(13),=(,Topk(,|1,),0,otherwise,(14),=Sigmoid?ue?,(15)8whereanddenotethenumbersofsharedexpertsandroutedexperts,respectively;FFN()()andFFN()()denote the-th shared expert and the-th routed expert,respectively;
58、denotesthe number of activated routed experts;,is the gating value for the-th expert;,is thetoken-to-expert affinity;eis the centroid vector of the-th routed expert;andTopk(,)denotesthe set comprisinghighest scores among the affinity scores calculated for the-th token andall routed experts.Slightly
59、different from DeepSeek-V2,DeepSeek-V3 uses the sigmoid functionto compute the affinity scores,and applies a normalization among all selected affinity scores toproduce the gating values.Auxiliary-Loss-Free Load Balancing.For MoE models,an unbalanced expert load will lead torouting collapse(Shazeer e
60、t al.,2017)and diminish computational efficiency in scenarios withexpert parallelism.Conventional solutions usually rely on the auxiliary loss(Fedus et al.,2021;Lepikhin et al.,2021)to avoid unbalanced load.However,too large an auxiliary loss will impairthe model performance(Wang et al.,2024a).To ac
61、hieve a better trade-off between load balanceand model performance,we pioneer an auxiliary-loss-free load balancing strategy(Wang et al.,2024a)to ensure load balance.To be specific,we introduce a bias termfor each expert andadd it to the corresponding affinity scores,to determine the top-K routing:,
62、=(,+Topk(,+|1,),0,otherwise.(16)Note that the bias term is only used for routing.The gating value,which will be multiplied withthe FFN output,is still derived from the original affinity score,.During training,we keepmonitoring the expert load on the whole batch of each training step.At the end of ea
63、ch step,we will decrease the bias term byif its corresponding expert is overloaded,and increase it byif its corresponding expert is underloaded,whereis a hyper-parameter called bias updatespeed.Through the dynamic adjustment,DeepSeek-V3 keeps balanced expert load duringtraining,and achieves better p
64、erformance than models that encourage load balance throughpure auxiliary losses.Complementary Sequence-Wise Auxiliary Loss.Although DeepSeek-V3 mainly relies on theauxiliary-loss-free strategy for load balance,to prevent extreme imbalance within any singlesequence,we also employ a complementary sequ
65、ence-wise balance loss:LBal=1,(17)=11?,Topk(,|1,)?,(18),=,=1,(19)=1=1,(20)where the balance factoris a hyper-parameter,which will be assigned an extremely smallvalue for DeepSeek-V3;1()denotes the indicator function;anddenotes the number of tokensin a sequence.The sequence-wise balance loss encourag
66、es the expert load on each sequence tobe balanced.9Main Model(Next Token Prediction)Embedding LayerOutput HeadOutput HeadTransformer Block Embedding Layer23413452RMSNormRMSNormLinear ProjectionMTP Module 1(Next2 Token Prediction)SharedSharedconcatenationOutput HeadTransformer Block Embedding LayerLi
67、near ProjectionMTP Module 2(Next3 Token Prediction)concatenationSharedShared3452456356744563Transformer Block Transformer Block Transformer Block Transformer Block Transformer Block Cross-Entropy LossCross-Entropy LossCross-Entropy LossInput TokensTarget TokensRMSNormRMSNormMTP1MTP2Figure 3|Illustra
68、tion of our Multi-Token Prediction(MTP)implementation.We keep thecomplete causal chain for the prediction of each token at each depth.Node-Limited Routing.Like the device-limited routing used by DeepSeek-V2,DeepSeek-V3also uses a restricted routing mechanism to limit communication costs during train
69、ing.In short,we ensure that each token will be sent to at mostnodes,which are selected according tothe sum of the highestaffinity scores of the experts distributed on each node.Under thisconstraint,our MoE training framework can nearly achieve full computation-communicationoverlap.No Token-Dropping.
70、Due to the effective load balancing strategy,DeepSeek-V3 keeps a goodload balance during its full training.Therefore,DeepSeek-V3 does not drop any tokens duringtraining.In addition,we also implement specific deployment strategies to ensure inference loadbalance,so DeepSeek-V3 also does not drop toke
71、ns during inference.2.2.Multi-Token PredictionInspired by Gloeckle et al.(2024),we investigate and set a Multi-Token Prediction(MTP)objective for DeepSeek-V3,which extends the prediction scope to multiple future tokens at eachposition.On the one hand,an MTP objective densifies the training signals a
72、nd may improvedata efficiency.On the other hand,MTP may enable the model to pre-plan its representationsfor better prediction of future tokens.Figure 3 illustrates our implementation of MTP.Differentfrom Gloeckle et al.(2024),which parallelly predictsadditional tokens using independentoutput heads,w
73、e sequentially predict additional tokens and keep the complete causal chain ateach prediction depth.We introduce the details of our MTP implementation in this section.MTP Modules.To be specific,our MTP implementation usessequential modules to predictadditional tokens.The-th MTP module consists of a
74、shared embedding layerEmb(),a sharedoutput headOutHead(),a Transformer blockTRM(),and a projection matrix R2.Forthe-th input token,at the-th prediction depth,we first combine the representation of the-thtoken at the(1)-th depth h1 Rand the embedding of the(+)-th token(+)R10with the linear projection
75、:h=RMSNorm(h1);RMSNorm(Emb(+),(21)where;denotes concatenation.Especially,when=1,h1refers to the representation givenby the main model.Note that for each MTP module,its embedding layer is shared with themain model.The combined hserves as the input of the Transformer block at the-th depth toproduce th
76、e output representation at the current depth h:h1:=TRM(h1:),(22)whererepresents the input sequence length and:denotes the slicing operation(inclusive ofboth the left and right boundaries).Finally,taking has the input,the shared output head willcompute the probability distribution for the-th addition
77、al prediction token+1+R,whereis the vocabulary size:+1=OutHead(h).(23)The output headOutHead()linearly maps the representation to logits and subsequently appliestheSoftmax()function to compute the prediction probabilities of the-th additional token.Also,for each MTP module,its output head is shared
78、with the main model.Our principle ofmaintaining the causal chain of predictions is similar to that of EAGLE(Li et al.,2024b),but itsprimary objective is speculative decoding(Leviathan et al.,2023;Xia et al.,2023),whereas weutilize MTP to improve training.MTP Training Objective.For each prediction de
79、pth,we compute a cross-entropy loss LMTP:LMTP=CrossEntropy(2+:+1,2+:+1)=1+1=2+log,(24)wheredenotes the input sequence length,denotes the ground-truth token at the-th position,anddenotes the corresponding prediction probability of,given by the-th MTP module.Finally,we compute the average of the MTP l
80、osses across all depths and multiply it by aweighting factorto obtain the overall MTP lossLMTP,which serves as an additional trainingobjective for DeepSeek-V3:LMTP=1LMTP.(25)MTP in Inference.Our MTP strategy mainly aims to improve the performance of the mainmodel,so during inference,we can directly
81、discard the MTP modules and the main model canfunction independently and normally.Additionally,we can also repurpose these MTP modulesfor speculative decoding to further improve the generation latency.3.Infrastructures3.1.Compute ClustersDeepSeek-V3 is trained on a cluster equipped with 2048 NVIDIA
82、H800 GPUs.Each node inthe H800 cluster contains 8 GPUs connected by NVLink and NVSwitch within nodes.Acrossdifferent nodes,InfiniBand(IB)interconnects are utilized to facilitate communications.11ComputationMLP(B)MLP(W)MLP(F)ATTN(B)ATTN(W)ATTN(F)CommunicationDISPATCH(F)DISPATCH(B)COMBINE(F)PPCOMBINE(
83、B)Time Forward chunk Backward chunkFigure 4|Overlapping strategy for a pair of individual forward and backward chunks(theboundaries of the transformer blocks are not aligned).Orange denotes forward,green denotesbackward for input,blue denotes backward for weights,purple denotes PP communication,and
84、red denotes barriers.Both all-to-all and PP communication can be fully hidden.3.2.Training FrameworkThe training of DeepSeek-V3 is supported by the HAI-LLM framework,an efficient andlightweight training framework crafted by our engineers from the ground up.On the whole,DeepSeek-V3 applies 16-way Pip
85、eline Parallelism(PP)(Qi et al.,2023a),64-way Expert Paral-lelism(EP)(Lepikhin et al.,2021)spanning 8 nodes,and ZeRO-1 Data Parallelism(DP)(Rajb-handari et al.,2020).In order to facilitate efficient training of DeepSeek-V3,we implement meticulous engineeringoptimizations.Firstly,we design the DualPi
86、pe algorithm for efficient pipeline parallelism.Compared with existing PP methods,DualPipe has fewer pipeline bubbles.More importantly,itoverlaps the computation and communication phases across forward and backward processes,thereby addressing the challenge of heavy communication overhead introduced
87、 by cross-nodeexpert parallelism.Secondly,we develop efficient cross-node all-to-all communication kernelsto fully utilize IB and NVLink bandwidths and conserve Streaming Multiprocessors(SMs)dedicated to communication.Finally,we meticulously optimize the memory footprint duringtraining,thereby enabl
88、ing us to train DeepSeek-V3 without using costly Tensor Parallelism(TP).3.2.1.DualPipe and Computation-Communication OverlapFor DeepSeek-V3,the communication overhead introduced by cross-node expert parallelismresults in an inefficient computation-to-communication ratio of approximately 1:1.To tackl
89、e thischallenge,we design an innovative pipeline parallelism algorithm called DualPipe,which notonly accelerates model training by effectively overlapping forward and backward computation-communication phases,but also reduces the pipeline bubbles.The key idea of DualPipe is to overlap the computatio
90、n and communication within a pair ofindividual forward and backward chunks.To be specific,we divide each chunk into four compo-nents:attention,all-to-all dispatch,MLP,andall-to-all combine.Specially,fora backward chunk,bothattentionandMLPare further split into two parts,backward forinputandbackward
91、for weights,like in ZeroBubble(Qi et al.,2023b).In addition,wehave aPP communicationcomponent.As illustrated in Figure 4,for a pair of forward andbackward chunks,we rearrange these components and manually adjust the ratio of GPU SMsdedicated to communication versus computation.In this overlapping st
92、rategy,we can ensurethat both all-to-all and PP communication can be fully hidden during execution.Given theefficient overlapping strategy,the full DualPipe scheduling is illustrated in Figure 5.It employsa bidirectional pipeline scheduling,which feeds micro-batches from both ends of the pipelinesim
93、ultaneously and a significant portion of communications can be fully overlapped.Thisoverlap also ensures that,as the model further scales up,as long as we maintain a constantcomputation-to-communication ratio,we can still employ fine-grained experts across nodeswhile achieving a near-zero all-to-all
94、 communication overhead.12Device 0 012345670819234566778899Device 101234560718293456787989Device 201234506172839456787989Device 30123405162738495678989Device 40123041526374859678989Device 50120031425364758697899Device 601002113243546576879899Device 700011122233445566778899TimeForwardBackwardBackward
95、 for inputBackward for weightsOverlapped forward&BackwardFigure 5|Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions.The micro-batches in the reverse direction are symmetric to those in the forward direction,sowe omit their batch ID for illustration simplicity.Two cell
96、s enclosed by a shared black borderhave mutually overlapped computation and communication.MethodBubbleParameterActivation1F1B(1)(+)1ZB1P(1)(+2)1DualPipe(Ours)(21)(&+3)2+1Table 2|Comparison of pipeline bubbles and memory usage across different pipeline parallelmethods.denotes the execution time of a
97、forward chunk,denotes the execution time of afull backward chunk,denotes the execution time of a backward for weights chunk,and&denotes the execution time of two mutually overlapped forward and backward chunks.In addition,even in more general scenarios without a heavy communication burden,Du-alPipe
98、still exhibits efficiency advantages.In Table 2,we summarize the pipeline bubbles andmemory usage across different PP methods.As shown in the table,compared with ZB1P(Qiet al.,2023b)and 1F1B(Harlap et al.,2018),DualPipe significantly reduces the pipeline bubbleswhile only increasing the peak activat
99、ion memory by1times.Although DualPipe requireskeeping two copies of the model parameters,this does not significantly increase the memoryconsumption since we use a large EP size during training.Compared with Chimera(Li andHoefler,2021),DualPipe only requires that the pipeline stages and micro-batches
100、 be divisible by2,without requiring micro-batches to be divisible by pipeline stages.In addition,for DualPipe,neither the bubbles nor activation memory will increase as the number of micro-batches grows.3.2.2.Efficient Implementation of Cross-Node All-to-All CommunicationIn order to ensure sufficien
101、t computational performance for DualPipe,we customize efficientcross-node all-to-all communication kernels(including dispatching and combining)to conservethe number of SMs dedicated to communication.The implementation of the kernels is co-designed with the MoE gating algorithm and the network topolo
102、gy of our cluster.To be specific,in our cluster,cross-node GPUs are fully interconnected with IB,and intra-node communicationsare handled via NVLink.NVLink offers a bandwidth of 160 GB/s,roughly 3.2 times that of IB(50 GB/s).To effectively leverage the different bandwidths of IB and NVLink,we limit
103、eachtoken to be dispatched to at most 4 nodes,thereby reducing IB traffic.For each token,when itsrouting decision is made,it will first be transmitted via IB to the GPUs with the same in-nodeindex on its target nodes.Once it reaches the target nodes,we will endeavor to ensure that it isinstantaneous
104、ly forwarded via NVLink to specific GPUs that host their target experts,withoutbeing blocked by subsequently arriving tokens.In this way,communications via IB and NVLinkare fully overlapped,and each token can efficiently select an average of 3.2 experts per nodewithout incurring additional overhead
105、from NVLink.This implies that,although DeepSeek-V313selects only 8 routed experts in practice,it can scale up this number to a maximum of 13 experts(4 nodes3.2 experts/node)while preserving the same communication cost.Overall,undersuch a communication strategy,only 20 SMs are sufficient to fully uti
106、lize the bandwidths of IBand NVLink.In detail,we employ the warp specialization technique(Bauer et al.,2014)and partition20 SMs into 10 communication channels.During the dispatching process,(1)IB sending,(2)IB-to-NVLink forwarding,and(3)NVLink receiving are handled by respective warps.Thenumber of w
107、arps allocated to each communication task is dynamically adjusted according to theactual workload across all SMs.Similarly,during the combining process,(1)NVLink sending,(2)NVLink-to-IB forwarding and accumulation,and(3)IB receiving and accumulation are alsohandled by dynamically adjusted warps.In a
108、ddition,both dispatching and combining kernelsoverlap with the computation stream,so we also consider their impact on other SM computationkernels.Specifically,we employ customized PTX(Parallel Thread Execution)instructions andauto-tune the communication chunk size,which significantly reduces the use
109、 of the L2 cacheand the interference to other SMs.3.2.3.Extremely Memory Saving with Minimal OverheadIn order to reduce the memory footprint during training,we employ the following techniques.Recomputation of RMSNorm and MLA Up-Projection.We recompute all RMSNorm op-erations and MLA up-projections d
110、uring back-propagation,thereby eliminating the need topersistently store their output activations.With a minor overhead,this strategy significantlyreduces memory requirements for storing activations.Exponential Moving Average in CPU.During training,we preserve the Exponential Mov-ing Average(EMA)of
111、the model parameters for early estimation of the model performanceafter learning rate decay.The EMA parameters are stored in CPU memory and are updatedasynchronously after each training step.This method allows us to maintain EMA parameterswithout incurring additional memory or time overhead.Shared E
112、mbedding and Output Head for Multi-Token Prediction.With the DualPipe strategy,we deploy the shallowest layers(including the embedding layer)and deepest layers(includingthe output head)of the model on the same PP rank.This arrangement enables the physicalsharing of parameters and gradients,of the sh
113、ared embedding and output head,between theMTP module and the main model.This physical sharing mechanism further enhances ourmemory efficiency.3.3.FP8 TrainingInspired by recent advances in low-precision training(Dettmers et al.,2022;Noune et al.,2022;Peng et al.,2023b),we propose a fine-grained mixe
114、d precision framework utilizing the FP8data format for training DeepSeek-V3.While low-precision training holds great promise,itis often limited by the presence of outliers in activations,weights,and gradients(Fishmanet al.,2024;He et al.;Sun et al.,2024).Although significant progress has been made i
115、n in-ference quantization(Frantar et al.,2022;Xiao et al.,2023),there are relatively few studiesdemonstrating successful application of low-precision techniques in large-scale language model14FpropFP32InputTo FP8BF16WeightDgradFP32Input GradientOutputOutput GradientBF16To FP8WgradFP32To FP8To FP8Wei
116、ght GradientOptimizer States ToBF16Master WeightTo FP8To BF16To BF16To FP32或者Input-Activation_LOutput-Activation_L+1FP32Figure 6|The overall mixed precision framework with FP8 data format.For clarification,onlytheLinearoperator is illustrated.pre-training(Fishman et al.,2024).To address this challen
117、ge and effectively extend the dynamicrange of the FP8 format,we introduce a fine-grained quantization strategy:tile-wise groupingwith 1elements or block-wise grouping withelements.The associated dequantiza-tion overhead is largely mitigated under our increased-precision accumulation process,a critic
118、alaspect for achieving accurate FP8 General Matrix Multiplication(GEMM).Moreover,to furtherreduce memory and communication overhead in MoE training,we cache and dispatch activa-tions in FP8,while storing low-precision optimizer states in BF16.We validate the proposed FP8mixed precision framework on
119、two model scales similar to DeepSeek-V2-Lite and DeepSeek-V2,training for approximately 1 trillion tokens(see more details in Appendix B.1).Notably,compared with the BF16 baseline,the relative loss error of our FP8-training model remainsconsistently below 0.25%,a level well within the acceptable ran
120、ge of training randomness.3.3.1.Mixed Precision FrameworkBuilding upon widely adopted techniques in low-precision training(Kalamkar et al.,2019;Narang et al.,2017),we propose a mixed precision framework for FP8 training.In this frame-work,most compute-density operations are conducted in FP8,while a
121、few key operationsare strategically maintained in their original data formats to balance training efficiency andnumerical stability.The overall framework is illustrated in Figure 6.Firstly,in order to accelerate model training,the majority of core computation kernels,i.e.,GEMM operations,are impleme
122、nted in FP8 precision.These GEMM operations accept FP8tensors as inputs and produce outputs in BF16 or FP32.As depicted in Figure 6,all three GEMMsassociatedwiththeLinearoperator,namelyFprop(forwardpass),Dgrad(activationbackwardpass),andWgrad(weight backward pass),are executed in FP8.This design the
123、oretically doublesthe computational speed compared with the original BF16 method.Additionally,the FP8WgradGEMM allows activations to be stored in FP8 for use in the backward pass.This significantlyreduces memory consumption.Despite the efficiency advantage of the FP8 format,certain operators still r
124、equire a higherprecision due to their sensitivity to low-precision computations.Besides,some low-cost opera-tors can also utilize a higher precision with a negligible overhead to the overall training cost.Forthis reason,after careful investigations,we maintain the original precision(e.g.,BF16 or FP3
125、2)for the following components:the embedding module,the output head,MoE gating modules,normalization operators,and attention operators.These targeted retentions of high precisionensure stable training dynamics for DeepSeek-V3.To further guarantee numerical stability,westore the master weights,weight
126、 gradients,and optimizer states in higher precision.While15Scaling FactorTensor CoreCUDA CoreInputScaling FactorWeightScaling FactorOutputTensor CoreWGMMA 1WGMMA 4Low Prec AccCUDA CoreFP32 RegisterIntervalOutput/GEMM Input(b)Increasing accumulation precision(a)Fine-grained quantizationFigure 7|(a)We
127、 propose a fine-grained quantization method to mitigate quantization errorscaused by feature outliers;for illustration simplicity,onlyFpropis illustrated.(b)In conjunctionwith our quantization strategy,we improve the FP8 GEMM precision by promoting to CUDACores at an interval of=128 elements MMA for
128、 the high-precision accumulation.these high-precision components incur some memory overheads,their impact can be minimizedthrough efficient sharding across multiple DP ranks in our distributed training system.3.3.2.Improved Precision from Quantization and MultiplicationBased on our mixed precision F
129、P8 framework,we introduce several strategies to enhance low-precision training accuracy,focusing on both the quantization method and the multiplicationprocess.Fine-Grained Quantization.In low-precision training frameworks,overflows and underflowsare common challenges due to the limited dynamic range
130、 of the FP8 format,which is constrainedby its reduced exponent bits.As a standard practice,the input distribution is aligned to therepresentable range of the FP8 format by scaling the maximum absolute value of the inputtensor to the maximum representable value of FP8(Narang et al.,2017).This method
131、makes low-precision training highly sensitive to activation outliers,which can heavily degrade quantizationaccuracy.To solve this,we propose a fine-grained quantization method that applies scalingat a more granular level.As illustrated in Figure 7(a),(1)for activations,we group andscale elements on
132、a1x128tile basis(i.e.,per token per 128 channels);and(2)for weights,wegroup and scale elements on a128x128block basis(i.e.,per 128 input channels per 128 outputchannels).This approach ensures that the quantization process can better accommodate outliersby adapting the scale according to smaller grou
133、ps of elements.In Appendix B.2,we furtherdiscuss the training instability when we group and scale activations on a block basis in the sameway as weights quantization.One key modification in our method is the introduction of per-group scaling factors alongthe inner dimension of GEMM operations.This f
134、unctionality is not directly supported in thestandard FP8 GEMM.However,combined with our precise FP32 accumulation strategy,it can16be efficiently implemented.Notably,our fine-grained quantization strategy is highly consistent with the idea of mi-croscaling formats(Rouhani et al.,2023b),while the Te
135、nsor Cores of NVIDIA next-generationGPUs(Blackwell series)have announced the support for microscaling formats with smallerquantization granularity(NVIDIA,2024a).We hope our design can serve as a reference forfuture work to keep pace with the latest GPU architectures.Increasing Accumulation Precision
136、.Low-precision GEMM operations often suffer from un-derflow issues,and their accuracy largely depends on high-precision accumulation,which iscommonly performed in an FP32 precision(Kalamkar et al.,2019;Narang et al.,2017).However,we observe that the accumulation precision of FP8 GEMM on NVIDIA H800
137、GPUs is limited toretaining around 14 bits,which is significantly lower than FP32 accumulation precision.Thisproblem will become more pronounced when the inner dimensionKis large(Wortsman et al.,2023),a typical scenario in large-scale model training where the batch size and model widthare increased.
138、Taking GEMM operations of two random matrices withK=4096 for example,inour preliminary test,the limited accumulation precision in Tensor Cores results in a maximumrelative error of nearly 2%.Despite these problems,the limited accumulation precision is stillthe default option in a few FP8 frameworks(
139、NVIDIA,2024b),severely constraining the trainingaccuracy.In order to address this issue,we adopt the strategy of promotion to CUDA Cores forhigher precision(Thakkar et al.,2023).The process is illustrated in Figure 7(b).To be specific,during MMA(Matrix Multiply-Accumulate)execution on Tensor Cores,i
140、ntermediate resultsare accumulated using the limited bit width.Once an interval ofis reached,these partialresults will be copied to FP32 registers on CUDA Cores,where full-precision FP32 accumulationis performed.As mentioned before,our fine-grained quantization applies per-group scalingfactors along
141、 the inner dimensionK.These scaling factors can be efficiently multiplied on theCUDA Cores as the dequantization process with minimal additional computational cost.It is worth noting that this modification reduces the WGMMA(Warpgroup-level MatrixMultiply-Accumulate)instruction issue rate for a singl
142、e warpgroup.However,on the H800architecture,it is typical for two WGMMA to persist concurrently:while one warpgroupperforms the promotion operation,the other is able to execute the MMA operation.This designenables overlapping of the two operations,maintaining high utilization of Tensor Cores.Basedon
143、 our experiments,setting=128 elements,equivalent to 4 WGMMAs,represents theminimal accumulation interval that can significantly improve precision without introducingsubstantial overhead.Mantissa over Exponents.In contrast to the hybrid FP8 format adopted by prior work(NVIDIA,2024b;Peng et al.,2023b;
144、Sun et al.,2019b),which usesE4M3(4-bit exponent and3-bit mantissa)inFpropandE5M2(5-bit exponent and 2-bit mantissa)inDgradandWgrad,we adopt theE4M3format on all tensors for higher precision.We attribute the feasibility ofthis approach to our fine-grained quantization strategy,i.e.,tile and block-wis
145、e scaling.Byoperating on smaller element groups,our methodology effectively shares exponent bits amongthese grouped elements,mitigating the impact of the limited dynamic range.Online Quantization.Delayed quantization is employed in tensor-wise quantization frame-works(NVIDIA,2024b;Peng et al.,2023b)
146、,which maintains a history of the maximum absolute17values across prior iterations to infer the current value.In order to ensure accurate scales andsimplify the framework,we calculate the maximum absolute value online for each1x128acti-vation tile or128x128weight block.Based on it,we derive the scal
147、ing factor and then quantizethe activation or weight online into the FP8 format.3.3.3.Low-Precision Storage and CommunicationIn conjunction with our FP8 training framework,we further reduce the memory consumptionand communication overhead by compressing cached activations and optimizer states intolo
148、wer-precision formats.Low-Precision Optimizer States.We adopt the BF16 data format instead of FP32 to track thefirst and second moments in the AdamW(Loshchilov and Hutter,2017)optimizer,withoutincurring observable performance degradation.However,the master weights(stored by theoptimizer)and gradient
149、s(used for batch size accumulation)are still retained in FP32 to ensurenumerical stability throughout training.Low-Precision Activation.As illustrated in Figure 6,theWgradoperation is performed in FP8.To reduce the memory consumption,it is a natural choice to cache activations in FP8 formatfor the b
150、ackward pass of theLinearoperator.However,special considerations are taken onseveral operators for low-cost high-precision training:(1)Inputs of theLinearafter the attention operator.These activations are alsoused in the backward pass of the attention operator,which makes it sensitive toprecision.We
151、 adopt a customizedE5M6data format exclusively for these activations.Additionally,these activations will be converted from an1x128quantization tile toan128x1tile in the backward pass.To avoid introducing extra quantization error,all the scaling factors are round scaled,i.e.,integral power of 2.(2)In
152、puts of the SwiGLU operator in MoE.To further reduce the memory cost,wecache the inputs of the SwiGLU operator and recompute its output in the backwardpass.These activations are also stored in FP8 with our fine-grained quantizationmethod,striking a balance between memory efficiency and computational
153、 accuracy.Low-Precision Communication.Communication bandwidth is a critical bottleneck in thetraining of MoE models.To alleviate this challenge,we quantize the activation before MoEup-projections into FP8 and then applydispatchcomponents,which is compatible withFP8Fpropin MoE up-projections.Like the
154、 inputs of theLinearafter the attention operator,scaling factors for this activation are integral power of 2.A similar strategy is applied to theactivation gradient before MoE down-projections.For both the forward and backwardcombinecomponents,we retain them in BF16 to preserve training precision in
155、 critical parts of the trainingpipeline.3.4.Inference and DeploymentWe deploy DeepSeek-V3 on the H800 cluster,where GPUs within each node are interconnectedusing NVLink,and all GPUs across the cluster are fully interconnected via IB.To simultaneouslyensure both the Service-Level Objective(SLO)for on
156、line services and high throughput,weemploy the following deployment strategy that separates the prefilling and decoding stages.183.4.1.PrefillingThe minimum deployment unit of the prefilling stage consists of 4 nodes with 32 GPUs.Theattentionpart employs 4-way Tensor Parallelism(TP4)with Sequence Pa
157、rallelism(SP),com-bined with 8-way Data Parallelism(DP8).Its small TP size of 4 limits the overhead of TPcommunication.For theMoEpart,we use 32-way Expert Parallelism(EP32),which ensures thateach expert processes a sufficiently large batch size,thereby enhancing computational efficiency.For theMoEal
158、l-to-all communication,we use the same method as in training:first transferringtokens across nodes via IB,and then forwarding among the intra-node GPUs via NVLink.Inparticular,we use 1-way Tensor Parallelism for the dense MLPs in shallow layers to save TPcommunication.To achieve load balancing among
159、 different experts in theMoEpart,we need to ensure thateach GPU processes approximately the same number of tokens.To this end,we introduce adeployment strategy of redundant experts,which duplicates high-load experts and deploys themredundantly.The high-load experts are detected based on statistics c
160、ollected during the onlinedeployment and are adjusted periodically(e.g.,every 10 minutes).After determining the setof redundant experts,we carefully rearrange experts among GPUs within a node based on theobserved loads,striving to balance the load across GPUs as much as possible without increasingth
161、e cross-node all-to-all communication overhead.For the deployment of DeepSeek-V3,we set32 redundant experts for the prefilling stage.For each GPU,besides the original 8 experts ithosts,it will also host one additional redundant expert.Furthermore,in the prefilling stage,to improve the throughput and
162、 hide the overhead ofall-to-all and TP communication,we simultaneously process two micro-batches with similarcomputational workloads,overlapping theattentionandMoEof one micro-batch with thedispatchandcombineof another.Finally,we are exploring a dynamic redundancy strategy for experts,where each GPU
163、 hostsmore experts(e.g.,16 experts),but only 9 will be activated during each inference step.Beforethe all-to-all operation at each layer begins,we compute the globally optimal routing schemeon the fly.Given the substantial computation involved in the prefilling stage,the overhead ofcomputing this ro
164、uting scheme is almost negligible.3.4.2.DecodingDuring decoding,we treat the shared expert as a routed one.From this perspective,each tokenwill select 9 experts during routing,where the shared expert is regarded as a heavy-load onethat will always be selected.The minimum deployment unit of the decod
165、ing stage consistsof 40 nodes with 320 GPUs.Theattentionpart employs TP4 with SP,combined with DP80,while theMoEpart uses EP320.For theMoEpart,each GPU hosts only one expert,and 64 GPUsare responsible for hosting redundant experts and shared experts.All-to-all communicationof thedispatchandcombinepa
166、rts is performed via direct point-to-point transfers over IB toachieve low latency.Additionally,we leverage the IBGDA(NVIDIA,2022)technology to furtherminimize latency and enhance communication efficiency.Similar to prefilling,we periodically determine the set of redundant experts in a certaininterv
167、al,based on the statistical expert load from our online service.However,we do not needto rearrange experts since each GPU only hosts one expert.We are also exploring the dynamicredundancy strategy for decoding.However,this requires more careful optimization of thealgorithm that computes the globally
168、 optimal routing scheme and the fusion with thedispatchkernel to reduce overhead.19Additionally,to enhance throughput and hide the overhead of all-to-all communication,we are also exploring processing two micro-batches with similar computational workloadssimultaneously in the decoding stage.Unlike p
169、refilling,attentionconsumes a larger portionof time in the decoding stage.Therefore,we overlap theattentionof one micro-batch withthedispatch+MoE+combineof another.In the decoding stage,the batch size per expertis relatively small(usually within 256 tokens),and the bottleneck is memory access rather
170、than computation.Since theMoEpart only needs to load the parameters of one expert,thememory access overhead is minimal,so using fewer SMs will not significantly affect the overallperformance.Therefore,to avoid impacting the computation speed of theattentionpart,wecan allocate only a small portion of
171、 SMs todispatch+MoE+combine.3.5.Suggestions on Hardware DesignBased on our implementation of the all-to-all communication and FP8 training scheme,wepropose the following suggestions on chip design to AI hardware vendors.3.5.1.Communication HardwareIn DeepSeek-V3,we implement the overlap between comp
172、utation and communication to hidethe communication latency during computation.This significantly reduces the dependencyon communication bandwidth compared to serial computation and communication.However,the current communication implementation relies on expensive SMs(e.g.,we allocate 20 out ofthe 13
173、2 SMs available in the H800 GPU for this purpose),which will limit the computationalthroughput.Moreover,using SMs for communication results in significant inefficiencies,astensor cores remain entirely under-utilized.Currently,the SMs primarily perform the following tasks for all-to-all communication
174、:Forwarding data between the IB(InfiniBand)and NVLink domain while aggregating IBtraffic destined for multiple GPUs within the same node from a single GPU.Transporting data between RDMA buffers(registered GPU memory regions)and in-put/output buffers.Executingreduceoperations forall-to-all combine.Ma
175、naging fine-grained memory layout during chunked data transferring to multipleexperts across the IB and NVLink domain.We aspire to see future vendors developing hardware that offloads these communicationtasks from the valuable computation unit SM,serving as a GPU co-processor or a networkco-processo
176、r like NVIDIA SHARP Graham et al.(2016).Furthermore,to reduce applicationprogramming complexity,we aim for this hardware to unify the IB(scale-out)and NVLink(scale-up)networks from the perspective of the computation units.With this unified interface,computation units can easily accomplish operations
177、 such asread,write,multicast,andreduceacross the entire IB-NVLink-unified domain via submitting communication requestsbased on simple primitives.3.5.2.Compute HardwareHigher FP8 GEMM Accumulation Precision in Tensor Cores.In the current Tensor Coreimplementation of the NVIDIA Hopper architecture,FP8
178、 GEMM suffers from limited accumula-tion precision.After aligning 32 mantissa products by right-shifting based on the maximumexponent,the Tensor Core only uses the highest 14 bits of each mantissa product for addition,20and truncates bits exceeding this range.The accumulation of addition results int
179、o registers alsoemploys 14-bit precision.Our implementation partially mitigates the limitation by accumulatingthe addition results of 128 FP8FP8 multiplications into registers with FP32 precision in theCUDA core.Although helpful in achieving successful FP8 training,it is merely a compromisedue to th
180、e Hopper architectures hardware deficiency in FP8 GEMM accumulation precision.Future chips need to adopt higher precision.Support for Tile-and Block-Wise Quantization.Current GPUs only support per-tensorquantization,lacking the native support for fine-grained quantization like our tile-and block-wis
181、e quantization.In the current implementation,when theinterval is reached,the partialresults will be copied from Tensor Cores to CUDA cores,multiplied by the scaling factors,andadded to FP32 registers on CUDA cores.Although the dequantization overhead is significantlymitigated combined with our preci
182、se FP32 accumulation strategy,the frequent data movementsbetween Tensor Cores and CUDA cores still limit the computational efficiency.Therefore,werecommend future chips to support fine-grained quantization by enabling Tensor Cores toreceive scaling factors and implement MMA with group scaling.In thi
183、s way,the whole partialsum accumulation and dequantization can be completed directly inside Tensor Cores until thefinal result is produced,avoiding frequent data movements.Support for Online Quantization.The current implementations struggle to effectively supportonline quantization,despite its effec
184、tiveness demonstrated in our research.In the existingprocess,we need to read 128 BF16 activation values(the output of the previous computation)from HBM(High Bandwidth Memory)for quantization,and the quantized FP8 values arethen written back to HBM,only to be read again for MMA.To address this ineffi
185、ciency,werecommend that future chips integrate FP8 cast and TMA(Tensor Memory Accelerator)accessinto a single fused operation,so quantization can be completed during the transfer of activationsfrom global memory to shared memory,avoiding frequent memory reads and writes.We alsorecommend supporting a
186、 warp-level cast instruction for speedup,which further facilitates thebetter fusion of layer normalization and FP8 cast.Alternatively,a near-memory computingapproach can be adopted,where compute logic is placed near the HBM.In this case,BF16elements can be cast to FP8 directly as they are read from
187、HBM into the GPU,reducing off-chipmemory access by roughly 50%.Support for Transposed GEMM Operations.The current architecture makes it cumbersometo fuse matrix transposition with GEMM operations.In our workflow,activations during theforward pass are quantized into1x128FP8 tiles and stored.During th
188、e backward pass,thematrix needs to be read out,dequantized,transposed,re-quantized into128x1tiles,and storedin HBM.To reduce memory operations,we recommend future chips to enable direct transposedreads of matrices from shared memory before MMA operation,for those precisions requiredin both training
189、and inference.Combined with the fusion of FP8 format conversion and TMAaccess,this enhancement will significantly streamline the quantization workflow.4.Pre-Training4.1.Data ConstructionCompared with DeepSeek-V2,we optimize the pre-training corpus by enhancing the ratioof mathematical and programmin
190、g samples,while expanding multilingual coverage beyond21English and Chinese.Also,our data processing pipeline is refined to minimize redundancywhile maintaining corpus diversity.Inspired by Ding et al.(2024),we implement the documentpacking method for data integrity but do not incorporate cross-samp
191、le attention masking duringtraining.Finally,the training corpus for DeepSeek-V3 consists of 14.8T high-quality and diversetokens in our tokenizer.In the training process of DeepSeekCoder-V2(DeepSeek-AI,2024a),we observe that theFill-in-Middle(FIM)strategy does not compromise the next-token predictio
192、n capability whileenabling the model to accurately predict middle text based on contextual cues.In alignment withDeepSeekCoder-V2,we also incorporate the FIM strategy in the pre-training of DeepSeek-V3.Tobe specific,we employ the Prefix-Suffix-Middle(PSM)framework to structure data as follows:presuf
193、middle.This structure is applied at the document level as a part of the pre-packing process.The FIMstrategy is applied at a rate of 0.1,consistent with the PSM framework.ThetokenizerforDeepSeek-V3employsByte-levelBPE(Shibataetal.,1999)withanextendedvocabulary of 128K tokens.The pretokenizer and trai
194、ning data for our tokenizer are modifiedto optimize multilingual compression efficiency.In addition,compared with DeepSeek-V2,the new pretokenizer introduces tokens that combine punctuations and line breaks.However,this trick may introduce the token boundary bias(Lundberg,2023)when the model process
195、esmulti-line prompts without terminal line breaks,particularly for few-shot evaluation prompts.To address this issue,we randomly split a certain proportion of such combined tokens duringtraining,which exposes the model to a wider array of special cases and mitigates this bias.4.2.Hyper-ParametersMod
196、el Hyper-Parameters.We set the number of Transformer layers to 61 and the hiddendimension to 7168.All learnable parameters are randomly initialized with a standard deviationof 0.006.In MLA,we set the number of attention headsto 128 and the per-head dimensionto 128.The KV compression dimensionis set
197、to 512,and the query compression dimensionis set to 1536.For the decoupled queries and key,we set the per-head dimensionto 64.Wesubstitute all FFNs except for the first three layers with MoE layers.Each MoE layer consists of 1shared expert and 256 routed experts,where the intermediate hidden dimensi
198、on of each expertis 2048.Among the routed experts,8 experts will be activated for each token,and each tokenwill be ensured to be sent to at most 4 nodes.The multi-token prediction depthis set to 1,i.e.,besides the exact next token,each token will predict one additional token.As DeepSeek-V2,DeepSeek-
199、V3 also employs additional RMSNorm layers after the compressed latent vectors,and multiplies additional scaling factors at the width bottlenecks.Under this configuration,DeepSeek-V3 comprises 671B total parameters,of which 37B are activated for each token.Training Hyper-Parameters.We employ the Adam
200、W optimizer(Loshchilov and Hutter,2017)with hyper-parameters set to1=0.9,2=0.95,andweight_decay=0.1.We set the maximumsequence length to 4K during pre-training,and pre-train DeepSeek-V3 on 14.8T tokens.As forthe learning rate scheduling,we first linearly increase it from 0 to 2.2104during the first2
201、K steps.Then,we keep a constant learning rate of 2.2104until the model consumes 10Ttraining tokens.Subsequently,we gradually decay the learning rate to 2.2105in 4.3T tokens,following a cosine decay curve.During the training of the final 500B tokens,we keep a constantlearning rate of 2.2105in the fir
202、st 333B tokens,and switch to another constant learning rate22of 7.3106in the remaining 167B tokens.The gradient clipping norm is set to 1.0.We employa batch size scheduling strategy,where the batch size is gradually increased from 3072 to 15360in the training of the first 469B tokens,and then keeps
203、15360 in the remaining training.Weleverage pipeline parallelism to deploy different layers of a model on different GPUs,and foreach layer,the routed experts will be uniformly deployed on 64 GPUs belonging to 8 nodes.As for the node-limited routing,each token will be sent to at most 4 nodes(i.e.,=4).
204、Forauxiliary-loss-free load balancing,we set the bias update speedto 0.001 for the first 14.3Ttokens,and to 0.0 for the remaining 500B tokens.For the balance loss,we setto 0.0001,just toavoid extreme imbalance within any single sequence.The MTP loss weightis set to 0.3 for thefirst 10T tokens,and to
205、 0.1 for the remaining 4.8T tokens.2K11K20K29K38K47K56K65K74K83K92K 101K 110K 119K 128KContext Length(#Tokens)07142129364350576471798693100Document Depth Percent(%)Pressure Testing DeepSeek-V3 128K Context via Needle In A HayStack12345678910ScoreFigure 8|Evaluation results on the”Needle In A Haystac
206、k”(NIAH)tests.DeepSeek-V3performs well across all context window lengths up to 128K.4.3.Long Context ExtensionWe adopt a similar approach to DeepSeek-V2(DeepSeek-AI,2024c)to enable long contextcapabilities in DeepSeek-V3.After the pre-training stage,we apply YaRN(Peng et al.,2023a)for context extens
207、ion and perform two additional training phases,each comprising 1000 steps,to progressively expand the context window from 4K to 32K and then to 128K.The YaRNconfiguration is consistent with that used in DeepSeek-V2,being applied exclusively to thedecoupled shared key k.The hyper-parameters remain id
208、entical across both phases,with thescale=40,=1,=32,and the scaling factor=0.1ln+1.In the first phase,the sequencelength is set to 32K,and the batch size is 1920.During the second phase,the sequence length isincreased to 128K,and the batch size is reduced to 480.The learning rate for both phases is s
209、etto 7.3106,matching the final learning rate from the pre-training stage.Through this two-phase extension training,DeepSeek-V3 is capable of handling inputs up to128K in length while maintaining strong performance.Figure 8 illustrates that DeepSeek-V3,following supervised fine-tuning,achieves notabl
210、e performance on the Needle In A Haystack(NIAH)test,demonstrating consistent robustness across context window lengths up to 128K.234.4.Evaluations4.4.1.Evaluation BenchmarksThe base model of DeepSeek-V3 is pretrained on a multilingual corpus with English and Chineseconstituting the majority,so we ev
211、aluate its performance on a series of benchmarks primarilyin English and Chinese,as well as on a multilingual benchmark.Our evaluation is basedon our internal evaluation framework integrated in our HAI-LLM framework.Consideredbenchmarks are categorized and listed as follows,whereunderlinedbenchmarks
212、 are in Chineseand double-underlined benchmarks are multilingual ones:Multi-subject multiple-choice datasets include MMLU(Hendrycks et al.,2020),MMLU-Redux(Gema et al.,2024),MMLU-Pro(Wang et al.,2024b),MMMLU(OpenAI,2024b),C-Eval(Huang et al.,2023),and CMMLU(Li et al.,2023).Language understanding and
213、 reasoning datasets include HellaSwag(Zellers et al.,2019),PIQA(Bisk et al.,2020),ARC(Clark et al.,2018),and BigBench Hard(BBH)(Suzgun et al.,2022).Closed-book question answering datasets include TriviaQA(Joshi et al.,2017)and Natu-ralQuestions(Kwiatkowski et al.,2019).Reading comprehension datasets
214、 include RACE Lai et al.(2017),DROP(Dua et al.,2019),C3(Sun et al.,2019a),and CMRC(Cui et al.,2019).Reference disambiguation datasets includeCLUEWSC(Xu et al.,2020)and WinoGrandeSakaguchi et al.(2019).Language modeling datasets include Pile(Gao et al.,2020).Chinese understanding and culture datasets
215、 include CCPM(Li et al.,2021).Math datasets include GSM8K(Cobbe et al.,2021),MATH(Hendrycks et al.,2021),MGSM(Shi et al.,2023),and CMath(Wei et al.,2023).Code datasets include HumanEval(Chen et al.,2021),LiveCodeBench-Base(0801-1101)(Jainet al.,2024),MBPP(Austin et al.,2021),and CRUXEval(Gu et al.,2
216、024).Standardized exams includeAGIEval(Zhong et al.,2023).Note that AGIEval includes bothEnglish and Chinese subsets.Following our previous work(DeepSeek-AI,2024b,c),we adopt perplexity-based eval-uation for datasets including HellaSwag,PIQA,WinoGrande,RACE-Middle,RACE-High,MMLU,MMLU-Redux,MMLU-Pro,
217、MMMLU,ARC-Easy,ARC-Challenge,C-Eval,CMMLU,C3,and CCPM,and adopt generation-based evaluation for TriviaQA,NaturalQuestions,DROP,MATH,GSM8K,MGSM,HumanEval,MBPP,LiveCodeBench-Base,CRUXEval,BBH,AGIEval,CLUEWSC,CMRC,and CMath.In addition,we perform language-modeling-based evaluationfor Pile-test and use
218、Bits-Per-Byte(BPB)as the metric to guarantee fair comparison amongmodels using different tokenizers.4.4.2.Evaluation ResultsIn Table 3,we compare the base model of DeepSeek-V3 with the state-of-the-art open-source basemodels,including DeepSeek-V2-Base(DeepSeek-AI,2024c)(our previous release),Qwen2.5
219、 72BBase(Qwen,2024b),and LLaMA-3.1 405B Base(AIMeta,2024b).We evaluate all these modelswith our internal evaluation framework,and ensure that they share the same evaluation setting.Note that due to the changes in our evaluation framework over the past months,the performance24Benchmark(Metric)#ShotsD
220、eepSeek-V2Qwen2.5LLaMA-3.1DeepSeek-V3Base72B Base405B BaseBaseArchitecture-MoEDenseDenseMoE#Activated Params-21B72B405B37B#Total Params-236B72B405B671BEnglishPile-test(BPB)-0.6060.6380.5420.548BBH(EM)3-shot78.879.882.987.5MMLU(EM)5-shot78.485.084.487.1MMLU-Redux(EM)5-shot75.683.281.386.2MMLU-Pro(EM)
221、5-shot51.458.352.864.4DROP(F1)3-shot80.480.686.089.0ARC-Easy(EM)25-shot97.698.498.498.9ARC-Challenge(EM)25-shot92.294.595.395.3HellaSwag(EM)10-shot87.184.889.288.9PIQA(EM)0-shot83.982.685.984.7WinoGrande(EM)5-shot86.382.385.284.9RACE-Middle(EM)5-shot73.168.174.267.1RACE-High(EM)5-shot52.650.356.851.
222、3TriviaQA(EM)5-shot80.071.982.782.9NaturalQuestions(EM)5-shot38.633.241.540.0AGIEval(EM)0-shot57.575.860.679.6CodeHumanEval(Pass1)0-shot43.353.054.965.2MBPP(Pass1)3-shot65.072.668.475.4LiveCodeBench-Base(Pass1)3-shot11.612.915.519.4CRUXEval-I(EM)2-shot52.559.158.567.3CRUXEval-O(EM)2-shot49.859.959.9
223、69.8MathGSM8K(EM)8-shot81.688.383.589.3MATH(EM)4-shot43.454.449.061.6MGSM(EM)8-shot63.676.269.979.8CMath(EM)3-shot78.784.577.390.7ChineseCLUEWSC(EM)5-shot82.082.583.082.7C-Eval(EM)5-shot81.489.272.590.1CMMLU(EM)5-shot84.089.573.788.8CMRC(EM)1-shot77.475.876.076.3C3(EM)0-shot77.476.779.778.6CCPM(EM)0
224、-shot93.088.578.692.0MultilingualMMMLU-non-English(EM)5-shot64.074.873.879.4Table 3|Comparison among DeepSeek-V3-Base and other representative open-source basemodels.All models are evaluated in our internal framework and share the same evaluationsetting.Scores with a gap not exceeding 0.3 are consid
225、ered to be at the same level.DeepSeek-V3-Base achieves the best performance on most benchmarks,especially on math and code tasks.of DeepSeek-V2-Base exhibits a slight difference from our previously reported results.Overall,DeepSeek-V3-Base comprehensively outperforms DeepSeek-V2-Base and Qwen2.5 72B
226、 Base,and surpasses LLaMA-3.1 405B Base in the majority of benchmarks,essentially becoming thestrongest open-source model.Fromamoredetailedperspective,wecompareDeepSeek-V3-Basewiththeotheropen-sourcebase models individually.(1)Compared with DeepSeek-V2-Base,due to the improvements inour model archit
227、ecture,the scale-up of the model size and training tokens,and the enhancementof data quality,DeepSeek-V3-Base achieves significantly better performance as expected.(2)Compared with Qwen2.5 72B Base,the state-of-the-art Chinese open-source model,with onlyhalf of the activated parameters,DeepSeek-V3-B
228、ase also demonstrates remarkable advantages,25especially on English,multilingual,code,and math benchmarks.As for Chinese benchmarks,except for CMMLU,a Chinese multi-subject multiple-choice task,DeepSeek-V3-Base also showsbetter performance than Qwen2.5 72B.(3)Compared with LLaMA-3.1 405B Base,the la
229、rgestopen-source model with 11 times the activated parameters,DeepSeek-V3-Base also exhibitsmuch better performance on multilingual,code,and math benchmarks.As for English andChinese language benchmarks,DeepSeek-V3-Base shows competitive or better performance,and is especially good on BBH,MMLU-serie
230、s,DROP,C-Eval,CMMLU,and CCPM.Due to our efficient architectures and comprehensive engineering optimizations,DeepSeek-V3 achieves extremely high training efficiency.Under our training framework and infrastruc-tures,training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours,whichis
231、 much cheaper than training 72B or 405B dense models.Benchmark(Metric)#ShotsSmall MoESmall MoELarge MoELarge MoEBaselinew/MTPBaselinew/MTP#Activated Params(Inference)-2.4B2.4B20.9B20.9B#Total Params(Inference)-15.7B15.7B228.7B228.7B#Training Tokens-1.33T1.33T540B540BPile-test(BPB)-0.7290.7290.6580.6
232、57BBH(EM)3-shot39.041.470.070.7MMLU(EM)5-shot50.053.367.566.6DROP(F1)1-shot39.241.368.570.6TriviaQA(EM)5-shot56.957.767.067.3NaturalQuestions(EM)5-shot22.722.327.228.5HumanEval(Pass1)0-shot20.726.844.553.7MBPP(Pass1)3-shot35.836.861.662.2GSM8K(EM)8-shot25.431.472.374.0MATH(EM)4-shot10.712.638.639.8T
233、able 4|Ablation results for the MTP strategy.The MTP strategy consistently enhances themodel performance on most of the evaluation benchmarks.4.5.Discussion4.5.1.Ablation Studies for Multi-Token PredictionIn Table 4,we show the ablation results for the MTP strategy.To be specific,we validate theMTP
234、strategy on top of two baseline models across different scales.At the small scale,we traina baseline MoE model comprising 15.7B total parameters on 1.33T tokens.At the large scale,we train a baseline MoE model comprising 228.7B total parameters on 540B tokens.On topof them,keeping the training data
235、and the other architectures the same,we append a 1-depthMTP module onto them and train two models with the MTP strategy for comparison.Note thatduring inference,we directly discard the MTP module,so the inference costs of the comparedmodels are exactly the same.From the table,we can observe that the
236、 MTP strategy consistentlyenhances the model performance on most of the evaluation benchmarks.4.5.2.Ablation Studies for the Auxiliary-Loss-Free Balancing StrategyIn Table 5,we show the ablation results for the auxiliary-loss-free balancing strategy.Wevalidate this strategy on top of two baseline mo
237、dels across different scales.At the small scale,we train a baseline MoE model comprising 15.7B total parameters on 1.33T tokens.At thelarge scale,we train a baseline MoE model comprising 228.7B total parameters on 578B tokens.26Benchmark(Metric)#ShotsSmall MoESmall MoELarge MoELarge MoEAux-Loss-Base
238、dAux-Loss-FreeAux-Loss-BasedAux-Loss-Free#Activated Params-2.4B2.4B20.9B20.9B#Total Params-15.7B15.7B228.7B228.7B#Training Tokens-1.33T1.33T578B578BPile-test(BPB)-0.7270.7240.6560.652BBH(EM)3-shot37.339.366.767.9MMLU(EM)5-shot51.051.868.367.2DROP(F1)1-shot38.139.067.167.1TriviaQA(EM)5-shot58.358.566
239、.767.7NaturalQuestions(EM)5-shot23.223.427.128.1HumanEval(Pass1)0-shot22.022.640.246.3MBPP(Pass1)3-shot36.635.859.261.2GSM8K(EM)8-shot27.129.670.774.5MATH(EM)4-shot10.911.137.239.6Table 5|Ablation results for the auxiliary-loss-free balancing strategy.Compared with thepurely auxiliary-loss-based met
240、hod,the auxiliary-loss-free strategy consistently achieves bettermodel performance on most of the evaluation benchmarks.Both of the baseline models purely use auxiliary losses to encourage load balance,and use thesigmoid gating function with top-K affinity normalization.Their hyper-parameters to con
241、trolthe strength of auxiliary losses are the same as DeepSeek-V2-Lite and DeepSeek-V2,respectively.On top of these two baseline models,keeping the training data and the other architectures thesame,we remove all auxiliary losses and introduce the auxiliary-loss-free balancing strategy forcomparison.F
242、rom the table,we can observe that the auxiliary-loss-free strategy consistentlyachieves better model performance on most of the evaluation benchmarks.4.5.3.Batch-Wise Load Balance VS.Sequence-Wise Load BalanceThe key distinction between auxiliary-loss-free balancing and sequence-wise auxiliary loss
243、liesin their balancing scope:batch-wise versus sequence-wise.Compared with the sequence-wiseauxiliary loss,batch-wise balancing imposes a more flexible constraint,as it does not enforcein-domain balance on each sequence.This flexibility allows experts to better specialize indifferent domains.To vali
244、date this,we record and analyze the expert load of a 16B auxiliary-loss-based baseline and a 16B auxiliary-loss-free model on different domains in the Pile test set.As illustrated in Figure 9,we observe that the auxiliary-loss-free model demonstrates greaterexpert specialization patterns as expected
245、.To further investigate the correlation between this flexibility and the advantage in modelperformance,we additionally design and validate a batch-wise auxiliary loss that encouragesload balance on each training batch instead of on each sequence.The experimental results showthat,when achieving a sim
246、ilar level of batch-wise load balance,the batch-wise auxiliary losscan also achieve similar model performance to the auxiliary-loss-free method.To be specific,in our experiments with 1B MoE models,the validation losses are:2.258(using a sequence-wise auxiliary loss),2.253(using the auxiliary-loss-fr
247、ee method),and 2.253(using a batch-wiseauxiliary loss).We also observe similar results on 3B MoE models:the model using a sequence-wise auxiliary loss achieves a validation loss of 2.085,and the models using the auxiliary-loss-freemethod or a batch-wise auxiliary loss achieve the same validation los
248、s of 2.080.In addition,although the batch-wise load balancing methods show consistent performanceadvantages,they also face two potential challenges in efficiency:(1)load imbalance within271 2 3 4 5 6 7 8 9 10 11 1213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
249、565758596061626364Wikipedia(en)GithubDM MathematicsAux-Loss-Based Layer 91 2 3 4 5 6 7 8 9 10 11 1213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364Wikipedia(en)GithubDM MathematicsAux-Loss-Free Layer 91 2 3 4 5 6 7 8 9 10 11 1213141516171819202
250、122232425262728293031323334353637383940414243444546474849505152535455565758596061626364Wikipedia(en)GithubDM MathematicsAux-Loss-Based Layer 181 2 3 4 5 6 7 8 9 10 11 1213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364Wikipedia(en)GithubDM Mathe
251、maticsAux-Loss-Free Layer 180246810Relative Expert LoadFigure 9|Expert load of auxiliary-loss-free and auxiliary-loss-based models on three domains inthe Pile test set.The auxiliary-loss-free model shows greater expert specialization patterns thanthe auxiliary-loss-based one.The relative expert load
252、 denotes the ratio between the actual expertload and the theoretically balanced expert load.Due to space constraints,we only present theresults of two layers as an example,with the results of all layers provided in Appendix C.certain sequences or small batches,and(2)domain-shift-induced load imbalan
253、ce during infer-ence.The first challenge is naturally addressed by our training framework that uses large-scaleexpert parallelism and data parallelism,which guarantees a large size of each micro-batch.Forthe second challenge,we also design and implement an efficient inference framework withredundant
254、 expert deployment,as described in Section 3.4,to overcome it.5.Post-Training5.1.Supervised Fine-TuningWe curate our instruction-tuning datasets to include 1.5M instances spanning multiple domains,with each domain employing distinct data creation methods tailored to its specific requirements.Reasoni
255、ng Data.For reasoning-related datasets,including those focused on mathematics,code competition problems,and logic puzzles,we generate the data by leveraging an internalDeepSeek-R1 model.Specifically,while the R1-generated data demonstrates strong accuracy,itsuffers from issues such as overthinking,p
256、oor formatting,and excessive length.Our objective isto balance the high accuracy of R1-generated reasoning data and the clarity and conciseness ofregularly formatted reasoning data.To establish our methodology,we begin by developing an expert model tailored to a specificdomain,such as code,mathemati
257、cs,or general reasoning,using a combined Supervised Fine-Tuning(SFT)and Reinforcement Learning(RL)training pipeline.This expert model serves as adata generator for the final model.The training process involves generating two distinct typesof SFT samples for each instance:the first couples the proble
258、m with its original response inthe format of,while the second incorporates a system prompt28alongside the problem and the R1 response in the format of.The system prompt is meticulously designed to include instructions that guide the modeltoward producing responses enriched with mechanisms for reflec
259、tion and verification.Duringthe RL phase,the model leverages high-temperature sampling to generate responses thatintegrate patterns from both the R1-generated and original data,even in the absence of explicitsystem prompts.After hundreds of RL steps,the intermediate RL model learns to incorporateR1
260、patterns,thereby enhancing overall performance strategically.Upon completing the RL training phase,we implement rejection sampling to curate high-quality SFT data for the final model,where the expert models are used as data generationsources.This method ensures that the final training data retains t
261、he strengths of DeepSeek-R1while producing responses that are concise and effective.Non-Reasoning Data.For non-reasoning data,such as creative writing,role-play,and sim-ple question answering,we utilize DeepSeek-V2.5 to generate responses and enlist humanannotators to verify the accuracy and correct
262、ness of the data.SFT Settings.We fine-tune DeepSeek-V3-Base for two epochs using the SFT dataset,using thecosine decay learning rate scheduling that starts at 5106and gradually decreases to 1106.During training,each single sequence is packed from multiple samples.However,we adopt asample masking str
263、ategy to ensure that these examples remain isolated and mutually invisible.5.2.Reinforcement Learning5.2.1.Reward ModelWe employ a rule-based Reward Model(RM)and a model-based RM in our RL process.Rule-Based RM.For questions that can be validated using specific rules,we adopt a rule-based reward sys
264、tem to determine the feedback.For instance,certain math problems havedeterministic results,and we require the model to provide the final answer within a designatedformat(e.g.,in a box),allowing us to apply rules to verify the correctness.Similarly,for LeetCodeproblems,we can utilize a compiler to ge
265、nerate feedback based on test cases.By leveragingrule-based validation wherever possible,we ensure a higher level of reliability,as this approachis resistant to manipulation or exploitation.Model-Based RM.For questions with free-form ground-truth answers,we rely on the rewardmodel to determine wheth
266、er the response matches the expected ground-truth.Conversely,forquestions without a definitive ground-truth,such as those involving creative writing,the rewardmodel is tasked with providing feedback based on the question and the corresponding answeras inputs.The reward model is trained from the Deep
267、Seek-V3 SFT checkpoints.To enhance itsreliability,we construct preference data that not only provides the final reward but also includesthe chain-of-thought leading to the reward.This approach helps mitigate the risk of rewardhacking in specific tasks.295.2.2.Group Relative Policy OptimizationSimila
268、r to DeepSeek-V2(DeepSeek-AI,2024c),we adopt Group Relative Policy Optimiza-tion(GRPO)(Shao et al.,2024),which foregoes the critic model that is typically with the samesize as the policy model,and estimates the baseline from group scores instead.Specifically,foreach question,GRPO samples a group of
269、outputs1,2,from the old policy modeland then optimizes the policy modelby maximizing the following objective:J()=E(),=1(|)1=1?min?(|)(|),clip?(|)(|),1,1+?D?|?,(26)D?|?=(|)(|)log(|)(|)1,(27)whereandare hyper-parameters;is the reference model;andis the advantage,derivedfrom the rewards 1,2,.,correspon
270、ding to the outputs within each group:=mean(1,2,)std(1,2,).(28)We incorporate prompts from diverse domains,such as coding,math,writing,role-playing,and question answering,during the RL process.This approach not only aligns the model moreclosely with human preferences but also enhances performance on
271、 benchmarks,especially inscenarios where available SFT data are limited.5.3.Evaluations5.3.1.Evaluation SettingsEvaluation Benchmarks.Apart from the benchmark we used for base model testing,wefurther evaluate instructed models on IFEval(Zhou et al.,2023),FRAMES(Krishna et al.,2024),LongBench v2(Bai
272、et al.,2024),GPQA(Rein et al.,2023),SimpleQA(OpenAI,2024c),C-SimpleQA(He et al.,2024),SWE-Bench Verified(OpenAI,2024d),Aider1,LiveCodeBench(Jainet al.,2024)(questions from August 2024 to November 2024),Codeforces2,Chinese NationalHigh School Mathematics Olympiad(CNMO 2024)3,and American Invitational
273、 MathematicsExamination 2024(AIME 2024)(MAA,2024).Compared Baselines.We conduct comprehensive evaluations of our chat model against sev-eral strong baselines,including DeepSeek-V2-0506,DeepSeek-V2.5-0905,Qwen2.5 72B Instruct,LLaMA-3.1 405B Instruct,Claude-Sonnet-3.5-1022,and GPT-4o-0513.For the Deep
274、Seek-V2model series,we select the most representative variants for comparison.For closed-sourcemodels,evaluations are performed through their respective APIs.Detailed Evaluation Configurations.For standard benchmarks including MMLU,DROP,GPQA,and SimpleQA,we adopt the evaluation prompts from the simp
275、le-evals framework4.1https:/aider.chat2https:/3https:/ utilize the Zero-Eval prompt format(Lin,2024)for MMLU-Redux in a zero-shot setting.For other datasets,we follow their original evaluation protocols with default prompts as pro-vided by the dataset creators.For code and math benchmarks,the HumanE
276、val-Mul datasetincludes 8 mainstream programming languages(Python,Java,Cpp,C#,JavaScript,TypeScript,PHP,and Bash)in total.We use CoT and non-CoT methods to evaluate model performanceon LiveCodeBench,where the data are collected from August 2024 to November 2024.TheCodeforces dataset is measured usin
277、g the percentage of competitors.SWE-Bench verified isevaluated using the agentless framework(Xia et al.,2024).We use the“diff”format to evaluatethe Aider-related benchmarks.For mathematical assessments,AIME and CNMO 2024 areevaluated with a temperature of 0.7,and the results are averaged over 16 run
278、s,while MATH-500employs greedy decoding.We allow all models to output a maximum of 8192 tokens for eachbenchmark.Benchmark(Metric)DeepSeek DeepSeek Qwen2.5 LLaMA-3.1 Claude-3.5-GPT-4o DeepSeekV2-0506V2.5-0905 72B-Inst.405B-Inst.Sonnet-10220513V3ArchitectureMoEMoEDenseDense-MoE#Activated Params21B21B
279、72B405B-37B#Total Params236B236B72B405B-671BEnglishMMLU(EM)78.280.685.388.688.387.288.5MMLU-Redux(EM)77.980.385.686.288.988.089.1MMLU-Pro(EM)58.566.271.673.378.072.675.9DROP(3-shot F1)83.087.876.788.788.383.791.6IF-Eval(Prompt Strict)57.780.684.186.086.584.386.1GPQA-Diamond(Pass1)35.341.349.051.165.
280、049.959.1SimpleQA(Correct)9.010.29.117.128.438.224.9FRAMES(Acc.)66.965.469.870.072.580.573.3LongBench v2(Acc.)31.635.439.436.141.048.148.7CodeHumanEval-Mul(Pass1)69.377.477.377.281.780.582.6LiveCodeBench(Pass1-COT)18.829.231.128.436.333.440.5LiveCodeBench(Pass1)20.328.428.730.132.834.237.6Codeforces
281、(Percentile)17.535.624.825.320.323.651.6SWE Verified(Resolved)-22.623.824.550.838.842.0Aider-Edit(Acc.)60.371.665.463.984.272.979.7Aider-Polyglot(Acc.)-18.27.65.845.316.049.6MathAIME 2024(Pass1)4.616.723.323.316.09.339.2MATH-500(EM)56.374.780.073.878.374.690.2CNMO 2024(Pass1)2.810.815.96.813.110.843
282、.2ChineseCLUEWSC(EM)89.990.491.484.785.487.990.9C-Eval(EM)78.679.586.161.576.776.086.5C-SimpleQA(Correct)48.554.148.450.451.359.364.8Table 6|Comparison between DeepSeek-V3 and other representative chat models.All modelsare evaluated in a configuration that limits the output length to 8K.Benchmarks c
283、ontainingfewer than 1000 samples are tested multiple times using varying temperature settings to deriverobust final results.DeepSeek-V3 stands as the best-performing open-source model,and alsoexhibits competitive performance against frontier closed-source models.5.3.2.Standard EvaluationTable 6 pres
284、ents the evaluation results,showcasing that DeepSeek-V3 stands as the best-performing open-source model.Additionally,it is competitive against frontier closed-sourcemodels like GPT-4o and Claude-3.5-Sonnet.31English Benchmarks.MMLU is a widely recognized benchmark designed to assess the perfor-mance
285、 of large language models,across diverse knowledge domains and tasks.DeepSeek-V3demonstrates competitive performance,standing on par with top-tier models such as LLaMA-3.1-405B,GPT-4o,and Claude-Sonnet 3.5,while significantly outperforming Qwen2.5 72B.Moreover,DeepSeek-V3 excels in MMLU-Pro,a more c
286、hallenging educational knowledgebenchmark,where it closely trails Claude-Sonnet 3.5.On MMLU-Redux,a refined version ofMMLU with corrected labels,DeepSeek-V3 surpasses its peers.In addition,on GPQA-Diamond,a PhD-level evaluation testbed,DeepSeek-V3 achieves remarkable results,ranking just behindClaud
287、e 3.5 Sonnet and outperforming all other competitors by a substantial margin.In long-context understanding benchmarks such as DROP,LongBench v2,and FRAMES,DeepSeek-V3 continues to demonstrate its position as a top-tier model.It achieves an impressive91.6 F1 score in the 3-shot setting on DROP,outper
288、forming all other models in this category.On FRAMES,a benchmark requiring question-answering over 100k token contexts,DeepSeek-V3 closely trails GPT-4o while outperforming all other models by a significant margin.Thisdemonstrates the strong capability of DeepSeek-V3 in handling extremely long-contex
289、t tasks.The long-context capability of DeepSeek-V3 is further validated by its best-in-class performanceon LongBench v2,a dataset that was released just a few weeks before the launch of DeepSeekV3.On the factual knowledge benchmark,SimpleQA,DeepSeek-V3 falls behind GPT-4o andClaude-Sonnet,primarily
290、due to its design focus and resource allocation.DeepSeek-V3 assignsmore training tokens to learn Chinese knowledge,leading to exceptional performance on theC-SimpleQA.On the instruction-following benchmark,DeepSeek-V3 significantly outperformsits predecessor,DeepSeek-V2-series,highlighting its impro
291、ved ability to understand and adhereto user-defined format constraints.Code and Math Benchmarks.Coding is a challenging and practical task for LLMs,encom-passing engineering-focused tasks like SWE-Bench-Verified and Aider,as well as algorithmictasks such as HumanEval and LiveCodeBench.In engineering
292、 tasks,DeepSeek-V3 trails behindClaude-Sonnet-3.5-1022 but significantly outperforms open-source models.The open-sourceDeepSeek-V3 is expected to foster advancements in coding-related engineering tasks.By pro-viding access to its robust capabilities,DeepSeek-V3 can drive innovation and improvementin
293、 areas such as software engineering and algorithm development,empowering developersand researchers to push the boundaries of what open-source models can achieve in codingtasks.In algorithmic tasks,DeepSeek-V3 demonstrates superior performance,outperformingall baselines on benchmarks like HumanEval-M
294、ul and LiveCodeBench.This success can beattributed to its advanced knowledge distillation technique,which effectively enhances its codegeneration and problem-solving capabilities in algorithm-focused tasks.On math benchmarks,DeepSeek-V3 demonstrates exceptional performance,significantlysurpassing ba
295、selines and setting a new state-of-the-art for non-o1-like models.Specifically,onAIME,MATH-500,and CNMO 2024,DeepSeek-V3 outperforms the second-best model,Qwen2.572B,by approximately 10%in absolute scores,which is a substantial margin for such challengingbenchmarks.This remarkable capability highlig
296、hts the effectiveness of the distillation techniquefrom DeepSeek-R1,which has been proven highly beneficial for non-o1-like models.Chinese Benchmarks.Qwen and DeepSeek are two representative model series with robustsupport for both Chinese and English.On the factual benchmark Chinese SimpleQA,DeepSe
297、ek-V3 surpasses Qwen2.5-72B by 16.4 points,despite Qwen2.5 being trained on a larger corpuscompromising 18T tokens,which are 20%more than the 14.8T tokens that DeepSeek-V3 is32ModelArena-HardAlpacaEval 2.0DeepSeek-V2.5-090576.250.5Qwen2.5-72B-Instruct81.249.1LLaMA-3.1 405B69.340.5GPT-4o-051380.451.1
298、Claude-Sonnet-3.5-102285.252.0DeepSeek-V385.570.0Table 7|English open-ended conversation evaluations.For AlpacaEval 2.0,we use the length-controlled win rate as the metric.pre-trained on.On C-Eval,a representative benchmark for Chinese educational knowledge evaluation,andCLUEWSC(Chinese Winograd Sch
299、ema Challenge),DeepSeek-V3 and Qwen2.5-72B exhibitsimilar performance levels,indicating that both models are well-optimized for challengingChinese-language reasoning and educational tasks.5.3.3.Open-Ended EvaluationIn addition to standard benchmarks,we also evaluate our models on open-ended generati
300、ontasks using LLMs as judges,with the results shown in Table 7.Specifically,we adhere tothe original configurations of AlpacaEval 2.0(Dubois et al.,2024)and Arena-Hard(Li et al.,2024a),which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons.On Arena-Hard,DeepSeek-V3 achieves an impressive
301、 win rate of over 86%against the baseline GPT-4-0314,performing on par with top-tier models like Claude-Sonnet-3.5-1022.This underscores therobust capabilities of DeepSeek-V3,especially in dealing with complex prompts,includingcoding and debugging tasks.Furthermore,DeepSeek-V3 achieves a groundbreak
302、ing milestoneas the first open-source model to surpass 85%on the Arena-Hard benchmark.This achievementsignificantly bridges the performance gap between open-source and closed-source models,setting a new standard for what open-source models can accomplish in challenging domains.Similarly,DeepSeek-V3
303、showcases exceptional performance on AlpacaEval 2.0,outperform-ing bothclosed-source andopen-source models.This demonstrates itsoutstanding proficiencyinwriting tasks and handling straightforward question-answering scenarios.Notably,it surpassesDeepSeek-V2.5-0905 by a significant margin of 20%,highl
304、ighting substantial improvements intackling simple tasks and showcasing the effectiveness of its advancements.5.3.4.DeepSeek-V3 as a Generative Reward ModelWe compare the judgment ability of DeepSeek-V3 with state-of-the-art models,namely GPT-4oand Claude-3.5.Table 8 presents the performance of thes
305、e models in RewardBench(Lambertet al.,2024).DeepSeek-V3 achieves performance on par with the best versions of GPT-4o-0806and Claude-3.5-Sonnet-1022,while surpassing other versions.Additionally,the judgment abilityof DeepSeek-V3 can also be enhanced by the voting technique.Therefore,we employ DeepSee
306、k-V3 along with voting to offer self-feedback on open-ended questions,thereby improving theeffectiveness and robustness of the alignment process.33ModelChatChat-HardSafetyReasoningAverageGPT-4o-051396.670.486.784.984.7GPT-4o-080696.176.188.186.686.7GPT-4o-112095.871.386.285.284.6Claude-3.5-sonnet-06
307、2096.474.081.684.784.2Claude-3.5-sonnet-102296.479.791.187.688.7DeepSeek-V396.979.887.084.387.0DeepSeek-V3(maj6)96.982.689.589.289.6Table 8|Performances of GPT-4o,Claude-3.5-sonnet and DeepSeek-V3 on RewardBench.ModelLiveCodeBench-CoTMATH-500Pass1LengthPass1LengthDeepSeek-V2.5 Baseline31.171874.6769
308、DeepSeek-V2.5+R1 Distill37.478383.21510Table 9|The contribution of distillation from DeepSeek-R1.The evaluation settings of Live-CodeBench and MATH-500 are the same as in Table 6.5.4.Discussion5.4.1.Distillation from DeepSeek-R1We ablate the contribution of distillation from DeepSeek-R1 based on Dee
309、pSeek-V2.5.Thebaseline is trained on short CoT data,whereas its competitor uses data generated by the expertcheckpoints described above.Table 9 demonstrates the effectiveness of the distillation data,showing significant improve-ments in both LiveCodeBench and MATH-500 benchmarks.Our experiments reve
310、al an inter-esting trade-off:the distillation leads to better performance but also substantially increases theaverage response length.To maintain a balance between model accuracy and computationalefficiency,we carefully selected optimal settings for DeepSeek-V3 in distillation.Our research suggests
311、that knowledge distillation from reasoning models presents a promis-ing direction for post-training optimization.While our current work focuses on distilling datafrom mathematics and coding domains,this approach shows potential for broader applicationsacross various task domains.The effectiveness de
312、monstrated in these specific areas indicatesthat long-CoT distillation could be valuable for enhancing model performance in other cogni-tive tasks requiring complex reasoning.Further exploration of this approach across differentdomains remains an important direction for future research.5.4.2.Self-Re
313、wardingRewards play a pivotal role in RL,steering the optimization process.In domains where verifica-tion through external tools is straightforward,such as some coding or mathematics scenarios,RLdemonstrates exceptional efficacy.However,in more general scenarios,constructing a feedbackmechanism thro
314、ugh hard coding is impractical.During the development of DeepSeek-V3,forthese broader contexts,we employ the constitutional AI approach(Bai et al.,2022),leveragingthe voting evaluation results of DeepSeek-V3 itself as a feedback source.This method has34produced notable alignment effects,significantl
315、y enhancing the performance of DeepSeek-V3in subjective evaluations.By integrating additional constitutional inputs,DeepSeek-V3 canoptimize towards the constitutional direction.We believe that this paradigm,which combinessupplementary information with LLMs as a feedback source,is of paramount import
316、ance.TheLLM serves as a versatile processor capable of transforming unstructured information fromdiverse scenarios into rewards,ultimately facilitating the self-improvement of LLMs.Beyondself-rewarding,we are also dedicated to uncovering other general and scalable rewardingmethods to consistently ad
317、vance the model capabilities in general scenarios.5.4.3.Multi-Token Prediction EvaluationInstead of predicting just the next single token,DeepSeek-V3 predicts the next 2 tokens throughthe MTP technique.Combined with the framework of speculative decoding(Leviathan et al.,2023;Xia et al.,2023),it can
318、significantly accelerate the decoding speed of the model.A naturalquestion arises concerning the acceptance rate of the additionally predicted token.Based onour evaluation,the acceptance rate of the second token prediction ranges between 85%and 90%across various generation topics,demonstrating consi
319、stent reliability.This high acceptance rateenables DeepSeek-V3 to achieve a significantly improved decoding speed,delivering 1.8 timesTPS(Tokens Per Second).6.Conclusion,Limitations,and Future DirectionsIn this paper,we introduce DeepSeek-V3,a large MoE language model with 671B total pa-rameters and
320、 37B activated parameters,trained on 14.8T tokens.In addition to the MLA andDeepSeekMoE architectures,it also pioneers an auxiliary-loss-free strategy for load balancingand sets a multi-token prediction training objective for stronger performance.The training ofDeepSeek-V3 is cost-effective due to t
321、he support of FP8 training and meticulous engineering op-timizations.The post-training also makes a success in distilling the reasoning capability from theDeepSeek-R1 series of models.Comprehensive evaluations demonstrate that DeepSeek-V3 hasemerged as the strongest open-source model currently avail
322、able,and achieves performance com-parable to leading closed-source models like GPT-4o and Claude-3.5-Sonnet.Despite its strongperformance,it also maintains economical training costs.It requires only 2.788M H800 GPUhours for its full training,including pre-training,context length extension,and post-t
323、raining.While acknowledging its strong performance and cost-effectiveness,we also recognize thatDeepSeek-V3 has some limitations,especially on the deployment.Firstly,to ensure efficientinference,the recommended deployment unit for DeepSeek-V3 is relatively large,which mightpose a burden for small-si
324、zed teams.Secondly,although our deployment strategy for DeepSeek-V3 has achieved an end-to-end generation speed of more than two times that of DeepSeek-V2,there still remains potential for further enhancement.Fortunately,these limitations are expectedto be naturally addressed with the development of
325、 more advanced hardware.DeepSeek consistently adheres to the route of open-source models with longtermism,aimingto steadily approach the ultimate goal of AGI(Artificial General Intelligence).In the future,weplan to strategically invest in research across the following directions.We will consistently
326、 study and refine our model architectures,aiming to further improveboth the training and inference efficiency,striving to approach efficient support for infinitecontext length.Additionally,we will try to break through the architectural limitations ofTransformer,thereby pushing the boundaries of its
327、modeling capabilities.35We will continuously iterate on the quantity and quality of our training data,and explorethe incorporation of additional training signal sources,aiming to drive data scaling acrossa more comprehensive range of dimensions.We will consistently explore and iterate on the deep th
328、inking capabilities of our models,aiming to enhance their intelligence and problem-solving abilities by expanding theirreasoning length and depth.We will explore more comprehensive and multi-dimensional model evaluation methods toprevent the tendency towards optimizing a fixed set of benchmarks duri
329、ng research,whichmay create a misleading impression of the model capabilities and affect our foundationalassessment.ReferencesAIMeta.Llama 3 model card,2024a.URLhttps:/ 3.1 model card,2024b.URLhttps:/ 3.5 sonnet,2024.URLhttps:/ al.Program synthesis with large language models.arXiv preprint arXiv:210
330、8.07732,2021.Y.Bai,S.Kadavath,S.Kundu,A.Askell,J.Kernion,A.Jones,A.Chen,A.Goldie,A.Mirhoseini,C.McKinnon,et al.Constitutional AI:Harmlessness from AI feedback.arXiv preprintarXiv:2212.08073,2022.Y.Bai,S.Tu,J.Zhang,H.Peng,X.Wang,X.Lv,S.Cao,J.Xu,L.Hou,Y.Dong,J.Tang,andJ.Li.LongBench v2:Towards deeper
331、understanding and reasoning on realistic long-contextmultitasks.arXiv preprint arXiv:2412.15204,2024.M.Bauer,S.Treichler,and A.Aiken.Singe:leveraging warp specialization for high performanceon GPUs.InProceedings of the 19th ACM SIGPLAN Symposium on Principles and Practiceof Parallel Programming,PPoP
332、P 14,page 119130,New York,NY,USA,2014.Associationfor Computing Machinery.ISBN 9781450326568.doi:10.1145/2555243.2555258.URLhttps:/doi.org/10.1145/2555243.2555258.Y.Bisk,R.Zellers,R.L.Bras,J.Gao,and Y.Choi.PIQA:reasoning about physical commonsensein natural language.InThe Thirty-Fourth AAAI Conferenc
333、e on Artificial Intelligence,AAAI2020,The Thirty-Second Innovative Applications of Artificial Intelligence Conference,IAAI2020,The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence,EAAI2020,New York,NY,USA,February 7-12,2020,pages 74327439.AAAI Press,2020.doi:10.1609/aaai.v34i05.6239.URLhttps:/doi.org/10.1609/aaai.v34i05.6239.M.Chen,J.Tworek,H.Jun,Q.Yuan,H.P.de Oliveira Pinto