《DeepSeek Math技術報告(英文版)(30頁).pdf》由會員分享,可在線閱讀,更多相關《DeepSeek Math技術報告(英文版)(30頁).pdf(30頁珍藏版)》請在三個皮匠報告上搜索。
1、DeepSeekMath:Pushing the Limits of MathematicalReasoning in Open Language ModelsZhihong Shao1,2,Peiyi Wang1,3,Qihao Zhu1,3,Runxin Xu1,Junxiao Song1Xiao Bi1,Haowei Zhang1,Mingchuan Zhang1,Y.K.Li1,Y.Wu1,Daya Guo11DeepSeek-AI,2Tsinghua University,3Peking Universityzhihongshao,wangpeiyi,zhuqh,https:/ re
2、asoning poses a significant challenge for language models due to its complexand structured nature.In this paper,we introduce DeepSeekMath 7B,which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from CommonCrawl,together with natural language and code data.De
3、epSeekMath 7B has achieved animpressive score of 51.7%on the competition-level MATH benchmark without relying onexternal toolkits and voting techniques,approaching the performance level of Gemini-Ultraand GPT-4.Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9%on MATH.The mathemati
4、cal reasoning capability of DeepSeekMath is attributed to two key factors:First,we harness the significant potential of publicly available web data through a meticulouslyengineered data selection pipeline.Second,we introduce Group Relative Policy Optimization(GRPO),a variant of Proximal Policy Optim
5、ization(PPO),that enhances mathematical reasoningabilities while concurrently optimizing the memory usage of PPO.Figure 1|Top1 accuracy of open-source models on the competition-level MATH benchmark(Hendrycks et al.,2021)without the use of external toolkits and voting techniques.Core contributors.Wor
6、k done during internship at DeepSeek-AI.arXiv:2402.03300v3 cs.CL 27 Apr 20241.IntroductionLarge language models(LLM)have revolutionized the approach to mathematical reasoningin artificial intelligence,spurring significant advancements in both the quantitative reasoningbenchmark(Hendrycks et al.,2021
7、)and the geometry reasoning benchmark(Trinh et al.,2024).Moreover,these models have proven instrumental in assisting humans in solving complexmathematical problems(Tao,2023).However,cutting-edge models such as GPT-4(OpenAI,2023)and Gemini-Ultra(Anil et al.,2023)are not publicly available,and the cur
8、rently accessibleopen-source models considerably trail behind in performance.In this study,we introduce DeepSeekMath,a domain-specific language model that signifi-cantly outperforms the mathematical capabilities of open-source models and approaches theperformance level of GPT-4 on academic benchmark
9、s.To achieve this,we create the DeepSeek-Math Corpus,a large-scale high-quality pre-training corpus comprising 120B math tokens.Thisdataset is extracted from the Common Crawl(CC)using a fastText-based classifier(Joulin et al.,2016).In the initial iteration,the classifier is trained using instances f
10、rom OpenWebMath(Pasteret al.,2023)as positive examples,while incorporating a diverse selection of other web pages toserve as negative examples.Subsequently,we employ the classifier to mine additional positiveinstances from the CC,which are further refined through human annotation.The classifier isth
11、en updated with this enhanced dataset to improve its performance.The evaluation resultsindicate that the large-scale corpus is of high quality,as our base model DeepSeekMath-Base7B achieves 64.2%on GSM8K(Cobbe et al.,2021)and 36.2%on the competition-level MATHdataset(Hendrycks et al.,2021),outperfor
12、ming Minerva 540B(Lewkowycz et al.,2022a).Inaddition,the DeepSeekMath Corpus is multilingual,so we notice an improvement in Chinesemathematical benchmarks(Wei et al.,2023;Zhong et al.,2023).We believe that our experiencein mathematical data processing is a starting point for the research community,a
13、nd there issignificant room for improvement in the future.DeepSeekMath-Base is initialized with DeepSeek-Coder-Base-v1.5 7B(Guo et al.,2024),aswe notice that starting from a code training model is a better choice compared to a generalLLM.Furthermore,we observe the math training also improves model c
14、apability on MMLU(Hendrycks et al.,2020)and BBH benchmarks(Suzgun et al.,2022),indicating it does not onlyenhance the models mathematical abilities but also amplifies general reasoning capabilities.After pre-training,we apply mathematical instruction tuning to DeepSeekMath-Base withchain-of-thought(
15、Wei et al.,2022),program-of-thought(Chen et al.,2022;Gao et al.,2023),andtool-integrated reasoning(Gou et al.,2023)data.The resulting model DeepSeekMath-Instruct7B beats all 7B counterparts and is comparable with 70B open-source instruction-tuned models.Furthermore,we introduce the Group Relative Po
16、licy Optimization(GRPO),a variant rein-forcement learning(RL)algorithm of Proximal Policy Optimization(PPO)(Schulman et al.,2017).GRPO foregoes the critic model,instead estimating the baseline from group scores,significantlyreducing training resources.By solely using a subset of English instruction
17、tuning data,GRPOobtains a substantial improvement over the strong DeepSeekMath-Instruct,including bothin-domain(GSM8K:82.9%88.2%,MATH:46.8%51.7%)and out-of-domain mathematicaltasks(e.g.,CMATH:84.6%88.8%)during the reinforcement learning phase.We also providea unified paradigm to understand different
18、 methods,such as Rejection Sampling Fine-Tuning(RFT)(Yuan et al.,2023a),Direct Preference Optimization(DPO)(Rafailov et al.,2023),PPO andGRPO.Based on such a unified paradigm,we find that all these methods are conceptualized aseither direct or simplified RL techniques.We also conduct extensive exper
19、iments,e.g.,onlinev.s.offline training,outcome v.s.process supervision,single-turn v.s.iterative RL and so on,2to deeply investigate the essential elements of this paradigm.At last,we explain why our RLboosts the performance of instruction-tuned models,and further summarize potential directionsto ac
20、hieve more effective RL based on this unified paradigm.1.1.ContributionsOur contribution includes scalable math pre-training,along with the exploration and analysis ofreinforcement learning.Math Pre-Training at ScaleOur research provides compelling evidence that the publicly accessible Common Crawld
21、ata contains valuable information for mathematical purposes.By implementing a metic-ulously designed data selection pipeline,we successfully construct the DeepSeekMathCorpus,a high-quality dataset of 120B tokens from web pages filtered for mathemati-cal content,which is almost 7 times the size of th
22、e math web pages used by Minerva(Lewkowycz et al.,2022a)and 9 times the size of the recently released OpenWebMath(Paster et al.,2023).Our pre-trained base model DeepSeekMath-Base 7B achieves comparable performancewith Minerva 540B(Lewkowycz et al.,2022a),indicating the number of parameters is notthe
23、 only key factor in mathematical reasoning capability.A smaller model pre-trained onhigh-quality data could achieve strong performance as well.We share our findings from math training experiments.Code training prior to mathtraining improves models ability to solve mathematical problems both with and
24、 withouttool use.This offers a partial answer to the long-standing question:does code trainingimprove reasoning abilities?We believe it does,at least for mathematical reasoning.Although training on arXiv papers is common,especially in many math-related papers,itbrings no notable improvements on all
25、mathematical benchmarks adopted in this paper.Exploration and Analysis of Reinforcement LearningWe introduce Group Relative Policy Optimization(GRPO),an efficient and effectivereinforcement learning algorithm.GRPO foregoes the critic model,instead estimatingthe baseline from group scores,significant
26、ly reducing training resources compared toProximal Policy Optimization(PPO).We demonstrate that GRPO significantly enhances the performance of our instruction-tuned model DeepSeekMath-Instruct,by solely using the instruction-tuning data.Further-more,we observe enhancements in the out-of-domain perfo
27、rmance during the reinforce-ment learning process.We provide a unified paradigm to understand different methods,such as RFT,DPO,PPO,and GRPO.We also conduct extensive experiments,e.g.,online v.s.offline training,outcome v.s.process supervision,single-turn v.s.iterative reinforcement learning,and soo
28、n to deeply investigate the essential elements of this paradigm.Based on our unified paradigm,we explore the reasons behind the effectiveness of rein-forcement learning,and summarize several potential directions to achieve more effectivereinforcement learning of LLMs.1.2.Summary of Evaluations and M
29、etricsEnglish and Chinese Mathematical Reasoning:We conduct comprehensive assessmentsof our models on English and Chinese benchmarks,covering mathematical problems3from grade-school level to college level.English benchmarks include GSM8K(Cobbeet al.,2021),MATH(Hendrycks et al.,2021),SAT(Azerbayev et
30、 al.,2023),OCW Courses(Lewkowycz et al.,2022a),MMLU-STEM(Hendrycks et al.,2020).Chinese benchmarksinclude MGSM-zh(Shi et al.,2023),CMATH(Wei et al.,2023),Gaokao-MathCloze(Zhonget al.,2023),and Gaokao-MathQA(Zhong et al.,2023).We evaluate models abilityto generate self-contained text solutions withou
31、t tool use,and also the ability to solveproblems using Python.On English benchmarks,DeepSeekMath-Base is competitive with the closed-source Min-erva 540B(Lewkowycz et al.,2022a),and surpasses all open-source base models(e.g.,Mis-tral 7B(Jiang et al.,2023)and Llemma-34B(Azerbayev et al.,2023),regardl
32、ess of whethertheyve undergone math pre-training or not,often by a significant margin.Notably,DeepSeekMath-Base is superior on Chinese benchmarks,likely because we dont followprevious works(Azerbayev et al.,2023;Lewkowycz et al.,2022a)to collect English-onlymath pre-training data,and also include hi
33、gh-quality non-English ones.With mathemati-cal instruction tuning and reinforcement learning,the resulting DeepSeekMath-Instructand DeepSeekMath-RL demonstrate strong performance,obtaining an accuracy of over50%on the competition-level MATH dataset for the first time within the open-sourcecommunity.
34、Formal Mathematics:We evaluate DeepSeekMath-Base using the informal-to-formaltheorem proving task from(Jiang et al.,2022)on miniF2F(Zheng et al.,2021)with Isabelle(Wenzel et al.,2008)chosen to be the proof assistant.DeepSeekMath-Base demonstratesstrong few-shot autoformalization performance.Natural
35、Language Understanding,Reasoning,and Code:To build a comprehensiveprofile of models general understanding,reasoning,and coding capabilities,we eval-uate DeepSeekMath-Base on the Massive Multitask Language Understanding(MMLU)benchmark(Hendrycks et al.,2020)which encompasses 57 multiple-choice tasks c
36、overingdiverse subjects,BIG-Bench Hard(BBH)(Suzgun et al.,2022)which consists of 23 chal-lenging tasks that mostly require multi-step reasoning to solve,as well as HumanEval(Chen et al.,2021)and MBPP(Austin et al.,2021)which are widely used to evaluate codelanguage models.Math pre-training benefits
37、both language understanding and reasoningperformance.2.Math Pre-Training2.1.Data Collection and DecontaminationIn this section,we will outline the process of constructing the DeepSeekMath Corpus fromCommon Crawl.As depicted in Figure 2,we present an iterative pipeline that demonstrateshow to systema
38、tically gather a large-scale mathematical corpus from Common Crawl,startingwith a seed corpus(e.g.,a small but high-quality collection of math-related dataset).Its worthnoting that this approach is also applicable to other domains,such as coding.First,we choose OpenWebMath(Paster et al.,2023),a coll
39、ection of high-quality mathematicalweb texts,as our initial seed corpus.Using this corpus,we train a fastText model(Joulin et al.,2016)to recall more OpenWebMath-like mathematical web pages.Specifically,we randomlyselect 500,000 data points from the seed corpus as positive training examples and anot
40、her500,000 web pages from Common Crawl as negative ones.We employ an open-source library1for training,configuring the vector dimension to 256,learning rate to 0.1,the maximum length1https:/fasttext.cc4Math SeedMath Corpus1.Train a FastText Model2.Recall Math-Related Webpages From Common Crawl3.Disco
41、ver Math-Related Domains4.Annotate Math-Related URL Path From LabelersDeduplicated Common Crawl40B HTML pagesFigure 2|An iterative pipeline that collects mathematical web pages from Common Crawl.of word n-gram to 3,the minimum number of word occurrences to 3,and the number oftraining epochs to 3.To
42、reduce the size of the original Common Crawl,we employ URL-baseddeduplication and near-deduplication techniques,resulting in 40B HTML web pages.We thenrecall mathematical web pages from deduplicated Common Crawl with the fastText model.To filter out low-quality mathematical content,we rank the colle
43、cted pages according to theirscores predicted by the fastText model,and only preserve the top-ranking ones.The volumeof data preserved is assessed through pre-training experiments on the top 40B,80B,120B,and160B tokens.In the first iteration,we choose to keep the top 40B tokens.After the first itera
44、tion of data collection,numerous mathematical web pages remain un-collected,mainly because the fastText model is trained on a set of positive examples that lackssufficient diversity.We therefore identify additional mathematical web sources to enrich the seedcorpus,so that we can optimize the fastTex
45、t model.Specifically,we first organize the entire Com-mon Crawl into disjoint domains;a domain is defined as web pages sharing the same base URL.For each domain,we calculate the percentage of web pages that are collected in the first iteration.Domains where over 10%of the web pages have been collect
46、ed are classified as math-related(e.g.,).Subsequently,we manually annotate the URLs associated withmathematical content within these identified domains(e.g., pages linked to these URLs,yet uncollected,will be added to the seed corpus.This ap-proach enables us to gather more positive examples,thereby
47、 training an improved fastTextmodel capable of recalling more mathematical data in the subsequent iteration.After fouriterations of data collection,we end up with 35.5M mathematical web pages,totaling 120Btokens.In the fourth iteration,we notice that nearly 98%of the data has already been collectedi
48、n the third iteration,so we decide to cease data collection.To avoid benchmark contamination,we follow Guo et al.(2024)to filter out web pagescontaining questions or answers from English mathematical benchmarks such as GSM8K(Cobbeet al.,2021)and MATH(Hendrycks et al.,2021)and Chinese benchmarks such
49、 as CMATH(Wei et al.,2023)and AGIEval(Zhong et al.,2023).The filtering criteria are as follows:anytext segment containing a 10-gram string that matches exactly with any sub-string from theevaluation benchmarks is removed from our math training corpus.For benchmark texts thatare shorter than 10 grams
50、 but have at least 3 grams,we employ exact matching to filter outcontaminated web pages.52.2.Validating the Quality of the DeepSeekMath CorpusWe run pre-training experiments to investigate how the DeepSeekMath Corpus is comparedwith the recently released math-training corpora:MathPile(Wang et al.,20
51、23c):a multi-source corpus(8.9B tokens)aggregated fromtextbooks,Wikipedia,ProofWiki,CommonCrawl,StackExchange,and arXiv,with themajority(over 85%)sourced from arXiv;OpenWebMath(Paster et al.,2023):CommonCrawl data filtered for mathematical content,totaling 13.6B tokens;Proof-Pile-2(Azerbayev et al.,
52、2023):a mathematical corpus consisting of OpenWeb-Math,AlgebraicStack(10.3B tokens of mathematical code),and arXiv papers(28.0B to-kens).When experimenting on Proof-Pile-2,we follow Azerbayev et al.(2023)to use anarXiv:Web:Code ratio of 2:4:1.2.2.1.Training SettingWe apply math training to a general
53、 pre-trained language model with 1.3B parameters,whichshares the same framework as the DeepSeek LLMs(DeepSeek-AI,2024),denoted as DeepSeek-LLM 1.3B.We separately train a model on each mathematical corpus for 150B tokens.Allexperiments are conducted using the efficient and light-weight HAI-LLM(High-f
54、lyer,2023)training framework.Following the training practice of DeepSeek LLMs,we use the AdamWoptimizer(Loshchilov and Hutter,2017)with1=0.9,2=0.95,andweight_decay=0.1,alongwith a multi-step learning rate schedule where the learning rate reaches the peak after 2,000warmup steps,decreases to its 31.6
55、%after 80%of the training process,and further decreases to10.0%of the peak after 90%of the training process.We set the maximum value of learning rateto 5.3e-4,and use a batch size of 4M tokens with a 4K context length.Math CorpusSizeEnglish BenchmarksChinese BenchmarksGSM8K MATH OCWSATMMLUSTEMCMATHG
56、aokaoMathClozeGaokaoMathQANo Math TrainingN/A2.9%3.0%2.9%15.6%19.5%12.3%0.8%17.9%MathPile8.9B2.7%3.3%2.2%12.5%15.7%1.2%0.0%2.8%OpenWebMath13.6B11.5%8.9%3.7%31.3%29.6%16.8%0.0%14.2%Proof-Pile-251.9B14.3%11.2%3.7%43.8%29.2%19.9%5.1%11.7%DeepSeekMath Corpus 120.2B23.8%13.6%4.8%56.3%33.1%41.5%5.9%23.6%T
57、able 1|Performance of DeepSeek-LLM 1.3B trained on different mathematical corpora,evalu-ated using few-shot chain-of-thought prompting.Corpus sizes are calculated using our tokenizerwith a vocabulary size of 100K.2.2.2.Evaluation ResultsThe DeepSeekMath Corpus is of high quality,covers multilingual
58、mathematical content,andis the largest in size.High-quality:Weevaluatedownstreamperformanceon8mathematicalbenchmarksusingfew-shot chain-of-thought prompting Wei et al.(2022).As shown in Table 1,there is a clearperformance lead of the model trained on the DeepSeekMath Corpus.Figure 3 shows thatthe mo
59、del trained on the DeepSeekMath Corpus demonstrates better performance than6Figure 3|Benchmark curves of DeepSeek-LLM 1.3B trained on different mathematical corpora.Proof-Pile-2 at 50B tokens(1 full epoch of Proof-Pile-2),indicating the average quality ofDeepSeekMath Corpus is higher.Multilingual:Th
60、e DeepSeekMath Corpus encompasses data in multiple languages,pre-dominantly featuring English and Chinese as the two most represented languages.Asshown in Table 1,training on the DeepSeekMath Corpus enhances mathematical reasoningperformance in both English and Chinese.In contrast,existing mathemati
61、cal corpora,which are primarily English-centric,show limited improvement and may even hinderperformance in Chinese mathematical reasoning.Large-scale:The DeepSeekMath Corpus is several times larger than existing mathematicalcorpora.As depicted in Figure 3,DeepSeek-LLM 1.3B,when trained on the DeepSe
62、ek-Math Corpus,shows a steeper learning curve along with more lasting improvements.Incontrast,the baseline corpora are much smaller,and have already been repeated multiplerounds during training,with the resulting model performance quickly reaching a plateau.2.3.Training and Evaluating DeepSeekMath-B
63、ase 7BIn this section,we introduce DeepSeekMath-Base 7B,a base model with strong reasoningabilities,especially in mathematics.Our model is initialized with DeepSeek-Coder-Base-v1.5 7B7(Guo et al.,2024)and trained for 500B tokens.The distribution of the data is as follows:56%is from the DeepSeekMath
64、Corpus,4%from AlgebraicStack,10%from arXiv,20%is Githubcode,and the remaining 10%is natural language data from Common Crawl in both English andChinese.We mainly adopt the training setting specified in Section 2.2.1,except that we set themaximum value of the learning rate to 4.2e-4 and use a batch si
65、ze of 10M tokens.We conduct a comprehensive assessment of the mathematical capabilities of DeepSeekMath-Base 7B,focusing on its ability to produce self-contained mathematical solutions without relyingon external tools,solve mathematical problems using tools,and conduct formal theorem proving.Beyond
66、mathematics,we also provide a more general profile of the base model,including itsperformance of natural language understanding,reasoning,and programming skills.Mathematical Problem Solving with Step-by-Step ReasoningWe evaluate DeepSeekMath-Bases performance of solving mathematical problems using f
67、ew-shot chain-of-thought prompt-ing(Wei et al.,2022),across eight benchmarks in English and Chinese.These benchmarks encom-pass quantitative reasoning(e.g.,GSM8K(Cobbe et al.,2021),MATH(Hendrycks et al.,2021),and CMATH(Wei et al.,2023)and multiple-choice problems(e.g.,MMLU-STEM(Hendryckset al.,2020)
68、and Gaokao-MathQA(Zhong et al.,2023),covering diverse fields of mathematicsfrom elementary to college-level complexity.As shown in Table 2,DeepSeekMath-Base 7B leads in performance across all eight bench-marks among the open-source base models(including the widely-used general model Mistral7B(Jiang
69、et al.,2023)and the recently released Llemma 34B(Azerbayev et al.,2023)whichunderwent math training on Proof-Pile-2(Azerbayev et al.,2023).Notably,on the competition-level MATH dataset,DeepSeekMath-Base surpasses existing open-source base models by over10%absolute,and outperforms Minerva 540B(Lewkow
70、ycz et al.,2022a),a closed-source basemodel 77 times larger which builds on PaLM(Lewkowycz et al.,2022b)and is further trainedon mathematical texts.ModelSizeEnglish BenchmarksChinese BenchmarksGSM8K MATH OCWSATMMLUSTEMCMATHGaokaoMathClozeGaokaoMathQAClosed-Source Base ModelMinerva7B16.2%14.1%7.7%-35
71、.6%-Minerva62B52.4%27.6%12.0%-53.9%-Minerva540B58.8%33.6%17.6%-63.9%-Open-Source Base ModelMistral7B40.3%14.3%9.2%71.9%51.1%44.9%5.1%23.4%Llemma7B37.4%18.1%6.3%59.4%43.1%43.4%11.9%23.6%Llemma34B54.0%25.3%10.3%71.9%52.9%56.1%11.9%26.2%DeepSeekMath-Base 7B64.2%36.2%15.4%84.4%56.5%71.7%20.3%35.3%Table
72、2|Comparisons between DeepSeekMath-Base 7B and strong base models on English andChinese mathematical benchmarks.Models are evaluated with chain-of-thought prompting.Minerva results are quoted from Lewkowycz et al.(2022a).8Mathematical Problem Solving with Tool UseWe evaluate program-aided mathematic
73、alreasoning on GSM8K and MATH using few-shot program-of-thought prompting(Chen et al.,2022;Gao et al.,2023).Models are prompted to solve each problem by writing a Python programwhere libraries such as math and sympy can be utilized for intricate computations.The executionresult of the program is eva
74、luated as the answer.As shown in Table 3,DeepSeekMath-Base 7Boutperforms the prior state-of-the-art Llemma 34B.ModelSizeProblem Solving w/ToolsInformal-to-Formal ProvingGSM8K+Python MATH+Python miniF2F-valid miniF2F-testMistral7B48.5%18.2%18.9%18.0%CodeLlama7B27.1%17.2%16.3%17.6%CodeLlama34B52.7%23.
75、5%18.5%18.0%Llemma7B41.0%18.6%20.6%22.1%Llemma34B64.6%26.3%21.0%21.3%DeepSeekMath-Base 7B66.9%31.4%25.8%24.6%Table 3|Few-shot evaluation of base models ability to solve mathematical problems using toolsand the ability to conduct informal-to-formal theorem proving in Isabelle.Formal MathematicsFormal
76、 proof automation is beneficial to ensure the accuracy and relia-bility of mathematical proofs and enhance efficiency,with increasing attention in recent years.We evaluate DeepSeekMath-Base 7B on the task of informal-to-formal proving from(Jiang et al.,2022)which is to generate a formal proof based
77、on an informal statement,a formal counterpartof the statement,and an informal proof.We evaluate on miniF2F(Zheng et al.,2021),a bench-mark for formal Olympiad-level mathematics,and generate a formal proof in Isabelle for eachproblem with few-shot prompting.Following Jiang et al.(2022),we leverage mo
78、dels to generateproof sketches,and execute the off-the-shelf automated prover Sledgehammer(Paulson,2010)to fill in the missing details.As shown in Table 3,DeepSeekMath-Base 7B demonstrates strongperformance in proof autoformalization.ModelSize MMLUBBHHumanEval(Pass1)MBPP(Pass1)Mistral7B62.4%55.7%28.
79、0%41.4%DeepSeek-Coder-Base-v1.57B42.9%42.9%40.2%52.6%DeepSeek-Coder-Base-v1.57B49.1%55.2%43.2%60.4%DeepSeekMath-Base7B54.9%59.5%40.9%52.6%Table 4|Evaluation on natural language understanding,reasoning,and code benchmarks.DeepSeek-Coder-Base-v1.5is the checkpoint right before learning rate decay,whic
80、h is used totrain DeepSeekMath-Base.On MMLU and BBH,we use few-shot chain-of-thought prompting.On HumanEval and MBPP,we evaluate model performance under the zero-shot setting and afew-shot setting,respectively.Natural Language Understanding,Reasoning,and CodeWe evaluate model performance ofnatural l
81、anguage understanding on MMLU(Hendrycks et al.,2020),reasoning on BBH(Suzgunet al.,2022),and coding capabilities on HumanEval(Chen et al.,2021)and MBPP(Austin et al.,92021).As shown in Table 4,DeepSeekMath-Base 7B exhibits significant enhancements in per-formance on MMLU and BBH over its precursor,D
82、eepSeek-Coder-Base-v1.5(Guo et al.,2024),illustrating the positive impact of math training on language understanding and reasoning.Additionally,by including code tokens for continual training,DeepSeekMath-Base 7B effectivelymaintains the performance of DeepSeek-Coder-Base-v1.5 on the two coding benc
83、hmarks.Over-all,DeepSeekMath-Base 7B significantly outperforms the general model Mistral 7B(Jiang et al.,2023)on the three reasoning and coding benchmarks.3.Supervised Fine-Tuning3.1.SFT Data CurationWe construct a mathematical instruction-tuning dataset covering English and Chinese problemsfrom dif
84、ferent mathematical fields and of varying complexity levels:problems are paired withsolutions in chain-of-thought(CoT)(Wei et al.,2022),program-of-thought(PoT)(Chen et al.,2022;Gao et al.,2023),and tool-integrated reasoning format(Gou et al.,2023).The total numberof training examples is 776K.English
85、 mathematical datasets:We annotate GSM8K and MATH problems with tool-integrated solutions,and adopt a subset of MathInstruct(Yue et al.,2023)along with thetraining set of Lila-OOD(Mishra et al.,2022)where problems are solved with CoT orPoT.Our English collection covers diverse fields of mathematics,
86、e.g.,algebra,probability,number theory,calculus,and geometry.Chinese mathematical datasets:We collect Chinese K-12 mathematical problems spanning76 sub-topics such as linear equations,with solutions annotated in both CoT and tool-integrated reasoning format.3.2.Training and Evaluating DeepSeekMath-I
87、nstruct 7BIn this section,we introduce DeepSeekMath-Instruct 7B which undergoes mathematical instruc-tion tuning based on DeepSeekMath-Base.Training examples are randomly concatenated untilreaching a maximum context length of 4K tokens.We train the model for 500 steps with a batchsize of 256 and a c
88、onstant learning rate of 5e-5.We evaluate models mathematical performance both without and with tool use,on 4quantitative reasoning benchmarks in English and Chinese.We benchmark our model againstthe leading models of the time:Closed-source models include:(1)the GPT family among which GPT-4(OpenAI,2
89、023)and GPT-4 Code Interpreter2are the most capable ones,(2)Gemini Ultra and Pro(Anilet al.,2023),(3)Inflection-2(Inflection AI,2023),(4)Grok-13,as well as models recentlyreleased by Chinese companies including(5)Baichuan-34,(6)the latest GLM-45from theGLM family(Du et al.,2022).These models are for
90、 general purposes,most of which haveundergone a series of alignment procedures.Open-source models include:general models like(1)DeepSeek-LLM-Chat 67B(DeepSeek-AI,2024),(2)Qwen 72B(Bai et al.,2023),(3)SeaLLM-v2 7B(Nguyen et al.,2023),and(4)2https:/ 6B(ChatGLM3 Team,2023),as well as models with enhanc
91、ements in mathemat-ics including(5)InternLM2-Math 20B6which builds on InternLM2 and underwent mathtraining followed by instruction tuning,(6)Math-Shepherd-Mistral 7B which applys PPOtraining(Schulman et al.,2017)to Mistral 7B(Jiang et al.,2023)with a process-supervisedreward model,(7)the WizardMath
92、series(Luo et al.,2023)which improves mathematicalreasoning in Mistral 7B and Llama-2 70B(Touvron et al.,2023)using evolve-instruct(i.e.,a version of instruction tuning that uses AI-evolved instructions)and PPO training withtraining problems primarily sourced from GSM8K and MATH,(8)MetaMath 70B(Yu e
93、t al.,2023)which is Llama-2 70B fine-tuned on an augmented version of GSM8K and MATH,(9)ToRA 34B Gou et al.(2023)which is CodeLlama 34B fine-tuned to do tool-integratedmathematical reasoning,(10)MAmmoTH 70B(Yue et al.,2023)which is Llama-2 70Binstruction-tuned on MathInstruct.AsshowninTable5,underth
94、eevaluationsettingwheretooluseisdisallowed,DeepSeekMath-Instruct 7B demonstrates strong performance of step-by-step reasoning.Notably,on thecompetition-level MATH dataset,our model surpasses all open-source models and the ma-jority of proprietary models(e.g.,Inflection-2 and Gemini Pro)by at least 9
95、%absolute.Thisis true even for models that are substantially larger(e.g.,Qwen 72B)or have been specifi-cally enhanced through math-focused reinforcement learning(e.g.,WizardMath-v1.1 7B).WhileDeepSeekMath-Instruct rivals the Chinese proprietary models GLM-4 and Baichuan-3 on MATH,it still underperfo
96、rms GPT-4 and Gemini Ultra.Under the evaluation setting where models are allowed to integrate natural language rea-soning and program-based tool use for problem solving,DeepSeekMath-Instruct 7B approachesan accuracy of 60%on MATH,surpassing all existing open-source models.On the other bench-marks,ou
97、r model is competitive with DeepSeek-LLM-Chat 67B,the prior state-of-the-art that is10 times larger.4.Reinforcement Learning4.1.Group Relative Policy OptimizationReinforcement learning(RL)has been proven to be effective in further improving the mathe-matical reasoning ability of LLMs after the Super
98、vised Fine-Tuning(SFT)stage(Luo et al.,2023;Wang et al.,2023b).In this section,we introduce our efficient and effective RL algorithm,GroupRelative Policy Optimization(GRPO).4.1.1.From PPO to GRPOProximal Policy Optimization(PPO)(Schulman et al.,2017)is an actor-critic RL algorithm that iswidely used
99、 in the RL fine-tuning stage of LLMs(Ouyang et al.,2022).In particular,it optimizesLLMs by maximizing the following surrogate objective:J()=E(),(|)1|=1min?(|,)(|,),clip?(|,)(|,),1,1+?,(1)whereandare the current and old policy models,and,are questions and outputssampled from the question dataset and
100、the old policy,respectively.is a clipping-relatedhyper-parameter introduced in PPO for stabilizing training.is the advantage,which iscomputed by applying Generalized Advantage Estimation(GAE)(Schulman et al.,2015),based6https:/ Benchmarks Chinese BenchmarksGSM8KMATHMGSM-zh CMATHChain-of-Thought Reas
101、oningClosed-Source ModelGemini Ultra-94.4%53.2%-GPT-4-92.0%52.9%-86.0%Inflection-2-81.4%34.8%-GPT-3.5-80.8%34.1%-73.8%Gemini Pro-86.5%32.6%-Grok-1-62.9%23.9%-Baichuan-3-88.2%49.2%-GLM-4-87.6%47.9%-Open-Source ModelInternLM2-Math20B82.6%37.7%-Qwen72B78.9%35.2%-Math-Shepherd-Mistral7B84.1%33.0%-Wizard
102、Math-v1.17B83.2%33.0%-DeepSeek-LLM-Chat67B84.1%32.6%74.0%80.3%MetaMath70B82.3%26.6%66.4%70.9%SeaLLM-v27B78.2%27.5%64.8%-ChatGLM36B72.3%25.7%-WizardMath-v1.070B81.6%22.7%64.8%65.4%DeepSeekMath-Instruct 7B82.9%46.8%73.2%84.6%DeepSeekMath-RL7B88.2%51.7%79.6%88.8%Tool-Integrated ReasoningClosed-Source M
103、odelGPT-4 Code Interpreter-97.0%69.7%-Open-Source ModelInternLM2-Math20B80.7%54.3%-DeepSeek-LLM-Chat67B86.7%51.1%76.4%85.4%ToRA34B80.7%50.8%41.2%53.4%MAmmoTH70B76.9%41.8%-DeepSeekMath-Instruct 7B83.7%57.4%72.0%84.3%DeepSeekMath-RL7B86.7%58.8%78.4%87.6%Table 5|Performance of Open-and Closed-Source mo
104、dels with both Chain-of-Thought andTool-Integrated Reasoning on English and Chinese Benchmarks.Scores in gray denote majorityvotes with 32 candidates;The others are Top1 scores.DeepSeekMath-RL 7B beats all open-source models from 7B to 70B,as well as the majority of closed-source models.AlthoughDeep
105、SeekMath-RL 7B is only further trained on chain-of-thought-format instruction tuning dataof GSM8K and MATH,it improves over DeepSeekMath-Instruct 7B on all benchmarks.12!#!#!#GAEReward ModelPolicy ModelValue ModelPolicy ModelReference ModelReward ModelPPOGRPOTrainedModelsFrozenModelsReference ModelG
106、roup ComputationFigure 4|Demonstration of PPO and our GRPO.GRPO foregoes the value model,insteadestimating the baseline from group scores,significantly reducing training resources.on the rewardsand a learned value function.Thus,in PPO,a value function needs tobe trained alongside the policy model an
107、d to mitigate over-optimization of the reward model,the standard approach is to add a per-token KL penalty from a reference model in the reward ateach token(Ouyang et al.,2022),i.e.,=(,)log(|,)(|,),(2)whereis the reward model,is the reference model,which is usually the initial SFT model,andis the co
108、efficient of the KL penalty.As the value function employed in PPO is typically another model of comparable size asthe policy model,it brings a substantial memory and computational burden.Additionally,during RL training,the value function is treated as a baseline in the calculation of the advantagefo
109、r variance reduction.While in the LLM context,usually only the last token is assigned areward score by the reward model,which may complicate the training of a value function that isaccurate at each token.To address this,as shown in Figure 4,we propose Group Relative PolicyOptimization(GRPO),which ob
110、viates the need for additional value function approximation asin PPO,and instead uses the average reward of multiple sampled outputs,produced in responseto the same question,as the baseline.More specifically,for each question,GRPO samples agroup of outputs1,2,from the old policyand then optimizes th
111、e policy modelby maximizing the following objective:J()=E(),=1(|)1=11|=1?min?(,|,)(,|,),clip?(,|,)(,|,),1,1+?,?D?|?,(3)whereandare hyper-parameters,and,is the advantage calculated based on relativerewards of the outputs inside each group only,which will be detailed in the following subsec-tions.The
112、group relative way that GRPO leverages to calculate the advantages,aligns well withthe comparative nature of rewards models,as reward models are typically trained on datasetsof comparisons between outputs on the same question.Also note that,instead of adding KLpenalty in the reward,GRPO regularizes
113、by directly adding the KL divergence between thetrained policy and the reference policy to the loss,avoiding complicating the calculation of,.13Algorithm 1 Iterative Group Relative Policy OptimizationInput initial policy modelinit;reward models;task prompts D;hyperparameters,1:policy modelinit2:for
114、iteration=1,.,I do3:reference model4:for step=1,.,M do5:Sample a batch Dfrom D6:Update the old policy model7:Sampleoutputs=1(|)for each question D8:Compute rewards=1for each sampled outputby running9:Compute,for the-th token ofthrough group relative advantage estimation.10:for GRPO iteration=1,.,do1
115、1:Update the policy modelby maximizing the GRPO objective(Equation 21)12:Updatethrough continuous training using a replay mechanism.OutputAnd different from the KL penalty term used in(2),we estimate the KL divergence with thefollowing unbiased estimator(Schulman,2020):D?|?=(,|,)(,|,)log(,|,)(,|,)1,
116、(4)which is guaranteed to be positive.4.1.2.Outcome Supervision RL with GRPOFormally,for each question,a group of outputs1,2,are sampled from the oldpolicy model.A reward model is then used to score the outputs,yieldingrewardsr=1,2,correspondingly.Subsequently,these rewards are normalized by subtrac
117、tingthe group average and dividing by the group standard deviation.Outcome supervision providesthe normalized reward at the end of each outputand sets the advantages,of all tokens inthe output as the normalized reward,i.e.,=e=mean(r)std(r),and then optimizes the policy bymaximizing the objective def
118、ined in equation(3).4.1.3.Process Supervision RL with GRPOOutcome supervision only provides a reward at the end of each output,which may not besufficient and efficient to supervise the policy in complex mathematical tasks.Following Wanget al.(2023b),we also explore process supervision,which provides
119、 a reward at the end ofeach reasoning step.Formally,given the questionandsampled outputs1,2,aprocess reward model is used to score each step of the outputs,yielding corresponding rewards:R=(1)1,(1)1,(1),(),where()is the end token indexof the-th step,andis the total number of steps in the-th output.W
120、e also normalize theserewards with the average and the standard deviation,i.e.,e()=()mean(R)std(R).Subsequently,the process supervision calculates the advantage of each token as the sum of the normalizedrewards from the following steps,i.e.,=()e(),and then optimizes the policy bymaximizing the objec
121、tive defined in equation(3).144.1.4.Iterative RL with GRPOAs the reinforcement learning training process progresses,the old reward model may not besufficient to supervise the current policy model.Therefore,we also explore the iterative RLwith GRPO.As shown in Algorithm 1,in iterative GRPO,we generat
122、e new training sets for thereward model based on the sampling results from the policy model and continually train theold reward model using a replay mechanism that incorporates 10%of historical data.Then,weset the reference model as the policy model,and continually train the policy model with thenew
123、 reward model.4.2.Training and Evaluating DeepSeekMath-RLWe conduct RL based on DeepSeekMath-Instruct 7B.The training data of RL are chain-of-thought-format questions related to GSM8K and MATH from the SFT data,which consistsof around 144K questions.We exclude other SFT questions to investigate the
124、impact of RLon benchmarks that lack data throughout the RL phase.We construct the training set ofreward models following(Wang et al.,2023b).We train our initial reward model based on theDeepSeekMath-Base 7B with a learning rate of 2e-5.For GRPO,we set the learning rate of thepolicy model as 1e-6.The
125、 KL coefficient is 0.04.For each question,we sample 64 outputs.Themax length is set to 1024,and the training batch size is 1024.The policy model only has a singleupdate following each exploration stage.We evaluate DeepSeekMath-RL 7B on benchmarksfollowing DeepSeekMath-Instruct 7B.For DeepSeekMath-RL
126、 7B,GSM8K and MATH withchain-of-thought reasoning can be regarded as in-domain tasks and all the other benchmarkscan be regarded as out-of-domain tasks.Table 5 demonstrates the performance of open-and closed-source models with both chain-of-thought and tool-integrated reasoning on English and Chines
127、e benchmarks.We find that:1)DeepSeekMath-RL 7B attains accuracies of 88.2%and 51.7%on GSM8K and MATH,respec-tively,utilizing chain-of-thought reasoning.This performance surpasses that of all open-sourcemodels in the 7B to 70B range,as well as the majority of closed-source models.2)Crucially,DeepSeek
128、Math-RL 7B is only trained on chain-of-thought-format instruction tuning data ofGSM8K and MATH,starting from DeepSeekMath-Instruct 7B.Despite the constrained scopeof its training data,it outperforms DeepSeekMath-Instruct 7B across all evaluation metrics,showcasing the effectiveness of reinforcement
129、learning.5.DiscussionIn this section,we will share our findings in pre-training and RL experiments.5.1.Lessons Learnt in Pre-TrainingWe first share our experience in pre-training.Unless otherwise specified,we will adhere tothe training settings outlined in Section 2.2.1.It is worth noting that,when
130、referring to theDeepSeekMath Corpus in this section,we use an 89B-token dataset from the second iteration ofthe data collection process.5.1.1.Code Training Benefits Mathematical ReasoningA popular yet unverified hypothesis suggests that code training improves reasoning.We attemptto offer a partial r
131、esponse to this,particularly within the mathematical domain:code training15Training SettingTraining Tokensw/o Tool Usew/Tool UseGeneral Code Math GSM8K MATH CMATH GSM8K+Python MATH+PythonNo Continual Training2.9%3.0%12.3%2.7%2.3%Two-Stage TrainingStage 1:General Training400B2.9%3.2%14.8%3.3%2.3%Stag
132、e 2:Math Training150B19.1%14.4%37.2%14.3%6.7%Stage 1:Code Training400B 5.9%3.6%19.9%12.4%10.0%Stage 2:Math Training150B21.9%15.3%39.7%17.4%9.4%One-Stage TrainingMath Training150B20.5%13.1%37.6%11.4%6.5%Code&Math Mixed Training 400B 150B17.6%12.1%36.3%19.7%13.5%Table 6|Investigation of how code affec
133、ts mathematical reasoning under different trainingsettings.We experiment with DeepSeek-LLM 1.3B,and evaluate its mathematical reasoningperformance without and with tool use via few-shot chain-of-thought prompting and few-shotprogram-of-thought prompting,respectively.improves models ability to do mat
134、hematical reasoning both with and without tool use.To study how code training affects mathematical reasoning,we experimented with thefollowing two-stage training and one-stage training settings:Two-Stage TrainingCode Training for 400B TokensMath Training for 150B Tokens:We train DeepSeek-LLM 1.3B fo
135、r 400B code tokens followed by 150B math tokens;General Training for 400B TokensMath Training for 150B Tokens:As a controlexperiment,we also experiment with general tokens(sampled from a large-scale generalcorpus created by DeepSeek-AI)instead of code tokens in the first stage of training,in anattem
136、pt to investigate the advantages of code tokens over general tokens in improvingmathematical reasoning.One-Stage Training Math Training for 150B Tokens:We train DeepSeek-LLM 1.3B for 150B math tokens;Training on a mixture of 400B Code Tokens and 150B Math Tokens:Math training fol-lowing code trainin
137、g degrades coding performance.We investigate whether code tokens,when mixed with math tokens for one-stage training,would still improve mathematicalreasoning and also alleviate the problem of catastrophic forgetting.ResultsTable 6 and Table 7 demonstrate the downstream performance under different tr
138、ainingsettings.Code training benefits program-aided mathematical reasoning,both under the two-stagetraining and one-stage training settings.As shown in Table 6,under the two-stage trainingsetting,code training alone already significantly enhances the ability to solve GSM8K andMATH problems using Pyt
139、hon.Math training in the second stage yields further improvements.Interestingly,under the one-stage training setting,mixing code tokens and math tokens effec-tively mitigates the issue of catastrophic forgetting that arises from two-stage training,and alsosynergizes coding(Table 7)and program-aided
140、mathematical reasoning(Table 6).16Training SettingTraining TokensMMLUBBHHumanEval(Pass1)MBPP(Pass1)General Code MathNo Continual Training24.5%28.1%12.2%13.0%Two-Stage TrainingStage 1:General Training400B25.9%27.7%15.2%13.6%Stage 2:Math Training150B33.1%32.7%12.8%13.2%Stage 1:Code Training400B 25.0%3
141、1.5%25.0%40.0%Stage 2:Math Training150B36.2%35.3%12.2%17.0%One-Stage TrainingMath Training150B32.3%32.5%11.6%13.2%Code&Math Mixed Training 400B 150B33.5%35.6%29.3%39.4%Table 7|Investigation of how different settings of code and math training affect model perfor-mance of language understanding,reason
142、ing,and coding.We experiment with DeepSeek-LLM1.3B.We evaluate the models on MMLU and BBH using few-shot chain-of-thought prompting.On HumanEval and MBPP,we conduct zero-shot and few-shot evaluations,respectively.ModelSize ArXiv CorpusEnglish BenchmarksChinese BenchmarksGSM8K MATH OCWSATMMLUSTEMCMAT
143、HGaokaoMathClozeGaokaoMathQADeepSeek-LLM1.3BNo Math Training2.9%3.0%2.9%15.6%19.5%12.3%0.8%17.9%MathPile2.7%3.3%2.2%12.5%15.7%1.2%0.0%2.8%ArXiv-RedPajama3.3%3.4%4.0%9.4%9.0%7.4%0.8%2.3%DeepSeek-Coder-Base-v1.5 7BNo Math Training29.0%12.5%6.6%40.6%38.1%45.9%5.9%21.1%MathPile23.6%11.5%7.0%46.9%35.8%37
144、.9%4.2%25.6%ArXiv-RedPajama28.1%11.1%7.7%50.0%35.2%42.6%7.6%24.8%Table 8|Effect of math training on different arXiv datasets.Model performance is evaluatedwith few-shot chain-of-thought prompting.ArXiv CorpusminiF2F-valid miniF2F-testNo Math Training20.1%21.7%MathPile16.8%16.4%ArXiv-RedPajama14.8%11
145、.9%Table 9|Effect of math training on different arXiv corpora,the base model being DeepSeek-Coder-Base-v1.5 7B.We evaluate informal-to-formal proving in Isabelle.Code training also improves mathematical reasoning without tool use.Under the two-stagetraining setting,the initial stage of code training
146、 already results in moderate enhancements.It also boosts the efficiency of the subsequent math training,eventually leading to the bestperformance.However,combining code tokens and math tokens for one-stage training com-promises mathematical reasoning without tool use.One conjecture is that DeepSeek-
147、LLM 1.3B,due to its limited scale,lacks the capacity to fully assimilate both code and mathematical datasimultaneously.5.1.2.ArXiv Papers Seem Ineffective in Improving Mathematical ReasoningArXiv papers are commonly included as a component of math pre-training data(Azerbayevet al.,2023;Lewkowycz et
148、al.,2022a;Polu and Sutskever,2020;Wang et al.,2023c).However,17detailed analysis regarding their impact on mathematical reasoning has not been extensivelyconducted.Perhaps counter-intuitively,according to our experiments,arXiv papers seemineffective in improving mathematical reasoning.We experiment
149、with models of different sizes,including DeepSeek-LLM 1.3B and DeepSeek-Coder-Base-v1.5 7B(Guo et al.,2024),using arXivcorpora that underwent varied processing pipelines:MathPile(Wang et al.,2023c):an 8.9B-token corpus developed with cleaning and filteringheuristic rules,over 85%of which are scienti
150、fic arXiv papers;ArXiv-RedPajama(Computer,2023):the entirety of arXiv LaTeX files with preambles,comments,macros,and bibliographies removed,totaling 28.0B tokens.In our experiments,we separately train DeepSeek-LLM 1.3B for 150B tokens and DeepSeek-Coder-Base-v1.57Bfor40BtokensoneacharXivcorpus.Itsee
151、msthatarXivpapersareineffectivein improving mathematical reasoning.When trained on a arXiv-only corpus,both models dis-play no notable improvements or even deterioration across various mathematical benchmarks ofdifferent complexities employed in this study.These benchmarks include quantitative reaso
152、ningdatasets like GSM8K and MATH(Table 8),multiple-choice challenges like MMLU-STEM(Table8),and formal mathematics like miniF2F(Table 9).However,this conclusion has its limitations and should be taken with a grain of salt.Wehave not yet studied:The impact of arXiv tokens on specific math-related tas
153、ks not included in this research,such as informalization of theorems which is to convert formal statements or proofs totheir informal versions;The effect of arXiv tokens when combined with other types of data;Whether the benefits of arXiv papers would manifest themselves at a larger model scale.Thus
154、,further exploration is required,which we leave for future studies.5.2.Insights of Reinforcement Learning5.2.1.Towards to a Unified ParadigmIn this section,we provide a unified paradigm to analyze different training methods,such asSFT,RFT,DPO,PPO,GRPO,and further conduct experiments to explore the f
155、actors of theunified paradigm.Generally,the gradient with respect to the parameterof a training methodcan be written as:JA()=E(,)D|z 1|=1A(,)|z log(|,)“fififi.(5)There exist three key components:1)Data SourceD,which determines the training data;2)Reward Function,which is the source of the training r
156、eward signal;3)AlgorithmA:whichprocesses the training data and the reward signal to the gradient coefficientthat determinesthe magnitude of the penalty or reinforcement for the data.We analyze several representativemethods based on such a unified paradigm:Supervised Fine-tuning(SFT):SFT fine-tunes p
157、retrained model on human selected SFTdata.18MethodsData SourceReward FunctionGradient CoefficientSFT,(,)-1RFT(),(|)RuleEquation 10DPO(),+,(|)RuleEquation 14Online RFT(),(|)RuleEquation 10PPO(),(|)ModelEquation 18GRPO(),=1(|)ModelEquation 21Table 10|The data source and gradient coefficient of differe
158、nt methods.denotes the datadistribution of supervised fine-tuning datasets.anddenote the supervised fine-tunedmodel and the real-time policy model during the online training process,respectively.02000400060008000Steps565860626466Acc(%)GSM8K02000400060008000Steps27282930Acc(%)MATHRFTOnline RFTGRPO+OS
159、GRPO+PSFigure 5|Performance of the DeepSeekMath-Instruct 1.3B model,which was further trainedusing various methods,on two benchmarks.Rejection Sampling Fine-tuning(RFT):RFT further fine-tunes the SFT model on thefiltered outputs sampled from the SFT model based on SFT questions.RFT filters theoutput
160、s based on the correctness of their answers.Direct Preference Optimization(DPO):DPO further refines the SFT model by fine-tuningit on augmented outputs sampled from the SFT model,using pair-wise DPO loss.Online Rejection Sampling Fine-tuning(Online RFT):Different from RFT,Online RFTinitiates the pol
161、icy model using the SFT model and refines it by fine-tuning with theaugmented outputs sampled from the real-time policy model.PPO/GRPO:PPO/GRPO initializes the policy model using the SFT model and reinforcesit with the outputs sampled from the real-time policy model.We summarize the components of th
162、ese methods in Table 10.Please refer to Appendix A.1 for amore detailed derivation process.Observation about Data SourceWe divide the data source into two categories,online sam-pling,and offline sampling.Online sampling denotes that the training data is from the explo-ration results of the real-time
163、 training policy model,while offline sampling denotes that the19013002300330043005300Steps83848586878889Acc(%)GSM8K013002300330043005300Steps474849505152Acc(%)MATHIteration-0Iteration-1Iteration-2Figure 6|Performance of iterative reinforcement learning with DeepSeekMath-Instruct 7B ontwo benchmarks.
164、training data is from the sampling results of the initial SFT model.RFT and DPO follow theoffline style,while Online RFT and GRPO follow the online style.As shown in Figure 5,we find that the Online RFT significantly outperforms RFT on twobenchmarks.Specifically,Online RFT is comparable to RFT in th
165、e early stage of training butgains an absolute advantage in the later stage,demonstrating the superiority of online training.This is intuitive,as in the initial stage,the actor and the SFT model exhibit close resemblance,with the sampled data revealing only minor differences.In the later stage,howev
166、er,the datasampled from the actor will exhibit more significant differences,and real-time data samplingwill offer greater advantages.Observation about Gradient CoefficientThe algorithm processes the reward signal to thegradient coefficient to update the model parameter.We divide the reward function
167、as Ruleand Model in our experiments.Rule refers to judging the quality of a response based onthe correctness of the answer,and Model denotes that we train a reward model to score eachresponse.The training data of the reward model is based on the rule judgment.Equations 10and 21 highlight a key diffe
168、rence between GRPO and Online RFT:GRPO uniquely adjusts itsgradient coefficient based on the reward value provided by the reward model.This allows fordifferential reinforcement and penalization of responses according to their varying magnitudes.In contrast,Online RFT lacks this feature;it does not p
169、enalize incorrect responses and uniformlyreinforces all responses with correct answers at the same level of intensity.As demonstrated in Figure 5,GRPO surpasses online RFT,thereby highlighting the efficiencyof altering positive and negative gradient coefficients.In addition,GRPO+PS shows superiorper
170、formance compared to GRPO+OS,indicating the benefits of using fine-grained,step-awaregradient coefficients.Furthermore,we explore the iterative RL,in our experiments,we conducttwo rounds of iteration.As shown in Figure 6,we notice that the iterative RL significantlyimproves the performance,especiall
171、y at the first iteration.20148163264K:The number of candidates828486889092949698Acc(%)GSM8K148163264K:The number of candidates455055606570758085Acc(%)MATHMajK-InstructMajK-RLPassK-InstructPassK-RLFigure 7|The MajK and PassK of SFT and RL DeepSeekMath 7B on GSM8K and MATH(temperature 0.7).It was note
172、d that RL enhances MajK but not PassK.5.2.2.Why RL Works?In this paper,we conduct reinforcement learning based on a subset of instruction tuningdata,and it achieves significant performance enhancement upon the instruction tuning model.To further explain why reinforcement learning works.We evaluate t
173、he PassK and MajKaccuracy of the Instruct and RL models on two benchmarks.As shown in Figure 7,RL enhancesMajKs performance but not PassK.These findings indicate that RL enhances the modelsoverall performance by rendering the output distribution more robust,in other words,it seemsthat the improvemen
174、t is attributed to boosting the correct response from TopK rather thanthe enhancement of fundamental capabilities.Similarly,(Wang et al.,2023a)identified amisalignment problem in reasoning tasks within the SFT model,showing that the reasoningperformance of SFT models can be improved through a series
175、 of preference alignment strategies(Song et al.,2023;Wang et al.,2023a;Yuan et al.,2023b).5.2.3.How to Achieve More Effective RL?We demonstrate RL works pretty well in mathematical reasoning tasks.We also provide a unifiedparadigm to understand different representative training methods.Within this p
176、aradigm,allmethods are conceptualized as either direct or simplified RL techniques.As summarized inEquation 5,there exist three key components:Data Source,Algorithm,and Reward Function.We provide some potential future directions about the three components.Data SourceData source is the raw material o
177、f all training methods.In the context of RL,wespecifically refer to the data source as the unlabeled questions with the outputs sampled fromthe policy model.In this paper,we only use the questions from the instruction tuning stage anda naive nucleus sampling to sample outputs.We think this is a pote
178、ntial reason that our RLpipeline only improves the MajK performance.In the future,we will explore our RL pipelineon out-of-distribution question prompts,in conjunction with advanced sampling(decoding)strategies,like those based on tree-search methods(Yao et al.,2023).Also,the efficient inferencetech
179、niques(Kwon et al.,2023;Leviathan et al.,2023;Xia et al.,2023,2024),which determines21the exploration efficiency of policy models,also play an exceedingly important role.AlgorithmsAlgorithms process the data and reward signal to the gradient coefficient to updatethe model parameter.Based on Equation
180、 5,to some extent,all methods now fully TRUST thesignal of the reward function to increase or decrease the conditional probability of a certaintoken.However,it is impossible to ensure the reward signal is always reliable,especially inextremely complex tasks.For example,even the PRM800K datasets(Ligh
181、tman et al.,2023),which have been carefully annotated by well-trained annotators,still contain approximately 20%of incorrectly annotations7.To this end,we will explore the reinforcement learning algorithmthat is robust against noisy reward signals.We believe such WEAK-TO-STRONG(Burns et al.,2023)ali
182、gnment methods will bring a fundamental change to the learning algorithms.Reward FunctionReward function is the source of the training signal.In RL,the rewardfunction is usually the neural reward model.We think there exist three important directions forreward models:1)How to enhance the generalizati
183、on ability of the reward model.The rewardmodel must be effectively generalized to handle out-of-distribution questions and advanceddecoding outputs;otherwise,reinforcement learning may merely stabilize the distribution ofLLMs rather than improve their fundamental capabilities;2)How to reflect the un
184、certaintyof reward model.The uncertainty could potentially act as a linking bridge between the weakreward model and the weak-to-strong learning algorithms;3)How to efficiently build high-quality process reward models that can provide fine-grained training signals for the reasoningprocess(Lightman et
185、 al.,2023;Wang et al.,2023b).6.Conclusion,Limitation,and Future WorkWe present DeepSeekMath,which outperforms all open-source models on the competition-level MATH benchmark and approaches the performance of closed models.DeepSeekMath isinitialized with DeepSeek-Coder-v1.5 7B and undergoes continual
186、training for 500B tokens,witha significant component of the training data being 120B math tokens sourced from CommonCrawl.Our extensive ablation study shows web pages offer significant potential for high-qualitymathematical data,while arXiv may not as beneficial as we expected.We introduce GroupRela
187、tive Policy Optimization(GRPO),a variant of Proximal Policy Optimization(PPO),whichcan notably improve mathematical reasoning capabilities with less memory consumption.Theexperiment results show that GRPO is effective even if DeepSeekMath-Instruct 7B has reacheda high score on benchmarks.We also pro
188、vide a unified paradigm to understand a series ofmethods and summarize several potential directions for more effective reinforcement learning.Although DeepSeekMath achieves impressive scores on quantitative reasoning benchmarks,its capability on geometry and theorem-proof are relatively weaker than
189、closed models.Forinstance,in our dry run,the model cannot handle problems related to triangles and ellipses,which may indicate data selection bias in pre-training and fine-tuning.In addition,restrictedby the model scale,DeepSeekMath is worse than GPT-4 on few-shot capability.GPT-4 couldimprove its p
190、erformance with few-shot inputs,while DeepSeekMath shows similar performancein zero-shot and few-shot evaluation.In the future,we will further improve our engineereddata selection pipeline to construct more high-quality pre-trained corpus.In addition,we willexplore the potential directions(Section 5
191、.2.3)for more effective reinforcement learning of LLMs.7https:/ Glehn,L.Yagati,M.Kazemi,L.Gonzalez,M.Khalman,J.Sygnowski,and et al.Gemini:A family of highly capable multimodalmodels.CoRR,abs/2312.11805,2023.doi:10.48550/ARXIV.2312.11805.URLhttps:/doi.org/10.48550/arXiv.2312.11805.J.Austin,A.Odena,M.
192、Nye,M.Bosma,H.Michalewski,D.Dohan,E.Jiang,C.Cai,M.Terry,Q.Le,et al.Program synthesis with large language models.arXiv preprint arXiv:2108.07732,2021.Z.Azerbayev,H.Schoelkopf,K.Paster,M.D.Santos,S.McAleer,A.Q.Jiang,J.Deng,S.Bider-man,and S.Welleck.Llemma:An open language model for mathematics.arXiv p
193、reprintarXiv:2310.10631,2023.J.Bai,S.Bai,Y.Chu,Z.Cui,K.Dang,X.Deng,Y.Fan,W.Ge,Y.Han,F.Huang,et al.Qwentechnical report.arXiv preprint arXiv:2309.16609,2023.C.Burns,P.Izmailov,J.H.Kirchner,B.Baker,L.Gao,L.Aschenbrenner,Y.Chen,A.Ecoffet,M.Joglekar,J.Leike,et al.Weak-to-strong generalization:Eliciting
194、strong capabilities withweak supervision.arXiv preprint arXiv:2312.09390,2023.ChatGLM3 Team.Chatglm3 series:Open bilingual chat llms,2023.URLhttps:/ Oliveira Pinto,J.Kaplan,H.Edwards,Y.Burda,N.Joseph,G.Brockman,A.Ray,R.Puri,G.Krueger,M.Petrov,H.Khlaaf,G.Sastry,P.Mishkin,B.Chan,S.Gray,N.Ryder,M.Pavlo
195、v,A.Power,L.Kaiser,M.Bavarian,C.Winter,P.Tillet,F.P.Such,D.Cummings,M.Plappert,F.Chantzis,E.Barnes,A.Herbert-Voss,W.H.Guss,A.Nichol,A.Paino,N.Tezak,J.Tang,I.Babuschkin,S.Balaji,S.Jain,W.Saunders,C.Hesse,A.N.Carr,J.Leike,J.Achiam,V.Misra,E.Morikawa,A.Radford,M.Knight,M.Brundage,M.Murati,K.Mayer,P.Wel
196、inder,B.McGrew,D.Amodei,S.McCandlish,I.Sutskever,andW.Zaremba.Evaluating large language models trained on code.CoRR,abs/2107.03374,2021.URLhttps:/arxiv.org/abs/2107.03374.W.Chen,X.Ma,X.Wang,and W.W.Cohen.Program of thoughts prompting:Disentanglingcomputation from reasoning for numerical reasoning ta
197、sks.CoRR,abs/2211.12588,2022.doi:10.48550/ARXIV.2211.12588.URLhttps:/doi.org/10.48550/arXiv.2211.12588.K.Cobbe,V.Kosaraju,M.Bavarian,M.Chen,H.Jun,L.Kaiser,M.Plappert,J.Tworek,J.Hilton,R.Nakano,et al.Training verifiers to solve math word problems.arXiv preprintarXiv:2110.14168,2021.T.Computer.Redpaja
198、ma:an open dataset for training large language models,Oct.2023.URLhttps:/ LLM:scaling open-source language models with longtermism.CoRR,abs/2401.02954,2024.doi:10.48550/ARXIV.2401.02954.URLhttps:/doi.org/10.48550/arXiv.2401.02954.23Z.Du,Y.Qian,X.Liu,M.Ding,J.Qiu,Z.Yang,and J.Tang.Glm:General languag
199、e modelpretraining with autoregressive blank infilling.InProceedings of the 60th Annual Meetingof the Association for Computational Linguistics(Volume 1:Long Papers),pages 320335,2022.L.Gao,A.Madaan,S.Zhou,U.Alon,P.Liu,Y.Yang,J.Callan,and G.Neubig.PAL:program-aided language models.In A.Krause,E.Brun
200、skill,K.Cho,B.Engelhardt,S.Sabato,andJ.Scarlett,editors,International Conference on Machine Learning,ICML 2023,23-29 July2023,Honolulu,Hawaii,USA,volume 202 ofProceedings of Machine Learning Research,pages 1076410799.PMLR,2023.URLhttps:/proceedings.mlr.press/v202/gao23f.html.Z.Gou,Z.Shao,Y.Gong,Y.Sh
201、en,Y.Yang,M.Huang,N.Duan,and W.Chen.Tora:A tool-integrated reasoning agent for mathematical problem solving.CoRR,abs/2309.17452,2023.doi:10.48550/ARXIV.2309.17452.URLhttps:/doi.org/10.48550/arXiv.2309.17452.D.Guo,Q.Zhu,D.Yang,Z.Xie,K.Dong,W.Zhang,G.Chen,X.Bi,Y.Wu,Y.K.Li,F.Luo,Y.Xiong,and W.Liang.Dee
202、pseek-coder:When the large language model meets programmingthe rise of code intelligence,2024.D.Hendrycks,C.Burns,S.Basart,A.Zou,M.Mazeika,D.Song,and J.Steinhardt.Measuringmassive multitask language understanding.arXiv preprint arXiv:2009.03300,2020.D.Hendrycks,C.Burns,S.Kadavath,A.Arora,S.Basart,E.
203、Tang,D.Song,andJ.Steinhardt.Mea-suring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,2021.High-flyer.Hai-llm:高效且輕量的大模型訓練工具,2023.URLhttps:/www.high- AI.Inflection-2,2023.URLhttps:/inflection.ai/inflection-2.A.Q.Jiang,S.Welleck,J.P.Zhou,W.Li,J.Liu,M.Jamnik,T.Lacroi
204、x,Y.Wu,and G.Lample.Draft,sketch,and prove:Guiding formal theorem provers with informal proofs.arXiv preprintarXiv:2210.12283,2022.A.Q.Jiang,A.Sablayrolles,A.Mensch,C.Bamford,D.S.Chaplot,D.d.l.Casas,F.Bressand,G.Lengyel,G.Lample,L.Saulnier,et al.Mistral 7b.arXiv preprint arXiv:2310.06825,2023.A.Joul
205、in,E.Grave,P.Bojanowski,M.Douze,H.Jgou,and T.Mikolov.Fasttext.zip:Compress-ing text classification models.arXiv preprint arXiv:1612.03651,2016.W.Kwon,Z.Li,S.Zhuang,Y.Sheng,L.Zheng,C.H.Yu,J.E.Gonzalez,H.Zhang,and I.Stoica.Efficient memory management for large language model serving with pagedattentio
206、n.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,2023.Y.Leviathan,M.Kalman,and Y.Matias.Fast inference from transformers via speculativedecoding.InInternational Conference on Machine Learning,pages 1927419286.PMLR,2023.A.Lewkowycz,A.Andreassen,D.Dohan,E.Dyer,H.Michale
207、wski,V.Ramasesh,A.Slone,C.Anil,I.Schlag,T.Gutman-Solo,et al.Solving quantitative reasoning problems withlanguage models.Advances in Neural Information Processing Systems,35:38433857,2022a.24A.Lewkowycz,A.Andreassen,D.Dohan,E.Dyer,H.Michalewski,V.V.Ramasesh,A.Slone,C.Anil,I.Schlag,T.Gutman-Solo,Y.Wu,
208、B.Neyshabur,G.Gur-Ari,and V.Misra.Solvingquantitativereasoningproblemswithlanguagemodels.InS.Koyejo,S.Mohamed,A.Agarwal,D.Belgrave,K.Cho,and A.Oh,editors,Advances in Neural Information Processing Systems35:Annual Conference on Neural Information Processing Systems 2022,NeurIPS 2022,NewOrleans,LA,USA
209、,November 28-December 9,2022,2022b.URLhttp:/papers.nips.cc/paper_files/paper/2022/hash/18abbeef8cfe9203fdf9053c9c4fe191-Abstract-Conference.html.H.Lightman,V.Kosaraju,Y.Burda,H.Edwards,B.Baker,T.Lee,J.Leike,J.Schulman,I.Sutskever,and K.Cobbe.Lets verify step by step.arXiv preprint arXiv:2305.20050,2
210、023.I.Loshchilov and F.Hutter.Decoupled weight decay regularization.arXiv preprintarXiv:1711.05101,2017.H.Luo,Q.Sun,C.Xu,P.Zhao,J.Lou,C.Tao,X.Geng,Q.Lin,S.Chen,and D.Zhang.Wizardmath:Empowering mathematical reasoning for large language models via reinforcedevol-instruct.arXiv preprint arXiv:2308.095
211、83,2023.S.Mishra,M.Finlayson,P.Lu,L.Tang,S.Welleck,C.Baral,T.Rajpurohit,O.Tafjord,A.Sab-harwal,P.Clark,and A.Kalyan.LILA:A unified benchmark for mathematical reasoning.In Y.Goldberg,Z.Kozareva,and Y.Zhang,editors,Proceedings of the 2022 Conference onEmpirical Methods in Natural Language Processing,E
212、MNLP 2022,Abu Dhabi,United ArabEmirates,December 7-11,2022,pages 58075832.Association for Computational Linguistics,2022.doi:10.18653/V1/2022.EMNLP-MAIN.392.URLhttps:/doi.org/10.18653/v1/2022.emnlp-main.392.X.Nguyen,W.Zhang,X.Li,M.M.Aljunied,Q.Tan,L.Cheng,G.Chen,Y.Deng,S.Yang,C.Liu,H.Zhang,and L.Bin
213、g.Seallms-large language models for southeast asia.CoRR,abs/2312.00738,2023.doi:10.48550/ARXIV.2312.00738.URLhttps:/doi.org/10.48550/arXiv.2312.00738.OpenAI.GPT4 technical report.arXiv preprint arXiv:2303.08774,2023.L.Ouyang,J.Wu,X.Jiang,D.Almeida,C.Wainwright,P.Mishkin,C.Zhang,S.Agarwal,K.Slama,A.R
214、ay,et al.Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems,35:2773027744,2022.K.Paster,M.D.Santos,Z.Azerbayev,and J.Ba.Openwebmath:An open dataset of high-qualitymathematical web text.CoRR,abs/2310.06786,2023.doi:10.48550/ARXIV.2310
215、.06786.URLhttps:/doi.org/10.48550/arXiv.2310.06786.L.C.Paulson.Three years of experience with sledgehammer,a practical link between auto-matic and interactive theorem provers.In R.A.Schmidt,S.Schulz,and B.Konev,editors,Proceedings of the 2nd Workshop on Practical Aspects of Automated Reasoning,PAAR-
216、2010,Edinburgh,Scotland,UK,July 14,2010,volume 9 ofEPiC Series in Computing,pages 110.EasyChair,2010.doi:10.29007/TNFD.URLhttps:/doi.org/10.29007/tnfd.S.Polu and I.Sutskever.Generative language modeling for automated theorem proving.CoRR,abs/2009.03393,2020.URLhttps:/arxiv.org/abs/2009.03393.R.Rafai
217、lov,A.Sharma,E.Mitchell,S.Ermon,C.D.Manning,and C.Finn.Direct preferenceoptimization:Your language model is secretly a reward model.2023.25J.Schulman.Approximating kl divergence,2020.URLhttp:/ P.Abbeel.High-dimensional continuouscontrol using generalized advantage estimation.arXiv preprint arXiv:150
218、6.02438,2015.J.Schulman,F.Wolski,P.Dhariwal,A.Radford,and O.Klimov.Proximal policy optimizationalgorithms.arXiv preprint arXiv:1707.06347,2017.F.Shi,M.Suzgun,M.Freitag,X.Wang,S.Srivats,S.Vosoughi,H.W.Chung,Y.Tay,S.Ruder,D.Zhou,D.Das,and J.Wei.Language models are multilingual chain-of-thought reasone
219、rs.InThe Eleventh International Conference on Learning Representations,ICLR 2023,Kigali,Rwanda,May 1-5,2023.OpenR,2023.URLhttps:/ H.Wang.Preference ranking optimization forhuman alignment.arXiv preprint arXiv:2306.17492,2023.M.Suzgun,N.Scales,N.Schrli,S.Gehrmann,Y.Tay,H.W.Chung,A.Chowdhery,Q.V.Le,E.
220、H.Chi,D.Zhou,et al.Challenging big-bench tasks and whether chain-of-thought can solvethem.arXiv preprint arXiv:2210.09261,2022.T.Tao.Embracing change and resetting expectations,2023.URLhttps:/ 2:Open foundation and fine-tuned chat models.CoRR,abs/2307.09288,2023.doi:10.48550/arXiv.2307.09288.URLhttp
221、s:/doi.org/10.48550/arXiv.2307.09288.T.H.Trinh,Y.Wu,Q.V.Le,H.He,and T.Luong.Solving olympiad geometry without humandemonstrations.Nature,625(7995):476482,2024.P.Wang,L.Li,L.Chen,F.Song,B.Lin,Y.Cao,T.Liu,and Z.Sui.Making large language modelsbetter reasoners with alignment.arXiv preprint arXiv:2309.0
222、2144,2023a.P.Wang,L.Li,Z.Shao,R.Xu,D.Dai,Y.Li,D.Chen,Y.Wu,and Z.Sui.Math-shepherd:Verifyand reinforce llms step-by-step without human annotations.CoRR,abs/2312.08935,2023b.Z.Wang,R.Xia,and P.Liu.Generative AI for math:Part I-mathpile:A billion-token-scalepretraining corpus for math.CoRR,abs/2312.171
223、20,2023c.doi:10.48550/ARXIV.2312.17120.URLhttps:/doi.org/10.48550/arXiv.2312.17120.J.Wei,X.Wang,D.Schuurmans,M.Bosma,B.Ichter,F.Xia,E.H.Chi,Q.V.Le,and D.Zhou.Chain-of-thought prompting elicits reasoning in large language models.InNeurIPS,2022.URLhttp:/papers.nips.cc/paper_files/paper/2022/hash/9d560
224、9613524ecf4f15af0f7b31abca4-Abstract-Conference.html.26T.Wei,J.Luan,W.Liu,S.Dong,and B.Wang.Cmath:Can your language model pass chineseelementary school math test?,2023.M.Wenzel,L.C.Paulson,and T.Nipkow.The isabelle framework.In O.A.Mohamed,C.A.Muoz,and S.Tahar,editors,Theorem Proving in Higher Order
225、 Logics,21st InternationalConference,TPHOLs 2008,Montreal,Canada,August 18-21,2008.Proceedings,volume 5170ofLecture Notes in Computer Science,pages 3338.Springer,2008.doi:10.1007/978-3-540-71067-7_7.URLhttps:/doi.org/10.1007/978-3-540-71067-7_7.H.Xia,T.Ge,P.Wang,S.-Q.Chen,F.Wei,and Z.Sui.Speculative
226、 decoding:Exploitingspeculative execution for accelerating seq2seq generation.In H.Bouamor,J.Pino,and K.Bali,editors,FindingsoftheAssociationforComputationalLinguistics:EMNLP2023,pages39093925,Singapore,Dec.2023.Association for Computational Linguistics.doi:10.18653/v1/2023.findings-emnlp.257.URLhtt
227、ps:/aclanthology.org/2023.findings-emnlp.257.H.Xia,Z.Yang,Q.Dong,P.Wang,Y.Li,T.Ge,T.Liu,W.Li,and Z.Sui.Unlocking efficiencyin large language model inference:A comprehensive survey of speculative decoding.arXivpreprint arXiv:2401.07851,2024.S.Yao,D.Yu,J.Zhao,I.Shafran,T.L.Griffiths,Y.Cao,and K.Narasi
228、mhan.Tree of thoughts:Deliberate problem solving with large language models.arXiv preprint arXiv:2305.10601,2023.L.Yu,W.Jiang,H.Shi,J.Yu,Z.Liu,Y.Zhang,J.T.Kwok,Z.Li,A.Weller,and W.Liu.Metamath:Bootstrap your own mathematical questions for large language models.CoRR,abs/2309.12284,2023.doi:10.48550/A
229、RXIV.2309.12284.URLhttps:/doi.org/10.48550/arXiv.2309.12284.Z.Yuan,H.Yuan,C.Li,G.Dong,C.Tan,and C.Zhou.Scaling relationship on learningmathematical reasoning with large language models.arXiv preprint arXiv:2308.01825,2023a.Z.Yuan,H.Yuan,C.Tan,W.Wang,S.Huang,and F.Huang.Rrhf:Rank responses to alignla
230、nguage models with human feedback without tears.arXiv preprint arXiv:2304.05302,2023b.X.Yue,X.Qu,G.Zhang,Y.Fu,W.Huang,H.Sun,Y.Su,and W.Chen.Mammoth:Buildingmath generalist models through hybrid instruction tuning.CoRR,abs/2309.05653,2023.doi:10.48550/ARXIV.2309.05653.URLhttps:/doi.org/10.48550/arXiv
231、.2309.05653.K.Zheng,J.M.Han,and S.Polu.Minif2f:a cross-system benchmark for formal olympiad-levelmathematics.arXiv preprint arXiv:2109.00110,2021.W.Zhong,R.Cui,Y.Guo,Y.Liang,S.Lu,Y.Wang,A.Saied,W.Chen,and N.Duan.AGIEval:Ahuman-centric benchmark for evaluating foundation models.CoRR,abs/2304.06364,20
232、23.doi:10.48550/arXiv.2304.06364.URLhttps:/doi.org/10.48550/arXiv.2304.06364.27A.AppendixA.1.Analysis of Reinforcement LearningWe provide the detailed derivation of the data source and gradient coefficient(algorithm andreward function)across various methods,including SFT,RFT,Online RFT,DPO,PPO,andGR
233、PO.A.1.1.Supervised Fine-tuningThe objective of Supervised Fine-tuning is maximizing the following objective:J()=E,(,)1|=1log(|,)!.(6)The gradient of J()is:J=E,(,)1|=1log(|,)!.(7)Data Source:The dataset employed for SFT.Reward Function:This can be regarded as humanselection.Gradient Coefficient:alwa
234、ys set to 1.A.1.2.Rejection Sampling Fine-tuningRejection Sampling Fine-tuning first samples multiple outputs from the supervised fine-tunedLLMs for each question,and then trains LLMs on the sampled outputs with the correct answer.Formally,the objective of RFT is to maximize the following objectives
235、:J()=E(),(|)1|=1I()log(|,)!.(8)The gradient of J()is:J()=E(),(|)1|=1I()log(|,)!.(9)Data Source:question in SFT dataset with outputs sampled from SFT model.Reward Function:Rule(whether the answer is correct or not).Gradient Coefficient:(,)=I()=(1the answer of o is correct0the answer of o is incorrect
236、(10)A.1.3.Online Rejection Sampling Fine-tuningThe only difference between RFT and Online RFT is that the outputs of Online RFT are sampledfrom the real-time policy model,rather than from the SFT model.Therefore,the gradientof online RFT is:J()=E(),(|)1|=1I()log(|,)!.(11)28A.1.4.Direct Preference Op
237、timization(DPO)The objective of DPO is:J()=E(),+,(|)log1|+|+|=1log(+|,+)ref(+|,+)1|=1log(|,)ref(|,)“fi(12)The gradient of J()is:J()=E(),+,(|)1|+|+|=1(,)log(+|,+)1|=1(,)log(|,)“fi(13)Data Source:question in SFT dataset with outputs sampled from SFT model.Reward Function:human preference in the genera
238、l domain(can be Rule in mathematical tasks).GradientCoefficient:(,)=?log(|,)ref(|,)log(+|,+)ref(+|,+)?(14)A.1.5.Proximal Policy Optimization(PPO)The objective of PPO is:J()=E(),(|)1|=1min?(|,)(|,),clip?(|,)(|,),1,1+?.(15)To simplify the analysis,it is assumed that the model only has a single update
239、following eachexploration stage,thereby ensuring that=.In this case,we can remove theminandclipoperation:J()=E(),(|)1|=1(|,)(|,).(16)The gradient of J()is:J()=E(),(|)1|=1log(|,)(17)Data Source:question in SFT dataset with outputs sampled from policy model.Reward Function:reward model.Gradient Coeffi
240、cient:(,)=,(18)whereis the advantage,which is computed by applying Generalized Advantage Estimation(GAE)(Schulman et al.,2015),based on the rewards and a learned value function.A.1.6.Group Relative Policy Optimization(GRPO)The objective of GRPO is(assume=for simplified analysis):J()=E(),=1(|)1=11|=1
241、?(,|,)(,|,),(,|,)(,|,)log(,|,)(,|,)1)?.(19)29The gradient of J()is:J()=E(),=1(|)1=11|=1?,+?(,|,)(,|,)1?log(,|,).(20)Data Source:question in SFT dataset with outputs sampled from policy model.Reward Function:reward model.Gradient Coefficient:(,)=,+?(,|,)(,|,)1?,(21)where,is computed based on the group reward scores.30