《關于建立基準的教程用于使用公共數據評估商業大型語言模型.pdf》由會員分享,可在線閱讀,更多相關《關于建立基準的教程用于使用公共數據評估商業大型語言模型.pdf(19頁珍藏版)》請在三個皮匠報告上搜索。
1、IMF IMF|Statistics1Benchmarking Commercial Large Language ModelsNOVEMBER 21,2024Jan BatznerWeizenbaum Institute,German Internet InstituteTechnical University Munich,Grad.Center CITIMF IMF|StatisticsIMF IMF|Statistics3Research on Research on Large Language Models Large Language Models“Since large lan
2、guage models,or LLMs,started to appear in 2017,the share of AI content in patent applications related to algorithmictrading has risen from 19 percent in 2017 to over 50 percent eachyear since 2020,suggesting a wave of innovation is coming in thisarea.”(IMF Blog,2024)Nassira Abbas,Charles Cohen,Dirk
3、Jan Grolleman,Benjamin Mosk(2024):Artificial Intelligence Can Make Markets More Efficientand More Volatile.International Monetary Fund(IMF)Blog.EconPhenomenonIMF IMF|Statistics4Research on Research on Large Language Models Large Language Models“Since large language models,or LLMs,started to appear i
4、n 2017,the share of AI content in patent applications related to algorithmic trading has risen from 19 percent in 2017 to over 50 percent each year since 2020,suggesting a wave of innovation is coming in this area.”(IMF Blog,2024)Nassira Abbas,Charles Cohen,Dirk Jan Grolleman,Benjamin Mosk(2024):Art
5、ificial Intelligence Can Make Markets More Efficientand More Volatile.International Monetary Fund(IMF)Blog.EconNLPAnthropic.Claude 3 Models on B IMF|Statistics5Language Modeling:What is a LM?Language Modeling:What is a LM?probability distribution over sequences of words LMs are generative models:Wha
6、t we understand as LLM are Autoregressive(AR)language models:Yann Dubois(2024):CS229 Building Large Language Models(LLMs).Stanford Online Lecture.Recorded Lecture.Slide 5(Credit:Y.Dubois).LMIMF IMF|Statistics6ReinforReinforcecement Learning by Human Feedback(RLHF)ment Learning by Human Feedback(RLHF
7、)Stiennon,N.,Ouyang,L.,Wu,J.,Ziegler,D.,Lowe,R.,Voss,C.,Radford,A.,Amodei,D.and Christiano,P.F.(2020):Learning to summarize with human feedback.Advances in Neural Information Processing Systems,33,pp.3008-3021.Human Preference DatarewardmodelLMpolicylabel rewardssample completionsReinforcement Learn
8、ingIMF IMF|Statistics7Research on Research on Large Language Models Large Language Models“Since large language models,or LLMs,started to appear in 2017,the share of AI content in patent applications related to algorithmic trading has risen from 19 percent in 2017 to over 50 percent each year since 2
9、020,suggesting a wave of innovation is coming in this area.”(IMF Blog,2024)Nassira Abbas,Charles Cohen,Dirk Jan Grolleman,Benjamin Mosk(2024):Artificial Intelligence Can Make Markets More Efficientand More Volatile.International Monetary Fund(IMF)Blog.EconNLPAnthropic.Claude 3 Models on B IMF|Statis
10、tics8What is a benchmark?What is a benchmark?Source:HuggingFace.Open LLM Leaderboard.Web Page.LLMQuestionsAnswersQuestion-Answering Datasetscraped from Web SourcesPerformance Scores of different LLMs onQuestion-Answering Benchmarks in percentageQA is one of multiple methods for constructing benchmar
11、ksIMF IMF|Statistics9What is a benchmark?What is a benchmark?Question-Answering Example on Microeconomics Knowledgein the MMMLU Benchmark(Hendrycks,2021):Hendrycks,D.,Burns,C.,Basart,S.,Zou,A.,Mazeika,M.,Song,D.and Steinhardt,J.(2021):Measuring Massive Multitask Language Understanding.In Internation
12、al Conference on Learning Representations(ICLR).IMF IMF|Statistics10LLM Benchmarks:Knowledge and RepresentationLLM Benchmarks:Knowledge and RepresentationKnowledge BenchmarksTesting LLMs knowledge in various domains with standardized questions.Kraft,Angelie(2024):Whose Knowledge?On Knowledge Gaps in
13、 LLM Benchmarks and Their Consequences.Work in Progress.Representation and Bias BenchmarksTesting LLMs for various biases and representativeness of sociodemographicgroups.TriviaQAStrategyQASQuADXQuAD QuACHotpotQAGallegos,I.O.,Rossi,R.A.,Barrow,J.,Tanjim,M.M.,Kim,S.,Dernoncourt,F.,Yu,T.,Zhang,R.and A
14、hmed,N.K.,(2024):Bias and fairness in large language models:A survey.Computational Linguistics,pp.1-79.BBQUnQoverBOLDHolisticBiasWinoQueerPANDAIMF IMF|Statistics11What are the sources of LLM Benchmarks?What are the sources of LLM Benchmarks?Over-reliance on Wikipedia:36%of the knowledge benchmarks a
15、re based on Wikipedia content(Kraft,2024).Kraft,Angelie(2024):Whose Knowledge?On Knowledge Gaps in LLM Benchmarks and Their Consequences.Work in Progress.IMF IMF|Statistics12Building an LLM BenchmarkBuilding an LLM BenchmarkBatzner,J.,Stocker,V.,Schmid,S.and Kasneci,G.(2024):GermanPartiesQA:Benchmar
16、king Commercial Large Language Models for Political Bias and Sycophancy.arXiv:2407.18008.Under Review.IMF IMF|Statistics13Research QuestionsResearch QuestionsRQ1:RQ1:How do commercial LLMs align with the positions of major German political parties?RQ2:RQ2:How does LLM output change with a political
17、persona as a prompted context?IMF IMF|Statistics14Benchmark:GermanPartiesQABenchmark:GermanPartiesQABatzner,J.,Stocker,V.,Schmid,S.and Kasneci,G.(2024):GermanPartiesQA:Benchmarking Commercial Large Language Models for Political Bias and Sycophancy.arXiv:2407.18008.Under Review.“All highways should h
18、ave a speed limit”418 Statements11 German ElectionsYears 2021-2023LLMAgree-0.69Disagree-1.20Neutral-1.60Log ProbsAgreeDisagreeNeutralIMF IMF|Statistics15Batzner,J.,Stocker,V.,Schmid,S.and Kasneci,G.(2024):GermanPartiesQA:Benchmarking Commercial Large Language Models for Political Bias and Sycophancy
19、.arXiv:2407.18008.Under Review.Prompt DesignPrompt DesignIMF IMF|Statistics16Results:Model ComparisonResults:Model ComparisonBatzner,J.,Stocker,V.,Schmid,S.and Kasneci,G.(2024):GermanPartiesQA:Benchmarking Commercial Large Language Models for Political Bias and Sycophancy.arXiv:2407.18008.Under Revi
20、ew.IMF IMF|Statistics17Results:Prompt ExperimentsResults:Prompt ExperimentsBatzner,J.,Stocker,V.,Schmid,S.and Kasneci,G.(2024):GermanPartiesQA:Benchmarking Commercial Large Language Models for Political Bias and Sycophancy.arXiv:2407.18008.Under Review.Prompt Experiment:Prompting a political persona
21、 description as“I am politician X”and“You are politician X”changes the LLM alignment with official political party positions.IMF IMF|Statistics18BuildingBuildingLLM Benchmarks for Economic ResearchLLM Benchmarks for Economic ResearchKnowledge BenchmarksTesting LLMs foreconomic knowledge.Data Sources:Representation and Bias BenchmarksTesting LLMs for representativeness of socioeconomic groups.Data Sources:Academic TestsEncyclopedic KnowledgeExpert SurveysSurvey DataVoting Advice ApplicationsInterview DataIMF IMF|Statistics