關于建立基準的教程用于使用公共數據評估商業大型語言模型.pdf

編號:615374 PDF 19頁 1.67MB 下載積分:VIP專享
下載報告請您先登錄!

關于建立基準的教程用于使用公共數據評估商業大型語言模型.pdf

1、IMF IMF|Statistics1Benchmarking Commercial Large Language ModelsNOVEMBER 21,2024Jan BatznerWeizenbaum Institute,German Internet InstituteTechnical University Munich,Grad.Center CITIMF IMF|StatisticsIMF IMF|Statistics3Research on Research on Large Language Models Large Language Models“Since large lan

2、guage models,or LLMs,started to appear in 2017,the share of AI content in patent applications related to algorithmictrading has risen from 19 percent in 2017 to over 50 percent eachyear since 2020,suggesting a wave of innovation is coming in thisarea.”(IMF Blog,2024)Nassira Abbas,Charles Cohen,Dirk

3、Jan Grolleman,Benjamin Mosk(2024):Artificial Intelligence Can Make Markets More Efficientand More Volatile.International Monetary Fund(IMF)Blog.EconPhenomenonIMF IMF|Statistics4Research on Research on Large Language Models Large Language Models“Since large language models,or LLMs,started to appear i

4、n 2017,the share of AI content in patent applications related to algorithmic trading has risen from 19 percent in 2017 to over 50 percent each year since 2020,suggesting a wave of innovation is coming in this area.”(IMF Blog,2024)Nassira Abbas,Charles Cohen,Dirk Jan Grolleman,Benjamin Mosk(2024):Art

5、ificial Intelligence Can Make Markets More Efficientand More Volatile.International Monetary Fund(IMF)Blog.EconNLPAnthropic.Claude 3 Models on B IMF|Statistics5Language Modeling:What is a LM?Language Modeling:What is a LM?probability distribution over sequences of words LMs are generative models:Wha

6、t we understand as LLM are Autoregressive(AR)language models:Yann Dubois(2024):CS229 Building Large Language Models(LLMs).Stanford Online Lecture.Recorded Lecture.Slide 5(Credit:Y.Dubois).LMIMF IMF|Statistics6ReinforReinforcecement Learning by Human Feedback(RLHF)ment Learning by Human Feedback(RLHF

7、)Stiennon,N.,Ouyang,L.,Wu,J.,Ziegler,D.,Lowe,R.,Voss,C.,Radford,A.,Amodei,D.and Christiano,P.F.(2020):Learning to summarize with human feedback.Advances in Neural Information Processing Systems,33,pp.3008-3021.Human Preference DatarewardmodelLMpolicylabel rewardssample completionsReinforcement Learn

8、ingIMF IMF|Statistics7Research on Research on Large Language Models Large Language Models“Since large language models,or LLMs,started to appear in 2017,the share of AI content in patent applications related to algorithmic trading has risen from 19 percent in 2017 to over 50 percent each year since 2

9、020,suggesting a wave of innovation is coming in this area.”(IMF Blog,2024)Nassira Abbas,Charles Cohen,Dirk Jan Grolleman,Benjamin Mosk(2024):Artificial Intelligence Can Make Markets More Efficientand More Volatile.International Monetary Fund(IMF)Blog.EconNLPAnthropic.Claude 3 Models on B IMF|Statis

10、tics8What is a benchmark?What is a benchmark?Source:HuggingFace.Open LLM Leaderboard.Web Page.LLMQuestionsAnswersQuestion-Answering Datasetscraped from Web SourcesPerformance Scores of different LLMs onQuestion-Answering Benchmarks in percentageQA is one of multiple methods for constructing benchmar

11、ksIMF IMF|Statistics9What is a benchmark?What is a benchmark?Question-Answering Example on Microeconomics Knowledgein the MMMLU Benchmark(Hendrycks,2021):Hendrycks,D.,Burns,C.,Basart,S.,Zou,A.,Mazeika,M.,Song,D.and Steinhardt,J.(2021):Measuring Massive Multitask Language Understanding.In Internation

12、al Conference on Learning Representations(ICLR).IMF IMF|Statistics10LLM Benchmarks:Knowledge and RepresentationLLM Benchmarks:Knowledge and RepresentationKnowledge BenchmarksTesting LLMs knowledge in various domains with standardized questions.Kraft,Angelie(2024):Whose Knowledge?On Knowledge Gaps in

13、 LLM Benchmarks and Their Consequences.Work in Progress.Representation and Bias BenchmarksTesting LLMs for various biases and representativeness of sociodemographicgroups.TriviaQAStrategyQASQuADXQuAD QuACHotpotQAGallegos,I.O.,Rossi,R.A.,Barrow,J.,Tanjim,M.M.,Kim,S.,Dernoncourt,F.,Yu,T.,Zhang,R.and A

14、hmed,N.K.,(2024):Bias and fairness in large language models:A survey.Computational Linguistics,pp.1-79.BBQUnQoverBOLDHolisticBiasWinoQueerPANDAIMF IMF|Statistics11What are the sources of LLM Benchmarks?What are the sources of LLM Benchmarks?Over-reliance on Wikipedia:36%of the knowledge benchmarks a

15、re based on Wikipedia content(Kraft,2024).Kraft,Angelie(2024):Whose Knowledge?On Knowledge Gaps in LLM Benchmarks and Their Consequences.Work in Progress.IMF IMF|Statistics12Building an LLM BenchmarkBuilding an LLM BenchmarkBatzner,J.,Stocker,V.,Schmid,S.and Kasneci,G.(2024):GermanPartiesQA:Benchmar

16、king Commercial Large Language Models for Political Bias and Sycophancy.arXiv:2407.18008.Under Review.IMF IMF|Statistics13Research QuestionsResearch QuestionsRQ1:RQ1:How do commercial LLMs align with the positions of major German political parties?RQ2:RQ2:How does LLM output change with a political

17、persona as a prompted context?IMF IMF|Statistics14Benchmark:GermanPartiesQABenchmark:GermanPartiesQABatzner,J.,Stocker,V.,Schmid,S.and Kasneci,G.(2024):GermanPartiesQA:Benchmarking Commercial Large Language Models for Political Bias and Sycophancy.arXiv:2407.18008.Under Review.“All highways should h

18、ave a speed limit”418 Statements11 German ElectionsYears 2021-2023LLMAgree-0.69Disagree-1.20Neutral-1.60Log ProbsAgreeDisagreeNeutralIMF IMF|Statistics15Batzner,J.,Stocker,V.,Schmid,S.and Kasneci,G.(2024):GermanPartiesQA:Benchmarking Commercial Large Language Models for Political Bias and Sycophancy

19、.arXiv:2407.18008.Under Review.Prompt DesignPrompt DesignIMF IMF|Statistics16Results:Model ComparisonResults:Model ComparisonBatzner,J.,Stocker,V.,Schmid,S.and Kasneci,G.(2024):GermanPartiesQA:Benchmarking Commercial Large Language Models for Political Bias and Sycophancy.arXiv:2407.18008.Under Revi

20、ew.IMF IMF|Statistics17Results:Prompt ExperimentsResults:Prompt ExperimentsBatzner,J.,Stocker,V.,Schmid,S.and Kasneci,G.(2024):GermanPartiesQA:Benchmarking Commercial Large Language Models for Political Bias and Sycophancy.arXiv:2407.18008.Under Review.Prompt Experiment:Prompting a political persona

21、 description as“I am politician X”and“You are politician X”changes the LLM alignment with official political party positions.IMF IMF|Statistics18BuildingBuildingLLM Benchmarks for Economic ResearchLLM Benchmarks for Economic ResearchKnowledge BenchmarksTesting LLMs foreconomic knowledge.Data Sources:Representation and Bias BenchmarksTesting LLMs for representativeness of socioeconomic groups.Data Sources:Academic TestsEncyclopedic KnowledgeExpert SurveysSurvey DataVoting Advice ApplicationsInterview DataIMF IMF|Statistics

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(關于建立基準的教程用于使用公共數據評估商業大型語言模型.pdf)為本站 (魯大師) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站