《可預測的擴展和基礎設施.pdf》由會員分享,可在線閱讀,更多相關《可預測的擴展和基礎設施.pdf(27頁珍藏版)》請在三個皮匠報告上搜索。
1、ConfidentialPredictable Scaling and InfrastructureTrevor CaiConfidentialPredictable Scaling2Confidential What Collect a dataset of text,code,images,audio,math.Pre-train a model to predict the next word.Post-train it to e.g.follow instructions,be conversational,use tools.How Synchronous SGD of a tran
2、sformer on a large cluster of accelerators.Combine many forms of data and model parallelism.Reinforcement learning from human feedback.3The What and How of ChatGPTConfidential Trained a character-level neural network on product reviews.Observed:there is a neuron encoding sentiment!Results in state-o
3、f-the-art sentiment analysis.4The Sentiment Neuron(2017)Learning to Generate Reviews and Discovering Sentiment,Radford et al(2017)Confidential Next word prediction models the underlying generative process.If the data is the internet,the underlying generative process is the world.Grandiose.But theore
4、tically justified via Solomonoff Inductive Inference.5Prediction is CompressionLearning to Generate Reviews and Discovering Sentiment,Radford et al(2017)ConfidentialReleaseNew BehaviorsGPT-1 June 2018 State-of-the-art language understanding(using task-specific fine tuning).GPT-2 Feb 2019Coherent tex
5、t generation and zero-shot transfer.GPT-3 Mar 2020In-context learning.GPT-4 Mar 2023Actually being useful.6Returns to Scale(2018-2023)Confidential7Scale Works(2023)This example required GPT-4 to:Understand both English and French.Interpret a diagram in context of the text.Solve a physics problem!GPT
6、-4 Technical Report,OpenAI(2023)Confidential8Predictable ScalingGPT-4 Technical Report,OpenAI(2023)Confidential9Predictable Scaling of Practical CapabilitiesGPT-4 Technical Report,OpenAI(2023)Confidential10What Log-Log Plots ObscureHow predictable is language model benchmark performance?,Owen(2024)C
7、onfidential1.Next-word prediction is meaningful.2.There are returns to scale3.which are predictable and extrapolative(!)11Recap:Scaling Laws for AI ModelsConfidentialImplications for Infrastructure12Confidential13Industry Compute TrendsTraining Compute of Frontier AI Models Grows by 4-5x per Year,Se
8、villa and Roldn(2024)Confidential14OpenAI Compute TrendsMicrosoft Build 2024Confidential15Industry Compute TrendsTraining Compute of Frontier AI Models Grows by 4-5x per Year,Sevilla and Roldn(2024)Confidential16Inference Demand is Driven By IntelligenceConfidential1.Compute scaling has been predict
9、able and looks to continue.2.Intelligence drives inference demand.3.Technology and economics are ripe for scale this decade.17The Bull Case for AI ComputeConfidential18Confidential19“Sometimes lines really do go up”Photovoltaic growth:reality versus projections of the International Energy Agency,Hoe
10、kstra(2018)Confidential20“Sometimes lines really do go up”What is Moores Law?,Roser and Ritchie(2020)ConfidentialDesign for Mass Deployment21Confidential22Cluster-Level RAS Optics MTBF alone measured in minutes.Not to mention HBM DUE,board failures,etc.SDCs:disturbingly common,and sometimes unreprod
11、ucible.Failures have a very wide blast radius.Confidential23Cluster-Level RAS Minimize cost of repair.Exception Process Restart GPU Reset Node Reboot RMA.Example:Ideally,failed write over scale-up is a catchable exception.Minimize blast radius.Example:Link flaps on one port should not affect neighbo
12、ring ports.Example:Ideally,uncorrectable memory error only affects own GPU.Even in presence of coherent memory fabric.Confidential Consider graceful degradation.Some failures are more worth technician time than others.Example:Disable faulty banks of second-tier memory instead of requiring RMA.Valida
13、tion must be automated,fast,extensive,and performable in-field.Example:In-depth correctness checks after SDC accusation.24Cluster-Level RASConfidential Power bottlenecks mean we need to maximize power we have.Synchronized training steps results in power draw jitter.Need:Low-latency power telemetry a
14、nd OOB power management.Want:Dynamic power sloshing.25Power Management Confidential26Takeaways1.Predictable scaling motivates rapidly expand AI training compute usage.2.Delivering AI to the world will demand massive infrastructure buildout.3.Design for mass deployment.4.Performance is only one of many requirements.Confidential27Thank You!