《HotChips_AI_Keynote_V1.5_Final.pdf》由會員分享,可在線閱讀,更多相關《HotChips_AI_Keynote_V1.5_Final.pdf(87頁珍藏版)》請在三個皮匠報告上搜索。
1、Beyond Compute:Enabling AI Through System IntegrationENIACCalculatorPersonal ComputerCray-1Laptop ComputerDatacentersSmart Phone ComputerConsolesTraining DatacentersTraining ServerIoT ComputerFSD ComputerHuman ComputerAbacusDiff EngineComputing Input DataUseful OutputsProcessingSource:Google ImagesP
2、rojections In 2011Source: Reality+New ProjectionsVol of Data(Zettabytes)0501001502002010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025181147120977964.24133261815.512.596.552Big Data-Data Explosion Corporate Data Type ChangesMore than 80%data is unstructured in nature5,00
3、020102011201220132014201520162017201820192020010,00015,00020,00025,00030,00035,00040,00045,000Structured DataUnstructured DataWorldwide Corporate Data GrowthExabytes80%of Data Growth is UnstructuredNew Types of ProcessingTraditional ProcessingProducts&Profits from Data ProcessingSource:IDCCorporate
4、Data Type ChangesMore than 80%data is unstructured in nature5,00020102011201220132014201520162017201820192020010,00015,00020,00025,00030,00035,00040,00045,000Structured DataUnstructured DataWorldwide Corporate Data GrowthExabytes80%of Data Growth is UnstructuredMachine Learning(ML)/AIAlgorithmic Cod
5、esProducts&Profits from Data ProcessingSource:IDCSoftware 2.0Software 1.0“The history of numerical weather prediction and climate simulation is almost exactly coincident with the history of digital computing itself.”-V.Balaji,Climbing down Charneys ladder:Machine Learning and the post-Dennard era of
6、 computational climate scienceClimate Super-Computing Architectures Over Time 1950 1960 1970 1980 1990 2000 2010 2020 1 101010101010101010102 3 4 5 6 7 8 9 10 History of GFDL Computing Growth of Computational Power with Time IBM 701 IBM 704 IBM 7090 IBM 70301/Stretch CBC 6600 UNIVAC1108IBM 360/91IBM
7、 360/195TI ASCCDC CYBER 205(2X)CRAY Y-MPCRAY C90T3ECRAY T90/16CRAY T90/24CRAY T90/30SGI CLUSTER/01SGI CLUSTER/02SGI CLUSTER/04R&D HPCSR&D HPCS UPGRADENCAA ORNLNCAA ORNL UPGRADEScalar Vector Parallel Vector Scalable First simulation of water,cloud-radiation,ice albedo feedbacks First GFDL Hurricane M
8、odel Coupled model on which first IPCC was based First simulation of chemistry-transport-radiation of Antarctic Ozone Hole First estimate effects of 2x CO2 GFDL Hurricane Model goes operational-Interactive atmospheric chemistry and aerosols -Stimulation and prediction of Category 4-5 hurricanes-Proj
9、ections of fish stock under climate change-FV3 models used both for climate studies and US operational weather forecasting-Coupled ENSO forecasts Attribution of role of Ozone depletion in climate change Log(Computer Power)History of computational power at the NOAA Geophysical Fluid Dynamics Laborato
10、ry.Computational power is measured in aggregate floating point operations per second Source:V Balaji,Princeton-NOAA :Climbing down Charneys ladder:Machine Learning and the post-Dennard era of computational climate science Years of Delivery History of GFDL Computing Growth of Computational Power with
11、 Time Log(Computer Power)Years of Delivery Climate Super-Computing Architectures Over Time 1950 1960 1970 1980 1990 2000 2010 2020 1 101010101010101010102 3 4 5 6 7 8 9 10 IBM 701 IBM 704 IBM 7090 IBM 70301/Stretch CBC 6600 UNIVAC1108IBM 360/91IBM 360/195TI ASCCDC CYBER 205(2X)CRAY Y-MPCRAY C90T3ECR
12、AY T90/16CRAY T90/24CRAY T90/30SGI CLUSTER/01SGI CLUSTER/02SGI CLUSTER/04R&D HPCSR&D HPCS UPGRADENCAA ORNLNCAA ORNL UPGRADEFirst simulation of water,cloud-radiation,ice albedo feedbacks First GFDL Hurricane Model Coupled model on which first IPCC was based First simulation of chemistry-transport-rad
13、iation of Antarctic Ozone Hole First estimate effects of 2x CO2 GFDL Hurricane Model goes operational-Interactive atmospheric chemistry and aerosols -Stimulation and prediction of Category 4-5 hurricanes-Projections of fish stock under climate change-FV3 models used both for climate studies and US o
14、perational weather forecasting-Coupled ENSO forecasts Attribution of role of Ozone depletion in climate change“this represents a sea change in computational Earth system science that rivals the von Neumann revolution.”-On the Increase of Machine Learning techniques in climate computing As quoted in
15、Machine Learning and the post-Denard era of computational climate science ML/AIScalar Vector Parallel Vector Scalable History of computational power at the NOAA Geophysical Fluid Dynamics Laboratory.Computational power is measured in aggregate floating point operations per second Source:V Balaji,Pri
16、nceton-NOAA :Climbing down Charneys ladder:Machine Learning and the post-Dennard era of computational climate science Traditional Data-New Processing Methods Climate AI,a pioneer in applying artificial intelligence to climate risk modeling,today announced its team has solved a critical weather forec
17、asting challenge.Leveraging advances in AI to improve weather and climate forecasts.“Artificial intelligence and machine learning breakthroughs are changing weather forecasting,and resource-heavy regional weather models might soon be completely replaced by machine learning approaches.”Dr.Stephan Ras
18、p,Lead Data Scientist Climate AI Source:Climate AISource:Los Almos National LabReal World DataOnly Machine Learning techniques can enable theseExploding ML Use Cases New Data TypesTraditional DataReal World Data.Accelerators:Strictly Structured Algorithmic Computers:Semi-Structured Learning Computer
19、s:Unstructured/Any typeComputing Architecture Categories ENIACCalculatorPersonal ComputerCray-1Laptop ComputerDatacentersConsolesTraining DatacentersTraining ServerIoT ComputerFSD ComputerHuman ComputerAbacusDiff EngineInput DataUseful OutputsProcessingSmart Phone ComputerAI -Tasks requiring near hu
20、man intelligence in real world settings ML -Subset of AI for specific tasks by learning from data&making predictions DL -Subset of ML using Deep neural network architectures Artificial Intelligence Machine Learning Deep Learning Image Courtesy:Buffaloboy Processing for Learning:AI-ML-DL Why Do We Ne
21、ed a Different Compute Platforms?Input DataProgram LogicTrained LogicLearning Computers+Traditional ComputersInput Data+Output DataInput Data+Trained LogicUseful OutputsUseful OutputsLearning ComputersTrainingInferenceExponential Rate Gaps in Training SystemsMoores LawSource:OpenAI(https:/ adept Sys
22、temsAI Model Compute Needs(Log Scale)Huge GapDesigning Solutions for AI Level NeedsDatasets Models Compute Scale SWHWAI Training Systems Define goals Dataset(s)Model(s)Desired outputs Iterative flowsHigh Manual EffortHuman ReviewsTypical ML Training Flows Datasets Models Compute Scale SWHWAI Adept S
23、ystems 2D Image Labeling of Real World Inputs 100X Labeling ThroughputLabel Once,Simultaneously Labels All Cameras at Many Frames4D-Space+Time Labeling 1,000-Person In-House Data Labeling Team Fully Custom Built Data Labeling&Analytics Infrastructuretime#labelsData Labeling Growth Source:MetaAI-How
24、Facebook Annotates Multimodal Training Data for ML“We refer to this framework as Human-AI loop(Halo.)researchers can streamline annotation tasks,visualize the results and accuracy metrics of annotations,and export the annotations to start their training modules.”:MetaAIOther Frameworks Led to Advent
25、 of programmable computers Colossus/Eniac/ACESource:Jack Copeland,Alan Turnings Electronic Brain“Once the human element is eliminated,the increase in speed is enormous”-Alan Turning (As stated in his report from early 1940s)Historical Parallels Led to Advent of programmable computers Colossus/Eniac/
26、ACESource:Jack Copeland,Alan Turnings Electronic Brain“Once the human element is eliminated,the increase in speed is enormous”-Alan Turning (As stated in his report from early 1940s)Historical Parallels How to do this for AI?Solution in ML/AI Space Itself!Chicken and Egg Problem?Solution in ML/AI Sp
27、ace Itself!Recursive Loops Solution in ML/AI Space Itself!Use of Offline Models for Real World Dataset Curation/Labeling With Reduced Human Loop Dependence 1101102103Network ComplexityResnetAutomation N/WsMany Orders of Magnitude104Dataset Labeling AutomationAutolabelingClipVideosGPSIMUOdometryOffli
28、ne Neural NetworksEgo Trajectory&Static World ReconstructionMovingObjects&KinematicsLabelsname:cipv-low-vis,requester:img-vid-cipv-low-vis-seq,description:Low visibility with a CIPV,query:$and:$eq:$decimate:$conv:$and:$eq:active-gear,4,/In drive$not:VisionSceneTags.main.scene_tag_array13.activated,/
29、GARAGE_DOOR_CLOSED$not:VisionSceneTags.main.scene_tag_array15.activated,/INDOOR$gt:TelemetryOutput.distance_travelled_m,1000,$not:“lss_app.right_lane.lane_change,/No right lane change$not:“lss_app.left_lane.lane_change,/No left lane change$not:moving_object_output0.cutin_active_in_scene,/No cutin$lt
30、:moving_object_output0.max_region_tag_cutin_prob,0.1,$lt:moving_object_output2.max_region_tag_cutin_prob,0.1,$gt:veh-speed-mps,2.2/5 mph ,h:1,1,1,1,1,1,1,1,1,1 /10s ,N:50,/1s period stateless-child:true ,10 ,Ask the Test Fleet for Interesting Clips10k Such Clips Collected&Automatically Labelled With
31、in a WeekAnd the Test Fleet Giveth BackInvestments for Offline Dataset Speedup Labeling EffortTimeManualAutomationMore ML ResourcesAI Adept SystemsDatasets Models Compute Scale SWHWVideo Training ModulesRecurrent Neural Net20 x80 x300 W x H x C20 x80 x300 x12 W x H x C x TTransformer20 x80 x300 x12
32、W x H x C x TRead out Token20 x80 x300 W x H x CMHSAKeyQueryQueryMHSAKeyValueQuery3D CONV20 x80 x300 W x H x C3D CONV3D CONV3D CONV20 x80 x300 x12 W x H x C x TMany More in ResearchKnowledge graphs Semantic networks ART Networks Multi modal AI BigGAN Transfer Learning Datacentric AI Image Source:Goo
33、gle ImagesDatasets Models Compute Scale SWHWAI Adept Systems Source:OpenAI(https:/ Rate Gaps in Training SystemsThe RealityAI Model Compute Needs(Log Scale)1.E+071.E+061.E+051.E+041.E+031.E+021.E+011.E+0020122013201420152016201720182019Growth Effective ComputeMoores Law 8x Algorithmic Efficiency 25x
34、 Scale&$s:37,500 x Source:D Hernandez,T Brown,OpenAI:Measuring the Algorithmic Efficiency of Neural Networks Effective Compute Over Time1.E+071.E+061.E+051.E+041.E+031.E+021.E+011.E+0020122013201420152016201720182019Growth Effective ComputeScale Up FlexibilityScale OutSource:D Hernandez,T Brown,Open
35、AI:Measuring the Algorithmic Efficiency of Neural Networks Effective Compute Over TimeBig DataClimbing Up to MLBig ComputeBig Models2018-NowMachine LearningBig DataClimbing Up to AIBig ComputeBig ModelsBetter ComputeBetter ModelsRealworld DataFlexible Compute Real world datasets Gigantic models Huge
36、 Scale out Real time performance Feed the beastAI System Traits ChipPackageBoardsBoxesRacksTraditional Hierarchies Datacenter/BuildingsExample HierarchySource:GraphcoreSource:Nvidia10100100010000ChipPackageSystemOrders of MagnitudeTraditional BW&Latency Scaling Discontinuities BW/Latency Orders of M
37、agnitudeTraditional Hierarchy PowerSource:Kogge&Shalf-Article in Computing in Science&Engineering 1,00010,000100101DP FlopRegister1 mm on-chip5 min on-chip15mm on-chipOff-chip/DRAMLocal InterconnectCross SystempJ per 64-bit operation2008(45 nm)2018(11 nm)2008(45 nm)2018(11 nm)Mitigation for Integrat
38、ion Hierarchy Discontinuities Reticle Sized Dies2-3x Reticle Sized Interposers/EMIBs/MCMsPower TrendGPU TDP TrendTDP(W)25030035040045050055060065070020142016201920212022Cooling DifficultiesSource:White-paper on Emergence and Expansion of Liquid Cooling in Mainstream Data Centershttps:/ of Cooling Di
39、fficultyYear of IntroductionInverse Thermal ResistanceYear of Introduction2000-2010 Single Core Trend2011-2017 Multi-Core Trend2018-2025 Power War Trend3U(135mm)8U(356mm)20162022Lateral Power Delivery Challengeshttps:/ Xu,Power Delivery in High Current 3-D SystemsProcess Vmin(V)0.500.600.700.800.901
40、.002013201620192022Tech Scaling GapsPerformance/CapabilitiesTimeComputeMemoryComm.StorageTech Scaling GapsPerformance/CapabilitiesTimeComputeMemoryComm.StorageEdge ComputeTraditional SystemsTech Scaling GapsPerformance/CapabilitiesTimeComputeMemoryComm.StorageEdge ComputeAI TrainingScaling Gaps for
41、Efficient ScaleoutBW&Latency Losses Power To Traverse Hierarchies Device vs I/O Scaling Integration Platform Constraints Cooling Needs Power Delivery Need Integrated Solutions With the Whole System in Mind!Discrete/Chip Centric Approaches Are Severely Limiting Value Propositions Sys MemCPUsPCIe SWPC
42、Ie SWSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWDatacenter Arch EvolutionSys MemCPUsPCIe SlotsPCIe SlotsAccAccMemMemAccAccMemMemAccAccMemMemSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWAccAccMemMemAccAccMemMemAccA
43、ccMemMemSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchDa
44、tacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWAccAccA
45、ccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchInternet Optimized Latencies Pac
46、ket Sizes Bandwidth Legacy Support Baggage HPC/CPU Centric Rack-Rack Optimized Latencies Bandwidth Packet Sizes AI Centric Source:Beyond Compute-Communication FocusSource:https:/ spring conference 2021H100H100H100H100H100H100H100H100H100NV SwitchNVLink-Network OSFP ConnectorsNVLink CablesGen-4 NV Li
47、nkGen-4 NV LinkPCLe Gen 5 x 16Next Generation ML Infrastructure for large Model Training TPU v4 chips are networked together into a Cloud TPU v4 pod by ultra-fast interconnect that provides 10 x the bandwidth per chip at scale compared to typical GPU-based large scale training systems.Large models a
48、re very communication intensive:local computation often depends on results from remote computation that are communicated across the network.TPU v4s ultra-fast interconnect has an outsized impact on computational efficiency of large models by eliminating latency and congestion in the network.Tensix c
49、oreEthernet(NoC NoC(Integration Wormhole chipRate of ChangeTypical HPC Improvements Traditional Integration,Comm N/W,Cooling et al Accelerating the AI AcceleratorsTime-Achieved PerformanceNumerous Opportunities To Do Better/QuickerWider aperture beyond chips and into Systems Reduce the Drag coeffici
50、ent from the traditional hierarchies Clean abstraction exists from frameworks to underlying HW Flexible vs Fixed ratios of Compute-Memory-I/Os by disaggregation Concentrate on the full solution stack Can We Do Better?What would you do if designing from first principles for AI?Dojo D1 Chip645mm27nm T
51、echnology 50 BillionTransistors 11+Miles Of Wires362 TFLOPs BF16/CFP8 22.6 TFLOPs FP32 10TBps/dir.On-Chip Bandwidth 4TBps/edge.Off-Chip Bandwidth 400W TDPDojo Unique InnovationSilicon WaferKnown Good DiesTest&SortPkgPCBDojo Unique Innovation:HierarchiesSilicon WaferKnown Good DiesTest&SortPkgPCBDojo
52、 Unique Innovation:Flattened HierarchiesSilicon WaferKnown Good DiesTest&SortNew IntegrationReconstructed Fanout Wafer9 PFLOPs(BFP16/CFP8)40 TB/s Bisec BW(X+Y)36TB/s I/O BWVertical Power DeliveryVerticalLateralImpedanceFrequencyTraining TileTraining Tile25x D1 Dies 9 PFLOPs 36TB/s I/O BW DatacenterD
53、atacenter/BuildingsComputeI/OMemoryDisaggregated Scalable SystemTileInterface ProcessorNetwork InterfaceFeeding the Beast(s)Sys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchTraditional
54、 ML Model FittingSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchModels Have To Fit Into Each Accelerator Sys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe
55、 SWAccAccAccAccMemMemMemMemSwitchXTraditional ML Gigantic Model Fitting Issues-Spilling to System Memory or Local Pool of Drams-Cant Assume Future Models Will Fit Into One AcceleratorGigantic ModelsDesigned from the get go to splitting Models across multiple Chips/TilesSource:Google Research:PaLM:Sc
56、aling Language Modeling with PathwaysDatacenter NetworkAI Focussed System Architectures“A Large body of programming must be completed beforehand,If any serious work is to be done on the machine when it is made.”-Alan TuringSource:http:/ Ease of programability is essential:Fully Flexible,Compiler Fri
57、endly Yet High Perf ArchitecturesSoftware Stack Dojo ExtensionDojo Compiler EngineLLVMPyTorch-Extension JIT NN CompilerLLVM BackendDojo DriversMulti-Host,Multi Partition ManagementDojo Interface ProcessorsIngest&Shared MemExaPODNeural Net ModelsPCIeSerdesHW Help for SW Stacks Compiler Friendly ISAs
58、HW Sync/Barriers Flexible StateMachines for ML Layers Fire and Forget Communication Protocols Fault Tolerance Software Stack OpportunitiesTake Advantage of Clean Abstraction Layers for ML Need Changes for Massively Parallel Architectures More Compilers,Less Kernels Need Renewed Focus on Distributed
59、Compiler Technology Reduced OS RolesBeyond Compute:Many Aspects-One GoalScaleout :Seamless Scale Out FabricCommunication :TerraBytes per SecMemory :GBs/TBsDisaggregation :Ratios Move With WorkloadsCompiler Technologies :Truly Distributed CompilersNetworking Topologies :TTP/TTPoEsFramework Enhancemen
60、ts :Dojo-PyTorch.DataTypes :CFP Formats for EfficiencyCompute :CPU+GPU+NPU+NNABeyond Compute:Scaleout From ML to AI!ExaPodJob Specific Sizing of ResourcesTightly Integrated Yet Disaggregated SystemInnovation Opportunities Beyond Compute for AI SystemsArchitectures:Scale Out Focused Parallel HW Archi
61、tectures Integration:Reduce Traditional Hierarchy Imposed Tax on Performance and Power Disaggregation:Alterable Ratios of Compute/Memory/Comm./Storage Abstractions:Taking Advantage of Clean Abstraction Layers of Frameworks Algorithms:Flexibility of Compute To Adapt to New Algorithms and Workloads Compilers:Explore/Revive Distributed Compiler Technologies Design Approach of Exploring the Full Solution Space Across System and Software Next Phase in Computing EvolutionINTEGRATED SYSTEMS122 Years of Moores LawNot on Same Scale As the Above ChartLearn more at T You!