HotChips_AI_Keynote_V1.5_Final.pdf

編號:136691 PDF 87頁 79.85MB 下載積分:VIP專享
下載報告請您先登錄!

HotChips_AI_Keynote_V1.5_Final.pdf

1、Beyond Compute:Enabling AI Through System IntegrationENIACCalculatorPersonal ComputerCray-1Laptop ComputerDatacentersSmart Phone ComputerConsolesTraining DatacentersTraining ServerIoT ComputerFSD ComputerHuman ComputerAbacusDiff EngineComputing Input DataUseful OutputsProcessingSource:Google ImagesP

2、rojections In 2011Source: Reality+New ProjectionsVol of Data(Zettabytes)0501001502002010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025181147120977964.24133261815.512.596.552Big Data-Data Explosion Corporate Data Type ChangesMore than 80%data is unstructured in nature5,00

3、020102011201220132014201520162017201820192020010,00015,00020,00025,00030,00035,00040,00045,000Structured DataUnstructured DataWorldwide Corporate Data GrowthExabytes80%of Data Growth is UnstructuredNew Types of ProcessingTraditional ProcessingProducts&Profits from Data ProcessingSource:IDCCorporate

4、Data Type ChangesMore than 80%data is unstructured in nature5,00020102011201220132014201520162017201820192020010,00015,00020,00025,00030,00035,00040,00045,000Structured DataUnstructured DataWorldwide Corporate Data GrowthExabytes80%of Data Growth is UnstructuredMachine Learning(ML)/AIAlgorithmic Cod

5、esProducts&Profits from Data ProcessingSource:IDCSoftware 2.0Software 1.0“The history of numerical weather prediction and climate simulation is almost exactly coincident with the history of digital computing itself.”-V.Balaji,Climbing down Charneys ladder:Machine Learning and the post-Dennard era of

6、 computational climate scienceClimate Super-Computing Architectures Over Time 1950 1960 1970 1980 1990 2000 2010 2020 1 101010101010101010102 3 4 5 6 7 8 9 10 History of GFDL Computing Growth of Computational Power with Time IBM 701 IBM 704 IBM 7090 IBM 70301/Stretch CBC 6600 UNIVAC1108IBM 360/91IBM

7、 360/195TI ASCCDC CYBER 205(2X)CRAY Y-MPCRAY C90T3ECRAY T90/16CRAY T90/24CRAY T90/30SGI CLUSTER/01SGI CLUSTER/02SGI CLUSTER/04R&D HPCSR&D HPCS UPGRADENCAA ORNLNCAA ORNL UPGRADEScalar Vector Parallel Vector Scalable First simulation of water,cloud-radiation,ice albedo feedbacks First GFDL Hurricane M

8、odel Coupled model on which first IPCC was based First simulation of chemistry-transport-radiation of Antarctic Ozone Hole First estimate effects of 2x CO2 GFDL Hurricane Model goes operational-Interactive atmospheric chemistry and aerosols -Stimulation and prediction of Category 4-5 hurricanes-Proj

9、ections of fish stock under climate change-FV3 models used both for climate studies and US operational weather forecasting-Coupled ENSO forecasts Attribution of role of Ozone depletion in climate change Log(Computer Power)History of computational power at the NOAA Geophysical Fluid Dynamics Laborato

10、ry.Computational power is measured in aggregate floating point operations per second Source:V Balaji,Princeton-NOAA :Climbing down Charneys ladder:Machine Learning and the post-Dennard era of computational climate science Years of Delivery History of GFDL Computing Growth of Computational Power with

11、 Time Log(Computer Power)Years of Delivery Climate Super-Computing Architectures Over Time 1950 1960 1970 1980 1990 2000 2010 2020 1 101010101010101010102 3 4 5 6 7 8 9 10 IBM 701 IBM 704 IBM 7090 IBM 70301/Stretch CBC 6600 UNIVAC1108IBM 360/91IBM 360/195TI ASCCDC CYBER 205(2X)CRAY Y-MPCRAY C90T3ECR

12、AY T90/16CRAY T90/24CRAY T90/30SGI CLUSTER/01SGI CLUSTER/02SGI CLUSTER/04R&D HPCSR&D HPCS UPGRADENCAA ORNLNCAA ORNL UPGRADEFirst simulation of water,cloud-radiation,ice albedo feedbacks First GFDL Hurricane Model Coupled model on which first IPCC was based First simulation of chemistry-transport-rad

13、iation of Antarctic Ozone Hole First estimate effects of 2x CO2 GFDL Hurricane Model goes operational-Interactive atmospheric chemistry and aerosols -Stimulation and prediction of Category 4-5 hurricanes-Projections of fish stock under climate change-FV3 models used both for climate studies and US o

14、perational weather forecasting-Coupled ENSO forecasts Attribution of role of Ozone depletion in climate change“this represents a sea change in computational Earth system science that rivals the von Neumann revolution.”-On the Increase of Machine Learning techniques in climate computing As quoted in

15、Machine Learning and the post-Denard era of computational climate science ML/AIScalar Vector Parallel Vector Scalable History of computational power at the NOAA Geophysical Fluid Dynamics Laboratory.Computational power is measured in aggregate floating point operations per second Source:V Balaji,Pri

16、nceton-NOAA :Climbing down Charneys ladder:Machine Learning and the post-Dennard era of computational climate science Traditional Data-New Processing Methods Climate AI,a pioneer in applying artificial intelligence to climate risk modeling,today announced its team has solved a critical weather forec

17、asting challenge.Leveraging advances in AI to improve weather and climate forecasts.“Artificial intelligence and machine learning breakthroughs are changing weather forecasting,and resource-heavy regional weather models might soon be completely replaced by machine learning approaches.”Dr.Stephan Ras

18、p,Lead Data Scientist Climate AI Source:Climate AISource:Los Almos National LabReal World DataOnly Machine Learning techniques can enable theseExploding ML Use Cases New Data TypesTraditional DataReal World Data.Accelerators:Strictly Structured Algorithmic Computers:Semi-Structured Learning Computer

19、s:Unstructured/Any typeComputing Architecture Categories ENIACCalculatorPersonal ComputerCray-1Laptop ComputerDatacentersConsolesTraining DatacentersTraining ServerIoT ComputerFSD ComputerHuman ComputerAbacusDiff EngineInput DataUseful OutputsProcessingSmart Phone ComputerAI -Tasks requiring near hu

20、man intelligence in real world settings ML -Subset of AI for specific tasks by learning from data&making predictions DL -Subset of ML using Deep neural network architectures Artificial Intelligence Machine Learning Deep Learning Image Courtesy:Buffaloboy Processing for Learning:AI-ML-DL Why Do We Ne

21、ed a Different Compute Platforms?Input DataProgram LogicTrained LogicLearning Computers+Traditional ComputersInput Data+Output DataInput Data+Trained LogicUseful OutputsUseful OutputsLearning ComputersTrainingInferenceExponential Rate Gaps in Training SystemsMoores LawSource:OpenAI(https:/ adept Sys

22、temsAI Model Compute Needs(Log Scale)Huge GapDesigning Solutions for AI Level NeedsDatasets Models Compute Scale SWHWAI Training Systems Define goals Dataset(s)Model(s)Desired outputs Iterative flowsHigh Manual EffortHuman ReviewsTypical ML Training Flows Datasets Models Compute Scale SWHWAI Adept S

23、ystems 2D Image Labeling of Real World Inputs 100X Labeling ThroughputLabel Once,Simultaneously Labels All Cameras at Many Frames4D-Space+Time Labeling 1,000-Person In-House Data Labeling Team Fully Custom Built Data Labeling&Analytics Infrastructuretime#labelsData Labeling Growth Source:MetaAI-How

24、Facebook Annotates Multimodal Training Data for ML“We refer to this framework as Human-AI loop(Halo.)researchers can streamline annotation tasks,visualize the results and accuracy metrics of annotations,and export the annotations to start their training modules.”:MetaAIOther Frameworks Led to Advent

25、 of programmable computers Colossus/Eniac/ACESource:Jack Copeland,Alan Turnings Electronic Brain“Once the human element is eliminated,the increase in speed is enormous”-Alan Turning (As stated in his report from early 1940s)Historical Parallels Led to Advent of programmable computers Colossus/Eniac/

26、ACESource:Jack Copeland,Alan Turnings Electronic Brain“Once the human element is eliminated,the increase in speed is enormous”-Alan Turning (As stated in his report from early 1940s)Historical Parallels How to do this for AI?Solution in ML/AI Space Itself!Chicken and Egg Problem?Solution in ML/AI Sp

27、ace Itself!Recursive Loops Solution in ML/AI Space Itself!Use of Offline Models for Real World Dataset Curation/Labeling With Reduced Human Loop Dependence 1101102103Network ComplexityResnetAutomation N/WsMany Orders of Magnitude104Dataset Labeling AutomationAutolabelingClipVideosGPSIMUOdometryOffli

28、ne Neural NetworksEgo Trajectory&Static World ReconstructionMovingObjects&KinematicsLabelsname:cipv-low-vis,requester:img-vid-cipv-low-vis-seq,description:Low visibility with a CIPV,query:$and:$eq:$decimate:$conv:$and:$eq:active-gear,4,/In drive$not:VisionSceneTags.main.scene_tag_array13.activated,/

29、GARAGE_DOOR_CLOSED$not:VisionSceneTags.main.scene_tag_array15.activated,/INDOOR$gt:TelemetryOutput.distance_travelled_m,1000,$not:“lss_app.right_lane.lane_change,/No right lane change$not:“lss_app.left_lane.lane_change,/No left lane change$not:moving_object_output0.cutin_active_in_scene,/No cutin$lt

30、:moving_object_output0.max_region_tag_cutin_prob,0.1,$lt:moving_object_output2.max_region_tag_cutin_prob,0.1,$gt:veh-speed-mps,2.2/5 mph ,h:1,1,1,1,1,1,1,1,1,1 /10s ,N:50,/1s period stateless-child:true ,10 ,Ask the Test Fleet for Interesting Clips10k Such Clips Collected&Automatically Labelled With

31、in a WeekAnd the Test Fleet Giveth BackInvestments for Offline Dataset Speedup Labeling EffortTimeManualAutomationMore ML ResourcesAI Adept SystemsDatasets Models Compute Scale SWHWVideo Training ModulesRecurrent Neural Net20 x80 x300 W x H x C20 x80 x300 x12 W x H x C x TTransformer20 x80 x300 x12

32、W x H x C x TRead out Token20 x80 x300 W x H x CMHSAKeyQueryQueryMHSAKeyValueQuery3D CONV20 x80 x300 W x H x C3D CONV3D CONV3D CONV20 x80 x300 x12 W x H x C x TMany More in ResearchKnowledge graphs Semantic networks ART Networks Multi modal AI BigGAN Transfer Learning Datacentric AI Image Source:Goo

33、gle ImagesDatasets Models Compute Scale SWHWAI Adept Systems Source:OpenAI(https:/ Rate Gaps in Training SystemsThe RealityAI Model Compute Needs(Log Scale)1.E+071.E+061.E+051.E+041.E+031.E+021.E+011.E+0020122013201420152016201720182019Growth Effective ComputeMoores Law 8x Algorithmic Efficiency 25x

34、 Scale&$s:37,500 x Source:D Hernandez,T Brown,OpenAI:Measuring the Algorithmic Efficiency of Neural Networks Effective Compute Over Time1.E+071.E+061.E+051.E+041.E+031.E+021.E+011.E+0020122013201420152016201720182019Growth Effective ComputeScale Up FlexibilityScale OutSource:D Hernandez,T Brown,Open

35、AI:Measuring the Algorithmic Efficiency of Neural Networks Effective Compute Over TimeBig DataClimbing Up to MLBig ComputeBig Models2018-NowMachine LearningBig DataClimbing Up to AIBig ComputeBig ModelsBetter ComputeBetter ModelsRealworld DataFlexible Compute Real world datasets Gigantic models Huge

36、 Scale out Real time performance Feed the beastAI System Traits ChipPackageBoardsBoxesRacksTraditional Hierarchies Datacenter/BuildingsExample HierarchySource:GraphcoreSource:Nvidia10100100010000ChipPackageSystemOrders of MagnitudeTraditional BW&Latency Scaling Discontinuities BW/Latency Orders of M

37、agnitudeTraditional Hierarchy PowerSource:Kogge&Shalf-Article in Computing in Science&Engineering 1,00010,000100101DP FlopRegister1 mm on-chip5 min on-chip15mm on-chipOff-chip/DRAMLocal InterconnectCross SystempJ per 64-bit operation2008(45 nm)2018(11 nm)2008(45 nm)2018(11 nm)Mitigation for Integrat

38、ion Hierarchy Discontinuities Reticle Sized Dies2-3x Reticle Sized Interposers/EMIBs/MCMsPower TrendGPU TDP TrendTDP(W)25030035040045050055060065070020142016201920212022Cooling DifficultiesSource:White-paper on Emergence and Expansion of Liquid Cooling in Mainstream Data Centershttps:/ of Cooling Di

39、fficultyYear of IntroductionInverse Thermal ResistanceYear of Introduction2000-2010 Single Core Trend2011-2017 Multi-Core Trend2018-2025 Power War Trend3U(135mm)8U(356mm)20162022Lateral Power Delivery Challengeshttps:/ Xu,Power Delivery in High Current 3-D SystemsProcess Vmin(V)0.500.600.700.800.901

40、.002013201620192022Tech Scaling GapsPerformance/CapabilitiesTimeComputeMemoryComm.StorageTech Scaling GapsPerformance/CapabilitiesTimeComputeMemoryComm.StorageEdge ComputeTraditional SystemsTech Scaling GapsPerformance/CapabilitiesTimeComputeMemoryComm.StorageEdge ComputeAI TrainingScaling Gaps for

41、Efficient ScaleoutBW&Latency Losses Power To Traverse Hierarchies Device vs I/O Scaling Integration Platform Constraints Cooling Needs Power Delivery Need Integrated Solutions With the Whole System in Mind!Discrete/Chip Centric Approaches Are Severely Limiting Value Propositions Sys MemCPUsPCIe SWPC

42、Ie SWSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWDatacenter Arch EvolutionSys MemCPUsPCIe SlotsPCIe SlotsAccAccMemMemAccAccMemMemAccAccMemMemSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWAccAccMemMemAccAccMemMemAccA

43、ccMemMemSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchDa

44、tacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWAccAccA

45、ccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchInternet Optimized Latencies Pac

46、ket Sizes Bandwidth Legacy Support Baggage HPC/CPU Centric Rack-Rack Optimized Latencies Bandwidth Packet Sizes AI Centric Source:Beyond Compute-Communication FocusSource:https:/ spring conference 2021H100H100H100H100H100H100H100H100H100NV SwitchNVLink-Network OSFP ConnectorsNVLink CablesGen-4 NV Li

47、nkGen-4 NV LinkPCLe Gen 5 x 16Next Generation ML Infrastructure for large Model Training TPU v4 chips are networked together into a Cloud TPU v4 pod by ultra-fast interconnect that provides 10 x the bandwidth per chip at scale compared to typical GPU-based large scale training systems.Large models a

48、re very communication intensive:local computation often depends on results from remote computation that are communicated across the network.TPU v4s ultra-fast interconnect has an outsized impact on computational efficiency of large models by eliminating latency and congestion in the network.Tensix c

49、oreEthernet(NoC NoC(Integration Wormhole chipRate of ChangeTypical HPC Improvements Traditional Integration,Comm N/W,Cooling et al Accelerating the AI AcceleratorsTime-Achieved PerformanceNumerous Opportunities To Do Better/QuickerWider aperture beyond chips and into Systems Reduce the Drag coeffici

50、ent from the traditional hierarchies Clean abstraction exists from frameworks to underlying HW Flexible vs Fixed ratios of Compute-Memory-I/Os by disaggregation Concentrate on the full solution stack Can We Do Better?What would you do if designing from first principles for AI?Dojo D1 Chip645mm27nm T

51、echnology 50 BillionTransistors 11+Miles Of Wires362 TFLOPs BF16/CFP8 22.6 TFLOPs FP32 10TBps/dir.On-Chip Bandwidth 4TBps/edge.Off-Chip Bandwidth 400W TDPDojo Unique InnovationSilicon WaferKnown Good DiesTest&SortPkgPCBDojo Unique Innovation:HierarchiesSilicon WaferKnown Good DiesTest&SortPkgPCBDojo

52、 Unique Innovation:Flattened HierarchiesSilicon WaferKnown Good DiesTest&SortNew IntegrationReconstructed Fanout Wafer9 PFLOPs(BFP16/CFP8)40 TB/s Bisec BW(X+Y)36TB/s I/O BWVertical Power DeliveryVerticalLateralImpedanceFrequencyTraining TileTraining Tile25x D1 Dies 9 PFLOPs 36TB/s I/O BW DatacenterD

53、atacenter/BuildingsComputeI/OMemoryDisaggregated Scalable SystemTileInterface ProcessorNetwork InterfaceFeeding the Beast(s)Sys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchTraditional

54、 ML Model FittingSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchModels Have To Fit Into Each Accelerator Sys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe

55、 SWAccAccAccAccMemMemMemMemSwitchXTraditional ML Gigantic Model Fitting Issues-Spilling to System Memory or Local Pool of Drams-Cant Assume Future Models Will Fit Into One AcceleratorGigantic ModelsDesigned from the get go to splitting Models across multiple Chips/TilesSource:Google Research:PaLM:Sc

56、aling Language Modeling with PathwaysDatacenter NetworkAI Focussed System Architectures“A Large body of programming must be completed beforehand,If any serious work is to be done on the machine when it is made.”-Alan TuringSource:http:/ Ease of programability is essential:Fully Flexible,Compiler Fri

57、endly Yet High Perf ArchitecturesSoftware Stack Dojo ExtensionDojo Compiler EngineLLVMPyTorch-Extension JIT NN CompilerLLVM BackendDojo DriversMulti-Host,Multi Partition ManagementDojo Interface ProcessorsIngest&Shared MemExaPODNeural Net ModelsPCIeSerdesHW Help for SW Stacks Compiler Friendly ISAs

58、HW Sync/Barriers Flexible StateMachines for ML Layers Fire and Forget Communication Protocols Fault Tolerance Software Stack OpportunitiesTake Advantage of Clean Abstraction Layers for ML Need Changes for Massively Parallel Architectures More Compilers,Less Kernels Need Renewed Focus on Distributed

59、Compiler Technology Reduced OS RolesBeyond Compute:Many Aspects-One GoalScaleout :Seamless Scale Out FabricCommunication :TerraBytes per SecMemory :GBs/TBsDisaggregation :Ratios Move With WorkloadsCompiler Technologies :Truly Distributed CompilersNetworking Topologies :TTP/TTPoEsFramework Enhancemen

60、ts :Dojo-PyTorch.DataTypes :CFP Formats for EfficiencyCompute :CPU+GPU+NPU+NNABeyond Compute:Scaleout From ML to AI!ExaPodJob Specific Sizing of ResourcesTightly Integrated Yet Disaggregated SystemInnovation Opportunities Beyond Compute for AI SystemsArchitectures:Scale Out Focused Parallel HW Archi

61、tectures Integration:Reduce Traditional Hierarchy Imposed Tax on Performance and Power Disaggregation:Alterable Ratios of Compute/Memory/Comm./Storage Abstractions:Taking Advantage of Clean Abstraction Layers of Frameworks Algorithms:Flexibility of Compute To Adapt to New Algorithms and Workloads Compilers:Explore/Revive Distributed Compiler Technologies Design Approach of Exploring the Full Solution Space Across System and Software Next Phase in Computing EvolutionINTEGRATED SYSTEMS122 Years of Moores LawNot on Same Scale As the Above ChartLearn more at T You!

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(HotChips_AI_Keynote_V1.5_Final.pdf)為本站 (2200) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站