1、Agnostic Software for AI WorkloadsDecoupling Software from Hardware2Change is ComingAdapt or3Workloads are Evolving at an Accelerated PaceModels are increasing in size exponentially,with no foreseeable slow downNetwork architectures are changing rapidlyTransformers are not the only solution that wil
2、l existThe balance between training and inference is 50/50,even at scaleThe progression from ML to AI to Agents is not slowing downInsight:Observations:4Machine Learning is the Gateway to Autonomous AgentsSoftware 1.0Rule based and deterministic1957Software 2.0Data driven and discriminative2011Softw
3、are 3.0Data driven and generative2018Software 4.0Goal-Directed agents2024Software 5.0Semi-Autonomous agents2030Software 6.0General Purpose agents20365Can Hardware Architecture Evolution Keep Up?Industry is hungry for a GPU alternative that has better efficiency Weve reached the limit of physics for
4、improved efficiency Even at max capacity,Fabs cannot supply enough hardware,GPU or otherwise Physical build out of data centers can only give 10%of the capacity we need Novel hardware architectures could help,but software takes a decade to catch upTheres enough capacity in todays compute architectur
5、es if we could maximize utilization.Insight:Observations:6If Not Specialized Hardware,Then What?What We Have Build out is insufficient compute and cost prohibitive Models gain adoption based on how well they map to GPUs Increased reliance on GPU leaves increasingly unused compute elsewhere Barrier t
6、o adopt any alternatives to GPUs are high What We Need Heterogeneous compute is necessary to meet the compute demand of AI Automated performance optimization on heterogenous clusters Architectures and compilers designed in the context of a cluster unitGAP=6 Orders of Magnitude*By 20327Can Software T
7、ools and Performance Engineering Bridge the Gap?Software tools lagged 10 years behind the advent of parallel computingDiscrete compilers for each architecture are ineffective in maximizing utilizationOnus of performance and hardware efficiency falls to finite Performance EngineersFinite resources op
8、erating a slow,tedious process of performance engineering cannot keep pace with workloads and models The hardware architecture that wins is the one that is most easily programmableEfficient utilization requires software innovationInsight:Observations:8Generality is Necessary for Scale9GENERALITYSoft
9、ware infrastructure and tools lack portability which limits adoption of new hardware architecturesDependency on massive kernel library availability to evaluate and adopt new hardware architectures AI hardware industry is mimicking disconnected elements of the HPC,cloud,and commodity industries CURRE
10、NT STATEScalable abstractions that span the unit of compute from a component to a cluster to create more opportunitiesSoftware should mimic elements of HPC,cloud and commodity industriesReplace platform specific software with a universal software stackFUTURE STATECreating Scalability by Focusing on
11、Compute Clusters10Hardware Lock In Caused by Compilers The Missing Middle EndNo common interchange for mapping compute IR to hardware IRExplosion of KernelsOptimally transformed graphs introduce an excessive reliance on hand-written kernels.The World is DynamicWorkloads and the hardware each contrib
12、ute to variable execution times variability,so static scheduling introduces overhead.The Operators are Not the KernelsThe operators expressed in applications imply false memory barriersIncreased OverheadLimited PerformanceLimited UtilizationIncreased Vendor Lock-In11Easy to Use Portable Performance
13、Automatically fuses and partitions compute graphsAutomatically lowers hardware-specific custom kernel code to machine-codeControl dynamic variance in execution time to directly manage memory,power,and utilization12Bind Compute IR and Hardware IREnumerate any specific target hardware into a hardware
14、intermediate representationPrecisely optimize workloads to target hardware throughout the full stackComprehend memory and compute boundaries from the frontend onwardsReduce overhead with a generic framework to easily enumerate hardware Compiler Innovation#1-The Operators are Not the Kernels13Paralle
15、l-Programming-Pattern Intermediate-RepresentationRepresents workloads holisticallyIncludes dynamic shapes,dynamic control flow,and serial code.Maintain hi-level semantic informationSupport idiomatic graph transformationDelivers aggressive and comprehensive optimization to achieve performanceCompiler
16、 Innovation#2-The Missing Middle End14Procedural Kernel-Code GenerationComprehensive graph transformation requires support for any kernel.Utilize high-level semantic information for optimal loweringLeverage general HPC techniques incorporating hardware architecture insightExceed performance of exper
17、t kernel ninjas and create portabilityCompiler Innovation#3-Explosion of Kernels15Heterogeneous Hierarchical Dynamic RuntimeDirectly scheduled data movement for performance and power consumptionTake advantage of load balance asymmetryProtect from NUMA effects and dark-silicon.Achieve higher utilizat
18、ionCompiler Innovation#4-The World is Dynamic16Portability-No-friction migration reducing barriers to hardware adoptionPerformance-Reduced latency,reduced energy,and increased throughput.Productivity-Achieve desired performance quickly,without expert performance engineeringScalability-Compiler abstractions and architecture for scale out to cluster level compute by designImprove Performance by 10 x17Scale Requires HeterogeneityHeterogeneity Requires Full Stack Level InnovationsThank YouJay Dawani