1、FuriosaAI Inc.AI Hardware Summit 2024Hyunsik Choi,Head of SW Platform,Jihoon Yoon,Product Marketing ManagerBeyond Just Hardware Full-stack Optimization Towards Efficient AI InferenceFuriosaAI Inc.AI Hardware Summit 2024FuriosaAI founded&Launch Gen 1 vision NPU RNGD raw silicon sample arrivalFirst LL
2、M demo 2017-2021 2024 May2024 JulyGPT3 inspired RNGD 2021 RNGD DevelopmentKick off 2022 FuriosaAI Inc.AI Hardware Summit 202401Mass AI adoption is bottlenecked02 Energy efficient AI inference03 Full-stack optimization for achieving efficiency Key Points FuriosaAI Inc.AI Hardware Summit 2024Source:Ma
3、sanet et al.(2020),Cisco,IEA,Goldman Sachs ResearchAI has broken energy efficiency V100Gaudi 1A100MI100MI250XH100Gaudi 2MI300XGaudi 3B200FuriosaAI Inc.AI Hardware Summit 2024Electricity is already a huge financial and environmental burden on data centersSource:HARTING White Paper(2024)FuriosaAI Inc.
4、AI Hardware Summit 2024AI inference will be everywhere.But is our infrastructure ready?FuriosaAI Inc.AI Hardware Summit 2024“Average server rack densities are increasing but remain below 8 kW.The majority of facilities do not have racks above 30 kW,and those that do have only a few.”-Uptime Institut
5、e Global Datacenter Summary 2024FuriosaAI Inc.AI Hardware Summit 2024What ifthere is a more energy efficient AI inferencesolutions that can be deployed anywhere within existing infrastructure.FuriosaAI Inc.AI Hardware Summit 2024Make AI computing sustainable,enabling access to powerful AI for everyo
6、ne on EarthFuriosaAIs MissionFuriosaAI Inc.AI Hardware Summit 2024RNGD:Powerfully Efficient AI Inference Data center AI accelerator built for the era of LLM and other generative AI modelsFuriosaAI Inc.AI Hardware Summit 2024512 TFLOPS64 TFLOPS(FP8)x 8 Processing Elements48 GBMemory Capacity256 MB SR
7、AM384 TB/s On-chip Bandwidth1.5 TB/sMemory Bandwidth150 W TDPtargeting air-cooled data centers2 x HBM3CoWoS-SINT8(512 TOPS),BF16(256 TFLOPS),INT4(1 POPS),FP8(512 TFLOPS)PCIe P2P support For LLMsFeatures For CloudMultiple-Instance supportVirtualizationSecure boot&model encryptionFuriosaAI Inc.AI Hard
8、ware Summit 2024Early performance numbers:60%higher perf/watt than current inference solutionsRNGDNVIDIA L40SIntel Gaudi 2Google TPU v5ePerformance(queries/sec)11.5(FP8)12.3(FP8)10.512.5Power(watt)185320UnknownUnknownData sourcemeasuredmeasuredMLPerf 3.1MLPerf 4.0Disclaimer:As of Aug 2024,unverified
9、 by MLPerfGPT-J 6B MLPerf Benchmark Scenario(99%accuracy)FuriosaAI Inc.AI Hardware Summit 2024Lower total cost of ownership,with less energy usage and fewer racks.Compatible with air-cooled data centers of todayDGX H100 x 1 server13,853 tokens/sRNGD Server x 7servers 48,727 tokens/sMost data center
10、racks today are below 15kWdata above is for running Llama 3 70 B3.5xcompute per rackFuriosaAI Inc.AI Hardware Summit 2024Beyond hardwareFull-stack innovation and optimization for maximized efficiency in AI inferenceFuriosaAI Inc.AI Hardware Summit 2024Model ExecutionServingUtilizationFuriosaAI Inc.A
11、I Hardware Summit 2024Source:“Data Movement is All You Need,”MLSYS21Tensor ContractionFlop analysis for BERT*Tensor Contraction,The Core Computation in Deep Learning“Data movement is the major bottleneck for efficiency”FuriosaAI Inc.AI Hardware Summit 2024“TCP aims at exploiting the rich parallelism
12、 and data locality inherent in tensor contractions,thereby enhancing both efficiency and performance of AI workloads.”TCP:A Tensor Contraction Processor for AI WorkloadsPresented at ISCA:International Symposium on Computer Architecture,2024TCP(Tensor Contraction Processor)FuriosaAI Inc.AI Hardware S
13、ummit 2024Tensor Contraction,not Matmul,as a PrimitiveTensor contraction is declarativeNo explicit memory layout for dataNo explicit scheduling for computationTensor contraction is a higher dimensional generalization of matrix multiplication.FuriosaAI Inc.AI Hardware Summit 2024DNN Graph Compiler:En
14、d-to-End Model EfficiencyOptimal memory layout and operation scheduling for maximum data reusability Temporal pipeline opportunitiesOperator fusion and memory allocation,split/merge schedulingcomputational graphlow-leveloperationlow-leveloperationchipCompute UnitsSRAMpartitionFuriosaAI Inc.AI Hardwa
15、re Summit 2024Quantization Becomes More Critical as Model Sizes GrowEfficiency gains through quantizationInference latencyComputation timeMemory footprintEnergy consumptionComputings Energy Problem,M.Horowitz,ISSCC,2014Slide:Courtesy of Prof.ShaoEnergy Consumption(Numbers are rough approximations fo
16、r 45nm)FuriosaAI Inc.AI Hardware Summit 2024Furiosa Quantizer:Graph-Based Automated ToolFuriosa QuantizerEnd-to-end automated quantizationSupports arbitrary customized LLM models using graph pattern searchBF16,INT8 Weight-Only(W8A16),FP8(W8A8),INT8 SmoothQuant(W8A8),INT4 Weight-Only(W4A16 AWQ/GPTQ)F
17、uriosaAI Inc.AI Hardware Summit 2024Model ExecutionServingUtilizationFuriosaAI Inc.AI Hardware Summit 2024Generative Inference BasicsDecoder(prompt inference)Greedy searchPrompt phaseKV cache lifetimeKVcacheWeightDecoder(token inference)KVcacheWeightDecoder(token inference)KVcacheWeightDecode phaseO
18、utput token ithRepeating N timesInput tokens(prompt)Greedy searchGreedy SearchOutput token 1Final output tokenQueryFuriosaAI Inc.AI Hardware Summit 2024HiThereWhatisLLM?HiLLMstandsforLargepromptsgenerated texts in decodingbatch axisearlier finished1)Computation/IO waste2)Memory wastecurrent sequence
19、sChallenges of auto-regressive execution in servingmax context lengthWhatisPaged3)Waiting new requestsChallenges in Generative Model ServingFuriosaAI Inc.AI Hardware Summit 2024Furiosa LLM:High-throughput Serving Engine for LLMsGRPC/RESTRequest PoolSchedulergenerate()IterativeDecoderforward()KV Bloc
20、k ManagerFuriosa GeneratorFuriosa Runtimealloc()Logical KV cache blocksPhysical memory blocksFuriosa LLMBlocked KV cache ManagementHigh throughput serving with SOTA optimizationContinuous batching allows immediately starting incoming requests when resource is available.PagedAttention eliminates comp
21、ute and IO wasteBlocked KV cache reduces significantly memory wastes6xIncrease in inference performanceFuriosaAI Inc.AI Hardware Summit 2024Model ExecutionServingUtilizationFuriosaAI Inc.AI Hardware Summit 2024Spatial Partitioning for Container and VM environmentA single RNGD has 8 Processing Elemen
22、ts(PEs)An RNGD can be spatially partitioned into many individual NPUsFuriosaAI Inc.AI Hardware Summit 2024Spatial Partitioning for Container and VM environmentPEPEPEPEPEPEPEPE8 Processing ElementsA single RNGD has 8 Processing Elements(PEs)An RNGD can be spatially partitioned into many individual NP
23、UsFuriosaAI Inc.AI Hardware Summit 2024Spatial Partitioning for Container and VM environmentNPU:0-1NPU:2-3NPU:4-5NPU:6-7VM 1VM 2VM 3A single RNGD has 8 Processing Elements(PEs)An RNGD can be spatially partitioned into many individual NPUsUp to 4 PEs can operate together as a single NPUFuriosaAI Inc.
24、AI Hardware Summit 2024Spatial Partitioning for Container and VM environmentNPU 0-3NPU 4-7VM 1VM 2A single RNGD has 8 Processing Elements(PEs)An RNGD can be spatially partitioned into many individual NPUsUp to 4 PEs can operate together as a single NPUFuriosaAI Inc.AI Hardware Summit 2024Spatial Par
25、titioning for Container and VM environmentNPU 0-3NPU 4-7VM 1VM 2A single RNGD has 8 Processing Elements(PEs)An RNGD can be spatially partitioned into many individual NPUsUp to 4 PEs can operate together as a single NPUFuriosa RNGD supports SR-IOV(Single Root IO Virtualization)for multiple isolated a
26、ccess from VMsFuriosaAI Inc.AI Hardware Summit 2024Furiosa Software Stack Key Features PyTorch 2.0 integrationQuantization toolkit(FP8,INT8,INT4,.)3D model parallelism supportGraph compiler for DNN modelsPerformance profiling toolsLLM serving framework compatible with vLLMKubernetes device plugin an
27、d NPU operatorVirtual machine supportFuriosaAI Inc.AI Hardware Summit 2024Delivering peak AI performance with high efficiency requiresMaximized model efficiencyThe RNGD Chip,Compiler,and Furiosa Quantizer deliver peak performance with low-precision inference for speed and efficiency.Enhanced serving
28、 capabilitiesBoost throughput and reduce latency in production with PagedAttention,Blocked KV cache,and continuous batching.Flexible resource utilizationRNGDs spatial partitioning and SR-IOV ensure optimal resource allocation,maximizing NPU utilization in virtualized and containerized environments.In summary FuriosaAI Inc.AI Hardware Summit 202434In order to solve for mass AI adoption,We have to think beyond just hardware