超越硬件：實現高效AI推理的全棧優化.pdf

報告預覽

超越硬件：實現高效AI推理的全棧優化.pdf

編號：464947

PDF 34頁 1.24MB 下載積分：VIP專享

下載報告請您先登錄！

超越硬件：實現高效AI推理的全棧優化.pdf

1、FuriosaAI Inc.AI Hardware Summit 2024Hyunsik Choi,Head of SW Platform,Jihoon Yoon,Product Marketing ManagerBeyond Just Hardware Full-stack Optimization Towards Efficient AI InferenceFuriosaAI Inc.AI Hardware Summit 2024FuriosaAI founded&Launch Gen 1 vision NPU RNGD raw silicon sample arrivalFirst LL

2、M demo 2017-2021 2024 May2024 JulyGPT3 inspired RNGD 2021 RNGD DevelopmentKick off 2022 FuriosaAI Inc.AI Hardware Summit 202401Mass AI adoption is bottlenecked02 Energy efficient AI inference03 Full-stack optimization for achieving efficiency Key Points FuriosaAI Inc.AI Hardware Summit 2024Source:Ma

3、sanet et al.(2020),Cisco,IEA,Goldman Sachs ResearchAI has broken energy efficiency V100Gaudi 1A100MI100MI250XH100Gaudi 2MI300XGaudi 3B200FuriosaAI Inc.AI Hardware Summit 2024Electricity is already a huge financial and environmental burden on data centersSource:HARTING White Paper(2024)FuriosaAI Inc.

4、AI Hardware Summit 2024AI inference will be everywhere.But is our infrastructure ready?FuriosaAI Inc.AI Hardware Summit 2024“Average server rack densities are increasing but remain below 8 kW.The majority of facilities do not have racks above 30 kW,and those that do have only a few.”-Uptime Institut

5、e Global Datacenter Summary 2024FuriosaAI Inc.AI Hardware Summit 2024What ifthere is a more energy efficient AI inferencesolutions that can be deployed anywhere within existing infrastructure.FuriosaAI Inc.AI Hardware Summit 2024Make AI computing sustainable,enabling access to powerful AI for everyo

6、ne on EarthFuriosaAIs MissionFuriosaAI Inc.AI Hardware Summit 2024RNGD:Powerfully Efficient AI Inference Data center AI accelerator built for the era of LLM and other generative AI modelsFuriosaAI Inc.AI Hardware Summit 2024512 TFLOPS64 TFLOPS(FP8)x 8 Processing Elements48 GBMemory Capacity256 MB SR

7、AM384 TB/s On-chip Bandwidth1.5 TB/sMemory Bandwidth150 W TDPtargeting air-cooled data centers2 x HBM3CoWoS-SINT8(512 TOPS),BF16(256 TFLOPS),INT4(1 POPS),FP8(512 TFLOPS)PCIe P2P support For LLMsFeatures For CloudMultiple-Instance supportVirtualizationSecure boot&model encryptionFuriosaAI Inc.AI Hard

8、ware Summit 2024Early performance numbers:60%higher perf/watt than current inference solutionsRNGDNVIDIA L40SIntel Gaudi 2Google TPU v5ePerformance(queries/sec)11.5(FP8)12.3(FP8)10.512.5Power(watt)185320UnknownUnknownData sourcemeasuredmeasuredMLPerf 3.1MLPerf 4.0Disclaimer:As of Aug 2024,unverified

9、 by MLPerfGPT-J 6B MLPerf Benchmark Scenario(99%accuracy)FuriosaAI Inc.AI Hardware Summit 2024Lower total cost of ownership,with less energy usage and fewer racks.Compatible with air-cooled data centers of todayDGX H100 x 1 server13,853 tokens/sRNGD Server x 7servers 48,727 tokens/sMost data center

10、racks today are below 15kWdata above is for running Llama 3 70 B3.5xcompute per rackFuriosaAI Inc.AI Hardware Summit 2024Beyond hardwareFull-stack innovation and optimization for maximized efficiency in AI inferenceFuriosaAI Inc.AI Hardware Summit 2024Model ExecutionServingUtilizationFuriosaAI Inc.A

11、I Hardware Summit 2024Source:“Data Movement is All You Need,”MLSYS21Tensor ContractionFlop analysis for BERT*Tensor Contraction,The Core Computation in Deep Learning“Data movement is the major bottleneck for efficiency”FuriosaAI Inc.AI Hardware Summit 2024“TCP aims at exploiting the rich parallelism

12、 and data locality inherent in tensor contractions,thereby enhancing both efficiency and performance of AI workloads.”TCP:A Tensor Contraction Processor for AI WorkloadsPresented at ISCA:International Symposium on Computer Architecture,2024TCP(Tensor Contraction Processor)FuriosaAI Inc.AI Hardware S

13、ummit 2024Tensor Contraction,not Matmul,as a PrimitiveTensor contraction is declarativeNo explicit memory layout for dataNo explicit scheduling for computationTensor contraction is a higher dimensional generalization of matrix multiplication.FuriosaAI Inc.AI Hardware Summit 2024DNN Graph Compiler:En

14、d-to-End Model EfficiencyOptimal memory layout and operation scheduling for maximum data reusability Temporal pipeline opportunitiesOperator fusion and memory allocation,split/merge schedulingcomputational graphlow-leveloperationlow-leveloperationchipCompute UnitsSRAMpartitionFuriosaAI Inc.AI Hardwa

15、re Summit 2024Quantization Becomes More Critical as Model Sizes GrowEfficiency gains through quantizationInference latencyComputation timeMemory footprintEnergy consumptionComputings Energy Problem,M.Horowitz,ISSCC,2014Slide:Courtesy of Prof.ShaoEnergy Consumption(Numbers are rough approximations fo

16、r 45nm)FuriosaAI Inc.AI Hardware Summit 2024Furiosa Quantizer:Graph-Based Automated ToolFuriosa QuantizerEnd-to-end automated quantizationSupports arbitrary customized LLM models using graph pattern searchBF16,INT8 Weight-Only(W8A16),FP8(W8A8),INT8 SmoothQuant(W8A8),INT4 Weight-Only(W4A16 AWQ/GPTQ)F

17、uriosaAI Inc.AI Hardware Summit 2024Model ExecutionServingUtilizationFuriosaAI Inc.AI Hardware Summit 2024Generative Inference BasicsDecoder(prompt inference)Greedy searchPrompt phaseKV cache lifetimeKVcacheWeightDecoder(token inference)KVcacheWeightDecoder(token inference)KVcacheWeightDecode phaseO

18、utput token ithRepeating N timesInput tokens(prompt)Greedy searchGreedy SearchOutput token 1Final output tokenQueryFuriosaAI Inc.AI Hardware Summit 2024HiThereWhatisLLM?HiLLMstandsforLargepromptsgenerated texts in decodingbatch axisearlier finished1)Computation/IO waste2)Memory wastecurrent sequence

19、sChallenges of auto-regressive execution in servingmax context lengthWhatisPaged3)Waiting new requestsChallenges in Generative Model ServingFuriosaAI Inc.AI Hardware Summit 2024Furiosa LLM:High-throughput Serving Engine for LLMsGRPC/RESTRequest PoolSchedulergenerate()IterativeDecoderforward()KV Bloc

20、k ManagerFuriosa GeneratorFuriosa Runtimealloc()Logical KV cache blocksPhysical memory blocksFuriosa LLMBlocked KV cache ManagementHigh throughput serving with SOTA optimizationContinuous batching allows immediately starting incoming requests when resource is available.PagedAttention eliminates comp

21、ute and IO wasteBlocked KV cache reduces significantly memory wastes6xIncrease in inference performanceFuriosaAI Inc.AI Hardware Summit 2024Model ExecutionServingUtilizationFuriosaAI Inc.AI Hardware Summit 2024Spatial Partitioning for Container and VM environmentA single RNGD has 8 Processing Elemen

22、ts(PEs)An RNGD can be spatially partitioned into many individual NPUsFuriosaAI Inc.AI Hardware Summit 2024Spatial Partitioning for Container and VM environmentPEPEPEPEPEPEPEPE8 Processing ElementsA single RNGD has 8 Processing Elements(PEs)An RNGD can be spatially partitioned into many individual NP

23、UsFuriosaAI Inc.AI Hardware Summit 2024Spatial Partitioning for Container and VM environmentNPU:0-1NPU:2-3NPU:4-5NPU:6-7VM 1VM 2VM 3A single RNGD has 8 Processing Elements(PEs)An RNGD can be spatially partitioned into many individual NPUsUp to 4 PEs can operate together as a single NPUFuriosaAI Inc.

24、AI Hardware Summit 2024Spatial Partitioning for Container and VM environmentNPU 0-3NPU 4-7VM 1VM 2A single RNGD has 8 Processing Elements(PEs)An RNGD can be spatially partitioned into many individual NPUsUp to 4 PEs can operate together as a single NPUFuriosaAI Inc.AI Hardware Summit 2024Spatial Par

25、titioning for Container and VM environmentNPU 0-3NPU 4-7VM 1VM 2A single RNGD has 8 Processing Elements(PEs)An RNGD can be spatially partitioned into many individual NPUsUp to 4 PEs can operate together as a single NPUFuriosa RNGD supports SR-IOV(Single Root IO Virtualization)for multiple isolated a

26、ccess from VMsFuriosaAI Inc.AI Hardware Summit 2024Furiosa Software Stack Key Features PyTorch 2.0 integrationQuantization toolkit(FP8,INT8,INT4,.)3D model parallelism supportGraph compiler for DNN modelsPerformance profiling toolsLLM serving framework compatible with vLLMKubernetes device plugin an

27、d NPU operatorVirtual machine supportFuriosaAI Inc.AI Hardware Summit 2024Delivering peak AI performance with high efficiency requiresMaximized model efficiencyThe RNGD Chip,Compiler,and Furiosa Quantizer deliver peak performance with low-precision inference for speed and efficiency.Enhanced serving

28、 capabilitiesBoost throughput and reduce latency in production with PagedAttention,Blocked KV cache,and continuous batching.Flexible resource utilizationRNGDs spatial partitioning and SR-IOV ensure optimal resource allocation,maximizing NPU utilization in virtualized and containerized environments.In summary FuriosaAI Inc.AI Hardware Summit 202434In order to solve for mass AI adoption,We have to think beyond just hardware

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后，可能會被瀏覽器默認打開，此種情況可以點擊瀏覽器菜單，保存網頁到桌面，就可以正常下載了。
3、本站不支持迅雷下載，請使用電腦自帶的IE瀏覽器，或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮，下載后原文更清晰。

本文（超越硬件：實現高效AI推理的全棧優化.pdf）為本站（com）主動上傳，三個皮匠報告文庫僅提供信息存儲空間，僅對用戶上傳內容的表現方式做保護處理，對上載內容本身不做任何修改或編輯。若此文所含內容侵犯了您的版權或隱私，請立即通知三個皮匠報告文庫（點擊聯系客服），我們立即給予刪除！

溫馨提示：如果因為網速或其他原因下載失敗請重新下載，重復下載不扣分。

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站