1、Michael Wong Codeplay技術VPC+嵌入式開發委員會SG14與機器學習委員會SG19主席,同時擔任C+語言方向演化委員會主席,Codeplay研發副總裁,C+標準委員會加拿大代表團團長。Michael在C+并行計算、高性能計算、機器學習領域擁有豐富工作經驗,他領導制訂了應用于GPU應用開發C+異構編程語言(SYCL)標準.對Tensorflow底層性能優化有著深刻的研究和見解。其具體工作涵蓋并行編程、神經網絡、計算機視覺、自動駕駛等領域。Michael 曾任 IBM 高級技術專家,領導 IBM XL C+編譯器、XL C 編譯器的開發工作。演講主題:釋放AI的潛能:應對變幻莫
2、測的AI硬件和軟件 Unlocking AI Potential:Navigating the Diverse Hardware Landscape釋放AI的潛能:應對變幻莫測的AI硬件和軟件David Edelsohn,AI Alliance,IBMAndrew Richards,UXL FoundationMichael Wong,UXL Foundation,Khronos,ISOCPP,RISC-V(co-authors from UXL Foundation,LFAI&Data)Thanks to AMD,Meta,ModularMWby MICHAEL Wong,Distingui
3、shed Engineer,VPIntroduction 簡介1Key Point 關鍵點The AI landscape is booming,with innovation in hardware accelerators driving breakthroughs.人工智能正在蓬勃發展,硬件加速器的創新推動了人工智能的突破。2Unprecedented compute demands 空前的計算需求The AI field is experiencing unprecedented demands for computational power.人工智能領域對計算能力的需求前所未有。3H
4、ardware proliferation 硬件普及There is a rapid proliferation of specialized AI hardware accelerators.專門的人工智能硬件加速器正在迅速普及。4Framework diversity框架的多元化A diverse ecosystem of AI software frameworks has emerged.一個多元化的人工智能軟件框架生態系統已經出現。The Challenge of Diversity 多元化的挑戰Key Point:The proliferation of AI hardware(G
5、PUs,TPUs,FPGAs)and software frameworks(TensorFlow,PyTorch,JAX)creates a complex ecosystem.Hardware Diversity 硬件多元化GPUs,TPUs,and FPGAs offer different performance characteristics and optimizations for AI workloads.GPUs、TPUs 和 FPGAs 為 AI 工作負載提供了不同的性能特性和優化。Software Framework Variety種類繁雜的軟件框架TensorFlow,
6、PyTorch,and JAX each have unique strengths and ecosystems,adding complexity to development choices.TensorFlow、PyTorch 和 JAX 各自有獨特的優勢和生態系統,增加了開發選擇的復雜性。Integration Challenges集成的挑戰Combining diverse hardware and software creates integration challenges for AI developers and researchers.將不同的硬件和軟件結合使用,為 AI
7、 開發者和研究人員帶來了集成挑戰。Understanding AI Accelerators了解人工智能加速器AI accelerators are specialized hardware for speeding up AI tasks,each with tradeoffs.人工智能加速器是用于加速人工智能任務的專用硬件,每種硬件都有其利弊。Global Players全球參與者NVIDIA-GPU dominanceGoogle-TPU innovationAMD-ROCm ecosystemIntel-Gaudi architectureChinese Innovation中國創新者
8、Huawei(Ascend)Baidu(Kunlun)Alibaba(Hanguang)AI Frameworks 人工智能框架Key Point:AI frameworks provide the software foundation for building and deploying AI models.關鍵點:人工智能框架為構建和部署人工智能模型提供了軟件基礎。Framework 框架Key Features 關鍵特性Primary Use Cases 主要應用場景PyTorchDynamic computation graphsResearch,rapid prototypingT
9、ensorFlowStatic graphs,production-readyLarge-scale deploymentJAXFunctional programming,auto-diffScientific computing,research#Modern AI Framework Comparison#Simplified Framework Comparisondef pytorch_example():model=torch.nn.Sequential(torch.nn.Linear(784,256),torch.nn.ReLU(),torch.nn.Linear(256,10)
10、.to(cuda)#Hardware acceleration def tensorflow_example():with tf.device(/GPU:0):#Explicit device placement model=tf.keras.Sequential(tf.keras.layers.Dense(256,activation=relu),tf.keras.layers.Dense(10)def paddlepaddle_example():model=paddle.nn.Sequential(paddle.nn.Linear(784,256),paddle.nn.ReLU(),pa
11、ddle.nn.Linear(256,10)PyTorch:A Flexible PowerhousePyTorch:靈活強大的開發平臺Key Point:PyTorch is favored for research and rapid prototyping due to its flexibility and dynamic computation graphs.關鍵點:PyTorch 因其靈活性和動態計算圖而成為研究和快速原型設計的首選1Eager Execution即時執行Eager execution mode for quick debugging and iteration.即
12、時執行模式便于快速調試和迭代。2Dynamic Graphs動態圖Dynamic computation graphs support complex AI models.支持復雜的 AI 模型,通過動態計算圖實現。3Wide Adoption廣泛應用Widely adopted in academia and research.在學術界和研究社區中廣受認可。TensorFlow and JAX:Building for ScaleTensorFlow 和 JAX:構建大規模應用Key Point:TensorFlow and JAX are designed for large-scale
13、AI,with robust support for distributed computing and diverse hardware.關鍵點:TensorFlow 和 JAX 專為大規模 AI 設計,提供對分布式計算和多樣化硬件的強大支持。TensorFlows EcosystemTensorFlows ecosystem supports industrial-scale applications.TensorFlow 的生態系統支持工業級應用。JAX CapabilitiesJAX supports functional programming and automatic diffe
14、rentiation.JAX 支持函數式編程和自動微分。XLA IntegrationBoth leverage XLA for optimized linear algebra computations.兩者都利用 XLA 進行優化的線性代數計算。PaddlePaddle:The Chinese ChampionPaddlePaddle:中國的贏家Key Point:PaddlePaddle is a leading AI framework developed by Baidu,optimized for ultra-large-scale AI and widely used in Ch
15、ina.關鍵點:PaddlePaddle 是百度開發的領先 AI 框架,針對超大型 AI 進行了優化,在中國得到了廣泛應用。1ScalabilityFocus on scalability for ultra-large AI models.專注于超大型 AI 模型的可擴展性。2User-FriendlyUser-friendly interface and Chinese documentation.對用戶友好的界面和中文文檔。3AdoptionSignificant adoption in research and industry within China.在中國的研究和行業中得到廣泛應
16、用。XLA:Accelerated Linear AlgebraXLA:加速線性代數Key PointXLA is a domain-specific compiler that optimizes linear algebra computations,the foundation of many AI algorithms.FeaturesOptimizes across CPUs,GPUs,and TPUs.Advanced optimizations like operator fusion and memory layout.Integral to TensorFlow and JA
17、X for performance gains.關鍵點:XLA 是一種領域專用編譯器,旨在優化線性代數計算,這是許多 AI 算法的基礎。特性:支持 CPU、GPU 和 TPU 的跨平臺優化;高級優化技術,如操作融合和內存布局優化;對 TensorFlow 和 JAX 的性能提升至關重要。IREE:MLIR-Based OptimizationIREE:基于 MLIR 的優化Key Point:IREE is an innovative framework that leverages MLIR to optimize AI models across various hardware plat
18、forms.關鍵點:IREE 是一個創新的框架,利用 MLIR 在各種硬件平臺上優化 AI 模型。1MLIR OptimizationUses MLIR for advanced optimization and code generation.使用 MLIR 進行高級優化和代碼生成。2Framework SupportSupports frameworks like PyTorch,TensorFlow,and JAX.支持 PyTorch、TensorFlow 和 JAX 等框架。3Community AdoptionGaining support in open-source AI co
19、mmunity.在開源 AI 社區中獲得支持。AI Model acceleration and optimization:Modular,Mojo,MaxAI 模型加速與優化:模塊化、Mojo、MaxFeatureXLAIREETVMModular(Mojo)FeatureNot applicable(compiler)Not applicable(compiler)Not applicable(compiler)Pythonic,similar to PythonProgramming ModelAOT,JITMulti-level compilation,leveraging MLIRM
20、ulti-stage compilation,tensor expression basedMLIR-based,with custom dialectsCompilationCPUs,GPUs,TPUsCPUs,GPUs,TPUs,some NPUs,targeting moreCPUs,GPUs,specialized acceleratorsCPUs,GPUs,potentially targeting more with MLIRHardware AccelerationLower(strict functional requirements)High(supports various
21、 frameworks and models)High(flexible IR and scheduling)High(Pythonic syntax,MLIR-based optimization)FlexibilityCustomizable through XLA passesHighly customizable through MLIR dialectsCustomizable through schedules and templatesCustomizable through MLIR dialectsCustomizationLarge,mature(part of Tenso
22、rFlow)Growing,focus on cross-platform/hardware deploymentLarge and active community,strong in researchStill developing,growing communityCommunity and EcosystemHigh(distributed training support)Designed for scalability across devicesScalable(auto-tuning and distributed compilation)High(leveraging MLI
23、R)ScalabilityPrimarily model optimizationDeployment to diverse targets,research,productionDeployment,embedded systems,researchHigh-performance AI,potentially broader range laterEdge AI:ExecuTorch and TensorFlow LiteKey Point:Edge AI requires specialized frameworks like ExecuTorch and TensorFlow Lite
24、,designed for resource-constrained mobile and embedded devices.關鍵點:邊緣 AI 需要像 ExecuTorch 和 TensorFlow Lite 這種專為資源受限的移動和嵌入式設備設計的框架。ExecuTorch optimizes PyTorch models for mobile devices.ExecuTorch 優化 PyTorch 模型以適應移動設備。TensorFlow Lite is a lightweight version of TensorFlow for efficient on-device ML.Te
25、nsorFlow Lite 是 TensorFlow 的輕量級版本,用于高效的設備端機器學習。Crucial for applications like mobile assistants,smart cameras,and wearables.適用于移動助手、智能攝像頭和可穿戴設備等。ONNX:The Bridge for InteroperabilityONNX:實現互通性的橋梁Key PointONNX is an open standard for representing AI models,enabling interoperability between frameworks a
26、nd simplifying hardware deployment.關鍵點:ONNX 是一個開放標準,支持 AI 模型的互操作性和跨平臺部署。特點包括:允許在一個框架(如 PyTorch)中訓練模型并在另一個框架(如 TensorFlow)中部署,支持跨平臺兼容性,擁有不斷增長的工具和庫生態系統。Features of ONNXAllows model training in one framework(e.g.,PyTorch)and deployment in another(e.g.,TensorFlow).Supports cross-platform compatibility.G
27、rowing ecosystem of tools and libraries.PolyBlocks:High-Dimensional DataPolyBlocks:高維數據Key PointPolyBlocks is a specialized compiler that excels in optimizing AI workloads with high-dimensional data,common in deep learning models.關鍵點:PolyBlocks 是一個專門的編譯器,擅長優化處理高維數據的 AI 工作負載,這在深度學習模型中非常常見。Polyhedral
28、OptimizationUses polyhedral optimization techniques for efficient tensor handling.使用多面體優化技術高效處理張量。Hardware SupportSupports CPUs,GPUs,and specialized AI accelerators.支持 CPU、GPU 和專用 AI 加速器等硬件。ApplicationValuable for cutting-edge deep learning models.對于前沿的深度學習模型具有重要價值。TVM:End-to-End OptimizationTVM:端到端
29、優化Key Point:TVM is a comprehensive compiler stack for optimizing and deploying AI models on a wide range of hardware platforms.關鍵點:TVM 是一個全面的編譯器棧,用于在廣泛的硬件平臺上優化和部署 AI 模型。-Modular design allows for customization and extension.模塊化設計:允許定制和擴展。-Automatic tuning capabilities for efficient deployment on dif
30、ferent platforms.自動調優能力:支持在不同平臺上高效部署。-Used by various companies for production deployments.被多家公司用于生產部署。The Acceleration Languages:CUDA,ROCm,SYCL加速語言:CUDA、ROCm、SYCLCUDA(NVIDIA)Provides a mature ecosystem for GPUs,known for its high performance and extensive tools.為 GPU 提供了一個成熟的生態系統,以其高性能和豐富的工具而聞名。ROC
31、m(AMD)Is an open-source alternative to CUDA,offering a more flexible and open approach.是 CUDA 的開源替代方案,提供更靈活和開放的方法。SYCL(Khronos)Is a single-source C+approach,enabling developers to write code that can run on a variety of hardware,including CPUs,GPUs,and other accelerators.采用單源 C+方法,使開發者能夠編寫可在多種硬件(包括
32、CPU、GPU 和其他加速器)上運行的代碼。OpenVX And Vulkan:Expanding The AI LandscapeOpenVX 和 Vulkan:擴展 AI 領域Key Point:OpenVX and Vulkan are expanding the AI landscape by enabling acceleration for computer vision and generalpurpose GPU computing.關鍵點:OpenVX 和 Vulkan 通過加速計算機視覺和通用 GPU 計算,擴展了 AI 領域。1OpenVXAPI for computer
33、 vision and ML acceleration,wellsuited for edge and embedded systems.OpenVX 是一個用于計算機視覺和機器學習加速的 API,特別適合邊緣和嵌入式系統。2VulkanInitially for graphics,now widely used for generalpurpose GPU computing in AI tasks.Vulkan 最初用于圖形處理,現在廣泛應用于 AI 任務中的通用 GPU 計算。Challenges Of Compatibility兼容性挑戰Key Point:The diversity
34、of AI hardware and software presents compatibility challenges,requiring consideration of framework support,hardware-specific libraries,and optimization techniques.關鍵點:AI 硬件和軟件的多樣性帶來了兼容性挑戰,需要考慮框架支持、硬件特定庫和優化技術。1Framework-Hardware CompatibilityCompatibility between AI frameworks and hardware is crucial
35、 for performance.AI 框架與硬件之間的兼容性對性能至關重要。2Varying Framework SupportFramework support varies,as some require specific libraries or configurations.不同框架的支持程度各異,有些框架需要特定的庫或配置。Performance Optimization Techniques性能優化技術Key Point:Various techniques exist for optimizing AI performance,including hardware-specif
36、ic libraries,mixed-precision training,and data pipeline optimization.關鍵點:存在多種優化 AI 性能的技術,包括硬件特定庫、混合精度訓練和數據管道優化。Hardware-specific LibrariesLibraries like cuDNN can significantly accelerate computations.硬件特定庫如 cuDNN 可以顯著加速計算。Mixed-precision Training(next slide)Allows faster training with maintained ac
37、curacy.混合精度訓練(詳見下一頁):在保持精度的同時加快訓練速度。Data Pipeline OptimizationReduces data transfer bottlenecks.數據管道優化能夠減少數據傳輸瓶頸。Mixed Precision Training Example 混合精度訓練class OptimizedTraining:def _init_(self):self.model=Model()self.scaler=torch.cuda.amp.GradScaler()def training_step(self,data):with torch.cuda.amp.a
38、utocast():#Automatic mixed precision output=self.model(data)loss=criterion(output)self.scaler.scale(loss).backward()self.scaler.step(optimizer)self.scaler.update()Navigating The Ecosystem導航生態系統Key Point:Choosing the right AI framework and hardware requires careful consideration of project requiremen
39、ts,compatibility,scalability,and budget.關鍵點:選擇合適的 AI 框架和硬件需要仔細考慮項目需求、兼容性、可擴展性和預算。1Project RequirementsConsider model complexity,dataset size,and performance needs.項目需求:考慮模型復雜度、數據集大小和性能需求。2CompatibilityEvaluate framework-hardware compatibility.兼容性:評估框架與硬件的兼容性。3Scalability and BudgetConsider scalabili
40、ty and budget constraints.還需要考慮可擴展性和預算限制。Leading Hardware Vendors主要硬件廠商Key Point:NVIDIA,AMD,Intel,and Google are key players in the AI hardware market,each with its strategies and offerings.關鍵點:NVIDIA、AMD、Intel 和 Google 是 AI 硬件市場的主要參與者,各自擁有不同的戰略和產品。NVIDIADominates with powerful GPUs and CUDA ecosyst
41、em.憑借強大的 GPU 和 CUDA 生態系統占據主導地位。AMDOffers ROCm platform and competitive GPUs.提供 ROCm 平臺和有競爭力的 GPU。IntelDeveloping AI hardware and software solutions.開發 AI 硬件和軟件解決方案。GoogleTPUs deliver high performance for deep learning.TPU 為深度學習提供高性能支持。Emerging Technologies新興技術Key Point:Emerging technologies like Tri
42、ton,IREE,and MLIR are shaping the future of AI by enhancing cross-platform compatibility and optimization.關鍵點:Triton、IREE 和 MLIR 等新興技術通過增強跨平臺兼容性和優化,正在塑造 AI 的未來。1TritonSimplifies AI model deployment on various hardware backends.Triton 簡化了在各種硬件后端上的 AI 模型部署。2IREEUses MLIR for cross-platform compatibili
43、ty.IREE 使用 MLIR 實現了跨平臺兼容性。3MLIREnables modularity and extensibility in AI compiler infrastructure.MLIR 在 AI 編譯器基礎設施中實現模塊化和可擴展性。The Future Of AI:A Unified EcosystemAI 的未來:統一的生態系統Key Point:The future of AI lies in a more unified ecosystem,with increased standardization,interoperability,and streamlined
44、 workflows.關鍵點:AI 的未來在于更加統一的生態系統,包括更高的標準化、互操作性和簡化的工作流程。1Breaking Down BarriersBreaking down barriers between frameworks and hardware.消除不同框架和硬件之間的壁壘。2StandardizationStandardization efforts like ONNX are crucial for unification.如 ONNX 等標準化工作對于實現統一至關重要。3Accelerating InnovationA unified ecosystem will a
45、ccelerate innovation.統一的生態系統將促進創新的快速發展。Best Practices For AI PerformanceAI 性能最佳實踐Key Point:Developers should adopt best practices like hybrid optimization,leveraging optimized libraries,and utilizing framework-level and runtime optimizations.關鍵點:開發者采用最佳實踐,如混合優化、利用優化庫以及框架級和運行時優化。Hybrid OptimizationHy
46、brid optimization combines various techniques for maximum performance.混合優化結合了多種技術以實現最大性能。Optimized LibrariesUse optimized libraries like cuDNN for specific hardware.使用針對特定硬件優化的庫,例如 cuDNN。Framework and Runtime OptimizationsEmploy framework-level and runtime optimizations.采用框架級和運行時優化措施。Future Trends:A
47、utomation And Openness未來趨勢:自動化與開放性Key Point:Future trends in AI include increased automation of optimization and a more open and unified ecosystem.關鍵點:AI 的未來趨勢包括優化過程的自動化增強以及更加開放和統一的生態系統。1Quantum AccelerationEmerging technology for AI acceleration.量子加速:用于 AI 加速的新興技術。2Neuromorphic ComputingBrain-inspi
48、red computing for AI tasks.神經形態計算:受大腦啟發的計算方式,適用于 AI 任務。3Edge AIAI processing at the edge of networks.邊緣 AI:在網絡邊緣進行的 AI 處理。4Green ComputingEnvironmentally friendly AI solutions.綠色計算:對環境友好的 AI 解決方案。Conclusion總結Key Point:Understanding the complex and evolving AI landscape is crucial for success in AI d
49、evelopment and deployment.關鍵點:理解復雜且不斷演變的 AI 領域,對于 AI 開發和部署的成功至關重要。1Vast PotentialVast potential of AI illustrated by diverse applications and technologies.多樣化的應用和技術,展示了 AI 的巨大潛力。Q&A問答環節Key Point:Open the floor for questions from the audience.關鍵點:歡迎各位聽眾踴躍提問!1Contact InformationMichael Wong:2Resources
50、Home|The AI Accelerator Software Ecosystem Guidehttps:/the-ai-alliance.github.io/ai-accelerator-software-ecosystem-guide/Additional Resources 其他資源To further explore the topics covered in this presentation,weve compiled a list of additional resources and references.Resource TypeDescriptionEcosystem G
51、uide:Current version of the guide updated as a“living”document.(The guide will be separated into chapter.)Ecosystem Guide Podcast:An AI-generated podcast created by Google NotebookLM presenting an overview of the Ecosystem Guide.Draft Ecosystem Guide PDF:Diversified Hardware LandscapeContributing to
52、 This Document:We welcome your contributions!Heres how you can contribute.AI Acceleration Languages and Technologies人工智能加速語言和技術This presentation explores the key languages,technologies,and trends shaping the landscape of AI acceleration.Well cover essential programming languages,emerging technologies,hardware vendors,and best practices for optimizing AI performance.MWby MICHAEL Wong,Distinguished Engineer,VP