當前位置：首頁 > 報告詳情

6395 - AI System Validation- Meta Perspective.pdf

上傳人：蘆葦編號：651465 2025-05-01 PDF PDF 13頁 789.13KB

該報告所屬合集： 2025 OCP EMEA Summit嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/13

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《6395 - AI System Validation- Meta Perspective.pdf》由會員分享，可在線閱讀，更多相關《6395 - AI System Validation- Meta Perspective.pdf（13頁珍藏版）》請在三個皮匠報告上搜索。

1、By:Carlos Fernandez HW Validation EngineerAI System Validation:Meta PerspectiveAI System Validation:Meta PerspectiveCarlos FernandezARTIFICIAL INTELLIGENCE(AI)Platform Overview(GT-Training)Overview:The Grand Teton system is designed with a modular and scalable architecture,allowing it to efficiently

2、 handle large-scale AI workloads.It typically includes a combination of CPUs,GPUs,and other accelerators interconnected to maximize performance.InterconnectsCPU to CPU:Efficiently managing distributed computing tasks and coordinating data processing across multiple CPU sockets.CPU to Accelerator:Off

3、loading parallelizable tasks like matrix multiplications or neural network computations to accelerators.Accelerator Interconnect:Enables direct communication between accelerators,bypassing the CPU for certain tasks to reduce latency.Data flow:Locally:Data blocks are processed by the CPU and then tra

4、nsferred to a switch,which routes the data to the appropriate GPU or accelerator via PCIe.Remotely:Data moves through a scale-out network,allowing it to be transferred to accelerators located in other hosts.System Topology Discussion PointsWorkloads,Stress,and Silent ErrorsAccelerator ComputeGEMM(Ge

5、neral Matrix Multiply)is a fundamental operation in linear algebra,involving the multiplication of two matrices.It is a core component of many scientific and engineering applications,including machine learning and vector mathematics.Matrix Sizing:The size of matrices involved in computations can imp

6、act the power consumption of a system.Larger matrices require more computational resources,leading to increased power usage.This can lead also to power capping and reduction in clocks,so optimizing the matrix size for validation purposes is critical.HBM(High Bandwidth Memory)used in high-performance

7、 computing systems,offering higher bandwidth(due to increased parallelism)and lower power consumption compared to traditional memory types.The 3D stacking of memory dies can lead to increased heat density,making thermal management more challenging.Sequential Stress-Sequential stress tests are design

8、ed to push the HBM to its limits by continuously reading and writing large blocks of data in a linear fashion.This helps assess the memorys ability to handle high throughput scenarios typical in graphics rendering and data-intensive computations.Random Access Patterns-Random access pattern tests eva

9、luate the HBMs performance when data is accessed in a non-sequential manner.This is important for applications that involve irregular data access,such as AI inference,database operations,or certain types of scientific computations.Accelerator StressSystem Validation StrategiesCPU Computation:Test th

10、e CPUs ability to handle complex arithmetic operations under load.That its data pipelines also reflect the correct level of performance and power consumption.Cache Coherency:Ensure data consistency across all threads and sockets.IO Throughput:Assess data transfer rates and integrity are sustained wh

11、ile transferring large blocks of random data across all supported components.Also ensure that when errors occur sufficient detail is provided to perform fault isolation.Interconnect Performance:Evaluate latency and fault tolerance.Test interconnects like PCIe for signal integrity stability,ensuring

12、the system handles failures gracefully.Memory SubsystemDDR Memory Stability:Perform stress tests and verify ECC.Run intensive read/write operations to test stability and ensure error-correcting codes recovers single bit errors and detects multi-bit errors.Data&Access Patterns:Perform sweeps around d

13、ifferent data and access patterns to characterize signal integrity and performance.Characterize the impact of rank crossing and random data transfers on memory bandwidth.Power Spiking:Simulate spikes and ensure thermal and power delivery network remain stable while performing large idle to to full p

14、erformance cyclical testingCompute Stress(CPU+DDR)CPU PERFORMANCE IMPACTThermal Issues:Excessive heat generation can lead to a reduction in clocks or overall CPU performance.This can result in:Increased Latency:The CPU may take longer to process tasks,affecting overall system responsiveness.Reduced

15、Bandwidth:The ability to transfer data efficiently is compromised,impacting performance.Correctable Error Floods:These errors trigger interrupts and writebacks to the device for correction,which can:Significantly Reduce Bandwidth Performance:The system spends more time correcting errors(writebacks/i

16、nterrupts)than processing data.Lead to System Crashes:Due to reduced error recovery exposure due to CEs.ACCELERATOR PERFORMANCE IMPACTThermal Issues:Like CPUs,accelerators can experience reduced performance due to excessive heat,leading to:Increased Latency:Slower processing times for tasks handled

17、by the accelerator.GPU Fall off the Bus:Reduction in Accelerator resources to complete training or inferencing jobs.System Performance as a Health IndicatorConcurrency Testing(Red Under Test)Walk through a silent error exampleSilent Error Debug ExampleWhat exactly is a silent error?Errors that are e

18、scape cases from the RAS subsystem.Example DDR Reading from incorrect address,this will generate the correct check bits however the captured data will be incorrect.Core execution errors or data transfers over unprotected buses.What kind of workload and access patterns generated the silent errorDiscu

19、ss how our workload loads the accelerator.Discuss how data pattern checks captures silent errors.Debug Example Talking PointsWorkload Standardization How can the industry align on a set of workloads that:Provide Cross Architectural Support(X86/Arm/Etc).Provide sufficient level of configuration to ta

20、rget specific use cases,access and data patterns.Generate standardized logging which automation can easily consume.Establish a Collaborative Framework for ValidationUtilize the OCPs collaborative environment to create a shared platform where system validation teams,hyperscalers,and vendors can contr

21、ibute and access validation data.Develop agreements that outline the terms of data sharing,ensuring that all parties are comfortable with the level of transparency.Leverage IndustryBy sharing validation data,all participants in the industry can benefit from collective learning and innovation.This co

22、llaborative approach can lead to faster identification of systemic issues and faster root causing.Feedback Loops:Establish feedback loops where insights gained from shared data can be fed back into the design and validation processes,improving the overall quality and reliability of systems.Call to ActionThank You!

相關圖表

本文主要介紹了人工智能(AI)平臺Grand Teton的系統驗證，該系統采用模塊化和可擴展的架構，能高效處理大規模AI工作負載，通常包括CPU、GPU和其他加速器，通過互連最大化性能。文章討論了工作負載、壓力和隱蔽錯誤，加速器計算，矩陣大小對系統功耗的影響，以及HBM（高帶寬內存）的優缺點。同時，還介紹了CPU計算、緩存一致性、I/O吞吐量、互聯性能和內存子系統的測試策略。最后，提出了行業協作的重要性，以及通過共享驗證數據來促進集體學習和創新。

"AI系統驗證中的隱性錯誤探秘" "如何優化矩陣大小以提高AI性能？" "高性能計算系統中HBM的使用與挑戰"

相關報告

Optics in AI Clusters - Meta Perspective.pdf

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站