《6395 - AI System Validation- Meta Perspective.pdf》由會員分享,可在線閱讀,更多相關《6395 - AI System Validation- Meta Perspective.pdf(13頁珍藏版)》請在三個皮匠報告上搜索。
1、By:Carlos Fernandez HW Validation EngineerAI System Validation:Meta PerspectiveAI System Validation:Meta PerspectiveCarlos FernandezARTIFICIAL INTELLIGENCE(AI)Platform Overview(GT-Training)Overview:The Grand Teton system is designed with a modular and scalable architecture,allowing it to efficiently
2、 handle large-scale AI workloads.It typically includes a combination of CPUs,GPUs,and other accelerators interconnected to maximize performance.InterconnectsCPU to CPU:Efficiently managing distributed computing tasks and coordinating data processing across multiple CPU sockets.CPU to Accelerator:Off
3、loading parallelizable tasks like matrix multiplications or neural network computations to accelerators.Accelerator Interconnect:Enables direct communication between accelerators,bypassing the CPU for certain tasks to reduce latency.Data flow:Locally:Data blocks are processed by the CPU and then tra
4、nsferred to a switch,which routes the data to the appropriate GPU or accelerator via PCIe.Remotely:Data moves through a scale-out network,allowing it to be transferred to accelerators located in other hosts.System Topology Discussion PointsWorkloads,Stress,and Silent ErrorsAccelerator ComputeGEMM(Ge
5、neral Matrix Multiply)is a fundamental operation in linear algebra,involving the multiplication of two matrices.It is a core component of many scientific and engineering applications,including machine learning and vector mathematics.Matrix Sizing:The size of matrices involved in computations can imp
6、act the power consumption of a system.Larger matrices require more computational resources,leading to increased power usage.This can lead also to power capping and reduction in clocks,so optimizing the matrix size for validation purposes is critical.HBM(High Bandwidth Memory)used in high-performance
7、 computing systems,offering higher bandwidth(due to increased parallelism)and lower power consumption compared to traditional memory types.The 3D stacking of memory dies can lead to increased heat density,making thermal management more challenging.Sequential Stress-Sequential stress tests are design
8、ed to push the HBM to its limits by continuously reading and writing large blocks of data in a linear fashion.This helps assess the memorys ability to handle high throughput scenarios typical in graphics rendering and data-intensive computations.Random Access Patterns-Random access pattern tests eva
9、luate the HBMs performance when data is accessed in a non-sequential manner.This is important for applications that involve irregular data access,such as AI inference,database operations,or certain types of scientific computations.Accelerator StressSystem Validation StrategiesCPU Computation:Test th
10、e CPUs ability to handle complex arithmetic operations under load.That its data pipelines also reflect the correct level of performance and power consumption.Cache Coherency:Ensure data consistency across all threads and sockets.IO Throughput:Assess data transfer rates and integrity are sustained wh
11、ile transferring large blocks of random data across all supported components.Also ensure that when errors occur sufficient detail is provided to perform fault isolation.Interconnect Performance:Evaluate latency and fault tolerance.Test interconnects like PCIe for signal integrity stability,ensuring
12、the system handles failures gracefully.Memory SubsystemDDR Memory Stability:Perform stress tests and verify ECC.Run intensive read/write operations to test stability and ensure error-correcting codes recovers single bit errors and detects multi-bit errors.Data&Access Patterns:Perform sweeps around d
13、ifferent data and access patterns to characterize signal integrity and performance.Characterize the impact of rank crossing and random data transfers on memory bandwidth.Power Spiking:Simulate spikes and ensure thermal and power delivery network remain stable while performing large idle to to full p
14、erformance cyclical testingCompute Stress(CPU+DDR)CPU PERFORMANCE IMPACTThermal Issues:Excessive heat generation can lead to a reduction in clocks or overall CPU performance.This can result in:Increased Latency:The CPU may take longer to process tasks,affecting overall system responsiveness.Reduced
15、Bandwidth:The ability to transfer data efficiently is compromised,impacting performance.Correctable Error Floods:These errors trigger interrupts and writebacks to the device for correction,which can:Significantly Reduce Bandwidth Performance:The system spends more time correcting errors(writebacks/i
16、nterrupts)than processing data.Lead to System Crashes:Due to reduced error recovery exposure due to CEs.ACCELERATOR PERFORMANCE IMPACTThermal Issues:Like CPUs,accelerators can experience reduced performance due to excessive heat,leading to:Increased Latency:Slower processing times for tasks handled
17、by the accelerator.GPU Fall off the Bus:Reduction in Accelerator resources to complete training or inferencing jobs.System Performance as a Health IndicatorConcurrency Testing(Red Under Test)Walk through a silent error exampleSilent Error Debug ExampleWhat exactly is a silent error?Errors that are e
18、scape cases from the RAS subsystem.Example DDR Reading from incorrect address,this will generate the correct check bits however the captured data will be incorrect.Core execution errors or data transfers over unprotected buses.What kind of workload and access patterns generated the silent errorDiscu
19、ss how our workload loads the accelerator.Discuss how data pattern checks captures silent errors.Debug Example Talking PointsWorkload Standardization How can the industry align on a set of workloads that:Provide Cross Architectural Support(X86/Arm/Etc).Provide sufficient level of configuration to ta
20、rget specific use cases,access and data patterns.Generate standardized logging which automation can easily consume.Establish a Collaborative Framework for ValidationUtilize the OCPs collaborative environment to create a shared platform where system validation teams,hyperscalers,and vendors can contr
21、ibute and access validation data.Develop agreements that outline the terms of data sharing,ensuring that all parties are comfortable with the level of transparency.Leverage IndustryBy sharing validation data,all participants in the industry can benefit from collective learning and innovation.This co
22、llaborative approach can lead to faster identification of systemic issues and faster root causing.Feedback Loops:Establish feedback loops where insights gained from shared data can be fed back into the design and validation processes,improving the overall quality and reliability of systems.Call to ActionThank You!