《6660 - A Blueprint for Scalable AI Computational Infrastructure.pdf》由會員分享,可在線閱讀,更多相關《6660 - A Blueprint for Scalable AI Computational Infrastructure.pdf(13頁珍藏版)》請在三個皮匠報告上搜索。
1、James Kelly,OCP FoundationOpen Systems for AI Strategic Initiative Contribution:“A Blueprint for Scalable Infrastructure”A Directional White PaperOpen Systems for AI Strategic Initiative Contribution:“A Blueprint for Scalable Infrastructure”A Directional White PaperJames Kelly,OCP FoundationSummit T
2、rack:Artificial Intelligence(AI)NEW100-page whitepaperFrom Component to AI Cluster:From Component to AI Cluster:Standardized rack architectures supporting power envelopes of 250kW to 1MW Advanced liquid cooling solutions for high-density nodesHigh-voltage,high-efficiency power delivery systemsSuppor
3、t for multiple,evolving scale-up and scale-out fabrics for performanceFull stack development to enable a universal and truly open system for AIComprehensive management frameworks for near-autonomous operationsVision for Future Open Systems for AI ClustersProcessorsHPMChassis/TrayPodClusterFullFull-S
4、tack thinking applied to integrate 3 manageable work areas:Stack thinking applied to integrate 3 manageable work areas:Taking a Holistic System ViewSystems ManagementDC IT InfrastructureDC Physical InfrastructureDC Physical Infrastructure Key ChallengesAI-driven Trends:Chip power is rising More chip
5、s co-located within a rack Higher volume consumed by IT inside the rack Reduced volume inside rack for power,battery,networking and cooling Increase in rack power and power density More stringent serviceability requirements New Technology Challenges:Power delivery to the chip Heat extraction from th
6、e chip,system/racks Rack power outpacing battery PSU improvementsDC Physical Infrastructure Planned Contributions Case studies for cooling of the accelerator chip High power rack(HPR)reference designs Understanding rack high-voltage power trade-offs Reference designs for sidecar rack units Reference
7、 designs High-density power distributionHigh-power rack liquid coolingLiquid cooling Coolant Distribution Units(CDUs)High-voltage modular data centers Framework for data center sustainability metrics OCP Ready Data Center for AI clusters AI cluster design cost modeling tools DC IT Infrastructure Key
8、 ChallengesAI-driven Trends:Higher-density servers with many GPUs Higher-interconnect bandwidth demands Different requirements for xPU-to-xPU(backend)from external to cluster(front end)Higher-capacity power supply shelves More resilient,automated system management New Technology Challenges:Compute t
9、rays with liquid cooling,high-capacity power,and connect to scale-out networks Higher-capacity power supply trays Higher-bandwidth switch trays for scale-up In-rack highly redundant system management trayDC IT Infrastructure Planned ContributionsCurrent Specification Work Efforts:High-density comput
10、e trays Power,cooling,interconnects,storage,UEC High-bandwidth switch trays Supports UALink,low-latency,high-port density,High-capacity ORV3 power supply trays Automation-capable system management traysGaps for Future Consideration:Designs for increasing processing power New methods for storage and
11、data management Novel compute tray power and cooling technologies Automation for scalability and flexibilitySystem Management Key ChallengesAI-driven Trends Increased complexity within the AI cluster Rapid response to exception conditions Advanced orchestration capabilities Increased resiliency for
12、improved uptimeNew Technology Challenges:Standardized telemetry interfaces Common protocols for dynamic infrastructure management Autonomous operationsSystem Management Planned ContributionsCurrent Work Efforts:Development of Redfish profiles Review of draft Redfish profiles CDU and Rear-Door HX man
13、agement Server Component Resilience specification Accelerator management interfaces Accelerator RAS requirements CPU firmware specification Rack manager interrace guide and profileGaps for Future Consideration:GPU Management Stack Hardware Abstraction Layers Automation(Digital twins)Join Open System
14、s for AI:opencompute.org/aiJoinProject Mailing ListsProject CallsOpen Systems for AI InitiativeDriveProject FocusThought LeadershipRunProjects andWorkstreamsTechnical Steering CommitteeContributeEffortSpecificationsProducts and FacilitiesAttendOCP SummitsOCP PanelsEngineering WorkshopsOCP Webinars3rd Party EventsThank You!