1、AI Hardware&SystemsaiandsystemsMastering AI Cluster ManagementPhil Pokorny Phil Pokorny CTO Penguin Solutions AI Hardware&SystemsaiandsystemsConsider thisFor an organization to make effective use of an AI cluster,it is important to take into consideration the entire process of designing,building,dep
2、loying and managing the resource.At each step,a cluster for AI presents new and different challenges that even experienced IT team members may not have encountered before.AI Hardware&SystemsaiandsystemsAgendaExplore AI clusters from design to daily management Key considerations when designing an AI
3、cluster Cooling Power How software complexities factor into cluster operations Cluster start up Day to day management Change control 2024 Penguin Solutions.All Rights Reserved.POWER IS THE FOUNDATION FOR AI10 kWCPU Rack(40 Nodes)13 kWA100 Rack(2 Nodes)22 kWH100 Rack(2 Nodes)H100 Rack(4 Nodes)44 kW12
4、0 kWB100 Rack(GTC)AI Hardware&SystemsaiandsystemsAir cooling limitations and the move to liquid or immersion cooling Commonly accepted limit of 30kW per rack on air Requires careful hot/cold aisle sealing Wider cold and hot aisles for more airflow Liquid cooling can handle 10 x more power per rack P
5、enguin offers a variety of server designs Traditional air cooled Direct to chip liquid cooled“Born to be immersed”designs Custom designed servers and cooling 100%capture,80/20 or 75/25 liquid/air cooling Material compatibility requires choosing a complete solutionAI Hardware&SystemsaiandsystemsPower
6、 and the need to move to 240 or 277V 120/208V is a problem for high-power servers Existing data center infrastructure Low current connectors(30 or 50 Amp)240/415V is an obvious first consideration Twice the power for the same current rating Compatible with all international AC power inputs 277/480V
7、provides additional benefits Saves money on infrastructure Increases available power for the same copper But requires new power supply standards Penguin has TundraOCP and EIA 19 inch 277V powered servers AI Hardware&SystemsaiandsystemsHow Penguins unique Scyld ClusterWaresoftware addresses complexit
8、ies What is Scyld(pronounced like highly skilled)and Scyld ClusterWare?Scyld comes from the epic Beowulf which is the dominant form of HPC computing today ClusterWare is the result of almost 30 years of development and innovation in cluster management software A complete system of configuration,serv
9、ices and processes to make multiple,independent servers act in a coordinated fashion to solve the worlds most challenging problems.Network services to support boot and configuration Features to manage changes to the cluster over time Monitoring and data collection to detect problems and report statu
10、sAI Hardware&SystemsaiandsystemsUsing Scyld ClusterWare to boot a new cluster Scyld requires only a minimum of functionality to get a node into the cluster A functioning network interface with PXE support Once booted,it is part of the cluster and can be inspected and configured if necessary Using st
11、ate machine and Ansible to insure consistent hardware configuration Provision a software load Multiple local OS options and protocols Local storage Read-only RAM image Shared RO root,etc.HTTP,NFS,iSCSI and othersAI Hardware&SystemsaiandsystemsUsing Scyld ClusterWare day to day Everything you need fo
12、r clustering is included,but also many options Open source Torque scheduler or SchedMD SLURM MPI libraries High performance network software And with so much software,there are always updates users want or needChange controlis how you insure that every machine is and remains consistent Git repositor
13、ies for configuration files Multiple OS images(last,next,testing,special users)can be easily assigned to nodes Roll back from unexpected deployments by booting into previous known good image Try different configurations for specific users Cluster telemetry to a central database for plotting and repo
14、rtingAI Hardware&SystemsaiandsystemsConclusions In summary:High power servers make demands on data centers designed for yesterdays servers Invest in upgrading infrastructure for the future(higher voltage and liquid cooling)Work with expertsAI Hardware&SystemsaiandsystemsQ&AQuestions?Open discussionThank you!Phil.PokornyPenguinS|Connect with me on LinkedIn