掌握 AI 集群管理.pdf

編號:464907 PDF 11頁 1.57MB 下載積分:VIP專享
下載報告請您先登錄!

掌握 AI 集群管理.pdf

1、AI Hardware&SystemsaiandsystemsMastering AI Cluster ManagementPhil Pokorny Phil Pokorny CTO Penguin Solutions AI Hardware&SystemsaiandsystemsConsider thisFor an organization to make effective use of an AI cluster,it is important to take into consideration the entire process of designing,building,dep

2、loying and managing the resource.At each step,a cluster for AI presents new and different challenges that even experienced IT team members may not have encountered before.AI Hardware&SystemsaiandsystemsAgendaExplore AI clusters from design to daily management Key considerations when designing an AI

3、cluster Cooling Power How software complexities factor into cluster operations Cluster start up Day to day management Change control 2024 Penguin Solutions.All Rights Reserved.POWER IS THE FOUNDATION FOR AI10 kWCPU Rack(40 Nodes)13 kWA100 Rack(2 Nodes)22 kWH100 Rack(2 Nodes)H100 Rack(4 Nodes)44 kW12

4、0 kWB100 Rack(GTC)AI Hardware&SystemsaiandsystemsAir cooling limitations and the move to liquid or immersion cooling Commonly accepted limit of 30kW per rack on air Requires careful hot/cold aisle sealing Wider cold and hot aisles for more airflow Liquid cooling can handle 10 x more power per rack P

5、enguin offers a variety of server designs Traditional air cooled Direct to chip liquid cooled“Born to be immersed”designs Custom designed servers and cooling 100%capture,80/20 or 75/25 liquid/air cooling Material compatibility requires choosing a complete solutionAI Hardware&SystemsaiandsystemsPower

6、 and the need to move to 240 or 277V 120/208V is a problem for high-power servers Existing data center infrastructure Low current connectors(30 or 50 Amp)240/415V is an obvious first consideration Twice the power for the same current rating Compatible with all international AC power inputs 277/480V

7、provides additional benefits Saves money on infrastructure Increases available power for the same copper But requires new power supply standards Penguin has TundraOCP and EIA 19 inch 277V powered servers AI Hardware&SystemsaiandsystemsHow Penguins unique Scyld ClusterWaresoftware addresses complexit

8、ies What is Scyld(pronounced like highly skilled)and Scyld ClusterWare?Scyld comes from the epic Beowulf which is the dominant form of HPC computing today ClusterWare is the result of almost 30 years of development and innovation in cluster management software A complete system of configuration,serv

9、ices and processes to make multiple,independent servers act in a coordinated fashion to solve the worlds most challenging problems.Network services to support boot and configuration Features to manage changes to the cluster over time Monitoring and data collection to detect problems and report statu

10、sAI Hardware&SystemsaiandsystemsUsing Scyld ClusterWare to boot a new cluster Scyld requires only a minimum of functionality to get a node into the cluster A functioning network interface with PXE support Once booted,it is part of the cluster and can be inspected and configured if necessary Using st

11、ate machine and Ansible to insure consistent hardware configuration Provision a software load Multiple local OS options and protocols Local storage Read-only RAM image Shared RO root,etc.HTTP,NFS,iSCSI and othersAI Hardware&SystemsaiandsystemsUsing Scyld ClusterWare day to day Everything you need fo

12、r clustering is included,but also many options Open source Torque scheduler or SchedMD SLURM MPI libraries High performance network software And with so much software,there are always updates users want or needChange controlis how you insure that every machine is and remains consistent Git repositor

13、ies for configuration files Multiple OS images(last,next,testing,special users)can be easily assigned to nodes Roll back from unexpected deployments by booting into previous known good image Try different configurations for specific users Cluster telemetry to a central database for plotting and repo

14、rtingAI Hardware&SystemsaiandsystemsConclusions In summary:High power servers make demands on data centers designed for yesterdays servers Invest in upgrading infrastructure for the future(higher voltage and liquid cooling)Work with expertsAI Hardware&SystemsaiandsystemsQ&AQuestions?Open discussionThank you!Phil.PokornyPenguinS|Connect with me on LinkedIn

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(掌握 AI 集群管理.pdf)為本站 (com) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站