《會議7_使用 OFA Sunfish 管理可組合分解基礎設施.pdf》由會員分享,可在線閱讀,更多相關《會議7_使用 OFA Sunfish 管理可組合分解基礎設施.pdf(24頁珍藏版)》請在三個皮匠報告上搜索。
1、MANAGING COMPOSABLE DISAGGREGATED INFRASTRUCTURE WITH OFA SUNFISHChristian PintoStaff Research Scientist,IBM Research EuropeCo-chair,OpenFabrics Alliance Management Framework Workgroup2024 OFA Virtual WorkshopCONTRIBUTORSMichele Gazzetti(IBM Research Europe)Phil Cayton(Intel)Russ Herrell(Hewlett Pac
2、kard Enterprise)Michael Aguilar(Sandia National Labs)Brian Pan(H3 Platform)Ziyan Zhuang(H3 Platform)Jin Hase(Fsas Technologies Inc.)Naoki Oguchi(Fsas Technologies Inc.)2 OpenFabrics AllianceOFA SUNFISHNetwork(fabric)-disaggregated infrastructure becoming the state-of-the-artNo common fabric manager
3、interface or fabric model available to link applications with remote resourcesAdministrators asked to manage an increasing heterogenous fabrics infrastructureDifficult to automate because different fabrics require different optimizations3 OpenFabrics AllianceOUR PROMISE4 OpenFabrics Alliance4Sunfish
4、 ServicesComposability LayerInfiniBandAgentApplication DomainFabric ResourcesMonitoringClientsResource InventorySunfish Core ServicesHardware ManagersCXLAgentRESTful API(RF/SF)Data StoreCompositionPoliciesResource Managers(e.g.,Compute,FAM,Storage,Fabric)Administration DomainData StoreRF tree manage
5、mentInfrastructure managementSystems composition,Systems updateApp driven system reconfigurationEvents&LogsAuthenticationAccess ControlEvents&LogsEventsEventsResource ConfigurationFabric ConfigurationCXL Manager(s)InfinibandManagerCXL HWCXL HWNVMeHW?HWRedfish/NativeAPI TranslationRedFishAPIVendor Na
6、tive APIAPIAPINVMeoF AgentNVMeoF ManagerSunfish AgentsComposability and Resources ManagersInfiniBand HWSunfish FrameworkAdditionalManagerAdditional AgentsOUR FOCUS SO FAR5 OpenFabrics Alliance5Sunfish ServicesComposability LayerInfiniBandAgentApplication DomainFabric ResourcesMonitoringClientsResour
7、ce InventorySunfish Core ServicesHardware ManagersCXLAgentRESTful API(RF/SF)Data StoreCompositionPoliciesResource Managers(e.g.,Compute,FAM,Storage,Fabric)Administration DomainData StoreRF tree managementInfrastructure managementSystems composition,Systems updateApp driven system reconfigurationEven
8、ts&LogsAuthenticationAccess ControlEvents&LogsEventsEventsResource ConfigurationFabric ConfigurationCXL HWCXL HWNVMeHW?HWRedfish/NativeAPI TranslationRedFishAPIVendor Native APIAPIAPINVMeoF AgentSunfish AgentsComposability and Resources ManagersInfiniBand HWSunfish FrameworkAdditional AgentsCXL Mana
9、ger(s)InfinibandManagerNVMeoF ManagerAdditionalManagerOUR FOCUS SO FAR6 OpenFabrics Alliance6Sunfish ServicesComposability LayerInfiniBandAgentApplication DomainFabric ResourcesMonitoringClientsResource InventorySunfish Core ServicesHardware ManagersCXLAgentRESTful API(RF/SF)Data StoreCompositionPol
10、iciesResource Managers(e.g.,Compute,FAM,Storage,Fabric)Administration DomainData StoreRF tree managementInfrastructure managementSystems composition,Systems updateApp driven system reconfigurationEvents&LogsAuthenticationAccess ControlEvents&LogsEventsEventsResource ConfigurationFabric Configuration
11、CXL HWCXL HWNVMeHW?HWRedfish/NativeAPI TranslationRedFishAPIVendor Native APIAPIAPINVMeoF AgentSunfish AgentsComposability and Resources ManagersInfiniBand HWSunfish FrameworkAdditional AgentsCXL Manager(s)InfinibandManagerNVMeoF ManagerAdditionalManagerOUR FOCUS SO FAR7 OpenFabrics Alliance7Sunfish
12、 ServicesComposability LayerInfiniBandAgentApplication DomainFabric ResourcesMonitoringClientsResource InventorySunfish Core ServicesHardware ManagersCXLAgentRESTful API(RF/SF)Data StoreCompositionPoliciesResource Managers(e.g.,Compute,FAM,Storage,Fabric)Administration DomainData StoreRF tree manage
13、mentInfrastructure managementSystems composition,Systems updateApp driven system reconfigurationEvents&LogsAuthenticationAccess ControlEvents&LogsEventsEventsResource ConfigurationFabric ConfigurationCXL HWCXL HWNVMeHW?HWRedfish/NativeAPI TranslationRedFishAPIVendor Native APIAPIAPINVMeoF AgentSunfi
14、sh AgentsComposability and Resources ManagersInfiniBand HWSunfish FrameworkAdditional AgentsCXL Manager(s)InfinibandManagerNVMeoF ManagerAdditionalManagerWHY SUNFISH?8 OpenFabrics AllianceFabricsSystemsCXLNVMeoFPCIeMemAccNVMeGPUSys1Sys2Sys3Sys4Sys5RootPhysicalResourcePoolsInterconnectFabricsCompute
15、NodesWHY SUNFISH?9 OpenFabrics AllianceFabricsSystemsCXLNVMeoFPCIeMemAccNVMeGPUSys1Sys2Sys3Sys4Sys5RootSys1Sys2Sys3Sys4Sys5PhysicalLogicalResourcePoolsInterconnectFabricsCompute NodesOnly composed systems are of interest to the clientSunfish abstracts the fabric and connection details out providing
16、the logical view of a“classic”computer systemCDIKubernetes ClusterKUBERNETES COMBINED WITH CDIUser requests pod using GPU to K8sIf there is no adequate GPU,Kubernetes requests resource to CDI to attach GPU to a worker nodeKubernetes maps a scheduled POD to the attached GPUCDI follows Sunfish archite
17、cture10 OpenFabrics AllianceGPUPOD(Container)Request resource(GPU)Worker node(Composed Baremetal)Resource ClaimPlugin(DRA kubelet plugin)Control PlaneController(DRA controller)GPUGPUPOD(Container)POD(Container)Resource PoolCDI managerGPUSSDDPUMemorySunfish Composition Service(GUI/CLI/Redfish API)Fab
18、ric ManagerSunfish Service(Core library)UserDRA:Dynamic Resource Allocation feature in Kubernetes11 OpenFabrics AllianceConventional operationThe maximum number of GPUs are always running assuming maximum load.In the example below,10 GPUs are always powered onLoad(Data Traffic)TimeNumber of GPUs pow
19、ered onProblempower consumption is constantly high(e.g.In case of H100,49W at idle per GPU)CDI USE CASEPower saving by adjusting the number of GPUs in the base stationSolution by CDICDI automatically increases/decreases the number of GPUs based on load.Unused GPUs are returned to the resource pool t
20、o power down and save powerTimeNumber of GPUs runningReduce GPUs when low loadIncrease GPUs when heavy loadResource PoolNumber of GPUs not running(power off)Load(Data Traffic)Improvement 6%reduction in power consumption by powering down idle GPUs 25%of total GPU cycles are then available for other w
21、orkloadsSUNFISH IS FINALLY OUT!The OFMF Workgroup is happy to announce the first official release of the Sunfish FrameworkOfficial documentationReference software implementation12 OpenFabrics AllianceDOCUMENTATIONRequirements and normative references for implementing a fully compliant Sunfish Framew
22、ork,Hardware Agent and ClientSunfish framework components design and interactions descriptionInteractions between Sunfish and Hardware AgentsHardware Agents lifecycle management(registration,failover,etc.)Redfish/Swordfish schema objects adoptedAdditions to Redfish schemaSunfish specific Redfish mod
23、eling requirements(e.g.,CXL Fabric Attached Memory)13 OpenFabrics AllianceDocumentation downloadREFERENCE SW IMPLEMENTATIONReference Sunfish Core LibraryCode:https:/ Implements the Sunfish core services as a python library RedFish tree management Interactions with Hardware Agents Events brokerageRef
24、erence Sunfish ServerCode:https:/ RESTful API for the Sunfish core libraryReference Sunfish Hardware AgentWork on CXL Hardware Agent in progressAgent API to Sunfish Server being developed for CXL FAMAgent backend being developed for CXL fabric mock-ups14 OpenFabrics AllianceTHE SUNFISH COMMUNITY15 O
25、penFabrics AllianceFIRST HARDWARE AGENT FOR CXL MEMORY16 OpenFabrics AllianceFIRST HARDWARE AGENT FOR CXL MEMORY17 OpenFabrics AllianceFIRST HARDWARE AGENT FOR CXL MEMORY18 OpenFabrics AllianceDEMONSTRATION19 OpenFabrics AllianceFabricsSystemsCXL(H3 Platform)PCIe(Liqid)MemGPUSys1Sys2Sys3RootEndpoint
26、sS1I3S1T11Endpointsgpu1pcpu1Connection to be establishedDEMONSTRATION20 OpenFabrics Alliance20Sunfish ServicesClientsResource InventorySunfish Core ServicesHardware ManagersCXLAgentRESTful API(RF/SF)Data StoreRF tree managementAuthenticationAccess ControlEvents&Logs1.RegistrationEventResource Config
27、urationFabric ConfigurationCXLManagerMemory Pool(H3)GPU Pool(Liqid)Redfish/NativeAPI TranslationRedFishAPIVendor Native API2.Agent tree crawling PCIe AgentPCIe ManagerSunfish AgentsSunfish Framework1.RegistrationEventDEMONSTRATION21 OpenFabrics Alliance21Sunfish ServicesClientsResource InventorySunf
28、ish Core ServicesHardware ManagersCXLAgentRESTful API(RF/SF)Data StoreRF tree managementAuthenticationAccess ControlEvents&LogsResource ConfigurationFabric ConfigurationCXL ManagerMemory Pool(H3)GPU Pool(Liqid)Redfish/NativeAPI TranslationRedFishAPIVendor Native APIAPIPCIe AgentPCIe ManagerSunfish A
29、gentsSunfish FrameworkCurl1.Attach GPU or CXL memory to hosts2.Attach GPU2.Attach MemoryDEMONSTRATION22 OpenFabrics AllianceCONCLUSIONS AND NEXT STEPSThe Sunfish community is rapidly growing,and we are targeting further hardware vendors for creating an ecosystem of agents.Focus on integrating with c
30、lients(e.g.,Kubernetes,Flux,etc.)to demonstrate the value of a single API approach.Sunfish will be at SC24 in Atlanta,GAJoin the community:Contributions welcome:Workload managers integration Parallel computing libraries integration More agents for real disaggregated hardware productsHow to join Meet
31、ing weekly on Fridays 7am Pacific Time https:/www.openfabrics.org/my-calendar/#mc_calendar_05_2802-calendar-details-my-calendar Join the Mailing list:https:/lists.openfabrics.org/mailman/listinfo/ofmfwg Reach out for information Christian Pinto: Michael Aguilar:mjaguilsandia.gov 23 OpenFabrics AllianceTHANK YOUChristian PintoIBM Research Europe2024 OFA Virtual Workshop