1、Life cycle management for hyperscale introduces unique challenges for manageability.This presentation showcases a blueprint for a data model which can be aggregated across various levels and stages of hyperscale evolution.Data Center Life Cycle Management ScaleNirav Shah,Cloud Software Architect,Int
2、elJim Harford,System Architect,BroadcomScott Ramsey,Technologist,Dell Technologies Data Center Life Cycle Management ScaleSUSTAINABLE SCALABLE COMPUTATIONAL INFRASTRUCTUREHW MGMTPurposeDefine comprehensive at scale remote service model(inclusive of different sizes)Standardize interfaces for at scale
3、 remote service model&enable added servicesDeliver services across the boundaries of ownershipAbility to integrate vendor tools using common frameworkInfluence spans across multiple OCP disciplinesCSM WG OverviewLife Cycle Opportunities RecapPlan/Design1Procure/Deploy2Operate3Decommission4(Sustainab
4、ility)(Modular Data Center)(Ready Facility Recognition)Networking,Security Security,Storage,Server,Power,F/W Hardware ManagementFault Management Data Center scale10s of thousands of geo located and/or distributed servers!Racks!Racks!Rack 1Rack N(geo-located)Rack Z(distributed)A universal data model
5、across components,systems and physical/virtual aggregates.Use case driven tops down approach to managing a hyperscale data center.Physical AggregateVirtual AggregateLife Cycle InterconnectPlan/Design1Procure/Deploy2Operate3Decommission4ConfigureValidateTarget Performance?YesNoUtilizeHealthy?AcquireY
6、esDecommissionNoDefine Life cycle cannot be a waterfall Need to(re)plan,(re)procure/deploy,decommission at any stageConfig issue?Bad Parts?NoYesYesUtilization Use cases Define power budget&optimal utilization/power ratio.Identify system/aggregate configurations meeting power and utilization budgetsI
7、dentify components vendors supporting configurations and budgets.Measure utilization for Debuggability,efficiency and/or meteringMeasure average performance o/power consumption for TCOMeasure conformance to specd performance targets.Detect thermal margining overloadDetect power margining overloads.C
8、ompare the target utilization to operationUtilization Plan/Design1Procure/Deploy2Operate3Decommission4Utilization snapshot of component/system/aggregate utilization.Power utilization snapshot of power utilized by component/system/aggregateThermal utilization snapshot of operating temperature of comp
9、onent/system/aggregateUptime time in seconds since the last reset/power cycleAverage utilization Utilization averaged over uptime(set to 0 on reset/power cycle)Average power utilization power utilization averaged over uptime.(set to 0 on reset/power cycle)Average thermal utilization thermal output a
10、veraged over uptime.(set to 0 on reset/power cycle)Average Uptime Uptime in seconds averaged over the#of resetsPower threshold Highest power utilization above which component/system/aggregate“may”be throttled/reset.Thermal threshold Highest temperate above which component/system/aggregate“may”be pow
11、er-cycled.Utilization threshold Lowest utilization at which component/system/aggregate“may”not-operate or sleep.Utilization Requirements ExampleHealth StatusHealth Status Use cases Define availability/redundancy targetsDefine fault tolerance targetsDefine fault recover/sparing targetsIdentify suppli
12、ers/configurations that meet availability,fault tolerance and recovery targetsMeasure correctable and uncorrectable faults against thresholdsTrack health status to prevent failures.Spare failed or about to fail hardwareIdentify patterns of failures,proactively address those failures.Backup and offli
13、ne faulty hardwarePostmortem analysisSustainably decommission fault hardware.Plan/Design1Procure/Deploy2Operate3Decommission4Fault/ErrorFailure Id 256 bit Unique ID allocated to any new failure with distinct symptomsType Correctable or uncorrectable but not fatal or fatalCount Number of error occurr
14、ences since component/system/aggregate installedFix Status Unknown/Fixed/Not available/Available/CorrectedTelemetry Signature Log of a unique set of events leading up to the error.Health Correctable error#Count of all corrected errors since component/system/aggregate installed.Fatal error#Count of a
15、ll fatal errors since component/system/aggregate installedUncorrectable non fatal error#Count of all uncorrectable non fatal errors since component/system/aggregate installed.Health Score A positive decreasing#representing the current health of a component/system/aggregateHealth Warning Threshold Th
16、reshold below which component/system/aggregate operates sub optimallyHealth Critical Threshold Threshold below which component/system/aggregate“may”break downHealth Requirements ExampleConfigurationNetwork Configuration Use Cases ExampleDefine required NIC port speeds and partitioning of PCIe PFs am
17、ong NIC portsDefine TX traffic shaping characteristics for multiple classes of service to be used on NICsUpgrade NIC firmware,driver(s),and/or SW tool(s)Configure network&link parameters needed for basic connectivity to the networkConfigure TX traffic shaping characteristics for multiple classes of
18、service to be used on NICOptimize performance via adjustments to BIOS settings,NIC resource allocation,flow steering,and pinning of interrupts to specific CPUsPlan/Design1Procure/Deploy2Operate3Decommission4Initiation&monitoring of upgrades to NIC firmware,driver(s),and SW tool(s)Configuration of NI
19、C hardware parametersConfiguration of servers BIOS settingsDownload of scripts to local storage used by OS running on serverExecution of configuration scripts by OS running on serverDiscover results from other(opaque to OCP)configuration mechanismsFor example:switch-NIC neighbor configuration via LL
20、DPConfiguration Requirement ExamplePerformancePerformance Use casesDetermine benchmarks for off-line performance testing of components/system/aggregates.o“micro-benchmarks”(perf,all-gather,fio,SPEChpc)osynthetic workloads that resemble the demands of production workloadsRun off-line performance benc
21、hmarks on properly configured elementsMeasure and record performance statistics while running production workloadsCompare performance results of off-line benchmarks versus production workloadsPlan/Design1Procure/Deploy2Operate3Decommission4Availability of performance“micro-benchmarks”Availability of
22、 synthetic workloads relevant to a given data center or cloud environmentAvailability of aggregate level benchmarks to simulate a data center environment.Installation,initiation,and monitoring of benchmark programs on server(s)Configuration of any HW parameters that affect performanceDownload of tes
23、t scripts to local storage used by OS running on serverExecution of test scripts by OS running on serverAccess to log files or other artifacts containing measured performance statisticsIntegrate the logs to a system level telemetry for a customizable view.Configurability at component/system/aggregat
24、e level for the scale of telemetry.Performance Requirement ExampleSummaryData centers growing at an exponential scaleData center already at a hyperscale introduces unique challenges.Not just the scale but the required sophistication presenting a need for a top-down model.Identify opportunities for a
25、dditional needed standardization Walk through scenarios to present unique use cases and requirements.Invite industry to help define the framework that meets these requirementsPublish a white paper with requirement for hyperscale life cycle manageability Engage with OCP Hardware Management Project“OCP-Cloud Service Model”workstream to participate in defining “Data Center Life Cycle Management Scale”requirements and interfaceshttps:/ocp-all.groups.io/g/Cloud-Service-Model/calendarhttps:/www.opencompute.org/projects/hardware-managementCall to ActionThank you!