1、NVIDIA超算集群網絡管理神器UFM CYBER-AlDec 2020#page#Supercomputers Hacked AcrossEurope to Mine CryptocurrencyREVOLUTIONIZINGSUPERCOMPUTING“Cost of Hourly DowntimeIncreases: 889 of Firms Say HourlyInfiniBand Data Center Cyber IntelligenceDowntime Costs Exceed S300k”and Analytics Powered by Artificial Intellige
2、nceTechMarch/April2020Scalablesupercomputers host many application users andvarietyofresearch and simulations jobsBased on ITIC 2020globalserverhardwaresurvey*,88%of firms sayhourly downtime costs exceed S300kUsers may leverage data centeraccess to run prohibitedapplications,resulting in increase of
3、 operation costs and misused of computeresourcesThe new NVIDIA Mellanox UFM Cyber-Al platform combines enhancedand real-time network telemetry with Al-powered cyber Intelligenceandanalytics,to discoveroperationsanomaliesand predict networkfailures for preventive maintenance/cas/DVOxZV6F#page#UFM PLA
4、TFORMS OPTIMIZESUPERCOMPUTING OPEXInfiniBand Data Center Cyber Intelligence andAnalytics Powered by Artificial Intelligence8888Buildarich databasewith real-timenetworktelemetry information.workloadusage,system configuration and more8898Provide enhanced network monitoring and management,workloadoptim
5、izations and periodical configuration checksLearn thesystemsnormal operation,condition,andusageDetect overtime performance degradations or change in conditionsProvide predictive analytics,alerts of abnormal system andapplication behavior,and potential system failuresecurity threats,andaddressfuture
6、failures more efficiently,savingOPEX and maintaining end-user SLASPredictability isoptimizedover time as systemcollecteoUFM Cyber-AlVideo#page#REVOLUTIONIZING SUPERCOMPUTINGAI-Powered InfiniBand Cyber Intelligence and Analytics PlatformMManagementCyber-securityTelemetry and MonitoringPredictive and
7、PreventiveandOrchestrationand Anomaly DetectionMaintenance#page#UFM IN THE FABRICSynchronization,HeartbeatManages Subnet Manager and Sharp Services= Software or appliance form factorUFM= High availability-2 or more=Switch and adapter management= Full management or monitoring onlyLayer 2 level monito
8、ring= REST API for configuration/monitoring Single Interface for all network#page#CENTRALIZE DEVICE MANAGEMENTManage an inventory of assets, switches and nodesCentrally upgrade firmware and software across all managed and unmanaged systemsptt美#page#UFM DASHBOARDENetworkValidationCongestionMappingHea
9、lth ReportsInventory MappingReal-TimeAnalysisSecure CableManagementPrediction DashboardPerformance MonitoringnVIDIA#page#CENTRALIZE DEVICE MANAGEMENTManage an inventory of assets, switches and nodesCentrally upgrade firmware and software across all managed and unmanaged systems家Grouo#page#page#MULTI
10、 CLOUD SOLUTIONSingle main dashboard for all managed cloud/clustersAlerts, Traffic utilization, Congestion,Cable Info,HealthSH#page#Daily Fabric Report- 24H Activity SummaryFabricHostswlSwiittches#page#LINK ANALYSISTop 10 Nodes by Link Failure PredictionnVIDIAPrediction DashboardUFM CyberAlLinkAnaly
11、sisSuspiciousBehavior銀22SLinkFa#page#LINK ANALYSISCorrelation between Events Raised the Alert70363.122020Acikk2HCAAoa興020066K23HCA1Sep703:13572020Aonsk3H41關UnOEk53HCASp702570120200MhoAnomaly events in last6 hoursRXPower(dB)PearSwitchTomperaturec)NormaliizedCongestiion%)#page#LINK ANALYSISPrediction
12、and Recommendation to Check the Peer Switchklir2n03 HCA-1NsIDDrsDescnctbH京CFFAc103HD13707:542821LKAcmal1103HDA18.m SapT5538422103HDA1Aci1Aeml103HD-1Acifal1103HDA1Sap704:122202TA51040352Acnal1103HDA170522Acial13HD1ACi1103HCA15p705228aka8SpT0135202Acmis11203HDA-11LinkmlX520d0 MnSHCAnVDIA#page#DETECT A
13、NOMALIESIdentify Network-Level Anomaly BehaviorAnalyzing and learning the data center normal operation communication patternsCommunication patterns can indicate the type of workloadsThese will cause a generation of abnormal behavior alertsData cennter IT managers could check whether an un-authorize
14、user or un-aauthorize application is running#page#ABNORMAL SYSTEM BEHAVIORSuspicious Network Behavior AlertefreshDasboard.NVIDIAPrediction DashboarcUFM CyberAlLinkAnalysisSuspiciousBehavior皖5Lmk FannAlets9 Cable EwentAnomaly DashboardFiBaFbeM7042020ACtc2asmc興7054103202023HA-1ACtACk;2HA1#page#ABNORMA
15、L SYSTEM BEHAVIORSuspicious Network Behavior Alert8XonSep70353-12202Acnfclk2n3HCA-15703362020Acnk3HCA-11X8關Mon5p703:13572028MinoR15HCA-1dud1MoS702570120290WmkgR5n3HCA1XWeeklyAverageTrafficLastWeek TrafficThisWeek Traffic#page#RefreshDashboardXklir2n03 HCA-1K1283HCA1s70538542320Aomc0Ac12HCA-1p7042519
16、2820Ma Sep70:30.162020enSep70412202020Acfck112m3HCA-1山0AcifclManSep70483452020k112n03HcA10570353122820Acnfck112m3HCA-10ACiol570330:152020k1123HCA-oS70313572020nK5rHA1K5n3AeS70257012020GwumeThisWeekly Average TrafficLast Week Traffic連#page#SECURE CABLE MANAGEMENTProtecting the InfrastructureMonitorin
17、g cables physical connectivityUnique ID to each cableUFM track which cable connect to which portAny change in connectivity is alertedMay indicate a malicious connectivity#page#UFM PLATFORMS PORTFOLIOUFMUFMUFMTELEMETRY-CYBER-AIENTERPRISEUFM TelemetryUFM EnterpriseUFM Cyber-AlReal-Time MonitoringManag
18、ement,MonitoringOrchestrationCyber Intelligence and Analytics#page#UFM TELEMETRYPLATFORMReal-Time MonitoringUFMNetworkandsystemsvalidation(adapters,switches,cables,transceivers)and connectivity checks-TELEMETRYNetwork performance testsApplication testsStreaming of telemetry information into on-premi
19、ses or cloud-baseddatabasePlatform options:Dockercontainer,software,orUFM Telemetryappliance#page#UFM ENTERPRISEPLATFORMManagement,Monitoring and OrchestrationUFMIncludes all UFM Telemetry servicesNetwork setupconnectivity validation and secure cable management-ENTERPRISEAutomatednetwork discovery a
20、nd network provisioningPerformance,traffic and congestion monitoring,health and fault monitoringJob scheduler provisioning,integratedwith Slurm and Platform LSFintegratedwith OpenStack,AzureCloudandVMware#page#UFM CYBER-AlPLATFORMCyber Intelligence and AnalyticsUFMIncludesall UFM Enterprise services
21、Lears system heartbeat operation mode,condition,usage,workoadCYBER-AIBuilds enhanced database of telemetry information and discovers correlationsDetects performance degradations,usageand profile changes over timeProvides alerts of abnormal system and application behaviorSystemadministratorscan quick
22、ly respondto potentialfuturefailuresorsecurity threatsPredictability isoptimizedovertimeassystemdata iscollectedPlatform:UFM Cyber-Alappliance#page#AlertsUFMMELLANOX CARE MONITORING-CYBER-AIAND NOC SERVICESRemote NOC,network management,and monitoring servicesAlert diagnostics and data center managem
23、entPerformance and health monitoringManagement#page#UFM SOFTWARE ARCHITECTUREA中C.8.SlurmAzurePlatform LSFopenstack.CloudintegrationpythoWeb-based GulJob SchedulerSDKntegrationNorthbound REST APlsml0OUnified Fabric Manager (UFM) Server SoftwareInfiniBandAggregationPerformanceSubnet ManagerSysloManagerMonitoring (IBPM)(SM)(SHARP)Software鮮用牛店wBServer中MNX-OS鮮External用牛店switchMLNX-0SServer#page#NVIDIA#page#