1、NVIDIAHIGH PERFORMANCE E2EETHERNET SOLUTIONACCELERATERECOMMENDERSYSTEMGTC China,Oct 2020#page#Recommendation PipelinesExampleExperimentationDATALAKETrain dataFeature engineringData Pre-processingTBstPBsModel(s)trainingTrain dataGBstOTBProduction InferenceProduction Re-training0(10)Feature engineerin
2、gRecommender5ystemImprowedaccuracy?DataPreprocessingCanddate generationModel(s)trainingweekly/0oil2電座D#page#Recommendation PipelinesChallengesData (ETL)TrainingInferenceFeatureThroughput&HugeembeddingPerformance &Data loadingtablesexplorationAccuracyLatencyHuge data sets:Data loading canLarge embedd
3、ingHard to achievDifficult to havebe50%oftotaltablesexceedTBs,PBsormorehigh scalinghigh throughputefficiencywithtraining time.single GPUand low latencyComplex databoth model andmemorywhen ranking preprocessing andTabular datadata parallelism.huge number ofloading scalesSub-optimalfeatureitems.Longer
4、 iterationengineeringpoorlywitharlookupsopscycles reducethepipelines.item-by-itemimplementation.abilitytoreachapproach.Many iterationshigheraccuraciesrequired.quickly#page#Nvidia Ethernet Switch addressthe challengesSpeed, Feed and Latency-Fast interconnectFast access datasetRDMA and RoCELow latency
5、 access GPU memoryLoW latency access external datasetMonitoring and Management#page#SPEED AND FEED-THE NEED OF BANDWIDTHIntra-layer model parallelData parallelIntra-layer model parallel leaves collectives exposedCommunication speedup mustAccelerating math without accelerationmatch math speedup,other
6、wisecommunication suffers from basic Amadahls lawproblemwe achieve little E2E speedupTypically collectives span NVLink domain onlyAlreduce spans both NVLink and networking domains:bandwidth must be available in each#page#NVIDIAS MULTI-GPU,MULTI-NODE NETWORKING AND STORAGE IOOPTIMIZATION STACKBuild l
7、arger 8 lower latency resource poolMagnum IOUCXNCCLOpenMPINVLINK FabricGPUDirect P2PGPUDirect RDMAGPUDirect StorageInterconnectTopologyStorageTransport日-日INFINIBANDX1同三園多BO出色售團NVLINKRoCEGPU DirectXBARNVLINKNVLINK SwitchOver RoCE orIBOn ChipGPUSGPUSNodes#page#NCCL Ring on multi nodeIBdomain(across no
8、des)NVLINK domain(withinnode)Network domain(acrossnodes)#page#NETWORK DESIGN FORAI CLUSTER業務分離設計,保證應用間低干擾和可擴展性計算網絡-視服務器設計,可為單平面或多平面存儲網絡根據數據集和存儲軟件的不同,采用不同的技術,可為單平面或多平面”管理網絡如有較大規模StorageComputeManagement中中中1GPU#page#ETHERNET Al SWITCHESPurpose-built for Rack Level 8 Multi-Rack Deep Learning Solutions
9、NVIDIA Certified PerformanceBestinClass latency&throughputROCE Acceleration for GPU-Direct 8 StorageSN370032200GbEAdaptive Routing DeepOps IntegrationAutomated network configurationAutomaticadvanced network monitoringSN460064200GbEAuto-verification of health ofthe DGX POD deploymentNative 200Gigabit
10、 Ethernet-without splitters市計32or64 port switchesDGX PODJust like DGXA100nVIDIA#page#ENABLING WORLD-CLASS Al SOLUTIONSFastFastFastE發InterconnectComputeStoragen NetApp中PURESTORAGEzxceleroNUTANIXNVIDIAnVIDIA.WEkA.ioGnVIDIIA#page#BROAD ETHERNET SWITCH PORTFOLIOAl SwitchesEdge/ESF SwitchesLeaf SwitchesS
11、pine SwitchesSN2100:16x100G+16x100GSN2410:48x25G+8x100GSN2700:32100GSN3700-V:32x200GSN2010:18x25G+4100GSN3700-C:32x100GSN4600-V:64x200GSN3420:48x25G+12x100GSN3510:48x50G+6x400GSN4700:32400GSN4410:48100G+8x400GBest-in-class-ASICs店SpecrmSN4600-C:64x100GPower consumption歡Throughput flatencySpectr2Fair
12、traffic distribution歡Securityvirtualization scaleSpectm3SN4800:128x100G#page#SPECTRUM SWITCH ADVANTAGESCongestion ManagementAvoidable Packet LoSSFairness & QoSSpectrumSpectrumMicroburst Absorption CapabilityMB128256wm0.91.00.3Competition64B51281.5B9KBPacket Size口Spectrum口CompetitionCompetition電話8mLo
13、west Latency50%樓SpectmTolly.#page#LARGER FABRICS WITH SPECTRUMPgONN6PO99-Upto65,000 Nonblocking100GbE portsf/spine netwoin3TUiT#page#RDMA AND ROCE#page#ROCE ACCELERATED Al SOLUTIONSRDMA Supercharges Leading AI FrameworksMicrosoft重靠S66mCaffe21PaddlePaddleFTensorFlowUptoUpto60%2.5X95%50%ScalingHigher
14、ROIBetterSavings on CapitalPerformanceEfficiency& Operation CostTencent騰訊Baidu百度ORACLE阿甲巴巴集團CLOUDll ByteDanceSONYRJDNVIDIA.#page#GPU DIRECT RDMA ROCETECHNOLOGYGPU-Direct with RoCE/RDMAMoving RoCE/RDMAtypetraffic between networkandGPU memory directly,bypassing CPUandCPU-MemoryWithout GPU DirectWith G
15、PU DirectGPU-Direct with Ethernet UDP/IPMoving EthernetUDP/MCtypetraffic between networkandGPU memory directly,bypassing CPU and CPU-MemoryD中GPUDirectMPowered byRoCEDeliversNetwork10XBetter PerformancenVIDIA#page#RoCE ACCELERATES SCALE-OUT STORAGEWindowsServer2016S2D-WriteloPsSATSSDs(100s)WindowsSer
16、ver2016S2D-ReadIop寶Storage is Getting a Lot Fasterl#page#ROCE IN A NUTSHELLWhat is RoCE(RDMA over Converged Ethernet)zWikipedial RoCE is a network protocol thatallows RDMA over an Ethernet network.It does thisby encapsulating an InfiniBand transport packet overEthernet.There are two RoCE versions,RO
17、CEv1 andhence allows communication between any two hostsin the same Ethernet broadcast domain.RoCE v2isaninternet layer protocol which means that RoCE v2packets canberoutedRoCE is a standard for RDMA over Ethernet defined bythe IBTA(InfiniBand Trade Association) How to RoCE?Quality of Service (QoS)m
18、 Congestion Management (ECN and DCOCN Algorithm)= Flow Control (L2 PCP/ L3 DSCP)Advanced Algorithms(part of ZTR development)#page#WHAT MAKES A GREAT ROCE SWITCHSimple Configuration1Command CLI config01 click GUI configAROCE1High PerformanceHigh PPS 8 Low latencyoFair 8 Predictable performanceAdvance
19、d Congestion ControlEarly detection and preventionRoCE over VXLANExtensive VisibilitySingle pane-of-glassReal time RoCE Telemetry#page#NVIDIA以太交換全線產品支持ROCE一鍵RDMA部署支持Lossless、Semi-Lossless、Lossy多種RDMA部署模式支持RDMA和非RDMA混合部署m CLI“RoCE vs 26+commands in other NOS支持RoCEoverVxLAN”支持RDMA的最佳硬件設計支持FastECN”低轉發時
20、延和優秀的共享緩存設計NEO網管軟件端到端管理LosyPont tust ode L3VVLDPEthL2CRCIPHaaderpeoAedFCSHeadeorswpro-CmappingEtherTypoUDP dpo4759DoED1-Eod-Msindcale8UOPprio 6-TC6ICNPsIPl.e1B.BTHnextPhL2HaadRort ETSrkingfolECNTC6(CNP)-stiatIPHeadermlude DSCPTCOother traffie-WWR509Port EON1O05-OSPOLOSSY中LOSSLESS3(RoCE)VEnable PFCio
21、3(RoE21nviDt#page#MONITOR8MANAGEMENT#page#WJH-輕量、可部署、事件驅動的TELEMETRY1.SDK generates:The Important QuestionsGrafangkibanaWJH messagesWHOis beingimpactedWHENit happenedNetQNEO2.Agent collectsWHATis causing the problem區apstra.data:WHEREis the problemStreams to DatabaseWHYitis happening3.PresentationRoot
22、 Cause+howto fix itLayer:Shows What Just HappenedNetwork OSSDK/SAIPackets 12 Tuple+Matabauvery detaileddescription#page#WJHTMAccelerates the Time to Root-CauseSNMP SYSLOGsFou卡MMellanoxHAPPENEDsFonnVID#page#WJH TAWhat Does it Monitor?Packet DropNo Packet DropL1CongestionBad CRCIncastFlaky cableBusy s
23、torage device2/L3Latency0=Pause framesVLANCongestion3latencyBufferRoute ValidationIncastPacket doesnt reach the firewallRate LimitPacket gothrough a suboptimal pathACLSLoad Balance ValidationDeny based on IPSuboptimal ECMPDeny based on VLANSuboptimal LAG#page#HISTOGRAMS強大的BUFFER統計工具QueueHistogram-Eg
24、ressQueueUtilizationUtilizationSampling 64ns granularity90%m 10 sample buckets90%Port HistogramsRX/TXbandwidthPFC duration40%Queue histograms30% Fill Level20%mLatency0-10% Micro-bursts detectionWatermarks (capture highest value)Thresholds Trigger events (Hysteresis)304times195time207time10%20%90%10
25、buckets per histogram#page#CUMULUS NETQModern toolset that accelerates the network transformationCI/CDFabric wideContainers 8AnalyticsStreamingTelemetryvisibilityMicroservices8DiagnosticsnVID#page#NVIDIA關注高性能互聯應用中的需求多對一和微突發通過優化的緩存設計,吸收突發、減少丟包應用常態,多對一帶來丟包進而導致時延增加、性能下降?!蔽⑼话l帶來難以觀察的丟包,從而導致性能問題距離帶來的需求-通過優化的緩存設計,直接適配長距離RTT,降低部署成本性能-時延、帶寬和PPS-低轉發時延、高帶寬、更好適配高性能系統“與網卡結合ROCE-應用加速容器應用-iptables加速形態和業務部署”半寬平臺,適配集成交付、邊緣計算、移動計算大表項,輕松適配容器應用要求#page#NVIDIA