《DOJO:使用以太網傳輸協議(TTPoE)的Exa級有損人工智能網絡.pdf》由會員分享,可在線閱讀,更多相關《DOJO:使用以太網傳輸協議(TTPoE)的Exa級有損人工智能網絡.pdf(23頁珍藏版)》請在三個皮匠報告上搜索。
1、HOT CHIPS 2024Tesla Transport Protocol over Ethernet(TTPoE)A new lossy,Exa-Scale fabric for the Dojo AI SupercomputerEric Quinnell,Ph.D.Dojo Fabric Lead2HOT CHIPS 2024Problem StatementTCP/IP is too slow for scaled AI interconnectBound by CPU SW kernelLossless fabrics are complex and brittlePriority
2、Flow Control(PFC)affects the global networkIdeal Fabric:Lowest latency Highest bandwidth Simple SoftwareFor Tesla AI:Layer 2 onlyCollective communications and ingestLow congestion,single application3HOT CHIPS 2024TTPoETesla Transport Protocol over Ethernet(TTPoE)is a peer-to-peer ethernet Transport
3、Layer Protocol executed entirely in hardware.Why a custom transport protocol?1.Vertical Integration extend Dojo RDMA onto optical fabric2.“Lossy”ethernet network ease of scaling,cost,congestion mgmt.3.Use 3rd party hardware Ethernet II frames“Just Work”TCP got it right just do it in hardware4HOT CHI
4、PS 2024Dojo OSI LayersOSI LayerExample Protocols(TCP/IP)TCP/IP ImplementationOSI LayerExample ProtocolsDojo ImplementationLayer 7ApplicationHTTP,Telnet,FTPLayer 7ApplicationPytorch,DojotorchLayer 6PresentationJPEG,PNG,MPEGLayer 6PresentationFFMPEG,HEVC,YUVLayer 5SessionNFS,SQLLayer 5SessionDojo RDMA
5、 DescriptorsLayer 4TransportTCP,UDPLayer 4TransportTTPLayer 3NetworkIPv4/IPv6Layer 3(Optional)NetworkIPv4/IPv6(Optional)Layer 2Data LinkEthernet Frames,MAC addresses,VLANLayer 2Data LinkEthernet Frames,MAC addresses,VLANLayer 1PhysicalData Encoding,Physical SpecsLayer 1PhysicalData Encoding,Physical
6、 SpecsHardwareSoftwareHardwareSoftwareStandard StackDojo Stack5HOT CHIPS 2024TimeTTP Link CommunicationTTP Device ATTP Device BTimeTTP Link CommunicationTTP Device ATTP Device BClean TTP transfer ExampleNACK TTP transfer Example.TTP_PAYLOAD,ID=3 is either lost or out of orderTTP transaction examples
7、6HOT CHIPS 2024TTP STATE MACHINECLOSEDOPENSENTOPENRECDOPENCLOSERECDCLOSESENTOPEN(TX)OPEN(RX)OPEN_ACK(RX)OPEN_ACK(TX)OPEN_NACK(TX)CLOSE(RX)CLOSE(TX)!CLOSE(RX)&!idle timer(TX)&!victim(TX)CLOSE_ACK(RX)CLOSE_ACK(TX)CLOSE_NACK(RX)Timeout,Resend CLOSE(TX)!quiesced,CLOSE_NACK(TX)Timeout,Resend OPEN(TX)OPEN
8、_NACKCLOSE(RX)TCP STATE MACHINEHW CONSTRAINEDTransport Layer State MachinesModifications made for hardware-only execution 2 millisecond quiesce in a microsecond protocol is too long No reliance on virtual memory physical memory only Automatic OPEN/CLOSE with no SW involvementIETF RFC-7937HOT CHIPS 2
9、024TTP Header FrameTTP uses Ethernet-II simple formats with optional standard Layers Dojo at scale uses only Layer 2,currently not using Layer 3 MAC addresses are a hardware hash of the SOW Physical Address(PA)A TTP endpoint can concurrently handle 512 unique links,dynamically replaced via victimiza
10、tion and LRU Virtual channels(VCs)allow for non-blocking control,semaphore,completion,and data movement8HOT CHIPS 2024Lossy ProtocolTTPoE is a lossy transport protocol Lossy transport meaning the underlying medium expects to lose packets and retry full packet transmission is still guaranteed.o Simil
11、ar to TCP and unlike UDP.TTP will default to packet drops and replays in corner cases of congestion,backpressure,or errors Speculative transmission is limited by SRAM size before a RTT ACK.This,in effect,forces a“TTP window size”beyond which bandwidth is lost Local SRAM lines are not retired/dealloc
12、ated until the ACK comes back,allowing HW to replay the line.Replay amounts are also limited by SRAM,constraining the scale of replay storms9HOT CHIPS 2024Congestion ManagementCongestion management is distributed Exponential backoff,rate control,and algorithms are handled by local link TX channels,n
13、ot by central network or switch.Fault Tolerant flow“flushes”the TTP network and removes a bad link before continuing training No PFC,no Nagel Algorithm,no QoS,no tokens,no lossless artifacts10HOT CHIPS 2024TTP MAC IPThe Transport Layer hardware is an IP block between a NOC and an Ethernet standard M
14、AC Translates and coalesces 64B/cycle NOC packets into up to 1kB TTP Ethernet packets Speaks AXI-S or SOP/EOP formats Optionally activates standard MAC features pause packets,counters,stats,LLDP IP block instantiated in FPGA and Silicon implementationsPMA/PHYIEEE 802.3TX serdes3:0RX serdes3:0PCS/FEC
15、64/66BETHERNETMACTTP MACAXI-SMIIPCSNOC TXNOC RXStandard Ethernet IP11HOT CHIPS 2024TTP MAC Micro-ArchitectureTTPs Micro-Architecture uses techniques from SMP Caches,Snoop Filters,CPUs 4-stage Read-Modify-Write(RMW)Pipeline TX Buffer size determines maximum outstanding packets before stall/backpressu
16、re ACK packets“retire”a packet from the common buffer 1MB TX Buffer allows for 80 microseconds latency tolerance RTT Virtual Channels to prioritize and avoid livelock/deadlock Multi-channel“coherent”arbitration to update link and use the TX Physical Channel DMA descriptors issue to TTP MAC Can be PU
17、SH for implicit pass-thru local-to-remote Can be explicit HBM2HBM fabric memcpy12HOT CHIPS 2024“Mojo”100Gbps Dumb-NICMojo Interface Processor(MIP)PCIeControllerTTP Ethernet ControllerDMA EngineDDR MemctlClocksResetPowerDebugCSR/PerfmonsNOC8x GB DDR4PCIe x16 Gen3QSFP28100GbpsFeatureSpecEthernet Speed
18、100Gbps QSFPPCI-eGen3 x16Memory8GB DDR4Power20W maxReliability5-year testedDMA engineDojo DMACPU+OSNoneActive Links512 unique,2-way,LRUFirst integration box-D1 DieTSMC 7nm,645mm2Physically and logically arranged as a 2D array 354 DOJO processing nodes on dieExtremely modular design362 TFlops BF16/CF
19、P8,22 TFlops FP32 2GHz440 MB SRAMCustom low power serdes channels on all edges 576 bidirectional channels 2 TB bandwidth on each edgeSeamless connection to neighboring diesHot Chips 34Second integration box Dojo Training Tile5x5 array of known good D1 chips 4.5TB/s off-tile bandwidth per edge Half o
20、f in-tile bandwidthFully integrated module Electrical+thermal+mechanical 15kW of power deliveryCustom power delivery Horizontal data communication plane Vertical power delivery and cooling 15kW per moduleCustom high-density connectors Seamless connection to neighboring training tilesHot Chips 34V1 D
21、ojo Interface Processor32GB High-Bandwidth Memory-800 GB/s Total Memory Bandwidth900 GB/s TTP Interface-Tesla Transport Protocol(TTP)-Full custom protocol-Provides full DRAM bandwidth to Training Tile50 GB/s TTP over Ethernet(TTPoE)-Enables extending communication over standard Ethernet-Native hardw
22、are support32 GB/s Gen4 PCIe InterfaceHot Chips 3416HOT CHIPS 2024“Mojo”Hosts Variable Ingest via TTP NetworkVision networks can be heavily ingest limited Vision-based tensors and training clips in GBs“Mojo”Hosts are scheduled on demand from a generic compute pool Forward/Backward pass TTP traffic i
23、s mutually exclusive i.e.ingest and all-reduce share the same TTP DIP ports but execute during different phases of trainingRemote Mojo HostNICNICRemote Mojo HostNICNICD1 SOWDIPMain HostPCIe Gen42 TbpsTTP Network1 Tbps TTPRemote Mojo HostNICNICPCIe Gen3100 Gbps TTPDIPDIPDIPDIPDIPDIPDIPDIPDIP100 Gbps
24、TTPVariable Ingest(Forward Pass)Local Ingest(Forward Pass)All Reduce(Backward Pass)To Other Partitions17HOT CHIPS 2024MDCH Mojo Dojo Compute Hall18HOT CHIPS 2024Dojo Engineering System4xExaFLOP BF16/FP16 Cluster40 PB Local Storage40,960 Main Host Cores61,440 Mojo Host Cores320 Tbps TTP All-Reduce I/
25、O(endpoint)128 Tbps TTP Ingest I/O(endpoint)208 Tbps TCP/IP(endpoint)Converged and non-Converged network experimentsLEAF-180 Tbps TTP16 TbpsTCP/IPLEAF-2LEAF-3LEAF-4LEAF-5LEAF-6LEAF-7LEAF-8SPINE-1SPINE-2SPINE-3SPINE-432 Tbps TTP36 Tbps TCP/IP80 Tbps TTP32 TbpsTTP16 Tbps TCP/IP36 Tbps TCP/IPConverged
26、EthernetIndependent NetworksTCP/IP OnlyTTP+TCP/IP ConvergedTTP OnlyTCP/IP OnlyTTP+TCP/IP ConvergedEVPNVXLAN19HOT CHIPS 2024Results Measured on Arista 7060,7808,and 7816 switches RTT latency is random sampling of in-flight packets+ACK return Gbps is wall time real-data movement All-reduce measure is
27、network only,non-pipelined SOW has all-reduce not shown(pre-network)All-reduce throughput is determined by the slowest node in system20HOT CHIPS 2024TTPoE in Ultra Ethernet Consortium(UEC)https:/ultraethernet.org/https:/ultraethernet.org/wp-content/uploads/sites/20/2023/10/23.07.12-UEC-1.0-Overview-
28、FINAL-WITH-LOGO.pdfTesla has achieved Exa-scale with a lossy fabric,executing real training runs deployed in FSDTesla is joining the UEC and offering the TTPoE protocol publicly21HOT CHIPS 2024Team AcknowledgementsPrototyping is Easy.Scaling is HardThanks to theTTPoE Original Inventors,Network Deplo
29、yment Team,Silicon Design Team,System and Infrastructure Team,SW and Drivers Team,Linux Patch Team,SDN Team,DevOps Team,QA Team,DC Tech Team,Supply Team,and all TTP/Mojo InternsHOT CHIPS 2024Tesla Transport Protocol over Ethernet(TTPoE)23HOT CHIPS 2024Backup LatenciesIntended de-emphasis on syntheti
30、c latency measurementsDifferences of greater consequence:lossy vs losslesscentralized vs distributed congestionproprietary vs open sourcesustained bandwidths at scaleTTPoE,TCP/IP Spectrum3 SN4700IB Spectrum 9700 IBNvlink DGX-H100 NvSwitch level1(internal)RoCEv2 7812 R3Inconsistent methodology and hardware,not at scale