《Re-configurable Photonic Interconnects for AIML Infrastructure.pdf》由會員分享,可在線閱讀,更多相關《Re-configurable Photonic Interconnects for AIML Infrastructure.pdf(15頁珍藏版)》請在三個皮匠報告上搜索。
1、Technical journey to develop an exploit the advantages of an all-photonic switch for AI/ML cluster designs.Re-Configurable Photonic Interconnects for AI and MLWilliam Koss,CEO of Drut TechnologiesRe-Configurable Photonic Interconnects for AI and MLSPECIAL FOCUS:OPTICSOPTICSA scalable photonic fabric
2、 for GPU-to-GPU connectivity under software control is an alterative architecture to legacy electrical packet switched networks that never have enough buffers,bandwidth and accelerators to overcome congestion controlsAll photonic fabrics provide significant advantages in operating costs,power consum
3、ption,latency,resource efficiency,network hop reduction and multi-vendor opennessBy using a system level architecture designed to exploit an all-optical fabric,a far better overall TCO for accelerator era can be realized by providing dynamic reconfigurability,finer fidelity of resource upgrades and
4、workload resource slicingWHY TRANSITION TO A PHOTONIC FABRICDynamic set of wires that can be changed through software-ReconfigurabilityData rate/format agnostic Extensible across multiple generations long useful lifeLow latency/low power per portLarge,extensible radix-flat network fabric,fewer layer
5、sSlower switching time,but offset by software control and fabric topology slicing150-175W optical switches versus 1500-1700W electrical packet switchesSupports multiple generations of future fabric speedsDeterministic traffic patterns Users fit the fabric around the workload,rather than the workload
6、 trying to fit around the fabricPhotonic Fabric versus Electrical Packet SwitchedDYNAMIC PHOTONIC INTERCONNECTS NEEDEDAI(Tensorflow)such as Large Language Model(LLM)creation need to move large amounts of data between GPUsGPU to CPU memory swap due to GPU memory page sizesML models such as BBN,Isolat
7、ion Forest and Fuzzy Logic compute vast amounts of historical dataHeterogeneous Applications(AL,ML and Storage)#Accelerator Era scale out has several challengesGPU memory IO is increasingly the bottleneck for AI trainingAI infrastructure scale out requires sprawlStatic network topologies hard to tun
8、e to AI/ML flowsInfrastructure performance not keeping pace with model growthThis bottleneck further exacerbatesResource utilization inefficiency which underpins high energy wasteML/AI software complexity(needs topology knowledge)The result isIncreased OpEx and CapEx costInadequate ML/AI performance
9、Disaggregate and attach server bus(e.g.PCIe,CXL)to an all-photonic fabricAttach RDMA capable NICs to all-photonic fabricTransition from pluggable optics to co-packaged/near packaged optics with internal lasers(overcome server form-factor challenges)Build workload aware topology management software t
10、o exploit dynamic nature of all-photonic switchesDiscover end points(CPU,GPU,FPGA,STORAGE,MEMORY,etc.)around the all-photonic fabric as well as fabric paths to construct workload topologiesMake it all workSOLVING HARD SYSTEM LEVEL PROBLEMSFull PCIe bus disaggregation removes the constraint of the ri
11、gid sheet metal boxRequires the ability to maintain side band signalingAdds dynamic nature to hardware configuration,think of it as a hardware hypervisorEnables discreet component upgrade paths,differentiated TCORequires modification of Linux kernelEliminates the need use complex hierarchies of netw
12、ork layers to connect resourcesNodes can be composed of resources and clusters of nodes can be composed into homogeneous topology slicesWHY PCIe OVER PHOTONICS(PoPh)and CXLPCIe 5.0CXL 3.0UCIe Resolves the overhead created by data copies between application and user space Removes the end-to-end laten
13、cy created by TCP windowing Frees CPU cores to perform application tasks Removes the high number of CPU interrupts needed for traditional high-speed network protocols(TCP/IP)RDMA-enabled NIC provides 6x lower latency than iWARP RDMA enabled NIC 1 uS latency(*NS range)RDMA over Ethernet(ROCe)50 to 80
14、ms end to end with queuing and buffering in a traditional Ethernet environmentWHY RDMA OVER PHOTONICS(RoPh)The cost of the CloS,Rail,and Fully Meshed Network TopologiesEthernet Networks are designed to be heterogeneousEthernet requires implementing physical different networks“Frontend”(Storage&Manag
15、ement)and“Backend”(RDMA)Intel and Marvell state no more than two switch hops for end-to-end RDMA connectivity to maintain RDMA benefits,prevent congestion and packet lossEthernet -Broadcom Trident typically has a latency of 1 to 2 uS+20 nanoseconds cable latency 1.01 uS/1,010 nanosecondsDCBx ECS and
16、 802.3ad hash enhancements used to“hopefully”prevent congestionRDMAv2 uses IP/UDP and with multiple NICs its impossible to keep 100%of the peer-to-peer communication separate unless multi homed to every switch CHALLENGES WITH RDMA OVER ETHERNETEliminate the cost of electrical packet switching at sca
17、leHeterogeneous fabric with homogeneous virtual podsShape and reconfigure the topology to match workloadsEliminate network hops,and silicon latency Eliminate the need for dynamic protocols(DCBx,802.3ad)Remove congestion issue totally exceeding PCIe lane bandwidthReduce high power and cooling costsDe
18、dicated point-to-point paths through dynamic optical fabricRDMC(RDMA enabled Multicast)to vPOD mappingReduce resource strandingHOW IS 3D-TORUS PHOTONIC FABRIC DIFFERENT?y-axisz-axisx-axisAll Photonic FabricManagement NetworkingEthernet is used for out of band and traditional networking functions onl
19、yTHIS HAS BEEN DONE BEFORE,BUT NOT BY MANYGoogle has a long history of deploying optical fabrics for AI and ML using all-optical MEMs based switchingEVOLVING TO DYNAMIC FROM STATIC DESIGNSRack 000“TOPOLOGY SLICING”(Application-Tuned Allocations)64 Nodes:-1x1x2(2),1x2x2(4),2x2x2(8),2x2x4(16)etc64-191
20、 Nodes:-4x4x4(64),4x4x8(128),4x4x12(192)etc192-511 Nodes:-4x8x8(256),4x4x16(256),4x8x12(384)etc512-1023 Nodes:-8x8x8(512),4x8x16(512),4x4x32(512),8x8x12(768)etc1024(1K)-2047(2K)Nodes:-8x8x16(1K),4x16x16(1K),4x8x32(1K),8x12x16(1.5K)4x4x96(1.5K)etc2048(2K)-3071(3K)Nodes:-8x16x16(2K),12x16x16(3K),4x4x1
21、92(3K)3072(3k)-4096(4K)Nodes:-15x15x15(3375),16x16x16(4K)“Build One Rack at Time”(Total#of Rack/Cluster:64)(Total#of GPUs:4096)y-axisz-axisx-axisRack 000Rack 100Rack 200Rack 300Rack 033Rack 133Rack 233Rack 333Rack 010Rack 020Rack 030Rack 301Rack 302Rack 303Rack 031Rack 032Rack 310Rack 320Rack 330Rac
22、k 331Rack 332Rack 130Rack 230Rack 323Rack 313Rack 321Rack 322Rack 311Rack 312Rack 132Rack 232Rack 131Rack 231Rack 120Rack 220Rack 110Rack 210Four(4)Availability ZonesFabric can be built with four(4)Availability Zones(AZs).Each AZ can contain up to 4,096 GPUs,providing a cluster capacity of 16,384 GP
23、Us for slicing.This is the magic.A dynamic interconnect fabric that can be tailored to the needs of the application/user/workload.WORKLOADS ARE NOT THE SAMEAI&ML training have Different PropertiesRDMA over Photonics(RoPh)GPU to GPU acceleration with RDMAPyTorch and Tensorflow on ROCm GPU Direct for
24、ROCm with RDMA(AMD)GPU Direct for CUDA with RDMA(Nvidia)GPU Direct for TensorFlow with RDMA(Intel)Application user space storage SCSI,SMB3 and NFS over RDMACeph 30%to 50%higher performance with RoCECeph 60%to 70%higher performance guaranteedSignificant lower latency with RoPhDedicated paths a must w
25、ith high interface utilizationWe need more focus eliminating layers on top of layers of static interconnectWe need better utilization of resources inside the data centerWe have reached a point that hyperscalers are buying or building nuclear power plants to power infrastructureWe need to exploit the benefits of photonics at scale inside the data canter for more that just point-to-point linksCALL TO ACTION IT IS ABOUT SUSTAINABILTYSUSTAINABILITYOPTICSThank you!