《基于Photonic Fabric的放大網絡用于芯片到芯片和芯片到內存的連接.pdf》由會員分享,可在線閱讀,更多相關《基于Photonic Fabric的放大網絡用于芯片到芯片和芯片到內存的連接.pdf(17頁珍藏版)》請在三個皮匠報告上搜索。
1、 2023 Celestial AI.All Rights Reserved.Celestial AI ConfidentialPhotonic FabricTMbased Scale-Up Network for Chip-to-Chip&Chip-to-Memory ConnectivityPreet VirkCo-Founder&COO,Celestial AI 2024 Celestial AI Inc.,All Rights Reserved.Celestial AI,the C logo,and Photonic Fabric are trademarks or registere
2、d trademarks of Celestial AI Inc.in the United States and other countriesAI Hardware Summit,San Jose11th Sep 2024AI The Largest Technological Wave We Have Ever Seen2Notes:1.Source:The economic potential of generative AI(McKinsey,June 2023);Total AI economic potential estimated to range from$17.1Tn t
3、o$25.6Tn)PioneersPioneers1960-1980($200Bn)1980-2000($600Bn)2000-2020($1.2 Trillion)Pioneers2020 and beyond($20 Trillion)PioneersMainframe100s of 1,000s of usersClient/ServerMillions of usersCloud/MobileBillions of usersAI/Large Language Models10s of billions of connected people,devices and applicati
4、onsAll are building their own AI Processorsand Data Center ArchitecturesFour Hyperscalers Represent 70%of Data Center MarketUnprecedented Growth&Market ConcentrationSource:Morgan Stanley Artificial Intelligence is Changing Everything!AI Driving a New Generation of Optical Compute InterconnectHBM3 HB
5、M4 Equivalent Bandwidth:The Minimum Requirement for Accelerated Computing Optical Interconnect Scale-Up Network for XPU-XPU Connectivity:Enabling Cluster Scale Processing of AI models Celestial AIs Photonic Fabric-Optical Interconnect for Accelerated Computing17.8 TbpsCPO 120242026202820302032203420
6、16201820202022Off Package Bandwidth1.Industry Trend based on data by John Wilson,Nvidia Research:“High Bandwidth Density,Energy Efficient Short Reach Signaling that Enables Massively Scalable Parallelism”2.Assuming 6 chiplets per XPU1010011,00010,000Total Off-Package Bandwidth(Tbps)32016080402064010
7、Package BottomMCM Package(Chiplet)InterposerGen1 Chiplet 2(2024)86.4 TbpsGen2 Chiplet 2(2025-2026)172.8 TbpsGen1 IP/Interposer(2024-2026)390 TbpsGen2 IP/Interposer(2026-)650 TbpsPhotonic Fabric AdvantageGen1 Photonic FabricGen 2 Photonic FabricCompact,Thermally Stable Optical ModulatorChip-to-chip p
8、ackaging with XPUs dissipating 100s of watts Revolutionary Integration of Silicon Photonics for Accelerated Computing4Photonic Fabric IPOptical WaveguideFAU(CHIPLET/ASIC/XPU/HBM)(CHIPLET/ASIC/XPU/HBM)FAUFAUFAUOptical WaveguidesSubstrateGrating CouplerPhotonic Interposer OMIBNo DSP Linear Drive Optic
9、sHigh SNR,low BER,close proximity of optics to electronics:Eliminates need for DSPMultiple Packaging Options Tailored for Customer ApplicationsIntegration of Advanced CMOS with Si PhotonicsFull Stack E-O-E Link OptimizationProtocol adaptive Network Convergence Layer(NCL)Full Electrical to Optical to
10、 Electrical(EOE)link management FEC,CRC,FLIT ReplayWhy Photonic FabricTM vs.Copper?ComputeCompute&ComputeMemory Photonic Connectivity for Accelerated ComputingHigher Off-Package Bandwidth100s of Tbps BW,unrestricted by beachfrontEfficient Data MovementZero mass photons vs I2R losses from electronsLo
11、wer Latency&PowerEliminate DSPs,Deep FEC and re-timersFewer System ConnectionsHigher overall system reliabilityLonger ReachConnect multiple racks with less powerMore efficient remote memory transactions(RDMA)10pJ/bit(incl.2.4 pJ/bit for optics)for up to 50m60 pJ/bit for copper for up to 1mPhotonic F
12、abricTM Link:Module&ApplianceShattering the Memory&Interconnect Bandwidth Wall for Accelerated Computing6Side ViewCelestial AI Designed System in Package(SIP)Memory Controllers+Photonic Fabric Link+Network Switch2.07 TB Memory Capacity at 7.2 Tbps Bandwidth with 100ns LatencyHBM3E Operates as Write-
13、through Cache for DDRBandwidth&Latency of HBM3E with Capacity&Cost of DDR5Photonic Fabric ModulePhotonic Fabric ModuleDDR5 DIMMDDR5 DIMMDDR5 DIMMHBM 3EHBM 3EFAUPhotonic Fabric ASICDDR5 DIMMPICPhotonic Fabric AppliancePhotonic Fabric Appliance16x Photonic Fabric Memory Modules in a 2U Appliance33TB M
14、emory Capacity115Tbps Network Switch Enabling a Backend/Scale-Up AI FabricFiber Array Unit(FAU)MicrocontrollerAMSAMSAIXPU1.4 mmUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-AUCIe-
15、AUCIe-AUCIe-AUCIe-AProtocol Adaptive LayerPhotonic FabricTM Link Implementation:The Chiplet ApproachFull Stack High-Bandwidth Optical Interconnect Solution for Accelerated ComputingCustomer AI ProcessorStandard D2D:UCIe-A or MAX PHY2.4 pJ/bit photonic link power(Gen1)Photonic Fabric IP MacrosProtoco
16、l Adaptive Layer14.4 Tbps(Gen1)Per Chiplet28.8 Tbps(Gen2)Per Chiplet Full HBM3E bandwidth3.6X more bandwidth than CPO chiplet10 x Gen1 chiplets per XPU package=144 Tbps10 x Gen2 chiplets per XPU package=288 TbpsProtocol Adaptive:AXI,HBM/DDR,UAL,CXL etc.Photonic Fabric Delivers Superior Off-Package B
17、andwidthPhotonic Fabric Based Chiplets Offer More Package Bandwidth With Full Link Management XPUXPUXPUXPU4 Tbps per Chiplet40 Tbps per Package144 Tbps(Gen1)/288 Tbps(Gen2)per package4x reticle package with 10 chipletsCurrent SoTAPhotonic Fabric14.4 Tbps(Gen1)/28.8 Tbps(Gen2)per chiplet4x reticle pa
18、ckage with 10 chipletsPhotonic FabricTM 14.4 Tbps Photonic Interface ChipletSeamless Integration with Existing AI Accelerators&XPUsCompute-to-ComputeScale-up/backend networksCompute-to-MemoryPhotonically scalable disaggregated memoryFull HBM3E bandwidthUnlocks Photonic Connectivity(E-O-E)for AI XPUs
19、Protocol AdaptiveStandard ProtocolsAXI,HBM/DDR,CXLEmerging ProtocolsUALProprietary ProtocolsStandard D2D InterfaceUCIe MAX PHYStandard 2.5D packaging flows from multiple large OSATsAutomated high-volume,high-throughput fiber-attach processLess beachfront than 1 HBM stack10Compute-to-Compute&Compute-
20、to-Memory Photonic ConnectivityPhotonic FabricTM Technology Platform Enables Cluster Scale AI Processing115 Tbps Switch for backend/scale-up network All-to-All Connectivity for efficient Collective CommsPhotonic FabricTM Appliance33TB MemoryBroadcast&Reduce across all connected XPUs33TB Unified/Shar
21、ed Memory SpacePhotonic FabricTM LinkPhotonic Fabric Link&Appliance:AI Efficiency&PerformancePhotonic Fabric Link&Appliance Delivers Compelling Benefits for AI Workloads56 Conventional XPU with 192GB each of HBM3 required to process a 10T parameter DLRM model16 XPU-Optical connected to Photonic Fabr
22、ic Appliance to process a 10T parameter DLRM model 33TB Memory Capacity Full HBM3 Bandwidth 115Tbps SwitchEach conventional XPU holds 1/56th(1.79%)of the 10T model in its HBM memory(192GB)Each XPU-Optical has access to the full model stored in the PFMAUp to 71%XPU CapEx and Power ReductionHigher Com
23、pute DensityMemory Resources Scalable Independent of Compute 12.5X DLRM Performance Speed-upPhotonic Memory Fabric:Delivers Higher Throughput on GPT412Conventional 16-XPU/2 Server Rack16-XPU/2-Server+Photonic Fabric Appliance Ethernet(400/800 Gbps)36 TB(12x)More Fast Memory+7.2 Tbps Scale-Up Network
24、 per XPU(Backend Network)Scale-up/Backend Network Constrained2 Servers:3 TB of HBM Easily serve GPT4(1.8T)MOE models with large context length and batch sizesMemory Restricted:GPT4(1.8T)MOE models1.8TB1.2T3TBGPT4-1.8T MoE Model Weights(fp8)Memory Available for Inferencing(KV Cache,Context Lengths,Ba
25、tch Size)1.8TB34.2TB36TBGPT4-1.8T MoE Model Weights(fp8)Memory Available for Inferencing(KV Cache,Context Lengths,Batch Size)CPUCPUPCIe SwitchPCIe SwitchEthernet(400/800 Gbps)NICNIC 33TB Memory Capacity Full HBM3 Bandwidth 115Tbps Switch128 XPU Cluster Scale Photonic Fabric for Accelerated Computing
26、13.Illustrative PictureNot all photonic links shownAppliance 1Appliance 2Appliance 16128 XPU Back End/Scale-Up Fabric(7.2Tbps per XPU)CPUCPUPCIe SwitchPCIe SwitchFront-End Network(Ethernet:400/800 Gbps)NICCPUCPUPCIe SwitchPCIe SwitchFront-End Network(Ethernet:400/800 Gbps)NICNICAppliance 15.NICUp to
27、 792 TB System Memory CapacityEnabling Very Large Clusters With Significantly Lower Carbon/Energy(TCO2)ImpactState-of-the-ArtPhotonic Fabric(Gen1)Power Efficiency(pJ/b)3Celestial AI Photonic FabricTM vs.Current State-of-the-ArtPhotonic Fabric Delivers Disruptive Performance,Lowest Latency,Improved E
28、nergy Efficiency for Data Movement&Compute and Lower TCO14Photonic Fabric offers 16x off-package bandwidthPhotonic Fabric offers 5X better latencyPhotonic Fabric offers 8X better power efficiencyPhotonic Fabric offers 26X better cost efficiency2 SOTA Benchmark:DGX H200 Remote Direct Memory Access(RD
29、MA)3 SOTA Benchmark:RDMA Power for Fourth-generation NVLink 4 SOTA Benchmark:NVIDIA DGX H2001 SOTA Benchmarks:Ayar Labs TeraPHYState of the ArtPhotonic Fabric(Gen1)RDMA Latency 25X Lower Latency8X Lower PowerState of the ArtPhotonic Fabric(Gen1)Cost($/GB)426X Lower CostState-of-the-ArtPhotonic Fabri
30、c Chiplet(Gen1)Bandwidth Per Optical Chiplet 13.6X BetterHBM3E Bandwidth(8Tbps)Competitive SolutionThe Photonic FabricTM Designed for Volume DeploymentThe Right Technology at the Right Time Full-Stack Optical Interconnectivity PlatformVolume Manufacturing Si PhotonicsDriven by Data CommunicationsSil
31、icon PhotonicsAdvanced CMOSSilicon Photonics Control Circuitry,SERDES,Router/SwitchOpto-Electronic Systems-in-PackageWafer-Scale Assembly&TestLeveraging Established Supply ChainsHow Does the Photonic Fabric Transform AI?A New Vision of AI infrastructure Unconstrained By Fast Memory Capacity&Scale-Up
32、 BandwidthAccelerates Multi-ModalitySimplify AI Software StackMitigate AI Carbon ImpactSupport Larger Context LengthsEfficiently Process Very Large LLMs Democratizes AILowers cost of AILarger Context Lengths activate new use casesHelps meet carbon neutrality targets even with increasing AI usageTrain and Serve 1T+Models like:Open AI GPT4-1.8T MoEAnthropic Claude Opus(2T)Text-To-Image Text-To-VideoOpen AI SoraRunwayML All-to-All/Broadcast Interconnect simplifies collective commsReduces needs for shardinginfocelestial.aiwww.celestial.ai