《SESSION 16 Invited Industry.pdf》由會員分享,可在線閱讀,更多相關《SESSION 16 Invited Industry.pdf(108頁珍藏版)》請在三個皮匠報告上搜索。
1、ISSCC 2025SESSION 16Invited Industry16.1:Tomahawk 5 51.2Tbps Ethernet Switch 2025 IEEE International Solid-State Circuits Conference1 of 26Tomahawk 551.2Tbps Ethernet SwitchAsad Khamisy,PHD16.1:Tomahawk 5 51.2Tbps Ethernet Switch 2025 IEEE International Solid-State Circuits Conference2 of 26Outline
2、Overview Technology Highlights Packet Flow Monolithic Chip Drivers Low Power,Air Cooled Pizza Box Design TH5-Bailly:Direct Drive Co-Packaged Optics Summary16.1:Tomahawk 5 51.2Tbps Ethernet Switch 2025 IEEE International Solid-State Circuits Conference3 of 26Overview51.2 Tbps Ethernet/IP Switching Ba
3、ndwidth2X bandwidth every 2 years at lower$/Tbps,maintaining Moores LawAccelerates AI WorkloadsCognitive routing,advanced telemetry and congestion controlDiverse Physical Connectivity4m DAC,KR backplane,pluggable optics,LPO/LRO,co-packaged optics50%compared to a typical clock treenOn-chip-variation
4、mitigationSlice_0Slice_1Slice_2Slice_3Slice_31Slice_30Slice_29Slice_28clkclkclkclkclkclkclkclkRoot driver1220m600m600m600m280m600m600m600mclock16.2 RNGD:A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models 2025 IEEE International Solid-State Circuits Conference22
5、 of 29HBM Channel Routing Layout2978m2868m234mRDL M4M3DRAM sidePHY sideM2M1Layer M3,M4Layer M2,M1shieldingTop ViewChannel 12Channel 8Channel 4Channel 0Channel 13Channel 9Channel 5Channel 1Channel 14Channel 10Channel 6Channel 2Channel 15Channel 11Channel 7Channel 316-channel DWORD 0-1 DQ 0-31Cross-Se
6、ction ViewMetal 5:VSSMetal 4:SignalMetal 3:VSSMetal 2:SignalMetal 1:VSS1.3m2.0m16.2 RNGD:A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models 2025 IEEE International Solid-State Circuits Conference23 of 29HBM Signal and Power IntegritynInterposer RLC extracted wi
7、th 3D full-wave EM modelingnLinearized switching DC-DC power supply modeling includedCh A0.78UISignal IntegrityReturn LossInsertion LossS-ParametersEye-diagramTarget Impedance Sweep with Different Current Profile FrequencyDCCurrentProfile Frequency SweepPower IntegrityPDN Z-profileVoltage Fluctuatio
8、nVpp 0.5UINear-end CrosstalkFar-end CrosstalkHBM3 Power RailsPDN Z-ProfileVoltage FluctuationDC(m)Zpeak(m)Vpp(%)VDDQL(0.4V)4.395.7113.8MHz3.65VDD(0.75V)5.7852.07.4MHz7.4816.2 RNGD:A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models 2025 IEEE International Solid-
9、State Circuits Conference24 of 29Core Power IntegritynPackage bump resistance and inductance simulatednBumps are merged by 10 x10 grid for simulationPCBInterposer 10 x10u-BumpBallPackageC4 bumpProbingResistanceInductancePackage bump R&LnWorkload transientlVDDC_CLUS0:peak 307AlVDDC:peak 73ADC IR:(750
10、-685)/750=8.67%AC IR(Vmin):(685-678)/750=0.93%Vavg:685mVVmin:678mVCPM 10 x10DC IR:(750-648)/750=13.6%AC IR(Vmin):(648-642)/750=0.8%Vavg:648mVVmin:642mV16.2 RNGD:A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models 2025 IEEE International Solid-State Circuits Conf
11、erence25 of 29PVT Sensors with Thermal Model and Heat SinknCustom heat sink,case,and bracket for air coolingnInfrared camera photo without heat sink for BERTProcess monitorGroup A Temp.sensor Voltage monitorGroup B Temp.sensor Voltage monitorP45540123123000P2661771PEPEPEPEPEPEPEPECPU,SecurityNoCHBM3
12、Sub-systemPCIe16.2 RNGD:A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models 2025 IEEE International Solid-State Circuits Conference26 of 29Reliability&YieldnExtra slice for each PEnECC for SRAM,DRAM,NoCnTiming margins are continuously monitoredlLong-term aging i
13、s monitorednVoltage droop detectorlFully digital and wide-bandwidthlMonitor localized fast supply voltage dropslTrigger interrupts if the voltage falls below a pre-defined thresholdnSupports encryption for secure booting12.0031.00Timing margin map755mV,53oC0 31PEPEPEPEPEPEPEPECPU,SecurityNoCHBM3Sub-
14、systemPCIe16.2 RNGD:A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models 2025 IEEE International Solid-State Circuits Conference27 of 29Dynamic Voltage and Frequency Scalingn Temperature and peak performance are efficiently balanced for the PE and NoCn Total boar
15、d power is controlled 100TBpsRDU High Bandwidth Memory 64 GB RDU High Capacity DDR Memory Up to 1.5 TB100 GB/s1.6 TB/sHigh throughput inference with caching Low Latency Model Switching Dataflow enabled by large On-Chip Memory16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tier
16、s for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference5 of 24RDU CoresRDU CoresTLNTLNTLNRDU CoresRDU CoresTLNTLNDie-to-DieDDRDDRDDRHBMSN40L:Chip OverviewDDRDDRDDRHBMHBMHBM RDU Core consists of programmable compute and memory units with meshed network Top Level Network(TL
17、N)provides a bridge to off-chip communication:P2P,HBM,DDR and PCIe Die-to-Die connectivity enhances core-to-core communication PCIEP2P P2P P2PP2PP2PTLNPCIEP2P P2P P2PP2PP2P16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International S
18、olid-State Circuits Conference6 of 24Core Architecture1040 Distributed Memory and Compute UnitsSAGCUAGCUAGCUAGCUPMU PCUPMU PCUPMU PCUSSSSSPMU PCUPMU PCUPMU PCUSSSSSSSSSSPMU PCUPMU PCUPMU PCUSSSSSSSSSSAGCUAGCUAGCUAGCUPCU:Pattern Compute Unit/PMU:Pattern Memory UnitS:Mesh Network Switches/AGCU:Address
19、 Generation and Coalescing Unit16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference7 of 24Scalar In FIFOHEADERTAILSIMDSIMDregsregsregsControl BlockHEADER:Consume and organize dataflow packets fr
20、om PMU and SwitchTAIL/Output:Special elementwise operations Export packets to PMU and SwitchBODY:Systolic Array SIMD Core Cross Lane reduction tree 8 bit/16 bit/32 bit operationsVector In FIFOScalar Out FIFOControl OutputsControl InputsBroadcast BufferPattern Compute Unit(PCU)Vector In FIFOCounters1
21、6.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference8 of 24Pattern Memory Unit(PMU)VectorIn FIFOScalar In FIFOHDRFragmentable Scalar ALU PipelineVectorOut FIFOScalar Out FIFOWrite Data AlignRead
22、Data AlignAddress PredicationScratchpadBanksScalarALUCountersALU Pipeline:Complex address creation Tensor access patternsData Aligner:Transposes,Permute,LUT,Layout Conversion,Format ConversionScratchpad:Programmer managed memory Concurrent read and writesControl BlockScalarSRAMControl OutputsControl
23、 InputsRW16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference9 of 24Switch Network and AGCUMesh based Switch InterconnectVector network:carries data,packet switchedScalar network:carries address
24、es,other metadata,packet switched Control network:carries flow control,synchronization and graph orchestration tokensHardware 2-D dimension order routing(DOR)with software overrideAddress Generation and Coalescing Unit(AGCU)Memory address generation to access DDR,HBM and HostPeer-to-Peer communicati
25、on between RDUs for collective communicationsGraph control interface to load and orchestrate graph execution without host involvementSegment Lookaside Buffer for virtual-physical address translation and memory access management16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Ti
26、ers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference10 of 24Top Level Network(TLN)RDU CoresDDRHBMPCIEInterconnects RDU Cores with DDR,HBM,PCIe and Die-to-Die interfaces Four independent networks operating in parallelRequest,Data,Response,and CreditHybrid Mesh/Ring Pa
27、cket Switched interconnectY-X DOR packet routingEnd-to-end credits to avoid deadlocksRDU CoresD2DD2D16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference11 of 24SN40L XRDU PCB and SN40L-16 RackPD
28、UsHost ServerXRDUsSwitchesXRDUsXRDU with 2 SN40L ChipsSN40LRDIMMFANBoardPSUBMCSwitchNICP2PSN40LSN40L-16 RackStandard 19in 42RU Form FactorFrontRearNode:16 SN40Lchips16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-St
29、ate Circuits Conference12 of 24SRAM CapacitySRAM BandwidthFlexible ProgrammingUnlocking Operator Fusion Potential GPU based:1 kernel=1-5s operatorsRDU based:1 kernel=100-1000s operatorsRDU/GPU 10 xRDU/GPU 10 xTensor access patterns/Manipulations16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator wi
30、th Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference13 of 24Executing Transformer on RDULaunch Overheads(microseconds)SyncWeight Load(microseconds)Compute(microseconds)Example:Llama3.1 8BEmbeddingDecoder 0Decoder 1Decoder 2Decoder 31ClassifierSampli
31、ng Two Key Optimizations:Spatial Fusion:Captures data localityKernel looping:Overlaps compute and communication,eliminates overheads16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference14 of 24Sp
32、atial Fusion:1 decoder=1 kernel on SN40L RDUEmbeddingDecoder 0Decoder 1Decoder 2Decoder 31ClassifierSamplingRMSNormQGEMMKGEMMVGEMMWqWkWvtransposeRMSNormUpGEMMDownGEMMAddWoWgateWupWdownxd-1xdAllReduceExample:Llama3.1 8BK K0 0 High Operator fusion:One kernel call for all decoders:Zero Kernel Launch Ov
33、erheads High data localityMulGateGEMMQKmatmulMaskfillSoftmaxPVmatmulOGEMMAllReduceSilUsection_0()section_1()section_1()section_1()section_1()section_2()section_3()16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-Stat
34、e Circuits Conference15 of 24Executing Transformer on RDU=500 tokens/s=1115 tokens/s=100 tokens/s+Single Kernel Decoder+Kernel LoopingExample:Llama3.1 8BBaselinesec_0()sec_1()sec_3()D0D1D2D31sec_0()sec_1()sec_1()sec_3()D0D1D2D3116.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory T
35、iers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference16 of 24Large Language Model Family PerformanceModelsTokens/second1B Llama 3.225003B Llama 3.215008B Llama 3.1111532B Qwen 2.531732B QwQ31170B Llama 3.1 58072B Qwen 2.5226405B Llama 3.1200671B DeepSeek-R1 198Highes
36、t tokens/sec in a Single rack with16 SN40LClose to Ideal Throughput Scaling with Batch Size on Llama 3.1 70B 16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference17 of 24Agentic AI Multiple AI mo
37、dels work collaboratively to accomplish complex tasks Most advanced LLM systems constitute agentic workflowsComposition of ExpertsSpeculative Decoding RAW SPEED MODEL VARIETY MODEL SWAPPING Architecture for agentic AI needs to excel in 16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three
38、Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference18 of 24Agentic AI Advantage on SN40L Host and Serve Trillions of Parameters at a fraction of the cost of GPUs19x SmallerMachine Footprint3.7x Fastervs.DGX H10015x FasterModel Switching16.4:SambaNova SN40L:
39、A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference19 of 24SN40L-16 Energy Efficiency and Power ManagementDataflow ArchitectureFused operation minimizes data movement Eliminate core-based Instruction Set Architectu
40、re(ISA)overhead3-Tier Memory ArchitectureConsolidate 10s-100s of models with high-speed model switching in a single systemAdvanced Power ManagementMultiple control loops to effectively manage thermal,electrical and Ldidt challengesSub-usec power monitoring and control to handle highly oscillating wo
41、rkloadContinuous monitoring of temperature and voltage to ensure reliable operationFull TFlop utilization in air-cooled system!65C at 22C ambientGraph Sections16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Ci
42、rcuits Conference20 of 24High-Speed Power Estimation Unit(PEU)DataMonitorActivity MonitorPower CalculatorxPEPower AccumulatorActivatorPEUMonitoring activity and data toggling of 3500 compute,memory,network component powers in sub-usec rangeMonitoring of op-code,clock enable,zero data and data toggli
43、ng with real time voltage and frequency synchronization for accurate power estimationSPESwitchMPEPMUCPEPCUCPASPESwitchMPEPMUCPEPCUSPESwitchMPEPMUCPEPCUSPESwitchMPEPMUCPEPCUCPA16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE Internationa
44、l Solid-State Circuits Conference21 of 24Accuracy Correlation of PEU Fast time scale correlation(usec)Slow time scale correlation(msec)PEU accurately correlates with VRM power in msec timescale and creates perfect mirror image of voltage behavior in usec timescale16.4:SambaNova SN40L:A 5nm 2.5D Data
45、flow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference22 of 24Batched Inference Performance for Low-power DatacentersLlama3.1 405B batched inference performance and power comparison between unconstrained power and power-constrained
46、rack scenariosToken generation performance degradation is minimized with PEU for power-constrained datacenters27%3%16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference23 of 24Performance and Ene
47、rgy Efficiency ComparisonVIT=Vision Transformer COE 150 8B=Composition of 150 Llama 3.1 8B experts16.4:SambaNova SN40L:A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI 2025 IEEE International Solid-State Circuits Conference24 of 24Summary SN40L presents a completely
48、reimagined chip architecture targeting generative agentic AI era.Unique dataflow features enable optimizations which are beyond horizon for traditional accelerators,delivering SOTA performance and energy efficiency.Most accurate and highly responsive power and thermal monitoring systems enable the best power/performance tradeoffs in datacenters across the world.