《黑洞和TT-金屬——獨立的人工智能計算機及其編程模型.pdf》由會員分享,可在線閱讀,更多相關《黑洞和TT-金屬——獨立的人工智能計算機及其編程模型.pdf(30頁珍藏版)》請在三個皮匠報告上搜索。
1、Blackhole&TT-MetaliumThe Standalone AI Computer and its Programming ModelAugust 2024Jasmina Vasiljevic,Senior FellowDavor Capalija,Senior FellowAgenda Architecture Micro-architecture Scale-out Software3AI Silicon RoadmapStandalone AI ComputerWormhole 80 Tensix+Tensix+Cores 12nm 328 TOPS(FP8)336 GB/s
2、 GDDR6 Gen4x16 16x100 Gbps Ethernet 16x100 Gbps Ethernet Blackhole 140 Tensix+Tensix+Cores 6nm 745 TOPS(FP8)512 GB/s GDDR6 Gen5x16 10 x400 Gbps Ethernet10 x400 Gbps Ethernet 16 RISC16 RISC-V CPU coresV CPU coresNetworked AI ProcessorAI ProcessorGrayskull 120 Tensix Cores 12nm 276 TOPS(FP8)100 GB/s L
3、PDDR4 Gen4x16202120222023GEN 1GEN 1GEN 2High Perf AI ASICScalabilityHeterogeny4Tensix coresDRAM coresETH coresPCIe coreARC coreTDEPABlackhole -A Standalone AI ComputerCRISC-V CPUs5Blackhole -A Standalone AI ComputerFeatureSpecTensix745 TFLOPs(8-bit)372 TFLOPs(16-bit)SRAM241 MBsEthernet10 x 400 GbpsD
4、RAM512 GB/s BW32 GBs capacityBaby RISC-Vs752Big RISC-Vs16PCIeGen5x16,64 GB/sNoC2 NOCs2D Torus256 B per core6Big RISC-V&Baby RISC-VCRISC-V CPUsFeatureSpecRISC-V CPUsx16(4 clusters of 4)Compute64-bit,dual-issue,in-order L3 cache2 MB/CPUL2 cache128 KB/CPUL1 I-cache32 KB/CPU(2 way associative)L1 D-cache
5、32 KB/CPU(4 way associative)Runs LinuxOn-device host for the AI acceleratorFeatureSpecTotal Baby RISC-Vs752Compute32-bitInt multiplier/dividerFloating point(FP32/BFLOAT16)128-bit vector(1 per Tensix)I-cache4 KBD-scratch8 KBBaby RISC-VTensix coresDRAM coresETH coresTDEBig RISC-V752 Baby RISC-Vs16 Big
6、 RISC-VsMicro-Architecture:All RISC-V ProgrammableAll RISC-V ProgrammableBaby RISC-Vs8Tile MathEngineRISC-VRouterDRAMBank controllerETHcontrollerVectorMathEngineComputeData MovementStorageRISC-VRISC-VRISC-VRISC-VuserkerneluserkerneluserkerneluserkerneluserkernelFeatureSpecTotal Baby RISC-Vs752Comput
7、e32-bitInt multiplier/dividerFloating point(FP32/BFLOAT16)128-bit vector(1 per Tensix)I-cache4 KBD-scratch8 KBAll RISC-V ProgrammableWithin the Tensix CoreWithin the Tensix Core9ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 Memory5 baby RISC-Vs32-bit RISC-V ISAAll RISC-V ProgrammableWit
8、hin the Tensix CoreWithin the Tensix Core10ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 Memory5 baby RISC-Vs32-bit RISC-V ISAAll RISC-V ProgrammableWithin the Tensix CoreWithin the Tensix Core11ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 Memory3 user C kernels program a
9、single Tensix core1 compute kernel2 data movement kernelsdatamovementkerneldatamovementkernelcomputekernelTensix Core Data Movement12ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1NoC InstructionDispatchNoC 0NoC InstructionDispatchNoC 1Router1Router0L1 Memory2 data movement kernelsAsynchronous reads&writ
10、esAccess to all SRAM&DRAM banksMemory barriersAtomic semaphoresdatamovementkerneldatamovementkernel13Tensix Core Data MovementRouter0Router1ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1333 GB/s83 GB/s83 GB/s83 GB/s83 GB/s83 GB/s83 GB/s83 GB/s83 GB/s2x 83 GB/s2x 83GB/sFeatureSpecIndependent NoCs2NoC typ
11、e2-dimensional torusNoC link width64 BytesNoC link BW83 GB/sTensix-NoC I/O BW665 GB/sSRAM NoCs333 GB/sSRAM NoC aggregate BW47 TB/sL1 MemoryTensix CoreBlackhole:Built for AI Data Movement PatternsMemory&I/OData Movement PatternBandwidthSRAMLocal /Sharded94 TB/sSRAMNeighbor(Halo)47 TB/sSRAMRow/Column/
12、Mesh Multicast24 TB/sSRAMGather/Scatter(3 hops)16 TB/sSRAMGather/Scatter(10 hops)5 TB/sDRAMRow512 GB/sEthernetColumn1 TB/sData patterns in MatMuls,Convolutions,and Sharded Data Layouts are regular.They have a great mapping to Mesh ArchitectureTensix Core-Compute15RISC-V2RISC-V3RISC-V4RISC-V5RISC-V1R
13、outer1Router0L1 MemoryTileMath EngineVectorMathEngineTensix Core-Compute16RISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 MemoryPowerful tile-based math engineTileMath EngineVectorMathEngineEngineData FormatAccumulatorTFLOPsMatrixBlock FP2FP32745MatrixBlock FP4FP32745MatrixBlock FP8FP32745Matrix
14、FP8FP32745MatrixBFLOAT16FP32373MatrixTF32FP32186MatrixINT8INT32186+-*T=3232Rich Matrix ISA:Mat Mul,dot product,elementwise,transposeTensix Core-ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 MemoryTileMath EngineVectorMathEngine=+-*,sort,reshuffle,LUT 64General Purpose Vector ISA:element
15、wise,sort,re-shuffle,LUT EngineData FormatAccumulatorTFLOPsVectorFP32FP3212VectorINT16INT326VectorINT32INT326Tensix Core-ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 MemoryTileMath EngineVectorMathEngineLibrary of Low-Level APIsOpen source library of low-level kernels1 API per math fun
16、ction100s of tile&vector math LLKsuser computekernelCompute InstructionDispatch1 user compute kernelAutomatically compiled to 3 RISC-V threadsKernel Synchronization19ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 MemoryCBsCBsNoC 0NoC 1datamovementkerneldatamovementkernelcomputekernelCirc
17、ular Buffer(CB)SRAM memory object with hardware-enabled flow controlAll RISC-V ProgrammableWithin the DRAM CoresWithin the DRAM Cores20RISC-V1Router0L1 MemoryRISC-V1Router0L1 MemoryRISC-V1Router0L1 MemoryxbarDRAMBank controlleroff-chip DRAMKernels for asynchronous pre-load/spill to DRAMAll RISC-V Pr
18、ogrammableWithin the Ethernet CoreWithin the Ethernet Core21RISC-V2RISC-V1Router1Router0L1 MemoryETHoff-chip EthernetScale-outBlackhole:Ethernet-Based Scale out 1 TB/s of Blackhole Ethernet Can be connected into any topology Mesh topology is great for AI Locality and regularity of data movement Shar
19、ded data 200 GB/s in N/S/W/E/Z 2D/3D torusTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCEEEEEEEEDRAMTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
20、TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCEEEEEEEEDRAMTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCEEEEEEEEDRAMTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
21、TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCEEEEEEEEDRAMEEEEEEEETile MathEngineRISC-VRouterDRAMBank controllerETHcontrollerVectorMathEngineComputeData MovementStorageRISC-VRISC-VRISC-VRISC-VuserkerneluserkerneluserkerneluserkerneluserkernelBl
22、ackhole Galaxy:32 chips in a 4x8 MeshBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHZ dim I/O:32 x 200 GB/sY dim I/O:16 x 200 GB/sX dim I/O:8 x 200 GB/s11.2 TB/s Galaxy I/OBlackhole Galaxy:Scale-out“Lego building block11.2 TB/s I/O”AI switch”24 PFLOPs”AI compute”16 TB/s BW,1 TB capa
23、city”AI memory”SoftwareTT-Metalium:Built for AI and Scale-outNative multi-device kernels&OPsTT-MetaliumGPU ProgrammingKernels are plain C+with APIsDedicated data movement&compute kernelsOptimize data movement and compute overlap directlyAny core can read/write/sync to any core or chip directlyFull c
24、ontrol of data layout and persistency in SRAM and DRAMDifferent cores can run different kernels and flow data directly between them Native multi-device kernels Fused and overlapped compute and inter-chip communication within the kernelsDeep Learning OPsCollective Comms OPsVS.TT-Metalium C+Host APITT
25、-Metalium C+Kernel APITT-NNC+Host APIGPUKernel languageDNNCCL27Tenstorrent Open Source SoftwareANY AI MODELOPENSOURCEOPENSOURCEOPTIMIZED ML RESULTSCUSTOM OPSBUILD ANYTHINGOPENSOURCEOPENSOURCECompilerTT-Forge-Integrated into various frameworks for native model ingestTT-MLIR-new MLIR-based compilerTT-
26、NN a library of optimized operators o ATen coverageo PyTorch-like APITT-Metalium low level programming model&entry pointSoftware Ecosystem&IntegrationsONNXTFMore Models!PyTorchJaxModelsJaxTT-ForgeLLM inferencemodelsTT-MLIRPyTorch2.0backendvLLMManually optimizedmodelsTT-NNTT-MetaliumTT-LLK(low-level-kernels)https:/ models3rd party training compilerThank you30