當前位置：首頁 > 報告詳情

黑洞和TT-金屬——獨立的人工智能計算機及其編程模型.pdf

上傳人： c** 編號：465074 2025-01-12 PDF PDF 30頁 2.22MB

該報告所屬合集： 2024年高性能芯片研討會（HOT Chips 2024）嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/30

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《黑洞和TT-金屬——獨立的人工智能計算機及其編程模型.pdf》由會員分享，可在線閱讀，更多相關《黑洞和TT-金屬——獨立的人工智能計算機及其編程模型.pdf（30頁珍藏版）》請在三個皮匠報告上搜索。

1、Blackhole&TT-MetaliumThe Standalone AI Computer and its Programming ModelAugust 2024Jasmina Vasiljevic,Senior FellowDavor Capalija,Senior FellowAgenda Architecture Micro-architecture Scale-out Software3AI Silicon RoadmapStandalone AI ComputerWormhole 80 Tensix+Tensix+Cores 12nm 328 TOPS(FP8)336 GB/s

2、 GDDR6 Gen4x16 16x100 Gbps Ethernet 16x100 Gbps Ethernet Blackhole 140 Tensix+Tensix+Cores 6nm 745 TOPS(FP8)512 GB/s GDDR6 Gen5x16 10 x400 Gbps Ethernet10 x400 Gbps Ethernet 16 RISC16 RISC-V CPU coresV CPU coresNetworked AI ProcessorAI ProcessorGrayskull 120 Tensix Cores 12nm 276 TOPS(FP8)100 GB/s L

3、PDDR4 Gen4x16202120222023GEN 1GEN 1GEN 2High Perf AI ASICScalabilityHeterogeny4Tensix coresDRAM coresETH coresPCIe coreARC coreTDEPABlackhole -A Standalone AI ComputerCRISC-V CPUs5Blackhole -A Standalone AI ComputerFeatureSpecTensix745 TFLOPs(8-bit)372 TFLOPs(16-bit)SRAM241 MBsEthernet10 x 400 GbpsD

4、RAM512 GB/s BW32 GBs capacityBaby RISC-Vs752Big RISC-Vs16PCIeGen5x16,64 GB/sNoC2 NOCs2D Torus256 B per core6Big RISC-V&Baby RISC-VCRISC-V CPUsFeatureSpecRISC-V CPUsx16(4 clusters of 4)Compute64-bit,dual-issue,in-order L3 cache2 MB/CPUL2 cache128 KB/CPUL1 I-cache32 KB/CPU(2 way associative)L1 D-cache

5、32 KB/CPU(4 way associative)Runs LinuxOn-device host for the AI acceleratorFeatureSpecTotal Baby RISC-Vs752Compute32-bitInt multiplier/dividerFloating point(FP32/BFLOAT16)128-bit vector(1 per Tensix)I-cache4 KBD-scratch8 KBBaby RISC-VTensix coresDRAM coresETH coresTDEBig RISC-V752 Baby RISC-Vs16 Big

6、 RISC-VsMicro-Architecture:All RISC-V ProgrammableAll RISC-V ProgrammableBaby RISC-Vs8Tile MathEngineRISC-VRouterDRAMBank controllerETHcontrollerVectorMathEngineComputeData MovementStorageRISC-VRISC-VRISC-VRISC-VuserkerneluserkerneluserkerneluserkerneluserkernelFeatureSpecTotal Baby RISC-Vs752Comput

7、e32-bitInt multiplier/dividerFloating point(FP32/BFLOAT16)128-bit vector(1 per Tensix)I-cache4 KBD-scratch8 KBAll RISC-V ProgrammableWithin the Tensix CoreWithin the Tensix Core9ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 Memory5 baby RISC-Vs32-bit RISC-V ISAAll RISC-V ProgrammableWit

8、hin the Tensix CoreWithin the Tensix Core10ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 Memory5 baby RISC-Vs32-bit RISC-V ISAAll RISC-V ProgrammableWithin the Tensix CoreWithin the Tensix Core11ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 Memory3 user C kernels program a

9、single Tensix core1 compute kernel2 data movement kernelsdatamovementkerneldatamovementkernelcomputekernelTensix Core Data Movement12ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1NoC InstructionDispatchNoC 0NoC InstructionDispatchNoC 1Router1Router0L1 Memory2 data movement kernelsAsynchronous reads&writ

10、esAccess to all SRAM&DRAM banksMemory barriersAtomic semaphoresdatamovementkerneldatamovementkernel13Tensix Core Data MovementRouter0Router1ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1333 GB/s83 GB/s83 GB/s83 GB/s83 GB/s83 GB/s83 GB/s83 GB/s83 GB/s2x 83 GB/s2x 83GB/sFeatureSpecIndependent NoCs2NoC typ

11、e2-dimensional torusNoC link width64 BytesNoC link BW83 GB/sTensix-NoC I/O BW665 GB/sSRAM NoCs333 GB/sSRAM NoC aggregate BW47 TB/sL1 MemoryTensix CoreBlackhole:Built for AI Data Movement PatternsMemory&I/OData Movement PatternBandwidthSRAMLocal /Sharded94 TB/sSRAMNeighbor(Halo)47 TB/sSRAMRow/Column/

12、Mesh Multicast24 TB/sSRAMGather/Scatter(3 hops)16 TB/sSRAMGather/Scatter(10 hops)5 TB/sDRAMRow512 GB/sEthernetColumn1 TB/sData patterns in MatMuls,Convolutions,and Sharded Data Layouts are regular.They have a great mapping to Mesh ArchitectureTensix Core-Compute15RISC-V2RISC-V3RISC-V4RISC-V5RISC-V1R

13、outer1Router0L1 MemoryTileMath EngineVectorMathEngineTensix Core-Compute16RISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 MemoryPowerful tile-based math engineTileMath EngineVectorMathEngineEngineData FormatAccumulatorTFLOPsMatrixBlock FP2FP32745MatrixBlock FP4FP32745MatrixBlock FP8FP32745Matrix

14、FP8FP32745MatrixBFLOAT16FP32373MatrixTF32FP32186MatrixINT8INT32186+-*T=3232Rich Matrix ISA:Mat Mul,dot product,elementwise,transposeTensix Core-ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 MemoryTileMath EngineVectorMathEngine=+-*,sort,reshuffle,LUT 64General Purpose Vector ISA:element

15、wise,sort,re-shuffle,LUT EngineData FormatAccumulatorTFLOPsVectorFP32FP3212VectorINT16INT326VectorINT32INT326Tensix Core-ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 MemoryTileMath EngineVectorMathEngineLibrary of Low-Level APIsOpen source library of low-level kernels1 API per math fun

16、ction100s of tile&vector math LLKsuser computekernelCompute InstructionDispatch1 user compute kernelAutomatically compiled to 3 RISC-V threadsKernel Synchronization19ComputeRISC-V2RISC-V3RISC-V4RISC-V5RISC-V1Router1Router0L1 MemoryCBsCBsNoC 0NoC 1datamovementkerneldatamovementkernelcomputekernelCirc

17、ular Buffer(CB)SRAM memory object with hardware-enabled flow controlAll RISC-V ProgrammableWithin the DRAM CoresWithin the DRAM Cores20RISC-V1Router0L1 MemoryRISC-V1Router0L1 MemoryRISC-V1Router0L1 MemoryxbarDRAMBank controlleroff-chip DRAMKernels for asynchronous pre-load/spill to DRAMAll RISC-V Pr

18、ogrammableWithin the Ethernet CoreWithin the Ethernet Core21RISC-V2RISC-V1Router1Router0L1 MemoryETHoff-chip EthernetScale-outBlackhole:Ethernet-Based Scale out 1 TB/s of Blackhole Ethernet Can be connected into any topology Mesh topology is great for AI Locality and regularity of data movement Shar

19、ded data 200 GB/s in N/S/W/E/Z 2D/3D torusTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCEEEEEEEEDRAMTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

20、TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCEEEEEEEEDRAMTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCEEEEEEEEDRAMTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

21、TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCEEEEEEEEDRAMEEEEEEEETile MathEngineRISC-VRouterDRAMBank controllerETHcontrollerVectorMathEngineComputeData MovementStorageRISC-VRISC-VRISC-VRISC-VuserkerneluserkerneluserkerneluserkerneluserkernelBl

22、ackhole Galaxy:32 chips in a 4x8 MeshBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHZ dim I/O:32 x 200 GB/sY dim I/O:16 x 200 GB/sX dim I/O:8 x 200 GB/s11.2 TB/s Galaxy I/OBlackhole Galaxy:Scale-out“Lego building block11.2 TB/s I/O”AI switch”24 PFLOPs”AI compute”16 TB/s BW,1 TB capa

23、city”AI memory”SoftwareTT-Metalium:Built for AI and Scale-outNative multi-device kernels&OPsTT-MetaliumGPU ProgrammingKernels are plain C+with APIsDedicated data movement&compute kernelsOptimize data movement and compute overlap directlyAny core can read/write/sync to any core or chip directlyFull c

24、ontrol of data layout and persistency in SRAM and DRAMDifferent cores can run different kernels and flow data directly between them Native multi-device kernels Fused and overlapped compute and inter-chip communication within the kernelsDeep Learning OPsCollective Comms OPsVS.TT-Metalium C+Host APITT

25、-Metalium C+Kernel APITT-NNC+Host APIGPUKernel languageDNNCCL27Tenstorrent Open Source SoftwareANY AI MODELOPENSOURCEOPENSOURCEOPTIMIZED ML RESULTSCUSTOM OPSBUILD ANYTHINGOPENSOURCEOPENSOURCECompilerTT-Forge-Integrated into various frameworks for native model ingestTT-MLIR-new MLIR-based compilerTT-

26、NN a library of optimized operators o ATen coverageo PyTorch-like APITT-Metalium low level programming model&entry pointSoftware Ecosystem&IntegrationsONNXTFMore Models!PyTorchJaxModelsJaxTT-ForgeLLM inferencemodelsTT-MLIRPyTorch2.0backendvLLMManually optimizedmodelsTT-NNTT-MetaliumTT-LLK(low-level-kernels)https:/ models3rd party training compilerThank you30

相關圖表

本文介紹了Blackhole和TT-Metalium兩種AI計算機架構及其編程模型。Blackhole是一種獨立的AI計算機，具有80個Tensix+核心、12nm工藝、328TOPS（FP8）的性能，適用于AI數據運動模式。它包含140個Tensix++核心、6nm工藝、745TOPS（FP8）的性能，以及10個400Gbps以太網端口。而TT-Metalium是一種為AI和擴展性設計的GPU架構，具有32個芯片的4x8網格布局，提供11.2TB/s的I/O帶寬和24PFLOPS的AI計算能力。TT-Metalium的軟件生態系統包括TT-Forge、TT-MLIR、TT-NN和TT-Metalium等，支持多種AI模型和深度學習操作。

"黑洞AI電腦有何特色？" "TT-Metalium如何優化AI模型訓練？" "tenstorrent開源項目對AI發展有何貢獻？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站