《超越邊緣:利用 Axelera AI 的數字內存計算和 RISC-V 技術徹底改變 AI 工作負載.pdf》由會員分享,可在線閱讀,更多相關《超越邊緣:利用 Axelera AI 的數字內存計算和 RISC-V 技術徹底改變 AI 工作負載.pdf(20頁珍藏版)》請在三個皮匠報告上搜索。
1、AI Hardware&SystemsaiandsystemsBeyond the Edge:Revolutionizing AI Workloads with Axelera AIs Digital In-Memory Computing and RISC-V technology AI Hardware&Systemsaiandsystems Co-founded in July 2021,by Fabrizio Del Maffeo and Evangelos Eleftheriou,with 16 founding team members from IBM,ETH Zurich,IM
2、EC,Bitfury AI,Google and Qualcomm.Our team has grown to 180+people,including 60+PhDs and are present in 16 countries.Our first product,Metis,is the most powerful AI processing unit for computer vision,with the best performance/price and efficiency on the market,and in 2025 we will expand our product
3、 line to generative AI applications.We have been delivering to customers since September 2023.We now have 25+customers and are moving into mass production in Q4 2024.We have raised USD 120M from leading deep-tech investors,institutions and European sovereign funds.About usOur investors includeOur te
4、am come from AI Hardware&SystemsaiandsystemsOpportunityRetailCustomer flow analysisInventory managementCashier-less checkoutsAgricultureCrop health monitoringAutomated pest controlAgricultural roboticsIndustrialQuality control automationWorker safety monitoringAutomated material handlingSecurityTraf
5、fic control systemsIntelligent surveillanceAccess control systemsHealthcareRemote patient monitoringReal-time diagnostics toolsSurgical tools and equipmentAutomotiveDriver assistance systemsAutonomous driving systemsPedestrian safety systemsComputer vision at the edge is generating real value across
6、 a range of industries todayAI Hardware&SystemsaiandsystemsEnterprise serverGeneral purpose systems used to run business applications and servicesE.g.Real-time analysis of medical imaging data,like X-rays,MRIs and CT scansOpportunityBut our technology can scale up to address the trillion-dollar indu
7、stries of tomorrowand work across different computing environmentsHPCDesigned for complex,computation-intensive tasks,optimized for parallel processingE.g.simulations,scientific research,weather modeling,large-scale data analysisDatacenterScalable and flexible resources designed to host and manage v
8、ast amounts of data and appsE.g.inference on image and text generation models(e.g.ChatGPT)AI Hardware&SystemsaiandsystemsSolutionAI Processing Unit(AIPU)The most powerful and efficient AI accelerator with our Digital in-Memory Computing technology and RISC-V ISAAI AcceleratorsCard&SystemsEdge-native
9、 hardware powered by an Axelera AIPU to enable instant field installation and faster time-to-marketVoyager SDKIntegrated AI software stack designed to simplify application development,optimization and deploymentModulesCardsBoardsSystems+METISFirst Gen (2024)AI Hardware&SystemsaiandsystemsHost intera
10、cts with the outside world and runs the“application”Metis AIPUAI CoreDigital inDigital in-memory memory computing(Dcomputing(D-IMC)IMC)We merge the memory and compute elements to reduce data movement,physical size,and increase throughput for matrix multiply operations Metis AIPURISC-V System Control
11、lerPCIe 3.0(x4)SecurityL2LPDDR4xAI CoreAI CoreAI CoreAI CoreL1RISC-VD D-IMCIMCDPUDWPUDECOMPAI Hardware&Systemsaiandsystems4 weight setsCommentary Interleaved weight-storage and compute units in an extremely dense fashionINT8 activations/weights,with INT32 accumulation to maintain full precisionImmun
12、e to noise and memory non-idealities affecting analog IMC precisionTechnology commensurate with CMOS scaling to low lithography nodesSRAM-based D-IMCAI Hardware&SystemsaiandsystemsCommentaryStores multiple weight sets in computational memory Enhances IMC storage density Allows accumulation up to 16k
13、 inputs Enables simultaneous processing and weight reloadingActivity gating and clock gating.Maintains high energy efficiency at low utilizationEnsures full-precision accumulation Negligible accuracy loss compared to FP32 Use of post-training quantization;no need for retrainingProprietary SRAM stora
14、ge cellsSRAM-based D-IMCAI Hardware&SystemsaiandsystemsCommentaryAI-CoreSelf-sufficient compute engine for concurrent network executionRISC-V system controllerBoots chip,interfaces with peripherals,manages AI cores with a real-time OSSecurity moduleSecure boot and weight/data encryption32 MiByte L2
15、SRAM52 MiByte on-chip memory in totalInterconnected through Network-on-Chip(NoC)1 Tbit/s bandwidth to shared memoriesEnsures AI cores will not stall AIPU SoC ArchitectureNoC4 AI cores32 MiByte L2AI Hardware&SystemsaiandsystemsCommentaryMatrix-Vector Multiplier:D-IMC basedData Processing Unit Element
16、-wise vector operations Apply activation functionsDepth-Wise Processing Unit Depth-wise convolution Pooling and Up-samplingWeight Decompression Unit4 MiByte L1 SRAMRISC-V control coreAI Core|Key ComponentsAI Hardware&SystemsaiandsystemsCommentaryDataflow engine:RISC-V controlled Dual high-throughput
17、 streaming data paths One for MVM One for DWPU Can operate fully in parallelBackground weight loading Write weights for next operation In parallel with operation Enabled by multiple weight sets On-the-fly weight decompressionAI Core|Operational ModelAI Hardware&SystemsaiandsystemsCommentaryBank gati
18、ng If all 64 outputs of an IMC bank pair are unused.Entire bank clock gatedBlock gating If block is not usedBlock is clock-gatedInputs are silencedEnergy efficiency high Even at low utilizationClock&Activity GatingAt 0.68VAI Hardware&Systemsaiandsystems15 TOPS/W for random uniform activations and we
19、ights(no sparsity)82 TOPS/W under high sparsity conditions at reduced throughputThroughput&Energy EfficiencyPeak throughputPeak throughput57.3 TOPS at 0.7V57.3 TOPS at 0.7V875MHz875MHz0%input sparsity50%input sparsity0%50%90%Weight sparsity15 TOPS/W15 TOPS/W82 TOPS/W82 TOPS/W90%AI Hardware&Systemsai
20、andsystemsMetis AIPU SpecPeak performance210 TOPs INT8(0.8 GHz)#of AI Cores4 x AIPU(Int4),Int8 16MB L1 SRAMInternal memory32MB L2 SRAM200GB/s aggregate BWIMC efficiency15 TOPs/W INT8External memoryLPDDR4x,34GB/sCommunication busPCIE 4x Gen3Security moduleSilex Security IP Video decoder-Pre/post-proc
21、essing-System controllerRISC-VNode geometryTSMC N12METISFirst Gen (2024)Tools SuiteAI Hardware&Systemsaiandsystems$1,200$249$249$2,0001249263250100200214248NvidiaJetson TX2IntelMyriad XGoogleEdge TPUKinaraAra-1HailoHailo-8NvidiaJetson AGX XavierSiMaMLSoCNvidiaJetson NX Orin 16NvidiaJetson AGX Orin 3
22、2Axelera AIMetis(2024)NvidiaL4Edge AI inference performanceTOPS*Board cost*Tera Operations Per Second(TOPS)is a measure of computational performance,and it quantifies the number of trillion operations(such as additions or multiplications)that a processing unit can perform in one second.TOPS presente
23、d here are what was reported in official datasheets.Nvidia performance reported as Sparse TOPS (2x Dense TOPS)PerformanceAI Hardware&SystemsaiandsystemsPerformanceYOLOV5s on Metis Deviation from FP32 accuracyNetworkResolutionMetis AIPUFPSEnergy Efficiency FPS/WAccuracy INT8ResNet-50(224x224)3155 fps
24、394 fps/W80.69%*(-0.16)MobileNet-SSD1(300 x300)5395 fps771 fps/WmAP 25.52+(-0.21)YoLoV5m(640 x640)369 fps46 fps/WmAP 44.04+(-1.09)Metis AIPU performance:Benchmarks*Measured on ImageNet-1000 validation+Measured on COCO detection validationAI Hardware&SystemsaiandsystemsScaling Up|Integrating RISC-V i
25、nto the datapathCommentaryCVA6:Open-source 64-bit RISC-V core with support for an application-class profileSingle-issue,6-stage,in-order CPU Vector Processing Unit(VPU):Our proprietary vector extension Builds on the existing RISC-V ISA Vector ISA is agnostic to vector size Enables general compute ke
26、rnels Flexibility and future-proofAI Hardware&SystemsaiandsystemsScaling Up|Design considerationsAI CoreCommentaryGeneration Two:Supports multiple high-speed video streams and medium-sized LLMs,e.g.,LLAMA-2/3 7/8B or LLAMA-2 13B,to be deployed on edge serversMassively parallel computeEnergy-efficien
27、t in-memory computingLarge on-chip SRAMMassive on-chip bandwidthHigh-capacity external memoryHigh-Speed External BandwidthSupport for multi-device pipeline parallelismApplication Processing Unit(APU)PCIe L2 Cache MemorySecurityDDRVIDEO CODECRISC-V Vector Cluster 1RISC-V Vector Cluster 2AI-COREAI-COR
28、EAI-COREAI-CORE.AI-COREAI Hardware&SystemsaiandsystemsAIPU evolution:From Metis to Generation TwoScaling initiativesOur second-generation chip can achieve up to 4x improvement in performance over Metis based on:1.Moving to a smaller node geometry2.Improving the design of the AI core 3.Improving the
29、memory hierarchy4.Doubling the number of AI coresScaling Up|PerformanceNetworkSpeedup vs Axelera METISMobileNetv32.9xResNet-503.9xSSD-MobileNetV12.7xSSD-ResNet343.7xYoloV5s3.8xYoloV5m3.8xYoloV8s5.8xPHI36.3x 500 output tokens/s,small batchLLAMA3-8B6.2x 3000 output tokens/s,large batchAI Hardware&Syst
30、emsaiandsystemsThe products we sellHow to reach us1.Meet us here at AIHW Summit2.Reach out by email directlyevangelos.eleftheriouaxelera.ai3.Visit our website at axelera.ai andcontact our sales team for more information on our productsM.2 AI Acceleration CardsForm factor:M.2 2280 M-keyAIPU:1x Metis
31、AIPUPeak performance:100 TOPSRAM:2GB of LPDDR4xConnection:PCIe 3.0 x4Power spec:Max 15W(typical 7W)PCIe AI Acceleration CardForm factor:PCIe CEM(half or full)AIPU:1-4x Metis AIPUs*Peak performance:214-856 TOPSRAM:4GB of LPDDR4xConnection:PCIe 3.0 x4-16*Power spec:Max 50-200W*All-in-one AI SystemsPeak performance:214 TOPSHost device types:x86 and ARMSystems:VariousInterested in learning more?