1、SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific Accelerators:From Hardware to SystemSophia Shao(ysshaoBerkeley.edu)ISSCC 2024 Tutorial 9February 2,20241 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceGrowing Demand in ComputingPublication
2、 dateSep 30,2012Dec 13,2015Sep 8,2018Jun 4,2021Training compute(petaFLOP)1,00010,000100,0001 million10 million100 million1 billion10 billionAlexNetVGG16GoogLeNet/InceptionV1ResNet-152(ImageNet)TransformerAlphaZeroYOLOv3GPTBERT-LargeGPT-2AlphaFoldGPT-3 175B(davinci)DALL-EPaLM(540B)LLaMA(65B)GPT-4PaLM
3、 2Task domainDrawingDrivingGamesLanguageMultimodalOtherSpeechVisionOtherSource:Epoch(2023)OurWorldInData.org/artificial-intelligence CC BYMoores LawScaling2 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceSlowing Supply in ComputingDiminishing Efficiency Returns:Achieving ef
4、ficiency gains becomes more challenging as technology advances to smaller nodes.Diminishing Cost Benefits:The traditional cost benefits with packing more transistors into a chip are also diminishing with advanced nodes.3 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceSlowin
5、g Supply in ComputingMarvellDiminishing Efficiency Returns:Achieving efficiency gains becomes more challenging as technology advances to smaller nodes.Diminishing Cost Benefits:The traditional cost benefits with packing more transistors into a chip are also diminishing with advanced nodes.3 of 72Spe
6、akerVideo 2024 IEEE International Solid-State Circuits ConferenceSlowing Supply in ComputingGrowing Demand in Computing4 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceSlowing Supply in ComputingGrowing Demand in Computing5 of 72SpeakerVideo 2024 IEEE International Solid-St
7、ate Circuits ConferenceSlowing Supply in ComputingGrowing Demand in ComputingDomain-Specific Accelerators6 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific AcceleratorsCustomized hardware designed for a domain of applications.To improve performance and efficie
8、ncy7 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific AcceleratorsCustomized hardware designed for a domain of applications.To improve performance and efficiencyApple M1 Chip*AnandTech7 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceD
9、omain-Specific AcceleratorsCustomized hardware designed for a domain of applications.To improve performance and efficiencyApple M1 ChipCPUCPU*AnandTech7 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific AcceleratorsCustomized hardware designed for a domain of a
10、pplications.To improve performance and efficiencyApple M1 ChipCPUCPUGPU*AnandTech7 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific AcceleratorsCustomized hardware designed for a domain of applications.To improve performance and efficiencyApple M1 ChipCPUCPUGP
11、U*AnandTechDomain-SpecificAccelerators7 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific AcceleratorsCustomized hardware designed for a domain of applications.To improve performance and efficiencyApple M1 ChipCPUCPUGPU*AnandTechNeural EngineDomain-SpecificAcce
12、lerators7 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFrom Domain-Specific Hardware to System“The Computer as a Datacenter”Many distributed cores and acceleratorsConnected via on-chip networksRunning concurrent and large applicationsFull-Stack innovations needed for:Acce
13、lerator Design for machine learning and beyond;Accelerator Integration to the core computing stack;Accelerator Evaluation of end-to-end applications.8 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceOutlineOverview Accelerator Design Accelerator Integration Accelerator Evalu
14、ationFrom Hardware to SystemConclusion9 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceAccelerators from Edge to CloudEdge TPUCloud TPUTesla FSD10 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceWhy Accelerators?Inefficiency in CPUsUnderstanding Sou
15、rces of Inefficiency in General-Purpose Chips,ISCA2010Functional units only consume a subset of the total energy consumption.Most of the energy was spent on instruction fetching and decoding.Consistent across many applications,from H.264 encoding to Machine Learning.Energy breakdown for a 5-stage pi
16、peline processors for H.264 encoder.11 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCore Optimizations in AcceleratorsInstruction Decoding:Coarse-grained instructionsE.g.,TPU ISACustom datapath:Parallel operations:SIMD/VLIWOperation fusion of multiple opsMagic instruction
17、s for custom opsCustom memory:Dedicated storage for operandsDedicated data delivering networkUnderstanding Sources of Inefficiency in General-Purpose Chips,ISCA201012 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:IBM Power10 MMAUsing coarse-grained instruction and
18、custom datapath for accelerationMatrix-Multiply Assist(MMA)for linear algebra on small matricesIntegrated as a special functional unitEnergy Efficiency Boost in the AI-Infused POWER10 Processor,ISCA202113 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:IBM Power10 MM
19、AUsing coarse-grained instruction and custom datapath for accelerationMatrix-Multiply Assist(MMA)for linear algebra on small matricesIntegrated as a special functional unitBenefits:Reduced#of instructionsEach instruction is more complex(i.e.,higher CPI)Reduced latencyLower#of cyclesImprove performan
20、ceEnergy Efficiency Boost in the AI-Infused POWER10 Processor,ISCA202114 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:PIM Accelerator in HBMCustom memory to reduce data movementSamsungs Processing-in-memory(PIM)in HBM2 for DNNsFirst HBM-based PIM design fabricated
21、 by a major DRAM manufacturer and 2.5D integrated with an unmodified commercial processor.PIM unit perform GEMV for DNNs.Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology,ISCA202115 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFrom Hardw
22、are to SystemAccelerators are more than just hardware.SoC Interface:how the accelerator is integrated into the entire system Software interface:how hardware operations are exposed to the software stack16 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:
23、SoC17 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:SoCPerformance ImpactsResource contention,etc.17 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Memory Hierarchy18 of 702518 of 72SpeakerVideo 2024 I
24、EEE International Solid-State Circuits ConferenceFull-System Visibility:Memory Hierarchy18 of 7026Performance ImpactsCache coherence,miss rates/latencies,etc.18 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:DIANA w/shared memorySoC with digital and analog NN accele
25、rators:RISC-V Host processorDigital NN core with flexible dataflowAnalog in memory compute core with SIMD digital post-processingShared memory system supporting efficient layer-fused execution,controlled by the RISC-V coreDIANA:An End-to-End Energy-Efficient DIgital and ANAlog Hybrid Neural Network
26、SoC,ISSCC202219 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceGemmini:Full-Stack,Full-System DL Acceleratorshttps:/ Best Paper Award20 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCase Study:Memory Partitioning21 of 72SpeakerVideo 2024 IEEE Inte
27、rnational Solid-State Circuits ConferenceCase Study:Memory Partitioning21 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCase Study:Memory PartitioningSingle corePrivate scratchpad more helpfulMuch better for convsDual coreShared L2 more helpfulMuch better for residual addi
28、tions22 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Host CPUs23 of 7023 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Host CPUs23 of 70Performance ImpactsUnaccelerated kernels,etc.23 of 72SpeakerVid
29、eo 2024 IEEE International Solid-State Circuits ConferenceExample:Centaur SoCDeep-learning coprocessor with server-class x86 CPUNcore:DL coprocessor8x 64-bit x86 cores w/distributed L377%of the code is still running on coresPerformance scales with increasing#of coresHigh-Performance Deep-Learning Co
30、processor Integrated into x86 SoC with Server-Class CPUs,ISCA202024 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Virtual Addresses25 of 703525 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Virtual Ad
31、dresses25 of 7036Performance ImpactsPage faults,TLB hits,etc.25 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCase Study:Virtual Memory for DNNsTLB:Translation Lookaside Buffer26 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCase Study:Virtual Me
32、mory for DNNsTwo-level TLB hierarchyShared L2 TLBPrivate Accelerator TLBPrivate CPU TLBTLB:Translation Lookaside Buffer26 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCase Study:Virtual Memory for DNNsTwo-level TLB hierarchyShared L2 TLBPrivate Accelerator TLBPrivate CPU
33、TLBTLB Misses for ResNet50TLB:Translation Lookaside Buffer26 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCase Study:Virtual Memory for DNNsSmall private TLB much more impactfulLow-cost optimizations:Single-entry L0 TLB filters out consecutive TLB requests to same pageWit
34、hout L0 TLBWith L0 TLB27 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Operating System28 of 70Linux28 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Operating System28 of 70LinuxPerformance ImpactsInt
35、errupts,context switches,etc.28 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-Stack Visibility29 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-Stack VisibilityHigh29 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits Confere
36、nceFull-Stack VisibilityHighMedium29 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-Stack VisibilityHighLowDirect hardware configuration,low-level ISAMedium29 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Software stack for PIM-HBMAny
37、 accelerators developed require a software stack so that their accelerators can be effectively used by applications.Different accelerators require developing different libraries and low-level interface to be compatible with the rest of the software stack.Hardware Architecture and Software Stack for
38、PIM Based on Commercial DRAM Technology,ISCA202130 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceOutlineOverview Accelerator Design Accelerator Integration Accelerator EvaluationFrom Hardware to SystemConclusion31 of 72SpeakerVideo 2024 IEEE International Solid-State Circu
39、its ConferenceAccelerators Dont Exist in Isolation http:/vlsiarch.eecs.harvard.edu/research/accelerators/die-photo-analysis/Maltiel consulting estimatesShao et al.IEEE Micro 2015Apple M1 ChipCPUCPUGPUNeural EngineDomain-SpecificAccelerators32 of 72SpeakerVideo 2024 IEEE International Solid-State Cir
40、cuits ConferenceExample:Autonomous Driving SoCHeterogeneous SoCMemory interfaceLPDDR41MB memoryControl domainDual-core lockstep(DCLS)Cortex-R52Hardware security module2MB memoryFlash interfaceApplication domainGPU8 Cortex-A76Computer vision domain22 heterogeneous accelerators8MB memA 12-nm Autonomou
41、s Driving Processor With 60.4 TOPS,13.8 TOPS/W CNN Executed by Task-Separated ASIL D Control,JSSC202233 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Image-Recognition SoC For advanced driver-assistance system(ADAS)Processing Island(PI)for image recognition2x Quad
42、ARM Cortex-A534x DSPs for image processing8x accelerators16Mb on-chip SRAMSafe Island(SI)for control processes2x Dual-core lock-step(DCLS)ARM Cortex-R4Hardware secure module(HSM)8Mb on-chip SRAMA 20.5TOPS and 217.3GOPS/mm2 Multicore SoC with DNN Accelerator and Image Signal Processor Complying with
43、ISO26262 for Automotive Applications,ISSCC201934 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceMobile SoC Use CaseMainstream architecture has long focused on general-purpose CPUs and GPUs.In an SoC,multiple IP blocks are active at the same time and communicate frequently w
44、ith each other.Example:Recording a 4K videoCamera-ISP“Preview stream”for display“Video stream”for storageDRAM for data sharingTwo Billion Devices and Counting:An Industry Perspective on the State of Mobile Computer Architecture,IEEE Micro201835 of 72SpeakerVideo 2024 IEEE International Solid-State C
45、ircuits ConferenceMobile SoC Use CasesMultiple accelerators are running concurrently for different use cases.Accelerators(IPs)Usecases(rows)CPUs(AP)DisplayMedia ScalerGPUImage Signal Proc.JPEGPixel VisualCoreVideo DecoderVideoEncoderDozens MorePhoto EnhancingXXXXXXVideo CaptureXXXXXVideo Capture HDR
46、XXXXXVideo PlaybackXXXXXImage RecognitionXXXXMark Hill and Vijay Janapa Reddi,Gables:A Roofline Model for Mobile SoCs,HPCA201936 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceAccelerator Integration Methodology Tightly coupled(e.g.,RoCC)Appears as custom ISA extensions to
47、host softwareSimplifies software stacks,accelerators appear as custom instructions Loosely coupled(e.g.,MMIO)Physically disaggregated over interconnectSW/HW features for managing these devices:interrupts/IOMMU/mmap37 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceRocket Cus
48、tom Coprocessor Interface(RoCC)An interface to facilitate easy decoupled communications between the core and the attached coprocessors.The RoCC interface accepts coprocessor commands generated by the Rocket core.38 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceRoCC Instruc
49、tion FormatThe commands include the instruction word and the values in up to two integer registers,and commands may write an integer register in response.Xd,xs1,and xs2 are used as valid bits for the register specifiers as whether the core is using those registers.39 of 72SpeakerVideo 2024 IEEE Inte
50、rnational Solid-State Circuits ConferenceMemory-Mapped IO AcceleratorsLoosely-coupled acceleratorsCommunicates with the core through memory-mapped registers.Instead of being invoked directly through RoCC instructions.A commonly-used way to connect loosely-coupled accelerators on an SoC.Access shared
51、 data in LLC and/or DRAMCan be coherent or notARMs AXIRISC-Vs TileLink40 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Speech Recognition SoC Heterogeneous SoCs withAlways-on Cortex-M0 MCUDual-core ARM Cortex-A53 CPU cluster with 2MB L2MSSE:Markov Source Separation
52、 Engine for speech denoisingFlexASR:reconfigurable accelerator for sequence-to-sequence speech recognitionConnected via AHB and AXI busesA 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recogni
53、tion in 16nm FinFET,ISSCC202141 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Motion-Control SoCMotion-control SoC for autonomous mobile robots:Microcontroller(Cortex-M3)Trajectory optimization acceleratorProcessing ElementsTrajectory unit(TU)Trajectory Pruner(TP)G
54、aussian random number generator(GRNG)Post-processing unitShared memoryConnected via AHB busA 28nm 142mW Motion-Control SoC for Autonomous Mobile Robots,ISSCC202342 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceTileLinkA chip-scale interconnect standard providing coherent m
55、emory-mapped access to memory and other devices.Designed for use in a system-on-chip(SoC)to connect general-purpose multiprocessors,co-processors,accelerators,DMA engines.Free and open-sourceRISC-V-based systems43 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceTileLink Prot
56、ocol LevelsTileLink Uncached Lightweight(TL-UL)Only simple memory read/write(Get/Put)operations of single words(similar to AXILite)TileLink Uncached Heavyweight(TL-UH)Adds various hints,atomic,and burst accesses but w/o coherence(similar to AXI4)TileLink Cached(TL-C)Complete protocol,which supports
57、use of coherent caches(similar to ACE)44 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceTileLink ChannelsManager to Subordinate Manager to Subordinate Manager to Subordinate Subordinate to Manager Subordinate to Manager*Using AXI agent names here*TileLink Client-AXI Manager
58、*TileLink Manager-AXI Subordinate45 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceTileLink:The Foundation of SiFives FU500https:/riscv.org/wp-content/uploads/2017/12/Wed-1154-TileLink-Wesley-Terpstra.pdf46 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits Conf
59、erenceTileLink ExamplesRoCC accelerators:SHA3https:/ acceleratorhttps:/ acceleratorhttps:/ vector acceleratorhttps:/ network interface controller for FireSimhttps:/ of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Using RoCC+TileLink w/GemminiHow Gemmini,a DNN acceler
60、ator,uses RoCC and TileLinkHow does Gemmini read data from main memory into Gemminis scratchpad?1.Host CPU encounters unknown RISC-V instructionUnknown opcoders1rs248 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Using RoCC+TileLink w/GemminiHow Gemmini,a DNN accel
61、erator,uses RoCC and TileLinkHow does Gemmini read data from main memory into Gemminis scratchpad?1.Host CPU encounters unknown RISC-V instruction2.Host CPU dispatches unknown instruction to RoCC accelerator1)As well as Rs1 and Rs2 contents(128 bits extra bits)CPUGemminiUnknown opcoders1rs2Rs1 conte
62、ntsRs2 contents49 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Using RoCC+TileLink w/GemminiHow Gemmini,a DNN accelerator,uses RoCC and TileLinkHow does Gemmini read data from main memory into Gemminis scratchpad?1.Host CPU encounters unknown RISC-V instruction2.H
63、ost CPU dispatches unknown instruction to RoCC accelerator1)As well as Rs1 and Rs2 contents(128 bits extra bits)3.Gemmini decodes instruction1)Its a load instruction!Unknown opcoders1rs2Load dataMain memory addressScratchpad address50 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits Co
64、nferenceExample:Using RoCC+TileLink w/GemminiHow Gemmini,a DNN accelerator,uses RoCC and TileLinkHow does Gemmini read data from main memory into Gemminis scratchpad?1.Host CPU encounters unknown RISC-V instruction2.Host CPU dispatches unknown instruction to RoCC accelerator1)As well as Rs1 and Rs2
65、contents(128 bits extra bits)3.Gemmini decodes instruction1)Its a load instruction!4.Gemmini asks CPUs page table walker to translate addresses in Rs1,Rs21)PTW is only available through RoCC interface5.Gemmini sends TileLink requests to read data from main memory1)Often,multiple TileLink requests mu
66、st be sent,due to TileLinks alignment and length limitations51 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceManaging Many-Accelerator SoCHigh performance variability due to contention.Sounds like datacenters?Though lack of abstractions to manage accelerators in todays SoC
67、.CamerasLiDARsGPSRadarsSteeringAcceleraon&brakingObjectdeteconLanedeteconTraffic lightdeteconLocalizaonTraffic signdeteconCurrentstateFuturestatesTrajectoryRouteplanningLocal pathplanningCollisionavoidancePerceponPrediconSensingPlanningControlCPUCPUGPUNeuralEngineDomain-SpecificAccelerators52 of 72S
68、peakerVideo 2024 IEEE International Solid-State Circuits ConferenceVirtualization for Performance and ProgrammabilityCPUW1CPUW2W1W1W2W1CPUW1CPUW2W2W2W2W1W1Physical AcceleratorsV0V1W2P0P3P1P4MapV0V1V2V3P2P5W1W2V0V1V0V1V2V353 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceVir
69、tualization for Performance and ProgrammabilityCPUW1CPUW2W1W1W2W1CPUW1CPUW2W2W2W2W1W1Physical AcceleratorsV0V1W2P0P3P1P4MapV0V1V2V3P2P5W1Physical AcceleratorsW2P0P3P1P4MapP2P5V0V1V0V1V2V353 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceAuRORA:Virtualized and Disaggregated
70、Accelerator IntegrationVirtualized and DisaggregatedAccelerator InterfaceFull-Stack Approach for ManyAccelerator IntegraonDynamic Accelerator Runmefor Adapve Controlhttps:/ of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceAuRORA MicroarchitectureAuRORA Client:Attach to CPUs v
71、ia RoCCForward accelerator instructions to acquired ManagersdfAuRORA Manager:Shadow registers to track thread architectural stateEliminate need of interacting with supervisor-managed IOMMUhttps:/ of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceOutlineOverview Accelerator Des
72、ign Accelerator Integration Accelerator EvaluationFrom Hardware to SystemConclusion56 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFrom PPA to Quality-of-ExperienceHardware evaluations have been focusing on the isolated performance,power,and area(PPA)of individual kernels
73、.However,todays computing is an integrated component of the sensing-to-action pipeline,where it is critical to evaluate the end-to-end quality-of-experience of the entire system.57 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:ILLIXR:Full-system XR TestbedOpen-sour
74、ce infrastructure for extended reality(XR):virtual,augmented,mixed realityInvolves many concurrent and dependent applications that work collectively for an end-to-end taskExploring Extended Reality with ILLIXR:A New Playground for Architecture Research,IISWC202158 of 72SpeakerVideo 2024 IEEE Interna
75、tional Solid-State Circuits ConferenceExample:MAVBench for Micro Aerial VehicleHardware-in-the-loop evaluation for unmanned aerial vehicles(UAVs)Closed-loop evaluation is needed to understand hardware and software bottlenecks in end-to-end tasks.MAVBench:Micro Aerial Vehicle Benchmarking,MICRO201859
76、 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific SoCs for RoboticsChallenging ConstraintsDiverse Architectures60 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceChallenge 1:Closed-Loop FeedbackDesired TrajectoryDelay61 of 72SpeakerVid
77、eo 2024 IEEE International Solid-State Circuits ConferenceChallenge 1:Closed-Loop Feedback62 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceChallenge 2:Full-SoC EvaluationSoC Level EvaluationAlgorithms/accelerators often profiled in isolationNeed to capture co-location/cont
78、ention on real systems63 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceChallenge 3:Pre-Silicon Design Space ExplorationExisting hardware-in-the-loop evaluation focuses on post-silicon parameters,e.g.,number of active cores and voltages/frequencies on each core.There is a l
79、ack of tools to support pre-silicon design space evaluation where we can change the types of hardware and the SoC architecture.64 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceChallenge 3:Pre-Silicon Design Space ExplorationExisting hardware-in-the-loop evaluation focuses
80、on post-silicon parameters,e.g.,number of active cores and voltages/frequencies on each core.There is a lack of tools to support pre-silicon design space evaluation where we can change the types of hardware and the SoC architecture.64 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits Co
81、nferenceHW-SW Co-Simulation w/RoS https:/ Artifact Award65 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceRoS:Pre-Silicon Robotics SoC Evaluator https:/ Artifact Award66 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceRoS in ActionAirSimFPGAhttps:/
82、Artifact Award67 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceRobotics Workloadshttps:/ Artifact Award68 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExperimental SetupEnvironmentsSoC ConfigurationsOut-of-Order Core+AcceleratorIn-Order Core+Acc
83、eleratorOut-of-Order Core Onlyhttps:/ Artifact Award69 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceImpact of SoC on Flight PerformanceOut-of-Order Core OnlyIn-Order Core+AcceleratorOut-of-Order Core+Acceleratorhttps:/ Artifact Award70 of 72SpeakerVideo 2024 IEEE Internat
84、ional Solid-State Circuits ConferenceImpact of SoC on Flight PerformanceOut-of-Order Core OnlyIn-Order Core+AcceleratorOut-of-Order Core+Acceleratorhttps:/ Artifact Award70 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceImpact of SoC on Flight PerformanceOut-of-Order Core O
85、nlyIn-Order Core+AcceleratorOut-of-Order Core+Acceleratorhttps:/ Artifact Award70 of 72FullFull-Stack Stack Optimization Optimization for Domainfor Domain-Specific Specific SystemsSystems71Accelerator DesignAccelerator IntegrationAccelerator Evaluation71 of 72SpeakerVideo 2024 IEEE International Sol
86、id-State Circuits ConferenceThank You!ContactEmail:ysshaoberkeley.eduWebsite:https:/people.eecs.berkeley.edu/ysshao/72 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits Conference73 of 70Please Scan to Rate This PaperSpeakerVideo 2024 IEEE International Solid-State Circuits ConferencePa
87、pers to See This YearForum 1:Efficient Chiplets and Die-to-Die CommunicationsForum 2:Energy-Efficient AI-Computing Systems for Large-Language ModelsPaper 2.4 ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical ApplicationsPaper 2.5 A 28nm Physical-Based Ray-Tracing Rendering Process
88、or for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile DevicesPaper 11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inference at the EdgeSpeakerVideo 2024 IEEE International Solid-State Circuits ConferencePapers to See This Yea
89、rPaper 20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16 Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot ApplicationsPaper 20.6 LSPU:A Fully Integrated Real-Time LiDAR-SLAM SoC with Point-Neural-Network Segmentation and Multi-Level kNN AccelerationPaper 20.
90、7 NeuGPU:A 18.5mJ/Iter Neural-Graphics Processing Unit for Instant-Modeling and Real-Time Rendering with Segmented-Hashing ArchitecturePaper 20.8 Space-Mate:A 303.5mW Real-Time Sparse Mixture-of-Experts-Based NeRF-SLAM Processor for Mobile Spatial ComputingSpeakerVideo 2024 IEEE International Solid-
91、State Circuits ConferenceInclude Key ReferencesUnderstanding Sources of Inefficiency in General-Purpose Chips,ISCA2010Energy Efficiency Boost in the AI-Infused POWER10 Processor,ISCA2021Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology,ISCA2021DIANA:An End-to-End E
92、nergy-Efficient DIgital and ANAlog Hybrid Neural Network SoC,ISSCC2022Gemmini:Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration,DAC2021High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs,ISCA2020A 12-nm Autonomous Driving Proce
93、ssor With 60.4 TOPS,13.8 TOPS/W CNN Executed by Task-Separated ASIL D Control,JSSC2022SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceInclude Key ReferencesA 20.5TOPS and 217.3GOPS/mm2 Multicore SoC with DNN Accelerator and Image Signal Processor Complying with ISO26262 for Autom
94、otive Applications,ISSCC2019A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET,ISSCC2021AuRORA:Virtualized Accelerator Orchestration for Multi-Tenant Workloads,MICRO2023Exploring Extended Reality with ILLIXR:A New Playground for Architecture Research,IISWC2021MAVBench:Micro Aerial Vehicle Benchmarking,MICRO2018RoS:A Hardware-Software Co-Simulation Infrastructure Enabling Pre-Silicon Full-Stack Robotics SoC Evaluation,ISCA2023