T9 - Domain-Specific Accelerators_from hardware to system.pdf

編號:154983 PDF 98頁 13MB 下載積分:VIP專享
下載報告請您先登錄!

T9 - Domain-Specific Accelerators_from hardware to system.pdf

1、SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific Accelerators:From Hardware to SystemSophia Shao(ysshaoBerkeley.edu)ISSCC 2024 Tutorial 9February 2,20241 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceGrowing Demand in ComputingPublication

2、 dateSep 30,2012Dec 13,2015Sep 8,2018Jun 4,2021Training compute(petaFLOP)1,00010,000100,0001 million10 million100 million1 billion10 billionAlexNetVGG16GoogLeNet/InceptionV1ResNet-152(ImageNet)TransformerAlphaZeroYOLOv3GPTBERT-LargeGPT-2AlphaFoldGPT-3 175B(davinci)DALL-EPaLM(540B)LLaMA(65B)GPT-4PaLM

3、 2Task domainDrawingDrivingGamesLanguageMultimodalOtherSpeechVisionOtherSource:Epoch(2023)OurWorldInData.org/artificial-intelligence CC BYMoores LawScaling2 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceSlowing Supply in ComputingDiminishing Efficiency Returns:Achieving ef

4、ficiency gains becomes more challenging as technology advances to smaller nodes.Diminishing Cost Benefits:The traditional cost benefits with packing more transistors into a chip are also diminishing with advanced nodes.3 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceSlowin

5、g Supply in ComputingMarvellDiminishing Efficiency Returns:Achieving efficiency gains becomes more challenging as technology advances to smaller nodes.Diminishing Cost Benefits:The traditional cost benefits with packing more transistors into a chip are also diminishing with advanced nodes.3 of 72Spe

6、akerVideo 2024 IEEE International Solid-State Circuits ConferenceSlowing Supply in ComputingGrowing Demand in Computing4 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceSlowing Supply in ComputingGrowing Demand in Computing5 of 72SpeakerVideo 2024 IEEE International Solid-St

7、ate Circuits ConferenceSlowing Supply in ComputingGrowing Demand in ComputingDomain-Specific Accelerators6 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific AcceleratorsCustomized hardware designed for a domain of applications.To improve performance and efficie

8、ncy7 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific AcceleratorsCustomized hardware designed for a domain of applications.To improve performance and efficiencyApple M1 Chip*AnandTech7 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceD

9、omain-Specific AcceleratorsCustomized hardware designed for a domain of applications.To improve performance and efficiencyApple M1 ChipCPUCPU*AnandTech7 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific AcceleratorsCustomized hardware designed for a domain of a

10、pplications.To improve performance and efficiencyApple M1 ChipCPUCPUGPU*AnandTech7 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific AcceleratorsCustomized hardware designed for a domain of applications.To improve performance and efficiencyApple M1 ChipCPUCPUGP

11、U*AnandTechDomain-SpecificAccelerators7 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific AcceleratorsCustomized hardware designed for a domain of applications.To improve performance and efficiencyApple M1 ChipCPUCPUGPU*AnandTechNeural EngineDomain-SpecificAcce

12、lerators7 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFrom Domain-Specific Hardware to System“The Computer as a Datacenter”Many distributed cores and acceleratorsConnected via on-chip networksRunning concurrent and large applicationsFull-Stack innovations needed for:Acce

13、lerator Design for machine learning and beyond;Accelerator Integration to the core computing stack;Accelerator Evaluation of end-to-end applications.8 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceOutlineOverview Accelerator Design Accelerator Integration Accelerator Evalu

14、ationFrom Hardware to SystemConclusion9 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceAccelerators from Edge to CloudEdge TPUCloud TPUTesla FSD10 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceWhy Accelerators?Inefficiency in CPUsUnderstanding Sou

15、rces of Inefficiency in General-Purpose Chips,ISCA2010Functional units only consume a subset of the total energy consumption.Most of the energy was spent on instruction fetching and decoding.Consistent across many applications,from H.264 encoding to Machine Learning.Energy breakdown for a 5-stage pi

16、peline processors for H.264 encoder.11 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCore Optimizations in AcceleratorsInstruction Decoding:Coarse-grained instructionsE.g.,TPU ISACustom datapath:Parallel operations:SIMD/VLIWOperation fusion of multiple opsMagic instruction

17、s for custom opsCustom memory:Dedicated storage for operandsDedicated data delivering networkUnderstanding Sources of Inefficiency in General-Purpose Chips,ISCA201012 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:IBM Power10 MMAUsing coarse-grained instruction and

18、custom datapath for accelerationMatrix-Multiply Assist(MMA)for linear algebra on small matricesIntegrated as a special functional unitEnergy Efficiency Boost in the AI-Infused POWER10 Processor,ISCA202113 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:IBM Power10 MM

19、AUsing coarse-grained instruction and custom datapath for accelerationMatrix-Multiply Assist(MMA)for linear algebra on small matricesIntegrated as a special functional unitBenefits:Reduced#of instructionsEach instruction is more complex(i.e.,higher CPI)Reduced latencyLower#of cyclesImprove performan

20、ceEnergy Efficiency Boost in the AI-Infused POWER10 Processor,ISCA202114 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:PIM Accelerator in HBMCustom memory to reduce data movementSamsungs Processing-in-memory(PIM)in HBM2 for DNNsFirst HBM-based PIM design fabricated

21、 by a major DRAM manufacturer and 2.5D integrated with an unmodified commercial processor.PIM unit perform GEMV for DNNs.Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology,ISCA202115 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFrom Hardw

22、are to SystemAccelerators are more than just hardware.SoC Interface:how the accelerator is integrated into the entire system Software interface:how hardware operations are exposed to the software stack16 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:

23、SoC17 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:SoCPerformance ImpactsResource contention,etc.17 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Memory Hierarchy18 of 702518 of 72SpeakerVideo 2024 I

24、EEE International Solid-State Circuits ConferenceFull-System Visibility:Memory Hierarchy18 of 7026Performance ImpactsCache coherence,miss rates/latencies,etc.18 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:DIANA w/shared memorySoC with digital and analog NN accele

25、rators:RISC-V Host processorDigital NN core with flexible dataflowAnalog in memory compute core with SIMD digital post-processingShared memory system supporting efficient layer-fused execution,controlled by the RISC-V coreDIANA:An End-to-End Energy-Efficient DIgital and ANAlog Hybrid Neural Network

26、SoC,ISSCC202219 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceGemmini:Full-Stack,Full-System DL Acceleratorshttps:/ Best Paper Award20 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCase Study:Memory Partitioning21 of 72SpeakerVideo 2024 IEEE Inte

27、rnational Solid-State Circuits ConferenceCase Study:Memory Partitioning21 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCase Study:Memory PartitioningSingle corePrivate scratchpad more helpfulMuch better for convsDual coreShared L2 more helpfulMuch better for residual addi

28、tions22 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Host CPUs23 of 7023 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Host CPUs23 of 70Performance ImpactsUnaccelerated kernels,etc.23 of 72SpeakerVid

29、eo 2024 IEEE International Solid-State Circuits ConferenceExample:Centaur SoCDeep-learning coprocessor with server-class x86 CPUNcore:DL coprocessor8x 64-bit x86 cores w/distributed L377%of the code is still running on coresPerformance scales with increasing#of coresHigh-Performance Deep-Learning Co

30、processor Integrated into x86 SoC with Server-Class CPUs,ISCA202024 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Virtual Addresses25 of 703525 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Virtual Ad

31、dresses25 of 7036Performance ImpactsPage faults,TLB hits,etc.25 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCase Study:Virtual Memory for DNNsTLB:Translation Lookaside Buffer26 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCase Study:Virtual Me

32、mory for DNNsTwo-level TLB hierarchyShared L2 TLBPrivate Accelerator TLBPrivate CPU TLBTLB:Translation Lookaside Buffer26 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCase Study:Virtual Memory for DNNsTwo-level TLB hierarchyShared L2 TLBPrivate Accelerator TLBPrivate CPU

33、TLBTLB Misses for ResNet50TLB:Translation Lookaside Buffer26 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceCase Study:Virtual Memory for DNNsSmall private TLB much more impactfulLow-cost optimizations:Single-entry L0 TLB filters out consecutive TLB requests to same pageWit

34、hout L0 TLBWith L0 TLB27 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Operating System28 of 70Linux28 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-System Visibility:Operating System28 of 70LinuxPerformance ImpactsInt

35、errupts,context switches,etc.28 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-Stack Visibility29 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-Stack VisibilityHigh29 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits Confere

36、nceFull-Stack VisibilityHighMedium29 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFull-Stack VisibilityHighLowDirect hardware configuration,low-level ISAMedium29 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Software stack for PIM-HBMAny

37、 accelerators developed require a software stack so that their accelerators can be effectively used by applications.Different accelerators require developing different libraries and low-level interface to be compatible with the rest of the software stack.Hardware Architecture and Software Stack for

38、PIM Based on Commercial DRAM Technology,ISCA202130 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceOutlineOverview Accelerator Design Accelerator Integration Accelerator EvaluationFrom Hardware to SystemConclusion31 of 72SpeakerVideo 2024 IEEE International Solid-State Circu

39、its ConferenceAccelerators Dont Exist in Isolation http:/vlsiarch.eecs.harvard.edu/research/accelerators/die-photo-analysis/Maltiel consulting estimatesShao et al.IEEE Micro 2015Apple M1 ChipCPUCPUGPUNeural EngineDomain-SpecificAccelerators32 of 72SpeakerVideo 2024 IEEE International Solid-State Cir

40、cuits ConferenceExample:Autonomous Driving SoCHeterogeneous SoCMemory interfaceLPDDR41MB memoryControl domainDual-core lockstep(DCLS)Cortex-R52Hardware security module2MB memoryFlash interfaceApplication domainGPU8 Cortex-A76Computer vision domain22 heterogeneous accelerators8MB memA 12-nm Autonomou

41、s Driving Processor With 60.4 TOPS,13.8 TOPS/W CNN Executed by Task-Separated ASIL D Control,JSSC202233 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Image-Recognition SoC For advanced driver-assistance system(ADAS)Processing Island(PI)for image recognition2x Quad

42、ARM Cortex-A534x DSPs for image processing8x accelerators16Mb on-chip SRAMSafe Island(SI)for control processes2x Dual-core lock-step(DCLS)ARM Cortex-R4Hardware secure module(HSM)8Mb on-chip SRAMA 20.5TOPS and 217.3GOPS/mm2 Multicore SoC with DNN Accelerator and Image Signal Processor Complying with

43、ISO26262 for Automotive Applications,ISSCC201934 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceMobile SoC Use CaseMainstream architecture has long focused on general-purpose CPUs and GPUs.In an SoC,multiple IP blocks are active at the same time and communicate frequently w

44、ith each other.Example:Recording a 4K videoCamera-ISP“Preview stream”for display“Video stream”for storageDRAM for data sharingTwo Billion Devices and Counting:An Industry Perspective on the State of Mobile Computer Architecture,IEEE Micro201835 of 72SpeakerVideo 2024 IEEE International Solid-State C

45、ircuits ConferenceMobile SoC Use CasesMultiple accelerators are running concurrently for different use cases.Accelerators(IPs)Usecases(rows)CPUs(AP)DisplayMedia ScalerGPUImage Signal Proc.JPEGPixel VisualCoreVideo DecoderVideoEncoderDozens MorePhoto EnhancingXXXXXXVideo CaptureXXXXXVideo Capture HDR

46、XXXXXVideo PlaybackXXXXXImage RecognitionXXXXMark Hill and Vijay Janapa Reddi,Gables:A Roofline Model for Mobile SoCs,HPCA201936 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceAccelerator Integration Methodology Tightly coupled(e.g.,RoCC)Appears as custom ISA extensions to

47、host softwareSimplifies software stacks,accelerators appear as custom instructions Loosely coupled(e.g.,MMIO)Physically disaggregated over interconnectSW/HW features for managing these devices:interrupts/IOMMU/mmap37 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceRocket Cus

48、tom Coprocessor Interface(RoCC)An interface to facilitate easy decoupled communications between the core and the attached coprocessors.The RoCC interface accepts coprocessor commands generated by the Rocket core.38 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceRoCC Instruc

49、tion FormatThe commands include the instruction word and the values in up to two integer registers,and commands may write an integer register in response.Xd,xs1,and xs2 are used as valid bits for the register specifiers as whether the core is using those registers.39 of 72SpeakerVideo 2024 IEEE Inte

50、rnational Solid-State Circuits ConferenceMemory-Mapped IO AcceleratorsLoosely-coupled acceleratorsCommunicates with the core through memory-mapped registers.Instead of being invoked directly through RoCC instructions.A commonly-used way to connect loosely-coupled accelerators on an SoC.Access shared

51、 data in LLC and/or DRAMCan be coherent or notARMs AXIRISC-Vs TileLink40 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Speech Recognition SoC Heterogeneous SoCs withAlways-on Cortex-M0 MCUDual-core ARM Cortex-A53 CPU cluster with 2MB L2MSSE:Markov Source Separation

52、 Engine for speech denoisingFlexASR:reconfigurable accelerator for sequence-to-sequence speech recognitionConnected via AHB and AXI busesA 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recogni

53、tion in 16nm FinFET,ISSCC202141 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Motion-Control SoCMotion-control SoC for autonomous mobile robots:Microcontroller(Cortex-M3)Trajectory optimization acceleratorProcessing ElementsTrajectory unit(TU)Trajectory Pruner(TP)G

54、aussian random number generator(GRNG)Post-processing unitShared memoryConnected via AHB busA 28nm 142mW Motion-Control SoC for Autonomous Mobile Robots,ISSCC202342 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceTileLinkA chip-scale interconnect standard providing coherent m

55、emory-mapped access to memory and other devices.Designed for use in a system-on-chip(SoC)to connect general-purpose multiprocessors,co-processors,accelerators,DMA engines.Free and open-sourceRISC-V-based systems43 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceTileLink Prot

56、ocol LevelsTileLink Uncached Lightweight(TL-UL)Only simple memory read/write(Get/Put)operations of single words(similar to AXILite)TileLink Uncached Heavyweight(TL-UH)Adds various hints,atomic,and burst accesses but w/o coherence(similar to AXI4)TileLink Cached(TL-C)Complete protocol,which supports

57、use of coherent caches(similar to ACE)44 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceTileLink ChannelsManager to Subordinate Manager to Subordinate Manager to Subordinate Subordinate to Manager Subordinate to Manager*Using AXI agent names here*TileLink Client-AXI Manager

58、*TileLink Manager-AXI Subordinate45 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceTileLink:The Foundation of SiFives FU500https:/riscv.org/wp-content/uploads/2017/12/Wed-1154-TileLink-Wesley-Terpstra.pdf46 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits Conf

59、erenceTileLink ExamplesRoCC accelerators:SHA3https:/ acceleratorhttps:/ acceleratorhttps:/ vector acceleratorhttps:/ network interface controller for FireSimhttps:/ of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Using RoCC+TileLink w/GemminiHow Gemmini,a DNN acceler

60、ator,uses RoCC and TileLinkHow does Gemmini read data from main memory into Gemminis scratchpad?1.Host CPU encounters unknown RISC-V instructionUnknown opcoders1rs248 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Using RoCC+TileLink w/GemminiHow Gemmini,a DNN accel

61、erator,uses RoCC and TileLinkHow does Gemmini read data from main memory into Gemminis scratchpad?1.Host CPU encounters unknown RISC-V instruction2.Host CPU dispatches unknown instruction to RoCC accelerator1)As well as Rs1 and Rs2 contents(128 bits extra bits)CPUGemminiUnknown opcoders1rs2Rs1 conte

62、ntsRs2 contents49 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:Using RoCC+TileLink w/GemminiHow Gemmini,a DNN accelerator,uses RoCC and TileLinkHow does Gemmini read data from main memory into Gemminis scratchpad?1.Host CPU encounters unknown RISC-V instruction2.H

63、ost CPU dispatches unknown instruction to RoCC accelerator1)As well as Rs1 and Rs2 contents(128 bits extra bits)3.Gemmini decodes instruction1)Its a load instruction!Unknown opcoders1rs2Load dataMain memory addressScratchpad address50 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits Co

64、nferenceExample:Using RoCC+TileLink w/GemminiHow Gemmini,a DNN accelerator,uses RoCC and TileLinkHow does Gemmini read data from main memory into Gemminis scratchpad?1.Host CPU encounters unknown RISC-V instruction2.Host CPU dispatches unknown instruction to RoCC accelerator1)As well as Rs1 and Rs2

65、contents(128 bits extra bits)3.Gemmini decodes instruction1)Its a load instruction!4.Gemmini asks CPUs page table walker to translate addresses in Rs1,Rs21)PTW is only available through RoCC interface5.Gemmini sends TileLink requests to read data from main memory1)Often,multiple TileLink requests mu

66、st be sent,due to TileLinks alignment and length limitations51 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceManaging Many-Accelerator SoCHigh performance variability due to contention.Sounds like datacenters?Though lack of abstractions to manage accelerators in todays SoC

67、.CamerasLiDARsGPSRadarsSteeringAcceleraon&brakingObjectdeteconLanedeteconTraffic lightdeteconLocalizaonTraffic signdeteconCurrentstateFuturestatesTrajectoryRouteplanningLocal pathplanningCollisionavoidancePerceponPrediconSensingPlanningControlCPUCPUGPUNeuralEngineDomain-SpecificAccelerators52 of 72S

68、peakerVideo 2024 IEEE International Solid-State Circuits ConferenceVirtualization for Performance and ProgrammabilityCPUW1CPUW2W1W1W2W1CPUW1CPUW2W2W2W2W1W1Physical AcceleratorsV0V1W2P0P3P1P4MapV0V1V2V3P2P5W1W2V0V1V0V1V2V353 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceVir

69、tualization for Performance and ProgrammabilityCPUW1CPUW2W1W1W2W1CPUW1CPUW2W2W2W2W1W1Physical AcceleratorsV0V1W2P0P3P1P4MapV0V1V2V3P2P5W1Physical AcceleratorsW2P0P3P1P4MapP2P5V0V1V0V1V2V353 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceAuRORA:Virtualized and Disaggregated

70、Accelerator IntegrationVirtualized and DisaggregatedAccelerator InterfaceFull-Stack Approach for ManyAccelerator IntegraonDynamic Accelerator Runmefor Adapve Controlhttps:/ of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceAuRORA MicroarchitectureAuRORA Client:Attach to CPUs v

71、ia RoCCForward accelerator instructions to acquired ManagersdfAuRORA Manager:Shadow registers to track thread architectural stateEliminate need of interacting with supervisor-managed IOMMUhttps:/ of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceOutlineOverview Accelerator Des

72、ign Accelerator Integration Accelerator EvaluationFrom Hardware to SystemConclusion56 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceFrom PPA to Quality-of-ExperienceHardware evaluations have been focusing on the isolated performance,power,and area(PPA)of individual kernels

73、.However,todays computing is an integrated component of the sensing-to-action pipeline,where it is critical to evaluate the end-to-end quality-of-experience of the entire system.57 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExample:ILLIXR:Full-system XR TestbedOpen-sour

74、ce infrastructure for extended reality(XR):virtual,augmented,mixed realityInvolves many concurrent and dependent applications that work collectively for an end-to-end taskExploring Extended Reality with ILLIXR:A New Playground for Architecture Research,IISWC202158 of 72SpeakerVideo 2024 IEEE Interna

75、tional Solid-State Circuits ConferenceExample:MAVBench for Micro Aerial VehicleHardware-in-the-loop evaluation for unmanned aerial vehicles(UAVs)Closed-loop evaluation is needed to understand hardware and software bottlenecks in end-to-end tasks.MAVBench:Micro Aerial Vehicle Benchmarking,MICRO201859

76、 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceDomain-Specific SoCs for RoboticsChallenging ConstraintsDiverse Architectures60 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceChallenge 1:Closed-Loop FeedbackDesired TrajectoryDelay61 of 72SpeakerVid

77、eo 2024 IEEE International Solid-State Circuits ConferenceChallenge 1:Closed-Loop Feedback62 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceChallenge 2:Full-SoC EvaluationSoC Level EvaluationAlgorithms/accelerators often profiled in isolationNeed to capture co-location/cont

78、ention on real systems63 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceChallenge 3:Pre-Silicon Design Space ExplorationExisting hardware-in-the-loop evaluation focuses on post-silicon parameters,e.g.,number of active cores and voltages/frequencies on each core.There is a l

79、ack of tools to support pre-silicon design space evaluation where we can change the types of hardware and the SoC architecture.64 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceChallenge 3:Pre-Silicon Design Space ExplorationExisting hardware-in-the-loop evaluation focuses

80、on post-silicon parameters,e.g.,number of active cores and voltages/frequencies on each core.There is a lack of tools to support pre-silicon design space evaluation where we can change the types of hardware and the SoC architecture.64 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits Co

81、nferenceHW-SW Co-Simulation w/RoS https:/ Artifact Award65 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceRoS:Pre-Silicon Robotics SoC Evaluator https:/ Artifact Award66 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceRoS in ActionAirSimFPGAhttps:/

82、Artifact Award67 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceRobotics Workloadshttps:/ Artifact Award68 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceExperimental SetupEnvironmentsSoC ConfigurationsOut-of-Order Core+AcceleratorIn-Order Core+Acc

83、eleratorOut-of-Order Core Onlyhttps:/ Artifact Award69 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceImpact of SoC on Flight PerformanceOut-of-Order Core OnlyIn-Order Core+AcceleratorOut-of-Order Core+Acceleratorhttps:/ Artifact Award70 of 72SpeakerVideo 2024 IEEE Internat

84、ional Solid-State Circuits ConferenceImpact of SoC on Flight PerformanceOut-of-Order Core OnlyIn-Order Core+AcceleratorOut-of-Order Core+Acceleratorhttps:/ Artifact Award70 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceImpact of SoC on Flight PerformanceOut-of-Order Core O

85、nlyIn-Order Core+AcceleratorOut-of-Order Core+Acceleratorhttps:/ Artifact Award70 of 72FullFull-Stack Stack Optimization Optimization for Domainfor Domain-Specific Specific SystemsSystems71Accelerator DesignAccelerator IntegrationAccelerator Evaluation71 of 72SpeakerVideo 2024 IEEE International Sol

86、id-State Circuits ConferenceThank You!ContactEmail:ysshaoberkeley.eduWebsite:https:/people.eecs.berkeley.edu/ysshao/72 of 72SpeakerVideo 2024 IEEE International Solid-State Circuits Conference73 of 70Please Scan to Rate This PaperSpeakerVideo 2024 IEEE International Solid-State Circuits ConferencePa

87、pers to See This YearForum 1:Efficient Chiplets and Die-to-Die CommunicationsForum 2:Energy-Efficient AI-Computing Systems for Large-Language ModelsPaper 2.4 ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical ApplicationsPaper 2.5 A 28nm Physical-Based Ray-Tracing Rendering Process

88、or for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile DevicesPaper 11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inference at the EdgeSpeakerVideo 2024 IEEE International Solid-State Circuits ConferencePapers to See This Yea

89、rPaper 20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16 Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot ApplicationsPaper 20.6 LSPU:A Fully Integrated Real-Time LiDAR-SLAM SoC with Point-Neural-Network Segmentation and Multi-Level kNN AccelerationPaper 20.

90、7 NeuGPU:A 18.5mJ/Iter Neural-Graphics Processing Unit for Instant-Modeling and Real-Time Rendering with Segmented-Hashing ArchitecturePaper 20.8 Space-Mate:A 303.5mW Real-Time Sparse Mixture-of-Experts-Based NeRF-SLAM Processor for Mobile Spatial ComputingSpeakerVideo 2024 IEEE International Solid-

91、State Circuits ConferenceInclude Key ReferencesUnderstanding Sources of Inefficiency in General-Purpose Chips,ISCA2010Energy Efficiency Boost in the AI-Infused POWER10 Processor,ISCA2021Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology,ISCA2021DIANA:An End-to-End E

92、nergy-Efficient DIgital and ANAlog Hybrid Neural Network SoC,ISSCC2022Gemmini:Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration,DAC2021High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs,ISCA2020A 12-nm Autonomous Driving Proce

93、ssor With 60.4 TOPS,13.8 TOPS/W CNN Executed by Task-Separated ASIL D Control,JSSC2022SpeakerVideo 2024 IEEE International Solid-State Circuits ConferenceInclude Key ReferencesA 20.5TOPS and 217.3GOPS/mm2 Multicore SoC with DNN Accelerator and Image Signal Processor Complying with ISO26262 for Autom

94、otive Applications,ISSCC2019A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET,ISSCC2021AuRORA:Virtualized Accelerator Orchestration for Multi-Tenant Workloads,MICRO2023Exploring Extended Reality with ILLIXR:A New Playground for Architecture Research,IISWC2021MAVBench:Micro Aerial Vehicle Benchmarking,MICRO2018RoS:A Hardware-Software Co-Simulation Infrastructure Enabling Pre-Silicon Full-Stack Robotics SoC Evaluation,ISCA2023

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(T9 - Domain-Specific Accelerators_from hardware to system.pdf)為本站 (2200) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站