Forum 3 Its all About Data Building Blocks, Compute, Movement and Integration.pdf

編號:620801 PDF 555頁 42.16MB 下載積分:VIP專享
下載報告請您先登錄!

Forum 3 Its all About Data Building Blocks, Compute, Movement and Integration.pdf

1、ISSCC2025Forum3ItsallAboutData:BuildingBlocks,Compute,MovementandIntegration 2025 IEEE International Solid-State Circuits ConferenceForum F3:Its all about Data:Building blocks,Compute,Movement and IntegrationInternational Solid State Circuit ConferenceFebruary 20th,2025Start of presentations at 8:15

2、amISSCC 20251 of 9 2025 IEEE International Solid-State Circuits ConferenceHosted by Digital Circuits,Memory,Digital Architectures&Systems&Security Subcommittees Organizers:Yvain Thonnart,CEA-List,Grenoble,FranceViolante Moschiano,OpenChip Technologies,Barcelona,SpainJie Gu,Northwestern University,Ev

3、anston,ILSanu Mathew,Intel,Hillsboro,ORChampions:Fatih Hamzaoglu,Intel,Hillsboro,ORTanay Karnik,Intel,Hillsboro,OROrganizing CommitteeISSCC 20252 of 9 2025 IEEE International Solid-State Circuits ConferenceFacing the memory wallISSCC 20253 of 9 2025 IEEE International Solid-State Circuits Conference

4、Abundant data:AI model sizeISSCC 20254 of 9A.Gholami,IEEE Micro,44,3 2025 IEEE International Solid-State Circuits ConferenceLagging behind throughput:data latencyISSCC 20255 of 9J.McCalpin,Supercomputing 2016 2025 IEEE International Solid-State Circuits ConferenceEnergy cost of data movementISSCC 20

5、256 of 9M.Horowitz,ISSCC 20141.3nJDRAM Access70pJ 2025 IEEE International Solid-State Circuits ConferenceData&compute,from applications to architecturesHigh performance computing,AI/ML,GPUsData&storage,from architectures to circuitsHBM,dense storage,compute in memory,memory-centric architecturesNew

6、data-centric architectures,new challengesChipletization,high-bandwidth interfaces,securityForum outlineISSCC 20257 of 9 2025 IEEE International Solid-State Circuits ConferenceAgendaISSCC 20258 of 8StartTitleSpeakerAffiliation8:15IntroductionYvain ThonnartCEA-List8:25Dataflow:Is it all About Algorith

7、ms and Application Targets?Osman Sabri UnsalBSC9:15Dataflow Optimization and Data Sparsity Management for ML AcceleratorsMarian VerhelstKU Leuven10:05Break10:20Explicit Decoupled Data Orchestration:A Fundamental Approach to AccelerationMichael PellauerNVIDIA11:10Memory Solutions for AI Era:High-Band

8、width Memories and Dense-Data StorageKyomin SohnSamsung12:00Lunch13:20Computation-in-Memory Circuit Design for Computing/Storage-Intensive AI ApplicationsXin SiSoutheast University14:10AIs Promethean Moment:Silicon Scaling-Laws and On-Package Memory Are Not EnoughIgor ArsovskiGroq15:00Break 15:15Opt

9、imizing Communication Between Chiplets for Future System-in-PackagesKemal AygnIntel16:05Complex Systems,Complex Threats:Security Strategies for Heterogeneous System DesignsTodd M.AustinUniversity of Michigan16:55Closing remarksViolante MoschianoOpenChip Technologies 2025 IEEE International Solid-Sta

10、te Circuits Conference2 coffee breaks and one lunch breakElectronic copies for Forums are available for downloadPlease switch your mobile devices to muteRemember to fill out speaker evaluation using the ISSCC appNo panel session at the end of the ForumGeneral Information ISSCC 20259 of 9 2025 IEEE I

11、nternational Solid-State Circuits ConferenceISSCC 2025 ForumsDataflow:Is it all About Algorithms and Application Targets?NOOsman UnsalBarcelona Supercomputing CenterISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?1 of 44 2025 IEEE International Solid-State Circuits C

12、onferenceOutlineWhy Dataflow?Because it is the ultimate orderDataflow in microarchitectureDataflow in VLSIDataflow in programming models2 of 44ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?2025 IEEE International Solid-State Circuits ConferenceThere is ample instru

13、ction level parallelismProvided write after write and write after read dependencies are recastThrough renaming mechanismProviding physical registersTo eliminate false dependenciesResulting in out of order processorsInstruction commit is still in-orderHowever,instructions execute in dataflow(not prog

14、ram)order Subject to true dependencies(read after write)Dataflow in microarchitectureISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?3 of 44 2025 IEEE International Solid-State Circuits ConferenceDataflow in VLSI The input to output combinatorial logic is in overall

15、dataflow order Simulators to checkers operate in dataflow fashion Asyncronous systems without a clock depend on dataflow Wave-pipeliningTatapudi et al,Symposium on VLSI 2005 ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?4 of 44 2025 IEEE International Solid-State C

16、ircuits ConferenceTask-based dataflow Programing ModelsTask dependency clauses in OpenMP4.0 Flow Graph Intel TBB V4.0OmpSs5 of 44ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?2025 IEEE International Solid-State Circuits ConferenceOmpSs programming modelTask-based p

17、arallel programming modelProgrammer defines tasks by using pragmasTask is an executable unit,multiple tasks can run in parallelTask has in/out parameters that describe the data dependenciesISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?6 of 44 2025 IEEE Internationa

18、l Solid-State Circuits ConferenceOmpSs dataflow runtimeRuntime support to OmpSs applicationa dataflow(task dependency)grapha pool of threadsa ready task queueA thread picks up a ready task and executes it on an available resourceISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Applicatio

19、n Targets?7 of 44 2025 IEEE International Solid-State Circuits ConferenceOpenMP3.0 Tasking stylevoid Cholesky(float*A)int i,j,k;for(k=0;kNT;k+)spotrf(Ak*NT+k);for(i=k+1;iNT;i+)#pragma omp taskstrsm(Ak*NT+k,Ak*NT+i);#pragma omp taskwaitfor(i=k+1;iNT;i+)for(j=k+1;ji;j+)#pragma omp task sgemm(Ak*NT+i,A

20、k*NT+j,Aj*NT+i);#pragma omp task ssyrk(Ak*NT+i,Ai*NT+i);#pragma omp taskwaitvoid Cholesky(float*A)int i,j,k;for(k=0;kNT;k+)spotrf(Ak*NT+k);#pragma omp parallel forfor(i=k+1;iNT;i+)strsm(Ak*NT+k,Ak*NT+i);for(i=k+1;iNT;i+)#pragma omp parallel forfor(j=k+1;ji;j+)sgemm(Ak*NT+i,Ak*NT+j,Aj*NT+i);ssyrk(Ak*

21、NT+i,Ai*NT+i);barriersTSTSNBNBTSTSISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?8 of 44 2025 IEEE International Solid-State Circuits Conferencevoid Cholesky(float*A)int i,j,k;for(k=0;kNT;k+)spotrf(Ak*NT+k);for(i=k+1;iNT;i+)strsm(Ak*NT+k,Ak*NT+i);/update trailing su

22、bmatrixfor(i=k+1;iNT;i+)for(j=k+1;j CRC The asynchronous execution characteristics:Coding calculation could be overlapped with regular execution of other tasksCompared to hardware ECCs,memory overheads of CRC-based software mechanisms is orders of magnitude less:Awareness of the minimal set of data

23、that needs to be protectedWith hardware ECCs all the memory has to be protectedRuntime enables flexibility for the user in terms of what type protection to deployISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?16 of 44 2025 IEEE International Solid-State Circuits Con

24、ferenceCRC Scheme(CRC)ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?17 of 44 2025 IEEE International Solid-State Circuits ConferenceHardware Optimization Utilizing Intel CRC32 instruction belongs to the SSE4.2 family iSCSI polynomial(0 x11EDC6F41)Two versions imple

25、mented:Naive version:Going over byte by byte of task outputs Advanced CRC calculation:Divide the output in three and calculate in parallel and independently CRCs and then merge three CRCs.Significant reduction of CRC calculation overheadISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Ap

26、plication Targets?18 of 44 2025 IEEE International Solid-State Circuits ConferenceResults:Performance OverheadsBenchmarksCRC(Runtime)CRC(Hardware-Nai.)CRC(Hardware-Adv.)Sparse LU6.70%2.99%0.41%Perlin4.37%1.92%0.52%CG8.27%4.20%2.70%FFT37.20%17.39%8.09%Cholesky31.53%12.56%2.08%Knapsack1.21%0.97%0.85%P

27、ingpong0.28%0.12%0.03%Nbody1.96%1.27%0.18%Matmul2.03%1.35%0.69%Average10.39%4.75%1.73%ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?19 of 44 2025 IEEE International Solid-State Circuits ConferenceTask Redundancy(Fault Detection):ImplementationProgramstatements.prag

28、ma.statementsTask instanceInputs,inouts,outputsstatements.Regular version-Satisfy dependences and wait for resources availability-Run*Instance of task is run in parallel within the rest of the task instances ExecutionFD version-Satisfy dependences and wait for resources availability-Checkpoint input

29、s,inoutsusing the concurrent backup handler-Run*and Run parallel duplicated-run;-while(different results)Restore checkpoint using the handlerRerun one instanceTask instancestatements.Inputs,inouts,outputsCheckpoint inputs,inoutsTask instancestatements.Inputs,inouts,outputsCheckpoint inputs,inoutsExe

30、cutionExecutionbackupRestoreExecutionISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?20 of 44 2025 IEEE International Solid-State Circuits ConferenceResults:Task ReplicationMulti-Node Scalability0%1%2%3%4%5%6%7%8%9%Overheads of ReplicationOverhead is defined w.r.t.to

31、tal application execution timeISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?21 of 44 2025 IEEE International Solid-State Circuits ConferenceSelective Task ReplicationIncrease applications reliability and thus its chance for successful completion by selectively repl

32、icating tasks.Set a FIT threshold to be adhered during the application execution.Per application Failures in time(FIT)Heuristic Assuming we have only%x spare resources for replication,maximize reliability by selectively replicating%x of the tasks.Replication Threshold HeuristicISSCC 2025-Forum 3.1:D

33、ataflow:Is it all About Algorithms and Application Targets?22 of 44 2025 IEEE International Solid-State Circuits ConferenceSelective Task ReplicationReliability modelHeuristicDecision if a task is to be checkpointedEstimates the possible reliability improvement if a task is replicated.Accounts for t

34、he costs of replicationISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?23 of 44 2025 IEEE International Solid-State Circuits ConferenceTask reliability modelTakes into account various task&application characteristics#of inputs/outputsSize of input in MBTask dependenc

35、iesWhen input was generated as outputExpected execution timeOthersTask reliability in FIT(Faults in Time 1 Billion hours)Other task propertiesInputs,outputsDependenciesTask reliability modelSome tasks are more important with respect to reliability than others.ISSCC 2025-Forum 3.1:Dataflow:Is it all

36、About Algorithms and Application Targets?24 of 44 2025 IEEE International Solid-State Circuits ConferenceHeuristic for selective replicationDesign principleSystem may have spare idle resources(e.g.10%of the resources are idle)Selective replication to that uses the idle resources to improve reliabili

37、tyThe heuristic(Target_Rep)Inspects the queue of ready tasks for executionBy applying the reliability model estimate how much the overall reliability of the application will increase if a particular task is replicatedReplicate those that would increase reliability most within the envelop of availabl

38、e idle resourcesLightweight and online operation without requiring prior offline profiling and analysis.ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?25 of 44 2025 IEEE International Solid-State Circuits ConferenceResults:Selective task replicationDifference from t

39、he optimal solutionx-axis:The%of tasks replicatedy-axis:The difference in%between the optimal solution and the FIT rates that the model estimates for the given percentage of task replication.The optimal task selection would improve the FIT by that%0%5%10%15%20%25%30%10%20%30%40%50%60%70%80%90%Averag

40、e Difference from the Optimal SolutionISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?26 of 44 2025 IEEE International Solid-State Circuits ConferenceAtomic Dataflow Motivation Parallelizing complex programs:Large tasks and irregular parallelism diminishing results.C

41、an we express coordination between tasks explicitly and reduce conflicts caused by this synchronization?Dataflow?Expressive,deterministic and inherently parallel.No data sharing,no implicit state.Not a general solution.Not a good fit for certain kind of problems which are naturally solved using the

42、concept of state.ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?27 of 44 2025 IEEE International Solid-State Circuits ConferenceIntroduction Atomic Dataflow(ADF)parallel programming model.Combines dataflow and shared memory programming.Increases programming expressi

43、veness.Emphasis is on problems that exhibit irregular parallelism.Key Point:Dataflow scheduling between atomic tasksStraightforward patterns Explicit data dependencies.Programmers explicitly define input data dependencies.Obscure patterns Atomic execution.Parallelization and the atomic execution are

44、 implicit.ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?28 of 44 2025 IEEE International Solid-State Circuits ConferenceMotivating exampleMax function with the global maximum:a)dataflow implementation,b)shared memory implementation and c)ADF implementation.MAXxyGlo

45、balmaxmax(x,y)wait(x);wait(y);z=max(x,y);atomic if(zg_max)g_max=z;MAXxymax(x,y)Globalmaxa)b)c)ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?29 of 44 2025 IEEE International Solid-State Circuits ConferenceReal World Example Game Engine Highly irregular application d

46、ifficult to parallelize.The server processes clients requests.Updates the game world.Game world represented with the areanode tree.State of the world encoded in the game world buffer.The buffer is sent to clients to keep the game consistent.Areanode treeregion lockinggame worldISSCC 2025-Forum 3.1:D

47、ataflow:Is it all About Algorithms and Application Targets?30 of 44 2025 IEEE International Solid-State Circuits ConferenceGame engine parallelizationPthreadsRegion based locking of the areanode tree.Coarse grained approachNegative effects on load balancing.Transactional MemoryMore fine grained sync

48、hronization.Sub-optimal performance due to the TM overhead.Gajinov,V.,Zyulkyarov,F.,Unsal,O.S.,Cristal,A.,Ayguade,E.,Harris,T.,and Valero,M.QuakeTM:parallelizing a complex sequential application using transactional memory.In Proceedings of the 23rd international conference on Supercomputing(New York

49、,NY,USA 2009),ACM,126-135.ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?31 of 44 2025 IEEE International Solid-State Circuits ConferenceGame Engine ADF ApproachWe partition the game world in a number of game areas.The areas correspond to the leaves of the areanode

50、treeWe define a dedicated dataflow task for each game area.We transform the areanode tree synchronization into a dataflow execution.We restrict the processing to a single area at the time.If the action crosses the area boundaries,we generate a new token for the neighboring task.The tasks are atomic.

51、The game world state is kept consistent.source areadestination areaobjects trajectorygenerate tokentrigger taskcontinuemovetask partitioned game worldISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?32 of 44 2025 IEEE International Solid-State Circuits ConferenceAtomi

52、c Dataflow Model Basic unit of work the ADF task.Task executes automatically implicit transactionThe task blocks if input dependencies are not satisfied.Data dependencies defined using the trigger set extension.For each data from the trigger set implicit buffer.Syntax:#pragma adf_task trigger_set()u

53、ntil(exit_condition)instances()relaxed ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?33 of 44 2025 IEEE International Solid-State Circuits ConferenceThe ADF APIprogrammer APIvoid adf_init(int num_threads);Initializes the ADF runtime.void adf_start();Unblocks worker

54、 threads.void adf_taskwait();Called when a thread needs to wait for the end of a dataflow execution.void adf_terminate();Destroys the ADF runtime environment.internal APIvoid adf_create_task(int num_instances,int num_tokens,void*tokens,std:function fn);Creates a dataflow task.void adf_pass_token(voi

55、d*addr,void*token,size_t token_size);Called when a task needs to pass newly generated output tokens.void adf_task_stop();Called by a task when its until condition is set.ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?34 of 44 2025 IEEE International Solid-State Circ

56、uits ConferenceThe ADF Execution ModelThe ADF program consists of a number of task regions that are executed in a program order.The main program thread:initializes the ADF runtime system,initiates the execution of all task regions in a program order,stops the ADF runtime and terminates the program.I

57、SSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?35 of 44 2025 IEEE International Solid-State Circuits ConferenceTask Region ExecutionWhen the main thread encounters a new task region,it:1.creates all ADF tasks that belong to that task region,2.generates initial tokens

58、 which are necessary to start the new dataflow execution3.starts the dataflow execution by unblocking the worker threads4.waits for the end of the dataflow execution of the current task region after which it 5.destroys all constituent tasks,and 6.moves to the next task region.ISSCC 2025-Forum 3.1:Da

59、taflow:Is it all About Algorithms and Application Targets?36 of 44 2025 IEEE International Solid-State Circuits ConferenceTask ExecutionADF task macro dataflow actor Processes one set of input data,Produces output data,andWaits for the new set of input dataStops when the exit condition is set.Consum

60、eStart_TxAbortCommitProduce TokensEnd_TaskYesYesNoNoISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?37 of 44 2025 IEEE International Solid-State Circuits ConferenceMultiple Producers,Multiple Consumers ExampleP1 produces a token for C1P2 produces a token for C2P1 sto

61、res a token x1 into the bufferP2 stores a token x2into the bufferC1 consumes token x1C2 consumes token x2P1P2C1C2xTOKEN BUFFERP1P2C1C2xx1TOKEN BUFFERP1P2C1C2xx1x2TOKEN BUFFERP1P2C1C2xx2TOKEN BUFFERP1P2C1C2xx1x2TOKEN BUFFERP1P2C1C2xTOKEN BUFFERISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms

62、and Application Targets?38 of 44 2025 IEEE International Solid-State Circuits ConferenceEvaluationADF implemented in C/C+.g+compiler version 4.5 TinySTM 11 for STM support.The machine is PowerEdge 6850 four dual-core 64-bit Intel Xeon processors at 3.2 GHz.Each processor unit has 16MB L3 cache memor

63、y.The machine is running SUSE LINUX 10.1 We compare the implementation with implicit task transaction(default)and the implementation with manually placed transactions(relaxed).Test applications:Bounded bufferGame world updateISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Ta

64、rgets?39 of 44 2025 IEEE International Solid-State Circuits ConferenceResults Game(1)ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?40 of 44Task throughput of the Game application with atomic ADF tasks(implicit synchronization).Task throughput of the Game applicatio

65、n with non-atomic ADF tasks(manual synchronization).2025 IEEE International Solid-State Circuits ConferenceResults Game(2)Effects of increasing the number of objects per game domain.Effects of increasing the number of instances per task(implicit synchronization).Effects of increasing the number of i

66、nstances per task(manual synchronization).ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?41 of 44 2025 IEEE International Solid-State Circuits ConferenceResults Bounded BufferTask throughput of the Game application with atomic ADF tasks(implicit synchronization).Tas

67、k throughput of the Game application with non-atomic ADF tasks(manual synchronization).ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?42 of 44 2025 IEEE International Solid-State Circuits ConferenceAtomic Dataflow ModelProposal for the Atomic Dataflow model:Combines

68、 shared memory and dataflow parallel programmingEnhances the programmability of irregular applications by exposing the expressiveness of the dataflow directly to a programmer.Improves the performance of transactional memory execution by eliminating unnecessary conflicts.The model is a good fit for P

69、roducer-Consumer type of problems and especially for problems that exhibit irregular parallelism.ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?43 of 44 2025 IEEE International Solid-State Circuits ConferenceThank youcontact:osman.unsalbsc.esISSCC 2025-Forum 3.1:Dat

70、aflow:Is it all About Algorithms and Application Targets?44 of 44 2025 IEEE International Solid-State Circuits ConferenceDataflow Optimization andManagement for ML AcceleratorsMarian Verhelst,KU Leuven MICAS&imecmarian.verhelstkuleuven.beFebruary 2025Marian VerhelstISSCC 2025 Forum F3.21 of 60 2025

71、IEEE International Solid-State Circuits ConferenceML accelerator performance:a tale of 2 rooflines.From CPU.to NPUShifting right on the roofline,a matter of data reuseExploiting spatial and temporal reuseImpact of array size,precision and sparsityApproaching the roofline,a matter of utilizationThe u

72、tilization problemImpact of scheduling and data layoutDecoupled AI accelerators:HW&SW decoupling of spatial dataflow temporal schedule memory layoutOutlineMarian VerhelstISSCC 2025 Forum F3.23 of 60 2025 IEEE International Solid-State Circuits ConferenceML accelerator performance:a tale of 2 rooflin

73、es.From CPU.to NPUShifting right on the roofline,a matter of data reuseExploiting spatial and temporal reuseImpact of array size,precision and sparsityApproaching the roofline,a matter of utilizationThe utilization problemImpact of scheduling and data layoutDecoupled AI accelerators:HW&SW decoupling

74、 of spatial dataflow temporal schedule memory layoutOutlineMarian VerhelstISSCC 2025 Forum F3.24 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor performance in function of“arithmetic intensity(AI)”AI=operations/bytes of memory accessAssumes simultaneous data fetching and compu

75、teThe performance roofline for CPUMarian Verhelstcompute boundmemory bounddrawing for Nop=8,Bmem=8bytes/cycleNopISSCC 2025 Forum F3.25 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor performance in function of“arithmetic intensity(AI)”AI=operations/bytes of memory accessAssume

76、s simultaneous data fetching and computeThe performance roofline for CPUMarian Verhelstcompute boundmemory boundAttainable performance OPs/cycle=MIN(Mem BW bytes/cycle*AI OPs/byte,NopOPs/cycle)drawing for Nop=8,Bdram=8bytes/cycleNopISSCC 2025 Forum F3.26 of 60 2025 IEEE International Solid-State Cir

77、cuits ConferenceProcessor performance in function of“arithmetic intensity(AI)”AI=operations/bytes of memory accessAssumes simultaneous data fetching and computeThe performance roofline for CPUMarian Verhelstcompute boundmemory boundAttainable performance OPs/cycle=MIN(Mem BW bytes/cycle*AI OPs/byte,

78、NopOPs/cycle)drawing for Nop=8,Bdram=8bytes/cycleNopISSCC 2025 Forum F3.26 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor performance in function of“arithmetic intensity(AI)”AI=operations/bytes of memory accessAssumes simultaneous data fetching and computeThe performance roof

79、line for CPUMarian Verhelstcompute boundmemory boundAttainable performance OPs/cycle=MIN(Mem BW bytes/cycle*AI OPs/byte,NopOPs/cycle)drawing for Nop=8,Bdram=8bytes/cycleNopISSCC 2025 Forum F3.26 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor performance in function of“arithme

80、tic intensity(AI)”AI=operations/bytes of memory accessAssumes simultaneous data fetching and computeThe performance roofline for CPUMarian Verhelstcompute boundmemory boundAttainable performance OPs/cycle=MIN(Mem BW bytes/cycle*AI OPs/byte,NopOPs/cycle)memory bound compute boundAI=Nop/Bdrawing for N

81、op=8,Bdram=8bytes/cycleNopISSCC 2025 Forum F3.26 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:M-Add:multiply addData fetch:input1,input2,partial sumData store:sum 4 words*4bytes/word=16bytesCompute:2 operations AI=2 OPs/16bytes=0.125 OPs/byteThe performance roofline for CPUMa

82、rian Verhelstcompute boundmemory boundX+NopISSCC 2025 Forum F3.27 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor energy efficiency in function of“arithmetic intensity(AI)”Energy is SUM of memory and operational energyThe energy roofline for CPUMarian Verhelstcompute boundmemo

83、ry bound1/Eopdrawing for Ecomp=0.5pJ/op,Edram=50pJ/byteISSCC 2025 Forum F3.28 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor energy efficiency in function of“arithmetic intensity(AI)”Energy is SUM of memory and operational energyThe energy roofline for CPUMarian Verhelstcompu

84、te boundmemory boundAttainable efficiency OPs/Joule=1(EcompJoules/OP+EmemJoule/byte/AI OPs/byte)1/Eopdrawing for Ecomp=0.5pJ/op,Edram=50pJ/byteISSCC 2025 Forum F3.28 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor energy efficiency in function of“arithmetic intensity(AI)”Energ

85、y is SUM of memory and operational energyThe energy roofline for CPUMarian Verhelstcompute boundmemory boundAttainable efficiency OPs/Joule=1(EcompJoules/OP+EmemJoule/byte/AI OPs/byte)1/Eopmemcompenergydrawing for Ecomp=0.5pJ/op,Edram=50pJ/byteISSCC 2025 Forum F3.28 of 60 2025 IEEE International Sol

86、id-State Circuits ConferenceProcessor energy efficiency in function of“arithmetic intensity(AI)”Energy is SUM of memory and operational energyThe energy roofline for CPUMarian Verhelstcompute boundmemory boundAttainable efficiency OPs/Joule=1(EcompJoules/OP+EmemJoule/byte/AI OPs/byte)1/Eopmemcompene

87、rgymemcompdrawing for Ecomp=0.5pJ/op,Edram=50pJ/byteISSCC 2025 Forum F3.28 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor energy efficiency in function of“arithmetic intensity(AI)”Energy is SUM of memory and operational energyThe energy roofline for CPUMarian Verhelstcompute

88、boundmemory boundAttainable efficiency OPs/Joule=1(EcompJoules/OP+EmemJoule/byte/AI OPs/byte)memory bound compute bound1/EopmemcompenergymemcompmemcompAI=Emem/Ecompdrawing for Ecomp=0.5pJ/op,Edram=50pJ/byteISSCC 2025 Forum F3.28 of 60 2025 IEEE International Solid-State Circuits ConferenceThe roofli

89、neS for CPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompISSCC 2025 Forum F3.29 of 60 2025 IEEE International Solid-State Circuits ConferenceOptimal place in roofline(s)?Can be memory bound in one and compute bound in other one.Is this the same for NPU

90、s?The rooflineS for CPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompISSCC 2025 Forum F3.29 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLay

91、er outputsProcessing element(PE)regX+regregSwitching to NPUsMarian Verhelstfor(m=0 to M-1);for each rowfor(n=0 to N-1);for each output columnfor(k=0 to K-1);for each input channelomn+=imk*wkn;ISSCC 2025 Forum F3.2MKMNKN10 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMSt

92、orageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLayer outputsProcessing element(PE)regX+regregSwitching to NPUsMarian Verhelstfor(m=0 to M-1);for each rowfor(n=0 to N-1);for each output columnfor(k=0 to K-1);for each input channelomn+=imk*wkn;Compared to CPU:Many more

93、PEs(e.g.100 x to 10,000 x)Lower precision data elements(e.g.4x less precision)ISSCC 2025 Forum F3.2MKMNKN10 of 60 2025 IEEE International Solid-State Circuits ConferenceThe rooflines for NPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompISSCC 2025 Forum

94、 F3.211 of 60 2025 IEEE International Solid-State Circuits ConferenceMore operators:performance more memory boundedThe rooflines for NPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompAI*100!ISSCC 2025 Forum F3.211 of 60 2025 IEEE International Solid-Sta

95、te Circuits ConferenceMore operators:performance more memory boundedLess improvement in Eop:cut off points closer together.The rooflines for NPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompAI*100!ISSCC 2025 Forum F3.211 of 60 2025 IEEE International S

96、olid-State Circuits ConferenceMore operators:performance more memory boundedLess improvement in Eop:cut off points closer together.The rooflines for NPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompAI*100!AI*4ISSCC 2025 Forum F3.211 of 60 2025 IEEE Int

97、ernational Solid-State Circuits ConferenceMore operators:performance more memory boundedLess improvement in Eop:cut off points closer together.Performance is terrible at AI1.The rooflines for NPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompAI*100!AI*4

98、ISSCC 2025 Forum F3.211 of 60 2025 IEEE International Solid-State Circuits ConferenceMore operators:performance more memory boundedLess improvement in Eop:cut off points closer together.Performance is terrible at AI1.Increase Bmem$Shift right,but needs AI 100 to 1000?The rooflines for NPUMarian Verh

99、elstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompAI*100!AI*4ISSCC 2025 Forum F3.211 of 60 2025 IEEE International Solid-State Circuits ConferenceML accelerator performance:a tale of 2 rooflines.From CPU.to NPUShifting right on the roofline,a matter of data reuseExpl

100、oiting spatial and temporal reuseImpact of array size,precision and sparsityApproaching the Roofline,a matter of utilizationThe utilization problemImpact of scheduling and data layoutDecoupled AI accelerators:Decouple&orthogonalize spatial dataflow temporal schedule memory layoutOutlineMarian Verhel

101、stISSCC 2025 Forum F3.212 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLayer outputsProcessing element(PE)regX+regregAI within an NPU:MAC levelMarian Verhelstfor(m=0 to M-1);for each rowfo

102、r(n=0 to N-1);for each output columnfor(k=0 to K-1);for each input channelomn+=imk*wkn;AI of 8bit MAC:4 words of 1 byte;2 ops AIreg=2/4=0.5 OPs/byteISSCC 2025 Forum F3.213 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRA

103、MDatapathWeights Layer inputsLayer outputsProcessing element(PE)regX+regregAI within an NPU:spatial unrollingMarian VerhelstAI of 8bit MAC:4 words of 1 byte;2 ops AIreg=2/4=0.5 OPs/bytespatial unrollingfor(m=0 to M-1);for each rowfor(n=0 to N-1);for each output columnfor(k=0 to K-1);for each input c

104、hannelomn+=imk*wkn;for(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.214 of 60 2025 IEEE International Solid-State Circuits ConferenceDifferent types of spatial reuse can(should!)be combined to get decent AIIntermezzo:Types of

105、spatial unrolling for spatial data reuseMarian Verhelstxxxxweightinputs+outputsfor(k=0 to K-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);omn+=imk*wkn;Weight reuseInput reuseOutput reuseISSCC 2025 Forum F3.215 of 60 2025 IEEE International Solid-State Circuits ConferenceDifferent types of

106、spatial reuse can(should!)be combined to get decent AIIntermezzo:Types of spatial unrolling for spatial data reuseMarian Verhelstxxxxweightsinput+outputsxxxxweightinputs+outputsfor(k=0 to K-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);omn+=imk*wkn;Weight reuseInput reuseOutput reusefor(k=

107、0 to K-1);for(m=0 to M-1);for(n2=0 to N/4-1);parfor(n1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.215 of 60 2025 IEEE International Solid-State Circuits ConferenceDifferent types of spatial reuse can(should!)be combined to get decent AIIntermezzo:Types of spatial unrolling for spatial data reuseMarian

108、 Verhelstinputsxxxx+weightsxxxxweightsinput+outputsxxxxweightinputs+outputsfor(k=0 to K-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);omn+=imk*wkn;Weight reuseInput reuseOutput reusefor(k=0 to K-1);for(m=0 to M-1);for(n2=0 to N/4-1);parfor(n1=0 to 3);omn+=imk*wkn;for(k2=0 to K/4-1);for(m=0

109、 to M-1);for(n=0 to N-1);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.215 of 60 2025 IEEE International Solid-State Circuits ConferenceSpatial unrolling impact on AI:2DMarian Verhelst2D MAC array with weight&input reuseArithmetic intensity?for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=i

110、mk *wkn;WI.32 x 32 MACs .SpatialISSCC 2025 Forum F3.2E.g.Tesla NPU,Google TPU,.16 of 60 2025 IEEE International Solid-State Circuits ConferenceSpatial unrolling impact on AI:2DMarian Verhelst2D MAC array with weight&input reuseArithmetic intensity?Read I=32 word/ccRead W=32 word/ccRead O-1=32*32 wor

111、d/ccWrite O=32*32 word/ccAI(spatial)=2048 ops ./(2*32+2*1024)words=1for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;WI.32 x 32 MACs .SpatialISSCC 2025 Forum F3.2E.g.Tesla NPU,Google TPU,.16 of 60 2025 IEEE International Solid-State Circuits ConferenceSpatial unrolling impact on AI:3DMa

112、rian Verhelst3D MAC array with weight&input&output reuseArithmetic intensity?for(k2=0 to.)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn+=imk *wkn;WI.8 x 8 x 16 MACs .SpatialISSCC 2025 Forum F3.2E.g.Tensor cores,DaVinci core,.17 of 60 2025 IEEE International Solid-State Circuits ConferenceS

113、patial unrolling impact on AI:3DMarian Verhelst3D MAC array with weight&input&output reuseArithmetic intensity?Read I=8*16 word/ccRead W=8*16 word/ccRead O-1=8*8 word/ccWrite O=8*8 word/ccAI(spatial)=2048 ops ./(2*128+2*64)words=5.4for(k2=0 to.)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn

114、+=imk *wkn;WI.8 x 8 x 16 MACs .3D better!But still not that good.SpatialISSCC 2025 Forum F3.2E.g.Tensor cores,DaVinci core,.17 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLayer outputsPro

115、cessing element(PE)regX+regregAI within an NPU:temporal unrollingMarian VerhelstAI of 8bit MAC:4 words of 1 byte;2 ops AIreg=2/4=0.5 OPs/byteAI of MAC array:AIarray 5 OPs/bytespatial unrollingfor(m=0 to M-1);for(n=0 to N-1);for(k=0 to K-1);o+=i*w;for(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1)

116、;parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.218 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLayer outputsProcessing element(PE)regX+regregAI within an NPU:tempor

117、al unrollingMarian VerhelstAI of 8bit MAC:4 words of 1 byte;2 ops AIreg=2/4=0.5 OPs/byteAI of MAC array:AIarray 5 OPs/bytespatial unrollingfor(m=0 to M-1);for(n=0 to N-1);for(k=0 to K-1);o+=i*w;for(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;tem

118、poral unrollingfor(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.218 of 60 2025 IEEE International Solid-State Circuits ConferenceDifferent types of temporal reuse can NOT be combined(at 1 memory level)Intermezzo:Types of tempo

119、ral unrolling for temporal data reuseMarian VerhelstWeight stationary Input stationary Output stationary(with output reuse)weightsinputxxxx+for(k2=0 to K/4-1);for(n=0 to N-1);for(m=0 to M-1);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.219 of 60 2025 IEEE International Solid-State Circuits Con

120、ferenceDifferent types of temporal reuse can NOT be combined(at 1 memory level)Intermezzo:Types of temporal unrolling for temporal data reuseMarian VerhelstWeight stationary Input stationary Output stationary(with output reuse)weightsxxxx+weightsinputxxxx+for(k2=0 to K/4-1);for(m=0 to M-1);for(n=0 t

121、o N-1);parfor(k1=0 to 3);omn+=imk*wkn;for(k2=0 to K/4-1);for(n=0 to N-1);for(m=0 to M-1);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.219 of 60 2025 IEEE International Solid-State Circuits ConferenceDifferent types of temporal reuse can NOT be combined(at 1 memory level)Intermezzo:Types of tem

122、poral unrolling for temporal data reuseMarian VerhelstWeight stationary Input stationary Output stationary(with output reuse)for(n=0 to N-1);for(m=0 to M-1);for(k2=0 to K/4-1);parfor(k1=0 to 3);omn+=imk*wkn;inputsxxxx+weightweightsxxxx+weightsinputxxxx+for(k2=0 to K/4-1);for(m=0 to M-1);for(n=0 to N

123、-1);parfor(k1=0 to 3);omn+=imk*wkn;for(k2=0 to K/4-1);for(n=0 to N-1);for(m=0 to M-1);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.219 of 60 2025 IEEE International Solid-State Circuits ConferenceTemporal unrolling impact on AI:2DMarian Verhelst2D MAC array with weight&input reuseArithmetic in

124、tensity?Read I=32 word/ccRead W=32 word/ccRead O-1=32*32 word/ccWrite O=32*32 word/ccAI(spatial)=2048 ops ./(2*32+2*1024)words=1for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;WI.32 x 32 MACs .SpatialISSCC 2025 Forum F3.220 of 60 2025 IEEE International Solid-State Circuits ConferenceT

125、emporal unrolling impact on AI:2DMarian Verhelst2D MAC array with weight&input reuseArithmetic intensity?Read I=32 word/ccRead W=32 word/ccRead O-1=32*32 word/ccWrite O=32*32 word/ccAI(spatial)=2048 ops ./(2*32+2*1024)words=1for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;WI.32 x 32 MA

126、Cs .SpatialTemporalISSCC 2025 Forum F3.220 of 60 2025 IEEE International Solid-State Circuits ConferenceTemporal unrolling impact on AI:2DMarian Verhelst2D MAC array with weight&input reuseArithmetic intensity?Read I=32 word/ccRead W=32 word/ccRead O-1=32*32 word/ccWrite O=32*32 word/ccAI(spatial)=2

127、048 ops ./(2*32+2*1024)words=1AI(temporal)=2048 ops ./(2*32)words=32!for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;WI.32 x 32 MACs .SpatialTemporalISSCC 2025 Forum F3.220 of 60 2025 IEEE International Solid-State Circuits ConferenceTemporal unrolling impact on AI:3DMarian Verhelst3D

128、MAC array with weight&input&output reuseArithmetic intensity?Read I=8*16 word/ccRead W=8*16 word/ccRead O-1=8*8 word/ccWrite O=8*8 word/ccAI(spatial)=2048 ops ./(2*128+2*64)words=5.4for(k2=0 to.)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn+=imk *wkn;WI.8 x 8 x 16 MACs .2D now better than

129、3D!SpatialISSCC 2025 Forum F3.221 of 60 2025 IEEE International Solid-State Circuits ConferenceTemporal unrolling impact on AI:3DMarian Verhelst3D MAC array with weight&input&output reuseArithmetic intensity?Read I=8*16 word/ccRead W=8*16 word/ccRead O-1=8*8 word/ccWrite O=8*8 word/ccAI(spatial)=204

130、8 ops ./(2*128+2*64)words=5.4for(k2=0 to.)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn+=imk *wkn;WI.8 x 8 x 16 MACs .2D now better than 3D!SpatialTemporalISSCC 2025 Forum F3.221 of 60 2025 IEEE International Solid-State Circuits ConferenceTemporal unrolling impact on AI:3DMarian Verhelst3

131、D MAC array with weight&input&output reuseArithmetic intensity?Read I=8*16 word/ccRead W=8*16 word/ccRead O-1=8*8 word/ccWrite O=8*8 word/ccAI(spatial)=2048 ops ./(2*128+2*64)words=5.4AI(temporal)=2048 ops ./(2*128)words=8for(k2=0 to.)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn+=imk *wkn

132、;WI.8 x 8 x 16 MACs .2D now better than 3D!SpatialTemporalISSCC 2025 Forum F3.221 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLayer outputsProcessing element(PE)regX+regregAI within an NP

133、U:multi-level stationarityMarian VerhelstAI of 8bit MAC:4 words of 1 byte;2 ops AIreg=2/4=0.5 OPs/byteAImem 10 OPs/bytesp&tempunrollingfor(m=0 to M-1);for(n=0 to N-1);for(k=0 to K-1);o+=i*w;for(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2

134、025 Forum F3.222 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLayer outputsProcessing element(PE)regX+regregAI within an NPU:multi-level stationarityMarian VerhelstAI of 8bit MAC:4 words o

135、f 1 byte;2 ops AIreg=2/4=0.5 OPs/byteAImem 10 OPs/bytesp&tempunrollingfor(m=0 to M-1);for(n=0 to N-1);for(k=0 to K-1);o+=i*w;for(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;temporal unrollingfor(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1);

136、parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;AImem 50 OPs/byteDRAMSRAMRegistersEach memory level(and operand)can individually exploit temporal reuse also DRAM and regs!ISSCC 2025 Forum F3.222 of 60 2025 IEEE International Solid-State Circuits ConferenceAIsram/dramshifts to the right:higher achie

137、vable performanceyet,dependent on spatial/temporal unrolling choicesThe performance roofline for NPUMarian VerhelstNopNopcompute boundmemory boundAIdramISSCC 2025 Forum F3.2AIsramAIrf23 of 60 2025 IEEE International Solid-State Circuits ConferenceThe performance roofline for NPUMarian VerhelstNopNop

138、compute boundmemory boundAImem1/Eopcompute dominatedmemory dominatedAImemISSCC 2025 Forum F3.224 of 60 2025 IEEE International Solid-State Circuits ConferenceImpact of quantization?Impact of sparsity?The performance roofline for NPUMarian VerhelstNopNopcompute boundmemory boundAImem1/Eopcompute domi

139、natedmemory dominatedAImemISSCC 2025 Forum F3.224 of 60 2025 IEEE International Solid-State Circuits ConferenceReduced precision=Lower EopThe impact of quantizationMarian Verhelstx+IWO-1O+O-1OxxxxIW32bit8bitISSCC 2025 Forum F3.225 of 60 2025 IEEE International Solid-State Circuits ConferenceReduced

140、precision=Lower EopLarger NopMore spatial unrollingThe impact of quantizationMarian Verhelstx+IWO-1O+O-1OxxxxIW32bit8bitISSCC 2025 Forum F3.225 of 60 2025 IEEE International Solid-State Circuits Conference.8 x 8 x 16 64 MACs .Reduced precision=Lower EopLarger NopMore spatial unrollingThe impact of q

141、uantizationMarian Verhelstx+IWO-1O+O-1OxxxxIW32bit8bitISSCC 2025 Forum F3.2Free to choose to which spatial dimension to give the 4x units to:from 8 x 8 x 16 to:8 x 8 x 6432 x 8 x 1616 x 16 x 1625 of 60 2025 IEEE International Solid-State Circuits Conference.8 x 8 x 16 64 MACs .Reduced precision=Lowe

142、r EopLarger NopMore spatial unrollingLarger AI(bytesops/byte)The impact of quantizationMarian Verhelstx+IWO-1O+O-1OxxxxIW32bit8bitISSCC 2025 Forum F3.2Free to choose to which spatial dimension to give the 4x units to:from 8 x 8 x 16 to:8 x 8 x 6432 x 8 x 1616 x 16 x 16One that maximizes AI,while min

143、imizing HW flexibility cost&underutilization25 of 60 2025 IEEE International Solid-State Circuits ConferenceReduced precision=Lower EopLarger NopMore spatial unrollingLarger AI(bytesops/byte)The performance roofline for NPU:quantizationMarian VerhelstNopNopAIdram1/EopAIdramx+IWO-1O+O-1OxxxxIW32bit8b

144、itISSCC 2025 Forum F3.226 of 60 2025 IEEE International Solid-State Circuits ConferenceReduced precision=Lower EopLarger NopMore spatial unrollingLarger AI(bytesops/byte)The performance roofline for NPU:quantizationMarian VerhelstNopNopAIdram1/EopAIdramx+IWO-1O+O-1OxxxxIW32bit8bitISSCC 2025 Forum F3

145、.226 of 60 2025 IEEE International Solid-State Circuits ConferenceThe impact of(structured)sparsityMarian VerhelstVirtual Nop,Bmem,.=including zero ops,values,.+O-1OxxxxIW+O-1OxxIWISSCC 2025 Forum F3.2Structured sparsity(e.g.2:4)=Either make everything virtual,or everything physical:27 of 60 2025 IE

146、EE International Solid-State Circuits ConferenceThe impact of(structured)sparsityMarian VerhelstVirtual Nop,Bmem,.=including zero ops,values,.+O-1OxxxxIW+O-1OxxIWISSCC 2025 Forum F3.2N(virt.)op=Nnon-zero-op+Nzero_opE(virt.)op=Eop-tot/N(virt.)opStructured sparsity(e.g.2:4)=Larger virtual NopSmaller v

147、irtual EopEither make everything virtual,or everything physical:27 of 60 2025 IEEE International Solid-State Circuits ConferenceThe impact of(structured)sparsityMarian VerhelstVirtual Nop,Bmem,.=including zero ops,values,.+O-1OxxxxIW+O-1OxxIWISSCC 2025 Forum F3.2N(virt.)op=Nnon-zero-op+Nzero_opE(vir

148、t.)op=Eop-tot/N(virt.)opB(virt.)mem=Bnon-zero-data+Bzero-dataE(virt.)mem/byte=Emem-tot/BmemStructured sparsity(e.g.2:4)=Larger virtual NopSmaller virtual EopSlightly larger virtual BmemSlightly smaller virtual EmemSame virtual AIEither make everything virtual,or everything physical:27 of 60 2025 IEE

149、E International Solid-State Circuits ConferenceStructured sparsity(e.g.2:4)=Larger virtual NopSmaller virtual EopSlightly larger virtual BmemSlightly smaller virtual EmemSame virtual AIThe impact of(structured)sparsityMarian Verhelst+O-1OxxxxIW+O-1OxxIWISSCC 2025 Forum F3.2.8 x 8 x 16 64 MACs .28 of

150、 60 2025 IEEE International Solid-State Circuits ConferenceStructured sparsity(e.g.2:4)=Larger virtual NopSmaller virtual EopSlightly larger virtual BmemSlightly smaller virtual EmemSame virtual AIThe impact of(structured)sparsityMarian Verhelst+O-1OxxxxIW+O-1OxxIWISSCC 2025 Forum F3.2.8 x 8 x 16 64

151、 MACs .Which spatial dimension to give the 4x units to?:from 8 x 8 x 16 to:8 x 8 x 328 x 16 x 1616 x 8 x 1628 of 60 2025 IEEE International Solid-State Circuits ConferenceStructured sparsity(e.g.2:4)=Larger virtual NopSmaller virtual EopSlightly larger virtual BmemSlightly smaller virtual EmemSame v

152、irtual AIThe impact of(structured)sparsityMarian Verhelst+O-1OxxxxIW+O-1OxxIWISSCC 2025 Forum F3.2.8 x 8 x 16 64 MACs .Which spatial dimension to give the 4x units to?:from 8 x 8 x 16 to:8 x 8 x 328 x 16 x 1616 x 8 x 16The one that follows sparsity structure,while minimizing HW flexibility cost&unde

153、rutilization28 of 60 2025 IEEE International Solid-State Circuits ConferenceStructured sparsity(e.g.2:4)=Larger(virtual)NopSlightly larger(virtual)BmemSmaller(virtual)EopSlightly smaller(virtual)EmemThe performance roofline for NPU:sparsityMarian VerhelstNopNopAIdram1/EopAIdram+O-1OxxxxIW+O-1OxxIWIS

154、SCC 2025 Forum F3.229 of 60 2025 IEEE International Solid-State Circuits ConferenceStructured sparsity(e.g.2:4)=Larger(virtual)NopSlightly larger(virtual)BmemSmaller(virtual)EopSlightly smaller(virtual)EmemThe performance roofline for NPU:sparsityMarian VerhelstNopNopAIdram1/EopAIdram+O-1OxxxxIW+O-1

155、OxxIWISSCC 2025 Forum F3.229 of 60 2025 IEEE International Solid-State Circuits ConferenceLarge arrays,precision,sparsity=Higher rooflinesLarge AI only good news?The performance roofline for NPU:all combinedMarian VerhelstNopNopAIdram1/EopAIdramISSCC 2025 Forum F3.230 of 60 2025 IEEE International S

156、olid-State Circuits ConferenceLarge arrays,precision,sparsity=Higher rooflinesLarge AI only good news?But.can we always reach this“attainable performance”?The performance roofline for NPU:all combinedMarian VerhelstNopNopAIdram1/EopAIdramISSCC 2025 Forum F3.230 of 60 2025 IEEE International Solid-St

157、ate Circuits ConferenceML accelerator performance:a tale of 2 rooflines.From CPU.to NPUShifting right on the roofline,a matter of data reuseExploiting spatial and temporal reuseImpact of array size,precision and sparsityApproaching the Roofline,a matter of utilizationThe utilization problemImpact of

158、 scheduling and data layoutDecoupled AI accelerators:Decouple&orthogonalize spatial dataflow temporal schedule memory layoutOutlineMarian VerhelstISSCC 2025 Forum F3.231 of 60 2025 IEEE International Solid-State Circuits ConferenceUtilization=average fraction of the time each MAC unit is usefully bu

159、sy=US*UTUtilizationMarian VerhelstUS=Spatial utilization=useful ops/Cctotal ops/CcISSCC 2025 Forum F3.232 of 60 2025 IEEE International Solid-State Circuits ConferenceUtilization=average fraction of the time each MAC unit is usefully busy=US*UTUtilizationMarian VerhelstUS=Spatial utilization=useful

160、ops/Cctotal ops/Ccfor(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)8omn+=imk *wkn;.8 x 8 x 16 MACs .9E.g.depthwise layer:ISSCC 2025 Forum F3.232 of 60 2025 IEEE International Solid-State Circuits ConferenceUtilization=average fraction of the time each MAC unit is usefully busy=US*UTU

161、tilizationMarian VerhelstUS=Spatial utilization=useful ops/Cctotal ops/Ccfor(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)8omn+=imk *wkn;.8 x 8 x 16 MACs .9E.g.depthwise layer:US=9/16=56%ISSCC 2025 Forum F3.232 of 60 2025 IEEE International Solid-State Circuits ConferenceUtilization=

162、average fraction of the time each MAC unit is usefully busy=US*UTUtilizationMarian VerhelstUS=Spatial utilization=useful ops/Cctotal ops/CcUT=Temp.utilization=active comp.Cctotal Cccomputeoffload datacomputeoffload data50%utilization(50%loss)for(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0

163、 to 15)8omn+=imk *wkn;(off)load dataUS=9/16=56%UT=1/2=50%WI.8 x 8 x 16 MACs .ISSCC 2025 Forum F3.233 of 60 2025 IEEE International Solid-State Circuits ConferenceUtilization=average fraction of the time each MAC unit is usefully busy=US*UTUtilizationMarian VerhelstUS=Spatial utilization=useful ops/C

164、ctotal ops/CcUT=Temp.utilization=active comp.Cctotal Cccomputeoffload datacomputeoffload data50%utilization(50%loss)for(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)8omn+=imk *wkn;(off)load dataUS=9/16=56%UT=1/2=50%U=28%WI.8 x 8 x 16 MACs .ISSCC 2025 Forum F3.233 of 60 2025 IEEE Inte

165、rnational Solid-State Circuits ConferenceCycles and energy is wasted on idle&stalling compute units.1.2.3.Roofline impact of utilizationMarian VerhelstNopNopAIdram1/EopAIdram1-Us*UT1-Us*UTISSCC 2025 Forum F3.234 of 60 2025 IEEE International Solid-State Circuits ConferenceCycles and energy is wasted

166、 on idle&stalling compute units.How to limit underutilization?1.Spatial optimization2.Temporal optimization3.Data layout optimizationRoofline impact of utilizationMarian VerhelstNopNopAIdram1/EopAIdram1-Us*UT1-Us*UTISSCC 2025 Forum F3.234 of 60 2025 IEEE International Solid-State Circuits Conference

167、Marian Verhelst9ISSCC 2025 Forum F3.2How to limit underutilization?1.Spatial dataflow optimization2D array3D arraydepthwise layer:64*9*6435 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian Verhelst9for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;.32 x 32 MACs .US=100

168、%ISSCC 2025 Forum F3.2How to limit underutilization?1.Spatial dataflow optimization2D array3D arraydepthwise layer:64*9*6435 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian Verhelstfor(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)8omn+=imk *wkn;.8 x 8 x 16 MACs .9

169、US=9/16=56%for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;.32 x 32 MACs .US=100%ISSCC 2025 Forum F3.2How to limit underutilization?1.Spatial dataflow optimization2D array3D arraydepthwise layer:64*9*6435 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian Verhelstfor(k

170、=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn+=imk *wkn;.8 x 8 x 16 MACs .for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;.32 x 32 MACs .8-7ISSCC 2025 Forum F3.2How to limit underutilization?1.Spatial dataflow optimizationFB layer,B=8:8*64*642D array3D arraydepthwise

171、layer:64*9*64US=9/16=56%US=100%36 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian Verhelstfor(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn+=imk *wkn;.8 x 8 x 16 MACs .for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;.32 x 32 MACs .US=100%8-7ISSCC

172、2025 Forum F3.2How to limit underutilization?1.Spatial dataflow optimizationFB layer,B=8:8*64*642D array3D arraydepthwise layer:64*9*64US=9/16=56%US=100%36 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian Verhelstfor(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn

173、+=imk *wkn;.8 x 8 x 16 MACs .for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;.32 x 32 MACs .US=100%US=1/32=3%8-7Need for flexibility,heterogeneity!ISSCC 2025 Forum F3.2How to limit underutilization?1.Spatial dataflow optimizationFB layer,B=8:8*64*642D array3D arraydepthwise layer:64*9*

174、64US=9/16=56%US=100%36 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:DianaMarian VerhelstISSCC 2025 Forum F3.237 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:DianaMarian VerhelstLarge,analog in-memory compute core:1152*512 PEs6bit*1.5bitinput parallel,o

175、utput parallelOptimized for efficiency,if massive spatial unrolling possibleExcellent peak efficiency(200Tops/Watt)Yet,utilization can be badISSCC 2025 Forum F3.237 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:DianaMarian VerhelstLarge,analog in-memory compute core:1152*512 P

176、Es6bit*1.5bitinput parallel,output parallelOptimized for efficiency,if massive spatial unrolling possibleExcellent peak efficiency(200Tops/Watt)Yet,utilization can be badISSCC 2025 Forum F3.238 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:DianaMarian VerhelstLarge,analog in-m

177、emory compute core:1152*512 PEs6bit*1.5bitinput parallel,output parallelOptimized for efficiency,if massive spatial unrolling possibleExcellent peak efficiency(200Tops/Watt)Yet,utilization can be badISSCC 2025 Forum F3.238 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:DianaMar

178、ian VerhelstLarge,analog in-memory compute core:1152*512 PEs6bit*1.5bitinput parallel,output parallelOptimized for efficiency,if massive spatial unrolling possibleSmall,flexible digital AI core:16*16 8bit PEs2x for 4bit,4x for 2bit different spatial flowsOptimized for flexibility,yet,less efficientE

179、xcellent peak efficiency(200Tops/Watt)Yet,utilization can be badNear perfect utilization,yet worse efficiency (4,8,16Tops/W 8,4,2bit)ISSCC 2025 Forum F3.238 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian VerhelstAfter good spatial unrolling data transfers quickly dominate!Need f

180、or bandwidth-&memory-aware scheduling to avoid stallsISSCC 2025 Forum F3.2How to limit underutilization?2.Temporal scheduling optimization39 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian VerhelstAfter good spatial unrolling data transfers quickly dominate!Need for bandwidth-&me

181、mory-aware scheduling to avoid stallsPipelining only123ISSCC 2025 Forum F3.2How to limit underutilization?2.Temporal scheduling optimization39 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian VerhelstAfter good spatial unrolling data transfers quickly dominate!Need for bandwidth-&

182、memory-aware scheduling to avoid stallsPipelining only123980195020040060080010001200Memory(kB)-80%L2L1Memory RequirementsISSCC 2025 Forum F3.2How to limit underutilization?2.Temporal scheduling optimization39 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian VerhelstAfter good spat

183、ial unrolling data transfers quickly dominate!Need for bandwidth-&memory-aware scheduling to avoid stallsPipelining onlyTiling&fusion1231231a1b 1c2a2c2b3a3c3b980195020040060080010001200Memory(kB)-80%L2L1Memory RequirementsISSCC 2025 Forum F3.2How to limit underutilization?2.Temporal scheduling optim

184、ization39 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian VerhelstAfter good spatial unrolling data transfers quickly dominate!Need for bandwidth-&memory-aware scheduling to avoid stallsPipelining onlyTiling&fusion1231231a1b 1c2a2c2b3a3c3b980195020040060080010001200Memory(kB)-80%

185、L2L1Memory RequirementsISSCC 2025 Forum F3.2How to limit underutilization?2.Temporal scheduling optimization39 of 60 2025 IEEE International Solid-State Circuits ConferenceTemporal schedules:single coreMarian VerhelstISSCC 2025 Forum F3.240 of 60 2025 IEEE International Solid-State Circuits Conferen

186、ceTemporal schedules:multi-coreMarian VerhelstISSCC 2025 Forum F3.241 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:scheduling on DianaMarian VerhelstISSCC 2025 Forum F3.2Heavily inter-leaving layer execution42 of 60 2025 IEEE International Solid-State Circuits ConferenceExamp

187、le:scheduling on DianaMarian VerhelstISSCC 2025 Forum F3.2Heavily inter-leaving layer executionKeeping memory needs within L1 bounds42 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:scheduling on DianaMarian VerhelstISSCC 2025 Forum F3.2Heavily inter-leaving layer executionKeep

188、ing memory needs within L1 boundsBut so many options.see further!42 of 60 2025 IEEE International Solid-State Circuits ConferenceEven the best spatial and temporal unrolling,can still suffer from memory memory stalls.Data layout mattersTo maximally utilize memory bandwidths&minimize the need for dat

189、a reshufflingExample:Tiled GeMM operationMarian VerhelstISSCC 2025 Forum F3.2How to limit underutilization?3.Data Layout optimization43 of 60 2025 IEEE International Solid-State Circuits ConferenceEven the best spatial and temporal unrolling,can still suffer from memory memory stalls.Data layout mat

190、tersTo maximally utilize memory bandwidths&minimize the need for data reshufflingExample:Tiled GeMM operationMarian VerhelstISSCC 2025 Forum F3.2How to limit underutilization?3.Data Layout optimizationData layout 1Data layout 243 of 60 2025 IEEE International Solid-State Circuits ConferenceEven the

191、best spatial and temporal unrolling,can still suffer from memory memory stalls.Data layout mattersTo maximally utilize memory bandwidths&minimize the need for data reshufflingExample:Tiled GeMM operationMarian VerhelstISSCC 2025 Forum F3.2How to limit underutilization?3.Data Layout optimizationData

192、layout 1Data layout 2Contention,but compactNo contention,but larger43 of 60 2025 IEEE International Solid-State Circuits ConferenceML accelerator performance:a tale of 2 rooflines.From CPU.to NPUShifting right on the roofline,a matter of data reuseExploiting spatial and temporal reuseImpact of array

193、 size,precision and sparsityApproaching the Roofline,a matter of utilizationThe utilization problemImpact of scheduling and data layoutDecoupled AI accelerators:Decouple&orthogonalize spatial dataflow temporal schedule memory layoutOutlineMarian VerhelstISSCC 2025 Forum F3.244 of 60 2025 IEEE Intern

194、ational Solid-State Circuits ConferenceTo orthogonally optimize dataflow,schedule and data layout,one needs:1.Flexible,yet decoupled accelerators OpenGeMM2.Scheduling space exploration ZigZag/Stream3.Customizable compilation MLIRThree utilization enablersMarian VerhelstISSCC 2025 Forum F3.245 of 60

195、2025 IEEE International Solid-State Circuits ConferenceFlexible Accelerator:OpenGeMM DatapathMixture of design time and run time flexibility1More details:Xiaoling Yi,ASPDAC25Marian VerhelstISSCC 2025 Forum F3.246 of 60 2025 IEEE International Solid-State Circuits ConferenceFlexible Accelerator:OpenG

196、eMM DatapathMixture of design time and run time flexibility Tightly coupled memory systemDesign time flexible1More details:Xiaoling Yi,ASPDAC25Marian VerhelstISSCC 2025 Forum F3.246 of 60 2025 IEEE International Solid-State Circuits ConferenceFlexible Accelerator:OpenGeMM DatapathMixture of design t

197、ime and run time flexibility Tightly coupled memory systemDesign time flexible Buffered data streamersFlexible for any(afine)temporal schedule and data layout1More details:Xiaoling Yi,ASPDAC25Marian VerhelstISSCC 2025 Forum F3.246 of 60 2025 IEEE International Solid-State Circuits ConferenceISSCC 20

198、25 Forum F3.2Decoupled Accelerator:OpenGeMM1More details:Xiaoling Yi,ASPDAC25Decoupled from temporal execution&data layoutAccelerator unaware of temporal pattern(schedulers problem)Streamers unaware of data layout(compilers problem)Self-scheduled and bufferedMarian Verhelst47 of 60 2025 IEEE Interna

199、tional Solid-State Circuits ConferenceTo orthogonally optimize dataflow,schedule and data layout,one needs:1.Flexible,yet decoupled accelerators OpenGeMM2.Scheduling space exploration ZigZag/Stream3.Customizable compilation MLIRThree utilization enablersMarian VerhelstISSCC 2025 Forum F3.248 of 60 2

200、025 IEEE International Solid-State Circuits ConferenceScheduling space explorationMarian Verhelst2ISSCC 2025 Forum F3.249 of 60 2025 IEEE International Solid-State Circuits Conference50 of 60Marian VerhelstISSCC 2025 Forum F3.2ZigZag(single core)&Stream(multi-core)ZIGZAG/STREAMEnergy(MAC energy,memo

201、ry load/store)Latency(array utilization,memory stalls)AreaCost ModelMore details:Mei,TransComp2021GITHUB:https:/ stream2 2025 IEEE International Solid-State Circuits Conference50 of 60Marian VerhelstISSCC 2025 Forum F3.2ZigZag(single core)&Stream(multi-core)ZIGZAG/STREAMEnergy(MAC energy,memory load

202、/store)Latency(array utilization,memory stalls)AreaNN workloadCost ModelMore details:Mei,TransComp2021GITHUB:https:/ stream2 2025 IEEE International Solid-State Circuits Conference50 of 60Marian VerhelstISSCC 2025 Forum F3.2ZigZag(single core)&Stream(multi-core)ZIGZAG/STREAMEnergy(MAC energy,memory

203、load/store)Latency(array utilization,memory stalls)AreaTechnology characteristics(cell size,read cost,wireload,)NN workloadCost ModelHardware architecture&constraints(mem.hierarchy,compute array,interconnect,IO BW,)More details:Mei,TransComp2021GITHUB:https:/ stream2 2025 IEEE International Solid-St

204、ate Circuits Conference50 of 60Marian VerhelstISSCC 2025 Forum F3.2ZigZag(single core)&Stream(multi-core)ZIGZAG/STREAMEnergy(MAC energy,memory load/store)Latency(array utilization,memory stalls)AreaTechnology characteristics(cell size,read cost,wireload,)NN workloadCost ModelHardware architecture&co

205、nstraints(mem.hierarchy,compute array,interconnect,IO BW,)Mapping(spatial&temporal unrolling)More details:Mei,TransComp2021GITHUB:https:/ stream2 2025 IEEE International Solid-State Circuits Conference50 of 60Marian VerhelstISSCC 2025 Forum F3.2ZigZag(single core)&Stream(multi-core)ZIGZAG/STREAMEner

206、gy(MAC energy,memory load/store)Latency(array utilization,memory stalls)AreaTechnology characteristics(cell size,read cost,wireload,)NN workloadCost ModelHardware architecture&constraints(mem.hierarchy,compute array,interconnect,IO BW,)Mapping(spatial&temporal unrolling)Energy/inferencetime/inferenc

207、eaccuracyMore details:Mei,TransComp2021GITHUB:https:/ stream2 2025 IEEE International Solid-State Circuits ConferenceFrom workload.to schedule.Stream tiling and scheduling results2ISSCC 2025 Forum F3.2Marian Verhelst51 of 60 2025 IEEE International Solid-State Circuits ConferenceFrom workload.to sch

208、edule.Stream tiling and scheduling results2ISSCC 2025 Forum F3.2Marian Verhelst51 of 60 2025 IEEE International Solid-State Circuits ConferenceFrom workload.to schedule.with temporal unrolling.Stream tiling and scheduling results2ISSCC 2025 Forum F3.2Marian Verhelst51 of 60 2025 IEEE International S

209、olid-State Circuits ConferenceFrom workload.to schedule.with temporal unrolling.and spatial unrolling.Stream tiling and scheduling results2ISSCC 2025 Forum F3.2Marian Verhelst51 of 60 2025 IEEE International Solid-State Circuits ConferenceTo orthogonally optimize dataflow,schedule and data layout,on

210、e needs:1.Flexible,yet decoupled accelerators OpenGeMM2.Scheduling space exploration ZigZag/Stream3.Customizable compilation MLIRThree utilization enablersMarian VerhelstISSCC 2025 Forum F3.252 of 60 2025 IEEE International Solid-State Circuits ConferenceCompiler customization Use optimal schedule f

211、rom ZigZag/Stream3.ZIGZAG/STREAMMarian VerhelstISSCC 2025 Forum F3.253 of 60 2025 IEEE International Solid-State Circuits ConferenceCompiler customization Use optimal schedule from ZigZag/Stream Use layout with minimal memory stalls,expressed in new MLIR“tsl”dialect3ZIGZAG/STREAMMarian VerhelstISSCC

212、 2025 Forum F3.254 of 60 2025 IEEE International Solid-State Circuits ConferenceCompiler customization Use optimal schedule from ZigZag/Stream Use layout with minimal memory stalls,expressed in new MLIR dialect Insert DMA and synchronization calls for parallel operation3ZIGZAG/STREAMMarian VerhelstI

213、SSCC 2025 Forum F3.255 of 60 2025 IEEE International Solid-State Circuits ConferenceCompiler customization Use optimal schedule from ZigZag/Stream Use layout with minimal memory stalls,expressed in new MLIR dialect Insert DMA and synchronization calls for parallel operation Generate accelerator code

214、 and runtime using accelerator template3Marian VerhelstISSCC 2025 Forum F3.256 of 60 2025 IEEE International Solid-State Circuits ConferenceCombining it all together.Adding flexible accelerators at design timeScheduling their overlapping executionGenerating codeExample network:Marian VerhelstISSCC 2

215、025 Forum F3.257 of 60 2025 IEEE International Solid-State Circuits ConferenceCombining it all together.Adding flexible accelerators at design timeScheduling their overlapping executionGenerating codeExample network:Marian VerhelstISSCC 2025 Forum F3.257 of 60 2025 IEEE International Solid-State Cir

216、cuits ConferenceCombining it all together.Allowing to achieve very-close-to roofline performance for real workloads!(here:multi-layer GeMM)ArithmeticMarian VerhelstISSCC 2025 Forum F3.258 of 60 2025 IEEE International Solid-State Circuits ConferenceConclusion AI accelerator perform seldomly accordin

217、g to their theoretical rooflines Dataflow,scheduling and data layout optimizations are key to minimize spatial and temporal utilization losses The exploration/design/compilation space is too large handle by hand.Need for automatic:Generation of flexible dataflow and memory interfacing HW Scheduling

218、space exploration Compiler customization frameworks Efficient decoupled ML acceleratorsMarian VerhelstISSCC 2025 Forum F3.259 of 60 2025 IEEE International Solid-State Circuits ConferenceChoi13 Choi,Jee Whan,et al.A roofline model of energy.2013 IEEE 27th International Symposium on Parallel and Dist

219、ributed Processing.IEEE,2013.Williams09 Williams,Samuel,Andrew Waterman,and David Patterson.Roofline:an insightful visual performance model for multicore architectures.Communications of the ACM 52.4(2009):65-76.Mei21 Mei,L.,Houshmand,P.,Jain,V.,Giraldo,S.,&Verhelst,M.(2021).ZigZag:Enlarging joint ar

220、chitecture-mapping design space exploration for DNN accelerators.IEEE Transactions on Computers,70(8),1160-1174.Symons23 Symons,A.,Mei,L.,Colleman,S.,Houshmand,P.,Karl,S.,&Verhelst,M.(2023,April).Stream:A Modeling Framework for Fine-grained Layer Fusion on Multi-core DNN Accelerators.In 2023 IEEE In

221、ternational Symposium on Performance Analysis of Systems and Software(ISPASS)(pp.355-357).IEEE.Ueyoshi22 Ueyoshi,K.,Papistas,I.A.,Houshmand,P.,Sarda,G.M.,Jain,V.,Shi,M.,.&Verhelst,M.(2022,February).DIANA:An end-to-end energy-efficient digital and ANAlog hybrid neural network SoC.In 2022 IEEE Interna

222、tional Solid-State Circuits Conference(ISSCC)(Vol.65,pp.1-3).IEEE.Yi24 Yi,X.,Antonio,R.,Dumoulin,J.,Sun,J.,Van Delm,J.,Paim,G.,&Verhelst,M.(2024).OpenGeMM:A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling.arXiv preprint arXiv:2411.09543.Key Refer

223、encesMarian VerhelstISSCC 2025 Forum F3.260 of 60 2025 IEEE International Solid-State Circuits ConferenceExplicit Decoupled Data Orchestration:A Fundamental Approach to AccelerationMichael PellauerNVIDIA,Architecture Research GroupDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orche

224、stration1 of 60 2025 IEEE International Solid-State Circuits ConferenceModern world has come to rely on regular cadenceComputational Efficiency Improvements28 nm process,7.1 billion transistors12 nm process,21.1 billion transistors7 nm process,54.2 billion transistorsEnabling:Enabling:GK210,2014GV10

225、0,2014GA100,2020Dr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration2 of 60 2025 IEEE International Solid-State Circuits Conference“Oil Well”has had easy stuff extracted alreadySlowing Transistor ScalingSource:OLaughlin,The Rising Tide of Semiconductor Cost,data from Interna

226、tional Business Strategies,Inc.and MarvellSource:Shalf,The Future of Computing Beyond Moores Law,projecting from data points by Olukotun,Hammond,Sutter,and HorowitzDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration3 of 60 2025 IEEE International Solid-State Circuits Confer

227、enceAlternative“oil well”to increase efficiencyHardware SpecializationGeneraldatapathsBranchmis-predictsCachepenaltiesControl overheadsThe real workSource:Chung et al.,Single-Chip Heterogeneous Computing:Does the Future Include Custom Logic,FPGAs,and GPGPUs?2010Dr.Michael PellauerISSCC 2025-Forum 3.

228、3:Explicit Decoupled Data Orchestration4 of 60 2025 IEEE International Solid-State Circuits ConferencePapers focus on customized compute,but data movement is more expensiveCommunity Research Focus is wrong64-bit datapath26 pJ256 pJ1000 pJ500 pJEfficientoff-chip link256-bit buses16,000 pJDRAMRd/Wr256

229、-bit access8 kB SRAM50 pJ20mmSource:Bill Dally,20 nm20 pJDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration5 of 60 2025 IEEE International Solid-State Circuits ConferencePercentage of area devoted to on-chip buffersBuffer Hierarchies in ML AcceleratorsEach buffering system

230、 is custom-designed,in-extractible from its acceleratorStatement like“Reuse TPU buffer hierarchy for genome sequencing”is nonsensicalDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration6 of 60 2025 IEEE International Solid-State Circuits ConferenceFeeding data to a functiona

231、l unit exactly when it wants itWhat is Data OrchestrationML ASICs use workload knowledge to optimize orchestration at design-timeWhen data is moved over a transfer substrateWhere data is placed in available staging buffersHow data is accessed,including removal/evictionWho the“actors”are that touch d

232、ata and synchronizeDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration7 of 60 2025 IEEE International Solid-State Circuits ConferenceGuiding Principles for EfficiencyLocal reuse staged physically closeCross-unit use amortize data accessBandwidth efficiency Maximize delivery

233、 ratePrecise synchronization Only wait for exactly data you need,respond quicklySimple structures-Minimize hardware area/powerDelivery/use overlap Next tile should be available when current is doneDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration8 of 60 2025 IEEE Internat

234、ional Solid-State Circuits ConferenceData Orchestration in DL ASICsISSCC 2016,ISCA 2016Dr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration9 of 60 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestra

235、tion10 of 60Conv1D:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10

236、 of 60Conv1D:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60

237、Conv1D:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D

238、:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:Weigh

239、tsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey

240、concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concep

241、t introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concept intr

242、oduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concept introduced

243、 by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concept introduced by Ey

244、erissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=Weights(Tile Size S)Inputs(Tile Size S)Outputs(Tile Size 1)AGenAGenAGenfor q in range(Q):for s in range(S):outq+=inq+s*wsL1 buffers divided into separate physical RAM macros at design time 2025 IEEE International Solid-State Circuits Confe

245、renceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=Weights(Tile Size S)Inputs(Tile Size S)Outputs(Tile Size 1)AGenAGenAGenfor q in range(Q):for s in range(S):

246、outq+=inq+s*wsL1 buffers divided into separate physical RAM macros at design timeLoop in“global”tensor address space determines staging pattern of individual RAMs 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of

247、 60Conv1D:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=Weights(Tile Size S)Inputs(Tile Size S)Outputs(Tile Size 1)AGenAGenAGenfor q in range(Q):for s in range(S):outq+=inq+s*wsL1 buffers divided into separate physical RAM macros at design timeLoop i

248、n“global”tensor address space determines staging pattern of individual RAMs“Output Stationary”that tensor changes most slowly 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concept introduc

249、ed by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=Weights(Tile Size S)Inputs(Tile Size S)Outputs(Tile Size 1)AGenAGenAGenfor q in range(Q):for s in range(S):outq+=inq+s*ws“Input Streaming”with halo from convolutional mathL1 buffers divided into separate physical RAM macros at desi

250、gn timeLoop in“global”tensor address space determines staging pattern of individual RAMs“Output Stationary”that tensor changes most slowly 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey con

251、cept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=Weights(Tile Size S)Inputs(Tile Size S)Outputs(Tile Size 1)AGenAGenAGenfor q in range(Q):for s in range(S):outq+=inq+s*ws“Weight Tile-Stationary”that tensor stays resident with fixed reuse distance“Input Streaming”with

252、 halo from convolutional mathL1 buffers divided into separate physical RAM macros at design timeLoop in“global”tensor address space determines staging pattern of individual RAMs“Output Stationary”that tensor changes most slowly 2025 IEEE International Solid-State Circuits ConferenceDr.Michael Pellau

253、erISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=Weights(Tile Size S)Inputs(Tile Size S)Outputs(Tile Size 1)AGenAGenAGenfor q in range(Q):for s in range(S):outq+=inq+s*ws#Actual

254、read FSM pattern is projection of the global loop into local addressesfor q in range(Q):for s in range(S):input_buffer.readq+s%S“Weight Tile-Stationary”that tensor stays resident with fixed reuse distance“Input Streaming”with halo from convolutional mathL1 buffers divided into separate physical RAM

255、macros at design timeLoop in“global”tensor address space determines staging pattern of individual RAMs“Output Stationary”that tensor changes most slowly 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills t

256、o local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L1 AddrGen012St.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum

257、 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L1 AddrGen012L2 image scratch(size Q)L2 AddrGenSt.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data

258、)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L

259、1 AddrGen012L2 image scratch(size Q)L2 AddrGenASt.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and per

260、iod(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L1 AddrGen012L2 image scratch(size Q)L2 AddrGenASt.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)B 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled

261、 Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L1 AddrGen012L2 image scratch(size Q)L2 AddrGenASt.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)BC 2025 IEEE Internat

262、ional Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L1 AddrGen012L2 imag

263、e scratch(size Q)L2 AddrGenASt.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)BCD 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence l

264、ength)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L1 AddrGen012Initial Delay-(Ld.S(0)-Ld.S(1)-.-Ld.S(Q-1+S-1%S)Q-1L2 image scratch(size Q)L2 AddrGenSt.S(0,L2image0)-St.S(1,L2image1).St.S(Q-1%S,L2imageQ-1)Ainterface usage“regexp”St.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)

265、BCD 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=

266、3)L1 AddrGen012Initial Delay-(Ld.S(0)-Ld.S(1)-.-Ld.S(Q-1+S-1%S)Q-1L2 image scratch(size Q)L2 AddrGenSt.S(0,L2image0)-St.S(1,L2image1).St.S(Q-1%S,L2imageQ-1)Ainterface usage“regexp”St.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)BCDWhy is separate“Push”filler not a race condition?(Both RAW and WAR)2025

267、 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseIn custom DL accelerator,all sizing

268、s can be done at design time based on known micro-architectural timingsOverheads for synchronization and hazard checking approach zeroL1 image scratchpad(size S=3)L1 AddrGen012Initial Delay-(Ld.S(0)-Ld.S(1)-.-Ld.S(Q-1+S-1%S)Q-1L2 image scratch(size Q)L2 AddrGenSt.S(0,L2image0)-St.S(1,L2image1).St.S(

269、Q-1%S,L2imageQ-1)Ainterface usage“regexp”St.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)BCDWhy is separate“Push”filler not a race condition?(Both RAW and WAR)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fil

270、ls to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseIn custom DL accelerator,all sizings can be done at design time based on known micro-architectural timingsOverheads for synchronization and hazard checking approach zeroL1 im

271、age scratchpad(size S=3)L1 AddrGen012Initial Delay-(Ld.S(0)-Ld.S(1)-.-Ld.S(Q-1+S-1%S)Q-1L2 image scratch(size Q)L2 AddrGenAll RAM slots used for Reuse ExploitationSt.S(0,L2image0)-St.S(1,L2image1).St.S(Q-1%S,L2imageQ-1)Ainterface usage“regexp”St.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)BCDWhy is s

272、eparate“Push”filler not a race condition?(Both RAW and WAR)2025 IEEE International Solid-State Circuits ConferenceFrom Eyeriss to General Reusable IdiomASPLOS 2019On-chipbuffetOff-chipDRAMread addrread datacreditsread datavalid?update dataAddress gen(DMA engine)read idxAddr genfill dataCombine best

273、aspects of FIFOs and Scratchpads in modular encapsulationDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration12 of 60 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration13 of 60Key concept intro

274、duced by this workData orchestration TaxonomyOff-chipDRAME.g.,Cacheread addrread dataread addrread dataOff-chipDRAMOn-chip scratchpadread addrread dataread addrread dataDpathDpath Makes decisions about staging,replacement,etc.in workload-agnostic fashion Size not exposed(officially)All staging and r

275、eplacement decisions made by workload Size exposed,latency hiding up to user(u-arch latency not exposed)ImplicitExplicit 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration14 of 60Key concept introduced by this workData

276、 orchestration TaxonomycachescratchpadCoupledDecoupledcacheAddress gen(DMA engine)ImplicitExplicitAddress gen(DMA engine)?See Decoupled Access-Execute archs Also,LD.G+ST.S,TMA(more on this later)Read-after-fill synchronization covered by load-to-use stalling Read-after-fill synchronization covered b

277、y load-to-use stallingFIFO/buffet External sync.needed(i.e.,arrive-wait)Encapsulate read-after-fill sync.via head/tail pointer checks Encapsulate modular arith.due to latency hiding inside buffer 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit

278、Decoupled Data Orchestration15 of 60FIFOs:Existing buffer idiom for E.D.D.O.Too restrictive for consumer No repeated reads past head No in-place updates No bulk removalBulk transfer Great MLP No remote address with fillEncapsulates synchronization Composes hierarchically with fewer“landing zones”Off

279、-chipDRAMread addrread datacreditsdatare-circulateXilinx.FIFO Generator v13.1,Intel,FIFO:FPGA IP User GuideAddress gen(DMA engine)fill datavalid?On-chipFIFO 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unl

280、ike implicit structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 AddrGen012L2 image scratch(size Q)L2 AddrGenReadOrShrink(offset)Read(data)Fill(data)34Credit(sp_amount)Internal Scoreboard 2025 IEEE International Solid-Stat

281、e Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 AddrGen012L2 image scratch(size Q)L2 AddrGenRead

282、OrShrink(offset)Read(data)Fill(data)34Credit(sp_amount)5Internal Scoreboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit structures like caches,requires active involvement to ensure perform

283、anceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 AddrGen012L2 image scratch(size Q)L2 AddrGenWait+Fill(L2image0:4)-Wait+Fill(L2image5).Wait+Fill(L2imageQ-1)Ainterface usage“regexp”ReadOrShrink(offset)Read(data)Fill(data)34BCDECredit(sp_amount)5Internal Scoreboard 2025 IEEE International

284、 Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 AddrGen012(Rd(0)-Rd(1)-Rd(2)+Shrink(1

285、)Q-1L2 image scratch(size Q)L2 AddrGenWait+Fill(L2image0:4)-Wait+Fill(L2image5).Wait+Fill(L2imageQ-1)Ainterface usage“regexp”ReadOrShrink(offset)Read(data)Fill(data)34BCDECredit(sp_amount)5Internal Scoreboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum

286、3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 AddrGen012(Rd(0)-Rd(1)-Rd(2)+Shrink(1)Q-1L2 image scratch(size Q)L2 AddrGenSome RAM slots used for Latenc

287、y HidingImportant:can be fractions of tiles,no need for whole quanta(i.e.,double-buffering)and no ROI for too muchWait+Fill(L2image0:4)-Wait+Fill(L2image5).Wait+Fill(L2imageQ-1)Ainterface usage“regexp”ReadOrShrink(offset)Read(data)Fill(data)34BCDECredit(sp_amount)5Most RAM slots used for Reuse Explo

288、itationInternal Scoreboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(siz

289、e S=3+K=2)L1 AddrGen012(Rd(0)-Rd(1)-Rd(2)+Shrink(1)Q-1L2 image scratch(size Q)L2 AddrGenSome RAM slots used for Latency HidingImportant:can be fractions of tiles,no need for whole quanta(i.e.,double-buffering)and no ROI for too muchWait+Fill(L2image0:4)-Wait+Fill(L2image5).Wait+Fill(L2imageQ-1)Ainte

290、rface usage“regexp”ReadOrShrink(offset)Read(data)Fill(data)34BCDECredit(sp_amount)5Most RAM slots used for Reuse ExploitationInternal Scoreboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit

291、 structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 AddrGen012(Rd(0)-Rd(1)-Rd(2)+Shrink(1)Q-1L2 image scratch(size Q)L2 AddrGenSome RAM slots used for Latency HidingImportant:can be fractions of tiles,no need for whole qu

292、anta(i.e.,double-buffering)and no ROI for too muchWait+Fill(L2image0:4)-Wait+Fill(L2image5).Wait+Fill(L2imageQ-1)AAfter a shrink,some RAM slots are Landing Zonesinterface usage“regexp”ReadOrShrink(offset)Read(data)Fill(data)34BCDECredit(sp_amount)5Most RAM slots used for Reuse ExploitationInternal S

293、coreboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 Addr

294、Gen012(Rd(0)-Rd(1)-Rd(2)+Shrink(1)Q-1L2 image scratch(size Q)L2 AddrGenSome RAM slots used for Latency HidingImportant:can be fractions of tiles,no need for whole quanta(i.e.,double-buffering)and no ROI for too muchWait+Fill(L2image0:4)-Wait+Fill(L2image5).Wait+Fill(L2imageQ-1)AAfter a shrink,some R

295、AM slots are Landing Zonesinterface usage“regexp”ReadOrShrink(offset)Read(data)Fill(data)34Properly sizing the latency-hiding ensures the next datum is staged just-in-timeAlso minimizes the percentage of inactive buffet entries(landing zones)Can use Littles Law to size the Reuse Exploitation tile si

296、ze to maximize compute bandwidthBCDECredit(sp_amount)5Most RAM slots used for Reuse ExploitationInternal Scoreboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration17 of 60Littles Law and ScratchpadsLittles Law:Origi

297、nally used by banks to size roped waiting areas in front of tellersOccupancyDemand rateTime spent 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration17 of 60Littles Law and ScratchpadsLittles Law:Originally used by bank

298、s to size roped waiting areas in front of tellersOccupancyDemand rateTime spentAnalogy:Minifridge(size 6)in office break roomLets assume someone wants 1 cold soda every 10 minutes.and it takes 3 hours for a can to become sufficiently chilled.that means workers only get cold soda out of the fridge 1

299、hour out of 3!Rule:When you take a cold can,replace with a warm one from cupboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration17 of 60Littles Law and Scratchpadsgreen:cold soda deliveredgray:no cold soda availabl

300、etime,in ten-minute intervals3 Hours:can sitting in“fridge slot 1”becomes coldLittles Law:Originally used by banks to size roped waiting areas in front of tellersOccupancyDemand rateTime spent18 slots needed for SOLAnalogy:Minifridge(size 6)in office break roomLets assume someone wants 1 cold soda e

301、very 10 minutes.and it takes 3 hours for a can to become sufficiently chilled.that means workers only get cold soda out of the fridge 1 hour out of 3!Rule:When you take a cold can,replace with a warm one from cupboardJob of a buffer idiom:Help the program exploit reuse and hide latency while minimiz

302、ing locked landing zonesAnd do it across multiple operand tensors and buffers with differing access patterns and fill rates!2025 IEEE International Solid-State Circuits ConferenceOn-chipbuffetBuffetsOff-chipDRAMread addrread datacreditsread datavalid?update dataAddress gen(DMA engine)Operations:Read

303、(idx)Update(idx,data)Shrink(Num)read idxAddr genfill dataLoosen FIFO access restrictions while maintaining fill benefitsEncapsulation of synchronization and RAM size has interfacing benefitsDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration18 of 60 2025 IEEE International

304、Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration19 of 60read(sp_offset)read(data)update(data)Why“Buffet”?Typical interface usage is directly analogous to restaurantsBuffetfill(data)shrink(sp_offset)credit(sp_amount)2025 IEEE International S

305、olid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration19 of 60read(sp_offset)read(data)update(data)Why“Buffet”?Typical interface usage is directly analogous to restaurantsBuffetGet any available dish,go back as many times as you wantfill(data)shri

306、nk(sp_offset)credit(sp_amount)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration19 of 60read(sp_offset)read(data)update(data)Why“Buffet”?Typical interface usage is directly analogous to restaurantsBuffetGet any availab

307、le dish,go back as many times as you wantfill(data)shrink(sp_offset)credit(sp_amount)Shrink=course change 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration19 of 60read(sp_offset)read(data)update(data)Why“Buffet”?Typic

308、al interface usage is directly analogous to restaurantsBuffetIn restaurant no in-place update!Get any available dish,go back as many times as you wantfill(data)shrink(sp_offset)credit(sp_amount)Shrink=course change 2025 IEEE International Solid-State Circuits ConferenceBuffet Local Synchronization(S

309、ee Paper)Buffetfillread idxread dataupdateHardware scoreboard and stall logicNo hardware needed 2025 IEEE International Solid-State Circuits ConferenceBuffet Local Synchronization(See Paper)Buffetfillread idxread dataupdateRead data may stall until Fill(RAW)Hardware scoreboard and stall logicNo hard

310、ware needed 2025 IEEE International Solid-State Circuits ConferenceBuffet Local Synchronization(See Paper)Buffetfillread idxread dataupdatewill_update?Read data may stall until Fill(RAW)Hardware scoreboard and stall logicNo hardware needed 2025 IEEE International Solid-State Circuits ConferenceBuffe

311、t Local Synchronization(See Paper)Buffetfillread idxread dataupdatewill_update?Read data may stall until Fill(RAW)Read data may stall until Update completes(RAW)Hardware scoreboard and stall logicNo hardware needed 2025 IEEE International Solid-State Circuits ConferenceBuffet Local Synchronization(S

312、ee Paper)Buffetfillread idxread dataupdatewill_update?Read data may stall until Fill(RAW)Read data may stall until Update completes(RAW)Transitively syncd via preceding Read()Hardware scoreboard and stall logicNo hardware needed 2025 IEEE International Solid-State Circuits ConferenceBuffet Local Syn

313、chronization(See Paper)Higher-level interfacing semantics beyond load/store=cheaper hardwareBuffetfillread idxread dataupdatewill_update?Read data may stall until Fill(RAW)Read data may stall until Update completes(RAW)Transitively syncd via preceding Read()Hardware scoreboard and stall logicNo hard

314、ware needed 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerI

315、SSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchest

316、ration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State

317、Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orch

318、estration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Output

319、s*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025

320、-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D

321、 ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGenAGenAGen 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-cei

322、l(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGenAGenAGenfor q=0.Q)for s=0.S)Read(s)Shrink(S)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInput

323、sQ=W-ceil(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGenAGenAGenReads past head non-destructivelyEncapsulates synchronization:no concern for fillsfor q=0.Q)for s=0.S)Read(s)Shrink(S)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Expli

324、cit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGenAGenAGenReads past head non-destructivelyEncapsulates synchronization:no concern for fillsfor q=0.Q)for s=0.S)Read(s)Shrink(S)for q=0.Q)

325、for s=0.S)Read(s)Shrink(1)Shrink(S-1)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGe

326、nAGenAGenReads past head non-destructivelyEncapsulates synchronization:no concern for fillsfor q=0.Q)for s=0.S)Read(s)Shrink(S)for q=0.Q)for s=0.S)Read(s)Shrink(1)Shrink(S-1)Encapsulates modular arithmetic:shrink tile size=sliding window 2025 IEEE International Solid-State Circuits ConferenceDr.Mich

327、ael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGenAGenAGenReads past head non-destructivelyEncapsulates synchronization:no concern for fillsfor q=0.

328、Q)for s=0.S)Read(s)Shrink(S)for q=0.Q)for s=0.S)Read(s)Shrink(1)Shrink(S-1)for q=0.Q)for s=0.S-1)ReadAndUpdate(0)Read(0)Shrink(1)Encapsulates modular arithmetic:shrink tile size=sliding window 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Dec

329、oupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGenAGenAGenReads past head non-destructivelyEncapsulates synchronization:no concern for fillsHandles RAW hazard:correct behavior even if MACC has

330、latencyfor q=0.Q)for s=0.S)Read(s)Shrink(S)for q=0.Q)for s=0.S)Read(s)Shrink(1)Shrink(S-1)for q=0.Q)for s=0.S-1)ReadAndUpdate(0)Read(0)Shrink(1)Encapsulates modular arithmetic:shrink K traversal order(i.e.,B-stationary dataflow)Einsum:Z=,Fixed-Function Sparse AcceleratorsSee:OuterSpace,ExTensor,GAMM

331、A,etc.Without these mechanisms,how does a general-purpose accelerator do on this domain?Dr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration42 of 60 2025 IEEE International Solid-State Circuits ConferenceReasons for not meeting roofline:Moving data too far(e.g.,low-reuse dat

332、a emplaced in DRAM-L2-L1)Moving unnecessary data(e.g.,whole cache lines,replacement policies)Round-trip data accesses(e.g.,slow indirections)Inefficient ISA implementations of certain key operations(e.g.,intersection)Unstructured Sparsity*on GPUsA100 FP32 roofline(SIMD only,excluding tensor cores)*N

333、ot the 2:4 structured sparsity supported natively in tensor coresSymphony:Use HHP to deploy specialized data orchestration blocks throughout memory hierarchy of GPU-like architectureDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration43 of 60 2025 IEEE International Solid-State Circuits ConferenceSymphony ApproachChallenge:How to expose dispersed specialized blocks across

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(Forum 3 Its all About Data Building Blocks, Compute, Movement and Integration.pdf)為本站 (張5G) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站