1、ISSCC2025Forum3ItsallAboutData:BuildingBlocks,Compute,MovementandIntegration 2025 IEEE International Solid-State Circuits ConferenceForum F3:Its all about Data:Building blocks,Compute,Movement and IntegrationInternational Solid State Circuit ConferenceFebruary 20th,2025Start of presentations at 8:15
2、amISSCC 20251 of 9 2025 IEEE International Solid-State Circuits ConferenceHosted by Digital Circuits,Memory,Digital Architectures&Systems&Security Subcommittees Organizers:Yvain Thonnart,CEA-List,Grenoble,FranceViolante Moschiano,OpenChip Technologies,Barcelona,SpainJie Gu,Northwestern University,Ev
3、anston,ILSanu Mathew,Intel,Hillsboro,ORChampions:Fatih Hamzaoglu,Intel,Hillsboro,ORTanay Karnik,Intel,Hillsboro,OROrganizing CommitteeISSCC 20252 of 9 2025 IEEE International Solid-State Circuits ConferenceFacing the memory wallISSCC 20253 of 9 2025 IEEE International Solid-State Circuits Conference
4、Abundant data:AI model sizeISSCC 20254 of 9A.Gholami,IEEE Micro,44,3 2025 IEEE International Solid-State Circuits ConferenceLagging behind throughput:data latencyISSCC 20255 of 9J.McCalpin,Supercomputing 2016 2025 IEEE International Solid-State Circuits ConferenceEnergy cost of data movementISSCC 20
5、256 of 9M.Horowitz,ISSCC 20141.3nJDRAM Access70pJ 2025 IEEE International Solid-State Circuits ConferenceData&compute,from applications to architecturesHigh performance computing,AI/ML,GPUsData&storage,from architectures to circuitsHBM,dense storage,compute in memory,memory-centric architecturesNew
6、data-centric architectures,new challengesChipletization,high-bandwidth interfaces,securityForum outlineISSCC 20257 of 9 2025 IEEE International Solid-State Circuits ConferenceAgendaISSCC 20258 of 8StartTitleSpeakerAffiliation8:15IntroductionYvain ThonnartCEA-List8:25Dataflow:Is it all About Algorith
7、ms and Application Targets?Osman Sabri UnsalBSC9:15Dataflow Optimization and Data Sparsity Management for ML AcceleratorsMarian VerhelstKU Leuven10:05Break10:20Explicit Decoupled Data Orchestration:A Fundamental Approach to AccelerationMichael PellauerNVIDIA11:10Memory Solutions for AI Era:High-Band
8、width Memories and Dense-Data StorageKyomin SohnSamsung12:00Lunch13:20Computation-in-Memory Circuit Design for Computing/Storage-Intensive AI ApplicationsXin SiSoutheast University14:10AIs Promethean Moment:Silicon Scaling-Laws and On-Package Memory Are Not EnoughIgor ArsovskiGroq15:00Break 15:15Opt
9、imizing Communication Between Chiplets for Future System-in-PackagesKemal AygnIntel16:05Complex Systems,Complex Threats:Security Strategies for Heterogeneous System DesignsTodd M.AustinUniversity of Michigan16:55Closing remarksViolante MoschianoOpenChip Technologies 2025 IEEE International Solid-Sta
10、te Circuits Conference2 coffee breaks and one lunch breakElectronic copies for Forums are available for downloadPlease switch your mobile devices to muteRemember to fill out speaker evaluation using the ISSCC appNo panel session at the end of the ForumGeneral Information ISSCC 20259 of 9 2025 IEEE I
11、nternational Solid-State Circuits ConferenceISSCC 2025 ForumsDataflow:Is it all About Algorithms and Application Targets?NOOsman UnsalBarcelona Supercomputing CenterISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?1 of 44 2025 IEEE International Solid-State Circuits C
12、onferenceOutlineWhy Dataflow?Because it is the ultimate orderDataflow in microarchitectureDataflow in VLSIDataflow in programming models2 of 44ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?2025 IEEE International Solid-State Circuits ConferenceThere is ample instru
13、ction level parallelismProvided write after write and write after read dependencies are recastThrough renaming mechanismProviding physical registersTo eliminate false dependenciesResulting in out of order processorsInstruction commit is still in-orderHowever,instructions execute in dataflow(not prog
14、ram)order Subject to true dependencies(read after write)Dataflow in microarchitectureISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?3 of 44 2025 IEEE International Solid-State Circuits ConferenceDataflow in VLSI The input to output combinatorial logic is in overall
15、dataflow order Simulators to checkers operate in dataflow fashion Asyncronous systems without a clock depend on dataflow Wave-pipeliningTatapudi et al,Symposium on VLSI 2005 ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?4 of 44 2025 IEEE International Solid-State C
16、ircuits ConferenceTask-based dataflow Programing ModelsTask dependency clauses in OpenMP4.0 Flow Graph Intel TBB V4.0OmpSs5 of 44ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?2025 IEEE International Solid-State Circuits ConferenceOmpSs programming modelTask-based p
17、arallel programming modelProgrammer defines tasks by using pragmasTask is an executable unit,multiple tasks can run in parallelTask has in/out parameters that describe the data dependenciesISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?6 of 44 2025 IEEE Internationa
18、l Solid-State Circuits ConferenceOmpSs dataflow runtimeRuntime support to OmpSs applicationa dataflow(task dependency)grapha pool of threadsa ready task queueA thread picks up a ready task and executes it on an available resourceISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Applicatio
19、n Targets?7 of 44 2025 IEEE International Solid-State Circuits ConferenceOpenMP3.0 Tasking stylevoid Cholesky(float*A)int i,j,k;for(k=0;kNT;k+)spotrf(Ak*NT+k);for(i=k+1;iNT;i+)#pragma omp taskstrsm(Ak*NT+k,Ak*NT+i);#pragma omp taskwaitfor(i=k+1;iNT;i+)for(j=k+1;ji;j+)#pragma omp task sgemm(Ak*NT+i,A
20、k*NT+j,Aj*NT+i);#pragma omp task ssyrk(Ak*NT+i,Ai*NT+i);#pragma omp taskwaitvoid Cholesky(float*A)int i,j,k;for(k=0;kNT;k+)spotrf(Ak*NT+k);#pragma omp parallel forfor(i=k+1;iNT;i+)strsm(Ak*NT+k,Ak*NT+i);for(i=k+1;iNT;i+)#pragma omp parallel forfor(j=k+1;ji;j+)sgemm(Ak*NT+i,Ak*NT+j,Aj*NT+i);ssyrk(Ak*
21、NT+i,Ai*NT+i);barriersTSTSNBNBTSTSISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?8 of 44 2025 IEEE International Solid-State Circuits Conferencevoid Cholesky(float*A)int i,j,k;for(k=0;kNT;k+)spotrf(Ak*NT+k);for(i=k+1;iNT;i+)strsm(Ak*NT+k,Ak*NT+i);/update trailing su
22、bmatrixfor(i=k+1;iNT;i+)for(j=k+1;j CRC The asynchronous execution characteristics:Coding calculation could be overlapped with regular execution of other tasksCompared to hardware ECCs,memory overheads of CRC-based software mechanisms is orders of magnitude less:Awareness of the minimal set of data
23、that needs to be protectedWith hardware ECCs all the memory has to be protectedRuntime enables flexibility for the user in terms of what type protection to deployISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?16 of 44 2025 IEEE International Solid-State Circuits Con
24、ferenceCRC Scheme(CRC)ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?17 of 44 2025 IEEE International Solid-State Circuits ConferenceHardware Optimization Utilizing Intel CRC32 instruction belongs to the SSE4.2 family iSCSI polynomial(0 x11EDC6F41)Two versions imple
25、mented:Naive version:Going over byte by byte of task outputs Advanced CRC calculation:Divide the output in three and calculate in parallel and independently CRCs and then merge three CRCs.Significant reduction of CRC calculation overheadISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Ap
26、plication Targets?18 of 44 2025 IEEE International Solid-State Circuits ConferenceResults:Performance OverheadsBenchmarksCRC(Runtime)CRC(Hardware-Nai.)CRC(Hardware-Adv.)Sparse LU6.70%2.99%0.41%Perlin4.37%1.92%0.52%CG8.27%4.20%2.70%FFT37.20%17.39%8.09%Cholesky31.53%12.56%2.08%Knapsack1.21%0.97%0.85%P
27、ingpong0.28%0.12%0.03%Nbody1.96%1.27%0.18%Matmul2.03%1.35%0.69%Average10.39%4.75%1.73%ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?19 of 44 2025 IEEE International Solid-State Circuits ConferenceTask Redundancy(Fault Detection):ImplementationProgramstatements.prag
28、ma.statementsTask instanceInputs,inouts,outputsstatements.Regular version-Satisfy dependences and wait for resources availability-Run*Instance of task is run in parallel within the rest of the task instances ExecutionFD version-Satisfy dependences and wait for resources availability-Checkpoint input
29、s,inoutsusing the concurrent backup handler-Run*and Run parallel duplicated-run;-while(different results)Restore checkpoint using the handlerRerun one instanceTask instancestatements.Inputs,inouts,outputsCheckpoint inputs,inoutsTask instancestatements.Inputs,inouts,outputsCheckpoint inputs,inoutsExe
30、cutionExecutionbackupRestoreExecutionISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?20 of 44 2025 IEEE International Solid-State Circuits ConferenceResults:Task ReplicationMulti-Node Scalability0%1%2%3%4%5%6%7%8%9%Overheads of ReplicationOverhead is defined w.r.t.to
31、tal application execution timeISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?21 of 44 2025 IEEE International Solid-State Circuits ConferenceSelective Task ReplicationIncrease applications reliability and thus its chance for successful completion by selectively repl
32、icating tasks.Set a FIT threshold to be adhered during the application execution.Per application Failures in time(FIT)Heuristic Assuming we have only%x spare resources for replication,maximize reliability by selectively replicating%x of the tasks.Replication Threshold HeuristicISSCC 2025-Forum 3.1:D
33、ataflow:Is it all About Algorithms and Application Targets?22 of 44 2025 IEEE International Solid-State Circuits ConferenceSelective Task ReplicationReliability modelHeuristicDecision if a task is to be checkpointedEstimates the possible reliability improvement if a task is replicated.Accounts for t
34、he costs of replicationISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?23 of 44 2025 IEEE International Solid-State Circuits ConferenceTask reliability modelTakes into account various task&application characteristics#of inputs/outputsSize of input in MBTask dependenc
35、iesWhen input was generated as outputExpected execution timeOthersTask reliability in FIT(Faults in Time 1 Billion hours)Other task propertiesInputs,outputsDependenciesTask reliability modelSome tasks are more important with respect to reliability than others.ISSCC 2025-Forum 3.1:Dataflow:Is it all
36、About Algorithms and Application Targets?24 of 44 2025 IEEE International Solid-State Circuits ConferenceHeuristic for selective replicationDesign principleSystem may have spare idle resources(e.g.10%of the resources are idle)Selective replication to that uses the idle resources to improve reliabili
37、tyThe heuristic(Target_Rep)Inspects the queue of ready tasks for executionBy applying the reliability model estimate how much the overall reliability of the application will increase if a particular task is replicatedReplicate those that would increase reliability most within the envelop of availabl
38、e idle resourcesLightweight and online operation without requiring prior offline profiling and analysis.ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?25 of 44 2025 IEEE International Solid-State Circuits ConferenceResults:Selective task replicationDifference from t
39、he optimal solutionx-axis:The%of tasks replicatedy-axis:The difference in%between the optimal solution and the FIT rates that the model estimates for the given percentage of task replication.The optimal task selection would improve the FIT by that%0%5%10%15%20%25%30%10%20%30%40%50%60%70%80%90%Averag
40、e Difference from the Optimal SolutionISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?26 of 44 2025 IEEE International Solid-State Circuits ConferenceAtomic Dataflow Motivation Parallelizing complex programs:Large tasks and irregular parallelism diminishing results.C
41、an we express coordination between tasks explicitly and reduce conflicts caused by this synchronization?Dataflow?Expressive,deterministic and inherently parallel.No data sharing,no implicit state.Not a general solution.Not a good fit for certain kind of problems which are naturally solved using the
42、concept of state.ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?27 of 44 2025 IEEE International Solid-State Circuits ConferenceIntroduction Atomic Dataflow(ADF)parallel programming model.Combines dataflow and shared memory programming.Increases programming expressi
43、veness.Emphasis is on problems that exhibit irregular parallelism.Key Point:Dataflow scheduling between atomic tasksStraightforward patterns Explicit data dependencies.Programmers explicitly define input data dependencies.Obscure patterns Atomic execution.Parallelization and the atomic execution are
44、 implicit.ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?28 of 44 2025 IEEE International Solid-State Circuits ConferenceMotivating exampleMax function with the global maximum:a)dataflow implementation,b)shared memory implementation and c)ADF implementation.MAXxyGlo
45、balmaxmax(x,y)wait(x);wait(y);z=max(x,y);atomic if(zg_max)g_max=z;MAXxymax(x,y)Globalmaxa)b)c)ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?29 of 44 2025 IEEE International Solid-State Circuits ConferenceReal World Example Game Engine Highly irregular application d
46、ifficult to parallelize.The server processes clients requests.Updates the game world.Game world represented with the areanode tree.State of the world encoded in the game world buffer.The buffer is sent to clients to keep the game consistent.Areanode treeregion lockinggame worldISSCC 2025-Forum 3.1:D
47、ataflow:Is it all About Algorithms and Application Targets?30 of 44 2025 IEEE International Solid-State Circuits ConferenceGame engine parallelizationPthreadsRegion based locking of the areanode tree.Coarse grained approachNegative effects on load balancing.Transactional MemoryMore fine grained sync
48、hronization.Sub-optimal performance due to the TM overhead.Gajinov,V.,Zyulkyarov,F.,Unsal,O.S.,Cristal,A.,Ayguade,E.,Harris,T.,and Valero,M.QuakeTM:parallelizing a complex sequential application using transactional memory.In Proceedings of the 23rd international conference on Supercomputing(New York
49、,NY,USA 2009),ACM,126-135.ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?31 of 44 2025 IEEE International Solid-State Circuits ConferenceGame Engine ADF ApproachWe partition the game world in a number of game areas.The areas correspond to the leaves of the areanode
50、treeWe define a dedicated dataflow task for each game area.We transform the areanode tree synchronization into a dataflow execution.We restrict the processing to a single area at the time.If the action crosses the area boundaries,we generate a new token for the neighboring task.The tasks are atomic.
51、The game world state is kept consistent.source areadestination areaobjects trajectorygenerate tokentrigger taskcontinuemovetask partitioned game worldISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?32 of 44 2025 IEEE International Solid-State Circuits ConferenceAtomi
52、c Dataflow Model Basic unit of work the ADF task.Task executes automatically implicit transactionThe task blocks if input dependencies are not satisfied.Data dependencies defined using the trigger set extension.For each data from the trigger set implicit buffer.Syntax:#pragma adf_task trigger_set()u
53、ntil(exit_condition)instances()relaxed ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?33 of 44 2025 IEEE International Solid-State Circuits ConferenceThe ADF APIprogrammer APIvoid adf_init(int num_threads);Initializes the ADF runtime.void adf_start();Unblocks worker
54、 threads.void adf_taskwait();Called when a thread needs to wait for the end of a dataflow execution.void adf_terminate();Destroys the ADF runtime environment.internal APIvoid adf_create_task(int num_instances,int num_tokens,void*tokens,std:function fn);Creates a dataflow task.void adf_pass_token(voi
55、d*addr,void*token,size_t token_size);Called when a task needs to pass newly generated output tokens.void adf_task_stop();Called by a task when its until condition is set.ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?34 of 44 2025 IEEE International Solid-State Circ
56、uits ConferenceThe ADF Execution ModelThe ADF program consists of a number of task regions that are executed in a program order.The main program thread:initializes the ADF runtime system,initiates the execution of all task regions in a program order,stops the ADF runtime and terminates the program.I
57、SSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?35 of 44 2025 IEEE International Solid-State Circuits ConferenceTask Region ExecutionWhen the main thread encounters a new task region,it:1.creates all ADF tasks that belong to that task region,2.generates initial tokens
58、 which are necessary to start the new dataflow execution3.starts the dataflow execution by unblocking the worker threads4.waits for the end of the dataflow execution of the current task region after which it 5.destroys all constituent tasks,and 6.moves to the next task region.ISSCC 2025-Forum 3.1:Da
59、taflow:Is it all About Algorithms and Application Targets?36 of 44 2025 IEEE International Solid-State Circuits ConferenceTask ExecutionADF task macro dataflow actor Processes one set of input data,Produces output data,andWaits for the new set of input dataStops when the exit condition is set.Consum
60、eStart_TxAbortCommitProduce TokensEnd_TaskYesYesNoNoISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?37 of 44 2025 IEEE International Solid-State Circuits ConferenceMultiple Producers,Multiple Consumers ExampleP1 produces a token for C1P2 produces a token for C2P1 sto
61、res a token x1 into the bufferP2 stores a token x2into the bufferC1 consumes token x1C2 consumes token x2P1P2C1C2xTOKEN BUFFERP1P2C1C2xx1TOKEN BUFFERP1P2C1C2xx1x2TOKEN BUFFERP1P2C1C2xx2TOKEN BUFFERP1P2C1C2xx1x2TOKEN BUFFERP1P2C1C2xTOKEN BUFFERISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms
62、and Application Targets?38 of 44 2025 IEEE International Solid-State Circuits ConferenceEvaluationADF implemented in C/C+.g+compiler version 4.5 TinySTM 11 for STM support.The machine is PowerEdge 6850 four dual-core 64-bit Intel Xeon processors at 3.2 GHz.Each processor unit has 16MB L3 cache memor
63、y.The machine is running SUSE LINUX 10.1 We compare the implementation with implicit task transaction(default)and the implementation with manually placed transactions(relaxed).Test applications:Bounded bufferGame world updateISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Ta
64、rgets?39 of 44 2025 IEEE International Solid-State Circuits ConferenceResults Game(1)ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?40 of 44Task throughput of the Game application with atomic ADF tasks(implicit synchronization).Task throughput of the Game applicatio
65、n with non-atomic ADF tasks(manual synchronization).2025 IEEE International Solid-State Circuits ConferenceResults Game(2)Effects of increasing the number of objects per game domain.Effects of increasing the number of instances per task(implicit synchronization).Effects of increasing the number of i
66、nstances per task(manual synchronization).ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?41 of 44 2025 IEEE International Solid-State Circuits ConferenceResults Bounded BufferTask throughput of the Game application with atomic ADF tasks(implicit synchronization).Tas
67、k throughput of the Game application with non-atomic ADF tasks(manual synchronization).ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?42 of 44 2025 IEEE International Solid-State Circuits ConferenceAtomic Dataflow ModelProposal for the Atomic Dataflow model:Combines
68、 shared memory and dataflow parallel programmingEnhances the programmability of irregular applications by exposing the expressiveness of the dataflow directly to a programmer.Improves the performance of transactional memory execution by eliminating unnecessary conflicts.The model is a good fit for P
69、roducer-Consumer type of problems and especially for problems that exhibit irregular parallelism.ISSCC 2025-Forum 3.1:Dataflow:Is it all About Algorithms and Application Targets?43 of 44 2025 IEEE International Solid-State Circuits ConferenceThank youcontact:osman.unsalbsc.esISSCC 2025-Forum 3.1:Dat
70、aflow:Is it all About Algorithms and Application Targets?44 of 44 2025 IEEE International Solid-State Circuits ConferenceDataflow Optimization andManagement for ML AcceleratorsMarian Verhelst,KU Leuven MICAS&imecmarian.verhelstkuleuven.beFebruary 2025Marian VerhelstISSCC 2025 Forum F3.21 of 60 2025
71、IEEE International Solid-State Circuits ConferenceML accelerator performance:a tale of 2 rooflines.From CPU.to NPUShifting right on the roofline,a matter of data reuseExploiting spatial and temporal reuseImpact of array size,precision and sparsityApproaching the roofline,a matter of utilizationThe u
72、tilization problemImpact of scheduling and data layoutDecoupled AI accelerators:HW&SW decoupling of spatial dataflow temporal schedule memory layoutOutlineMarian VerhelstISSCC 2025 Forum F3.23 of 60 2025 IEEE International Solid-State Circuits ConferenceML accelerator performance:a tale of 2 rooflin
73、es.From CPU.to NPUShifting right on the roofline,a matter of data reuseExploiting spatial and temporal reuseImpact of array size,precision and sparsityApproaching the roofline,a matter of utilizationThe utilization problemImpact of scheduling and data layoutDecoupled AI accelerators:HW&SW decoupling
74、 of spatial dataflow temporal schedule memory layoutOutlineMarian VerhelstISSCC 2025 Forum F3.24 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor performance in function of“arithmetic intensity(AI)”AI=operations/bytes of memory accessAssumes simultaneous data fetching and compu
75、teThe performance roofline for CPUMarian Verhelstcompute boundmemory bounddrawing for Nop=8,Bmem=8bytes/cycleNopISSCC 2025 Forum F3.25 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor performance in function of“arithmetic intensity(AI)”AI=operations/bytes of memory accessAssume
76、s simultaneous data fetching and computeThe performance roofline for CPUMarian Verhelstcompute boundmemory boundAttainable performance OPs/cycle=MIN(Mem BW bytes/cycle*AI OPs/byte,NopOPs/cycle)drawing for Nop=8,Bdram=8bytes/cycleNopISSCC 2025 Forum F3.26 of 60 2025 IEEE International Solid-State Cir
77、cuits ConferenceProcessor performance in function of“arithmetic intensity(AI)”AI=operations/bytes of memory accessAssumes simultaneous data fetching and computeThe performance roofline for CPUMarian Verhelstcompute boundmemory boundAttainable performance OPs/cycle=MIN(Mem BW bytes/cycle*AI OPs/byte,
78、NopOPs/cycle)drawing for Nop=8,Bdram=8bytes/cycleNopISSCC 2025 Forum F3.26 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor performance in function of“arithmetic intensity(AI)”AI=operations/bytes of memory accessAssumes simultaneous data fetching and computeThe performance roof
79、line for CPUMarian Verhelstcompute boundmemory boundAttainable performance OPs/cycle=MIN(Mem BW bytes/cycle*AI OPs/byte,NopOPs/cycle)drawing for Nop=8,Bdram=8bytes/cycleNopISSCC 2025 Forum F3.26 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor performance in function of“arithme
80、tic intensity(AI)”AI=operations/bytes of memory accessAssumes simultaneous data fetching and computeThe performance roofline for CPUMarian Verhelstcompute boundmemory boundAttainable performance OPs/cycle=MIN(Mem BW bytes/cycle*AI OPs/byte,NopOPs/cycle)memory bound compute boundAI=Nop/Bdrawing for N
81、op=8,Bdram=8bytes/cycleNopISSCC 2025 Forum F3.26 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:M-Add:multiply addData fetch:input1,input2,partial sumData store:sum 4 words*4bytes/word=16bytesCompute:2 operations AI=2 OPs/16bytes=0.125 OPs/byteThe performance roofline for CPUMa
82、rian Verhelstcompute boundmemory boundX+NopISSCC 2025 Forum F3.27 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor energy efficiency in function of“arithmetic intensity(AI)”Energy is SUM of memory and operational energyThe energy roofline for CPUMarian Verhelstcompute boundmemo
83、ry bound1/Eopdrawing for Ecomp=0.5pJ/op,Edram=50pJ/byteISSCC 2025 Forum F3.28 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor energy efficiency in function of“arithmetic intensity(AI)”Energy is SUM of memory and operational energyThe energy roofline for CPUMarian Verhelstcompu
84、te boundmemory boundAttainable efficiency OPs/Joule=1(EcompJoules/OP+EmemJoule/byte/AI OPs/byte)1/Eopdrawing for Ecomp=0.5pJ/op,Edram=50pJ/byteISSCC 2025 Forum F3.28 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor energy efficiency in function of“arithmetic intensity(AI)”Energ
85、y is SUM of memory and operational energyThe energy roofline for CPUMarian Verhelstcompute boundmemory boundAttainable efficiency OPs/Joule=1(EcompJoules/OP+EmemJoule/byte/AI OPs/byte)1/Eopmemcompenergydrawing for Ecomp=0.5pJ/op,Edram=50pJ/byteISSCC 2025 Forum F3.28 of 60 2025 IEEE International Sol
86、id-State Circuits ConferenceProcessor energy efficiency in function of“arithmetic intensity(AI)”Energy is SUM of memory and operational energyThe energy roofline for CPUMarian Verhelstcompute boundmemory boundAttainable efficiency OPs/Joule=1(EcompJoules/OP+EmemJoule/byte/AI OPs/byte)1/Eopmemcompene
87、rgymemcompdrawing for Ecomp=0.5pJ/op,Edram=50pJ/byteISSCC 2025 Forum F3.28 of 60 2025 IEEE International Solid-State Circuits ConferenceProcessor energy efficiency in function of“arithmetic intensity(AI)”Energy is SUM of memory and operational energyThe energy roofline for CPUMarian Verhelstcompute
88、boundmemory boundAttainable efficiency OPs/Joule=1(EcompJoules/OP+EmemJoule/byte/AI OPs/byte)memory bound compute bound1/EopmemcompenergymemcompmemcompAI=Emem/Ecompdrawing for Ecomp=0.5pJ/op,Edram=50pJ/byteISSCC 2025 Forum F3.28 of 60 2025 IEEE International Solid-State Circuits ConferenceThe roofli
89、neS for CPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompISSCC 2025 Forum F3.29 of 60 2025 IEEE International Solid-State Circuits ConferenceOptimal place in roofline(s)?Can be memory bound in one and compute bound in other one.Is this the same for NPU
90、s?The rooflineS for CPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompISSCC 2025 Forum F3.29 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLay
91、er outputsProcessing element(PE)regX+regregSwitching to NPUsMarian Verhelstfor(m=0 to M-1);for each rowfor(n=0 to N-1);for each output columnfor(k=0 to K-1);for each input channelomn+=imk*wkn;ISSCC 2025 Forum F3.2MKMNKN10 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMSt
92、orageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLayer outputsProcessing element(PE)regX+regregSwitching to NPUsMarian Verhelstfor(m=0 to M-1);for each rowfor(n=0 to N-1);for each output columnfor(k=0 to K-1);for each input channelomn+=imk*wkn;Compared to CPU:Many more
93、PEs(e.g.100 x to 10,000 x)Lower precision data elements(e.g.4x less precision)ISSCC 2025 Forum F3.2MKMNKN10 of 60 2025 IEEE International Solid-State Circuits ConferenceThe rooflines for NPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompISSCC 2025 Forum
94、 F3.211 of 60 2025 IEEE International Solid-State Circuits ConferenceMore operators:performance more memory boundedThe rooflines for NPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompAI*100!ISSCC 2025 Forum F3.211 of 60 2025 IEEE International Solid-Sta
95、te Circuits ConferenceMore operators:performance more memory boundedLess improvement in Eop:cut off points closer together.The rooflines for NPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompAI*100!ISSCC 2025 Forum F3.211 of 60 2025 IEEE International S
96、olid-State Circuits ConferenceMore operators:performance more memory boundedLess improvement in Eop:cut off points closer together.The rooflines for NPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompAI*100!AI*4ISSCC 2025 Forum F3.211 of 60 2025 IEEE Int
97、ernational Solid-State Circuits ConferenceMore operators:performance more memory boundedLess improvement in Eop:cut off points closer together.Performance is terrible at AI1.The rooflines for NPUMarian Verhelstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompAI*100!AI*4
98、ISSCC 2025 Forum F3.211 of 60 2025 IEEE International Solid-State Circuits ConferenceMore operators:performance more memory boundedLess improvement in Eop:cut off points closer together.Performance is terrible at AI1.Increase Bmem$Shift right,but needs AI 100 to 1000?The rooflines for NPUMarian Verh
99、elstcompute boundmemory boundAI=Nop/BNopcompute boundmemory bound1/EopAI=Emem/EcompAI*100!AI*4ISSCC 2025 Forum F3.211 of 60 2025 IEEE International Solid-State Circuits ConferenceML accelerator performance:a tale of 2 rooflines.From CPU.to NPUShifting right on the roofline,a matter of data reuseExpl
100、oiting spatial and temporal reuseImpact of array size,precision and sparsityApproaching the Roofline,a matter of utilizationThe utilization problemImpact of scheduling and data layoutDecoupled AI accelerators:Decouple&orthogonalize spatial dataflow temporal schedule memory layoutOutlineMarian Verhel
101、stISSCC 2025 Forum F3.212 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLayer outputsProcessing element(PE)regX+regregAI within an NPU:MAC levelMarian Verhelstfor(m=0 to M-1);for each rowfo
102、r(n=0 to N-1);for each output columnfor(k=0 to K-1);for each input channelomn+=imk*wkn;AI of 8bit MAC:4 words of 1 byte;2 ops AIreg=2/4=0.5 OPs/byteISSCC 2025 Forum F3.213 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRA
103、MDatapathWeights Layer inputsLayer outputsProcessing element(PE)regX+regregAI within an NPU:spatial unrollingMarian VerhelstAI of 8bit MAC:4 words of 1 byte;2 ops AIreg=2/4=0.5 OPs/bytespatial unrollingfor(m=0 to M-1);for each rowfor(n=0 to N-1);for each output columnfor(k=0 to K-1);for each input c
104、hannelomn+=imk*wkn;for(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.214 of 60 2025 IEEE International Solid-State Circuits ConferenceDifferent types of spatial reuse can(should!)be combined to get decent AIIntermezzo:Types of
105、spatial unrolling for spatial data reuseMarian Verhelstxxxxweightinputs+outputsfor(k=0 to K-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);omn+=imk*wkn;Weight reuseInput reuseOutput reuseISSCC 2025 Forum F3.215 of 60 2025 IEEE International Solid-State Circuits ConferenceDifferent types of
106、spatial reuse can(should!)be combined to get decent AIIntermezzo:Types of spatial unrolling for spatial data reuseMarian Verhelstxxxxweightsinput+outputsxxxxweightinputs+outputsfor(k=0 to K-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);omn+=imk*wkn;Weight reuseInput reuseOutput reusefor(k=
107、0 to K-1);for(m=0 to M-1);for(n2=0 to N/4-1);parfor(n1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.215 of 60 2025 IEEE International Solid-State Circuits ConferenceDifferent types of spatial reuse can(should!)be combined to get decent AIIntermezzo:Types of spatial unrolling for spatial data reuseMarian
108、 Verhelstinputsxxxx+weightsxxxxweightsinput+outputsxxxxweightinputs+outputsfor(k=0 to K-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);omn+=imk*wkn;Weight reuseInput reuseOutput reusefor(k=0 to K-1);for(m=0 to M-1);for(n2=0 to N/4-1);parfor(n1=0 to 3);omn+=imk*wkn;for(k2=0 to K/4-1);for(m=0
109、 to M-1);for(n=0 to N-1);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.215 of 60 2025 IEEE International Solid-State Circuits ConferenceSpatial unrolling impact on AI:2DMarian Verhelst2D MAC array with weight&input reuseArithmetic intensity?for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=i
110、mk *wkn;WI.32 x 32 MACs .SpatialISSCC 2025 Forum F3.2E.g.Tesla NPU,Google TPU,.16 of 60 2025 IEEE International Solid-State Circuits ConferenceSpatial unrolling impact on AI:2DMarian Verhelst2D MAC array with weight&input reuseArithmetic intensity?Read I=32 word/ccRead W=32 word/ccRead O-1=32*32 wor
111、d/ccWrite O=32*32 word/ccAI(spatial)=2048 ops ./(2*32+2*1024)words=1for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;WI.32 x 32 MACs .SpatialISSCC 2025 Forum F3.2E.g.Tesla NPU,Google TPU,.16 of 60 2025 IEEE International Solid-State Circuits ConferenceSpatial unrolling impact on AI:3DMa
112、rian Verhelst3D MAC array with weight&input&output reuseArithmetic intensity?for(k2=0 to.)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn+=imk *wkn;WI.8 x 8 x 16 MACs .SpatialISSCC 2025 Forum F3.2E.g.Tensor cores,DaVinci core,.17 of 60 2025 IEEE International Solid-State Circuits ConferenceS
113、patial unrolling impact on AI:3DMarian Verhelst3D MAC array with weight&input&output reuseArithmetic intensity?Read I=8*16 word/ccRead W=8*16 word/ccRead O-1=8*8 word/ccWrite O=8*8 word/ccAI(spatial)=2048 ops ./(2*128+2*64)words=5.4for(k2=0 to.)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn
114、+=imk *wkn;WI.8 x 8 x 16 MACs .3D better!But still not that good.SpatialISSCC 2025 Forum F3.2E.g.Tensor cores,DaVinci core,.17 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLayer outputsPro
115、cessing element(PE)regX+regregAI within an NPU:temporal unrollingMarian VerhelstAI of 8bit MAC:4 words of 1 byte;2 ops AIreg=2/4=0.5 OPs/byteAI of MAC array:AIarray 5 OPs/bytespatial unrollingfor(m=0 to M-1);for(n=0 to N-1);for(k=0 to K-1);o+=i*w;for(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1)
116、;parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.218 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLayer outputsProcessing element(PE)regX+regregAI within an NPU:tempor
117、al unrollingMarian VerhelstAI of 8bit MAC:4 words of 1 byte;2 ops AIreg=2/4=0.5 OPs/byteAI of MAC array:AIarray 5 OPs/bytespatial unrollingfor(m=0 to M-1);for(n=0 to N-1);for(k=0 to K-1);o+=i*w;for(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;tem
118、poral unrollingfor(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.218 of 60 2025 IEEE International Solid-State Circuits ConferenceDifferent types of temporal reuse can NOT be combined(at 1 memory level)Intermezzo:Types of tempo
119、ral unrolling for temporal data reuseMarian VerhelstWeight stationary Input stationary Output stationary(with output reuse)weightsinputxxxx+for(k2=0 to K/4-1);for(n=0 to N-1);for(m=0 to M-1);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.219 of 60 2025 IEEE International Solid-State Circuits Con
120、ferenceDifferent types of temporal reuse can NOT be combined(at 1 memory level)Intermezzo:Types of temporal unrolling for temporal data reuseMarian VerhelstWeight stationary Input stationary Output stationary(with output reuse)weightsxxxx+weightsinputxxxx+for(k2=0 to K/4-1);for(m=0 to M-1);for(n=0 t
121、o N-1);parfor(k1=0 to 3);omn+=imk*wkn;for(k2=0 to K/4-1);for(n=0 to N-1);for(m=0 to M-1);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.219 of 60 2025 IEEE International Solid-State Circuits ConferenceDifferent types of temporal reuse can NOT be combined(at 1 memory level)Intermezzo:Types of tem
122、poral unrolling for temporal data reuseMarian VerhelstWeight stationary Input stationary Output stationary(with output reuse)for(n=0 to N-1);for(m=0 to M-1);for(k2=0 to K/4-1);parfor(k1=0 to 3);omn+=imk*wkn;inputsxxxx+weightweightsxxxx+weightsinputxxxx+for(k2=0 to K/4-1);for(m=0 to M-1);for(n=0 to N
123、-1);parfor(k1=0 to 3);omn+=imk*wkn;for(k2=0 to K/4-1);for(n=0 to N-1);for(m=0 to M-1);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2025 Forum F3.219 of 60 2025 IEEE International Solid-State Circuits ConferenceTemporal unrolling impact on AI:2DMarian Verhelst2D MAC array with weight&input reuseArithmetic in
124、tensity?Read I=32 word/ccRead W=32 word/ccRead O-1=32*32 word/ccWrite O=32*32 word/ccAI(spatial)=2048 ops ./(2*32+2*1024)words=1for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;WI.32 x 32 MACs .SpatialISSCC 2025 Forum F3.220 of 60 2025 IEEE International Solid-State Circuits ConferenceT
125、emporal unrolling impact on AI:2DMarian Verhelst2D MAC array with weight&input reuseArithmetic intensity?Read I=32 word/ccRead W=32 word/ccRead O-1=32*32 word/ccWrite O=32*32 word/ccAI(spatial)=2048 ops ./(2*32+2*1024)words=1for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;WI.32 x 32 MA
126、Cs .SpatialTemporalISSCC 2025 Forum F3.220 of 60 2025 IEEE International Solid-State Circuits ConferenceTemporal unrolling impact on AI:2DMarian Verhelst2D MAC array with weight&input reuseArithmetic intensity?Read I=32 word/ccRead W=32 word/ccRead O-1=32*32 word/ccWrite O=32*32 word/ccAI(spatial)=2
127、048 ops ./(2*32+2*1024)words=1AI(temporal)=2048 ops ./(2*32)words=32!for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;WI.32 x 32 MACs .SpatialTemporalISSCC 2025 Forum F3.220 of 60 2025 IEEE International Solid-State Circuits ConferenceTemporal unrolling impact on AI:3DMarian Verhelst3D
128、MAC array with weight&input&output reuseArithmetic intensity?Read I=8*16 word/ccRead W=8*16 word/ccRead O-1=8*8 word/ccWrite O=8*8 word/ccAI(spatial)=2048 ops ./(2*128+2*64)words=5.4for(k2=0 to.)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn+=imk *wkn;WI.8 x 8 x 16 MACs .2D now better than
129、3D!SpatialISSCC 2025 Forum F3.221 of 60 2025 IEEE International Solid-State Circuits ConferenceTemporal unrolling impact on AI:3DMarian Verhelst3D MAC array with weight&input&output reuseArithmetic intensity?Read I=8*16 word/ccRead W=8*16 word/ccRead O-1=8*8 word/ccWrite O=8*8 word/ccAI(spatial)=204
130、8 ops ./(2*128+2*64)words=5.4for(k2=0 to.)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn+=imk *wkn;WI.8 x 8 x 16 MACs .2D now better than 3D!SpatialTemporalISSCC 2025 Forum F3.221 of 60 2025 IEEE International Solid-State Circuits ConferenceTemporal unrolling impact on AI:3DMarian Verhelst3
131、D MAC array with weight&input&output reuseArithmetic intensity?Read I=8*16 word/ccRead W=8*16 word/ccRead O-1=8*8 word/ccWrite O=8*8 word/ccAI(spatial)=2048 ops ./(2*128+2*64)words=5.4AI(temporal)=2048 ops ./(2*128)words=8for(k2=0 to.)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn+=imk *wkn
132、;WI.8 x 8 x 16 MACs .2D now better than 3D!SpatialTemporalISSCC 2025 Forum F3.221 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLayer outputsProcessing element(PE)regX+regregAI within an NP
133、U:multi-level stationarityMarian VerhelstAI of 8bit MAC:4 words of 1 byte;2 ops AIreg=2/4=0.5 OPs/byteAImem 10 OPs/bytesp&tempunrollingfor(m=0 to M-1);for(n=0 to N-1);for(k=0 to K-1);o+=i*w;for(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;ISSCC 2
134、025 Forum F3.222 of 60 2025 IEEE International Solid-State Circuits ConferenceWeightMain DRAMStorageWeights WeightOff-chipDRAM storageOn-chip SRAMDatapathWeights Layer inputsLayer outputsProcessing element(PE)regX+regregAI within an NPU:multi-level stationarityMarian VerhelstAI of 8bit MAC:4 words o
135、f 1 byte;2 ops AIreg=2/4=0.5 OPs/byteAImem 10 OPs/bytesp&tempunrollingfor(m=0 to M-1);for(n=0 to N-1);for(k=0 to K-1);o+=i*w;for(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1);parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;temporal unrollingfor(k2=0 to K/4-1);for(m2=0 to M/4-1);for(n=0 to N-1);
136、parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;AImem 50 OPs/byteDRAMSRAMRegistersEach memory level(and operand)can individually exploit temporal reuse also DRAM and regs!ISSCC 2025 Forum F3.222 of 60 2025 IEEE International Solid-State Circuits ConferenceAIsram/dramshifts to the right:higher achie
137、vable performanceyet,dependent on spatial/temporal unrolling choicesThe performance roofline for NPUMarian VerhelstNopNopcompute boundmemory boundAIdramISSCC 2025 Forum F3.2AIsramAIrf23 of 60 2025 IEEE International Solid-State Circuits ConferenceThe performance roofline for NPUMarian VerhelstNopNop
138、compute boundmemory boundAImem1/Eopcompute dominatedmemory dominatedAImemISSCC 2025 Forum F3.224 of 60 2025 IEEE International Solid-State Circuits ConferenceImpact of quantization?Impact of sparsity?The performance roofline for NPUMarian VerhelstNopNopcompute boundmemory boundAImem1/Eopcompute domi
139、natedmemory dominatedAImemISSCC 2025 Forum F3.224 of 60 2025 IEEE International Solid-State Circuits ConferenceReduced precision=Lower EopThe impact of quantizationMarian Verhelstx+IWO-1O+O-1OxxxxIW32bit8bitISSCC 2025 Forum F3.225 of 60 2025 IEEE International Solid-State Circuits ConferenceReduced
140、precision=Lower EopLarger NopMore spatial unrollingThe impact of quantizationMarian Verhelstx+IWO-1O+O-1OxxxxIW32bit8bitISSCC 2025 Forum F3.225 of 60 2025 IEEE International Solid-State Circuits Conference.8 x 8 x 16 64 MACs .Reduced precision=Lower EopLarger NopMore spatial unrollingThe impact of q
141、uantizationMarian Verhelstx+IWO-1O+O-1OxxxxIW32bit8bitISSCC 2025 Forum F3.2Free to choose to which spatial dimension to give the 4x units to:from 8 x 8 x 16 to:8 x 8 x 6432 x 8 x 1616 x 16 x 1625 of 60 2025 IEEE International Solid-State Circuits Conference.8 x 8 x 16 64 MACs .Reduced precision=Lowe
142、r EopLarger NopMore spatial unrollingLarger AI(bytesops/byte)The impact of quantizationMarian Verhelstx+IWO-1O+O-1OxxxxIW32bit8bitISSCC 2025 Forum F3.2Free to choose to which spatial dimension to give the 4x units to:from 8 x 8 x 16 to:8 x 8 x 6432 x 8 x 1616 x 16 x 16One that maximizes AI,while min
143、imizing HW flexibility cost&underutilization25 of 60 2025 IEEE International Solid-State Circuits ConferenceReduced precision=Lower EopLarger NopMore spatial unrollingLarger AI(bytesops/byte)The performance roofline for NPU:quantizationMarian VerhelstNopNopAIdram1/EopAIdramx+IWO-1O+O-1OxxxxIW32bit8b
144、itISSCC 2025 Forum F3.226 of 60 2025 IEEE International Solid-State Circuits ConferenceReduced precision=Lower EopLarger NopMore spatial unrollingLarger AI(bytesops/byte)The performance roofline for NPU:quantizationMarian VerhelstNopNopAIdram1/EopAIdramx+IWO-1O+O-1OxxxxIW32bit8bitISSCC 2025 Forum F3
145、.226 of 60 2025 IEEE International Solid-State Circuits ConferenceThe impact of(structured)sparsityMarian VerhelstVirtual Nop,Bmem,.=including zero ops,values,.+O-1OxxxxIW+O-1OxxIWISSCC 2025 Forum F3.2Structured sparsity(e.g.2:4)=Either make everything virtual,or everything physical:27 of 60 2025 IE
146、EE International Solid-State Circuits ConferenceThe impact of(structured)sparsityMarian VerhelstVirtual Nop,Bmem,.=including zero ops,values,.+O-1OxxxxIW+O-1OxxIWISSCC 2025 Forum F3.2N(virt.)op=Nnon-zero-op+Nzero_opE(virt.)op=Eop-tot/N(virt.)opStructured sparsity(e.g.2:4)=Larger virtual NopSmaller v
147、irtual EopEither make everything virtual,or everything physical:27 of 60 2025 IEEE International Solid-State Circuits ConferenceThe impact of(structured)sparsityMarian VerhelstVirtual Nop,Bmem,.=including zero ops,values,.+O-1OxxxxIW+O-1OxxIWISSCC 2025 Forum F3.2N(virt.)op=Nnon-zero-op+Nzero_opE(vir
148、t.)op=Eop-tot/N(virt.)opB(virt.)mem=Bnon-zero-data+Bzero-dataE(virt.)mem/byte=Emem-tot/BmemStructured sparsity(e.g.2:4)=Larger virtual NopSmaller virtual EopSlightly larger virtual BmemSlightly smaller virtual EmemSame virtual AIEither make everything virtual,or everything physical:27 of 60 2025 IEE
149、E International Solid-State Circuits ConferenceStructured sparsity(e.g.2:4)=Larger virtual NopSmaller virtual EopSlightly larger virtual BmemSlightly smaller virtual EmemSame virtual AIThe impact of(structured)sparsityMarian Verhelst+O-1OxxxxIW+O-1OxxIWISSCC 2025 Forum F3.2.8 x 8 x 16 64 MACs .28 of
150、 60 2025 IEEE International Solid-State Circuits ConferenceStructured sparsity(e.g.2:4)=Larger virtual NopSmaller virtual EopSlightly larger virtual BmemSlightly smaller virtual EmemSame virtual AIThe impact of(structured)sparsityMarian Verhelst+O-1OxxxxIW+O-1OxxIWISSCC 2025 Forum F3.2.8 x 8 x 16 64
151、 MACs .Which spatial dimension to give the 4x units to?:from 8 x 8 x 16 to:8 x 8 x 328 x 16 x 1616 x 8 x 1628 of 60 2025 IEEE International Solid-State Circuits ConferenceStructured sparsity(e.g.2:4)=Larger virtual NopSmaller virtual EopSlightly larger virtual BmemSlightly smaller virtual EmemSame v
152、irtual AIThe impact of(structured)sparsityMarian Verhelst+O-1OxxxxIW+O-1OxxIWISSCC 2025 Forum F3.2.8 x 8 x 16 64 MACs .Which spatial dimension to give the 4x units to?:from 8 x 8 x 16 to:8 x 8 x 328 x 16 x 1616 x 8 x 16The one that follows sparsity structure,while minimizing HW flexibility cost&unde
153、rutilization28 of 60 2025 IEEE International Solid-State Circuits ConferenceStructured sparsity(e.g.2:4)=Larger(virtual)NopSlightly larger(virtual)BmemSmaller(virtual)EopSlightly smaller(virtual)EmemThe performance roofline for NPU:sparsityMarian VerhelstNopNopAIdram1/EopAIdram+O-1OxxxxIW+O-1OxxIWIS
154、SCC 2025 Forum F3.229 of 60 2025 IEEE International Solid-State Circuits ConferenceStructured sparsity(e.g.2:4)=Larger(virtual)NopSlightly larger(virtual)BmemSmaller(virtual)EopSlightly smaller(virtual)EmemThe performance roofline for NPU:sparsityMarian VerhelstNopNopAIdram1/EopAIdram+O-1OxxxxIW+O-1
155、OxxIWISSCC 2025 Forum F3.229 of 60 2025 IEEE International Solid-State Circuits ConferenceLarge arrays,precision,sparsity=Higher rooflinesLarge AI only good news?The performance roofline for NPU:all combinedMarian VerhelstNopNopAIdram1/EopAIdramISSCC 2025 Forum F3.230 of 60 2025 IEEE International S
156、olid-State Circuits ConferenceLarge arrays,precision,sparsity=Higher rooflinesLarge AI only good news?But.can we always reach this“attainable performance”?The performance roofline for NPU:all combinedMarian VerhelstNopNopAIdram1/EopAIdramISSCC 2025 Forum F3.230 of 60 2025 IEEE International Solid-St
157、ate Circuits ConferenceML accelerator performance:a tale of 2 rooflines.From CPU.to NPUShifting right on the roofline,a matter of data reuseExploiting spatial and temporal reuseImpact of array size,precision and sparsityApproaching the Roofline,a matter of utilizationThe utilization problemImpact of
158、 scheduling and data layoutDecoupled AI accelerators:Decouple&orthogonalize spatial dataflow temporal schedule memory layoutOutlineMarian VerhelstISSCC 2025 Forum F3.231 of 60 2025 IEEE International Solid-State Circuits ConferenceUtilization=average fraction of the time each MAC unit is usefully bu
159、sy=US*UTUtilizationMarian VerhelstUS=Spatial utilization=useful ops/Cctotal ops/CcISSCC 2025 Forum F3.232 of 60 2025 IEEE International Solid-State Circuits ConferenceUtilization=average fraction of the time each MAC unit is usefully busy=US*UTUtilizationMarian VerhelstUS=Spatial utilization=useful
160、ops/Cctotal ops/Ccfor(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)8omn+=imk *wkn;.8 x 8 x 16 MACs .9E.g.depthwise layer:ISSCC 2025 Forum F3.232 of 60 2025 IEEE International Solid-State Circuits ConferenceUtilization=average fraction of the time each MAC unit is usefully busy=US*UTU
161、tilizationMarian VerhelstUS=Spatial utilization=useful ops/Cctotal ops/Ccfor(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)8omn+=imk *wkn;.8 x 8 x 16 MACs .9E.g.depthwise layer:US=9/16=56%ISSCC 2025 Forum F3.232 of 60 2025 IEEE International Solid-State Circuits ConferenceUtilization=
162、average fraction of the time each MAC unit is usefully busy=US*UTUtilizationMarian VerhelstUS=Spatial utilization=useful ops/Cctotal ops/CcUT=Temp.utilization=active comp.Cctotal Cccomputeoffload datacomputeoffload data50%utilization(50%loss)for(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0
163、 to 15)8omn+=imk *wkn;(off)load dataUS=9/16=56%UT=1/2=50%WI.8 x 8 x 16 MACs .ISSCC 2025 Forum F3.233 of 60 2025 IEEE International Solid-State Circuits ConferenceUtilization=average fraction of the time each MAC unit is usefully busy=US*UTUtilizationMarian VerhelstUS=Spatial utilization=useful ops/C
164、ctotal ops/CcUT=Temp.utilization=active comp.Cctotal Cccomputeoffload datacomputeoffload data50%utilization(50%loss)for(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)8omn+=imk *wkn;(off)load dataUS=9/16=56%UT=1/2=50%U=28%WI.8 x 8 x 16 MACs .ISSCC 2025 Forum F3.233 of 60 2025 IEEE Inte
165、rnational Solid-State Circuits ConferenceCycles and energy is wasted on idle&stalling compute units.1.2.3.Roofline impact of utilizationMarian VerhelstNopNopAIdram1/EopAIdram1-Us*UT1-Us*UTISSCC 2025 Forum F3.234 of 60 2025 IEEE International Solid-State Circuits ConferenceCycles and energy is wasted
166、 on idle&stalling compute units.How to limit underutilization?1.Spatial optimization2.Temporal optimization3.Data layout optimizationRoofline impact of utilizationMarian VerhelstNopNopAIdram1/EopAIdram1-Us*UT1-Us*UTISSCC 2025 Forum F3.234 of 60 2025 IEEE International Solid-State Circuits Conference
167、Marian Verhelst9ISSCC 2025 Forum F3.2How to limit underutilization?1.Spatial dataflow optimization2D array3D arraydepthwise layer:64*9*6435 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian Verhelst9for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;.32 x 32 MACs .US=100
168、%ISSCC 2025 Forum F3.2How to limit underutilization?1.Spatial dataflow optimization2D array3D arraydepthwise layer:64*9*6435 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian Verhelstfor(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)8omn+=imk *wkn;.8 x 8 x 16 MACs .9
169、US=9/16=56%for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;.32 x 32 MACs .US=100%ISSCC 2025 Forum F3.2How to limit underutilization?1.Spatial dataflow optimization2D array3D arraydepthwise layer:64*9*6435 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian Verhelstfor(k
170、=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn+=imk *wkn;.8 x 8 x 16 MACs .for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;.32 x 32 MACs .8-7ISSCC 2025 Forum F3.2How to limit underutilization?1.Spatial dataflow optimizationFB layer,B=8:8*64*642D array3D arraydepthwise
171、layer:64*9*64US=9/16=56%US=100%36 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian Verhelstfor(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn+=imk *wkn;.8 x 8 x 16 MACs .for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;.32 x 32 MACs .US=100%8-7ISSCC
172、2025 Forum F3.2How to limit underutilization?1.Spatial dataflow optimizationFB layer,B=8:8*64*642D array3D arraydepthwise layer:64*9*64US=9/16=56%US=100%36 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian Verhelstfor(k=0 to 8)parfor(m1=0 to 7)parfor(n1=0 to 7)parfor(k1=0 to 15)omn
173、+=imk *wkn;.8 x 8 x 16 MACs .for(k=0 to.)parfor(m1=0 to 31)parfor(n1=0 to 31)omn+=imk *wkn;.32 x 32 MACs .US=100%US=1/32=3%8-7Need for flexibility,heterogeneity!ISSCC 2025 Forum F3.2How to limit underutilization?1.Spatial dataflow optimizationFB layer,B=8:8*64*642D array3D arraydepthwise layer:64*9*
174、64US=9/16=56%US=100%36 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:DianaMarian VerhelstISSCC 2025 Forum F3.237 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:DianaMarian VerhelstLarge,analog in-memory compute core:1152*512 PEs6bit*1.5bitinput parallel,o
175、utput parallelOptimized for efficiency,if massive spatial unrolling possibleExcellent peak efficiency(200Tops/Watt)Yet,utilization can be badISSCC 2025 Forum F3.237 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:DianaMarian VerhelstLarge,analog in-memory compute core:1152*512 P
176、Es6bit*1.5bitinput parallel,output parallelOptimized for efficiency,if massive spatial unrolling possibleExcellent peak efficiency(200Tops/Watt)Yet,utilization can be badISSCC 2025 Forum F3.238 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:DianaMarian VerhelstLarge,analog in-m
177、emory compute core:1152*512 PEs6bit*1.5bitinput parallel,output parallelOptimized for efficiency,if massive spatial unrolling possibleExcellent peak efficiency(200Tops/Watt)Yet,utilization can be badISSCC 2025 Forum F3.238 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:DianaMar
178、ian VerhelstLarge,analog in-memory compute core:1152*512 PEs6bit*1.5bitinput parallel,output parallelOptimized for efficiency,if massive spatial unrolling possibleSmall,flexible digital AI core:16*16 8bit PEs2x for 4bit,4x for 2bit different spatial flowsOptimized for flexibility,yet,less efficientE
179、xcellent peak efficiency(200Tops/Watt)Yet,utilization can be badNear perfect utilization,yet worse efficiency (4,8,16Tops/W 8,4,2bit)ISSCC 2025 Forum F3.238 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian VerhelstAfter good spatial unrolling data transfers quickly dominate!Need f
180、or bandwidth-&memory-aware scheduling to avoid stallsISSCC 2025 Forum F3.2How to limit underutilization?2.Temporal scheduling optimization39 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian VerhelstAfter good spatial unrolling data transfers quickly dominate!Need for bandwidth-&me
181、mory-aware scheduling to avoid stallsPipelining only123ISSCC 2025 Forum F3.2How to limit underutilization?2.Temporal scheduling optimization39 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian VerhelstAfter good spatial unrolling data transfers quickly dominate!Need for bandwidth-&
182、memory-aware scheduling to avoid stallsPipelining only123980195020040060080010001200Memory(kB)-80%L2L1Memory RequirementsISSCC 2025 Forum F3.2How to limit underutilization?2.Temporal scheduling optimization39 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian VerhelstAfter good spat
183、ial unrolling data transfers quickly dominate!Need for bandwidth-&memory-aware scheduling to avoid stallsPipelining onlyTiling&fusion1231231a1b 1c2a2c2b3a3c3b980195020040060080010001200Memory(kB)-80%L2L1Memory RequirementsISSCC 2025 Forum F3.2How to limit underutilization?2.Temporal scheduling optim
184、ization39 of 60 2025 IEEE International Solid-State Circuits ConferenceMarian VerhelstAfter good spatial unrolling data transfers quickly dominate!Need for bandwidth-&memory-aware scheduling to avoid stallsPipelining onlyTiling&fusion1231231a1b 1c2a2c2b3a3c3b980195020040060080010001200Memory(kB)-80%
185、L2L1Memory RequirementsISSCC 2025 Forum F3.2How to limit underutilization?2.Temporal scheduling optimization39 of 60 2025 IEEE International Solid-State Circuits ConferenceTemporal schedules:single coreMarian VerhelstISSCC 2025 Forum F3.240 of 60 2025 IEEE International Solid-State Circuits Conferen
186、ceTemporal schedules:multi-coreMarian VerhelstISSCC 2025 Forum F3.241 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:scheduling on DianaMarian VerhelstISSCC 2025 Forum F3.2Heavily inter-leaving layer execution42 of 60 2025 IEEE International Solid-State Circuits ConferenceExamp
187、le:scheduling on DianaMarian VerhelstISSCC 2025 Forum F3.2Heavily inter-leaving layer executionKeeping memory needs within L1 bounds42 of 60 2025 IEEE International Solid-State Circuits ConferenceExample:scheduling on DianaMarian VerhelstISSCC 2025 Forum F3.2Heavily inter-leaving layer executionKeep
188、ing memory needs within L1 boundsBut so many options.see further!42 of 60 2025 IEEE International Solid-State Circuits ConferenceEven the best spatial and temporal unrolling,can still suffer from memory memory stalls.Data layout mattersTo maximally utilize memory bandwidths&minimize the need for dat
189、a reshufflingExample:Tiled GeMM operationMarian VerhelstISSCC 2025 Forum F3.2How to limit underutilization?3.Data Layout optimization43 of 60 2025 IEEE International Solid-State Circuits ConferenceEven the best spatial and temporal unrolling,can still suffer from memory memory stalls.Data layout mat
190、tersTo maximally utilize memory bandwidths&minimize the need for data reshufflingExample:Tiled GeMM operationMarian VerhelstISSCC 2025 Forum F3.2How to limit underutilization?3.Data Layout optimizationData layout 1Data layout 243 of 60 2025 IEEE International Solid-State Circuits ConferenceEven the
191、best spatial and temporal unrolling,can still suffer from memory memory stalls.Data layout mattersTo maximally utilize memory bandwidths&minimize the need for data reshufflingExample:Tiled GeMM operationMarian VerhelstISSCC 2025 Forum F3.2How to limit underutilization?3.Data Layout optimizationData
192、layout 1Data layout 2Contention,but compactNo contention,but larger43 of 60 2025 IEEE International Solid-State Circuits ConferenceML accelerator performance:a tale of 2 rooflines.From CPU.to NPUShifting right on the roofline,a matter of data reuseExploiting spatial and temporal reuseImpact of array
193、 size,precision and sparsityApproaching the Roofline,a matter of utilizationThe utilization problemImpact of scheduling and data layoutDecoupled AI accelerators:Decouple&orthogonalize spatial dataflow temporal schedule memory layoutOutlineMarian VerhelstISSCC 2025 Forum F3.244 of 60 2025 IEEE Intern
194、ational Solid-State Circuits ConferenceTo orthogonally optimize dataflow,schedule and data layout,one needs:1.Flexible,yet decoupled accelerators OpenGeMM2.Scheduling space exploration ZigZag/Stream3.Customizable compilation MLIRThree utilization enablersMarian VerhelstISSCC 2025 Forum F3.245 of 60
195、2025 IEEE International Solid-State Circuits ConferenceFlexible Accelerator:OpenGeMM DatapathMixture of design time and run time flexibility1More details:Xiaoling Yi,ASPDAC25Marian VerhelstISSCC 2025 Forum F3.246 of 60 2025 IEEE International Solid-State Circuits ConferenceFlexible Accelerator:OpenG
196、eMM DatapathMixture of design time and run time flexibility Tightly coupled memory systemDesign time flexible1More details:Xiaoling Yi,ASPDAC25Marian VerhelstISSCC 2025 Forum F3.246 of 60 2025 IEEE International Solid-State Circuits ConferenceFlexible Accelerator:OpenGeMM DatapathMixture of design t
197、ime and run time flexibility Tightly coupled memory systemDesign time flexible Buffered data streamersFlexible for any(afine)temporal schedule and data layout1More details:Xiaoling Yi,ASPDAC25Marian VerhelstISSCC 2025 Forum F3.246 of 60 2025 IEEE International Solid-State Circuits ConferenceISSCC 20
198、25 Forum F3.2Decoupled Accelerator:OpenGeMM1More details:Xiaoling Yi,ASPDAC25Decoupled from temporal execution&data layoutAccelerator unaware of temporal pattern(schedulers problem)Streamers unaware of data layout(compilers problem)Self-scheduled and bufferedMarian Verhelst47 of 60 2025 IEEE Interna
199、tional Solid-State Circuits ConferenceTo orthogonally optimize dataflow,schedule and data layout,one needs:1.Flexible,yet decoupled accelerators OpenGeMM2.Scheduling space exploration ZigZag/Stream3.Customizable compilation MLIRThree utilization enablersMarian VerhelstISSCC 2025 Forum F3.248 of 60 2
200、025 IEEE International Solid-State Circuits ConferenceScheduling space explorationMarian Verhelst2ISSCC 2025 Forum F3.249 of 60 2025 IEEE International Solid-State Circuits Conference50 of 60Marian VerhelstISSCC 2025 Forum F3.2ZigZag(single core)&Stream(multi-core)ZIGZAG/STREAMEnergy(MAC energy,memo
201、ry load/store)Latency(array utilization,memory stalls)AreaCost ModelMore details:Mei,TransComp2021GITHUB:https:/ stream2 2025 IEEE International Solid-State Circuits Conference50 of 60Marian VerhelstISSCC 2025 Forum F3.2ZigZag(single core)&Stream(multi-core)ZIGZAG/STREAMEnergy(MAC energy,memory load
202、/store)Latency(array utilization,memory stalls)AreaNN workloadCost ModelMore details:Mei,TransComp2021GITHUB:https:/ stream2 2025 IEEE International Solid-State Circuits Conference50 of 60Marian VerhelstISSCC 2025 Forum F3.2ZigZag(single core)&Stream(multi-core)ZIGZAG/STREAMEnergy(MAC energy,memory
203、load/store)Latency(array utilization,memory stalls)AreaTechnology characteristics(cell size,read cost,wireload,)NN workloadCost ModelHardware architecture&constraints(mem.hierarchy,compute array,interconnect,IO BW,)More details:Mei,TransComp2021GITHUB:https:/ stream2 2025 IEEE International Solid-St
204、ate Circuits Conference50 of 60Marian VerhelstISSCC 2025 Forum F3.2ZigZag(single core)&Stream(multi-core)ZIGZAG/STREAMEnergy(MAC energy,memory load/store)Latency(array utilization,memory stalls)AreaTechnology characteristics(cell size,read cost,wireload,)NN workloadCost ModelHardware architecture&co
205、nstraints(mem.hierarchy,compute array,interconnect,IO BW,)Mapping(spatial&temporal unrolling)More details:Mei,TransComp2021GITHUB:https:/ stream2 2025 IEEE International Solid-State Circuits Conference50 of 60Marian VerhelstISSCC 2025 Forum F3.2ZigZag(single core)&Stream(multi-core)ZIGZAG/STREAMEner
206、gy(MAC energy,memory load/store)Latency(array utilization,memory stalls)AreaTechnology characteristics(cell size,read cost,wireload,)NN workloadCost ModelHardware architecture&constraints(mem.hierarchy,compute array,interconnect,IO BW,)Mapping(spatial&temporal unrolling)Energy/inferencetime/inferenc
207、eaccuracyMore details:Mei,TransComp2021GITHUB:https:/ stream2 2025 IEEE International Solid-State Circuits ConferenceFrom workload.to schedule.Stream tiling and scheduling results2ISSCC 2025 Forum F3.2Marian Verhelst51 of 60 2025 IEEE International Solid-State Circuits ConferenceFrom workload.to sch
208、edule.Stream tiling and scheduling results2ISSCC 2025 Forum F3.2Marian Verhelst51 of 60 2025 IEEE International Solid-State Circuits ConferenceFrom workload.to schedule.with temporal unrolling.Stream tiling and scheduling results2ISSCC 2025 Forum F3.2Marian Verhelst51 of 60 2025 IEEE International S
209、olid-State Circuits ConferenceFrom workload.to schedule.with temporal unrolling.and spatial unrolling.Stream tiling and scheduling results2ISSCC 2025 Forum F3.2Marian Verhelst51 of 60 2025 IEEE International Solid-State Circuits ConferenceTo orthogonally optimize dataflow,schedule and data layout,on
210、e needs:1.Flexible,yet decoupled accelerators OpenGeMM2.Scheduling space exploration ZigZag/Stream3.Customizable compilation MLIRThree utilization enablersMarian VerhelstISSCC 2025 Forum F3.252 of 60 2025 IEEE International Solid-State Circuits ConferenceCompiler customization Use optimal schedule f
211、rom ZigZag/Stream3.ZIGZAG/STREAMMarian VerhelstISSCC 2025 Forum F3.253 of 60 2025 IEEE International Solid-State Circuits ConferenceCompiler customization Use optimal schedule from ZigZag/Stream Use layout with minimal memory stalls,expressed in new MLIR“tsl”dialect3ZIGZAG/STREAMMarian VerhelstISSCC
212、 2025 Forum F3.254 of 60 2025 IEEE International Solid-State Circuits ConferenceCompiler customization Use optimal schedule from ZigZag/Stream Use layout with minimal memory stalls,expressed in new MLIR dialect Insert DMA and synchronization calls for parallel operation3ZIGZAG/STREAMMarian VerhelstI
213、SSCC 2025 Forum F3.255 of 60 2025 IEEE International Solid-State Circuits ConferenceCompiler customization Use optimal schedule from ZigZag/Stream Use layout with minimal memory stalls,expressed in new MLIR dialect Insert DMA and synchronization calls for parallel operation Generate accelerator code
214、 and runtime using accelerator template3Marian VerhelstISSCC 2025 Forum F3.256 of 60 2025 IEEE International Solid-State Circuits ConferenceCombining it all together.Adding flexible accelerators at design timeScheduling their overlapping executionGenerating codeExample network:Marian VerhelstISSCC 2
215、025 Forum F3.257 of 60 2025 IEEE International Solid-State Circuits ConferenceCombining it all together.Adding flexible accelerators at design timeScheduling their overlapping executionGenerating codeExample network:Marian VerhelstISSCC 2025 Forum F3.257 of 60 2025 IEEE International Solid-State Cir
216、cuits ConferenceCombining it all together.Allowing to achieve very-close-to roofline performance for real workloads!(here:multi-layer GeMM)ArithmeticMarian VerhelstISSCC 2025 Forum F3.258 of 60 2025 IEEE International Solid-State Circuits ConferenceConclusion AI accelerator perform seldomly accordin
217、g to their theoretical rooflines Dataflow,scheduling and data layout optimizations are key to minimize spatial and temporal utilization losses The exploration/design/compilation space is too large handle by hand.Need for automatic:Generation of flexible dataflow and memory interfacing HW Scheduling
218、space exploration Compiler customization frameworks Efficient decoupled ML acceleratorsMarian VerhelstISSCC 2025 Forum F3.259 of 60 2025 IEEE International Solid-State Circuits ConferenceChoi13 Choi,Jee Whan,et al.A roofline model of energy.2013 IEEE 27th International Symposium on Parallel and Dist
219、ributed Processing.IEEE,2013.Williams09 Williams,Samuel,Andrew Waterman,and David Patterson.Roofline:an insightful visual performance model for multicore architectures.Communications of the ACM 52.4(2009):65-76.Mei21 Mei,L.,Houshmand,P.,Jain,V.,Giraldo,S.,&Verhelst,M.(2021).ZigZag:Enlarging joint ar
220、chitecture-mapping design space exploration for DNN accelerators.IEEE Transactions on Computers,70(8),1160-1174.Symons23 Symons,A.,Mei,L.,Colleman,S.,Houshmand,P.,Karl,S.,&Verhelst,M.(2023,April).Stream:A Modeling Framework for Fine-grained Layer Fusion on Multi-core DNN Accelerators.In 2023 IEEE In
221、ternational Symposium on Performance Analysis of Systems and Software(ISPASS)(pp.355-357).IEEE.Ueyoshi22 Ueyoshi,K.,Papistas,I.A.,Houshmand,P.,Sarda,G.M.,Jain,V.,Shi,M.,.&Verhelst,M.(2022,February).DIANA:An end-to-end energy-efficient digital and ANAlog hybrid neural network SoC.In 2022 IEEE Interna
222、tional Solid-State Circuits Conference(ISSCC)(Vol.65,pp.1-3).IEEE.Yi24 Yi,X.,Antonio,R.,Dumoulin,J.,Sun,J.,Van Delm,J.,Paim,G.,&Verhelst,M.(2024).OpenGeMM:A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling.arXiv preprint arXiv:2411.09543.Key Refer
223、encesMarian VerhelstISSCC 2025 Forum F3.260 of 60 2025 IEEE International Solid-State Circuits ConferenceExplicit Decoupled Data Orchestration:A Fundamental Approach to AccelerationMichael PellauerNVIDIA,Architecture Research GroupDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orche
224、stration1 of 60 2025 IEEE International Solid-State Circuits ConferenceModern world has come to rely on regular cadenceComputational Efficiency Improvements28 nm process,7.1 billion transistors12 nm process,21.1 billion transistors7 nm process,54.2 billion transistorsEnabling:Enabling:GK210,2014GV10
225、0,2014GA100,2020Dr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration2 of 60 2025 IEEE International Solid-State Circuits Conference“Oil Well”has had easy stuff extracted alreadySlowing Transistor ScalingSource:OLaughlin,The Rising Tide of Semiconductor Cost,data from Interna
226、tional Business Strategies,Inc.and MarvellSource:Shalf,The Future of Computing Beyond Moores Law,projecting from data points by Olukotun,Hammond,Sutter,and HorowitzDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration3 of 60 2025 IEEE International Solid-State Circuits Confer
227、enceAlternative“oil well”to increase efficiencyHardware SpecializationGeneraldatapathsBranchmis-predictsCachepenaltiesControl overheadsThe real workSource:Chung et al.,Single-Chip Heterogeneous Computing:Does the Future Include Custom Logic,FPGAs,and GPGPUs?2010Dr.Michael PellauerISSCC 2025-Forum 3.
228、3:Explicit Decoupled Data Orchestration4 of 60 2025 IEEE International Solid-State Circuits ConferencePapers focus on customized compute,but data movement is more expensiveCommunity Research Focus is wrong64-bit datapath26 pJ256 pJ1000 pJ500 pJEfficientoff-chip link256-bit buses16,000 pJDRAMRd/Wr256
229、-bit access8 kB SRAM50 pJ20mmSource:Bill Dally,20 nm20 pJDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration5 of 60 2025 IEEE International Solid-State Circuits ConferencePercentage of area devoted to on-chip buffersBuffer Hierarchies in ML AcceleratorsEach buffering system
230、 is custom-designed,in-extractible from its acceleratorStatement like“Reuse TPU buffer hierarchy for genome sequencing”is nonsensicalDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration6 of 60 2025 IEEE International Solid-State Circuits ConferenceFeeding data to a functiona
231、l unit exactly when it wants itWhat is Data OrchestrationML ASICs use workload knowledge to optimize orchestration at design-timeWhen data is moved over a transfer substrateWhere data is placed in available staging buffersHow data is accessed,including removal/evictionWho the“actors”are that touch d
232、ata and synchronizeDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration7 of 60 2025 IEEE International Solid-State Circuits ConferenceGuiding Principles for EfficiencyLocal reuse staged physically closeCross-unit use amortize data accessBandwidth efficiency Maximize delivery
233、 ratePrecise synchronization Only wait for exactly data you need,respond quicklySimple structures-Minimize hardware area/powerDelivery/use overlap Next tile should be available when current is doneDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration8 of 60 2025 IEEE Internat
234、ional Solid-State Circuits ConferenceData Orchestration in DL ASICsISSCC 2016,ISCA 2016Dr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration9 of 60 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestra
235、tion10 of 60Conv1D:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10
236、 of 60Conv1D:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60
237、Conv1D:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D
238、:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:Weigh
239、tsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey
240、concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concep
241、t introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concept intr
242、oduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concept introduced
243、 by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=for q in range(Q):for s in range(S):outq+=inq+s*ws 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concept introduced by Ey
244、erissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=Weights(Tile Size S)Inputs(Tile Size S)Outputs(Tile Size 1)AGenAGenAGenfor q in range(Q):for s in range(S):outq+=inq+s*wsL1 buffers divided into separate physical RAM macros at design time 2025 IEEE International Solid-State Circuits Confe
245、renceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=Weights(Tile Size S)Inputs(Tile Size S)Outputs(Tile Size 1)AGenAGenAGenfor q in range(Q):for s in range(S):
246、outq+=inq+s*wsL1 buffers divided into separate physical RAM macros at design timeLoop in“global”tensor address space determines staging pattern of individual RAMs 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of
247、 60Conv1D:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=Weights(Tile Size S)Inputs(Tile Size S)Outputs(Tile Size 1)AGenAGenAGenfor q in range(Q):for s in range(S):outq+=inq+s*wsL1 buffers divided into separate physical RAM macros at design timeLoop i
248、n“global”tensor address space determines staging pattern of individual RAMs“Output Stationary”that tensor changes most slowly 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concept introduc
249、ed by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=Weights(Tile Size S)Inputs(Tile Size S)Outputs(Tile Size 1)AGenAGenAGenfor q in range(Q):for s in range(S):outq+=inq+s*ws“Input Streaming”with halo from convolutional mathL1 buffers divided into separate physical RAM macros at desi
250、gn timeLoop in“global”tensor address space determines staging pattern of individual RAMs“Output Stationary”that tensor changes most slowly 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey con
251、cept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=Weights(Tile Size S)Inputs(Tile Size S)Outputs(Tile Size 1)AGenAGenAGenfor q in range(Q):for s in range(S):outq+=inq+s*ws“Weight Tile-Stationary”that tensor stays resident with fixed reuse distance“Input Streaming”with
252、 halo from convolutional mathL1 buffers divided into separate physical RAM macros at design timeLoop in“global”tensor address space determines staging pattern of individual RAMs“Output Stationary”that tensor changes most slowly 2025 IEEE International Solid-State Circuits ConferenceDr.Michael Pellau
253、erISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration10 of 60Conv1D:WeightsKey concept introduced by EyerissDataflows and StationaritySWInputsQ=W-ceil(S/2)Outputs*=Weights(Tile Size S)Inputs(Tile Size S)Outputs(Tile Size 1)AGenAGenAGenfor q in range(Q):for s in range(S):outq+=inq+s*ws#Actual
254、read FSM pattern is projection of the global loop into local addressesfor q in range(Q):for s in range(S):input_buffer.readq+s%S“Weight Tile-Stationary”that tensor stays resident with fixed reuse distance“Input Streaming”with halo from convolutional mathL1 buffers divided into separate physical RAM
255、macros at design timeLoop in“global”tensor address space determines staging pattern of individual RAMs“Output Stationary”that tensor changes most slowly 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills t
256、o local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L1 AddrGen012St.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum
257、 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L1 AddrGen012L2 image scratch(size Q)L2 AddrGenSt.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data
258、)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L
259、1 AddrGen012L2 image scratch(size Q)L2 AddrGenASt.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and per
260、iod(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L1 AddrGen012L2 image scratch(size Q)L2 AddrGenASt.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)B 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled
261、 Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L1 AddrGen012L2 image scratch(size Q)L2 AddrGenASt.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)BC 2025 IEEE Internat
262、ional Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L1 AddrGen012L2 imag
263、e scratch(size Q)L2 AddrGenASt.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)BCD 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence l
264、ength)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=3)L1 AddrGen012Initial Delay-(Ld.S(0)-Ld.S(1)-.-Ld.S(Q-1+S-1%S)Q-1L2 image scratch(size Q)L2 AddrGenSt.S(0,L2image0)-St.S(1,L2image1).St.S(Q-1%S,L2imageQ-1)Ainterface usage“regexp”St.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)
265、BCD 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseL1 image scratchpad(size S=
266、3)L1 AddrGen012Initial Delay-(Ld.S(0)-Ld.S(1)-.-Ld.S(Q-1+S-1%S)Q-1L2 image scratch(size Q)L2 AddrGenSt.S(0,L2image0)-St.S(1,L2image1).St.S(Q-1%S,L2imageQ-1)Ainterface usage“regexp”St.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)BCDWhy is separate“Push”filler not a race condition?(Both RAW and WAR)2025
267、 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fills to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseIn custom DL accelerator,all sizing
268、s can be done at design time based on known micro-architectural timingsOverheads for synchronization and hazard checking approach zeroL1 image scratchpad(size S=3)L1 AddrGen012Initial Delay-(Ld.S(0)-Ld.S(1)-.-Ld.S(Q-1+S-1%S)Q-1L2 image scratch(size Q)L2 AddrGenSt.S(0,L2image0)-St.S(1,L2image1).St.S(
269、Q-1%S,L2imageQ-1)Ainterface usage“regexp”St.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)BCDWhy is separate“Push”filler not a race condition?(Both RAW and WAR)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration11 of 60Fil
270、ls to local space like“waveform”with amplitude(tile size)and period(residence length)Scratchpad Interface and Exploiting ReuseIn custom DL accelerator,all sizings can be done at design time based on known micro-architectural timingsOverheads for synchronization and hazard checking approach zeroL1 im
271、age scratchpad(size S=3)L1 AddrGen012Initial Delay-(Ld.S(0)-Ld.S(1)-.-Ld.S(Q-1+S-1%S)Q-1L2 image scratch(size Q)L2 AddrGenAll RAM slots used for Reuse ExploitationSt.S(0,L2image0)-St.S(1,L2image1).St.S(Q-1%S,L2imageQ-1)Ainterface usage“regexp”St.S(sp_addr)Ld.S(sp_addr)Ld.S(data)St.S(data)BCDWhy is s
272、eparate“Push”filler not a race condition?(Both RAW and WAR)2025 IEEE International Solid-State Circuits ConferenceFrom Eyeriss to General Reusable IdiomASPLOS 2019On-chipbuffetOff-chipDRAMread addrread datacreditsread datavalid?update dataAddress gen(DMA engine)read idxAddr genfill dataCombine best
273、aspects of FIFOs and Scratchpads in modular encapsulationDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration12 of 60 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration13 of 60Key concept intro
274、duced by this workData orchestration TaxonomyOff-chipDRAME.g.,Cacheread addrread dataread addrread dataOff-chipDRAMOn-chip scratchpadread addrread dataread addrread dataDpathDpath Makes decisions about staging,replacement,etc.in workload-agnostic fashion Size not exposed(officially)All staging and r
275、eplacement decisions made by workload Size exposed,latency hiding up to user(u-arch latency not exposed)ImplicitExplicit 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration14 of 60Key concept introduced by this workData
276、 orchestration TaxonomycachescratchpadCoupledDecoupledcacheAddress gen(DMA engine)ImplicitExplicitAddress gen(DMA engine)?See Decoupled Access-Execute archs Also,LD.G+ST.S,TMA(more on this later)Read-after-fill synchronization covered by load-to-use stalling Read-after-fill synchronization covered b
277、y load-to-use stallingFIFO/buffet External sync.needed(i.e.,arrive-wait)Encapsulate read-after-fill sync.via head/tail pointer checks Encapsulate modular arith.due to latency hiding inside buffer 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit
278、Decoupled Data Orchestration15 of 60FIFOs:Existing buffer idiom for E.D.D.O.Too restrictive for consumer No repeated reads past head No in-place updates No bulk removalBulk transfer Great MLP No remote address with fillEncapsulates synchronization Composes hierarchically with fewer“landing zones”Off
279、-chipDRAMread addrread datacreditsdatare-circulateXilinx.FIFO Generator v13.1,Intel,FIFO:FPGA IP User GuideAddress gen(DMA engine)fill datavalid?On-chipFIFO 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unl
280、ike implicit structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 AddrGen012L2 image scratch(size Q)L2 AddrGenReadOrShrink(offset)Read(data)Fill(data)34Credit(sp_amount)Internal Scoreboard 2025 IEEE International Solid-Stat
281、e Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 AddrGen012L2 image scratch(size Q)L2 AddrGenRead
282、OrShrink(offset)Read(data)Fill(data)34Credit(sp_amount)5Internal Scoreboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit structures like caches,requires active involvement to ensure perform
283、anceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 AddrGen012L2 image scratch(size Q)L2 AddrGenWait+Fill(L2image0:4)-Wait+Fill(L2image5).Wait+Fill(L2imageQ-1)Ainterface usage“regexp”ReadOrShrink(offset)Read(data)Fill(data)34BCDECredit(sp_amount)5Internal Scoreboard 2025 IEEE International
284、 Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 AddrGen012(Rd(0)-Rd(1)-Rd(2)+Shrink(1
285、)Q-1L2 image scratch(size Q)L2 AddrGenWait+Fill(L2image0:4)-Wait+Fill(L2image5).Wait+Fill(L2imageQ-1)Ainterface usage“regexp”ReadOrShrink(offset)Read(data)Fill(data)34BCDECredit(sp_amount)5Internal Scoreboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum
286、3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 AddrGen012(Rd(0)-Rd(1)-Rd(2)+Shrink(1)Q-1L2 image scratch(size Q)L2 AddrGenSome RAM slots used for Latenc
287、y HidingImportant:can be fractions of tiles,no need for whole quanta(i.e.,double-buffering)and no ROI for too muchWait+Fill(L2image0:4)-Wait+Fill(L2image5).Wait+Fill(L2imageQ-1)Ainterface usage“regexp”ReadOrShrink(offset)Read(data)Fill(data)34BCDECredit(sp_amount)5Most RAM slots used for Reuse Explo
288、itationInternal Scoreboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(siz
289、e S=3+K=2)L1 AddrGen012(Rd(0)-Rd(1)-Rd(2)+Shrink(1)Q-1L2 image scratch(size Q)L2 AddrGenSome RAM slots used for Latency HidingImportant:can be fractions of tiles,no need for whole quanta(i.e.,double-buffering)and no ROI for too muchWait+Fill(L2image0:4)-Wait+Fill(L2image5).Wait+Fill(L2imageQ-1)Ainte
290、rface usage“regexp”ReadOrShrink(offset)Read(data)Fill(data)34BCDECredit(sp_amount)5Most RAM slots used for Reuse ExploitationInternal Scoreboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit
291、 structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 AddrGen012(Rd(0)-Rd(1)-Rd(2)+Shrink(1)Q-1L2 image scratch(size Q)L2 AddrGenSome RAM slots used for Latency HidingImportant:can be fractions of tiles,no need for whole qu
292、anta(i.e.,double-buffering)and no ROI for too muchWait+Fill(L2image0:4)-Wait+Fill(L2image5).Wait+Fill(L2imageQ-1)AAfter a shrink,some RAM slots are Landing Zonesinterface usage“regexp”ReadOrShrink(offset)Read(data)Fill(data)34BCDECredit(sp_amount)5Most RAM slots used for Reuse ExploitationInternal S
293、coreboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration16 of 60Unlike implicit structures like caches,requires active involvement to ensure performanceGeneralized Latency HidingL1 image buffet(size S=3+K=2)L1 Addr
294、Gen012(Rd(0)-Rd(1)-Rd(2)+Shrink(1)Q-1L2 image scratch(size Q)L2 AddrGenSome RAM slots used for Latency HidingImportant:can be fractions of tiles,no need for whole quanta(i.e.,double-buffering)and no ROI for too muchWait+Fill(L2image0:4)-Wait+Fill(L2image5).Wait+Fill(L2imageQ-1)AAfter a shrink,some R
295、AM slots are Landing Zonesinterface usage“regexp”ReadOrShrink(offset)Read(data)Fill(data)34Properly sizing the latency-hiding ensures the next datum is staged just-in-timeAlso minimizes the percentage of inactive buffet entries(landing zones)Can use Littles Law to size the Reuse Exploitation tile si
296、ze to maximize compute bandwidthBCDECredit(sp_amount)5Most RAM slots used for Reuse ExploitationInternal Scoreboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration17 of 60Littles Law and ScratchpadsLittles Law:Origi
297、nally used by banks to size roped waiting areas in front of tellersOccupancyDemand rateTime spent 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration17 of 60Littles Law and ScratchpadsLittles Law:Originally used by bank
298、s to size roped waiting areas in front of tellersOccupancyDemand rateTime spentAnalogy:Minifridge(size 6)in office break roomLets assume someone wants 1 cold soda every 10 minutes.and it takes 3 hours for a can to become sufficiently chilled.that means workers only get cold soda out of the fridge 1
299、hour out of 3!Rule:When you take a cold can,replace with a warm one from cupboard 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration17 of 60Littles Law and Scratchpadsgreen:cold soda deliveredgray:no cold soda availabl
300、etime,in ten-minute intervals3 Hours:can sitting in“fridge slot 1”becomes coldLittles Law:Originally used by banks to size roped waiting areas in front of tellersOccupancyDemand rateTime spent18 slots needed for SOLAnalogy:Minifridge(size 6)in office break roomLets assume someone wants 1 cold soda e
301、very 10 minutes.and it takes 3 hours for a can to become sufficiently chilled.that means workers only get cold soda out of the fridge 1 hour out of 3!Rule:When you take a cold can,replace with a warm one from cupboardJob of a buffer idiom:Help the program exploit reuse and hide latency while minimiz
302、ing locked landing zonesAnd do it across multiple operand tensors and buffers with differing access patterns and fill rates!2025 IEEE International Solid-State Circuits ConferenceOn-chipbuffetBuffetsOff-chipDRAMread addrread datacreditsread datavalid?update dataAddress gen(DMA engine)Operations:Read
303、(idx)Update(idx,data)Shrink(Num)read idxAddr genfill dataLoosen FIFO access restrictions while maintaining fill benefitsEncapsulation of synchronization and RAM size has interfacing benefitsDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration18 of 60 2025 IEEE International
304、Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration19 of 60read(sp_offset)read(data)update(data)Why“Buffet”?Typical interface usage is directly analogous to restaurantsBuffetfill(data)shrink(sp_offset)credit(sp_amount)2025 IEEE International S
305、olid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration19 of 60read(sp_offset)read(data)update(data)Why“Buffet”?Typical interface usage is directly analogous to restaurantsBuffetGet any available dish,go back as many times as you wantfill(data)shri
306、nk(sp_offset)credit(sp_amount)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration19 of 60read(sp_offset)read(data)update(data)Why“Buffet”?Typical interface usage is directly analogous to restaurantsBuffetGet any availab
307、le dish,go back as many times as you wantfill(data)shrink(sp_offset)credit(sp_amount)Shrink=course change 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration19 of 60read(sp_offset)read(data)update(data)Why“Buffet”?Typic
308、al interface usage is directly analogous to restaurantsBuffetIn restaurant no in-place update!Get any available dish,go back as many times as you wantfill(data)shrink(sp_offset)credit(sp_amount)Shrink=course change 2025 IEEE International Solid-State Circuits ConferenceBuffet Local Synchronization(S
309、ee Paper)Buffetfillread idxread dataupdateHardware scoreboard and stall logicNo hardware needed 2025 IEEE International Solid-State Circuits ConferenceBuffet Local Synchronization(See Paper)Buffetfillread idxread dataupdateRead data may stall until Fill(RAW)Hardware scoreboard and stall logicNo hard
310、ware needed 2025 IEEE International Solid-State Circuits ConferenceBuffet Local Synchronization(See Paper)Buffetfillread idxread dataupdatewill_update?Read data may stall until Fill(RAW)Hardware scoreboard and stall logicNo hardware needed 2025 IEEE International Solid-State Circuits ConferenceBuffe
311、t Local Synchronization(See Paper)Buffetfillread idxread dataupdatewill_update?Read data may stall until Fill(RAW)Read data may stall until Update completes(RAW)Hardware scoreboard and stall logicNo hardware needed 2025 IEEE International Solid-State Circuits ConferenceBuffet Local Synchronization(S
312、ee Paper)Buffetfillread idxread dataupdatewill_update?Read data may stall until Fill(RAW)Read data may stall until Update completes(RAW)Transitively syncd via preceding Read()Hardware scoreboard and stall logicNo hardware needed 2025 IEEE International Solid-State Circuits ConferenceBuffet Local Syn
313、chronization(See Paper)Higher-level interfacing semantics beyond load/store=cheaper hardwareBuffetfillread idxread dataupdatewill_update?Read data may stall until Fill(RAW)Read data may stall until Update completes(RAW)Transitively syncd via preceding Read()Hardware scoreboard and stall logicNo hard
314、ware needed 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerI
315、SSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchest
316、ration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State
317、Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orch
318、estration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Output
319、s*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025
320、-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D
321、 ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGenAGenAGen 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-cei
322、l(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGenAGenAGenfor q=0.Q)for s=0.S)Read(s)Shrink(S)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInput
323、sQ=W-ceil(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGenAGenAGenReads past head non-destructivelyEncapsulates synchronization:no concern for fillsfor q=0.Q)for s=0.S)Read(s)Shrink(S)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Expli
324、cit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGenAGenAGenReads past head non-destructivelyEncapsulates synchronization:no concern for fillsfor q=0.Q)for s=0.S)Read(s)Shrink(S)for q=0.Q)
325、for s=0.S)Read(s)Shrink(1)Shrink(S-1)2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGe
326、nAGenAGenReads past head non-destructivelyEncapsulates synchronization:no concern for fillsfor q=0.Q)for s=0.S)Read(s)Shrink(S)for q=0.Q)for s=0.S)Read(s)Shrink(1)Shrink(S-1)Encapsulates modular arithmetic:shrink tile size=sliding window 2025 IEEE International Solid-State Circuits ConferenceDr.Mich
327、ael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGenAGenAGenReads past head non-destructivelyEncapsulates synchronization:no concern for fillsfor q=0.
328、Q)for s=0.S)Read(s)Shrink(S)for q=0.Q)for s=0.S)Read(s)Shrink(1)Shrink(S-1)for q=0.Q)for s=0.S-1)ReadAndUpdate(0)Read(0)Shrink(1)Encapsulates modular arithmetic:shrink tile size=sliding window 2025 IEEE International Solid-State Circuits ConferenceDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Dec
329、oupled Data Orchestration21 of 60WeightsExample Orchestration:1D ConvolutionSWInputsQ=W-ceil(S/2)Outputs*=Weightsbuffet(S)Inputsbuffet(S)Outputsbuffet(1)AGenAGenAGenReads past head non-destructivelyEncapsulates synchronization:no concern for fillsHandles RAW hazard:correct behavior even if MACC has
330、latencyfor q=0.Q)for s=0.S)Read(s)Shrink(S)for q=0.Q)for s=0.S)Read(s)Shrink(1)Shrink(S-1)for q=0.Q)for s=0.S-1)ReadAndUpdate(0)Read(0)Shrink(1)Encapsulates modular arithmetic:shrink K traversal order(i.e.,B-stationary dataflow)Einsum:Z=,Fixed-Function Sparse AcceleratorsSee:OuterSpace,ExTensor,GAMM
331、A,etc.Without these mechanisms,how does a general-purpose accelerator do on this domain?Dr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration42 of 60 2025 IEEE International Solid-State Circuits ConferenceReasons for not meeting roofline:Moving data too far(e.g.,low-reuse dat
332、a emplaced in DRAM-L2-L1)Moving unnecessary data(e.g.,whole cache lines,replacement policies)Round-trip data accesses(e.g.,slow indirections)Inefficient ISA implementations of certain key operations(e.g.,intersection)Unstructured Sparsity*on GPUsA100 FP32 roofline(SIMD only,excluding tensor cores)*N
333、ot the 2:4 structured sparsity supported natively in tensor coresSymphony:Use HHP to deploy specialized data orchestration blocks throughout memory hierarchy of GPU-like architectureDr.Michael PellauerISSCC 2025-Forum 3.3:Explicit Decoupled Data Orchestration43 of 60 2025 IEEE International Solid-State Circuits ConferenceSymphony ApproachChallenge:How to expose dispersed specialized blocks across