當前位置：首頁 > 報告詳情

HC2022.Nodai.Menon.pdf

上傳人： 2*** 編號：136924 2023-08-03 PDF PDF 43頁 15.94MB

該報告所屬合集： Hot Chips 34 PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/43

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《HC2022.Nodai.Menon.pdf》由會員分享，可在線閱讀，更多相關《HC2022.Nodai.Menon.pdf（43頁珍藏版）》請在三個皮匠報告上搜索。

1、Harsh Menon(nod.ai)Code Generation in MLIROverviewMotivationCode generation in LLVM/MLIRDialects(Linalg,Vector,GPU,etc.)Walkthrough of code generation using IREE/SHARKCPU and GPU code generationTargeting custom acceleratorsAuto-tuningConclusionAcknowledgementsReferencesMotivationDeep learning has be

2、come extremely pervasive spanning domains ranging from autonomous cars,natural language processing to medical imagingLarge models are achieving state of the art results(such as transformers in NLP)Unfortunately,these large models take many months and millions of dollars to train on existing hardware

3、On edge-based systems,inference dominates with latency being a key metricNew hardware vendors have risen to the occasion with custom accelerators that address some of these concernsBut as the number of models and hardware combinations explode,need a strong compiler infrastructure that can provide pe

4、rformance gains and is easily re-targetable to new hardware MotivationLLVM provides a strong compiler infrastructure that already scales to various hardware targets and can be used to fill the gap But LLVM IR is too low-level and many opportunities for optimization are missed if we start at that lev

5、el of abstractionMLIR provides compiler infrastructure to handle varying levels of abstraction and provides a way to progressively lower to LLVM IR,leveraging best of both worldsIn machine learning(ML),neural networks are defined in Python-based frameworks such as Tensorflow,PyTorch and JAXMLIR help

6、s progressively lower computation graphs from their pythonic representation to LLVM IRCode generation in LLVMFinal phase in compiler pipelineMust make effective use of available resources while preserving program semanticsDeals with problems such as instruction selection,register allocation and assi

7、gnment,instruction orderingLLVM has several different backends such as X86,NVPTX,RISC-V etc.which contain the specific hardware definition(in terms of hardware instructions,registers etc.expressed in tablegen)During code generation pipeline,LLVM IR is lowered to the SelectionDAG,then MachineInstr,MC

8、Inst and finally compiled to bitcodeMLIR leverages LLVM for this part of code generationCode generation in MLIR operates at a higher-level of abstraction and attempts to provide the missing infrastructure between high level ML programs and LLVM IRFront EndCode OptimizerCode GeneratorCode generation

9、in MLIRFigure on left shows dialect ecosystem in MLIRStarting with dialects on top(that closely follow the native ML frameworks),there are many paths to LLVM dialectEach dialect progressively lowers the abstraction going from tensors(immutable SSA)to memrefsPerform high-level optimizations at upper

10、dialects,more hardware-specific optimizations as we approach LLVM dialectOnce we arrive at LLVM dialect,we can translate to LLVM IR using mlir-translate and using LLVM to generate the final binaryLinalg DialectLinalg dialect is used to represent perfectly nested loop computations making it easy to p

11、erform transformations like fusion,tiling,and loop interchangeOperates on both tensors and memrefsCan be lowered to loops or affine expressions with computation in loop bodyIncorporates learnings from Halide,TVM,Tensor Comprehensions,XLA etc.Linalg defines a small set of core named ops(such as matmu

12、l,conv,pooling etc.)that the front-end dialects can lower toThis maps the large set of operations to a smaller set of operations that the compiler can focus on optimizingThe core workhorse of this dialect is the linalg.generic opLinalg GenericOpIndexing Maps Capture data access patterns for each ope

13、randDomain represents point in iteration space,range represents a point in the operands data spaceIterator Types Specifies data dependence between iterations of the loopInputs and Outputs Tied output and result operandsCompute Payload Computation performed at each point in the iteration spaceYielded

14、 value is result of computationArguments are obtained from operands using indexing maps%6=linalg.generic indexing_maps=affine_map(d0,d1,d2,d3),affine_map(d0,d1),iterator_types=parallel,parallel,reduction,reduction ins(%1:tensor)outs(%5:tensor)bb0(%arg1:f32,%arg2:f32):%17=arith.addf%arg2,%arg1:f32 li

15、nalg.yield%17:f3 -tensorfor(int i=0;i M;i+)for(int j=0;j N;j+)for(int k=0;k K;k+)for(int l=0;l P;l+)cij=cij+aijkl;Vector DialectProvides generic retargetable n-D vector abstractionsOperations in vector dialect can progressively decompose to lower rank variantsCan lower to LLVM instructions or direct

16、ly target hardware intrinsics(mma_compute)Examples:vector.transfer_read/write:bridge the gap between memory and vectors and contain enough information to encode read/write patterns such as broadcasted,permuted and masked accessesvector.outerproduct:outer product operation(typically obtained from low

17、ering of matmul)that can further be lowered to llvm fused-multiply add(FMA)instructionsGPU DialectProvides abstractions for retargetable GPU modelContains operations common to SIMT platformsExamplesCommunication:gpu.all_reduce(reduction across a local workgroup)Synchronization:gpu.barrierCompute:gpu

18、.subgroup_mma_computeCan be obtained by lowering from vector dialectOperates only on memrefsAccelerator-Specific DialectsNVIDIA GPU specific dialectsNVVM dialect,NVGPU dialectAMD GPU specific dialectsROCDL dialectARM CPU specific dialectsARM Neon dialect,ARM SVE dialectIntel CPU specific dialectsx86

19、Vector dialectEach of these dialects exposes specific hardware functionality present in those devicesNVVM dialect has nvvm.wmma.mma for matrix multiplication using tensor cores,not present in other dialectsCode generation using IREEIREE is an open-source MLIR-based end-to-end compiler and runtime th

20、at lowers ML models for datacenter and edge workloadsSupports X86,NVIDIA,AMD,RISC-V,Vulkan and ARMSupports Tensorflow,JAX,TFLite,PyTorchWe will use IREE to demonstrate code generation using MLIR,specifically focusing on the compilation pipeline Code generation using SHARKBuilds on top of IREE pipeli

21、neAdds performance optimizations for CUDA Caching allocatorAsync memory prefetchAdds auto-tuning capabilitiesAdds backends for custom acceleratorsContains a fully validated set of 100s of modelsEasy to deploy(integrated with triton server)Perceptron Code GenerationPerceptron is defined mathematicall

22、y as shown belowIn ML frameworks,it is a matrix multiplication followed by a ReLU non-linearityThe code on the top right shows how to define such a network in C+and in Tensorflow on the bottom rightBy not having to start with a loop-based definition,we avoid any raising and perform optimizations on

23、a higher level of abstraction than possible with other languages such as C+This results in reduced complexity during fusion,tiling and vectorizationz=max(0,XW)for(int i=0;i 64;i+)for(int j=0;j 1024;j+)zij=0.0;for(int k=0;k None:super(MLP,self)._init_()self.W=tf.Variable(tf.ones(INPUT_SIZE,INPUT_SIZE

24、)tf.function(input_signature=input_shapes)def predict(self,x:tf.Tensor)-tf.Tensor:return tf.nn.relu(tf.matmul(x,self.W)Lowering to Linalg on TensorsTF is lowered to MHLO and then to the linalg dialectBoth matrix multiplication and the ReLU are lowered to linalg.generic opsImportant to note that lina

25、lg.matmul isHence,we need a linalg.fill before the linalg.matmulCi,j=Ci,j+Ai,kBk,j#map=affine_map(d0,d1)module iree_input.global private W=dense:tensor .func.func private _inference_predict_140(%arg0:tensor%cst=arith.constant 0.000000e+00:f32%2=linalg.init_tensor 64,1024:tensor%3=linalg.fill ins(%cs

26、t:f32)outs(%2:tensor)-tensor%4=linalg.matmul ins(%arg0,%1:tensor,tensor)outs(%3:tensor)-tensor%5=linalg.init_tensor 64,1024:tensor%6=linalg.generic indexing_maps=#map,#map,iterator_types=parallel,parallel ins(%4:tensor)outs(%5:tensor)bb0(%arg1:f32,%arg2:f32):%7=arith.maxf%arg1,%cst:f32 linalg.yield%

27、7:f32 -tensor return%6:tensor Overview of CPU code generation pipelineSplit computation graph into smaller sub-graphs(dispatch regions)Tiling and vectorization key pieces of CPU pipeline CPU pipeline performs bufferization late,prefering in-place bufferization while taking care to avoid RaW conflict

28、sFinally operations are lowered to make mapping to hardware instructions efficient(such as lowering matrix multiplication to outer products to easily map to FMA instructions)Most of this pipeline shared with GPU Dispatch Region FormationDispatch region contains computation that has to be executed on

29、 device in an atomic fashionLarge neural networks are partitioned into finite number of dispatch regionsEach dispatch region contains a root op(any linalg named op or generic op with reduction iterator type)Root ops are then fused with consumers if all uses of producer are dominated by it Also does

30、elementwise fusion to fuse linalg.generic ops%1=flow.dispatch.workgroups%c64,%c1024(%0):(tensor)-tensor (%arg1:!flow.dispatch.tensor,%arg2:!flow.dispatch.tensor)%3=flow.dispatch.tensor.load%arg1,offsets=0,0,sizes=64,1024,strides=1,1:!flow.dispatch.tensor-tensor%4=linalg.init_tensor 64,1024:tensor%5=

31、linalg.fill ins(%cst:f32)outs(%4:tensor)-tensor%6=linalg.matmul ins(%3,%cst_0:tensor,tensor)outs(%5:tensor)-tensor%7=linalg.generic indexing_maps=affine_map(d0,d1),affine_map(d0,d1),iterator_types=parallel,parallel ins(%6:tensor)outs(%4:tensor)bb0(%arg3:f32,%arg4:f32):%8=arith.maxf%arg3,%cst:f32 lin

32、alg.yield%8:f32 -tensor flow.dispatch.tensor.store%7,%arg2,offsets=0,0,sizes=64,1024,strides=1,1:tensor-!flow.dispatch.tensor flow.return count(%arg1:index,%arg2:index)-(index,index,index)%x,%y,%z=flow.dispatch.default_workgroup_count%arg1,%arg2 flow.return%x,%y,%z:index,index,index Body of Dispatch

33、 RegionTile and Distribute to WorkgroupsWork partitioned along a 3-D grid of virtual processors(workgroups)that can be mapped to multi-core CPUs or GPUsUses a block cyclic distribution to distribute the tilesEach dispatch region performs a tile of the computationThe compute done is determined by its

34、 rank(workgroup_id_x,workgroup_id_y)and number of processorsOnly parallel dimensions are tiledTile and Distribute to Workgroups func.func predict_dispatch_0().%workgroup_id_x=hal.interface.workgroup.id0:index%workgroup_count_x=hal.interface.workgroup.count0:index%workgroup_id_y=hal.interface.workgro

35、up.id1:index%workgroup_count_y=hal.interface.workgroup.count1:index%2=affine.apply affine_map(s0*32)()%workgroup_id_y%3=affine.apply affine_map(s0*32)()%workgroup_count_y scf.for%arg0=%2 to%c64 step%3%4=affine.apply affine_map(s0*32)()%workgroup_id_x%5=affine.apply affine_map(s0*32)()%workgroup_coun

36、t_x scf.for%arg1=%4 to%c1024 step%5 .%8=linalg.fill ins(%cst_0:f32)outs(%6:tensor)-tensor%9=linalg.matmul ins(%7,%cst:tensor,tensor)outs(%8:tensor)-tensor%10=linalg.generic indexing_maps=affine_map(d0,d1),iterator_types=parallel,parallel outs(%9:tensor)bb0(%arg2:f32):%11=arith.maxf%arg2,%cst_0:f32 l

37、inalg.yield%11:f32 -tensor .Tile and FuseAdditional tiling along parallel dimensionsIntroduces subset operations(tensor.extract_slice,tensor.insert_slice)to access the tiled dataDepending on chosen tile sizes,there may not exists a single static tensor type valid for every iterationThe sub-tensor ma

38、y be relaxed to a dynamic tensorSubsequent canonicalizations can be used to refine any shapes that are determined to be static Can use padding to handle these scenarios by introducing tensor.pad operations with appropriate sizesfunc.func predict_dispatch_0()scf.for%arg0=%2 to%c64 step%3 scf.for%arg1

39、=%4 to%c1024 step%5%8=scf.for%arg2=%c0 to%c32 step%c8 iter_args(%arg3=%7)-(tensor)%9=tensor.extract_slice%6%arg2,0 8,1024 1,1:tensor to tensor%10=scf.for%arg4=%c0 to%c128 step%c32 iter_args(%arg5=%arg3)-(tensor)%11=tensor.extract_slice%arg5%arg2,%arg4 8,32 1,1:tensor to tensor%12=linalg.fill ins(%cs

40、t_0:f32)outs(%11:tensor)-tensor%13=linalg.matmul ins(%9,%cst:tensor,tensor)outs(%12:tensor)-tensor%14=linalg.generic outs(%13:tensor)bb0(%arg6:f32):%16=arith.maxf%arg6,%cst_0:f32 linalg.yield%16:f32 -tensor%15=tensor.insert_slice%14 into%arg5%arg2,%arg4 8,32 1,1:tensor into tensor scf.yield%15:tenso

41、r scf.yield%10:tensor Single Tiling ExpertMultiple different ways of tiling the matmulDoubleTilingExpert,TripleTilingExpert,DoubleTilingPadExpert etc.Using SingleTilingExpert get a single tiling of reduction dimensionChanges K dimension from 1024 to 16func.func predict_dispatch_0()scf.for%arg0=%2 to

42、%c64 step%3 scf.for%arg1=%4 to%c1024 step%5%8=scf.for%arg2=%c0 to%c32 step%c8 iter_args(%arg3=%7)-(tensor)%9=scf.for%arg4=%c0 to%c128 step%c32 iter_args(%arg5=%arg3)-(tensor)%10=tensor.extract_slice%arg5%arg2,%arg4 8,32 1,1:tensor to tensor%11=linalg.fill ins(%cst_0:f32)outs(%10:tensor)-tensor%12=sc

43、f.for%arg6=%c0 to%c1024 step%c16 iter_args(%arg7=%11)-(tensor)%15=tensor.extract_slice%6%arg2,%arg6 8,16 1,1:tensor to tensor%16=linalg.matmul ins(%15,%cst:tensor,tensor)outs(%arg7:tensor)-tensor scf.yield%16:tensor%13=linalg.generic()%14=tensor.insert_slice%13 into%arg5%arg2,%arg4 8,32 1,1:tensor i

44、nto tensor scf.yield%14:tensor scf.yield%9:tensor VectorizeEmits vector.transfer_read/write operations for each operandFor elementwise operations,rewrite as pointwise vector variantFor reductions,rewrite as vector.contract or multi-reductionBroadcasting lower dimensional operands is done by vector.b

45、roadcastPermutations are handled by vector.transposeLowers n-D vectors to 1-D vectors supported by LLVMfunc.func predict_dispatch_0()%6=vector.transfer_read%cst_0%c0,%c0,%cst_1 in_bounds=true,true:tensor,vector scf.for%arg0=%2 to%c64 step%3 scf.for%arg1=%4 to%c1024 step%5%9=scf.for%arg2=%c0 to%c32 s

46、tep%c8 iter_args(%arg3=%8)-(tensor)%10=scf.for%arg4=%c0 to%c128 step%c32 iter_args(%arg5=%arg3)-(tensor)%11=scf.for%arg6=%c0 to%c1024 step%c16 iter_args(%arg7=%cst)-(vector)%14=vector.transfer_read%7%arg2,%arg6,%cst_1 in_bounds=true,true:tensor,vector%15=vector.contract indexing_maps=affine_map(d0,d

47、2),affine_map(d2,d1),affine_map(d0,d1),iterator_types=parallel,parallel,reduction,kind=#vector.kind%14,%6,%arg7:vector,vector into vector scf.yield%15:vector%12=arith.maxf%11,%cst:vector%13=vector.transfer_write%12,%arg5%arg2,%arg4 in_bounds=true,true:vector,tensor scf.yield%13:tensor scf.yield%10:t

48、ensor BufferizeAllocate and copy as little memory as possibleAlways prefer re-using buffers in placeUse destination-passing style as a heuristic for in-place bufferizationTie output tensor to results tensor to act as bufferization constraintPerforms a future in-place bufferization analysis of a the

49、operands and checks if a RaW conflict is detectedIf not,then performs in-place bufferizationfunc.func predict_dispatch_0()%7=vector.transfer_read%0%c0,%c0,%cst in_bounds=true,true:memref,vector scf.for%arg0=%3 to%c64 step%4%8=memref.subview%1%arg0,0 32,1024 1,1:memref to memref32x1024xf32,affine_map

50、(d0*1024+s0+d1)scf.for%arg1=%5 to%c1024 step%6%9=memref.subview%2%arg0,%arg1 32,128 1,1:memref to memref32x128xf32,affine_map(d0*1024+s0+d1)scf.for%arg2=%c0 to%c32 step%c8 scf.for%arg3=%c0 to%c128 step%c32%10=scf.for%arg4=%c0 to%c1024 step%c16 iter_args(%arg5=%cst_0)-(vector)%12=vector.transfer_read

51、%8%arg2,%arg4,%cst in_bounds=true,true:memref32x1024xf32,affine_map(d0*1024+s0+d1),vector%13=vector.contract indexing_maps=affine_map(d0,d2),affine_map(d2,d1),affine_map(d0,d1),iterator_types=parallel,parallel,reduction,kind=#vector.kind%12,%7,%arg5:vector,vector into vector scf.yield%13:vector%11=a

52、rith.maxf%10,%cst_0:vector vector.transfer_write%11,%9%arg2,%arg3 in_bounds=true,true:vector,memref32x128xf32,affine_map(d0*1024+s0+d1)Lowering closer to hardwareApply vector unrollingBreaks down vector sizes to sizes well supported by targetPre-emptively handles non power of 2 sizes to avoid subopt

53、imal code generationvector.contract is lowered to outer products to enable mapping to SIMD FMA instructionsFurther lowered to LLVM dialect and translated to LLVM IRfunc.func predict_dispatch_0().%8=vector.extract%50:vector%9=vector.extract%51:vector .%21=vector.extract%513:vector%22=vector.extract%5

54、14:vector scf.for%arg0=%c0 to%c32 step%c8 scf.for%arg1=%c0 to%c128 step%c32%24=scf.for%arg2=%c0 to%c1024 step%c16 iter_args(%arg3=%cst_0)-(vector)%26=vector.transfer_read%6%arg0,%arg2,%cst in_bounds=true,true:memref32x1024xf32,affine_map(d0*1024+s0+d1),vector%27=vector.transpose%26,1,0:vector to vec

55、tor%28=vector.extract%270:vector%29=vector.outerproduct%28,%8,%arg3 kind=#vector.kind:vector,vector .%58=vector.extract%2715:vector%59=vector.outerproduct%58,%23,%57 kind=#vector.kind:vector,vector scf.yield%59:vector%25=arith.maxf%24,%cst_0:vector vector.transfer_write%25,%7%arg0,%arg1 in_bounds=tr

56、ue,true:vector,memref32x128xf32,affine_map(d0*1024+s0+d1)Overview of GPU code generation pipelineTiling and vectorization key pieces of GPU pipelineGPU pipeline bufferizes early and focuses on optimizing shared memory copies and reducing bank conflictsPlan to move bufferization after vectorizationGP

57、U vectorization focuses on using tensor cores efficiently and emitting appropriate nvvm intrinsicsTile and Distribute to WarpsTile and distribute to warpsAfter bufferization,tile the reduction dimension Copy subviews of input memrefs to shared memory(workgroup memory)on the GPU prior to computationS

58、tart introducing SIMD characteristics by introducing memref.copy(mapping to shared memory done later downstream)Insert barriers after copying to workgroup memoryDo an additional level of tiling to distribute to a warpsfunc.func predict_dispatch_0().scf.for%arg0=%c0 to%c1024 step%c16 .memref.copy%19,

59、%1:memref32x16xf32,affine_map(d0*1024+s0+d1)to memref memref.copy%20,%0:memref16x32xf32,affine_map(d0*32+s0+d1)to memref gpu.barrier%21=gpu.thread_id x%22=gpu.thread_id y%23=affine.apply affine_map(s0*16)()%22%24=affine.apply affine_map(d0 floordiv 32)*16)(%21)%25=memref.subview%1%23,0 16,16 1,1:mem

60、ref to memref16x16xf32,affine_map(d0*16+s0+d1),3%26=memref.subview%00,%24 16,16 1,1:memref to memref16x16xf32,affine_map(d0*32+s0+d1),3 .linalg.matmul ins(%25,%26:memref16x16xf32,affine_map(d0*16+s0+d1),3,memref16x16xf32,affine_map(d0*32+s0+d1),3)outs(%27:memref16x16xf32,affine_map(d0*1024+s0+d1).Ti

61、le and Distribute to WarpsMulti-BufferingIn order to hide latency,we can use double/multi-buffering to break dependencies between consecutive iterations of a loop using the same temporary buffer Required for pipeliningNumber of copies determined by desired stages of pipelineMulti-Bufferingfunc.func

62、predict_dispatch_0().%0=memref.alloc():memref .%22=memref.subview%0%21,0,0 1,16,32 1,1,1:memref to memref16x32xf32,affine_map(d0*32+s0+d1),3 .memref.copy%23,%20:memref32x16xf32,affine_map(d0*1024+s0+d1)to memref32x16xf32,affine_map(d0*16+s0+d1),3 func.func predict_dispatch_0().%0=memref.alloc():memr

63、ef .memref.copy%20,%0:memref16x32xf32,affine_map(d0*32+s0+d1)to memref Vectorize Shared Memory CopiesVectorize the shared memory copy(converts to vector.transfer_read,vector.transfer_write)For optimal performance,always want to copy 128 bits.This can be used to determine copy tile size.Unroll the ve

64、ctor transfer read and writes%3=memref.alloc():memref%4=memref.alloc():memref .linalg.fill ins(%cst:f32)outs(%14:memref16x16xf32,affine_map(d0*1024+s0+d1)scf.for%arg0=%c0 to%c1024 step%c16 .%22=vector.transfer_read%18%20,%21,%cst in_bounds=true,true:memref32x16xf32,affine_map(d0*1024+s0+d1),vector v

65、ector.transfer_write%22,%16%20,%21 in_bounds=true,true:vector,memref32x16xf32,affine_map(d0*16+s0+d1),3 .%25=vector.transfer_read%19%23,%24,%cst in_bounds=true,true:memref16x32xf32,affine_map(d0*32+s0+d1),vector vector.transfer_write%25,%17%23,%24 in_bounds=true,true:vector,memref16x32xf32,affine_ma

66、p(d0*32+s0+d1),3 gpu.barrier linalg.matmul ins(%27,%28:memref16x16xf32,affine_map(d0*16+s0+d1),3,memref16x16xf32,affine_map(d0*32+s0+d1),3)outs(%29:memref)linalg.generic indexing_maps=affine_map(d0,d1),iterator_types=parallel,parallel outs(%14:memref16x16xf32,affine_map(d0*1024+s0+d1)Reduce Bank Con

67、flictsShared memory is arranged in banks(32 banks each of width 4 bytes)Each thread in a warp can access shared memory in parallelWhen 2 or more threads in a warp access 4 byte words in the same bank,results in serialized accesses and hence a reduction in overall bandwidthPad inner dimensions of all

68、ocOp to reduce chances of having bank conflicts(with 16 bytes)Plan to switch to shared memory swizzle for better efficiencyfunc.func predict_dispatch_0().%3=memref.alloc():memref%4=memref.subview%30,0,0 4,16,32 1,1,1:memref to.%5=memref.alloc():memref%6=memref.subview%50,0,0 4,32,16 1,1,1:memref to.

69、func.func predict_dispatch_0().%3=memref.alloc():memref%4=memref.subview%30,0,0 4,16,32 1,1,1:memref to.%5=memref.alloc():memref%6=memref.subview%50,0,0 4,32,16 1,1,1:memref to.Tensor Core VectorizationReplace all linalg ops with vector equivalentsLower linalg.matmul to vector.contract with vector.t

70、ransfer_read/writeLower linalg.generic to arith ops with vector.transfer_read/writeOptimize vector transfers by removing redundant opsVector transfer/contract size determined by tensor core supported sizes%16=scf.for%arg0=%c0 to%c1024 step%c16 iter_args(%arg1=%cst)-(vector)gpu.barrier%25=vector.tran

71、sfer_read%6%24,%23,%cst_0:memref,vector vector.transfer_write%25,%4%26,%12,%13:vector,memref%28=vector.transfer_read%5%27,%15,%cst_0:memref,vector vector.transfer_write%28,%3%29,%14,%15:vector,memref gpu.barrier .%32=vector.transfer_read%4%31,%30,%c0,%cst_0 in_bounds=true,true:memref,vector%41=vecto

72、r.transfer_read%3%40,%c8,%39,%cst_0 in_bounds=true,true:memref,vector%42=vector.contract indexing_maps=affine_map(d0,d2),affine_map(d2,d1),affine_map(d0,d1),iterator_types=parallel,parallel,reduction,kind=#vector.kind%32,%38,%arg1:vector,vector into vector%43=vector.contract indexing_maps=affine_map

73、(d0,d2),affine_map(d2,d1),affine_map(d0,d1),iterator_types=parallel,parallel,reduction,kind=#vector.kind%35,%41,%42:vector,vector into vector scf.yield%43:vector%17=arith.maxf%16,%cst:vectorConvert to GPU DialectConvert copies to shared memory to async copies(nvgpu.device_async_copy,nvgpu.device_asy

74、nc_create_group,nvgpu.device_async_wait)Adds gpu.subgroup_mma_load_matrix which loads a matrix using all threads in a subgroupLowers vector.contract to gpu.subgroup_mma_compute which performs matrix-multiply accumulate using all threads in the subgroupLowers arith elementwise ops to gpu.subgroup_mma

75、_elementwise%13=scf.for%arg0=%c0 to%c1024 step%c16 iter_args(%arg1=%0)-(!gpu.mma_matrix)gpu.barrier%20=nvgpu.device_async_copy%7%18,%17,%5%19,%9,%10,4:memref to memref%22=nvgpu.device_async_copy%6%21,%12,%4%19,%11,%12,4:memref to memref%23=nvgpu.device_async_create_group%20,%22 nvgpu.device_async_wa

76、it%23 gpu.barrier .%25=gpu.subgroup_mma_load_matrix%5%19,%24,%c0 leadDimension=20:index:memref-!gpu.mma_matrix%28=gpu.subgroup_mma_load_matrix%4%19,%c0,%27 leadDimension=36:index:memref-!gpu.mma_matrix%31=gpu.subgroup_mma_compute%26,%29,%30:!gpu.mma_matrix,!gpu.mma_matrix -!gpu.mma_matrix scf.yield%

77、31:!gpu.mma_matrix%14=gpu.subgroup_mma_elementwise maxf%13,%0:(!gpu.mma_matrix,!gpu.mma_matrix)-!gpu.mma_matrix gpu.subgroup_mma_store_matrix%14,%8%15,%16 leadDimension=1024:index:!gpu.mma_matrix,memref GPU PipeliningGPU PipeliningImplement software pipelining using modulo schedulingOperations of or

78、iginal loop body are overlapped so that there is a fixed initiation interval(II)between the start of consecutive loop iterationsScheduling can be constrained by available hardware resource or loop carried dependenciesEmit a prologue,kernel and epiloguefor(int i=0;i N;i+)Bi=(Ai+Ai+1)/4;GPU Pipelining

79、Prologue contains async copies to shared memoryKernel contains mma load matrix and compute and ends with async copiesEpilogue contains remaining computation gpu.barrier _pipelining_global_load_%14=nvgpu.device_async_copy%7%13,%10,%5%c0,%9,%10,4 _pipelining_global_load_:memref to memref%15=nvgpu.devi

80、ce_async_copy%6%11,%12,%4%c0,%11,%12,4 _pipelining_global_load_:memref to memref%16=nvgpu.device_async_create_group%14,%15 _pipelining_global_load_ .gpu.barrier _pipelining_global_load_%28=nvgpu.device_async_copy%7%13,%27,%5%c3,%9,%10,4 _pipelining_global_load_:memref to memref%30=nvgpu.device_async

81、_copy%6%29,%12,%4%c3,%11,%12,4 _pipelining_global_load_:memref to memref%31=nvgpu.device_async_create_group%28,%30 _pipelining_global_load_%32:9=scf.for%arg0=%c0 to%c960 step%c16 iter_args(%arg1=%0,%arg2=%16,%arg3=%21,%arg4=%26,%arg5=%31,%arg6=%c0,%arg7=%c1,%arg8=%c2,%arg9=%c3)nvgpu.device_async_wai

82、t%arg2 numGroups=3:i32 gpu.barrier%63=gpu.subgroup_mma_load_matrix%5%arg6,%62,%c0 leadDimension=20:index:memref-!gpu.mma_matrix%66=gpu.subgroup_mma_load_matrix%4%arg6,%c0,%65 leadDimension=36:index:memref-!gpu.mma_matrix%68=gpu.subgroup_mma_compute%63,%66,%arg1:!gpu.mma_matrix,!gpu.mma_matrix-!gpu.m

83、ma_matrix gpu.barrier _pipelining_global_load_%70=arith.addi%arg0,%c64:index%73=nvgpu.device_async_copy%7%13,%71,%5%72,%9,%10,4 _pipelining_global_load_:memref to memref%75=nvgpu.device_async_copy%6%74,%12,%4%72,%11,%12,4 _pipelining_global_load_:memref to memref%76=nvgpu.device_async_create_group%7

84、3,%75 _pipelining_global_load_ scf.yield%69,%arg3,%arg4,%arg5,%76,%arg7,%arg8,%arg9,%72:.nvgpu.device_async_wait%32#1 numGroups=3:i32 gpu.barrier%34=gpu.subgroup_mma_load_matrix%5%32#5,%33,%c0 leadDimension=20:index:memref-!gpu.mma_matrix%37=gpu.subgroup_mma_load_matrix%4%32#5,%c0,%36 leadDimension=

85、36:index:memref-!gpu.mma_matrix%39=gpu.subgroup_mma_compute%34,%37,%32#0:!gpu.mma_matrix,!gpu.mma_matrix-!gpu.mma_matrix nvgpu.device_async_wait%32#2 numGroups=2:i32 gpu.barrier .%53=gpu.subgroup_mma_load_matrix%5%32#8,%33,%c0 leadDimension=20:index:memref-!gpu.mma_matrix%55=gpu.subgroup_mma_load_ma

86、trix%4%32#8,%c0,%36 leadDimension=36:index:memref-!gpu.mma_matrix%57=gpu.subgroup_mma_compute%53,%55,%52:!gpu.mma_matrix,!gpu.mma_matrix-!gpu.mma_matrix%59=gpu.subgroup_mma_elementwise maxf%58,%0:(!gpu.mma_matrix,!gpu.mma_matrix)-!gpu.mma_matrix gpu.subgroup_mma_store_matrix%59,%8%60,%61 leadDimensi

87、on=1024:index:!gpu.mma_matrix,memref PrologueKernelEpilogueGPU LoweringNext,we use LLVM to lower to PTXUse JIT compilation to convert PTX code to native GPU machine codeAlternatively,could use PTX assembler and convert to SASS(and then CUBIN)wmma.load.a.sync.aligned.row.m16n16k8.shared.tf32%r43,%r44

88、,%r45,%r46,%rd107+11776,%r6;wmma.load.a.sync.aligned.row.m16n16k8.shared.tf32%r47,%r48,%r49,%r50,%rd107+11808,%r6;wmma.load.b.sync.aligned.row.m16n16k8.shared.tf32%r51,%r52,%r53,%r54,%rd109+2304,%r15;wmma.load.b.sync.aligned.row.m16n16k8.shared.tf32%r55,%r56,%r57,%r58,%rd109+3456,%r15;wmma.mma.sync.

89、aligned.row.row.m16n16k8.f32.tf32.tf32.f32%f49,%f50,%f51,%f52,%f53,%f54,%f55,%f56,%r43,%r44,%r45,%r46,%r51,%r52,%r53,%r54,%f41,%f42,%f43,%f44,%f45,%f46,%f47,%f48;wmma.mma.sync.aligned.row.row.m16n16k8.f32.tf32.tf32.f32%f57,%f58,%f59,%f60,%f61,%f62,%f63,%f64,%r47,%r48,%r49,%r50,%r55,%r56,%r57,%r58,%f

90、49,%f50,%f51,%f52,%f53,%f54,%f55,%f56;cp.async.wait_group 1;bar.sync 0;wmma.load.a.sync.aligned.row.m16n16k8.shared.tf32%r59,%r60,%r61,%r62,%rd107+14336,%r6;wmma.load.a.sync.aligned.row.m16n16k8.shared.tf32%r63,%r64,%r65,%r66,%rd107+14368,%r6;wmma.load.b.sync.aligned.row.m16n16k8.shared.tf32%r67,%r6

91、8,%r69,%r70,%rd109+4608,%r15;wmma.load.b.sync.aligned.row.m16n16k8.shared.tf32%r71,%r72,%r73,%r74,%rd109+5760,%r15;wmma.mma.sync.aligned.row.row.m16n16k8.f32.tf32.tf32.f32%f65,%f66,%f67,%f68,%f69,%f70,%f71,%f72,%r59,%r60,%r61,%r62,%r67,%r68,%r69,%r70,%f57,%f58,%f59,%f60,%f61,%f62,%f63,%f64;wmma.mma.

92、sync.aligned.row.row.m16n16k8.f32.tf32.tf32.f32%f73,%f74,%f75,%f76,%f77,%f78,%f79,%f80,%r63,%r64,%r65,%r66,%r71,%r72,%r73,%r74,%f65,%f66,%f67,%f68,%f69,%f70,%f71,%f72;cp.async.wait_group 0;bar.sync 0;Additional TargetsCan reuse existing pipeline to target other GPUs that have tensor core equivalent

93、functionalityAdjust lowering from linalg to vector dialectAdd appropriate dialect and lowering for custom acceleratorTargeting custom acceleratorsNew RISC-V based many-core architectures(such as Hammerblade)Create a new dialect to model multiple processing elements(PE)and memory hierarchyCan leverag

94、e vector dialect or experimental RVV dialect for vector code generationNeed to develop cost model to determine how to place kernels on tiles of PEsCould have special function units(tensor core equivalents)that can be leveraged during code generation for better performanceAuto-tuningHow to determine

95、optimal tile sizes?What about loop interchange?How much to unroll the loop?How does this extend from a single operator to entire neural network?On distributed heterogeneous resources?Many hyperparameters and a large search space make it difficult to generate good codeUnclear how much performance is

96、left on table for a given set of hyperparametersFormulate as search problem imposing constraints to reduce search spaceCan choose from a variety of search algorithms ranging from reinforcement learning(RL)to genetic algorithmsAuto-tuningRL Framework(Compiler Gym)that can be used for compiler optimiz

97、ation tasksGradient-free methods(Nevergrad)have also been used to tune performanceCurrent list of knobs shown in table belowBeyond operator level,we can also search for how to best partition tensors for distributed computations(for inference and training)Approaches such as ALPA provide framework for

98、 how to partition tensors across clusters of heterogeneous resourcesAdditional variables such as checkpointing for trainingConclusionMLIR Code generation focuses on taking high-level tensor computation primitives and lowering them to LLVM IR with appropriate intrinsicsAttempts to take guesswork out

99、of the backend,reducing dependence on black-box optimizers such as the LLVM auto-vectorizerLeverages LLVM for traditional code generation Many abstractions shared between CPU and GPU compilation pipeline(and potentially other new accelerators)Tiling and vectorization key components of both pipelines

100、Additional work required to manage shared memory and target tensor cores on GPUAuto-tuning essential to obtaining good performance from code generated kernelsCan be extended to handle sparse tensors,non-structured ops(linalg.ext)AcknowledgementsNod.ai TeamDiscord:https:/discord.gg/RUqY2h2s9uGoogle I

101、REE TeamDiscord:https:/discord.gg/26P4xW4ReferencesVasilache,N.et al.(2022)Composable and Modular Code Generation in MLIR.https:/arxiv.org/pdf/2202.03293.pdfHsin-I,C.L.et al.(2022)TinyIREE:An ML Execution Environment for Embedded Systems from Compilation to Deployment.https:/arxiv.org/pdf/2205.14479

102、.pdfBradbury,A.LLVM backend development by example(RISC-V).https:/youtu.be/AFaIP-dF-RACodegen Dialect Overview.https:/discourse.llvm.org/t/codegen-dialect-overview/2723 Dawkins,Q.Updated MLIR Dialect Overview Diagram.https:/discourse.llvm.org/t/rfc-updated-mlir-dialect-overview-diagram/64266Linalg D

103、ialect Rationale:The Case for Compiler Friendly Custom Operations.https:/mlir.llvm.org/docs/Rationale/RationaleLinalgDialect/Anatomy of Linalg.generic.https:/youtu.be/A805W2KSCxQLoop double-buffering/multi-buffering.https:/discourse.llvm.org/t/loop-double-buffering-multi-buffering/59979CUTLASS:Cuda

104、Template Library for dense linear algebra at all levels and scales.https:/on- al.A Novel Hilbert Curve for Cache-locality Preserving Loops.https:/eprints.cs.univie.ac.at/5726/1/loops.pdfJordans,R.et al.(2015)High-level software-pipelining in LLVM.Hammerblade Manycore Technical Reference Manual.https

105、:/ RISC-V Vector Extension(RVV)Dialect.https:/discourse.llvm.org/t/rfc-add-risc-v-vector-extension-rvv-dialect/4146Cummins,C.et al.(2021)CompilerGym:Robust,Performant Compiler Optimization Environments for AI Research.https:/arxiv.org/pdf/2109.08267.pdfZheng,L.et al(2022).Alpa:Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning.https:/arxiv.org/pdf/2201.12023.pdfImages on Motivation slide taken from OpenAI,NVIDIA,AMD,Sambanova and Cerebras websites.

相關圖表

本文主要介紹了使用MLIR進行機器學習代碼生成的方法。MLIR是一種高級的編譯器基礎設施，可以處理各種抽象級別，并逐步降低到LLVM IR，利用兩者的最佳特性。文章首先闡述了MLIR的動機，即為了提高深度學習模型的訓練效率和降低成本，需要一種強大的編譯器基礎設施，MLIR可以提供性能提升并易于重新針對新硬件。然后，文章詳細介紹了MLIR中的各種方言，包括Linalg、Vector、GPU等，以及如何使用IREE和SHARK進行代碼生成。此外，文章還討論了CPU和GPU代碼生成、針對自定義加速器的代碼生成以及自動調優等內容?？偟膩碚f，MLIR提供了一種高效、可擴展的機器學習代碼生成方法，可以顯著提高深度學習模型的訓練和推理效率。

MLIR如何實現代碼生成？ IREE和SHARK在代碼生成中有什么不同？如何通過MLIR實現自定義加速器的代碼生成？

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站