《HC2022.Nodai.Menon.pdf》由會員分享,可在線閱讀,更多相關《HC2022.Nodai.Menon.pdf(43頁珍藏版)》請在三個皮匠報告上搜索。
1、Harsh Menon(nod.ai)Code Generation in MLIROverviewMotivationCode generation in LLVM/MLIRDialects(Linalg,Vector,GPU,etc.)Walkthrough of code generation using IREE/SHARKCPU and GPU code generationTargeting custom acceleratorsAuto-tuningConclusionAcknowledgementsReferencesMotivationDeep learning has be
2、come extremely pervasive spanning domains ranging from autonomous cars,natural language processing to medical imagingLarge models are achieving state of the art results(such as transformers in NLP)Unfortunately,these large models take many months and millions of dollars to train on existing hardware
3、On edge-based systems,inference dominates with latency being a key metricNew hardware vendors have risen to the occasion with custom accelerators that address some of these concernsBut as the number of models and hardware combinations explode,need a strong compiler infrastructure that can provide pe
4、rformance gains and is easily re-targetable to new hardware MotivationLLVM provides a strong compiler infrastructure that already scales to various hardware targets and can be used to fill the gap But LLVM IR is too low-level and many opportunities for optimization are missed if we start at that lev
5、el of abstractionMLIR provides compiler infrastructure to handle varying levels of abstraction and provides a way to progressively lower to LLVM IR,leveraging best of both worldsIn machine learning(ML),neural networks are defined in Python-based frameworks such as Tensorflow,PyTorch and JAXMLIR help
6、s progressively lower computation graphs from their pythonic representation to LLVM IRCode generation in LLVMFinal phase in compiler pipelineMust make effective use of available resources while preserving program semanticsDeals with problems such as instruction selection,register allocation and assi
7、gnment,instruction orderingLLVM has several different backends such as X86,NVPTX,RISC-V etc.which contain the specific hardware definition(in terms of hardware instructions,registers etc.expressed in tablegen)During code generation pipeline,LLVM IR is lowered to the SelectionDAG,then MachineInstr,MC
8、Inst and finally compiled to bitcodeMLIR leverages LLVM for this part of code generationCode generation in MLIR operates at a higher-level of abstraction and attempts to provide the missing infrastructure between high level ML programs and LLVM IRFront EndCode OptimizerCode GeneratorCode generation
9、in MLIRFigure on left shows dialect ecosystem in MLIRStarting with dialects on top(that closely follow the native ML frameworks),there are many paths to LLVM dialectEach dialect progressively lowers the abstraction going from tensors(immutable SSA)to memrefsPerform high-level optimizations at upper
10、dialects,more hardware-specific optimizations as we approach LLVM dialectOnce we arrive at LLVM dialect,we can translate to LLVM IR using mlir-translate and using LLVM to generate the final binaryLinalg DialectLinalg dialect is used to represent perfectly nested loop computations making it easy to p
11、erform transformations like fusion,tiling,and loop interchangeOperates on both tensors and memrefsCan be lowered to loops or affine expressions with computation in loop bodyIncorporates learnings from Halide,TVM,Tensor Comprehensions,XLA etc.Linalg defines a small set of core named ops(such as matmu
12、l,conv,pooling etc.)that the front-end dialects can lower toThis maps the large set of operations to a smaller set of operations that the compiler can focus on optimizingThe core workhorse of this dialect is the linalg.generic opLinalg GenericOpIndexing Maps Capture data access patterns for each ope
13、randDomain represents point in iteration space,range represents a point in the operands data spaceIterator Types Specifies data dependence between iterations of the loopInputs and Outputs Tied output and result operandsCompute Payload Computation performed at each point in the iteration spaceYielded
14、 value is result of computationArguments are obtained from operands using indexing maps%6=linalg.generic indexing_maps=affine_map(d0,d1,d2,d3),affine_map(d0,d1),iterator_types=parallel,parallel,reduction,reduction ins(%1:tensor)outs(%5:tensor)bb0(%arg1:f32,%arg2:f32):%17=arith.addf%arg2,%arg1:f32 li
15、nalg.yield%17:f3 -tensorfor(int i=0;i M;i+)for(int j=0;j N;j+)for(int k=0;k K;k+)for(int l=0;l P;l+)cij=cij+aijkl;Vector DialectProvides generic retargetable n-D vector abstractionsOperations in vector dialect can progressively decompose to lower rank variantsCan lower to LLVM instructions or direct
16、ly target hardware intrinsics(mma_compute)Examples:vector.transfer_read/write:bridge the gap between memory and vectors and contain enough information to encode read/write patterns such as broadcasted,permuted and masked accessesvector.outerproduct:outer product operation(typically obtained from low
17、ering of matmul)that can further be lowered to llvm fused-multiply add(FMA)instructionsGPU DialectProvides abstractions for retargetable GPU modelContains operations common to SIMT platformsExamplesCommunication:gpu.all_reduce(reduction across a local workgroup)Synchronization:gpu.barrierCompute:gpu
18、.subgroup_mma_computeCan be obtained by lowering from vector dialectOperates only on memrefsAccelerator-Specific DialectsNVIDIA GPU specific dialectsNVVM dialect,NVGPU dialectAMD GPU specific dialectsROCDL dialectARM CPU specific dialectsARM Neon dialect,ARM SVE dialectIntel CPU specific dialectsx86
19、Vector dialectEach of these dialects exposes specific hardware functionality present in those devicesNVVM dialect has nvvm.wmma.mma for matrix multiplication using tensor cores,not present in other dialectsCode generation using IREEIREE is an open-source MLIR-based end-to-end compiler and runtime th
20、at lowers ML models for datacenter and edge workloadsSupports X86,NVIDIA,AMD,RISC-V,Vulkan and ARMSupports Tensorflow,JAX,TFLite,PyTorchWe will use IREE to demonstrate code generation using MLIR,specifically focusing on the compilation pipeline Code generation using SHARKBuilds on top of IREE pipeli
21、neAdds performance optimizations for CUDA Caching allocatorAsync memory prefetchAdds auto-tuning capabilitiesAdds backends for custom acceleratorsContains a fully validated set of 100s of modelsEasy to deploy(integrated with triton server)Perceptron Code GenerationPerceptron is defined mathematicall
22、y as shown belowIn ML frameworks,it is a matrix multiplication followed by a ReLU non-linearityThe code on the top right shows how to define such a network in C+and in Tensorflow on the bottom rightBy not having to start with a loop-based definition,we avoid any raising and perform optimizations on
23、a higher level of abstraction than possible with other languages such as C+This results in reduced complexity during fusion,tiling and vectorizationz=max(0,XW)for(int i=0;i 64;i+)for(int j=0;j 1024;j+)zij=0.0;for(int k=0;k None:super(MLP,self)._init_()self.W=tf.Variable(tf.ones(INPUT_SIZE,INPUT_SIZE
24、)tf.function(input_signature=input_shapes)def predict(self,x:tf.Tensor)-tf.Tensor:return tf.nn.relu(tf.matmul(x,self.W)Lowering to Linalg on TensorsTF is lowered to MHLO and then to the linalg dialectBoth matrix multiplication and the ReLU are lowered to linalg.generic opsImportant to note that lina
25、lg.matmul isHence,we need a linalg.fill before the linalg.matmulCi,j=Ci,j+Ai,kBk,j#map=affine_map(d0,d1)module iree_input.global private W=dense:tensor .func.func private _inference_predict_140(%arg0:tensor%cst=arith.constant 0.000000e+00:f32%2=linalg.init_tensor 64,1024:tensor%3=linalg.fill ins(%cs
26、t:f32)outs(%2:tensor)-tensor%4=linalg.matmul ins(%arg0,%1:tensor,tensor)outs(%3:tensor)-tensor%5=linalg.init_tensor 64,1024:tensor%6=linalg.generic indexing_maps=#map,#map,iterator_types=parallel,parallel ins(%4:tensor)outs(%5:tensor)bb0(%arg1:f32,%arg2:f32):%7=arith.maxf%arg1,%cst:f32 linalg.yield%
27、7:f32 -tensor return%6:tensor Overview of CPU code generation pipelineSplit computation graph into smaller sub-graphs(dispatch regions)Tiling and vectorization key pieces of CPU pipeline CPU pipeline performs bufferization late,prefering in-place bufferization while taking care to avoid RaW conflict
28、sFinally operations are lowered to make mapping to hardware instructions efficient(such as lowering matrix multiplication to outer products to easily map to FMA instructions)Most of this pipeline shared with GPU Dispatch Region FormationDispatch region contains computation that has to be executed on
29、 device in an atomic fashionLarge neural networks are partitioned into finite number of dispatch regionsEach dispatch region contains a root op(any linalg named op or generic op with reduction iterator type)Root ops are then fused with consumers if all uses of producer are dominated by it Also does
30、elementwise fusion to fuse linalg.generic ops%1=flow.dispatch.workgroups%c64,%c1024(%0):(tensor)-tensor (%arg1:!flow.dispatch.tensor,%arg2:!flow.dispatch.tensor)%3=flow.dispatch.tensor.load%arg1,offsets=0,0,sizes=64,1024,strides=1,1:!flow.dispatch.tensor-tensor%4=linalg.init_tensor 64,1024:tensor%5=
31、linalg.fill ins(%cst:f32)outs(%4:tensor)-tensor%6=linalg.matmul ins(%3,%cst_0:tensor,tensor)outs(%5:tensor)-tensor%7=linalg.generic indexing_maps=affine_map(d0,d1),affine_map(d0,d1),iterator_types=parallel,parallel ins(%6:tensor)outs(%4:tensor)bb0(%arg3:f32,%arg4:f32):%8=arith.maxf%arg3,%cst:f32 lin
32、alg.yield%8:f32 -tensor flow.dispatch.tensor.store%7,%arg2,offsets=0,0,sizes=64,1024,strides=1,1:tensor-!flow.dispatch.tensor flow.return count(%arg1:index,%arg2:index)-(index,index,index)%x,%y,%z=flow.dispatch.default_workgroup_count%arg1,%arg2 flow.return%x,%y,%z:index,index,index Body of Dispatch
33、 RegionTile and Distribute to WorkgroupsWork partitioned along a 3-D grid of virtual processors(workgroups)that can be mapped to multi-core CPUs or GPUsUses a block cyclic distribution to distribute the tilesEach dispatch region performs a tile of the computationThe compute done is determined by its
34、 rank(workgroup_id_x,workgroup_id_y)and number of processorsOnly parallel dimensions are tiledTile and Distribute to Workgroups func.func predict_dispatch_0().%workgroup_id_x=hal.interface.workgroup.id0:index%workgroup_count_x=hal.interface.workgroup.count0:index%workgroup_id_y=hal.interface.workgro
35、up.id1:index%workgroup_count_y=hal.interface.workgroup.count1:index%2=affine.apply affine_map(s0*32)()%workgroup_id_y%3=affine.apply affine_map(s0*32)()%workgroup_count_y scf.for%arg0=%2 to%c64 step%3%4=affine.apply affine_map(s0*32)()%workgroup_id_x%5=affine.apply affine_map(s0*32)()%workgroup_coun
36、t_x scf.for%arg1=%4 to%c1024 step%5 .%8=linalg.fill ins(%cst_0:f32)outs(%6:tensor)-tensor%9=linalg.matmul ins(%7,%cst:tensor,tensor)outs(%8:tensor)-tensor%10=linalg.generic indexing_maps=affine_map(d0,d1),iterator_types=parallel,parallel outs(%9:tensor)bb0(%arg2:f32):%11=arith.maxf%arg2,%cst_0:f32 l
37、inalg.yield%11:f32 -tensor .Tile and FuseAdditional tiling along parallel dimensionsIntroduces subset operations(tensor.extract_slice,tensor.insert_slice)to access the tiled dataDepending on chosen tile sizes,there may not exists a single static tensor type valid for every iterationThe sub-tensor ma
38、y be relaxed to a dynamic tensorSubsequent canonicalizations can be used to refine any shapes that are determined to be static Can use padding to handle these scenarios by introducing tensor.pad operations with appropriate sizesfunc.func predict_dispatch_0()scf.for%arg0=%2 to%c64 step%3 scf.for%arg1
39、=%4 to%c1024 step%5%8=scf.for%arg2=%c0 to%c32 step%c8 iter_args(%arg3=%7)-(tensor)%9=tensor.extract_slice%6%arg2,0 8,1024 1,1:tensor to tensor%10=scf.for%arg4=%c0 to%c128 step%c32 iter_args(%arg5=%arg3)-(tensor)%11=tensor.extract_slice%arg5%arg2,%arg4 8,32 1,1:tensor to tensor%12=linalg.fill ins(%cs
40、t_0:f32)outs(%11:tensor)-tensor%13=linalg.matmul ins(%9,%cst:tensor,tensor)outs(%12:tensor)-tensor%14=linalg.generic outs(%13:tensor)bb0(%arg6:f32):%16=arith.maxf%arg6,%cst_0:f32 linalg.yield%16:f32 -tensor%15=tensor.insert_slice%14 into%arg5%arg2,%arg4 8,32 1,1:tensor into tensor scf.yield%15:tenso
41、r scf.yield%10:tensor Single Tiling ExpertMultiple different ways of tiling the matmulDoubleTilingExpert,TripleTilingExpert,DoubleTilingPadExpert etc.Using SingleTilingExpert get a single tiling of reduction dimensionChanges K dimension from 1024 to 16func.func predict_dispatch_0()scf.for%arg0=%2 to
42、%c64 step%3 scf.for%arg1=%4 to%c1024 step%5%8=scf.for%arg2=%c0 to%c32 step%c8 iter_args(%arg3=%7)-(tensor)%9=scf.for%arg4=%c0 to%c128 step%c32 iter_args(%arg5=%arg3)-(tensor)%10=tensor.extract_slice%arg5%arg2,%arg4 8,32 1,1:tensor to tensor%11=linalg.fill ins(%cst_0:f32)outs(%10:tensor)-tensor%12=sc
43、f.for%arg6=%c0 to%c1024 step%c16 iter_args(%arg7=%11)-(tensor)%15=tensor.extract_slice%6%arg2,%arg6 8,16 1,1:tensor to tensor%16=linalg.matmul ins(%15,%cst:tensor,tensor)outs(%arg7:tensor)-tensor scf.yield%16:tensor%13=linalg.generic()%14=tensor.insert_slice%13 into%arg5%arg2,%arg4 8,32 1,1:tensor i
44、nto tensor scf.yield%14:tensor scf.yield%9:tensor VectorizeEmits vector.transfer_read/write operations for each operandFor elementwise operations,rewrite as pointwise vector variantFor reductions,rewrite as vector.contract or multi-reductionBroadcasting lower dimensional operands is done by vector.b
45、roadcastPermutations are handled by vector.transposeLowers n-D vectors to 1-D vectors supported by LLVMfunc.func predict_dispatch_0()%6=vector.transfer_read%cst_0%c0,%c0,%cst_1 in_bounds=true,true:tensor,vector scf.for%arg0=%2 to%c64 step%3 scf.for%arg1=%4 to%c1024 step%5%9=scf.for%arg2=%c0 to%c32 s
46、tep%c8 iter_args(%arg3=%8)-(tensor)%10=scf.for%arg4=%c0 to%c128 step%c32 iter_args(%arg5=%arg3)-(tensor)%11=scf.for%arg6=%c0 to%c1024 step%c16 iter_args(%arg7=%cst)-(vector)%14=vector.transfer_read%7%arg2,%arg6,%cst_1 in_bounds=true,true:tensor,vector%15=vector.contract indexing_maps=affine_map(d0,d
47、2),affine_map(d2,d1),affine_map(d0,d1),iterator_types=parallel,parallel,reduction,kind=#vector.kind%14,%6,%arg7:vector,vector into vector scf.yield%15:vector%12=arith.maxf%11,%cst:vector%13=vector.transfer_write%12,%arg5%arg2,%arg4 in_bounds=true,true:vector,tensor scf.yield%13:tensor scf.yield%10:t
48、ensor BufferizeAllocate and copy as little memory as possibleAlways prefer re-using buffers in placeUse destination-passing style as a heuristic for in-place bufferizationTie output tensor to results tensor to act as bufferization constraintPerforms a future in-place bufferization analysis of a the
49、operands and checks if a RaW conflict is detectedIf not,then performs in-place bufferizationfunc.func predict_dispatch_0()%7=vector.transfer_read%0%c0,%c0,%cst in_bounds=true,true:memref,vector scf.for%arg0=%3 to%c64 step%4%8=memref.subview%1%arg0,0 32,1024 1,1:memref to memref32x1024xf32,affine_map
50、(d0*1024+s0+d1)scf.for%arg1=%5 to%c1024 step%6%9=memref.subview%2%arg0,%arg1 32,128 1,1:memref to memref32x128xf32,affine_map(d0*1024+s0+d1)scf.for%arg2=%c0 to%c32 step%c8 scf.for%arg3=%c0 to%c128 step%c32%10=scf.for%arg4=%c0 to%c1024 step%c16 iter_args(%arg5=%cst_0)-(vector)%12=vector.transfer_read
51、%8%arg2,%arg4,%cst in_bounds=true,true:memref32x1024xf32,affine_map(d0*1024+s0+d1),vector%13=vector.contract indexing_maps=affine_map(d0,d2),affine_map(d2,d1),affine_map(d0,d1),iterator_types=parallel,parallel,reduction,kind=#vector.kind%12,%7,%arg5:vector,vector into vector scf.yield%13:vector%11=a
52、rith.maxf%10,%cst_0:vector vector.transfer_write%11,%9%arg2,%arg3 in_bounds=true,true:vector,memref32x128xf32,affine_map(d0*1024+s0+d1)Lowering closer to hardwareApply vector unrollingBreaks down vector sizes to sizes well supported by targetPre-emptively handles non power of 2 sizes to avoid subopt
53、imal code generationvector.contract is lowered to outer products to enable mapping to SIMD FMA instructionsFurther lowered to LLVM dialect and translated to LLVM IRfunc.func predict_dispatch_0().%8=vector.extract%50:vector%9=vector.extract%51:vector .%21=vector.extract%513:vector%22=vector.extract%5
54、14:vector scf.for%arg0=%c0 to%c32 step%c8 scf.for%arg1=%c0 to%c128 step%c32%24=scf.for%arg2=%c0 to%c1024 step%c16 iter_args(%arg3=%cst_0)-(vector)%26=vector.transfer_read%6%arg0,%arg2,%cst in_bounds=true,true:memref32x1024xf32,affine_map(d0*1024+s0+d1),vector%27=vector.transpose%26,1,0:vector to vec
55、tor%28=vector.extract%270:vector%29=vector.outerproduct%28,%8,%arg3 kind=#vector.kind:vector,vector .%58=vector.extract%2715:vector%59=vector.outerproduct%58,%23,%57 kind=#vector.kind:vector,vector scf.yield%59:vector%25=arith.maxf%24,%cst_0:vector vector.transfer_write%25,%7%arg0,%arg1 in_bounds=tr
56、ue,true:vector,memref32x128xf32,affine_map(d0*1024+s0+d1)Overview of GPU code generation pipelineTiling and vectorization key pieces of GPU pipelineGPU pipeline bufferizes early and focuses on optimizing shared memory copies and reducing bank conflictsPlan to move bufferization after vectorizationGP
57、U vectorization focuses on using tensor cores efficiently and emitting appropriate nvvm intrinsicsTile and Distribute to WarpsTile and distribute to warpsAfter bufferization,tile the reduction dimension Copy subviews of input memrefs to shared memory(workgroup memory)on the GPU prior to computationS
58、tart introducing SIMD characteristics by introducing memref.copy(mapping to shared memory done later downstream)Insert barriers after copying to workgroup memoryDo an additional level of tiling to distribute to a warpsfunc.func predict_dispatch_0().scf.for%arg0=%c0 to%c1024 step%c16 .memref.copy%19,
59、%1:memref32x16xf32,affine_map(d0*1024+s0+d1)to memref memref.copy%20,%0:memref16x32xf32,affine_map(d0*32+s0+d1)to memref gpu.barrier%21=gpu.thread_id x%22=gpu.thread_id y%23=affine.apply affine_map(s0*16)()%22%24=affine.apply affine_map(d0 floordiv 32)*16)(%21)%25=memref.subview%1%23,0 16,16 1,1:mem
60、ref to memref16x16xf32,affine_map(d0*16+s0+d1),3%26=memref.subview%00,%24 16,16 1,1:memref to memref16x16xf32,affine_map(d0*32+s0+d1),3 .linalg.matmul ins(%25,%26:memref16x16xf32,affine_map(d0*16+s0+d1),3,memref16x16xf32,affine_map(d0*32+s0+d1),3)outs(%27:memref16x16xf32,affine_map(d0*1024+s0+d1).Ti
61、le and Distribute to WarpsMulti-BufferingIn order to hide latency,we can use double/multi-buffering to break dependencies between consecutive iterations of a loop using the same temporary buffer Required for pipeliningNumber of copies determined by desired stages of pipelineMulti-Bufferingfunc.func
62、predict_dispatch_0().%0=memref.alloc():memref .%22=memref.subview%0%21,0,0 1,16,32 1,1,1:memref to memref16x32xf32,affine_map(d0*32+s0+d1),3 .memref.copy%23,%20:memref32x16xf32,affine_map(d0*1024+s0+d1)to memref32x16xf32,affine_map(d0*16+s0+d1),3 func.func predict_dispatch_0().%0=memref.alloc():memr
63、ef .memref.copy%20,%0:memref16x32xf32,affine_map(d0*32+s0+d1)to memref Vectorize Shared Memory CopiesVectorize the shared memory copy(converts to vector.transfer_read,vector.transfer_write)For optimal performance,always want to copy 128 bits.This can be used to determine copy tile size.Unroll the ve
64、ctor transfer read and writes%3=memref.alloc():memref%4=memref.alloc():memref .linalg.fill ins(%cst:f32)outs(%14:memref16x16xf32,affine_map(d0*1024+s0+d1)scf.for%arg0=%c0 to%c1024 step%c16 .%22=vector.transfer_read%18%20,%21,%cst in_bounds=true,true:memref32x16xf32,affine_map(d0*1024+s0+d1),vector v
65、ector.transfer_write%22,%16%20,%21 in_bounds=true,true:vector,memref32x16xf32,affine_map(d0*16+s0+d1),3 .%25=vector.transfer_read%19%23,%24,%cst in_bounds=true,true:memref16x32xf32,affine_map(d0*32+s0+d1),vector vector.transfer_write%25,%17%23,%24 in_bounds=true,true:vector,memref16x32xf32,affine_ma
66、p(d0*32+s0+d1),3 gpu.barrier linalg.matmul ins(%27,%28:memref16x16xf32,affine_map(d0*16+s0+d1),3,memref16x16xf32,affine_map(d0*32+s0+d1),3)outs(%29:memref)linalg.generic indexing_maps=affine_map(d0,d1),iterator_types=parallel,parallel outs(%14:memref16x16xf32,affine_map(d0*1024+s0+d1)Reduce Bank Con
67、flictsShared memory is arranged in banks(32 banks each of width 4 bytes)Each thread in a warp can access shared memory in parallelWhen 2 or more threads in a warp access 4 byte words in the same bank,results in serialized accesses and hence a reduction in overall bandwidthPad inner dimensions of all
68、ocOp to reduce chances of having bank conflicts(with 16 bytes)Plan to switch to shared memory swizzle for better efficiencyfunc.func predict_dispatch_0().%3=memref.alloc():memref%4=memref.subview%30,0,0 4,16,32 1,1,1:memref to.%5=memref.alloc():memref%6=memref.subview%50,0,0 4,32,16 1,1,1:memref to.
69、func.func predict_dispatch_0().%3=memref.alloc():memref%4=memref.subview%30,0,0 4,16,32 1,1,1:memref to.%5=memref.alloc():memref%6=memref.subview%50,0,0 4,32,16 1,1,1:memref to.Tensor Core VectorizationReplace all linalg ops with vector equivalentsLower linalg.matmul to vector.contract with vector.t
70、ransfer_read/writeLower linalg.generic to arith ops with vector.transfer_read/writeOptimize vector transfers by removing redundant opsVector transfer/contract size determined by tensor core supported sizes%16=scf.for%arg0=%c0 to%c1024 step%c16 iter_args(%arg1=%cst)-(vector)gpu.barrier%25=vector.tran
71、sfer_read%6%24,%23,%cst_0:memref,vector vector.transfer_write%25,%4%26,%12,%13:vector,memref%28=vector.transfer_read%5%27,%15,%cst_0:memref,vector vector.transfer_write%28,%3%29,%14,%15:vector,memref gpu.barrier .%32=vector.transfer_read%4%31,%30,%c0,%cst_0 in_bounds=true,true:memref,vector%41=vecto
72、r.transfer_read%3%40,%c8,%39,%cst_0 in_bounds=true,true:memref,vector%42=vector.contract indexing_maps=affine_map(d0,d2),affine_map(d2,d1),affine_map(d0,d1),iterator_types=parallel,parallel,reduction,kind=#vector.kind%32,%38,%arg1:vector,vector into vector%43=vector.contract indexing_maps=affine_map
73、(d0,d2),affine_map(d2,d1),affine_map(d0,d1),iterator_types=parallel,parallel,reduction,kind=#vector.kind%35,%41,%42:vector,vector into vector scf.yield%43:vector%17=arith.maxf%16,%cst:vectorConvert to GPU DialectConvert copies to shared memory to async copies(nvgpu.device_async_copy,nvgpu.device_asy
74、nc_create_group,nvgpu.device_async_wait)Adds gpu.subgroup_mma_load_matrix which loads a matrix using all threads in a subgroupLowers vector.contract to gpu.subgroup_mma_compute which performs matrix-multiply accumulate using all threads in the subgroupLowers arith elementwise ops to gpu.subgroup_mma
75、_elementwise%13=scf.for%arg0=%c0 to%c1024 step%c16 iter_args(%arg1=%0)-(!gpu.mma_matrix)gpu.barrier%20=nvgpu.device_async_copy%7%18,%17,%5%19,%9,%10,4:memref to memref%22=nvgpu.device_async_copy%6%21,%12,%4%19,%11,%12,4:memref to memref%23=nvgpu.device_async_create_group%20,%22 nvgpu.device_async_wa
76、it%23 gpu.barrier .%25=gpu.subgroup_mma_load_matrix%5%19,%24,%c0 leadDimension=20:index:memref-!gpu.mma_matrix%28=gpu.subgroup_mma_load_matrix%4%19,%c0,%27 leadDimension=36:index:memref-!gpu.mma_matrix%31=gpu.subgroup_mma_compute%26,%29,%30:!gpu.mma_matrix,!gpu.mma_matrix -!gpu.mma_matrix scf.yield%
77、31:!gpu.mma_matrix%14=gpu.subgroup_mma_elementwise maxf%13,%0:(!gpu.mma_matrix,!gpu.mma_matrix)-!gpu.mma_matrix gpu.subgroup_mma_store_matrix%14,%8%15,%16 leadDimension=1024:index:!gpu.mma_matrix,memref GPU PipeliningGPU PipeliningImplement software pipelining using modulo schedulingOperations of or
78、iginal loop body are overlapped so that there is a fixed initiation interval(II)between the start of consecutive loop iterationsScheduling can be constrained by available hardware resource or loop carried dependenciesEmit a prologue,kernel and epiloguefor(int i=0;i N;i+)Bi=(Ai+Ai+1)/4;GPU Pipelining
79、Prologue contains async copies to shared memoryKernel contains mma load matrix and compute and ends with async copiesEpilogue contains remaining computation gpu.barrier _pipelining_global_load_%14=nvgpu.device_async_copy%7%13,%10,%5%c0,%9,%10,4 _pipelining_global_load_:memref to memref%15=nvgpu.devi
80、ce_async_copy%6%11,%12,%4%c0,%11,%12,4 _pipelining_global_load_:memref to memref%16=nvgpu.device_async_create_group%14,%15 _pipelining_global_load_ .gpu.barrier _pipelining_global_load_%28=nvgpu.device_async_copy%7%13,%27,%5%c3,%9,%10,4 _pipelining_global_load_:memref to memref%30=nvgpu.device_async
81、_copy%6%29,%12,%4%c3,%11,%12,4 _pipelining_global_load_:memref to memref%31=nvgpu.device_async_create_group%28,%30 _pipelining_global_load_%32:9=scf.for%arg0=%c0 to%c960 step%c16 iter_args(%arg1=%0,%arg2=%16,%arg3=%21,%arg4=%26,%arg5=%31,%arg6=%c0,%arg7=%c1,%arg8=%c2,%arg9=%c3)nvgpu.device_async_wai
82、t%arg2 numGroups=3:i32 gpu.barrier%63=gpu.subgroup_mma_load_matrix%5%arg6,%62,%c0 leadDimension=20:index:memref-!gpu.mma_matrix%66=gpu.subgroup_mma_load_matrix%4%arg6,%c0,%65 leadDimension=36:index:memref-!gpu.mma_matrix%68=gpu.subgroup_mma_compute%63,%66,%arg1:!gpu.mma_matrix,!gpu.mma_matrix-!gpu.m
83、ma_matrix gpu.barrier _pipelining_global_load_%70=arith.addi%arg0,%c64:index%73=nvgpu.device_async_copy%7%13,%71,%5%72,%9,%10,4 _pipelining_global_load_:memref to memref%75=nvgpu.device_async_copy%6%74,%12,%4%72,%11,%12,4 _pipelining_global_load_:memref to memref%76=nvgpu.device_async_create_group%7
84、3,%75 _pipelining_global_load_ scf.yield%69,%arg3,%arg4,%arg5,%76,%arg7,%arg8,%arg9,%72:.nvgpu.device_async_wait%32#1 numGroups=3:i32 gpu.barrier%34=gpu.subgroup_mma_load_matrix%5%32#5,%33,%c0 leadDimension=20:index:memref-!gpu.mma_matrix%37=gpu.subgroup_mma_load_matrix%4%32#5,%c0,%36 leadDimension=
85、36:index:memref-!gpu.mma_matrix%39=gpu.subgroup_mma_compute%34,%37,%32#0:!gpu.mma_matrix,!gpu.mma_matrix-!gpu.mma_matrix nvgpu.device_async_wait%32#2 numGroups=2:i32 gpu.barrier .%53=gpu.subgroup_mma_load_matrix%5%32#8,%33,%c0 leadDimension=20:index:memref-!gpu.mma_matrix%55=gpu.subgroup_mma_load_ma
86、trix%4%32#8,%c0,%36 leadDimension=36:index:memref-!gpu.mma_matrix%57=gpu.subgroup_mma_compute%53,%55,%52:!gpu.mma_matrix,!gpu.mma_matrix-!gpu.mma_matrix%59=gpu.subgroup_mma_elementwise maxf%58,%0:(!gpu.mma_matrix,!gpu.mma_matrix)-!gpu.mma_matrix gpu.subgroup_mma_store_matrix%59,%8%60,%61 leadDimensi
87、on=1024:index:!gpu.mma_matrix,memref PrologueKernelEpilogueGPU LoweringNext,we use LLVM to lower to PTXUse JIT compilation to convert PTX code to native GPU machine codeAlternatively,could use PTX assembler and convert to SASS(and then CUBIN)wmma.load.a.sync.aligned.row.m16n16k8.shared.tf32%r43,%r44
88、,%r45,%r46,%rd107+11776,%r6;wmma.load.a.sync.aligned.row.m16n16k8.shared.tf32%r47,%r48,%r49,%r50,%rd107+11808,%r6;wmma.load.b.sync.aligned.row.m16n16k8.shared.tf32%r51,%r52,%r53,%r54,%rd109+2304,%r15;wmma.load.b.sync.aligned.row.m16n16k8.shared.tf32%r55,%r56,%r57,%r58,%rd109+3456,%r15;wmma.mma.sync.
89、aligned.row.row.m16n16k8.f32.tf32.tf32.f32%f49,%f50,%f51,%f52,%f53,%f54,%f55,%f56,%r43,%r44,%r45,%r46,%r51,%r52,%r53,%r54,%f41,%f42,%f43,%f44,%f45,%f46,%f47,%f48;wmma.mma.sync.aligned.row.row.m16n16k8.f32.tf32.tf32.f32%f57,%f58,%f59,%f60,%f61,%f62,%f63,%f64,%r47,%r48,%r49,%r50,%r55,%r56,%r57,%r58,%f
90、49,%f50,%f51,%f52,%f53,%f54,%f55,%f56;cp.async.wait_group 1;bar.sync 0;wmma.load.a.sync.aligned.row.m16n16k8.shared.tf32%r59,%r60,%r61,%r62,%rd107+14336,%r6;wmma.load.a.sync.aligned.row.m16n16k8.shared.tf32%r63,%r64,%r65,%r66,%rd107+14368,%r6;wmma.load.b.sync.aligned.row.m16n16k8.shared.tf32%r67,%r6
91、8,%r69,%r70,%rd109+4608,%r15;wmma.load.b.sync.aligned.row.m16n16k8.shared.tf32%r71,%r72,%r73,%r74,%rd109+5760,%r15;wmma.mma.sync.aligned.row.row.m16n16k8.f32.tf32.tf32.f32%f65,%f66,%f67,%f68,%f69,%f70,%f71,%f72,%r59,%r60,%r61,%r62,%r67,%r68,%r69,%r70,%f57,%f58,%f59,%f60,%f61,%f62,%f63,%f64;wmma.mma.
92、sync.aligned.row.row.m16n16k8.f32.tf32.tf32.f32%f73,%f74,%f75,%f76,%f77,%f78,%f79,%f80,%r63,%r64,%r65,%r66,%r71,%r72,%r73,%r74,%f65,%f66,%f67,%f68,%f69,%f70,%f71,%f72;cp.async.wait_group 0;bar.sync 0;Additional TargetsCan reuse existing pipeline to target other GPUs that have tensor core equivalent
93、functionalityAdjust lowering from linalg to vector dialectAdd appropriate dialect and lowering for custom acceleratorTargeting custom acceleratorsNew RISC-V based many-core architectures(such as Hammerblade)Create a new dialect to model multiple processing elements(PE)and memory hierarchyCan leverag
94、e vector dialect or experimental RVV dialect for vector code generationNeed to develop cost model to determine how to place kernels on tiles of PEsCould have special function units(tensor core equivalents)that can be leveraged during code generation for better performanceAuto-tuningHow to determine
95、optimal tile sizes?What about loop interchange?How much to unroll the loop?How does this extend from a single operator to entire neural network?On distributed heterogeneous resources?Many hyperparameters and a large search space make it difficult to generate good codeUnclear how much performance is
96、left on table for a given set of hyperparametersFormulate as search problem imposing constraints to reduce search spaceCan choose from a variety of search algorithms ranging from reinforcement learning(RL)to genetic algorithmsAuto-tuningRL Framework(Compiler Gym)that can be used for compiler optimiz
97、ation tasksGradient-free methods(Nevergrad)have also been used to tune performanceCurrent list of knobs shown in table belowBeyond operator level,we can also search for how to best partition tensors for distributed computations(for inference and training)Approaches such as ALPA provide framework for
98、 how to partition tensors across clusters of heterogeneous resourcesAdditional variables such as checkpointing for trainingConclusionMLIR Code generation focuses on taking high-level tensor computation primitives and lowering them to LLVM IR with appropriate intrinsicsAttempts to take guesswork out
99、of the backend,reducing dependence on black-box optimizers such as the LLVM auto-vectorizerLeverages LLVM for traditional code generation Many abstractions shared between CPU and GPU compilation pipeline(and potentially other new accelerators)Tiling and vectorization key components of both pipelines
100、Additional work required to manage shared memory and target tensor cores on GPUAuto-tuning essential to obtaining good performance from code generated kernelsCan be extended to handle sparse tensors,non-structured ops(linalg.ext)AcknowledgementsNod.ai TeamDiscord:https:/discord.gg/RUqY2h2s9uGoogle I
101、REE TeamDiscord:https:/discord.gg/26P4xW4ReferencesVasilache,N.et al.(2022)Composable and Modular Code Generation in MLIR.https:/arxiv.org/pdf/2202.03293.pdfHsin-I,C.L.et al.(2022)TinyIREE:An ML Execution Environment for Embedded Systems from Compilation to Deployment.https:/arxiv.org/pdf/2205.14479
102、.pdfBradbury,A.LLVM backend development by example(RISC-V).https:/youtu.be/AFaIP-dF-RACodegen Dialect Overview.https:/discourse.llvm.org/t/codegen-dialect-overview/2723 Dawkins,Q.Updated MLIR Dialect Overview Diagram.https:/discourse.llvm.org/t/rfc-updated-mlir-dialect-overview-diagram/64266Linalg D
103、ialect Rationale:The Case for Compiler Friendly Custom Operations.https:/mlir.llvm.org/docs/Rationale/RationaleLinalgDialect/Anatomy of Linalg.generic.https:/youtu.be/A805W2KSCxQLoop double-buffering/multi-buffering.https:/discourse.llvm.org/t/loop-double-buffering-multi-buffering/59979CUTLASS:Cuda
104、Template Library for dense linear algebra at all levels and scales.https:/on- al.A Novel Hilbert Curve for Cache-locality Preserving Loops.https:/eprints.cs.univie.ac.at/5726/1/loops.pdfJordans,R.et al.(2015)High-level software-pipelining in LLVM.Hammerblade Manycore Technical Reference Manual.https
105、:/ RISC-V Vector Extension(RVV)Dialect.https:/discourse.llvm.org/t/rfc-add-risc-v-vector-extension-rvv-dialect/4146Cummins,C.et al.(2021)CompilerGym:Robust,Performant Compiler Optimization Environments for AI Research.https:/arxiv.org/pdf/2109.08267.pdfZheng,L.et al(2022).Alpa:Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning.https:/arxiv.org/pdf/2201.12023.pdfImages on Motivation slide taken from OpenAI,NVIDIA,AMD,Sambanova and Cerebras websites.