《深入理解 Nsight System 與 Nsight Compute 性能分析優化工具.pdf》由會員分享,可在線閱讀,更多相關《深入理解 Nsight System 與 Nsight Compute 性能分析優化工具.pdf(89頁珍藏版)》請在三個皮匠報告上搜索。
1、NVIDIADEEPDIVEINTONSIGHTSYSTEMS8 NSIGHTCOMPUTEBing Liiu,202012#page#Overview of ProfilersNsight SystemsNsight ComputeAGENDACase StudiesSummary#page#page#NSIGHT PRODUCT FAMILYStart hereNsight SystemsComprowokoadelperomanceRecheckoverallRecheckoverallCouAODive intographicsCUDAkonolsframosNsight Comput
2、eNsight GraphicsDetaled CuDAkenel perforFinished ifOpuimize:performanceToyaccasssatisfactory#page#page#NSIGHT SYSTEMSOverviewSystem-wide application algorithm tuningFocus on the applications algorithm- a unique perspectiveLocate optimization opportunitiess See gaps of unused CPU and GPU timeBalance
3、your workload across multiple CPUs and GPUsCPU algorithms,utilization,and thread stateGPU streams, kernels, memory transfers, etcSupport for Linux 8 Windows,X86-64 8 Tegra. Host only for Mac#page#NSIGHT SYSTEMSKey FeaturesComputeCUDA API. Kernel launch and execution correlationLibraries: cuBLAS, CUD
4、NN,TensorRTOpenACCGraphicsVulkan,OpenGL,DX11,DX12,DXR,V-syncOS Thread state and CPU utilization, pthread, file l/O,etc.User annotations API (NVTX)#page#Thread/coremigrationProcesses0andThread statethreads來科印CUDA andOpenGL API traceCuDNN andCUBLAS trace110201110101Kernel and memorytransfer activities
5、I日Multi-GPU#page#CPU THREADSThread ActivitiesGet an overview of each threads activitiesWhich core the thread is running and the utilizationCPU state and transitionOS runtime libraries usage: pthread, file l/O,etc.APl usage: CUDA,CuDNN, CuBLAS,TensorRT,口181pythonOSruntime librariesCUDAAPIcuEventSynch
6、ronizcuEventCuEverCUEVCUDNN11日CuBLAS#page#page#OS RUNTIME LIBRARIESldentify time periods where threads are blocked and the reasonLocate potentially redundant synchronizationsTPL122405pythonsemwait.jpthread.pthreadsemwaitfgetcOS runtime librariespthread cond waitBeqins:9.23185sCUDA APIlhMhEnds:9.2336
7、s(+1.747ms)Profiler overheadnVD#page#OS RUNTIME LIBRARIESBacktrace forr time-consuming calls to OS runtime Libs3438.0614+438.25m+438.35m+38.1m+438.15mg1438.2m+438.3mV12363mMPD_TIMESIEP760.12413NTXMPDSYNCHRONEZE548.50115COAAPudastreamSyndronizeV12364mOSrunimeresWaitingDuration:6.318sNTXMPD_SYNOHRONIZ
8、E471.450sallstackat3.438slayva9CZUOsnnmelres780.810WTXMPD_SYNOHRONIZE483.652uslavvaNeCCcomM叫VSEZT#page#CUDA APITrace CUDA API Calls on OS threadSee when kernels are dispatchedSee when memory operations are initiatedLocate the corresponding CUDA workload on GPU165+722ms+716ms+718ms+720ms+728ms+724ms+
9、726ms181pythonmutex lockpthreadsem_waitcu.OSruntime librariesimplicit convol.cuEventSynchro.CUDA APIcudaMalloc#page#GPU WORKLOADSee CUDA workloads execution timeLocate idle GPU times1CLOA(Tesla vioo-SX3注,三Lnllnmllei mI lwo_udnn_128re1.H29kerne#page#GPU WORKLOADSee trace of GPU activity%Chart forAvg.
10、CUDA kernel coverageLocate idle GPU timesZ(Not SM occupancy)0.250.4S0.6sCUDA(TeslaV100-SXM2-16GB%Chart forAvg.no.of memory operatiions#page#page#NVTX INSTRUMENTATIONNVIDIA Tools Extension (NVTX) to annotate the timeline with applications logicHelps understand the profilers output in apps algorithmic
11、 contextNVTX(IDefauij)Batch3Batch3dpaForwardpassCUDA APItlhNulehul.all.lnVDL#page#NVTX INSTRUMENTATIONUsageInclude the header “nvToolsExt.h”Call the API functions from your sourceLink the NVTX library on the compiler command line with -lnvToolsExtAls0 Supports PythonCuPy- https:/docs.cupy.dev/en/v9.
12、0.0a1/reference/cuda.htmlfprofilerTF - https:/ INSTRUMENTATIONExample#include nvToolsExt.hwinit host data 10.998msvoid myfunctionG int n, doublexAnvtxRangePusha(init_host_data”);/initialize x on hostinit host_data(n,x,x_d,y_d);nvtxRangePop();nVD#page#page#API StreamGPU SOL sectionMemory workload ana
13、lysis section85.84V#page#KEY FEATURESAPI StreamInteractive profiling with API StreamAPI StreamRun to the next (CUDA) kernel21464culnitRun to the next (CUDA) API callEnter filterNextTrigger:Run to the next range startIDAPINameDetailsFunc ReturnCUDA SUCCESS(cuDriverGetVersiRun to the next range stopcu
14、InitNext Trigger. The filter of APl and kerne“foo” the next kernel launch/API callmatching reg exp “foo#page#KEY FEATURESSectionsAn event is a countable activity,action, or occurrence on a deviceA metric is a characteristic of an application that is calculated from one or more event valuesgldi28*16+
15、gld64*8+glds2*4+gldi6*2+gldapuy pue souau dnols o sladojaAap djau o ulY souau auos Jo dnos e S! uonpas Yoptimization opportunities quicklynVDIA#page#page#KEY FEATURESMetricsMetrics are much different from nvprofand more related to HW.https:/ SECTIONSectionsSOL Section (case 1: Compute Bound)High-lev
16、el overview of the utilization for compute and memory resources of the GPU.Foreach unit, the Speed Of Light (SOL) reports the achieved percentage of utilization withrespect to the theoretical maximumGPUUtilizationSM90.010.020.030.040.050.060.070.080.0Speed ofLight %2nVDL#page#SOL SECTIONSectionsSOL
17、Section (case 2: Memory Bound)High-level overview of the utilization for compute and memory resources of the GPU.Foreach unit, the Speed Of Light (SOL) reports the achieved percentage of utilization withrespect to the theoretical maximumationSMItory1%0020050010.030.040060.070.080.0edofght%#page#SOL
18、SECTIONSectionsSOL Section (case 3: Latency Bound)High-level overview of the utilization for compute and memory resources of the GPU. Foreach unit, the Speed Of Light (SOL) reports the achieved percentage of utilization withrespect to the theoretical maximumSM96Memory%0.050.060.010.020.030.040.0Spee
19、d of Light %nVDIA#page#SOL SECTIONSections600200Uoht%SOLSMBrSOLMhput%精悅障動易前助OnAny%ts%Active AnVDIA#page#SOL SECTIONUnit detailsUnitsDevice(main)memory,wheretheGPUsglobalandlocal memoryresidesTheFrameBufferPartitionfsamemorycontrollerwhichsits betweenthelevel2cache(LTC)andtheDRAM.ThenumberofFBPAsvari
20、esacrossGPUsTheGeneralProcessingClustercontainsSM.TextureandL1intheformofTPC(s).Itisreplicatedseveraltimesacrossachip.heentireGraphics ProcessingUnitGraphfcsEngineisresponsibleforal2Dand3Dputework,andsynchronousgraphicscopyingwork.Jeist8aeulm paxepul ae qeusueysuoo 8ulupeJojaiqlsuodsaJWSaun Jolungns
21、eslauequeqsuopaxaquauThelevel1(L1)/Texture CacheislocatedwthintheGPC.tcanbeusedasdirectedmappedsharedmemoryand/orstoreglobal.localandtexturedatain iscache portionALevel2(L2)CacheSliceisasub-partition ofthe Level2cacheSMSPisasubpartitionof theSM8K6Logicalgroupingofseveralunitss#page#page#COMPUTE WORK
22、LOAD ANALYSISCompute Pipelines Detailsy and FMALite phtysicalnVIOL#page#COMPUTE WORKLOAD ANALYSISSectionsnVDA#page#SCHEDULER STATISTICSSectionsScheduler Statistics(case 2)Active Warps Per Scheduler IvarplActive Issue Slot Iinst/cyclel96.17ELigible eEligible warps Per scheduler Ivarple or Hore Eligib
23、le Il3.83Issued warp Per SchedulerSchedulerWarpsPActive warps Per ScheduEligiblewapsPer5chedulRecommendationsssuo slotUtili#page#WARP STATE STATISTICSSectionsWarp State Statistics (case 2)major reasons cause stall:an instruction fetch,a memory dependency (result of memory instruction)an execution de
24、pendency (result of previous instruction),a pipeline is busya synchronization barrierRVICIA#page#WARP SCHEDULERVolta Architecture4 Warp Scheduler per SMManagesa poolof warps;RegisterFile(16.384x32-bilVolta:16 warp slotsTuring:8 warp slotsEach scheduler can issue 1 warp/cycleOffers simplified mental
25、model forprofiling and SM metrics戰戰戰戰活SFU#page#WARP SCHEDULERMental Model for ProfilingWarp States:6Unused5Warp SlotsActive4Stalled3210EligibleSelectednVIOL#page#page#WARP SCHEDULERMental Model for ProfilingNCycle:Warp States:Each Cycle:6Out of all eligible warps,Unusedselect one to issue on that cy
26、cle5Warp SlotsWarp SlotsActive4Stalled3210EligibleSelecteclssue Slot:nVIOL#page#WARP SCHEDULERMental Model for ProfilingN+1Cycle:NWarp States:6Unused5Warp SlotsWarp SlotsActive4Stalled3210Warp selected in cycle N,is not eligible in N+1.EligibleE.g.instructions with longer instruction latenciesSelect
27、e經、家線Issue Slot:#page#WARP SCHEDULERMental Model for ProfilingNN+1N+2Cycle:Warp States:6Unused5Warp SlotsWarp SlotsActive4Warp exitsStalled3210EligibleSelectecX.No eligible warpl lssue slot unused經、經經lssue Slot:nVD#page#WARP SCHEDULERMental Model for ProfilingNN+1N+2E+NCycle:Warp States:6Unused5Warp
28、 SlotsWarp SlotsActive4New warps scheduledStalled3210EligibleSelected團圖X園Issue Slot:nVIOL#page#WARP SCHEDULERMental Model for ProfilingNN+1N+2Cycle:N+3Metrics(aggregated)Warp States:6Unused5Warp SlotsWarp SlotsActive4Stalled3210EligibleSelected有牛Issue Slot:nVIOL#page#WARP SCHEDULERMental Model for P
29、rofilingNCycle:N+1N+2N+3Metrics(aggregated)Warp States:cycles_active6Unused5Warp SlotsWarp SlotsActive4Stalled3210EligibleSelected經線Issue Slot:#page#WARP SCHEDULERMental Model for ProfilingNCycle:N+1N+2N+3Metrics(aggregated):Warp States:cycles_active6Unused20warps_active5Warp SlotsSlotsActive4Stalle
30、d3210WarpEligibleSelected緣圖XIssue Slot:nVICL#page#WARP SCHEDULERMental Model for ProfilingNCycle:N+1N+2N+3Metrics(aggregated):Warp States:4cycles_active6Unused20warps_active5Warp SlotsWarp SlotsActive45warps_active/cycles_active62.5%achieved_occupancyStalled3210EligibleSelected團圖X園Issue Slot:#page#W
31、ARP SCHEDULERMental Model for ProfilingNN+1Cycle:N+2N+3Metrics(aggregated)Warp States:4cycles_active6Unused20warps_active5Warp SlotsWarp SlotsActive45warps_active/cycles_active62.5%achieved_occupancyStalled321015Eligiblewarps_stalledSelectedX額經線Issue Slot:L#page#WARP SCHEDULERMental Model for Profil
32、ingCycle:NN+1N+2N+3Metrics(aggregated):Warp States:4cycles_active6Unused20warps_active5Warp SlotsActiveSlots45warps_active/cycles_active62.5%achieved_occupancyStalled3210WarpS15Eligiblewarps_stalled5warps_eligibleSelected經線Xlssue Slot:#page#WARP SCHEDULERMental Model for ProfilingCycle:NN+1N+2N+3Met
33、rics(aggregated)Warp States:4cycles_active6Unused20warps_active5Warp SlotsWarp SlotsActive45warps_active/cycles_active62.5%achieved_occupancyStalled321015Eligiblewarps_stalled5warps_eligibleSelected3warps_issued緣線XIssue Slot:#page#WARP SCHEDULERMental Model for ProfilingCycle:NN+1N+2N+3Metrics(aggre
34、gated):Warp States:64cycles activeUnused20warps_active5Warp SlotsActiveWarp Slots45warps_active/cycles_activeachieved_occupancy62.5%Stalled321015Eligiblewarps_stalled5warps_eligibleSelected3warps_issued9.75warps_issued/cycles_active色園X75%Issue Slot:Lissue_slot_utilization#page#WARP STATE STATISTICSS
35、ectionsWarp State Statistics (case 2)Warp state Statistics AalysisofthestatesyCyElMarp Cycles Per Issued Instruction111.19 Avg.Active Threads Per Warp111.19Marp Cycles Per Issue ActiveNot predicated off Threads Per warpInstruction cyclel123.52WarpState(AllCycles)Stall DrainStallMCMissSta LongScorebo
36、ardStal BarierStall MembaStshortScorardStall sleepingstalwaitStal NoscoStl Math Ppe ThrotleStall TexThrottl#page#MEMORY WORKLOAD ANALYSISSectionsMemory Workload AnalysisDetailed analysis of the memory resources of the GPU. Memory can become a limitingfactor for the overall kernel performance when fu
37、lly utilizing the involved hardware units(Mem Busy),exhausting the available communication bandwidth between those units (MaxBandwidth),or by reaching the maximum throughput of issuing memory instructions (MemPipes Busy). Depending on the limiting factor, the memory chart and tables allow toidentify
38、 the exact bottleneck in the memory system.nalysis5.4usysaoryThroughputbyte/secodL1 Hit Rate s34.75 amPipesusy2.32ZnVDIA#page#MEMORY WORKLOAD ANALYSISMemory Hierarchyrequest numberbytestransferedinstructions numberFor Ampere LDGSTS and L2 Compressior#page#MEMORY WORKLOAD ANALYSISMemory Report304#pag
39、e#page#SOURCE CONTERS STATISTICS店Indicates inefficient memory access instructions,including shared memory and global memory access.#page#SOURCEPAGEUnderstand the code generated by NVCCAdd -lineinfo to NVCC FlagCUDAC/C+PTXSASSCUDAC/C+PTXSASS#page#UNDERSTAND STALL REASONSectionsWarp State Statistics (
40、case 2)major reasons cause stall:an instruction fetch,a memory dependency (result of memory instruction)an execution dependency (result of previous instruction),a pipeline is busya synchronization barriernVIOL#page#UNDERSTAND STALL REASONTypical Stall ReasonCommon Stall Reasons:個 Long ScoreboardL1Te
41、x (Global, Local, Suface, Tex) result dependency.個Short ScoreboardShared memory result dependency or frequent MUFU or Dynamicbranching個LG ThrottleWaiting for the L1 instruction queue for local and global (LG)memory operations to be not full. Stall occurs only when executing local or global memoryins
42、tructions extremely frequently.2MIO ThrottleWas stalled waiting for the MIO (memory input/output) instructionqueue to be not full. Stall occurs when executing LDS,MUFU or Dynamic Branching extremelyfrequently2Math Pipe ThrottleWas stalled waiting for the execution pipe to be available.nVDIA#page#UND
43、ERSTAND STALL REASONLong ScoreboardLong score board - global memory latencyneidlaneid;ramptrlane(1sDelaneidint idxB1 Too many & frequent global accessint tid-threadfdx.xsintoffset=tid*Q39Tfor(inti=i Too many & frequent global accessUncoalesced global memoryaccess caused a large numberof requests to
44、block the queue.#page#UNDERSTAND STALL REASONLG ThrottleCoalesced global memory access reduce a largenumber of requests.globalvoid stall_reason_lg_worse(int8_t*dramptr,int8tdramptr2)int tid =threadrdx.xsint total_thread =1024/blockDim.xfor(inti=0;2000;1+)dramptr2itotal_thread + tida dramptri * total
45、_thread + tid#page#UNDERSTAND STALL REASONLG Throttle#page#UNDERSTAND STALL REASONLG ThrottlethreadIdx.xntintinrol40+dramptr2itotal_threadttid-threadIdx.xintinttotal_thread =1024/blockDim.xiintptr=(int*)dramptr;intptr2=(int*)dramptr2;unrol1for(int i=0;1500;i+)ptr2iglobalvoidstall_reason_lg_best(ints
46、_t*dramptr,int8_t*dramptr2)int tid = threadrdx.xiint4*ptr-(int4*)dramptr;int4*ptr2=(int4*)dramptr2unrol1for(int ii shared memory latencyBanConflictsnHIUCHOTharedload1280.0001.280,00049.56sharedint smm320.05haredstore1.2801.280int tid-threadIdx.xsaredomcintlaneid=tid%32;1281280128128049.61Totalsmmfla
47、neid= laneid;_syncthreads();int idx =laneid;unrol1tpragmafor(inti=e;iInstructions s total threads / WARP SZ * instructions num1+idx=smmidx;dramptridx=idx;void stall_reason_ssb_launcher(int*dramptr)tstall_reason_ssb(dramptr);#page#UNDERSTAND STALL REASONShort ScoreboardShort scoreboard - shared memor
48、y latency#page#UNDERSTAND STALL REASONMIO ThrottleMIO Throttle - extreme utilization of the MIO pipelines. Egs: shared memory, special mathsharedint smm3232sharedint smm3232;5mm23232sharedint5mm23232int tid-threadIdx.xiint tid =threadidx.xintlaneid=tid%32;laneid=tid%32;unrol1unrol1for(inti=0;1_32;i+
49、)for(int1.=;i.32;i+sfprauetluas= pauetzuus.prauetlJums = prauetllzums_syncthreads()_syncthreads()void stall reason mio launcher(int* dramptr)fstall_reason_mio_bad(dramptr)stallmio_good(dramptr);#page#UNDERSTAND STALL REASONMIO ThrottleMIO Bad Func13172路16.6131872012.76067Requests = 32(bank conflict)
50、 * instructionsRVICIA#page#UNDERSTAND STALL REASONMIO ThrottleMIO Good Func1.9231.92Requests = 1(no bank confliict) * instructions.#page#CASE STUDY 1:NSYS USAGE#page#FASTER TRANSFORMERndnoTake the encoder as an exampleProbabilitiesaddbias to0/K/Vtranspose4-Dtensor0/K/VO/K/VinputXweighto/k/ybatch str
51、ide GEMMOXK(GEMMO,1.2)(GEMM3)FeedForwaroOKmulti-headattentionattention=softmaydMulti-HeadFeedAttentionGEMM5batch strideGEMMatentionXVFonwardN(GEMM4)Nxaddbias.layernormalizationAdd8NorMaskedtranspose4-DtensorMulti-HeadMulti-HeaoAttentionAttantionGEMM6PositionalPositionaC40EncodingEncodingaddbias.acti
52、vation(GeLU)OutputInputGEMM7sndnoInputs(shiftedright)Transformer architecture 1addbias.layernormalizationnVIOIA#page#OPTIMIZE FASTER TRANSFORMERWhere is the bottleneck for EncoderzGPU is idle in many timeReason: kernels are too small - kernel Launch bound113.WIXM3.M.MCUDA(TeslaV100-PCIE-32GB¥99.6%St
53、ream2095.5%Kernels323%volt.5gemm32229.6%EigenMetakemeTensorFlow encoder no XLA- one transformer layer- 50 kernelsbatch size1,12 heads,size per head64,FP32#page#OPTIMIZE FASTER TRANSFORMERWhere is the bottleneck for Encoder?GPU is idle in many timeReason: kernels are too smallE.g., using 11 kernels (
54、mean, add,) to compute the LayerNormNVTXCUDAAPNTXMean:laCUDATeslaV100-PCIE-32GB99.6%Stream2095.5%Kernelsm32x32.s福29.6%ETensorFlow LayerNorm no XLA-11 kernelsnVDIAbatch size1,12heads,sizeperhead64,FPP32#page#OPTIMIZE FASTER TRANSFORMERWhere is the bottleneck for Encoder?A simple solution: Using Tenso
55、rFlow XLA to fuse kernel automaticallyBecome better, but still many idle fraction+837.7m10sNVTXCUDATeslaV100-PCIE-32GB99.6%Stream20949%Kernels41.9%woagmm.3232128x64ATensorFlow encoder with XLA-one transformer layer-24 kernelsbatch size1,12heads,size perhead64,FP32#page#OPTIMIZE FASTER TRANSFORMERFus
56、ed EncoderTimeline after fusion7477.55m+477.65msWTX3edt3CUDATeslav100-PCIE-32GB98.4%Stream20su99963.4%MemoryNVTXBertTransforAsyncRangesFasterTransformer encoder op- one transformer layer-14 kernelsbatch size1,12heads,sizeperhead64,FP3#page#OPTIMIZE FASTER TRANSFORMERFused EncoderTimeline after fusio
57、nThe pure cpp APIcan provide better performanceOSruntielibwarisCUDAAPIadd_QkvbiasIvoita_sgemm3.cudaMemsetAsyncad.volta_sgemm32x32.v.volta.sgemm305ProfleroverheadtheshiddenCUDA(TeslaV100-PCIE-32GB99.9%stream1499.4%Kernels64x32 slica.volta_s64x325.olta_s0.6%MemoryFasterTransformer encoder cpp- one tra
58、nsformer layer-14 kernelsbatch size1,12heads,sizeperhead64,FP32#page#CASE STUDY 2:MATRIX TRANSPOSITION#page#MATRIX TRANSPOSITIONm= 8192 n = 4096.Some theoretical metricsaW8ZL=98ZLZ+EL=*960*Z618=pealsakq1eototal byteswrite=8192*4096*4=134,217,728B=128MBtotalreadtransactions(32B)=134,217,728/32=4,194,
59、304totalwritetransactions(32B)=134,217,728/32=4,194,304#page#MATRIX TRANSPOSITIONNaive ImplementationY/ m the number of rows of input matrix1!n the number of cols of input matrixglobalvoid transposeNative(float *input, float *output, int m, int n)Fint colID input = threadTdx.x + blockDim.x*blockIdx.
60、x;int rowID input = threadIdxy + blockDimy*blockIdxy;if(rowID inputm & ColIDinputL2 Requests(32B)L2-Tex Returns(32B)globalload4,194,394global store33.554,432time(us)189033,554,432/4,194,304=8,Utilization12.5%JnvDL#page#OPTIMIZATION WITH SHARED MEMORYLoad Data to Shared MemoryB(0,0)B(0,1)B(0,0)B(0,1)
61、B(1,0)B(1,1)B(1,0)B(1,1)B(2,0)B(2,1)B(2,0)B(2,1)nVDI#page#OPTIMIZATION WITH SHARED MEMORYLocal Transposition in Shared MemoryShared MemoryShared MemoryB(0,1)B(0,0)BB(1,0)B(1,1)B0Bd61)B(2,0)B(2,1)BnVIOL#page#OPTIMIZATIONWITH SHAREDMEMORYBlock Transposition When Writing to Global MemoryShared MemoryGl
62、obal Memory60BB10B0dst col= threadldx.x + blockDim-y*blockldxyidst_row = threadldx.y + blockDim.x*blockldx.x85AnVDIA#page#MATRIX TRANSPOSITIONOptimized Implementationglobalvoid transposeOptimized(float *input,float *output,int m, int n)tint colID input = threadTdx.x +blockDim.x*blockIdx.x;int rowID
63、input = threadIdx.y+ blockDimy*blockIdx.yfloat sdatat321331;sharedif(rowID input m & colID input L2 Requests(32B)L2-Tex Returns(32B)globalload4,194,3941global store4,194,3940.00128.00M8525time(us)128.00ME128.29ME#page#SUMMARYNsight Systems is a system-level profilerNsight Compute is for kernel profi
64、ling toolBasic knowledge of CUDA programming and GPU architecture is needed for proflingEncourage developers to use Nsight Systems 8 Nsight Compute instead of NVVP 8 nvprofUse profiler tools whenever possible to locate the optimization opportunities to avoidpremature optimizationUse top-down approachs no need to jumpp directly into SASS code#page#NVIDIA#page#