當前位置：首頁 > 報告詳情

深入理解 Nsight System 與 Nsight Compute 性能分析優化工具.pdf

上傳人： li 編號：29480 2021-02-07 PDF PDF 89頁 6.72MB

該報告所屬合集： 2020年GTC中國線上大會嘉賓演講PPT資料合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/89

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《深入理解 Nsight System 與 Nsight Compute 性能分析優化工具.pdf》由會員分享，可在線閱讀，更多相關《深入理解 Nsight System 與 Nsight Compute 性能分析優化工具.pdf（89頁珍藏版）》請在三個皮匠報告上搜索。

1、NVIDIADEEPDIVEINTONSIGHTSYSTEMS8 NSIGHTCOMPUTEBing Liiu，202012#page#Overview of ProfilersNsight SystemsNsight ComputeAGENDACase StudiesSummary#page#page#NSIGHT PRODUCT FAMILYStart hereNsight SystemsComprowokoadelperomanceRecheckoverallRecheckoverallCouAODive intographicsCUDAkonolsframosNsight Comput

2、eNsight GraphicsDetaled CuDAkenel perforFinished ifOpuimize：performanceToyaccasssatisfactory#page#page#NSIGHT SYSTEMSOverviewSystem-wide application algorithm tuningFocus on the applications algorithm- a unique perspectiveLocate optimization opportunitiess See gaps of unused CPU and GPU timeBalance

3、your workload across multiple CPUs and GPUsCPU algorithms，utilization，and thread stateGPU streams, kernels， memory transfers， etcSupport for Linux 8 Windows，X86-64 8 Tegra. Host only for Mac#page#NSIGHT SYSTEMSKey FeaturesComputeCUDA API. Kernel launch and execution correlationLibraries: cuBLAS， CUD

4、NN，TensorRTOpenACCGraphicsVulkan，OpenGL，DX11，DX12，DXR，V-syncOS Thread state and CPU utilization， pthread， file l/O，etc.User annotations API (NVTX）#page#Thread/coremigrationProcesses0andThread statethreads來科印CUDA andOpenGL API traceCuDNN andCUBLAS trace110201110101Kernel and memorytransfer activities

5、I日Multi-GPU#page#CPU THREADSThread ActivitiesGet an overview of each threads activitiesWhich core the thread is running and the utilizationCPU state and transitionOS runtime libraries usage: pthread， file l/O，etc.APl usage: CUDA，CuDNN， CuBLAS，TensorRT，口181pythonOSruntime librariesCUDAAPIcuEventSynch

6、ronizcuEventCuEverCUEVCUDNN11日CuBLAS#page#page#OS RUNTIME LIBRARIESldentify time periods where threads are blocked and the reasonLocate potentially redundant synchronizationsTPL122405pythonsemwait.jpthread.pthreadsemwaitfgetcOS runtime librariespthread cond waitBeqins:9.23185sCUDA APIlhMhEnds:9.2336

7、s（+1.747ms）Profiler overheadnVD#page#OS RUNTIME LIBRARIESBacktrace forr time-consuming calls to OS runtime Libs3438.0614+438.25m+438.35m+38.1m+438.15mg1438.2m+438.3mV12363mMPD_TIMESIEP760.12413NTXMPDSYNCHRONEZE548.50115COAAPudastreamSyndronizeV12364mOSrunimeresWaitingDuration:6.318sNTXMPD_SYNOHRONIZ

8、E471.450sallstackat3.438slayva9CZUOsnnmelres780.810WTXMPD_SYNOHRONIZE483.652uslavvaNeCCcomM叫VSEZT#page#CUDA APITrace CUDA API Calls on OS threadSee when kernels are dispatchedSee when memory operations are initiatedLocate the corresponding CUDA workload on GPU165+722ms+716ms+718ms+720ms+728ms+724ms+

9、726ms181pythonmutex lockpthreadsem_waitcu.OSruntime librariesimplicit convol.cuEventSynchro.CUDA APIcudaMalloc#page#GPU WORKLOADSee CUDA workloads execution timeLocate idle GPU times1CLOA（Tesla vioo-SX3注，三Lnllnmllei mI lwo_udnn_128re1.H29kerne#page#GPU WORKLOADSee trace of GPU activity%Chart forAvg.

10、CUDA kernel coverageLocate idle GPU timesZ(Not SM occupancy）0.250.4S0.6sCUDA(TeslaV100-SXM2-16GB%Chart forAvg.no.of memory operatiions#page#page#NVTX INSTRUMENTATIONNVIDIA Tools Extension (NVTX） to annotate the timeline with applications logicHelps understand the profilers output in apps algorithmic

11、 contextNVTX（IDefauij）Batch3Batch3dpaForwardpassCUDA APItlhNulehul.all.lnVDL#page#NVTX INSTRUMENTATIONUsageInclude the header “nvToolsExt.h”Call the API functions from your sourceLink the NVTX library on the compiler command line with -lnvToolsExtAls0 Supports PythonCuPy- https:/docs.cupy.dev/en/v9.

12、0.0a1/reference/cuda.htmlfprofilerTF - https:/ INSTRUMENTATIONExample#include nvToolsExt.hwinit host data 10.998msvoid myfunctionG int n， doublexAnvtxRangePusha（init_host_data”）；/initialize x on hostinit host_data(n,x,x_d,y_d）；nvtxRangePop（）；nVD#page#page#API StreamGPU SOL sectionMemory workload ana

13、lysis section85.84V#page#KEY FEATURESAPI StreamInteractive profiling with API StreamAPI StreamRun to the next (CUDA） kernel21464culnitRun to the next (CUDA） API callEnter filterNextTrigger：Run to the next range startIDAPINameDetailsFunc ReturnCUDA SUCCESS(cuDriverGetVersiRun to the next range stopcu

14、InitNext Trigger. The filter of APl and kerne“foo” the next kernel launch/API callmatching reg exp “foo#page#KEY FEATURESSectionsAn event is a countable activity，action， or occurrence on a deviceA metric is a characteristic of an application that is calculated from one or more event valuesgldi28*16+

15、gld64*8+glds2*4+gldi6*2+gldapuy pue souau dnols o sladojaAap djau o ulY souau auos Jo dnos e S! uonpas Yoptimization opportunities quicklynVDIA#page#page#KEY FEATURESMetricsMetrics are much different from nvprofand more related to HW.https:/ SECTIONSectionsSOL Section (case 1: Compute Bound）High-lev

16、el overview of the utilization for compute and memory resources of the GPU.Foreach unit， the Speed Of Light (SOL） reports the achieved percentage of utilization withrespect to the theoretical maximumGPUUtilizationSM90.010.020.030.040.050.060.070.080.0Speed ofLight %2nVDL#page#SOL SECTIONSectionsSOL

17、Section (case 2: Memory Bound）High-level overview of the utilization for compute and memory resources of the GPU.Foreach unit， the Speed Of Light (SOL） reports the achieved percentage of utilization withrespect to the theoretical maximumationSMItory1%0020050010.030.040060.070.080.0edofght%#page#SOL

18、SECTIONSectionsSOL Section (case 3: Latency Bound）High-level overview of the utilization for compute and memory resources of the GPU. Foreach unit， the Speed Of Light (SOL） reports the achieved percentage of utilization withrespect to the theoretical maximumSM96Memory%0.050.060.010.020.030.040.0Spee

19、d of Light %nVDIA#page#SOL SECTIONSections600200Uoht%SOLSMBrSOLMhput%精悅障動易前助OnAny%ts%Active AnVDIA#page#SOL SECTIONUnit detailsUnitsDevice（main）memory，wheretheGPUsglobalandlocal memoryresidesTheFrameBufferPartitionfsamemorycontrollerwhichsits betweenthelevel2cache（LTC)andtheDRAM.ThenumberofFBPAsvari

20、esacrossGPUsTheGeneralProcessingClustercontainsSM.TextureandL1intheformofTPC(s）.Itisreplicatedseveraltimesacrossachip.heentireGraphics ProcessingUnitGraphfcsEngineisresponsibleforal2Dand3Dputework，andsynchronousgraphicscopyingwork.Jeist8aeulm paxepul ae qeusueysuoo 8ulupeJojaiqlsuodsaJWSaun Jolungns

21、eslauequeqsuopaxaquauThelevel1（L1）/Texture CacheislocatedwthintheGPC.tcanbeusedasdirectedmappedsharedmemoryand/orstoreglobal.localandtexturedatain iscache portionALevel2(L2)CacheSliceisasub-partition ofthe Level2cacheSMSPisasubpartitionof theSM8K6Logicalgroupingofseveralunitss#page#page#COMPUTE WORK

22、LOAD ANALYSISCompute Pipelines Detailsy and FMALite phtysicalnVIOL#page#COMPUTE WORKLOAD ANALYSISSectionsnVDA#page#SCHEDULER STATISTICSSectionsScheduler Statistics(case 2）Active Warps Per Scheduler IvarplActive Issue Slot Iinst/cyclel96.17ELigible eEligible warps Per scheduler Ivarple or Hore Eligib

23、le Il3.83Issued warp Per SchedulerSchedulerWarpsPActive warps Per ScheduEligiblewapsPer5chedulRecommendationsssuo slotUtili#page#WARP STATE STATISTICSSectionsWarp State Statistics (case 2）major reasons cause stall:an instruction fetch，a memory dependency (result of memory instruction）an execution de

24、pendency (result of previous instruction），a pipeline is busya synchronization barrierRVICIA#page#WARP SCHEDULERVolta Architecture4 Warp Scheduler per SMManagesa poolof warps；RegisterFile（16.384x32-bilVolta：16 warp slotsTuring：8 warp slotsEach scheduler can issue 1 warp/cycleOffers simplified mental

25、model forprofiling and SM metrics戰戰戰戰活SFU#page#WARP SCHEDULERMental Model for ProfilingWarp States:6Unused5Warp SlotsActive4Stalled3210EligibleSelectednVIOL#page#page#WARP SCHEDULERMental Model for ProfilingNCycle:Warp States:Each Cycle:6Out of all eligible warps，Unusedselect one to issue on that cy

26、cle5Warp SlotsWarp SlotsActive4Stalled3210EligibleSelecteclssue Slot:nVIOL#page#WARP SCHEDULERMental Model for ProfilingN+1Cycle:NWarp States:6Unused5Warp SlotsWarp SlotsActive4Stalled3210Warp selected in cycle N，is not eligible in N+1.EligibleE.g.instructions with longer instruction latenciesSelect

27、e經、家線Issue Slot:#page#WARP SCHEDULERMental Model for ProfilingNN+1N+2Cycle:Warp States:6Unused5Warp SlotsWarp SlotsActive4Warp exitsStalled3210EligibleSelectecX.No eligible warpl lssue slot unused經、經經lssue Slot:nVD#page#WARP SCHEDULERMental Model for ProfilingNN+1N+2E+NCycle:Warp States:6Unused5Warp

28、 SlotsWarp SlotsActive4New warps scheduledStalled3210EligibleSelected團圖X園Issue Slot:nVIOL#page#WARP SCHEDULERMental Model for ProfilingNN+1N+2Cycle：N+3Metrics（aggregated）Warp States:6Unused5Warp SlotsWarp SlotsActive4Stalled3210EligibleSelected有牛Issue Slot:nVIOL#page#WARP SCHEDULERMental Model for P

29、rofilingNCycle:N+1N+2N+3Metrics（aggregated）Warp States:cycles_active6Unused5Warp SlotsWarp SlotsActive4Stalled3210EligibleSelected經線Issue Slot:#page#WARP SCHEDULERMental Model for ProfilingNCycle：N+1N+2N+3Metrics（aggregated）:Warp States:cycles_active6Unused20warps_active5Warp SlotsSlotsActive4Stalle

30、d3210WarpEligibleSelected緣圖XIssue Slot:nVICL#page#WARP SCHEDULERMental Model for ProfilingNCycle:N+1N+2N+3Metrics（aggregated）：Warp States:4cycles_active6Unused20warps_active5Warp SlotsWarp SlotsActive45warps_active/cycles_active62.5%achieved_occupancyStalled3210EligibleSelected團圖X園Issue Slot:#page#W

31、ARP SCHEDULERMental Model for ProfilingNN+1Cycle：N+2N+3Metrics（aggregated）Warp States:4cycles_active6Unused20warps_active5Warp SlotsWarp SlotsActive45warps_active/cycles_active62.5%achieved_occupancyStalled321015Eligiblewarps_stalledSelectedX額經線Issue Slot:L#page#WARP SCHEDULERMental Model for Profil

32、ingCycle：NN+1N+2N+3Metrics（aggregated）:Warp States:4cycles_active6Unused20warps_active5Warp SlotsActiveSlots45warps_active/cycles_active62.5%achieved_occupancyStalled3210WarpS15Eligiblewarps_stalled5warps_eligibleSelected經線Xlssue Slot:#page#WARP SCHEDULERMental Model for ProfilingCycle：NN+1N+2N+3Met

33、rics（aggregated）Warp States:4cycles_active6Unused20warps_active5Warp SlotsWarp SlotsActive45warps_active/cycles_active62.5%achieved_occupancyStalled321015Eligiblewarps_stalled5warps_eligibleSelected3warps_issued緣線XIssue Slot:#page#WARP SCHEDULERMental Model for ProfilingCycle:NN+1N+2N+3Metrics（aggre

34、gated）:Warp States:64cycles activeUnused20warps_active5Warp SlotsActiveWarp Slots45warps_active/cycles_activeachieved_occupancy62.5%Stalled321015Eligiblewarps_stalled5warps_eligibleSelected3warps_issued9.75warps_issued/cycles_active色園X75%Issue Slot:Lissue_slot_utilization#page#WARP STATE STATISTICSS

35、ectionsWarp State Statistics (case 2）Warp state Statistics AalysisofthestatesyCyElMarp Cycles Per Issued Instruction111.19 Avg.Active Threads Per Warp111.19Marp Cycles Per Issue ActiveNot predicated off Threads Per warpInstruction cyclel123.52WarpState(AllCycles）Stall DrainStallMCMissSta LongScorebo

36、ardStal BarierStall MembaStshortScorardStall sleepingstalwaitStal NoscoStl Math Ppe ThrotleStall TexThrottl#page#MEMORY WORKLOAD ANALYSISSectionsMemory Workload AnalysisDetailed analysis of the memory resources of the GPU. Memory can become a limitingfactor for the overall kernel performance when fu

37、lly utilizing the involved hardware units（Mem Busy），exhausting the available communication bandwidth between those units (MaxBandwidth），or by reaching the maximum throughput of issuing memory instructions (MemPipes Busy）. Depending on the limiting factor， the memory chart and tables allow toidentify

38、 the exact bottleneck in the memory system.nalysis5.4usysaoryThroughputbyte/secodL1 Hit Rate s34.75 amPipesusy2.32ZnVDIA#page#MEMORY WORKLOAD ANALYSISMemory Hierarchyrequest numberbytestransferedinstructions numberFor Ampere LDGSTS and L2 Compressior#page#MEMORY WORKLOAD ANALYSISMemory Report304#pag

39、e#page#SOURCE CONTERS STATISTICS店Indicates inefficient memory access instructions，including shared memory and global memory access.#page#SOURCEPAGEUnderstand the code generated by NVCCAdd -lineinfo to NVCC FlagCUDAC/C+PTXSASSCUDAC/C+PTXSASS#page#UNDERSTAND STALL REASONSectionsWarp State Statistics (

40、case 2）major reasons cause stall:an instruction fetch，a memory dependency (result of memory instruction）an execution dependency (result of previous instruction），a pipeline is busya synchronization barriernVIOL#page#UNDERSTAND STALL REASONTypical Stall ReasonCommon Stall Reasons:個 Long ScoreboardL1Te

41、x (Global， Local， Suface， Tex） result dependency.個Short ScoreboardShared memory result dependency or frequent MUFU or Dynamicbranching個LG ThrottleWaiting for the L1 instruction queue for local and global (LG）memory operations to be not full. Stall occurs only when executing local or global memoryins

42、tructions extremely frequently.2MIO ThrottleWas stalled waiting for the MIO (memory input/output） instructionqueue to be not full. Stall occurs when executing LDS，MUFU or Dynamic Branching extremelyfrequently2Math Pipe ThrottleWas stalled waiting for the execution pipe to be available.nVDIA#page#UND

43、ERSTAND STALL REASONLong ScoreboardLong score board - global memory latencyneidlaneid；ramptrlane（1sDelaneidint idxB1 Too many & frequent global accessint tid-threadfdx.xsintoffset=tid*Q39Tfor（inti=i Too many & frequent global accessUncoalesced global memoryaccess caused a large numberof requests to

44、block the queue.#page#UNDERSTAND STALL REASONLG ThrottleCoalesced global memory access reduce a largenumber of requests.globalvoid stall_reason_lg_worse（int8_t*dramptr，int8tdramptr2）int tid =threadrdx.xsint total_thread =1024/blockDim.xfor（inti=0；2000；1+）dramptr2itotal_thread + tida dramptri * total

45、_thread + tid#page#UNDERSTAND STALL REASONLG Throttle#page#UNDERSTAND STALL REASONLG ThrottlethreadIdx.xntintinrol40+dramptr2itotal_threadttid-threadIdx.xintinttotal_thread =1024/blockDim.xiintptr=（int*）dramptr；intptr2=（int*)dramptr2；unrol1for（int i=0；1500；i+）ptr2iglobalvoidstall_reason_lg_best(ints

46、_t*dramptr，int8_t*dramptr2）int tid = threadrdx.xiint4*ptr-（int4*）dramptr；int4*ptr2=（int4*）dramptr2unrol1for（int ii shared memory latencyBanConflictsnHIUCHOTharedload1280.0001.280,00049.56sharedint smm320.05haredstore1.2801.280int tid-threadIdx.xsaredomcintlaneid=tid%32；1281280128128049.61Totalsmmfla

47、neid= laneid；_syncthreads(）；int idx =laneid；unrol1tpragmafor(inti=e；iInstructions s total threads / WARP SZ * instructions num1+idx=smmidx；dramptridx=idx；void stall_reason_ssb_launcher（int*dramptr）tstall_reason_ssb（dramptr）；#page#UNDERSTAND STALL REASONShort ScoreboardShort scoreboard - shared memor

48、y latency#page#UNDERSTAND STALL REASONMIO ThrottleMIO Throttle - extreme utilization of the MIO pipelines. Egs: shared memory, special mathsharedint smm3232sharedint smm3232；5mm23232sharedint5mm23232int tid-threadIdx.xiint tid =threadidx.xintlaneid=tid%32；laneid=tid%32；unrol1unrol1for（inti=0；1_32；i+

49、）for（int1.=；i.32；i+sfprauetluas= pauetzuus.prauetlJums = prauetllzums_syncthreads（）_syncthreads(）void stall reason mio launcher(int* dramptr)fstall_reason_mio_bad（dramptr）stallmio_good（dramptr）；#page#UNDERSTAND STALL REASONMIO ThrottleMIO Bad Func13172路16.6131872012.76067Requests = 32（bank conflict）

50、 * instructionsRVICIA#page#UNDERSTAND STALL REASONMIO ThrottleMIO Good Func1.9231.92Requests = 1(no bank confliict） * instructions.#page#CASE STUDY 1:NSYS USAGE#page#FASTER TRANSFORMERndnoTake the encoder as an exampleProbabilitiesaddbias to0/K/Vtranspose4-Dtensor0/K/VO/K/VinputXweighto/k/ybatch str

51、ide GEMMOXK（GEMMO，1.2）（GEMM3）FeedForwaroOKmulti-headattentionattention=softmaydMulti-HeadFeedAttentionGEMM5batch strideGEMMatentionXVFonwardN（GEMM4）Nxaddbias.layernormalizationAdd8NorMaskedtranspose4-DtensorMulti-HeadMulti-HeaoAttentionAttantionGEMM6PositionalPositionaC40EncodingEncodingaddbias.acti

52、vation（GeLU）OutputInputGEMM7sndnoInputs（shiftedright）Transformer architecture 1addbias.layernormalizationnVIOIA#page#OPTIMIZE FASTER TRANSFORMERWhere is the bottleneck for EncoderzGPU is idle in many timeReason: kernels are too small - kernel Launch bound113.WIXM3.M.MCUDA（TeslaV100-PCIE-32GB￥99.6%St

53、ream2095.5%Kernels323%volt.5gemm32229.6%EigenMetakemeTensorFlow encoder no XLA- one transformer layer- 50 kernelsbatch size1，12 heads，size per head64，FP32#page#OPTIMIZE FASTER TRANSFORMERWhere is the bottleneck for Encoder？GPU is idle in many timeReason: kernels are too smallE.g.， using 11 kernels (

54、mean， add，) to compute the LayerNormNVTXCUDAAPNTXMean:laCUDATeslaV100-PCIE-32GB99.6%Stream2095.5%Kernelsm32x32.s福29.6%ETensorFlow LayerNorm no XLA-11 kernelsnVDIAbatch size1，12heads，sizeperhead64，FPP32#page#OPTIMIZE FASTER TRANSFORMERWhere is the bottleneck for Encoder？A simple solution: Using Tenso

55、rFlow XLA to fuse kernel automaticallyBecome better， but still many idle fraction+837.7m10sNVTXCUDATeslaV100-PCIE-32GB99.6%Stream20949%Kernels41.9%woagmm.3232128x64ATensorFlow encoder with XLA-one transformer layer-24 kernelsbatch size1，12heads，size perhead64，FP32#page#OPTIMIZE FASTER TRANSFORMERFus

56、ed EncoderTimeline after fusion7477.55m+477.65msWTX3edt3CUDATeslav100-PCIE-32GB98.4%Stream20su99963.4%MemoryNVTXBertTransforAsyncRangesFasterTransformer encoder op- one transformer layer-14 kernelsbatch size1，12heads，sizeperhead64，FP3#page#OPTIMIZE FASTER TRANSFORMERFused EncoderTimeline after fusio

57、nThe pure cpp APIcan provide better performanceOSruntielibwarisCUDAAPIadd_QkvbiasIvoita_sgemm3.cudaMemsetAsyncad.volta_sgemm32x32.v.volta.sgemm305ProfleroverheadtheshiddenCUDA（TeslaV100-PCIE-32GB99.9%stream1499.4%Kernels64x32 slica.volta_s64x325.olta_s0.6%MemoryFasterTransformer encoder cpp- one tra

58、nsformer layer-14 kernelsbatch size1，12heads，sizeperhead64，FP32#page#CASE STUDY 2:MATRIX TRANSPOSITION#page#MATRIX TRANSPOSITIONm= 8192 n = 4096.Some theoretical metricsaW8ZL=98ZLZ+EL=*960*Z618=pealsakq1eototal byteswrite=8192*4096*4=134,217,728B=128MBtotalreadtransactions（32B）=134，217,728/32=4,194,

59、304totalwritetransactions（32B）=134,217,728/32=4,194,304#page#MATRIX TRANSPOSITIONNaive ImplementationY/ m the number of rows of input matrix1！n the number of cols of input matrixglobalvoid transposeNative（float *input， float *output， int m， int n）Fint colID input = threadTdx.x + blockDim.x*blockIdx.

60、x；int rowID input = threadIdxy + blockDimy*blockIdxy；if（rowID inputm & ColIDinputL2 Requests（32B）L2-Tex Returns(32B）globalload4,194,394global store33.554,432time(us）189033,554,432/4,194,304=8，Utilization12.5%JnvDL#page#OPTIMIZATION WITH SHARED MEMORYLoad Data to Shared MemoryB（0,0）B（0,1）B（0,0）B（0,1）

61、B（1,0）B（1,1）B（1,0）B（1,1）B（2,0）B（2,1）B（2,0）B（2,1）nVDI#page#OPTIMIZATION WITH SHARED MEMORYLocal Transposition in Shared MemoryShared MemoryShared MemoryB（0,1）B（0,0）BB（1,0）B（1,1）B0Bd61）B（2,0）B（2,1）BnVIOL#page#OPTIMIZATIONWITH SHAREDMEMORYBlock Transposition When Writing to Global MemoryShared MemoryGl

62、obal Memory60BB10B0dst col= threadldx.x + blockDim-y*blockldxyidst_row = threadldx.y + blockDim.x*blockldx.x85AnVDIA#page#MATRIX TRANSPOSITIONOptimized Implementationglobalvoid transposeOptimized(float *input，float *output，int m， int n）tint colID input = threadTdx.x +blockDim.x*blockIdx.x；int rowID

63、input = threadIdx.y+ blockDimy*blockIdx.yfloat sdatat321331；sharedif（rowID input m & colID input L2 Requests(32B）L2-Tex Returns（32B）globalload4,194,3941global store4,194,3940.00128.00M8525time(us）128.00ME128.29ME#page#SUMMARYNsight Systems is a system-level profilerNsight Compute is for kernel profi

64、ling toolBasic knowledge of CUDA programming and GPU architecture is needed for proflingEncourage developers to use Nsight Systems 8 Nsight Compute instead of NVVP 8 nvprofUse profiler tools whenever possible to locate the optimization opportunities to avoidpremature optimizationUse top-down approachs no need to jumpp directly into SASS code#page#NVIDIA#page#

相關圖表

本文主要介紹了NVIDIA的NSight系統性能分析工具，包括NSight Systems和NSight Compute。NSight Systems是一個系統級別的分析工具，專注于應用程序算法，可以幫助開發者看到未被使用的CPU和GPU時間，并在多個CPU和GPU之間平衡工作負載。NSight Compute則是一個交互式的內核分析工具，用于分析CUDA應用程序。文章還提到了NSight Systems和Nsight Compute的核心功能，如API調用、庫支持、圖形界面、命令行接口等。同時，文章強調了使用這些工具的重要性，以定位優化機會，避免過早優化，并采用自頂向下的方法，避免直接跳入SASS代碼。

"Nsight系統與Nsight計算有何區別？" "如何使用NVIDIA工具擴展(NVTX)為應用程序添加邏輯注釋？" "在矩陣轉置中，如何優化全局內存訪問以提高GPU利用率？"

相關報告

解鎖 SQL 查詢性能深入理解關系數據庫的內部機制與優化策略-李小燕.pdf

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站