《逆水寒的 DX12 和 RT 實踐.pdf》由會員分享,可在線閱讀,更多相關《逆水寒的 DX12 和 RT 實踐.pdf(96頁珍藏版)》請在三個皮匠報告上搜索。
1、逆水寒DX12升級經驗分享俞云康網易雷火#page#背景2018年NVVIDIA展示的逆水寒RTXdemo#page#背景RTXON食寧義2020周年版本RTXON#page#11to12升級歷程API層面替換補全缺失的特性,第三方庫升級到12優化會#page#DX12,顯式內存管理驅動層更薄多線程Console級別的API高級特性N Variable Shading Rate、 Mesh Shader.#page#DrawDX11:DX12:Draw / DrawlnstancedDrawlnstancedDrawlndexed / DrawlndexedlnstancedDrawlndex
2、edlnstanced沒有DrawAutoDrawAuto都是Instanceddraw#page#PSO(Pipeline State)Shaders(Vs, PS,DS, HS,GS)Blend StateRasterizer StateDepth Stencil StateInput LayoutPrimitive Topology TypeRTV and DSV formatls) Stream Output#page#Root SignatureDX11資源綁定:CB:PSSetConstantBuffersSRV:PSSetShaderResourcesPSSetSamplersS
3、ampler:UAV:OMSetRenderTargetsAndunorderedAccess ViewsDX12資源綁定:統一使用Root Signature和DescriiptorHeap#page#Root Signature資源如何綁定的一種約定Resource Heap(s)Root SignatureDescriptor RangesDescriptor HeapCBVDescriptor TableConstantBufferCBV(3)CBVStructuredBuffeSRV(4)CBVConstantBufferUAV(2)SRVTexture2DSRVTexture2DS
4、RVRWexture2DSRVTexture3DUAVRWStructuredBufferUAVConstantBuffer5m/llearniing-directx-12-2/#page#Root Signature雖然創建PSO需要使用RS(需要驗證和shader是否匹配),但是使用時還是需要調用SetRootsignature設置RS切換RS需要綁定重新綁定所有資源創建RS要在盡可能小和盡可能通用之前取得平衡#page#Descriptor Heap存放Descriptor的容器相當于array of viewsRoot SignatureDescriptor RangesDescrip
5、tor HeapResource Heap(s)CBVDescriptor TableConstantBufferCBV(3)CBVStructuredBufferSRV(4)CBVConstantBufferUAV(2)SRVTexture2DSRVTexture2DSRVRWexture2DSRVTexture3DUAVRWStructuredBufferUAVConstantB HeapDescriptor就是ViewCRTV/DSV/CBV/SRV/UAV/Sampler)DescriptorHeap就是view的數組、分CPU和GPU可見兩類個CommandList同時只能綁定一個C
6、BV-SRV-UAVdescriptorheap和一個sampler descriptor heap使用時需要拷貝Descriptor到當前的GPU可見的Heap上SamplerDescriptorHeap最大為2048#page#MemoryDX11:DX11:CreateBufferD3D11 USAGE DEFAULTCreateTexture1D/2D/3DD3D11 USAGE_IMMUTABLED3D11 USAGE DYNAMICD3D11 USAGE STAGINGDX12:DX12:(GPU only)CreateCommittedResourceDefault HeapUp
7、load Heap(CPU write, GPU read)CreatePlacedResource (sub allocation)CreateReservedResourceReadback Heap#page#HeapandResourcesComitted ResourcePlaced ResourceResoureResurHeapHeapGPUMeGPUMeAliasing Placed ResourceReserved ResourcesResourceHeapHeapGPUM全emorGPUMeResource2Images from https:/.com/learninng
8、-directx#page#Committed VS PlacedNvidia DX12 Dos And DontsUse committed resources where possible to give the driver moreknowledgeF This allows the driver to better manage GPU memoryA good use case for placed resources are resource heaps that are e.gused during streaming and do hold different sets of
9、 read-only texturesover their life timePlacedResource中的內容可能無法被正確壓縮Committed Resource的paging由驅動負責,Placed Resource的paging由自己負責責#page#Committed VS PlacedCreateCommittedResource很耗時,盡量在后臺線程使用Paging問題很大的資源(如RT)盡量使用Committed靜態的texture盡量使用CommittedPlaced Resource適合臨時資源避免過度使用有碎片化的問題#page#AlignmentHeap: D3D12
10、_DEFAULT_RESOURCE_PLACEMENT_ALIGNMENT(64KB) D3D12 DEFAULT_MSAA_RESOURCE PLACEMENT_ALIGNMENT (4MB)Resource:64KB Buffer and Texture:MSAA Texture:4MBSmall Resource:S ID3D 12Device::GetResourceAllocationlnfoTexture: D3D12_SMALL RESOURCE PLACEMENT_ALIGNMENT (4KB)MSAA Texture:D3D12_SMALL_MSAA_RESOURCE_PLA
11、CEMENT_ALIGNMENT (64KB)#page#MemoryCreateCommittedResource可能會非常耗時合理使用Placed Resourcelsub allocation)不同類型的資源對Heap類型有要求需要實現類似dx11中rename的機制注意不要DELETED_WHILE_STILLINUSE#page#其他新概念#page#新概念Command List & Command QueueResource BarrierFence#page#Command List用于調用演染指令包含絕大部分演染API、指令(Command)實際存在CommandAlloca
12、tor上不允許跨線程同時使用但可以在別的線程提交執行不要頻繁提交很小的CommandListExecuteCommandLists(ECL)有固定開銷,盡量合并提交推薦每頓15-30個CommandList,5-10次ECL#page#Command Queue3DengineCopyengineCompute engineWaitWait執行CommandList的載體CopyResourceDirect / Compute / CopySignalDrawCopyResource不同Queue之間可以waitSignalCopyResourceAsync ComputeCopyResour
13、ceWaitDispatchCopyResourceSignalWaitSignalDraw#page#Fence3DengineCopycngineCompute engineWeitWait,同步的手段CopyResoureSignalSignalDrawCopyResourceGetCompletedValueSignalCopyResourceSetEventOnCompletionCopyResourceWaitDispatchCopyResourceSignalWaitSignalDraw#page#Resource BarrierDX11由驅動負責三類N Transition b
14、arrierS UAV bariers Aliasing barrierCommon state promotion and decay多線程#page#BarrierUAV Resource配市#page#補全缺失的特性#page#補全缺失的特性GenerateMipsStream OutputQueryUAV with ccounter#page#Stream OutputDX11CreateGeometryShaderWithStreamOutputDrawAutoDX12:StreamOutput是PSO的一部分沒有DrawAuto,需要用Executelndirect模擬注意Buff
15、erFilledSizeLocation輸出的是SO寫的Byte長度,不是頂點數不建議繼續使用SO,推薦使用CS模擬#page#QueryEvent query(可以用fence替代)Timestamp query“With stable power two timestamp queries issued from differentcommand lists are reliably comparable.Two timestamps within thesame command llist are always reliably comparable.#page#UAV with cou
16、nterDX11的counter是驅動幫忙創建的DX12需要自己管理,在創建UAV的時候設置#page#優化#page#優化廠幀率廠卡頓內存/顯存#page#幀率優化提高理論姿勢水平practiceDX12移植和優化的talk356voidcRender12:ResourceBindc)基于Profile分析瓶頸2(0.01%)4367福DepthstenciistateBind();132(0.66%)43684369ShaderBind()2649(13.11%)4378SRVBind();3994(15.36%)4371重構代碼#page#幀率優化Job化的多線程框架將更多的API轉成多
17、線程調用、確保線程安全#page#多線程漳染Rencerc2U1tT5EEE#page#多線程漳染理論上30%40%的漳染工作被分攤到多線程中游戲實測頓率提升30%以上、更多基于多線程的優化化(GPUSkiinning,RTXRefflection.)#page#卡頓優化CreateGraphicsPipelineState / CreateComputePipelineStateCreateCommittedResource#page#CreateGraphicsPipelineState固定開銷600Us左右異步編譯#page#CreateCommittedResource耗時可能非常長F
18、 Unpredictable and sometimes very high duration up to hundreds of milliseconds看情況使用PlacedResource代替釋放資源也有一定開銷盡量在后臺線程創建/釋放盡量復用#page#內存顯存優化-問題DX12版游戲比DX11版游戲內存和顯存使用量高了20%以上內存和顯存泄漏,使用最簡單的方式創建和管理演染資源#page#內存優化例1使用PIX統計下顯存和內存的使用量DescriptorHeap的使用量明顯超過正常范圍定位到是某第三方庫中初始化設置的值太大Descriptor HeapsFilter(CtrlE)To
19、tal=334HistoryColumns1DSizeNameId:324eeHsepFramaHeap32.04714Totalsize:32,047,104bytesdescHeap#32232,047,104FrameHeapType:D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAVdescHeap#32632,047,104Descriptors:1000000FrameHeapF1ags:1descHeap#31,048,576NodeMask:11048,576descHeap#327EventTimestampdescHeap#3281,048,576
20、#page#Constant Buffer優化OptionsStart Timing Capture (Legacy)0使用微軟官方的profile工具PModeImmediate CContinuousCapture GPU memory usaDownload SelectedGet Selected Timing DataCapture Cpu Samples1k/sec(Lowimpact)Sampling Rate:Capture Tracked Functions(Preview)ManageFunctionsCapture callstacs on context switche
21、sCapture GPU timings區CaptureGpUmemory usageNo target process has been selected#page#ConstantBuffer優化FiterCtl-E)HeapsSizeDNameResidencyPaging Out MBTypePaging EventsPaging Out EventsResourceshaap#19660268435456Upload HeapAlways4892AIsy5heap#19659268,435,456Defauit Heap185heap#19661134217,728Default H
22、eapAlwsy51,895heap#19996Always2.033134,217728Upload Heapheap#28715Seaena+na10485760AlwaysDefault Heapheap#20714104857,608AlwayDefault Heapheap#2031767,108,864Upload HeapAlws8y3730heap#2031867.108,864Defauit HeapAlways505494heap#2883033,554432Upload HeapAlways152heap#2885533,554,43Default HeapAlway5h
23、eap#1972525.165824AlweysTexture2DArrayDefauitRDefauit HeapSp66T#deeu16,777.216Defauit HeapAlwaysheap#19651milkyaypanoromaskdds16,777,216Default HeapAlwsyheap#28063Texture3DDefaultResourceworldspaceAlway516,777,216Default HeapAIyheapt*869316,777,218Defauit Heapheap#1977516,777,216AiwayS1Defauit Heape
24、ap#19740Ay516,777,216Default Hcap統計所所有通過D3D12API創建的漳染資源#page#Constant Buffer優化優化方案引擎內使用的CB全部不創建實際的GPU資源,完全創建在內存上實際draw和compute需要綁定CB時,統一拷貝到一片連續的buffer上后續可以進一步優化(從uploadheap拷貝到defaultheap)#page#優化總結綜合使用各種分析工具定位充分理解DX12設計思路和官方的建議結合項目實際做針對性優化#page#工具#page#ToolsRenderDocMicrosoft PIXNVIDIA Nsiight#page#
25、RenderDoc最好用的截幀分析工具V1.8開始支持DX12的shaderdebug不支持RTX#page#PIX微軟官方出品CPU & GPU TimelineMemory UsageContext Switch支持RTX50m#page#NVIDIA AftermathGPU mini dump支持DX11,DX12,DXR,Vulkan集成簡單#page#Debug#page#DebugAuto QADebug LayerDevice Remove#page#Device RemoveCPUdump中的calstack沒有意義API錯誤調用(DebugLayer檢查)_WHILE_ST
26、ILL IN USE例如DELETEDTDRF HungF Page Fault#page#WriteBufferlmmediateBreadcrumbs /WriteBufferlmmediateOWe know what caused/1WriteMarker(TopofPipe, 1)the TDR nowDraw(x) TDR happens here0WriteMarker (BottomofPipe, 2)OWriteMarker(TopofPipe,3)Draw(y)一0WriteMarker (BottomofPipe, 4)(!Crash reported afterward
27、sGDC19 Directx 12 Optimization Techniques in Capcoms REENGINE#page#DRED、基于WriteBufferlmmediate包含Page Fault ReportingWindows 10 version 19032-5%性能損失#page#NVIDIA AftermathCreb Ink#page#NVIDIAAftermathCresh InfoAetive 3CAtes Fipelin#page#擴展閱讀LearningDirectx 12 by JeremiahDX12 Dos And DontsAMD GPuOpenIG
28、DCjDirectx 12 Case StudiesJGDCIDX12 & Vulkan: Dawn of a New Generation ofGraphics APlsIGDCjRendering Hitman with Directx 12IGDCIMemory management in Vulkan and DX12#page#逆水寒中的RT反射實踐詹定根、錢海勇、陳明輝、曹威杰、曾晗琦、林盛、程翔網易雷火#page#RTXOFF設義nVIDIA#page#RTXON設文nVIDIA#page#混合光柵化渣染管線GPU SkinningBuild ASGBufferRaytracin
29、gLight Pass#page#材質Shader優化實踐中的問題粒子系統#page#材質Shader#page#如何保證RT反射的效果與光柵化的結果一致?逆水寒中共有近200個材質shader參與材質編寫的開發人員有近20余人#page#Shader功能定位Generation Shader1.執行raceRay2.計算finalcolorMaterial AClosest Hit ShaderMaterial BPayload返回material color、 blend.functionMaterial NMiss ShaderPayload返回終止條件#page#將光柵化的shader
30、轉換為CHSshadershader(“closesthit)void xxxx ClosestHitshader(inout RayPayload payload,in MyAttributes attr)./vertexindexVertex vertex0,vertexl,vertex2Loadvertex(index.x,vertice5,(uint)PackPlag,vertex0);Loadvertex(index.y.vertices,(uint)Packplag,vertex1)解析vertexbufferLoadvertex(index,Ttice5./basic verte
31、x buffer:(pO,pl,p2);(normalo,normall,norma12);(uvO,uvl,uv2)./otherex bufferfloat3 bary=float3(1.0-attr.barycentrics.x-attr.barycentrics.y,attr.barycentricsbarycentrics.y);TA_COMMON_INPUT i;i.Position-Interpolate3(po,pl,p2,bary);組裝vsinputstructurei.Tangent=Interpolate4(tangento,tangentl,tangent2,bary
32、);i.Tex=Interpolate2(uvO,uvl,uv2,bary);VS_OUTPUT_COMMON vsOut=VS Mesh_Common(i);bool vEace =(HitKind()=HI_KIND_TRIANGLE ERONT_EACE)復用光柵化shadefloat4color;color=PS_Mesh_Tranaparent_common(vgOut,vFace);./payload#page#CHS Shader通過srv綁定vertex bufferindex buffer和傳入VS傳入ps其它頂點屬性Bufferfunctionfunctionpayload
33、組裝VS解析vertexbuffers&組output裝vsinputstructurestructure#page#優化#page#RT Resource更新JOB化執行多線程的更新降低Constantbuffer的更新開銷按需更新單獨做worldmatrix的更新。Constantbuffer變化的情況中,約有40%的情況只有worldmatrix變化了。減少memcpy的次數。按照r的constantbuffer內存分布,緩存整段默認參數值,對于交給上層控制的參數再根據dirty標記逐個更新#page#BLAS Update靜態物體不做更新動態物體根據LOD分頓更新#page#BLAS
34、BuildBLAS合并將同一邏輯對象的mesh放到一個bottomlevelas里、降低cpu、gpu開銷,同時提高traceray性能gpu從7ms到2msBLAS靜態物體壓縮,使用壓縮方式buildAS,build完成后,獲取壓縮后AS內存大小創建一個新的AS,對舊的AS執行壓縮拷貝,并釋放舊的AS,顯存占用減少50%#page#反射bindless保存bind信息,按需更新從5ms降低到0.5-1mscpu開銷#page#Resource barrier優化移動drawcall之間transition、uavbarier到drawcall起始處和結尾處gpuuskin的花費從3.5ms降
35、低到0.2msgpu開銷#page#shader的優化坑點、明智的將循環展開,尤其是循環里執行TraceRayTraceRay盡可能避開i條件Payload盡可能小Attribute盡可能小#page#實踐中的問題#page#cs模擬vs和gs頂點變換,VS案例:骨路動畫、頂點動畫使用cs去模擬vs的頂點變換過程,將處理結果存到新的buffer上,傳入rtshader和加速結構新增圖元或銷毀圖元,9S案例:furshader使用cs去模擬gs的流程,計算一套新的頂點和索引buffer#page#分類trace多pass演染案例:頭發演染頭發是雙pass演染,分為半透部分和非半透部分,幾何是重疊
36、的,如果不分開trace,則只能trace到其中一個部分類貼花案例:假焦散計算出反射光線與場景mesh的hit點,將hit點作為trace的origin,相機指向hit點的方向作為trace的direction,如此trace就能與光柵化的假焦散結果一致#page#Rtshader的處理使用反射點的位置代替光柵化Shader計算中經常使用的eye位置Viewmatrix需要轉換到反射空間下添加新的space(sm5.1以上)綁定rt需要用到的資源視圖,這樣既可以保持源shader的寄存器位置不變,也可以避免寄存器重疊注冊燈光計算代碼,forwardliight#page#粒子系統#page#w
37、orld spaceTriangle InterssectionMesh OnlyHit PoinSecondary RayPrimary RayMirrror#page#Billboard-ProceduralPrimitiveProceduralIntersectionWrtual BartielSecondary RayPrimary RayVirtal ViewFrustumView FrustumMirro#page#Billboard-插片ddInstanceMaskatch米Secondary RayPrimary RayViewMirror#page#粒子系統#page#全局光照系統#page#RTXGI