從系統層面提升 GPU 利用率.pdf

編號:29577 PDF 62頁 2.91MB 下載積分:VIP專享
下載報告請您先登錄!

從系統層面提升 GPU 利用率.pdf

1、NVIDIAIMPROVEGPU UTILIZATION EROMSYSTEM LEVELClick Cheng, NVIDIA Solution ArchitectGTC China 2020#page#WHATS ABOUT THE TALKWelcomeItsFrom system level of NVIDIA perspective, proposed several ways to improve GPU utilizationDiscuss several GPU monitoring metrics which reflect real GPU utilizationiIntr

2、o each solution mechanism, usage, discuss the benefit in some test cases;Summary different solution positioning, comparison, etc;ItsNotImprove GPU utilization from scheduler levelOptiimize GPU utilization from coding level#page#OUTLINEOverviewWhats About The TalkGPU Utilization DiscussionMulti-Proce

3、ss ServiceMPS Intro,Usage,TestCasesMulti-Instance GPUMIGIntro,Usage,Test CasesTriton and VGPU BriefIntro,Test CasesQuick Summarry#page#OVERVIEW#page#BACKGROUNDWhy ls This ImportantGPU is more and more powerful, and more precious.Many applications are benefiting more from more powerful GPUWhile for s

4、ome lower-utilized application, still cant fully utilize GPUpowerful computing capability.Example, some developing scenario, inference scenario-Especially for some inference cases with critical latency limitation,which not allowed batching for inference,How to share and isolate among processes or us

5、ers on one GPU#page#GPU UTILIZATIONMetrics and ToolsGPU utilization: reflect how busy different resources on GPU are, metrics including GPUcore(CUDA core,integer, FP32, Tensor Core), frame buffer(capacity, bandwidth), PCle RXand TX, NVLink RX and TX,encoder and decoder, etc.Generally,when we talk ab

6、out GPU utilization, we are mostly talking about GPU utilizationof CUDA core.GPU utilization reflects an impact on delivered application performance somehow, but notnecessarily.Monitor toolsnvidia-smi or NVML,installed with GPU driverDCGM: Data Center GPU Manager, standalone package, using NVML and

7、advanced datacenter profiling metrics;#page#GPU UTILIZATION METRICFrom nvidia-smi or NVML“GPU Utilization” from nvidia-smi or NVML is a rough metric that reflects how busy GPU coresare utilized.SEM sjaulay aloul Jo auo uPIM Suunp pollad aldues sed au laAo aul Jo qualeds Kq pauladexecuting on the GPU

8、?from NVML API Guide.Extreme case, the metric is 100% even theres only one thread launched to run kernel on GPUduring past sample period.+-NVIDIA-SMI450.51.00450.51.06CUDDisp.AiGPUPersistence-MlVolatilencorr.ECCBus-IdPerferseGPU-tilPwr:usage/capHIGMTesav100-SXM2.,00000000:06:00.00FF商門788DefaultNA620

9、15698M1B/16168MiEN/AiProcessesCIPIDGIProcessnamGPU MamoryGPUTypeHOUsageN/AN/A1021015693M18python#page#GPU UTILIZATION METRICFrom DCGMDCGM provides CLI dcgmi and API for C and Python000MANAGEMENT NODElanguagebeCClutSofwarDCGM DCPCData Center Profiling) provides lower levelEDKADMINISTRATORprofiling me

10、trics,which lists several utilization metrics in=800COMPUTENODEmore accurate.CLAPbFrom these metrics, better reflect how well GPU resourcesDCGPUManaare utilized to some extent.TeslaEnterpriseDrerWell, one GPU has many different resourcesicomputing,memory, IO), its highly recommended to capture sever

11、almetrics to understand GPU utilization, not just one or two.#page#GPU UTILIZATION METRICDCGM DCP MetricsMetricDefinitionDCGM Field IDRatio oftimethegraphicsengineisactive.Thegraphics engineisactiveGraphics EngineDCGMFLPROF_GR_ENGINE_ACTIVEifa graphics/compute context is bound and the graphics pipe

12、orActivitycompute pipeis busy.Theratio of cycles an SM has at least 1warp assigned (computed fromDCGMFLPROF_SMACTVESMActivitythe number of cycles and elapsed cycles)Theratio of number of warps resident on an SM.(number of resident asDCGMLFLPROF_SM_OCCUPANCYSM Occupancyapercentage of the theoretical

13、maximum number of warps per elapsedCycle)TensorDCGMFLPROF_PIPE_TENSOR_ACTIVEUtilizationsustainedelapsedcycles)Memory BWTheratio of cycles the device memory interface is active sending orDCGMFLPROF DRAACTIVEUtitizationreceiving data,FLOP CountsDCGMFLPROFPIPE_FPXY_ACTIVERatio of cycles the fp64/fp32/f

14、p16/HMMAIIMMA pipes are active.NVLinkThe number of bytes of active NVLink rx or tx data including bothDCGMFLDEV_NVLINK_BANDWIDTH_LOUtilizationheader and payload.pci_bytes_frx,tx3-The number ofbytesofactivepcie rx ortx dataDCGMFLPROFPCIE_ITIRJX_BYTESPCle Utilizationincluding both header and payload.#

15、page#GPU UTILIZATION METRICUsing dcgmiRecommended monitor command with dcgmiSdcgmi dmon-e1001,1002,1004,1005,1009,1010,1011,1012,150,155,110,111#page#MULTI-PROCESS SERVICE#page#HYPER QUEUEExample: SCUDA_PATH/samples/6_Advanced/simpleHyperQfor(inti=e;instreamsj+i)kernel_A(&d_a2*itime_clocks)total_clo

16、cks += time_clockskernel_B(ad_a2*1+1,time_clocks);total_clocks + time_clocksi上門Device without Hyper-QDevice with Hyper-Q#page#MULTI-PROCESS SERVICEWhats MPSAn alternative, binary-compatible implementation of the CUDA Application ProgrammingInterface (API).Based on GPU Hyper-Q capability:Enabling mul

17、tiple CPU processes sharing one GPU context;Allowing kernels and memcpy in different processes can be executed simultaneously on the same GPU,to utilizeGPU bettersMPS includesControl Daemon Process- The control daemon is responsible for starting and stopping the server,as well ascoordinating connect

18、ions between clients and servers. Server Process- The server is the clients shared connection to the GPU and provides concurrency between clients.Client Runtime- The MPS client runtime is built into the CUDA Driver library and may be used transparently by anyCUDA application.nViO#page#MULTI-PROCESS

19、SERVICEWithout MPS VS With MPSMPi ProcessKeneBKernoB2DCUKemeKerneB281A1CPUKeneB28Without MPSWiith MPS#page#MULTI-PROCESS SERVICEMPS Architecture店System-wide provisioning with multiple users.招Client A from User 1 request;Daemon create MPS server for User 1 and Client Aruns招的路具中Client B from User 1 re

20、quest and assigned to MPSserver,and to run;Client C from User 2 request, and pending號Util al clients from User 1 running end and MPSserver exit for User 1,Daemon create MPS serverfor User 2,and Client C begin to run;品#page#MULTI-PROCESS SERVICEMPS BenefitsGPU UtilizationA single process may not util

21、ize all the compute and memory-bandwidth capacity available onthe GPU.MPS allows kernel and memcopy operations from different processes to overlap onthe GPU,achieving higher utilization and shorter running times.Reduced on-GPU Context StorageThe MPS server allocates one copy of GPU storage and sched

22、uling resources shared by all itsclients, thus reduces the resource storage.Reduced on-GPU Context SwitchingThe MPS server shares one set of scheduling resources between all of its clients, eliminatingthe overhead of swapping when the GPU is scheduling between those clients.nViO#page#MULTI-PROCESS S

23、ERVICEPotential Applications for MPSApplication process does not generate enough work to saturate the GPU. Applications like thisare identified by having a small number of blocks-per-grid.Application shows a low GPU occupancy because of a small number of threads-per-gridIn strong-scaling case some M

24、PI processes may underutilize the available compute capacity.Especially for Al inference, with critical latency limitation, which not allowed batching forinference.#page#MULTI-PROCESS SERVICEVolta MPSVolta MPS provides a few key improvements, compared with pre-Volta:Volta MPS clients submit work dir

25、ectly to the GPU without passing through the MPS servers Each Volta MPS client owns its own GPU address space instead of sharing GPU address space with all otherMPS clients.s Volta MPS supports limited execution resource provisioning for Quality of Service (QoS).SPuCPUPrgBPascalGP100VoltaGV100#page#

26、MULTI-PROCESS SERVICEMPS UsageStart MPS daemon processnvidia-cuda-mps-control -dCheck MPS processps -ef lgrep mpsRecommend to set compute mode to exclusivesudo nvidia-smi -c EXCLUSIVE PROCESSQuit MPS daemonecho quit l nvidia-cuda-mps-control#page#MULTI-PROCESS SERVICEMPS Usagenvidia-smi shows when r

27、unning eight trtexec processes with MPS:10.1Disp.熱安TwE:Usee/calComEuteN.00000000+osnoo.0ommOEE1100號Befault l400/161601BPTDGRUTyPe1962908131016V生生810748733418trtexe81075873341B1七工七電2881076lawKeLe0安工七電準8109767304181C電區香81078873M181SLOT8873341B1trtexeo310801884809中國信市公準81081873348nViD1#page#MPS TEST CA

28、SE 1Simple Kernel with One Thread RunningSimple kernel code: Clgnore the computing content)global_void testMaxFlopsKernelfloat*pData,int nRepeats,floatv1,floatv2)inttid=blockldx.x*blockDim.x+threadldx.xfloats=pDataftid,s2=10.0f-5,s3=9.0f-5,54=9.0f-52;for(inti=0;itrt-mps-mpi-8.og#page#MPS TEST CASE 2

29、GPU Utilization Metrics - MPS OFFGPU Utilization Metrics- Without Batching,Without MPS10.90.80.70.60.50.40.30.20.10GPU UtilSMActivityTensor UtilSM OccupancyNP=8NP=16NP=1NP=2NP=4#page#MPS TEST CASE 2GPU Utilization Metrics - MPS ONGPU Utilization Metrics- Without Batching, With MPS10.90.8070.60.50.40

30、.30.20.10GPU UtilSMActivitySM OccupancyTensor UtilNP=1口NP=2NP=4口NP=8口NP=12#page#MPS TEST CASE 2Profiling Analysis市醫速理BS=1,NP=8,MPSOFFBS=1,NP=8,MPSON#page#MPS TEST CASE 3JPEG ResizeJPEG to JPEG resizing is an essential workload for many internet services, including trainingand inference for image cla

31、ssification,object detection,etc.And for some service provider, to cut storage expense, they might just storage one imageinstead of several dozens in different resolutions.Eastvideo,an NVIDIA Preferred Partner, developed an image processing SDK with CUDAacceleration (one of their customer was Flickr

32、), since therere multi phases in the whole JPEGresize implementation pipeline, like copy from storage to CPU memory, then copy to GPUmemory,JPEG decoding,resizing, sharp, JPEG encoding, copy to CPU memory,etc. Theyvedone many optimizations across the whole pipeline, and one technical they adopted is

33、 NVIDIAMPS, to optimize the throughput of the GPU system.We use Fastvideo SDK to perform this testing-nviD#page#MPS TEST CASE3Test ResultsResize JPEG from 1920x1080 to 480x270.Up to 3.5x throughput improvement when MPS enabledProcesses NumberFPS-MPS OFFFPS-MPS ONSpeedup115216331.4224102523192.266101

34、627862.74101430242.9810101131903.1512330110143.251433672.921154634583.42101218355810093.53#page#MPS TEST CASE3Test ResultsResize JPEG from 1280x720 to 320180.paqeua Sdw uaum quauaAodu nduanolu xttodnProcesses NumberFPS-MPS OFFFPS-MPS ONSpeedup937200722.14490429103.22689734513.8589438134.261089038484

35、.321289138784.351438609004.29688939214.411888639424.45#page#MULTI-INSTANCE GPU#page#GPU ARCHITECTURE AND CUDACUDA 10.0CUDA8.0CUDA9.0CUDA 11.02016201720182020PASCALVOLTATURINGAMPEREHBM,NVLINK,FP16TENSOR CORES, RTHBM,NVLINK,TENSORHBM,NVLINK,TENSORCORESCORES,MPSCORES,PARTITIONINGWON#page#A100 GPUHighes

36、t Performance, Efficiency and UtilizationNew TechnologyBenefit over VoltaHigh BandwidthHigh Bandwidth2xV100RN50TransformertrainMemoryMemoryFaster Tensor Core for Al-3x Tensor Core FLOPSsupport FP168 bfloat16Dramatically reduce time-to-soln.2.5xFP64FLOPSNew Tensor Core for HPCAccelerate core HPC kern

37、elsHigh1.7x memory bandwidthA100Wider+Faster MemoryUp to 40GB per GPUGPUCoreLarger model 8 dataset2xNVLINK bandwidthNew NVLINK3 + PCle Gen42x PCle bandwidth+SR-10VBNew Multi-Instance GPU,Up to 7 concurrent GPUswith Fault and PerfHigher utilization12 NVLINK3 X4PCle G4 x16IsolationSubstantially lower

38、entry costJPEG HW decoder 5 video NVDECNew Hardware EnginesOptical fflow accelerator#page#NEW MULTI-INSTANCE GPU (MIG)Optimize GPU Utilization, Expand Access to More Users with Guaranteed Quality of ServiceBQAmberUp To 7 GPU Instances In a Single A100: DedicatedSM,Memory L2 cache,Bandwidth for hardw

39、are Qos用8isolationSimultaneous Workload Execution With GuaranteedQuality Of Service: AUL MIG instances run in parallelPU3PUGPUGPUCPUwith predictable throughput 8latencyGPUMomGPUMomGPUMSmGPU WemGPUMomGPUMomGPUMamRight Sized GPUAllocation:Different sized MIGinstances based on target workloadsFlexibili

40、ty: to run any type of workload on a MIGinstanceDiverse Deployment Environments: Supported withBare metal,DockerKubernetes,Virtualized Env.#page#MIG ISOLATIONComputational IsolationSM are not shared between MIGss This provides high QoS for each MIG usersDRAM Bandwidth IsolationKiouau pue sjauuey Wd

41、leinoled uIM pajeosse Aieoyskud ale aue Z7au Jo saoyss lsolating MIGs to non-overlapping sets of L2 cache slices does two things:Isolates BWAllocates DRAM memory between the MIGSConfiguration IsolationCreating GPU Instances or Compute Instances do not disturb work running on existing instancesError

42、lsolationResources within the chip are separately resettable#page#GPU INSTANCE PROFILESForA100-SXM4-40GBTarget use-casesNumber ofGPUSN5NVDECSInstancesMemoryInstanceAvailableTrainingInferenceBERTFinetuning(e.g.SQuAD),70145GB1g.5gbMultiple chatbots,JupyternotebooksMultiple inference(e.g.TRITON)312810G

43、B2g.10gbResNet-50,BERTWnDnetworks4220GB3g.20gbTrainingon ResNet-50,BERTWnD1564g.20gb20 GBnetworks186540GB7g.40gb#page#FLEXIBLE MIG CONFIGURATIONS FOR DIFFERENT SCENARIOSStice#1Slice#2Slice#3Slice#4Slice#5Slice#6Slice#7718 possible configurations42中車NVML or NVIDIA-SMI to32create and retire Instance23

44、3中3Config. can be dynamically3updated when the GPU slices2involved are idle31車222122222112121中2心#page#EXAMPLE: TWO LEVEL PARTITIONINGGPU Instances and Compute InstancesDebugge4 Parallel CUDA processes/containersOne containerQLRMMPSnVIDIA.NVIDIA.GPCPCGPCGPCPCGPCGPC1c起2091c4g20g1c4g20glc420gmwmwMemory

45、MemoryMemoryGPUInstanceGPU InstanceGPU Instance2g.10gblg.5gb4g.20gbGPU#page#ENABLEMENT ACROSS SOFTWARE STACKMANAGEMENTMONITORING麻NVMLDCGMSlurmSYSTEM SOFTWARECUDAAeNA100TESLAGPUSYSTEMTESLAGPUNIDIADGX FAWILYNVIDIAHGXSupport for bare-metal and containerized environmentsInteraction directly via NVML/nvi

46、dia-smiKubernetes (device enumeration, resource type), SlurmDocker CLIMonitoriing and management (iincluding device metrics association to MIG)#page#USER WORKFLOW:MIG MANAGEMENTList/Create/Update/Destroy Instances via NVML and nvidia-smiGPU reset required to enable/disable MIG mode (one-#nvidia-smim

47、ig-11st-gpu-instancestime operation)GPU instancesProfilePlacementInstanceGPUNameIDIDStart:sizeUse NVML/nvidia-smi (even through containers) to19018.5gb92:1manage MIG19133:1018.5gbExample: Create new instance with nvidia-smi19136:1618.5gb1430:202g.10gb144:2628.10gbEnable MIGCheckAllocate oneRemove on

48、eapowAvailabilityPartitionPartition#page#MIG: RUNNING DOCKER CONTAINERSUser WorkflowRun GPU containers with MIG using “$docker rungpusdevice:0:1”-gpus option in Docker 19.03nvidia/cuda:11.0-basenvidia-smi-LPrimarily for single nodeGPUQ:A100-SXM4-40GB(UUID:GPU-2ceff3df-31b3-caf2-eace-a494b4b7926b)dev

49、elopment and testingMIG3B.20gbDevice:(UUID:MIG-GPU-2ceff3df-31b3-caf2-eacea494b4b7926b/1/0)Enabled via NVIDIA Container Toolkita494b4b7926b/2/0)(previously known as nvidia-docker2)$ docker runUsers configure MIG partitions using0/t/q9z6qq6-aea-z+e-cqt-p+az-nd-9IN=aAap,snd8-NVML/nvidia-sminvidia/cuda

50、:11.-base nvidia-smi-LGPU:A100-SXM4-40GB(UUID:GPU-2ceff3df-31b3-caf2-eace-a494b4b7926b)Launching the container requiresspecifying the GPU instances toa494b4b7926b/1/0)expose to the container#page#MIG: RUNNING CONTAINERS USING K8SUser WorkflowMIG configured on the node ahead of timeapiVersion:v1Expec

51、ted to be transparent to the endkind:Podmetadata:username:8pu-examplespec:Simple exposure model for homogenouscontainers:name:gpu-examplenodesimage:nvidia/cuda:11.0-baseresources:Other exposure options still in discussion1imits:and not settled jobs will be able to only execute on Compute I TEST CA

52、SE 1-BERT LARGE INFERENCETest ResultsBert Large Inference,BS=1,INT8Perf among 7MG 1g.5gb slice is very stable1600and consistent. MG provides great perf14571400isolation and QoS.2.1x throughput when MIG is enabled for this1200case and config10008006726004002082000No MIG:Whole GPUMIG:1*1g.5gbMIG:7*1g.

53、5gb#page#MIG TEST CASE 1-BERT LARGE INFERENCEGPU Utilization MetricsGPU Device Level Utilization Metrics10.90.8070.60.50.40.30.20.10SMActivityTensor UtilMemory Activity口MIG:1*1g.5gb口No MIG:Whole GPU口MIG:7*1g.5gb#page#MIG TEST CASE 2- JASPER INFERENCETest ResultsJasper inference,BS=1,FP16Throughput:

54、amount of audio seconds4500processed by GPU in one second403740003500With MIG enabled,throughput up to 3.4x3000improvement.250020001500116010005845000No MIG:Whole GPUMIG:1*1g.5gbMG:7*1g.5gb#page#TRITON AND VGPU BRIEF#page#INEFFICIENCY LIMITS INNOVATIONDifficulties with Deploying Data Center Inferenc

55、eSingle Framework OnlySingle Model OnlyCustom Development6ChaineOxnetPYTRCHRec-ASRNLPommendertheanoSome systems are overused whileDevelopers need to reinvent theSolutions can only supportothers are underutilizedmodelsfrom oneframeworkplumbing for every application#page#NVIDIA TRITON INFERENCE SERVER

56、Production Data Center Inference ServerMaximize real-time inferenceperformance of GPUsNVIDIAT4NVIDIAQuickly deploy and manage multipleT4Q0Cmodels per GPU per nodeTeslaV100Easily scale to heterogeneous GPUsTeslaand multi GPU nodesV100Integrates with orchestrationTesla P4systems and auto scalers via l

57、atencyTesla P4and health metricsNow open source for thoroughcustomization and integration#page#DYNAMIC BATCHING2.5X Faster Inferences/Second at a 50ms End-to-End Server Latency ThresholdStatic vs Dynamic Batching (V100 TRT Resnet50 FP16 Instance 1)Triton Inference Server groups2500inference requests

58、 based oncustomer defined metrics for2000optimal performance50Customer defines1) batch size (required)10002) latency requirements (optional)Example: No dynamic batching(batch size 18 8)vs dynamicbatching101461254nviDu#page#VGPU FOR GRAPHICS AND COMPUTINGVirtualVirtualVirtualVirtualVirtualVirtualPCPC

59、ComputeComputeWorkstationWorkstationNVIDIAGraphicsNVIDIAGraphicsNVIDIA QuadroNVIDIA QuadroNVIDIAComputeNVIDIAComputeDriverDriverDriverDriverDriverDriverVGPUVGPUVGPUVGPUVGPUVGPUHypervisorNVIDIA GRID VGPU managerNVIDIANVIDIACPUSGPUServerGPUH.264Encode#page#VGPU FOR COMPUTINGVCSHypervisor provides best

60、 security,isolation guarantee,VCS provides a good option for cost sensitive customers and those new comers to GPU computingor application of low-utilized GPU scenarios.Flexible scheduler strategy: Best effort, fixed-share,equal-share.Flexible scheduler time slice (1-20 ms controllable)-Perf is guara

61、nteed even that itts time-round sharing for SM resources#page#QUICK SUMMARY#page#CUDA CONCURRENCY MECHANISMSTriton,MPS,VGPU and MIGMemoryParallelSM performanceAddress spaceError isolationperformanceworkisolationisolationisolationNoYesNoNoNoTRITON (CUDA Streams)YesMPSYesYesNo(bypercentage,ONnot parti

62、tioning)Yes(withVGPUYesYes(Time-slicing)YesYeshypervisor)MGYesYesYesYesYesnVIDL#page#COMPARISONPart 1Simple Comparison Among MPS,VGPUTRITONMIEVGPUTRITONMIGMPSIntro LinkGithubMPSWhitepaperoficial LinkMIGWhitepaper-NDANoNoOpen SourceYesNoFreeYesNoYesYesOffer a consistent userImprove GPU utilizationexp

63、erience for every virtualProvide a cloudImprove GPU utilization forand serve more usersinferencing solutionworkflowand improve GPUapplications that doesnt fullywith physical resourceutilization in some scenario,byoptimized for NVGPUMain Positioningutilize GPU,by scheduleisolation andQoSsplit GPU int

64、o multiplevGPUs aswith an inference servicemultiprocess,withlimitedguarantee,memorysizeequal partition,byvia HTTP or gRPCexecution resource.integrating with hypervisorendpoint.virtualmachine technology)Applicatons thatdoesntfullyutilize GPU:HPC-MP3D Rendering,vGaming,InferenceTraining,inference,HPCT

65、argetApplicationsapplication,training,training,inference.inferencewithsmall matrixsize#page#COMPARISONPart 2Simple Comparison Among MPS.VGPUTRITON,MIGMPSVGPUTRITONMIGP100,P40,P4,P6,V100,T4,Al GPUGPUsince KeplerA100Supported GPURTX8000,RTX6000,M10,M60LinuxSupported OSLinuxLinux,WindowsLinuxExtra Soft

66、wareHypervisorKVM,CitrixNoNoNoNeededVMWare,etc)Improve GPU utilizationImprove GPUImprove GPU utilization viaimprove throughputImprove GPU utilization,Benefitstime-sharing,improve userutilization,improveservemore users,improve throughoutexperiencethroughoutprovide QoS andfaultisolation.TRTIS executesmodellapp)instanceasGPU memory isolationGPU ResourceContext level isolationGPU memory isolation SMThread(cPU)-SMisolation,otherIsolationmemory and SM sharingsharing by rotation.Stream(GPU).SMengines isolation(CEs,NVDEC)sharingisvia multistreamnviOL#page#NVIDIA

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(從系統層面提升 GPU 利用率.pdf)為本站 (X-iao) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站