當前位置：首頁 > 報告詳情

使用 Triton 優化深度學習推理的大規模部署.pdf

上傳人： li 編號：29551 2021-02-07 PDF PDF 68頁 3.47MB

該報告所屬合集： 2020年GTC中國線上大會嘉賓演講PPT資料合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/68

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《使用 Triton 優化深度學習推理的大規模部署.pdf》由會員分享，可在線閱讀，更多相關《使用 Triton 優化深度學習推理的大規模部署.pdf（68頁珍藏版）》請在三個皮匠報告上搜索。

1、NVIDIA使用Triton優化深度學習推理的大規模部署徐添豪，張雪萌，黃孟迪#page#Triton OverviewInference Server pipelineA100 Multi-Instance GPU (MIG）AGENDADeployment on KubernetesIntegration with KFServingMetrics for Monitoring and AutoscalingPerformance Analyzer: Optimization GuidanceCustommer Case Studies#page#NVIDIATriton Overview

2、#page#Inefficiency Limits InnovationDifficulties with Deploying Data Center InferenceSingle Framework OnlySingle Model OnlyCustom Development33Chainer21HOuXGPYT6RCHTensorfloRecSySNLPASRtheanoDevelopers need to reinSolutions can only supportteSome systems are overused whileplumbing for every applicam

3、odels from one frameworkothers are underutiliized#page#NVIDIA Triton Inference ServerProduction Inference Server on GPU and CPUMaximize real-time inferenceNVIDIAT4performance of CPUS and GPUsNVIDIAT4Quickly deploy and manage multipleQO口Teslamodels per GPU per nodeV100TeslaEasily scale to heterogeneo

4、us GPUsV100and multi GPU nodesNVIDIAA100Integrates with orchestrationNVIDIAsystems and auto scalers via latencyA100and health metricsCPUOpen source for seamlessCPUcustomizattion and integrattion#page#Triton Inference Server ArchitecturePreviously “TensorRT Inference Server”口 Support for multiple fra

5、meworkss Concurrent model executiondi制 CPU and Multi-GPU supportDynamic batchings Sequence batching for stateful models HTTP/REST，gRPC，shared library Health and status metrics (Prometheus) reportingModel ensembling and pipelinings Shared-memory API (system and CUDA） GCS and S3 support Open source- m

6、onthly releases on NGC and GitHub#page#FeaturesUtilizationUsabilityCustomizationPerformanceConcurrent Model ExecutionMultiple Model Format SupportModel EnsembleSystem/CUDA Shared MemoryMultiple models (or multipleTensorRTPipeline of one or more models andInputs/outputs needed to be passedinstancesof

7、same model）mayPyTorch JIT（-pt）the connection of input and outputto/from Triton are stored inTensorFlow 1.Xtensors between those models （cansystem/CUDA shared memory.execute on GPU simultaneouslyGraphDef/SavedModelReduces HTTP/gRPC overheadbe usedwith custom backend）TensorFlow+TensorRT 1.x GraphDefDy

8、namic BatchingInference requests can be batchedTensorFlow 2.x SavedModelCustom Backend for C+ and PythonLibrary VersionLinkagainst lbtrtserver.sosothatup by the inference server to 1） theTensorFlow+TensorRT2.XCustom backend allows the userSavedModelmore flexibility by providing theiryou can include

9、all the inferencemodel-allowed maximum or 2）theONNX graph (ONNX Runtime）user-defined latency SLAown implementation of an executionserver functionality directly in yourengine through the use ofa sharedapplicationCPU Model Inference ExecutionlibraryFramework native models canexecute inference requests

10、 on theStreaming APICPUBuiltin support for audio streaminginput.AccommodatesMetricsstateful/sequence models that haveUtilization，count，memory，andasequence ofinputs tokeeptrack oflatency（speech，translation，etc）.Model Control APIExplicitly load/unload models intoand out of Triton based on changesmade

11、in the model-controlconfiguration#page#TRITON 2.5What ls NewKFSerings new community standard gRPC and HTTP/RESTData Loading Library (DALI） backenddata plane v2 protocolAlows for accelerated pre-processing andaugmentation pipelinesEasily deploy serverless inferencing with Triton inwithin Triton for i

12、mages，videos，and speechKubernetesDecoupled inference servingPython custom backendEngages a model once sufficient but not all inputs are receivedAloWs Python code execution inside Triton (egpre， post-e.g.speechrecognition andsynthesisprocessing）Triton Model AnalyzerSupport for A100，MIGTools to charac

13、terize model performance and memory footprintHigher performance inference serving. Triton on MIG withfor efficient servingperformanceandfault isolationDeepStream 5.0 IntegrationLatestframework backendsNative integration in DeepStream 5.0 for multi-framework，multi-TensorRT7.1，TensorFlow2.2，PyTorch1.6

14、，ONNXRT1.5.3sensorstreaminganalyticsAzureMachine Learning IntegrationGoogle CAIP IntegrationAzureML integrated Triton as the platforms inferenceTriton is now available on GCP CAIP as a custom containerserver to deploy models at scale.#page#DEVELOPERS CAN FOCUS ON MODELS AND APPLICATIONSTriton Takes

15、Care of Plumbing To Deploy Models for InferenceMultipleDifferent Types ofInferencing on GPUDynamicFrameworksand CPUQueriesBatchingPTensorflowONNX用用發臨PYT6RCH CustomRUNTIMEBatchReal timeStreamEnsembleAll Major FrameworkInference Serving on GPU 8Support For Different TypesDynamic BatchingMaximizesBacke

16、nds For Flexibility8CPU AcrossOf Inference Queries ForThroughput Under LatencyConsistencyDifferent Use CasesConstraintCloud I Data Center IEdgeConcurrent Model ExecutionFor High Throughput 8Bare metal|VirtualizationUtilizationStandard HTTP/gRPCCommunication#page#Designed for DevOps/MLOpsTriton Integ

17、rates Easily In Organizations Workflow For ALL Al Use CasesKubernetesMLOpsOpen Source 8IntegrationCustomizableLive Model Updates Scalable Microservice InDynamic Model LoadingCompletely Open SourceKubernetesTriton Model AnalyzerInspect，CustomizeaExtendHelm Chart For FastGoogle CAIP，Azure MLCustomizab

18、le ContainerDeploymentModular Backends For LowIntegrationMemory FootprintKFServing IntegrationGPU Util.，Memory，Inference Load 8 LatencyMetrics#page#NVIDIAInference ServerPipeline#page#Inference PipelineTypical PipelineInferenceRequest De-NetworkQueueRequestSerializationCompletedClientServerNetworkRe

19、questReceiveComputeSerializationClientTRITON#page#Running TritonTriton Docker Container Available on NGCname；output“bert_tf_v2_large_fp16_128_v2platform：modelname:“endlogitsPrerequisite: Docker and nvidia-docker installedmax batch_size:1data_type:TYPE_FP32input中dims：128name：“unique_ids”data_type$ do

20、cker pullTYPE_INT32name：dims:1“start_logitsnvcr.io/nvidia/tritonserver:20.11-py3data_type；eshape:shape:TVPE_FP321dims：128name；“segment_ids”個$ docker run -gpus=11-rm-p8000:8909data_typep8001:8001-p8002:8002-TYPE_INT32dims：128v/path/to/model/repository:/modelscount:1 tritonserver -modelkind:KIND_GPUna

21、me:“input_ids”gpus: erepository=/modelsdata_type：TYPE_INT32dims：128Tname：“input_maskdata_typeiTYPE_IN3dims：128#page#Running InferenceAvailable at NVIDIA Deep Learning Examples GithubQA - Tokenizationimport tritonclient.http as httpclientmodel_name=“bert_tf_v2_large_fp16_128_v2”model version = -1RVID

22、IAbatch_size =11Overvewurl=x.x.x.：8000#page#Running InferenceAvailable at NVIDIA Deep Learning Examples GithubMetadatatriton_client=httpclient.InferenceServerclient(url=url） Healthif not triton_client.is_server_live(）：if（metadataname=triton）：print（metadata）if not triton_client.is_server_ready(）：prin

23、t(”FAILED：is_server_ready”）metadataif not triton_client.is_model_ready（model_name）print（”FAILED：is_model_ready”）print（metadata）#page#Running InferenceAvailable at NVIDIA Deep Learning Examples GithubSend inference request to theCreatetheinference input/output for the modeinputs =results for1 = sanda

24、no#both output tensors.inputs.append (httpclient.InferInput（1，1，“INT32”）resultstriton client.infer（modelhameinputs.append(httpclientinputs，1，128，“INT32”）（sandano-sandanoinputs,aopend(httpclienmOUL1n3u1,128，“INT32） We expect there to be 2 results （each with batchoutputs,append(httpclient,InferRequest

25、edoutput(endsize1）.1ogits，binary_data=False）end logits=resultend _logits0outputs.append (httpclient,InferRequestedoutput(stastart logits = resultlstart logitsrort_logits，binary_data=False）aniD#page#NVIDIAA100 Multi-lnstance GPU (MIG）五#page#A100 MIG SupportOptimize GPU Utilization， Expand Access to M

26、ore Users with Guaranteed Quality of ServiceSNSUp To 7GPU Instances In a Single A100:USERD8Dedicated SM，Memory，L2 cache，Bandwidth forhardware QoS 8 isolationUSER1SimultaneousWorkload ExecutionWithGuaranteed Quality Of Service:USER2GPU InstaAIL MIG instances run in parallel with predictablethroughput

27、8latencyU5ER38Right Sized GPU Allocation:Different sized MIG instances based on targetUSER4品福GPU InstancworkloadsUSER5Diverse Deployment Environments:品GPUInstaSupportedwith Bare metal，Docker，Kubernetes，JSERVirtualizedEnv.SGPU InstCNS20428:Multi-InstanceGPU（MIG）深度學習最佳用法示例#page#Inference with Triton7

28、ResNet Models on 7 MIG Instances in ParallelgRPC ClientLodBalancertriton-trttriton-trttriton-trttriton-trttriton-trttriton-trttriton-trRESNET50MIG4MIG7MIG1MIG2MIG3MIG5MIG6A100#page#Inference with TritonMeasure Performance Using the Perf AnalyzerPerf Analyzer (formerly “perf_client）:Measures latency

29、and throughput (inf/s）under varying client loads.it can be usedto measure performance at the lowestpossible load on the model， by sending oneinference request to Triton and waiting forthe response.Use -concurrency-range option to sendmultiple requests at the same time.perf_analyzer -m flower -u 127.

30、0.0.1:50058-i http-concurrency-range 1:100 -f results.csv#page#Triton on A100 with MIG4.5X Throughput Speedup Using 7 MIG Instances on ResNet50Throughput（HTTP.BatchSize=1）A100-MG7X7One A100 can be partitioned into upto seven GPU instances to maximizingA100-MIG7x6GPU utilization and providing dynamic

31、scalability.A100-MIG7x5A100-MIG7x4Each MIG instance brings a consistentincrease in throughput.Using onlyA100-MIG7x3two MIG instances on A100 alreadyprovides an improvement over bothA100-MIG7x2V100V100 16GB and T4.A100-MIG7x1Conguroncy#page#Triton on A100 with MIG4X Latency Speedup Using 7 MIG Instan

32、ces on ResNet50Latency（p90，HTTP.BatchSize=1）Typically， when model concurrencyincreases， latency performance tends to100MG7xsuffers. This chart shows how increasingthe number of MIG instances used cankeep the latency low at higher切concurrency values.V100A100-MIG7X2The latency monotonously decreasesA1

33、0M7when more MIGs are added. Using onlytwo MIG instances on A100 alreadyA100-MIG7A7represents an improvement over bothMG7xV100 16GB and T4.At seven MIGA100-MG7xinstances，we obtain a significantincrease in latency speedup.Conct#page#1T4GPUTRITONClientNVIDIA TRITONModel RepositoryResNet5OTensorRTNVIDI

34、AT4GPU#page#MIG for Optimized InferencingScaling out microservices with Triton and MIGHorizontally scale out your containers or VMS7x by using MIG GPU compute Instance insteadof GPU devicesNo updates needed for application code；deployment code must be updated to use MIGrather than GPU resourcesConti

35、nue using microservice best-practice， oneserver per app，or allow Triton to manage allMIG devicesldeal for batch-size 1inferencing56 inference jobs on a DGXA100: 8*7* MIG 1g.5gb#page#NVIDIADeploymment on Kubernettes#page#Deployment on KubernetesHelm Chartapiversion：apps/v1kind：Deploymentcontainers：me

36、tadata；name：.Chart.Namename: template “triton-infereneaueNaaeuraeusanteAJaemrver.fuiname”namespace:（Release.NamespaceimagePullPolicy:values.imagepullPollcy labels；app：template“triton-inference-server.name”，resources:1imits：廣 args:“tritonserver=aJos-tapomspec：RepositoryPath ”maRe.replicas:Values.repl

37、icaCount ports：selector：matchLabels：containerPort：8068name：httprelease:（CRelease.Name containerPort:8801templatename:grpcmetadata！-containerport:8892labels：name：metricsapp:（template“triton-infelivenessProbehttpGet：Pelease:t .Release.Name Jhttps:/ server#page#StorageTriton on Azure Kubernetes Service

38、BERT ModelModelBERT ChatBotRepositoryCLIENTBERT ChatBotAKSQuestions andAZUREVMTritonTriton InferencedLIHTritonInferenceServer DockerInferenceHTTPServer8000ContainerClient DockerK8s ServiceContainerTriton POD8002Metrics#page#Triton Inference Server on AKSAutoscalingAKSTritonTriton PODsK8s ServiceTrit

39、onInference20Server DockerMetricsContainerSorHPA（Horizontal Pod Autoscaler）Reg#page#6Questions andnVIDIAAnswers with BERTContext:NFollowing their loss in the divisional round ofthe previous seasons playoffs， the DenverBroncos underwentnumerous coachingchanges，including acoach John Fox(who had won.DA

40、:A:Q:What in0Q:Wh#page#費Kubernetes Support for MIG“Mixed”Strategy“Single” Strategyapiversion：v1apiVersion：V1kind：Podkind：Podmetadata：metadata:name：gpu-examplename：gpu-examplespec；spec：containers：containers：name：gpu-examplename：gpu-exampleimage：nvidia/cuda:11.0-baseimage：nvidia/cuda:11.0-baseresource

41、sresources1imits：1imits： A100-SXM4-48GB using traditional resource GPU typesExposed using new resource GPU typesGood for large clusters (homogeneous nodes）Good for smaller clusters (heterogeneous nodes）Users need to learn new type syntaxUsers dont need to learn new type syntaxCNS20428:Mullti-lnstanc

42、eGPU（MIG）深度學習最佳用法法示例#page#NVIDIAIntegration withKFServing#page#Triton Integration with KubeflowWhat is Kubeflow？Open-source project to make ML/DL workflows on Kubernetes simple， portable，and scalableCustomizable scripts and configuration files to deploy containers on their chosenenvironmentProblems

43、it solvesKubeflowEasily set up an ML/DL stack or pipeline that can fit into the majority of enterprisedatacenter and multi-cloud environmentsHow it helps Triton Inference ServerTriton Inference Server is deployedas a component inside of a production workflowTritonInferencetoServerOptimize GPU perfor

44、manceEnable auto-scaling， traffic load balancing，and redundancy/failover viametrics#page#Kubeflow Serving (KFServing）OverviewKFServing enables serverless inferencing on Kubernetes and provides performant high abstraction interfaces forcommon machine learning (ML） frameworks to solve production model

45、 serving use cases.You can use KFServing to:Provide a Kubernetes Custom Resource Definition for serving ML models on arbitraryframeworks.Encapsulate the complexity of autoscaling， networking，health checking，and server configuration to bringcutting edge serving features like GPU autoscaling， scale to

46、 zero，and canary rollouts to your MLdeployments.Enable a simple， pluggable， and complete story for your production ML inference server by providingXoq au Jo no Kllqpuleidxa pue Susssaoid-sod Sulssaoid-aid uolpaid#page#Yourtrainig/pruningvalidation flowdumpINFERENCE SERVERmodelARCHITECTURE KUBEFLOWMo

47、del repositorya00L0CMultipleworkloadsLoad balancerRECModelrepositoryPersistentvolumIMG8店3網到10PreQCAPl:ASRContainerizeprocessinginferenceserviceGPUPost（CPUGPU）processing網網網心0NVIDIA Trton商商地LegendMetricssericeAuto scalerAreadyexisingNew from NVIDIACluster圖TensorRT，TensorFlow.C2/ONNXModl#page#Architect

48、ure OverviewInference Service Data PlaneExplaineriexplainDefault EndpointTransformerpredictpredict：explainPredictorpredict三Triton Inference Server First To Adopt KFServing V2 Protocol#page#KFServing: InterfaceInference ServiceApply the CRD：apiVersion：“serving.kubeflow.org/v1alpha2kind： “InferenceSer

49、vice”S kubectl apply-ftriton.yamlmetadata：name:“triton-simple-stringspec:default:predictor：ExpectedOutput:triton：storageUri：gs:/kfservinginferenceservice.serving.kubeflow.org/triton-simple-Samples/models/tensorrt”string created#page#KFServingRun a predictionUses the client at: https:/ example.html1.

50、determine the ingress IP and ports and set INGRESS HOST and INGRESS PORTSERVICE HOSTNAME=S(kubectl get inferenceservice triton-simple-string-o jsonpath=f.status.url Icut-d/-f 3）2.check server statuscurl -H Host: SESERVICE_HOSTNAME http:/SINGRESS_HOST3SEINGRESS_PORT3/api/status3.edit /etc/hosts to ma

51、p the CLUSTER IP to triton-simple-4.run the clientdocker run -e SERVICE_HOSTNAME:SSERVICE_HOSTNAME -it-rm-net=host nvcr.io/nvidia/tritonserver:20.11-py3-clientsdk./build/simple_string_client-u SSERVICE_HOSTNAMEndnoroottrantor:/workspace#./build/simple_string_client-utriton-simple-0+1=1#page#BERT Exa

52、mpleExtend KFServing and Implement pre/post processing and predictionclass BertTransformer（kfserving.KFModel）：def_init_（self，name：str，predictor_host:str）：super（）._init_（name）The preprocess method converts theself.short_paragraph_text=“The Apollo.”paragraph and the question to BERT inputself.predicto

53、r_host = predictor_hostself.tokenizer=with the help of the tokenizertokenization.FullTokenizer（vocab_file=/mnt/models/vocab.txtdo_lower_case=True）The predict method calls the Tritonself.model_name=“bert_tf_v2_large_fp16_128_v2“inference server PYTHON API toself.model_version =-1self.protocol = Proto

54、colType.from_str（ http）communicate with the inference serverwith HTTPdef preprocess(self，inputs： Dict）- Dict:The postprocess method converts rawreturn self.featuresprediction to the answer with thedef predict(self，features: Dict） -Dict:probabilityreturn resultdef postprocess(self，result:Dict）-Dict:r

55、eturn predictions”： prediction， “prob”：fet+aTqeqoud.teuosasaqu#page#BERT ExampleCreate the Inference ServiceapiVersion：“serving.kubeflow.org/v1alpha2env:kind：“InferenceServicename：STORAGE URIvalue:“gs:/kfservingmetadata:name：“bert-largesamples/models/triton/bert-transformerspec：predictor：default:tri

56、ton：transformer：resourcescustom:limits:Cpu：“1”Container:name:kfserving-containermemory: 16Giimage：gcr.io/kubeflow-ci/kfserving/bert-requests：Cpu:“1transformer:latestmemory:16G1resources:1imits：storageUri：gs:/kfserving-Cpu：“1Samples/models/triton/bertmemory: 1Girequests:Cpu：“1memory:1Gicommand:kubect

57、l apply-fbertyaml“python“bert_transformer#page#BERT ExampleRun Inferenceinstances”：“what President is credited with the original notion of putting Americans in space文MODEL_NAME=bert-1argeINPUT_PATH=./input.json/p-nnss=duoseqssuu8 ）NISOHIS-f3）HIdIndNI$p-aWVNISOH3IAH3S$SoHH-A-Tunhttp:/$INGRESS_HOST：IN

58、GRESS_PORT/v1/models/$MODEL_NAME:predictExpectedoutput:fpredictions:John F.Kennedy,“prob77.91848979818604https:/ ForMonitoring and AutoscalingA#page#Triton Inference Server Metrics For AutoscalingBefore Triton Inference Server- 5,000 FPSBefore Triton Inference Server- 800 FPSOne model per GPUSpike i

59、n requests for blue modelRequests are steady across all modelsGPUs running blue model are being fullyUtilization is low on all GPUsutilizedOther GPUs remain underutilized#page#Triton Inference Server Metrics For AutoscalingAfter Triton Inference Server- 15,000 FPSAfter Triton Inference Server-5,000

60、FPSLoad multiple models on every GPUSpike in requests for blue modelLoad is evenly distributed between all GPUsEach GPU can run the blue model concurrentlyMetrics to indicate time to scale upGPU utitizationPowerusageInference countQueue timeNumber of requests/sec#page#AVAILABLE METRICSNameUse CaseGr

61、anularityCategoryFrequencyPer GPUProxy for load on the GPUPer secondPower usagePer GPUPower limitMaximum GPU power limitPer secondGPUUtilizationGPUutilization ratePer GPUPer secondGPUutilization0.0-1.0）Per GPUTotal GPU memory，tn bytesPer secondGPU Total MemoryGPU MemoryPer GPUUsed GPU memory，in byte

62、sPer secondGPU Used MemoryPer modelNumber of inference requestsPer requestRequest countPer modelNumber of model inference executionsPer requestCountExecution countGPU 8 CPUrequest batchingPer modelPer requestNumber of inferences performed lone request countsInference countas“batch size”inferences）Pe

63、r modelPer requestLatency:requesttimeEnd-to-endinferencerequest handlingtimeTimearequest spends executing the inference modelPer modelPerrequestLatency: compute timeLatency（in the appropriate framework）GPU&CPUPer requestPer modelTimearequest spends waiting in thequeue beforeLatency:queuetimebeing ex

64、ecuted#page#Triton Metrics: AutoscalingHorizontal Pod AutoscalerThe HPAcontroller operates ontheratio beteendesiredPod1metric value and current metricvalue，the followingHorizontalequation returns the number of desired replicas:Pod1PodDeploymentAutoscalerScalePodnR = ceilWhereRis the number of replic

65、as that kubernetes needsWhen R is different from CR then the HPA increasestohave.auuoGuoe Kg seoudal Jo Jequinu aul sesealoepJoCR is the current number of replicas，deployment （in our case the TTIS deployment）:CVis the current metric value andDVis the desiired metric value#page#Triton Metrics: Autosc

66、alingCustom MetricsHere is the summary of what we need to deploy1.Prometheus operator and PrometheusWhere Req is the total number of requests.2.Horizontal Pod Autoscaler3Service MonitorWe need to express this equation using PromQL，the4.Custom Metricsprometheus query language，with the actual name of

67、themetrics exposed by TRTis:Wewant the HPA to perform autoscaling based onthe following metrics average time spent for eachOT= delta(nv_inference_queue_duration _us30srequest in the queuedela(nv inference request success30s）0TOT=Req#page#Triton Metrics: AutoscalingCustom MetricsapiVersion；metadata：n

68、am會tricsapiVersion：vIkind：ConfigMapmetadataonfignamespace：customtricsdata:config-data：adapterrulesseriesQuery取艾resourcesoverridesnamespaceresourcenamematchesa8：avg time queue msmetricsQueryvg（delta (nv inferenceSuecesLabeiMatcher8/308j）/1000）by（）#page#NVIDIAPerformance Analyzer：Optimization Guidance

69、A#page#Measuring Inference PerformanceTriton includes a performance measurement toolcalled the perf analyzer (formerly “perf client”）s Measures throughput and latency under varyingclient Loadss Real or synthetic input tensor data and clientloads HTTP/REST or GRPC APlss Complete feature coverage: sha

70、red-memorystateless and stateful models，batching，etc.sCommand-line tool w/ spreadsheet templates Generates charts to help visualize the throughputvs latency tradeoffs#page#Perf_Analyzeron Github repository and on NGC Triton Client ContainerSettingsThe perf_analyzer helps you determine the ideal mode

71、lBatchsize：Measurementwindow5msecconfiguration which maximizes performance based on specificLatency limit:9 msecConcurrency 1imit: 108 concurrent requestsconstraintsStabilizing using average latencyThe throughput and latency are taken over a time window，andthen repeated the measurements until it get

72、s stable values.Request concurrency:6Client：Request count:8387By default the perf_analyzer uses average latency to determineThroughput:1677.4 infer/secAvg latency:3575usec（standard deviationsinsalazesoelaiuaad-auasn ue noK anqqes248usec）p59 latency:3578 usecbased on that confidence level.p93 latency

73、:3626 usep95 latency: 4161 usecpgg latency:4218 usecAVB HTTP time:3569usec（send/recv146usec+responsewa1t3423usec）Inference count: 10366Executioncount：16866$ perf_analyzer -m flower、-u127.0.0.1:50050-i httpSuccessful request count:10866concurrency-range 1:50 -p5060Avgrequestlatency：2223usecusecinfer5

74、49usec+compute output 12usec）CnID#page#Basic Optimization - Inference Schedulers Default model scheduler:S Process 1 inference request at a time for each modelsIf multiple models each can have 1 inference request executing at any giventimeAn inference request can be a batch Gif client creates the ba

75、tch）Minimize LatencyBatch-1 RequestModel BackendBatch-4 RequestFrameworkDefaultSchedulerRunttime#page#Dynamic Batching OptimizationTriton Inference ServerGrouping requests into aModelY Backendsingle “batch” increasesoverall GPU throughputRuntime考Contexts Process multiple inferenceDynamicBatcherreque

76、sts at the same timeContextfor a modelIndividual requests arebatched and executedtogether#page#Dynamic BatchersEach models scheduler configured independentlys Dynamic batcher controls:Preferred batch sizesMaximum delay to hold inference request to form larger batchPrioritization and timeoutdynamic b

77、atching preferred batch_size：4，81#page#Dynamic Batcher ResultsDeep-recommender TensorRT modelDynamic Batch Size 1-321500010000館50003p95Latency（ms）#page#Dynamic Batching2.5X Faster Inferences/Second at a 50ms End-to-End Server Latency ThresholdStatic vs Dynamic Batching（T4TRT Resnet50 FP16 Instance 1

78、）Triton Inference Server groups inference2000requests based on customer defined metricsfor optimal performanceCustomer defines 1） batch size (required）1500and 2）latency requirements （optional）Example:No dynamic batching（batch size1000188）vs dynamic batching101412#page#Basic Optimization- Concurrent

79、Model ExecutionsBy default Triton creates 1 copy of each model (on each available GPU）sEach copy is known as an instance of the models Inference requests are scheduled independently to each model instances Using more than 1 instance increase throughput and reduce latency if:GPU has sufficient memory

80、 to hold multiple copies of the modelGPU has sufficient compute to execute multiple inferences simultaneouslyGPU / PCle has sufficient bandwidth for simultaneous inferencesPrioritization and timeoutinstance group（count:2）1#page#Concurrent Model Execution- ResNet504x Better Performance and Improved G

81、PU Utilization Through Multiple Model ConcurrencyTriton Inference ServerCommon Scenario 1V100 16GB GPUOne APlusing multiple copies of theCUDASIInferenceFN50Instance1same model on a GPURequestsResNetCUDASFN50Instance250Example:12 instances of TRT FP16CUDA SreanBeuguOgNQoResNet50 （each model takes 1.3

82、3GB GPUCuDA StrFN5DInaanc4memory） are loaded onto the GPU and canrun concurrently on a 16GB V100 GPU.14 concurrent inference requests happenCUDARN5DInsnce6each model instance fulfils one requestCUDASRN50Insance7simultaneously and 2 are queued in theper-model scheduler queues in TritonCUDAStreamRN50I

83、nstancs8Inference Server to execute after the 12CUDA StrasRN5OInstnngtrequests finish.With this configuration，CUDA Sun2832 inferences per secondat 33.94 mswith batch size 8on each inference serverCUDA Sreaminstance is achieved.CUDA StraarRN50Instance12TimenID#page#Multiple Model Instances Increases

84、ThroughputDeep-recommender TensorRT modelDynamic Batch Size1-32D1instanceD2instances200001500010000500003p95Laeny（ms）#page#NVIDIACustomer Case Studies#page#KINGSOFT CLOUD ADOPTS NVIDIATRITON INFERENCE SERVER TOMAXIMIZE PERFORMANCEAT SCALE15+ online AI computer vision services usingTriton for inferen

85、ce serving50% higher QPS per GPU with Triton4-5x higher QPS with Triton + TensorRT/TVM“Besides the increase of QPS and latency， we can nowKingsoft Cloudsmoothly shift our service from the offline mode (whichcares more about throughput） to online mode (which caresmore about latency） thanks to Tritons

86、 scheduling+batchingand TensorRTs dynamic batch support”nVIDIAKingsoftCloud#page#TENCENT YOUTU INTEGRATES TNNBACKEND INTO NVIDIA TRITONINFERENCE SERVER TO STANDARDIZEINFERENCE AT SCALE“in order to standardize inferencing across Tencent，Tencent Youtu developed a new open source highperforming framewo

87、rk called TNN. Next， they choseTencentTriton Inference Server for inference servingYouTu Labbecause of its product maturity and dynamicbatching 8 concurrent model execution capabilities.Tencent easily customized Triton by integratingTNN as a Triton custom backend. TNN and Tritontogether help achieve

88、 standardized highperformance inferencing for all developers buildingAl applications”CnVIDIATencent YouTuLab#page#UNIVERSALINFERENCE ENGINENaver is the#1search engine and intemet servicescompany in South Korea.They use deep learning，built288with multipleframeworks，toenrich and diversifyqueryresults.

89、Naver uses Triton Inference Server to accelerate thedeployment of models in production.The platformsupports multipleframeworks，batch and real-timeinferencing，andinferencingonGPUsandCPUs.Ithelps Naver roll out newAl-basedservices faster and with lowerdevelopmentcosts.NAVERNAVEROnVIDIA#page#第ANNOUNCIN

90、GMICROSOFT ADOPTS NVIDIA Al4TO CREATE SMART EXPERIENCESIN MICROSOFT OFFICECorrect Grammar I Q&A I Predict Text200ms Inference Response for SOTAAl1/3 Lower Cost Than CPUAzure ML&ONNX Runtime with Triton for Inference ServingHalfaTrilion Queries per Yearfor GrammarLinkto AnnouncementMicrosoft Bllog#pa

91、ge#TRITON CUSTOMER ADOPTIONCompanyUse CaseWorkflowsOutcomeStandardized high performance inferencing for allATencent YoutuComputervision usecases，facialIntegrated their highly performant TNNcustom backendrecognition，andbiometriwork intoTritorapplicationsFacial recognition and humanReal-time online in

92、ferencing on Triton in the50% higher QPS per GPU with TritonKingsoft Cloud4-5x higher QPs with Triton+TensorRT/TVMcloudattributes recognitionMicrosoft AzureAny ML/DLworkflowReal-time inference serving on Triton withIncreased throughput by -7x compared to FlaskMachine Learninsd-to-end python serverMi

93、crosoft OfficeNLP models (BERTTuring-NLG，etc）Real time grammar checker runningSlashed costs by-70%andachjievedathroughput of450queries per secondona single V100GPU,withmore than half a trillion queries ayearAmerican ExpressFraud detection on 8B credit cardUsed Triton to deploy a TensorRT-optimizedCa

94、n operatewithin2ms latency budget（a 50xtransactionsimprovement compared to CPUs that could not meetGated Recurrent Unit model to analyze tens ofmillionsof daily transactionsthe latency requirement）NaverSearch recommendations and imageModelsinmultipleframeworks（TensorFlow，Asingle inference platform t

95、hat allowed for fasterclassificationPyTorch，Caffe，and TensorRT） on CPU and GPUrollout of newDL models from multiple frameworksSPILDefect detection on30,000waferUsed Triton to deploy and manage DenseNet.Can detect 100%of defects with10%falsAutoencoder，and UNET models in TensorRTimages per day7x improvement)，and scale to 100 diffthout changes to serving intrastruct#page#Try Triton Inference ServerDownload from GitHub or from Docker RegistryTriton Inference Server GitHub: https:/ GPU Cloud (NGC)- Docker Container:https:/ / Documentation:https:/ /

相關圖表

本文主要介紹了NVIDIA Triton Inference Server在深度學習推理部署中的應用。Triton Inference Server是一個開源的推理服務器，支持多種深度學習框架，可以在GPU和CPU上部署，以最大化CPU和GPU的實時推理性能。文中詳細介紹了Triton Inference Server的架構、特性、部署方式以及在Kubernetes上的集成。此外，還介紹了如何使用Triton進行性能分析和模型優化。文中還分享了多個客戶案例，包括騰訊優圖、金山云、Naver和微軟等，展示了Triton在實際應用中的效果。

如何使用Triton優化深度學習推理的大規模部署？ Triton如何支持多框架和多模型并發執行？ Triton如何幫助企業實現AI應用的標準化和高效部署？

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站