《盧翔龍-NVIDIA:LLM 推理和服務部署技術總結及未來展望-掘金.pdf》由會員分享,可在線閱讀,更多相關《盧翔龍-NVIDIA:LLM 推理和服務部署技術總結及未來展望-掘金.pdf(30頁珍藏版)》請在三個皮匠報告上搜索。
1、NVIDIA:LLM推理和服務部署技術總結及未來展望盧翔龍NVIDIA資深解決方案架構師目錄C O N T E N T S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLM目錄C O N T E N T S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLMProduction Language Apps Increasing need for deep learning in language applicationsChat,tran
2、slation,summarization,search,generation,etc.models are important for correct resultsModel accuracy directly correlates to helpfulness for users“Online”deployment require Ensure a great experience with applications Multi-functional,accurate models are large making them slow during inference&Deploying
3、 massive models for real-time applicationsMaking cost effective deployments challengingLarge Language Model EcosystemLlama,Falcon,Starcoder,ChatGLM,MPT,&more70-200 Billion parameter or moreRapid evolution makes optimization challengingLLaMaGPTFalconStarcoderChatGLMMPTImage from Mooler0410/LLMsPracti
4、calGuideYang,J.,Jin,H.,Tang,R.,Han,X.,Feng,Q.,Jiang,H.,Hu,X.(2023).Harnessing the Power of LLMs in Practice:A Survey on ChatGPT and Beyond.arXiv Cs.CL.Retrieved from http:/arxiv.org/abs/2304.13712Need a performant,robust,&extensible solution forcost-effective,real-time LLM deployments目錄C O N T E N T
5、 S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLMTensorRT-LLM Optimizing LLM InferenceSoTA Performance for Large Language Models for Production DeploymentsChallenges:LLM performance is crucial for real-time,cost-effective,production deployments.Rapid evolution in the LLM ecosystem,
6、with new models&techniques released regularly,requires a performant,flexible solution to optimize models.TensorRT-LLMis an open-sourcelibrary to optimize inference performance on the latest Large Language Models for NVIDIA GPUs.It is built on TensorRT with a simple Python API for defining,optimizing
7、,&executing LLMs for inference in production.Add new operators or models in Python to quickly support new LLMs with optimized performanceLeverage TensorRT compilation&hand-tuned kernels developed by GPU expertsMaximize throughput and GPU utilization through new scheduling techniques for LLMs#define
8、a new activationdef silu(input:Tensor)Tensor:return input*sigmoid(input)#implement models like in DL FWsclass LlamaModel(Module)def _init_()self.layers=ModuleList()def forward()hidden=self.embedding()for layer in self.layers:hidden_states=layer(hidden)return hiddenA100H100TRT-LLM4.6PerformanceTCO3St
9、aticInflight BatchingAvg Latency52CostFor technical discussion onlyTensorRT-LLM Available Now!TensorRT-LLM is live at All source provided under Apache 2.0 Model examples,quantization toolkits&more In Triton 23.10+NGC Container Coming soon to NeMo Framework InferenceGet it on Github,NGC,&coming soon
10、to NeMoClonesFirst Month of ReleaseViewsStarsTensorRT-LLM UsageUse Pre-built models,or optimize new ones!Ex.fMHA,layerNorm,activations,etc.Built on top of Ex.GPT,LLaMa,BERT,etc.Leverages NCCL plugins for multi-device communicationPre-segmented graphs in pre-built modelsUser can manually segment for
11、custom modelsFuture may allow automatic segmentation across gpusInclude tokenization&Sampling(ex.Beam search)NeMoPyTJAXLLM FWsGPTLLaMatransformerPre-built ModelsCustom Modeltrt_llm.layers.*attentionmlpTensorRT-LLM BackendTensorRT PrimitivesFT KernelsNCCL Comm.Pre/PostProcessingTensorRT-LLM RuntimeTe
12、nsorRT RuntimeC+/PyRuntimeTensorRT-LLM UsageCreate,Build,ExecuteLoad pre-built models or define via TensorRT-LLM Python APIsCompile to optimized implementations via TensorRTSaved as a serialized engineExecute in Python,C+,or Triton2.Engine BuildingOptimized model via TensorRT and custom kernelsTenso
13、rRT-LLM EngineTRT EnginePlugins1.Model InitializationLoad example model,or create one via python APIs0.Trained Model in FWNeMo,HuggingFace,or from DL Frameworks3.ExecutionLoad&execute engines in Python,C+,or TritonMulti-GPU Multi-NodeSharding Models across GPUsMulti-NodeMulti-GPUPipeline ParallelTen
14、sor ParallelNo ParallelismOptimized AttentionCustom Implementations for AttentionSimilar to FlashAttentionV2Multi-headGrouped-QueryMulti-QueryQueriesKeysValuesKV Cache OptimizationsPaged&Quantized KV CacheStores keys&values in non-contiguous memory spaceAllows for reduced memory consumption of KV ca
15、cheAllocates memory on demandReduces KV Cache elements from 16b to 8b(or less!)Reduces memory transfer improving performanceSupports INT8/FP8 KV CachesBlock 0TensorRTLLMoptimizesinferenceBlock 1onNVIDIAGPUsBlock 2Block 3Traditional KV CachingB0TensorRTLLMoptimizesinferenceB1B2onNVIDIAGPUsB3B0TRTLLMo
16、ptinfonNVIDIAGPUsB1B2B3AllocatedFreeKV Cache Contents:TensorRT-LLM optimizes inference on NVIDIA GPUs Inflight BatchingMaximing GPU Utilization during LLM ServingEvicts requests after EoS&inserts a new request123456789R1ENDR5R2ENDR6R3ENDR7R4ENDR8Batch ElementsStatic Batching123456789R1ENDR5R2ENDR6R3
17、ENDR7ENDR9R4ENDR8Batch ElementsContextGenEoSNoOpIterationIterationInflight BatchingArchitectureQuantizationSupported Precisions&ModelsFP8,INT8/INT4 Weight only,INT8 Smooth Quant,AWQ,GPTQSupport varies by modelImproves performance&allows for larger models per GPUTensorRT-LLM PerformanceAdvanced Techn
18、iques can further improve TensorRT-LLM performance&memory consumption1.6x0.3x0.0 x0.5x1.0 x1.5x2.0 xFP16KVQuantINT8INT8+KVQuantINT8 SQINT8 SQ+KVQuantINT4INT4+KVQuantThroughputMemoryRelative ImprovementPerformance ImprovementOngoing&Future PlansFeature EnhancementCommunity contributions are highly we
19、lcome.MoEMulti-modal BLIPv2 and moreC+runtimeOngoing&Future Plans目錄C O N T E N T S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLMTriton Inference ServerOpen-source inference serving software for fast,scalable,simplified inference servingOptimized for Real Time and Batch RequestsAud
20、io&Video StreamingModel EnsemblesMultiple DL/ML Frameworks e.g.,TensorFlow,PyTorch,TensorRT,XGBoost,ONNX,Python&MoreX86 CPU|Arm CPU|NVIDIA GPUs|MIGLinux|Windows|VirtualizationPublic Cloud,Data Center and Edge/Embedded(Jetson)Optimized for High GPU/CPU Utilization,Throughput&Low LatencyMulti-GPU Mult
21、i-Node Inference for Large Language ModelsMicroservice in Kubernetes,KServeAvailable Across All Major Cloud AI PlatformsIntegration with major MLOPS solutionsLarge scale Model OrchestrationLLM ServingServing is to pack models to inference serviceKubernetes,PrometheusMetricsStandard HTTP/gRPCOrIn-Pro
22、cess API(directly integrate into client app via C or Java API)Utilization,Throughput,Latency MetricsModel RepositoryQueryResultMultiple Client ApplicationsPython/C+Client LibraryQueryResultPython/C+Client LibraryQueryResultPython/C+Client LibraryMany active modelsModel analyzer Model orchestration C
23、PUGPUAn open-sourced inference serving software for fast,scalable,simplified inference servingCommunication between GPUs and nodes with MPI and NCCL PluginsMultiple GPU&CPU BackendsCustomPer Model Scheduler QueuesFlexible Model Loading(All,Selective)Dynamic Batching(Real time,Batch,Stream)Output Ten
24、sorNode 0Node 1Working ThreadsWorking ThreadsBroadcast RequestsTriton TensorRT LLM BackendRequest Waiting PoolTriton tensorrtllm backendTensorrt llm runtime Triton TensorRT LLM Backend is GA released now The goal of TensorRT-LLM Backend is to let you serve TensorRT-LLM models with Triton Inference S
25、erver The directory contains the C+implementation of the backend supporting inflight batching,paged attention and more Supports and Ongoing&Future PlansBs and sequence length after IFB;TP size,beam width etc.(GEMM peak flops)vLLM backend and moreBring in more flexibility,work with any backend目錄C O N
26、 T E N T S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLMFP8 VS.INT8 STIn accuracyFP8 LLM PRE-TRAIN CONVERGENCE TESTFrom GTC announcementRecap and Forecast MoE MoE beats dense LLM to save flops?Quantization w4afp8,structure sparsity etc.Token generation stuck issue can we do better in scheduling?Super long context can we handle infinite length context input well?What is llama next?-We are ready to accelerate it.NVIDIA不僅僅提供鏟子THANKS