盧翔龍-NVIDIA:LLM 推理和服務部署技術總結及未來展望-掘金.pdf

編號:159530 PDF 30頁 12.29MB 下載積分:VIP專享
下載報告請您先登錄!

盧翔龍-NVIDIA:LLM 推理和服務部署技術總結及未來展望-掘金.pdf

1、NVIDIA:LLM推理和服務部署技術總結及未來展望盧翔龍NVIDIA資深解決方案架構師目錄C O N T E N T S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLM目錄C O N T E N T S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLMProduction Language Apps Increasing need for deep learning in language applicationsChat,tran

2、slation,summarization,search,generation,etc.models are important for correct resultsModel accuracy directly correlates to helpfulness for users“Online”deployment require Ensure a great experience with applications Multi-functional,accurate models are large making them slow during inference&Deploying

3、 massive models for real-time applicationsMaking cost effective deployments challengingLarge Language Model EcosystemLlama,Falcon,Starcoder,ChatGLM,MPT,&more70-200 Billion parameter or moreRapid evolution makes optimization challengingLLaMaGPTFalconStarcoderChatGLMMPTImage from Mooler0410/LLMsPracti

4、calGuideYang,J.,Jin,H.,Tang,R.,Han,X.,Feng,Q.,Jiang,H.,Hu,X.(2023).Harnessing the Power of LLMs in Practice:A Survey on ChatGPT and Beyond.arXiv Cs.CL.Retrieved from http:/arxiv.org/abs/2304.13712Need a performant,robust,&extensible solution forcost-effective,real-time LLM deployments目錄C O N T E N T

5、 S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLMTensorRT-LLM Optimizing LLM InferenceSoTA Performance for Large Language Models for Production DeploymentsChallenges:LLM performance is crucial for real-time,cost-effective,production deployments.Rapid evolution in the LLM ecosystem,

6、with new models&techniques released regularly,requires a performant,flexible solution to optimize models.TensorRT-LLMis an open-sourcelibrary to optimize inference performance on the latest Large Language Models for NVIDIA GPUs.It is built on TensorRT with a simple Python API for defining,optimizing

7、,&executing LLMs for inference in production.Add new operators or models in Python to quickly support new LLMs with optimized performanceLeverage TensorRT compilation&hand-tuned kernels developed by GPU expertsMaximize throughput and GPU utilization through new scheduling techniques for LLMs#define

8、a new activationdef silu(input:Tensor)Tensor:return input*sigmoid(input)#implement models like in DL FWsclass LlamaModel(Module)def _init_()self.layers=ModuleList()def forward()hidden=self.embedding()for layer in self.layers:hidden_states=layer(hidden)return hiddenA100H100TRT-LLM4.6PerformanceTCO3St

9、aticInflight BatchingAvg Latency52CostFor technical discussion onlyTensorRT-LLM Available Now!TensorRT-LLM is live at All source provided under Apache 2.0 Model examples,quantization toolkits&more In Triton 23.10+NGC Container Coming soon to NeMo Framework InferenceGet it on Github,NGC,&coming soon

10、to NeMoClonesFirst Month of ReleaseViewsStarsTensorRT-LLM UsageUse Pre-built models,or optimize new ones!Ex.fMHA,layerNorm,activations,etc.Built on top of Ex.GPT,LLaMa,BERT,etc.Leverages NCCL plugins for multi-device communicationPre-segmented graphs in pre-built modelsUser can manually segment for

11、custom modelsFuture may allow automatic segmentation across gpusInclude tokenization&Sampling(ex.Beam search)NeMoPyTJAXLLM FWsGPTLLaMatransformerPre-built ModelsCustom Modeltrt_llm.layers.*attentionmlpTensorRT-LLM BackendTensorRT PrimitivesFT KernelsNCCL Comm.Pre/PostProcessingTensorRT-LLM RuntimeTe

12、nsorRT RuntimeC+/PyRuntimeTensorRT-LLM UsageCreate,Build,ExecuteLoad pre-built models or define via TensorRT-LLM Python APIsCompile to optimized implementations via TensorRTSaved as a serialized engineExecute in Python,C+,or Triton2.Engine BuildingOptimized model via TensorRT and custom kernelsTenso

13、rRT-LLM EngineTRT EnginePlugins1.Model InitializationLoad example model,or create one via python APIs0.Trained Model in FWNeMo,HuggingFace,or from DL Frameworks3.ExecutionLoad&execute engines in Python,C+,or TritonMulti-GPU Multi-NodeSharding Models across GPUsMulti-NodeMulti-GPUPipeline ParallelTen

14、sor ParallelNo ParallelismOptimized AttentionCustom Implementations for AttentionSimilar to FlashAttentionV2Multi-headGrouped-QueryMulti-QueryQueriesKeysValuesKV Cache OptimizationsPaged&Quantized KV CacheStores keys&values in non-contiguous memory spaceAllows for reduced memory consumption of KV ca

15、cheAllocates memory on demandReduces KV Cache elements from 16b to 8b(or less!)Reduces memory transfer improving performanceSupports INT8/FP8 KV CachesBlock 0TensorRTLLMoptimizesinferenceBlock 1onNVIDIAGPUsBlock 2Block 3Traditional KV CachingB0TensorRTLLMoptimizesinferenceB1B2onNVIDIAGPUsB3B0TRTLLMo

16、ptinfonNVIDIAGPUsB1B2B3AllocatedFreeKV Cache Contents:TensorRT-LLM optimizes inference on NVIDIA GPUs Inflight BatchingMaximing GPU Utilization during LLM ServingEvicts requests after EoS&inserts a new request123456789R1ENDR5R2ENDR6R3ENDR7R4ENDR8Batch ElementsStatic Batching123456789R1ENDR5R2ENDR6R3

17、ENDR7ENDR9R4ENDR8Batch ElementsContextGenEoSNoOpIterationIterationInflight BatchingArchitectureQuantizationSupported Precisions&ModelsFP8,INT8/INT4 Weight only,INT8 Smooth Quant,AWQ,GPTQSupport varies by modelImproves performance&allows for larger models per GPUTensorRT-LLM PerformanceAdvanced Techn

18、iques can further improve TensorRT-LLM performance&memory consumption1.6x0.3x0.0 x0.5x1.0 x1.5x2.0 xFP16KVQuantINT8INT8+KVQuantINT8 SQINT8 SQ+KVQuantINT4INT4+KVQuantThroughputMemoryRelative ImprovementPerformance ImprovementOngoing&Future PlansFeature EnhancementCommunity contributions are highly we

19、lcome.MoEMulti-modal BLIPv2 and moreC+runtimeOngoing&Future Plans目錄C O N T E N T S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLMTriton Inference ServerOpen-source inference serving software for fast,scalable,simplified inference servingOptimized for Real Time and Batch RequestsAud

20、io&Video StreamingModel EnsemblesMultiple DL/ML Frameworks e.g.,TensorFlow,PyTorch,TensorRT,XGBoost,ONNX,Python&MoreX86 CPU|Arm CPU|NVIDIA GPUs|MIGLinux|Windows|VirtualizationPublic Cloud,Data Center and Edge/Embedded(Jetson)Optimized for High GPU/CPU Utilization,Throughput&Low LatencyMulti-GPU Mult

21、i-Node Inference for Large Language ModelsMicroservice in Kubernetes,KServeAvailable Across All Major Cloud AI PlatformsIntegration with major MLOPS solutionsLarge scale Model OrchestrationLLM ServingServing is to pack models to inference serviceKubernetes,PrometheusMetricsStandard HTTP/gRPCOrIn-Pro

22、cess API(directly integrate into client app via C or Java API)Utilization,Throughput,Latency MetricsModel RepositoryQueryResultMultiple Client ApplicationsPython/C+Client LibraryQueryResultPython/C+Client LibraryQueryResultPython/C+Client LibraryMany active modelsModel analyzer Model orchestration C

23、PUGPUAn open-sourced inference serving software for fast,scalable,simplified inference servingCommunication between GPUs and nodes with MPI and NCCL PluginsMultiple GPU&CPU BackendsCustomPer Model Scheduler QueuesFlexible Model Loading(All,Selective)Dynamic Batching(Real time,Batch,Stream)Output Ten

24、sorNode 0Node 1Working ThreadsWorking ThreadsBroadcast RequestsTriton TensorRT LLM BackendRequest Waiting PoolTriton tensorrtllm backendTensorrt llm runtime Triton TensorRT LLM Backend is GA released now The goal of TensorRT-LLM Backend is to let you serve TensorRT-LLM models with Triton Inference S

25、erver The directory contains the C+implementation of the backend supporting inflight batching,paged attention and more Supports and Ongoing&Future PlansBs and sequence length after IFB;TP size,beam width etc.(GEMM peak flops)vLLM backend and moreBring in more flexibility,work with any backend目錄C O N

26、 T E N T S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLMFP8 VS.INT8 STIn accuracyFP8 LLM PRE-TRAIN CONVERGENCE TESTFrom GTC announcementRecap and Forecast MoE MoE beats dense LLM to save flops?Quantization w4afp8,structure sparsity etc.Token generation stuck issue can we do better in scheduling?Super long context can we handle infinite length context input well?What is llama next?-We are ready to accelerate it.NVIDIA不僅僅提供鏟子THANKS

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(盧翔龍-NVIDIA:LLM 推理和服務部署技術總結及未來展望-掘金.pdf)為本站 (張5G) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站