當前位置：首頁 > 報告詳情

盧翔龍-NVIDIA：LLM 推理和服務部署技術總結及未來展望-掘金.pdf

上傳人：張** 編號：159530 2024-04-05 PDF PDF 30頁 12.29MB

該報告所屬合集： 2023掘金年度技術演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/30

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《盧翔龍-NVIDIA：LLM 推理和服務部署技術總結及未來展望-掘金.pdf》由會員分享，可在線閱讀，更多相關《盧翔龍-NVIDIA：LLM 推理和服務部署技術總結及未來展望-掘金.pdf（30頁珍藏版）》請在三個皮匠報告上搜索。

1、NVIDIA：LLM推理和服務部署技術總結及未來展望盧翔龍NVIDIA資深解決方案架構師目錄C O N T E N T S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLM目錄C O N T E N T S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLMProduction Language Apps Increasing need for deep learning in language applicationsChat,tran

2、slation,summarization,search,generation,etc.models are important for correct resultsModel accuracy directly correlates to helpfulness for users“Online”deployment require Ensure a great experience with applications Multi-functional,accurate models are large making them slow during inference&Deploying

3、 massive models for real-time applicationsMaking cost effective deployments challengingLarge Language Model EcosystemLlama,Falcon,Starcoder,ChatGLM,MPT,&more70-200 Billion parameter or moreRapid evolution makes optimization challengingLLaMaGPTFalconStarcoderChatGLMMPTImage from Mooler0410/LLMsPracti

4、calGuideYang,J.,Jin,H.,Tang,R.,Han,X.,Feng,Q.,Jiang,H.,Hu,X.(2023).Harnessing the Power of LLMs in Practice:A Survey on ChatGPT and Beyond.arXiv Cs.CL.Retrieved from http:/arxiv.org/abs/2304.13712Need a performant,robust,&extensible solution forcost-effective,real-time LLM deployments目錄C O N T E N T

5、 S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLMTensorRT-LLM Optimizing LLM InferenceSoTA Performance for Large Language Models for Production DeploymentsChallenges:LLM performance is crucial for real-time,cost-effective,production deployments.Rapid evolution in the LLM ecosystem,

6、with new models&techniques released regularly,requires a performant,flexible solution to optimize models.TensorRT-LLMis an open-sourcelibrary to optimize inference performance on the latest Large Language Models for NVIDIA GPUs.It is built on TensorRT with a simple Python API for defining,optimizing

7、,&executing LLMs for inference in production.Add new operators or models in Python to quickly support new LLMs with optimized performanceLeverage TensorRT compilation&hand-tuned kernels developed by GPU expertsMaximize throughput and GPU utilization through new scheduling techniques for LLMs#define

8、a new activationdef silu(input:Tensor)Tensor:return input*sigmoid(input)#implement models like in DL FWsclass LlamaModel(Module)def _init_()self.layers=ModuleList()def forward()hidden=self.embedding()for layer in self.layers:hidden_states=layer(hidden)return hiddenA100H100TRT-LLM4.6PerformanceTCO3St

9、aticInflight BatchingAvg Latency52CostFor technical discussion onlyTensorRT-LLM Available Now!TensorRT-LLM is live at All source provided under Apache 2.0 Model examples,quantization toolkits&more In Triton 23.10+NGC Container Coming soon to NeMo Framework InferenceGet it on Github,NGC,&coming soon

10、to NeMoClonesFirst Month of ReleaseViewsStarsTensorRT-LLM UsageUse Pre-built models,or optimize new ones!Ex.fMHA,layerNorm,activations,etc.Built on top of Ex.GPT,LLaMa,BERT,etc.Leverages NCCL plugins for multi-device communicationPre-segmented graphs in pre-built modelsUser can manually segment for

11、custom modelsFuture may allow automatic segmentation across gpusInclude tokenization&Sampling(ex.Beam search)NeMoPyTJAXLLM FWsGPTLLaMatransformerPre-built ModelsCustom Modeltrt_llm.layers.*attentionmlpTensorRT-LLM BackendTensorRT PrimitivesFT KernelsNCCL Comm.Pre/PostProcessingTensorRT-LLM RuntimeTe

12、nsorRT RuntimeC+/PyRuntimeTensorRT-LLM UsageCreate,Build,ExecuteLoad pre-built models or define via TensorRT-LLM Python APIsCompile to optimized implementations via TensorRTSaved as a serialized engineExecute in Python,C+,or Triton2.Engine BuildingOptimized model via TensorRT and custom kernelsTenso

13、rRT-LLM EngineTRT EnginePlugins1.Model InitializationLoad example model,or create one via python APIs0.Trained Model in FWNeMo,HuggingFace,or from DL Frameworks3.ExecutionLoad&execute engines in Python,C+,or TritonMulti-GPU Multi-NodeSharding Models across GPUsMulti-NodeMulti-GPUPipeline ParallelTen

14、sor ParallelNo ParallelismOptimized AttentionCustom Implementations for AttentionSimilar to FlashAttentionV2Multi-headGrouped-QueryMulti-QueryQueriesKeysValuesKV Cache OptimizationsPaged&Quantized KV CacheStores keys&values in non-contiguous memory spaceAllows for reduced memory consumption of KV ca

15、cheAllocates memory on demandReduces KV Cache elements from 16b to 8b(or less!)Reduces memory transfer improving performanceSupports INT8/FP8 KV CachesBlock 0TensorRTLLMoptimizesinferenceBlock 1onNVIDIAGPUsBlock 2Block 3Traditional KV CachingB0TensorRTLLMoptimizesinferenceB1B2onNVIDIAGPUsB3B0TRTLLMo

16、ptinfonNVIDIAGPUsB1B2B3AllocatedFreeKV Cache Contents:TensorRT-LLM optimizes inference on NVIDIA GPUs Inflight BatchingMaximing GPU Utilization during LLM ServingEvicts requests after EoS&inserts a new request123456789R1ENDR5R2ENDR6R3ENDR7R4ENDR8Batch ElementsStatic Batching123456789R1ENDR5R2ENDR6R3

17、ENDR7ENDR9R4ENDR8Batch ElementsContextGenEoSNoOpIterationIterationInflight BatchingArchitectureQuantizationSupported Precisions&ModelsFP8,INT8/INT4 Weight only,INT8 Smooth Quant,AWQ,GPTQSupport varies by modelImproves performance&allows for larger models per GPUTensorRT-LLM PerformanceAdvanced Techn

18、iques can further improve TensorRT-LLM performance&memory consumption1.6x0.3x0.0 x0.5x1.0 x1.5x2.0 xFP16KVQuantINT8INT8+KVQuantINT8 SQINT8 SQ+KVQuantINT4INT4+KVQuantThroughputMemoryRelative ImprovementPerformance ImprovementOngoing&Future PlansFeature EnhancementCommunity contributions are highly we

19、lcome.MoEMulti-modal BLIPv2 and moreC+runtimeOngoing&Future Plans目錄C O N T E N T S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLMTriton Inference ServerOpen-source inference serving software for fast,scalable,simplified inference servingOptimized for Real Time and Batch RequestsAud

20、io&Video StreamingModel EnsemblesMultiple DL/ML Frameworks e.g.,TensorFlow,PyTorch,TensorRT,XGBoost,ONNX,Python&MoreX86 CPU|Arm CPU|NVIDIA GPUs|MIGLinux|Windows|VirtualizationPublic Cloud,Data Center and Edge/Embedded(Jetson)Optimized for High GPU/CPU Utilization,Throughput&Low LatencyMulti-GPU Mult

21、i-Node Inference for Large Language ModelsMicroservice in Kubernetes,KServeAvailable Across All Major Cloud AI PlatformsIntegration with major MLOPS solutionsLarge scale Model OrchestrationLLM ServingServing is to pack models to inference serviceKubernetes,PrometheusMetricsStandard HTTP/gRPCOrIn-Pro

22、cess API(directly integrate into client app via C or Java API)Utilization,Throughput,Latency MetricsModel RepositoryQueryResultMultiple Client ApplicationsPython/C+Client LibraryQueryResultPython/C+Client LibraryQueryResultPython/C+Client LibraryMany active modelsModel analyzer Model orchestration C

23、PUGPUAn open-sourced inference serving software for fast,scalable,simplified inference servingCommunication between GPUs and nodes with MPI and NCCL PluginsMultiple GPU&CPU BackendsCustomPer Model Scheduler QueuesFlexible Model Loading(All,Selective)Dynamic Batching(Real time,Batch,Stream)Output Ten

24、sorNode 0Node 1Working ThreadsWorking ThreadsBroadcast RequestsTriton TensorRT LLM BackendRequest Waiting PoolTriton tensorrtllm backendTensorrt llm runtime Triton TensorRT LLM Backend is GA released now The goal of TensorRT-LLM Backend is to let you serve TensorRT-LLM models with Triton Inference S

25、erver The directory contains the C+implementation of the backend supporting inflight batching,paged attention and more Supports and Ongoing&Future PlansBs and sequence length after IFB;TP size,beam width etc.(GEMM peak flops)vLLM backend and moreBring in more flexibility,work with any backend目錄C O N

26、 T E N T S大模型技術趨勢01.TensorRT-LLM02.03.FP804.Triton Inference Server for LLMFP8 VS.INT8 STIn accuracyFP8 LLM PRE-TRAIN CONVERGENCE TESTFrom GTC announcementRecap and Forecast MoE MoE beats dense LLM to save flops?Quantization w4afp8,structure sparsity etc.Token generation stuck issue can we do better in scheduling?Super long context can we handle infinite length context input well?What is llama next?-We are ready to accelerate it.NVIDIA不僅僅提供鏟子THANKS

相關圖表

本文主要探討了NVIDIA在大型語言模型（LLM）推理和服務部署方面的技術總結及未來展望。文章首先介紹了TensorRT-LLM，這是一個針對NVIDIA GPU優化的開源庫，旨在提高LLM的推理性能。TensorRT-LLM通過新的調度技術和優化模型，實現了4.6倍的性能提升和3倍的成本降低。文章還提到了LLM模型在生產環境中的實際應用，如聊天、翻譯、總結、搜索和生成等，并指出模型準確性對用戶幫助的重要性。此外，文章還討論了Triton Inference Server，這是一個開源的推理服務軟件，支持實時和批量請求，適用于多種深度學習框架和硬件平臺。Triton Inference Server與TensorRT-LLM結合使用，可以進一步優化LLM模型的推理性能。最后，文章展望了LLM領域的未來發展趨勢，包括模型結構的創新、量化技術的進步、長上下文處理能力的提升等，并強調了NVIDIA為這些技術進步所做的準備。

如何優化大語言模型推理性能？" 如何實現大模型的高效服務部署？" 在大模型推理中，哪種量化策略更佳？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站