《PAI-Megatron-Patch:靈駿集群訓練 LLMs.pdf》由會員分享,可在線閱讀,更多相關《PAI-Megatron-Patch:靈駿集群訓練 LLMs.pdf(31頁珍藏版)》請在三個皮匠報告上搜索。
1、PAI LLMContents020305010401XLABXPSPAI-TensorflowPAI-PyTorchPAI-StudioDLC DSW EASNLP/CV/千億參數 ODLM6OFA Swin-TransformerPAIAI 9SLA 數據訓練推理穩定性PAI面向LLM全鏈路的一站式智算平臺02-Data Deduplication from Google(2022/03)-Text Deduplication from BigCode(2023/05)-The RefinedWeb for Falcon LLM(2023/06)高質量的文本輸入可以獲得更好的大語言模型 j
2、ieba MinHashMinHashLSH GABABGG 1.2.Power law10 Distributed union find 1.join 2.圖連通分量算法示例實現樣本數重復率耗時PrecisionRecallF1PAI5億50%1h 34min879993其他實現5億50%4h 10min859290PAI10億50%3h 0min829990其他實現10億50%6h 54min80908503 A general framework that helps dispatching the operators into new backends(AICompiler)and m
3、eanwhile provides new Tensor expression that swaps in eager mode.AIAn Compiler that uses the advanced optimization skills in order to support high performance codegen.Support FSDP,TP and other distribute strategies.TorchAcceleratorTorchAcceleratorTorchAccelerator基于Kube Scheduler FrameworkAIASW/DSW/P
4、SW合適的網絡架構的調度選擇可以更充分的釋放高性能網絡的潛力04LLMEAS OPT/GPT/Bloom/GLM *模型壓縮權重量化激活量化KV Cache量化系統優化編譯器優化高性能算子庫分布式執行張量并行流水并行Nvidia GPUAMD GPU建模主流模型高性能實現 開源模型全兼容OPT-66BGPU01234A100(80GB)V100(32GB)A10(24GB)fp16int8int4OPT-66Bperplexity036912wikitext2ptbc4fp16int8int4服務吞吐提升1.73.8倍首包延遲降低8.713.8倍LLMBladeLLMModel weights/configCompressionCompilingServingUserPlatform05高性能靈駿集群帶來了非常有挑戰的穩定性ECC ErrorNCCL TimeoutNCCL HangPCIE降速NVLink ErrorAIMasterEasyCKPT AIMaster HangCheckpointEasyCKPT 多級存儲異步并行存儲最快支持秒級存儲,大幅減少計算上的浪費EasyCKPTServerless PAIPAI面向LLM全鏈路的一站式智算平臺THANKS