《APSARA云棲大會:云原生核心技術與最佳實戰指南1(619頁).pdf》由會員分享,可在線閱讀,更多相關《APSARA云棲大會:云原生核心技術與最佳實戰指南1(619頁).pdf(619頁珍藏版)》請在三個皮匠報告上搜索。
1、?03?04?01?0205?08?09?06?0710?13?14?11?1215?18?19?16?1720?01?02?03?04?05?06?09?10?07?08?11?12?01?02?03?04?05?06?01?02?01?02?03?04?01?02?03?01?02?Best practices for choosing a Alibaba Cloud Container service01?02?03?01?02?03?Cloud-Native AI Suite Boosts Efficiency in Large Model Engineering?01?02?03?行
2、業報告資源群行業報告資源群微信掃碼 長期有效微信掃碼 長期有效微信掃碼 行研無憂微信掃碼 行研無憂免責申明:本內容非原報告內容;報告來源互聯網公開數據;如侵權請聯系客服微信,第一時間清理;報告僅限社群個人學習,如需它用請聯系版權方;如有其他疑問請聯系微信。1.進群福利:進群即領萬份行業研究、管理方案及其他學習資源,直接打包下載2.每日分享:6+份行研精選、3個行業主題3.報告查找:群里直接咨詢,免費協助查找4.嚴禁廣告:僅限行業報告交流,禁止一切無關信息知識星球 行業與管理資源知識星球 行業與管理資源專業知識社群:每月分享8000+份行業研究報告、商業計劃、市場研究、企業運營及咨詢管理方案等,
3、涵蓋科技、金融、教育、互聯網、房地產、生物制藥、醫療健康等;已成為投資、產業研究、企業運營、價值傳播等工作助手。?01?02?03?01?30%?02?Deployment?PodPodPodPod?Pod?0?0?A?ecs.c7.xlarge?ecs.c7.2xlargecluster-autoscaler?B?ecs.c8.3xlarge?ecs.c8.4xlargePending PodsPodPodPod?Nodeecs.c8.3xlarge/ecs.c8.4xlarge?ecs.c8.3xlarge?ecs.c7.xlargeAll NodesAll PodsAll Daemons
4、etAll PVCAll StorageClassList K8s?All?All?All?List ACK?A?Estimator?ecs.c7.xlarge?ecs.c7.2xlargeProvisioner?Scaling Plan?ACK?Scaler?B?ecs.c8.*Watch Pending PodsPodPodPod?PodPodPod?Podcluster-autoscalerACK clusterNodePoolNodePodPodPodNodePodPodNodeNodePool?ACK clusterNodePoolNodePodPodPodPodNodePodPod
5、Node?Podcluster-autoscaler?15sACK clusterNodePool?3s?ACK clusterNodePoolPodPod?PodPodPodPodPodPodBatch1PodPodPodPodNode1PodPodPodPodNode2Batch2PodPodPodPodNode3PodPodPodPodNode2PodPodPodPodNode3PodPodPodPodNode1?PodPodPod?Apiserver?cluster-autoscaler logsPod?Pod Events?pod?Nodes?Pod Events?Nodeoolcl
6、uster-autoscalerNodeool?NodeAcluster-autoscaler?NodeANodeB?-1s?99%?03Prometheus?ACK Metrics Adapter?HPA?HPA?WorkLoads/scale subresourcePodsPodPodAutoscaler?A?BPodNodePodNodePodDaemon-set PodDaemon-set PodNodeDaemon-set Podcustom metrics?A2?B1?custom metrics?pod?Pod?&Pod?&?THANKS?行業報告資源群行業報告資源群微信掃碼 長
7、期有效微信掃碼 長期有效微信掃碼 行研無憂微信掃碼 行研無憂免責申明:本內容非原報告內容;報告來源互聯網公開數據;如侵權請聯系客服微信,第一時間清理;報告僅限社群個人學習,如需它用請聯系版權方;如有其他疑問請聯系微信。1.進群福利:進群即領萬份行業研究、管理方案及其他學習資源,直接打包下載2.每日分享:6+份行研精選、3個行業主題3.報告查找:群里直接咨詢,免費協助查找4.嚴禁廣告:僅限行業報告交流,禁止一切無關信息知識星球 行業與管理資源知識星球 行業與管理資源專業知識社群:每月分享8000+份行業研究報告、商業計劃、市場研究、企業運營及咨詢管理方案等,涵蓋科技、金融、教育、互聯網、房地產、
8、生物制藥、醫療健康等;已成為投資、產業研究、企業運營、價值傳播等工作助手。?25%?80%?借助?平臺,成功整合了云上?平臺和邊緣端服務,將云原生技術擴展至邊緣側。這不僅提升了服務穩定性,還顯著提升了用戶驗票體驗,將人均驗票時間縮短了?。同時在北京冬奧會和杭州亞運會等大型賽事的?場次項目中,成功處理了近?萬張的驗票。?AZ/?IO?20GB+/s?NEW?12?50%?30%.?-1?Serverless Argo?A?B?C,?ECI Pod?ECI Podg?-2?A?B?C,?ECI Pod?ECI Pod?ECI Pod?ECI Pod?ECI Pod?ECI Pod?ACS?ACS?
9、Contents?01?02?03?ACS?ACS?K8S API?PAI?MaxCompute?Web?Spark?SLO?Saving Plan?Alibaba Cloud Container Compute Service?Clusterless?K8S?Nodeless?&?Serverless?ACS Cluster?ACS Cluster?ACS Cluster?SparkPAIMaxComputeWeb?ACS?Alibaba Cloud Container Compute Service?Contents?01?02?03?ACS?K8S?API?K8S?SLO?Alibaba
10、 Cloud Container Compute Service?ACSCluster?ACSCluster?Latency?Redis?ACSClusterACS?Pod?Alibaba Cloud Container Compute Service?2001000500400300700600100090080004121624820?ContainerGuest KernelContainer?Pod?Pod?vSwitchVPCvRouter?PodVolumePodVolumeNASEBSKMS?RAMRBACRRSA?Pod?KMS?Contents?01?02?03?1.?ACS
11、2.?3.?RMKV-Runner4.?ChatGPT-Next-Web5.?6.?https:/ EAS?5.?https:/ Confidential Container for Data Privacy?(Intel TDX New)?LLM/AIGCIoTKMS?PCCS?PodPod?PodPodNEW?New Future on Cloud?AI?-?AI?-?Intel AMX?Intel PyTorch?32?-?TDX?3%?ACK?(OSS)?(KMS)?(ACR)AI?PodAI?Pod?AI?ECS 8?Intel TDX?AMX?AIGC?Building Trust
12、worthy AI Applications Based on Confidential Containers?New Future on Cloud?THANKS?Future on Cloud?S Step tep 2 2:推理服務拉起:推理服務拉起?Step 3Step 3.a.a:緩存縮容:緩存縮容?Step 3Step 3.c.c:業務感知擴容:業務感知擴容?Step 3.b:緩存副本維持Step 3.b:緩存副本維持?Dataset A?Dataset B?F Fluid支持的數據操作:luid支持的數據操作:?Kube-api-serverECI PodECI PodECI Po
13、dECI PodOSS?Fluid?Kube-api-serverECI PodECI PodECI PodECI PodFluid Cache LayerOSS?THANKS?87%?90%?15%?90%?78%?N Newew?OSS/EBS?KMS?THANKS?Service Mesh?Cloud Native Networking,Zero Trust Security,and Observability Platform?01?02?03?04?/Pilot API Server?CoreDNSK8s?PodProxyServiceNodeLocal DNS CachePodPr
14、oxyServiceNodeLocal DNS Cache?CoreDNSPodProxyServiceNodeLocal DNS CachePodProxyServiceNodeLocal DNS CacheK8s?01?02?03?04?T TCPCP?App AApp BApp BApp C?ztunnel podveth192.168.126.1?Geneve tunnel?DestinationPod192.168.126.2?ztunnel podeth0 請求方Pod IPveth192.168.127.1?mark 0 x100?Ambient Pod IPSetGeneve
15、tunnelh求h0 請求h0 0 請請請求?發到目的方ntneve tneveeve t?rk 0 x1rk k 0 x1?app pod192.168.127.2節點A節點B節點A節點B?ztunnel podveth192.168.126.1?Geneve tunnel?DestinationPod192.168.126.2?ztunnel podeth0 請求方Pod IPveth192.168.127.1?mark 0 x100?Ambient Pod IPSetGeneve tunnelh求h0 請求h0 0 請請請求?發到目的方ntneve tneveeve t?rk 0 x1r
16、k 0 x1?app pod192.168.127.2?01?02?03?04?Sidecarless?Sidecar?ASM?ASM?ASM?API?L7?Serverless?ztunnel?App AApp Cztunnel?App BApp CApp Dztunnel?App CApp D?Waypoint?ECI?Waypoint Proxy?01?02?03?04?ASM?集群集群?100?ASM?50%50%?40%40%?1Pytorch?ASM?Pod 1?ASM?Pod?2?IDaaS?SSO?Waypoint ProxyModel Repository?-Tensorfl
17、ow?CLB?Model ProxyModel ServerOAuth2 2-Proxy?IDaaS?Tensorflow-deployPytorch-deploy?ACK?Model ProxyModel Server?1Pytorch?ASM?Pod 1?ASM?Pod?2?IDaaS?SSO?Waypoint ProxyModel Repository?-Tensorflow?CLB?Model ProxyModel ServerOAuth2-Proxy?IDaaS?Tensorflow-deployPytorch-deploy?ACK?Model ProxyModel Server?A
18、 A A A A A A A A A A AS S S SM?OAuth2-Proxy?Pod 1?y y y y y y y y ytototototototototorch-d d d d d d d d deploywwwwwwww dededeplplp oyoyoyA A A?TeT nsorflo?1Pytorch?ASM?Pod 1?ASM?Pod?2?IDaaS?SSO?Waypoint ProxyModel Repository?-Tensorflow?CLB?Model ProxyModel ServerOAuth2 2-Proxy?IDaaS?Tensorflow-dep
19、loyPytorch-deploy?ACK?Model ProxyModel Server?1Pytorch?ASM?Pod 1?ASM?Pod?2?IDaaS?SSO?Waypoint ProxyModel Repository?-Tensorflow?CLB?Model ProxyModel ServerOAuth2 2-Proxy?IDaaS?Tensorflow-deployPytorch-deploy?ACK?Model ProxyModel Server?SSSSSSSSO O O O?CLB?Pod 1?ASM?y y y y y y y?OAuProT TeTeTeTeTeTe
20、 ensnsnsnsnsnsnsnsorororororororororf fl fl fl fl fl fl fl flo o o o o o o ow-deploy?Pytorch-deployPod 2?Waypoint Proxy?Tensorflow-deployPytorch-deploy?1Pytorch?ASM?Pod 1?ASM?Pod?2?IDaaS?SSO?Waypoint ProxyModel Repository?-Tensorflow?CLB?Model ProxyModel ServerOAuth2-Proxy?IDaaS?Tensorflow-deployPyt
21、orch-deploy?ACK?Model ProxyModel Server?h h h h h h h?I I I I I I I I I I I I I I I I I I I ID D D D D D D D D D D D D D D D D D D D D D Da?o o o o o o oPod 2x x x x x x x x x xPod 1o o o o oxoxoxoxoxoxo oxoxoxoxoxo oxoxo o oxo oe e el l l l l l l lde e e e e e e e el l l S S S S S S SeServerSeSeSeS
22、eSeSeSe e er r r r r r rvrvrverTHANKS?01?02?03?New Future on Cloud?01?02?03?AI?PAIAI?/?CPU(x86/arn)OSS/CPFSVPC/RDMA?AI?AI?&?&Prompt?GPU/NPUQwenBaichuanChatGLMLlamaBloomFalconStableDiffusion?Kubelet?gpu0gpu0?gpu1gpu0?gpu0?gpu0?Min:GPU 100Max:GPU 100rootroot.aroot.broot.croot.b.1root.b.2root.c.1Min:GP
23、U 20Max:GPU 40Min:GPU 50Max:GPU 80Min:GPU 30Max:GPU 50Min:GPU 30Max:GPU 50Min:GPU 20Max:GPU 40Min:GPU 30Max:GPU 50Namespace1Namespace2Namespace3Namespace4Namespace5Namespace6?Dataset?HuggingFace TGI Server/Stable Diffusion/Model Serving Programon GPUModelShardFileShardFileShardFileShardFileShardFile
24、Distributed Cache?ShardFluid SDK?PageCacheShardShardShdShardon G?LlamaQwenBaichuanChatGLMBloomOPTFalconStableDiffusionGPTBertLlamaQwenBaichuanChatGLMBloomOPTFalconStableDiffusionGPTBertarena submit pytorchjob -label -label -annotation -name=chatglm-ptuning -gpus=1 -imag=xxx-chatglm-finetune:chatglm2
25、 -data=oss-data:/mymodels cd/ChatGLM-6B/ptuning&bash train.sh/models/thudm-chatglm2-6bTensorflowPytorchDeepspeedDeepspeed-chatTGIKServeTritonSD-WebUIDJLvLLMFasterTransformerTensorflowPytorchDeepspeedDeepspeed-chatTGIKServeTritonSD-WebUIDJLvLLMFasterTransformerarena serve custom -name=bloom-tgi-infer
26、ence -gpus=2 -version=alpha -replicas=1 -restful-port=8080 -image=xxx-text-generation-inference:0.8 text-generation-launcher-disable-custom-kernels-model-id bigscience/bloom-560m-num-shard 2-p 8080?ACK Kubernetes Cluster?(ECI)?ACK?Arm?Remote Shuffle ServiceSpark Executor PodSpark Driver PodOSSFluidJ
27、indoFSx?Spark Application Operator?K8s?EMR?Spark Executor Pods?01?02?03?ACKPytorchTensorflow?PytorchTensorflow?ArenaCLI/SDK?AI?20%?30%?80%?New Future on CloudTHANKS?01?02?03?01Challenges of K8s cluster stability and large-scale scenarios?02ACK cluster stability governance and optimization policy?pos
28、eidon?ECI?networkpolicy?03ACK Stability product capabilities and best practices?THANKS ACK?FinOps?Contents?01?02?FinOps?03?FinOps?01?FinOps?02?IT?/IT?1.?2.?3.?IT?IT?1.?2.?3.?IT?1.?2.?CPU/GPU,request/limit?Cost-APICost-DashboardCost-ExporterPrometheusCost-Analysis?/?Pod?=Pod?Pod?=?Pod?A?=RDS?+Kafka?X
29、?+(pod x pod?A?B?FinOps?03?01?02?03?01?02?03?01?02?03?04?25%?20?50%?20%?01?02?03?FinOps?FinOps?20%-50%?IT?10?+?:?IDC?25%?IT?THANKS?01?02?03?Flexible resource scheduling.?Efficient AI development work efficiency?Operable and maintainable AI services?New Future on Cloud?Convergence of IT Infrastructur
30、e?Online Presence of Core Technologies?Data and Intelligence Capabilities of Business Applications?10,000?10000+?5%?Tmax?Stable Diffusion llama?AIGC?TMax AI/?THANKS?THANKS?78%?89%?Contents?01?02?03?NEW?SLO?神龍ECS?Alibaba Cloud Container Compute Service?Alibaba Cloud Container Compute Service?50%?K8s?
31、pod?0.25c/0.25GB?1:1?1:8?pod?Pod?Pod?Alibaba Cloud Container Compute Service?Alibaba Cloud Container Compute Service?Alibaba Cloud Container Compute Service?Alibaba Cloud Container Compute Service?Contents?01?02?03?Extension Webhook?Extension Controller?Alibaba Cloud Container Compute Service?CPUGPU
32、/NVswitch?NPU?VPC/RDMA?CPFS/PoVNAS/OSSOptimization for Microservice,Web APPs,AI,Big Data Workloads?Resource SchedulingWorkloadsResource Management?Java?AI?PodPodPodPodPodPodPodPodPodPodPodPodPodPodAlibaba Cloud Container Compute Service?ContainerGuest KernelContainer?Pod?Pod?Pod?K8S Master?RAM/RBAC/
33、RRSA?runD?Kata 3.0?Pod?Alibaba Cloud Container Compute Service?RAMRBACRRSA?Pod?KMS?Contents?01?02?03?ACS?Alibaba Cloud Container Compute Service?Alibaba Cloud Container Compute Service?Alibaba Cloud Container Compute Service?Alibaba Cloud Container Compute Service?Alibaba Cloud Container Compute Ser
34、viceACS?ACS?Serverless?ACS?ACS?ACS?PAI?ACS?ECS?ACS?ACS?https:/ Available for Invitational PreviewTHANKS?ACK?Empower Digital Innovations for Everyone with Alibaba Cloud Container ServicesACK-Kubernetes ServiceACS Container Compute ServiceACK ONE?ACK Edge-?ACK?ACK Distro?IaaS?/?/?ASM?ACR?Alibaba Cloud
35、 Named Leader for Container PlatformForrester?Q4/22?Gartner?Gartner 2023?Forrester 2022 Q4?Contents?01?02?03?04?05?Unified scheduling?Comprehensive Computing Power?-?Improved Scalability-Maximize Elastic Compute Resources?Consistent CapabilitiesACK?Enhanced?ACK?Enhanced?ECI?GPU?ECI?x86?ECI?ECI?x86?E
36、nhanced?AI/?15000 ECS?50000 ECI?NewNewTensorflowPyTorchArgoKubeflow/Arena/KServeECS?x86?ECS?ECS?GPU?-?710Better Cost-effectiveness-Yitian 710ACR?_?-?Alibaba Cloud Linux/?OS?TAG?x86?TAG?Arm?ACK?_?x86?Arm?Arm?/?x86?/?G7?Web?50%?80%?Spark?28%?Arm V9?G7?Web?22%?Spark TPC-DS?15%?ACR?AI?KeenTune?30%NewEnh
37、anced?A?-?Estimator?Scaling Plan?ecs.c7.xlarge/ecs.c7.2xlargeProvisioner?Scaling Plan?ACK?Scaler?B?ecs.c8.*Higher Elasticity-Just-In-Time Cluster Auto ScalerPending PodsPodPodPod?PodPodPod?NEW?-?ContainerOSSimplified Operation-Intelligent Node Pool Management and ContainerOS?NodeNodeNodeNodeNodeCVE?
38、ContainerOSNode?P90 55s?50%?98%?90%Enhanced?ECI-Serverless Container?ACK?ECI?200?ECI?Spark?Spot?50%?AI for Science?ACK?ECI?Region?AI?30%?Improving Efficiency and Reducing Costs with Elastic Container Instance?40%?15%?7000 Pod/min?GPU?60%?Arm?AMD?Windows?NEWContents?01?02?03?04?05?ACK?-?ACK Lingjun-S
39、table and Efficient Cloud-Native AI Infrastructure?PAI?AIGC?ACK?GPU/RDMA?AC?K?ACFluid?GPU?/?ACK Kubernetes?170X?3X?82%?70%NEW?AI?20%?30%?80%?AI?Cloud-Native AI Suite Boosts Efficiency in Large Model Engineering?&?&Prompt?ACK?AI?AI?AI?AI?AI?GPU?GangGPU?GPU?gCapacity?Kube-queue?FluidAI?ElasticTraining
40、Job?PAIAI?/?Serverless?Serverless?Serverless?Kserve?CPU/GPU/NPUOSS/CPFSVPC/RDMA?AIACC?AI?AI?py?Fluid?DatasetProcess?AI?TGIFasterTransformerJobJDeepspeedJobp pJDeepspeed-ChatSeaArtSoul?AIGC?AIACC?Fluid?2?AI PaaS?2-5?LLM?Enhanced?Kubernetes?GPU?CPUGPUNPUVPC/RDMANAS/CPFS?KubeflowKubeDLAI?Kube-queueOSSG
41、PU?PodPodPod?AI?Yarn?PodPodPodPodPodPod?CNCF?intel?360?Pod?AI?Optimized Scheduling for AI,Big Data and Other WorkloadsEnhancedContents?01?02?03?04?05?AIOps for Kubernetes Cluster:Fault Prevention and Problem DeterminationNEW?AIOps?LLM?200+?ChatOps?/?85+%?Demo?FinOps?FinOps for Kubernetes Cluster:Dig
42、italized Financial GovernanceEnhanced?IDC?ACK?API?20%?10?/?IT?FinOps?FinOps?FinOps?FinOps for Kubernetes Cluster:Digitalized Financial Governance?CPU?GPU?AI?Serverless?EnhancedContents?01?02?03?04?05?ACR EE OPA PolicyGatekeeperBinary AuthZ PolicyKritispolicy-controller?KMS sign?DevSecOps?45%?Gartner
43、?90%?Sysdig Sysdig87%?DevSecOps?DevSecOps Security InsightACK?Admission WebhookNodeEnhanced?Ambient MeshSidecarless?ASM Sidecarless Service Mesh-Zero Trust Application NetworkPPPPPPSidecar?Sidecarless?ASM?PPL4 ProxyTLS?Istio?OPA?ASM?60%50%40%?Ambient MeshNEW?PSidecar Proxy?L7 Proxy?L4 Proxy?E2E Conf
44、idential Container for Data Privacy?(Intel TDX New)?LLM/AIGCIoTKMS?PCCS?PodPod?PodPodNEW?AI?-?AI?-?Intel AMX?PyTorch?32?-?TDX?3%?ACK?(OSS)?(KMS)?(ACR)AI?PodAI?Pod?AI?ECS 8?Intel TDX?AMX?AIGC?Building Trustworthy AI Applications Based on Confidential Containers?Contents?01?02?03?04?05?ACK One?K8s?25%
45、?80%?5?ACK One Fleet Management for Distributed Cloud?ACK One Fleet?Open Cluster ManagementACK One-Fleet3rd K8s connector?Enhanced?AZ/?IO?20GB+/s?/?NAS/OSS)?NAS?CPFS?OSS?-1?Serverless Argo?A?B?C,?Argo?Managed Argo Workflow Crossing Multiple Regions30%15?10?Argo?12?50%?30%.?ECI Pod?ECI Podg?-2?NEW?A?
46、B?C,?ECI Pod?ECI Pod?ECI Pod?ECI Pod?ECI Pod?ECI Pod?ACK -Cloud Native Infrastructure for AI Era?Argo?1ECI?OSSMNS?Argo SDK?Gitpush commitworkflow apiargo cli/uiupload filegit eventoss eventoss eventKubernetes?Event Bridge?Managed Argo Workflow-ScenariosARMSSLSRAM?OSSNAS?Argo?Argo EventArgo Workflow?
47、1ECI?2022.62022.82023.42023.8?2022.4?Join Koordinator擴大技術影響力擴大技術影響力?眾多行業相互賦能眾多行業相互賦能?提升工程研發能效提升工程研發能效?Serverless?Hot Cache?DFSRDMANAS?Check PointIndexLogSegment?Buffer Batch Write?AppendDirect IO Read?Cold CachePageCacheAsync?01?02?03?New Future on Cloud?New Future on Cloud?New Future on Cloud New F
48、uture on Cloud New Future on Cloud01?Serverless?Serverless Devs+Serverless Work Flow?New Future on Cloud?New Future on Cloud New Future on Cloud New Future on Cloud02?New Future on Cloud?New Future on Cloud?New Future on Cloud?New Future on Cloud?GPU?*?GB*?New Future on Cloud?New Future on Cloud New
49、 Future on Cloud New Future on Cloud03?New Future on Cloud?New Future on Cloud?Serverless?AI AI?Serverless DevsAI AI?Serverless AI Framework AI AI?FC?Serverless?RDS/PostgreSQL?OSS/NAS/OTS?MNS/RocketMQBaaS?/?Serverless?AI?Serverless?AI?New Future on Cloud?New Future on Cloud?New Future on Cloud?New Future on Cloud?New Future on Cloud?THANKS?01?02?03?01New Future on Cloud?Small?Slow(Days)BigFast(ms)1234?UDP?VDP?EDP?QDP?SDP?MDP?DDP?FDP?HDP?RDP?XDP?TSPUDEPUCDSSAPMES?APP?+?OSM?*TEA?-?-?02New Future on Cloud?03New Future on Cloud?THANKS?01?02?03?01?02?03?01?02?03?