當前位置：首頁 > 報告詳情

何濤-Vineyard：開源分布式內存數據管理引擎（GOTC深圳會場）（23頁）.pdf

上傳人：懶人編號：82839 2022-07-14 PDF PDF 23頁 5.01MB

該報告所屬合集： 2021年GOTC全球開源技術峰會PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/23

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《何濤-Vineyard：開源分布式內存數據管理引擎（GOTC深圳會場）（23頁）.pdf》由會員分享，可在線閱讀，更多相關《何濤-Vineyard：開源分布式內存數據管理引擎（GOTC深圳會場）（23頁）.pdf（23頁珍藏版）》請在三個皮匠報告上搜索。

1、綜合技術專場何濤 2021年 8月 1日阿里巴巴：開源分布式數據管理引擎Why bother1.Sharing data efficiently(with“0-copy”)between libraries is easy within a single python process2.It is not as easy to do so across processes/runtimes on a single machinePossible with plasma from Apache Arrow,a local object store using shared-memory3.Wha

2、t about processing big data that cannot fit into a single machine,and involving different workloads?Use vineyard+K8s!PyData is the de-facto standard for data analysisThere are lots of libraries for different workloads(image credit:https:/coiled.io/blog/pydata-dask/)Big data analytical pipelinesAn an

3、ti-fraud pipelineLoad data,ETLLabel PropogationFraud detection with DNNPost ETL processingSQLML/TensorflowGraph Computation（LPA）/GRAPESQLBig data analytical pipelinesObversation：A typical big data application involves various kinds of workloads,and thus involves multiple dedicated systems for each w

4、orkloadThese dedicated systems typically shares intermediate data with external file systemsThe workflow is often organized as a chain/DAG,and each individual task only gets invoked after their prerequisite tasks are completedAn anti-fraud pipelineData Extraction and PreprocessingData Warehouse(Hive

5、)、Distributed FS(HDFS),Object Store(S3)Postprocessing and ETLETL（SQL）SQLGraph Analytical AlgorithmsGRAPETrain/Infer with a NN modelTensorflow/PytorchBig data analytical pipelinesProblem：Production-ready systems(Hive,Tensorflow,)are hard to develop.Sharing data with external file systems has huge I/O

6、 cost.Applying cross-task optimization(pipelining)on tasks is challenging.An anti-fraud pipelineData Extraction and PreprocessingData Warehouse(Hive)、Distributed FS(HDFS),Object Store(S3)Postprocessing and ETLETL（SQL）SQLGraph Analytical AlgorithmsGRAPETrain/Infer with a NN modelTensorflow/PytorchBig

7、 data analytical pipelinesHardness in developing production-ready systemsProblem：Many dedicated systems(e.g.,for graph computing)are developed these years,but only a few are production-ready.Huge efforts are required just to implementI/O adaptorsData partition/chunking strategiesFault-tolerance mech

8、anismsScale in/outData sink/sourceBig data analytical pipelinesHuge I/O cost in workflowsProblem：Data could be polymorphicNon-relational data,such as tensors,dataframes and graphs/networks are becoming increasingly prevalent.Tables and SQL may not be best way to store/exchange or process them.Having

9、 the data transformed from/to tables back and forth between different systems could be a huge overhead.Saving/loading the data to/from the external storage requires lots of memory-copies and IO costs.Big data analytical pipelinesHardness of cross-job optimizationProblem：Tasks in workflows has no inf

10、ormation about other tasks The immediate data cannot be placed in a optimized fashion for the dependent tasksThe data transfer from one task to another is a barrierUsually requires transformation of format and schemaIt is hard to do cross-task pipeliningTasks in a typical workflowRun this firstJoinB

11、ranch 1Branch 1Branch 1Follow Branch 1Follow Branch 1Follow Branch 1VineyardBig data systems at production-ready quality are hard to developVineyard has an extensible design,that supports pluggable routines for I/O,data partition,scaling and fault-toleranceI/O cost in workflows is usually highVineya

12、rd enables sharing in-memory immutable data in a zero-copy fashionI/O flows tasks in a workflow dont require extra copy,and data can be accessed an in-memory data object.Cross-task optimization is challengingData in memory can be directly shared between different systemsVineyard supported streams in

13、 shared memory,provides opportunity for pipelining between dedicated systemsMotivationVineyardDistributed in-memory object store for immutable dataZero-copy in-memory data sharing between different systemsOut-of-the-box high-level abstraction for developing big data applicationsLocal data access as

14、native objectsDrivers for data partitioning,I/O,checkpointing,migration,.What is VineyardVineyardA vineyard object consists of data payload and metadataData payload is storing in shared memoryMetadata is synced through the cluster with ETCDVineyard daemon instances are accessed via IPC/RPC connectio

15、nsData payload can only be accessed by IPC connectionsPluggable drivers can provide certain functionalities to certain data formatsArchitectureVineyardObject=Metadata+BlobDecouple the payload and semanticsShare by memory mappingZero-copyShare with the data structure abstractionsShares the data struc

16、ture directlye.g.,Tensors,DataFrames,GraphsBuilders+ResolverInterpret the vineyard objects to engines native value typeEfficient Object Sharing across EnginesVineyardVineyard support distributed objectsA global object consists of a set of chunksA client can accessing payload of local chunksand metad

17、ata(only)of remote chunksMetadata is synced using etcdPerformance:only metadata of objects that are refered by a global object are synced to other instancesDistributed Objects SharingVineyardZero-copy sharing unlocks new opportunityThe intermediate data sharing is not a barrier anymoreStream in Vine

18、yardStream over chunks of data structurese.g,tensor stream,dataframe streamTasks can be pipelined using vineyard stream!Pipelining between tasks in a workflowVineyardEngines usually are hard to be connected to production systemsIntegration with internal I/OIntegration with other internal enginesVine

19、yard serves as a bridgeI/O is delegated to vineyardEngines consume data structures in vineyard directlyEngine talks to other engines via shared intermediate objects in vineyardPluggable driversVineyard on KubernetesThe end-to-end big data task is deployed on KubernetesIntermediate data is abstracted

20、 as a Kubernetes resource(CRDs),and is sharing with vineyard through memory mapping“Data”lives in memory,and the scheduler optimizes the data flow among cluster nodesVision:a new cloud-native paradigm for bigdata tasksData Extraction and PreprocessingVineyard Daemonset(on Kubernetes)Postprocessing a

21、nd ETLETL（SQL）SQLGraph Analytical AlgorithmsGRAPETrain/Infer with a NN modelTensorflow/PytorchVineyard on KubernetesVineyard requires IPC communication between vineyard server pods and application pods for memory sharingThe domain socket of vineyard server could be mounted on hostPath or PersistentV

22、olumeClaimWhen users bundle vineyard and the workload to the same pod,the domain socket could be shared using an emptyDirMemory Sharing on KubernetesVineyard on KubernetesVineyard objects are abstracted as Kubernetes resources(i.e.,CRDs)Each CRD contains the metadata of the represented vineyard obje

23、ctLocation specs that describe which node an object is located are added to the CRDs of local objectsVineyard objects as Kubernetes resources(CRDs)Vineyard on KubernetesJob and its required data cannot be always alignedThe cluster environment is dynamic and constrainedThe requirements of different w

24、orkloads is differentThe location information can be used to guide the scheduling process:A vineyard scheduler plugin!It still can be unalignedAuto migration in intiContainerScheduling on KubernetesVineyard on KubernetesDeploymentVineyard is deployed as a DaemonSet in Kubernetes clusterDeploy using

25、HelmVineyard can be easily deployed in Kubernetes cluster using Helm:Deploying Vineyard on Kuberneteshelm repo add vineyard https:/ install-namespace vineyard-name vineyard stable/vineyardRoadmapOngingConnecting to machine learning frameworksIntegration with Tensorflow/Pytorch to share objects in vi

26、neyard to machine learning frameworksSDK in more languagesPythonJavaRustGoIntegration with workflow engines Integration with airflow:brings better immediate data sharing solution for workflows orchestrated by airflowRoadmapFurther aheadVineyard Operator for KubernetesBetter cluster management and mo

27、nitor on Kubernetes clusterBetter data-aware scheduler policy within the scheduler pluginApplication-aware Far MemoryVineyard supports global object abstractions,e.g,GlobalDataFrameSupport for application-aware far memory will enables single-machine applications to leverage remote memory resourcesBe

28、tter performance than raw RPCStorage hierarchyIn-memory objects can be swapped out in certain casesSnapshot the objects and restore back to memory later benefits the end-to-end performanceVineyard CommunityVineyard is open source under the Apache-2.0 LicenseAny contribution from the community are we

29、lcomedIssues about bugs and feature requestsPull requests for bugfix,enhancement,feature implementation and extensionsDiscussion about the installation,deployment,usage of vineyardWe have comprehensive documentation for the underlying design and how to build application on vineyardhttps:/v6d.io/Opensource

相關圖表

該用戶很懶，什么也沒介紹

關注

相關報告

何民雄-開源分布式數據庫人才培養（GOTC深圳會場）（18頁）.pdf

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站