《鐘陽紅-Apache Ballista Introduction.pdf》由會員分享,可在線閱讀,更多相關《鐘陽紅-Apache Ballista Introduction.pdf(17頁珍藏版)》請在三個皮匠報告上搜索。
1、第三屆中國Rust開發者大會Apache Ballista Introduction鐘陽紅(John Zhong)Software Engineer eBaynju_yahoapache.orgAgenda Overview Cluster Setup SQL Execution Data Cache FutureApache Ballista is a distributed SQL query engine powered by the Rust implementation of Apache Arrow and DataFusion.Its mainly for interactive
2、 queries of low latency.Support DAG and fault toleranceSupport data exchangeSupport different kinds of object stores,like HDFS,S3,Azure,etcSupport data cache and cache aware task schedulingOverviewCluster SetupThe cluster consists of one scheduler and a number of executors.Both of scheduler and exec
3、utor can be deployed on K8S.Executors can be added to the cluster flexibly by registering to the cluster scheduler.SQL Execution SQL -DAG(Directed Acyclic Graph)DAG State Machine Task Assignment Event Loop based ProcessingSQL Execution DAG Generation SQLLogical PlanSingle MachineExecution PlanDistri
4、butedExecution PlanDAGSQL Execution DAG State MachineNormal Stage State MachineSQL Execution Fault ToleranceStage State Machine for Executor LostSQL Execution Task AssignmentTask:each execution stage for a number of data partitions.one task for each data partition.Executor slot:each executor has a n
5、umber of slots for task execution.One round task assignment will bind pending tasks with available executor slots as many as possible.Two assignment policies:PolicyResult of One RoundRound-robinJob_a:1 slot from executor_3 1 slot from executor_2Job_b:3 slots from executor_3 2 slots from executor_2 2
6、 slots from executor_1BiasJob_a:2 slots from executor_3Job_b:5 slots from executor_3 2 slots from executor_2SQL Execution Event Loop based ProcessingAdvantages:DecoupledEfficient processing for batch eventsData CacheData cache is a very common feature for the cloud data warehouses for accelerating t
7、he access to the data source.Snowflake Multi-Cluster Shared Data ArchitectureVertica Eon Architecture Consistent hashing-based assignment(Snowflake)LRU based retirement Cache aware scheduling Consistent hashing tolerance-based work stealing Currently its file-levelData CacheThree rounds cache aware
8、task Scheduling:Assign non-map stage tasks(without scanning files)in a round robin wayAssign map stage tasks(scanning files)based on the consistent hashing policy on the hash value of the file name and the executor topologyAssign tasks with scanning files based on the consistent hashing policy on th
9、e hash value of the file name and the executor topology with N toleranceData CacheFuture Scheduler HA Shuffle Improvement-Self-adjustable shuffle partition number-Sort-based shuffle writer for pull-based shuffling-Push-based shufflingReferenceEon Mode:Bringing the Vertica Columnar Database to the Cloudhttps:/ Snowflake Elastic Data Warehousehttps:/event.cwi.nl/lsde/papers/p215-dageville-snowflake.pdfApache Arrowhttps:/arrow.apache.org/Apache Arrow DataFusionhttps:/ Arrow Ballistahttps:/ you!