當前位置：首頁 > 報告詳情

優化批處理和流式聚合.pdf

上傳人： 2*** 編號：139024 2023-06-04 PDF PDF 28頁 648.58KB

該報告所屬合集： 2023年數據和人工智能峰會（data+ai summit2023）演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/28

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《優化批處理和流式聚合.pdf》由會員分享，可在線閱讀，更多相關《優化批處理和流式聚合.pdf（28頁珍藏版）》請在三個皮匠報告上搜索。

1、Jacek Laskowski/jaceklaskowskiOptimizing Batch and Streaming AggregationsData+AI Summit 2023About the SpeakerJacek Laskowski is a Freelance IT ConsultantSpecializing in Apache Spark,Delta Lake,Databricks,Apache Kafka(incl.Kafka Streams and ksqlDB)Best known by The Internals Of online booksContact me

2、 at jacekjapila.plFollow me at JacekLaskowskiConnect on LinkedInTable of Contents1.The Intro to The Internals of Structured Queries2.The Internals of Aggregate Queries3.Scala UDAFs and Aggregators4.Streaming Aggregates5.Streaming Aggregates Performance Tuning Gig6.Things to Watch Out For(Recap)The I

3、ntro toThe Internals ofStructured QueriesStructured Queries Apache Spark is a general-purpose distributed compute platform Spark SQL is a module of Apache Spark to describe batch queries over structured and semi-structured datasets(of any size)Spark Structured Streaming is a module of Apache Spark f

4、or streaming queries over unbounded data Queries are described using High-Level Query OperatorsDataFrame APISQL In most cases,optimizing streaming queries is to optimize corresponding batch queriesNo need to focus on streaming features(less to worry about)Caveat:streaming issues may really be relate

5、d to how streaming queries workHigh-Level Query Language-DataFrame APIHigh-Level Query Language-SQLQueryExecutionQueryExecution is the execution pipeline(workflow)of a structured queryMade up of execution phasesLogical and Physical OperatorsLogical Operators are building blocks of logical query plan

6、s in Spark SQLAggregateJoinLocalRelationLogicalRDDMergeIntoTableProjectSortPhysical Operators are executable nodes of physical query plans in Spark SQLAdaptiveSparkPlanExecBroadcastHashJoinExecHashAggregateExecObjectHashAggregateExecProjectExecSortAggregateExecThe Internals of Aggregate QueriesAggre

7、gate QueriesCalculate single value for a set of rowsCan be broken down to the following sectionsGrouping(using GROUP BY clause in SQL or Dataset.groupBy operator)that arranges rows into groups(possibly guarded by HAVING SQL clause)Aggregation(using Aggregate Functions)to apply to a set of rows and c

8、alculate single values per groupsStructured queries with Aggregate logical operatorGROUP BY(incl.GROUPING SETS,WITH CUBE,WITH ROLLUP)Dataset.groupByDataset.groupByKeyBasic AggregationMulti-Dimensional AggregationWindow Aggregation(not covered)Aggregate Query ExampleLogical Query Plan-Aggregate Opera

9、torPhysical Query Plan-BaseAggregateExec Operators3 Possible Aggregate Physical Operators(from the fastest to the slowest)HashAggregateExecObjectHashAggregateExecSortAggregateExecAggregation Execution Planning StrategyHashAggregateExec Physical OperatorHash-Based AggregationUses UnsafeRow keys and v

10、aluesUses TungstenAggregationIteratorUses UnsafeFixedWidthAggregationMap until it runs out of memory(and starts spilling to disk)and eventually switches to Sort-Based Aggregationnumber of sort fallback tasks Performance MetricSupports Whole-Stage Java Code GenerationObjectHashAggregateExec Physical

11、OperatorObject AggregationUses Scala custom object keys and valuesUses ObjectAggregationIterator Uses ObjectAggregationMap until it crosses spark.sql.objectHashAggregate.sortBased.fallbackThreshold and switches to Sort-Based Aggregationnumber of sort fallback tasks Performance MetricVery similar to

12、HashAggregateExecNo Support for Whole-Stage Java Code GenerationSort-Based AggregationUses SortBasedAggregationIterator The least performant out of 3 aggregate operators availableThe last resort when the others could not be usedSupports Whole-Stage Java Code GenerationSortAggregateExec Physical Oper

13、atornumber of sort fallback tasks MetricHash-vs Sort-Based AggregationsHashAggregateExecThe underlying UnsafeFixedWidthAggregationMap ran out of memory to hold key-value pairsObjectHashAggregateExecspark.sql.objectHashAggregate.sortBased.fallbackThresholdThe number of entries in an in-memory hash ma

14、p(to store aggregation buffers per grouping keys)before ObjectHashAggregateExec falls back to sort-based aggregationDefault:128(key-value pairs)Scala UDAFsandAggregatorsScala UDAFs and AggregatorsScalaUDAF is an ImperativeAggregate expression for User-Defined Aggregate FunctionsAggregator should now

15、 be registered as a UDF via the functions.udaf(agg)methodOnly 2 Possible Aggregate Physical OperatorsHashAggregateExecObjectHashAggregateExecSortAggregateExecUDFs(incl.UDAFs)are black boxes to the Catalyst OptimizerWhatever happens in UDFs,stays in UDFsDont Use Them Unless Youve Got No Choice Use Co

16、lumn Expressions(not Scala code)insteadStreaming AggregatesStreaming AggregatesThe very same high-level operators yet with Streaming Data Sources(Connectors)Dataset.groupByGROUP BYExtra streaming-related issuesOutput ModesLate Data,Watermarking and Event TimeLimitations related to aggregate queriesT

17、rigger.ProcessingTimeEnough to think about(“scare you”)while optimizing streaming queriesAmong Streaming Connectors is Kafka ConnectorKafka ConnectorSupport for Apache KafkaLoading records from topics(Kafka Consumer)Writing out records to topics(Kafka Producer)The number of partitions of a streaming

18、 DataFrame is the number of partitions of input topic(s)Streaming AggregatesPerformance Tuning GigI had a clientScala DevsStreaming AggregatesUDAFsKafka ConnectorObjectHashAggregateExecThe number of partitions is the number of topic partitionsspark.sql.objectHashAggregate.sortBased.fallbackThreshold

19、number of sort fallback tasks MetricMy RecommendationRewrite UDAFs to Columns(Catalyst Expressions)RepartitionShorten Trigger.ProcessingTimeThings toWatch Out For(Recap)Recap1.There are 3 possible Aggregate Physical Operatorsa.HashAggregateExecb.ObjectHashAggregateExecc.SortAggregateExec2.Use Column

20、 Expressions(no Scala/Python/Java code)3.Whole-Stage Java Code Generationa.HashAggregateExecb.SortAggregateExec4.Dont use UDAFsa.Unless youve got no choice5.Use Mutable DataTypes for grouping keysa.Basic Primitives(e.g.Boolean,Byte,Dates,Numbers)b.Promote HashAggregateExec6.Observe number of sort fallback tasks Metric7.Consider Repartitioning8.Learn aggregate-related Configuration PropertiesThats all folks!Thank you!Any questions?Jacek Laskowski/jaceklaskowski/jacekjapila.pl

相關圖表

本文主要介紹了Apache Spark中結構化查詢的內部機制，批處理和流處理聚合操作的優化方法。作者Jacek Laskowski是一位自由職業的IT顧問，專注于Apache Spark、Delta Lake、Databricks、Apache Kafka等領域。文章首先概述了Spark SQL的模塊，用于描述針對結構化和半結構化數據集的批量查詢，以及針對無界數據的流查詢。接著，詳細講解了聚合查詢的內部原理，包括邏輯和物理操作符，以及聚合函數的使用。文章還討論了流處理聚合的性能調優，以及在使用過程中需要關注的問題。最后，作者給出了一系列優化建議，如避免使用Scala UDAFs，使用整數類型作為分組鍵，觀察sort fallback tasks Metric等。

"Spark SQL中聚合查詢的內部機制是什么？" "如何優化Spark Structured Streaming的聚合查詢？" "在Spark中使用UDAF時，有哪些需要注意的性能問題？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站