使用 DATABRICKS 遷移和優化大規模流式處理應用程序(重復).pdf

編號:167634 PDF 48頁 3.73MB 下載積分:VIP專享
下載報告請您先登錄!

使用 DATABRICKS 遷移和優化大規模流式處理應用程序(重復).pdf

1、2024 Databricks Inc.All rights reserved1MIGRATING AND OPTIMIZING MIGRATING AND OPTIMIZING LARGELARGE-SCALE SCALE STREAMING APPLICATIONS STREAMING APPLICATIONS WITH DATABRICKSWITH DATABRICKSSharif Doghmi,Donghui LiSharif Doghmi,Donghui LiJune 2024June 20242024 Databricks Inc.All rights reservedLead S

2、oftware Engineer,FreeW Software Engineer,FreeW DoghmiSharif DoghmiDonghui LiDonghui Li2WHO WE AREWHO WE ARE2024 Databricks Inc.All rights reserved FreeWheel,Beeswax,and Programmatic Advertising Beeswax Data Platform Redesign and Migration of Streaming Application Monitoring and Observability Perform

3、ance Optimization Databricks Development Experience Future Databricks Work3SESSION OVERVIEWSESSION OVERVIEW2024 Databricks Inc.All rights reserved4FREEWHEEL,FREEWHEEL,BEESWAX,AND BEESWAX,AND PROGRAMMATIC PROGRAMMATIC ADVERTISINGADVERTISING2024 Databricks Inc.All rights reserved52024 Databricks Inc.A

4、ll rights reserved Comprehensive ad platforms for publishers,advertisers,and media buyers We make it easier for buyers and sellers to transact Products for demand and supply sides of advertising We serve many of the largest media companies in the U.S.and Europe We power advertising for many of the b

5、iggest sporting events62024 Databricks Inc.All rights reserved72024 Databricks Inc.All rights reserved8PROGRAMMATIC ADVERTISING 101PROGRAMMATIC ADVERTISING 101Real Time BiddingReal Time Bidding2024 Databricks Inc.All rights reservedEnables demand-side customers to:Plan and execute ad campaignsBid on

6、 ad inventory using RTBMonitor,analyze,and optimize their ad campaignsBuilt on the principles of transparency,customizability,and controlKey offerings:DSP:Highly customizable-out-of-the-boxBaaS:Fully customizable own bidder-single-tenant architecture9FreeWheels Programmatic Buying PlatformFreeWheels

7、 Programmatic Buying Platform2024 Databricks Inc.All rights reserved10BEESWAX DATA BEESWAX DATA PLATFORMPLATFORM2024 Databricks Inc.All rights reserved11BEESWAX HIGHBEESWAX HIGH-LEVEL ARCHITECTURELEVEL ARCHITECTURE2024 Databricks Inc.All rights reservedBid RequestBid ResponseImpressionClickActivityC

8、onversionMuch less frequent:Bot Clicks,VAST Error,SKAd,and more12ABOUT OUR DATAABOUT OUR DATAMain ad event types received by the data platformMain ad event types received by the data platform2024 Databricks Inc.All rights reservedConsumes ad events and ad campaign dataFilters,deduplicates,anonymizes

9、,joins,enriches,and aggregates dataActs as the single source of truth for data.Enables customers to monitor,analyze,and forecast campaign performanceProvides:Real-time visual analytics:bid/win/spend stats and chartsHigh performance interactive analytics:auction,bid,and win multi-dimensional analysis

10、Raw event-level data:streaming and batch delivery,Snowflake shared tablesAggregate reports and dashboards:Looker-based UI and API,batch file delivery13DATA PLATFORMDATA PLATFORMOverviewOverview2024 Databricks Inc.All rights reserved14DATA PLATFORM ARCHITECTUREDATA PLATFORM ARCHITECTURE2024 Databrick

11、s Inc.All rights reserved152024 Databricks Inc.All rights reserved162024 Databricks Inc.All rights reserved172024 Databricks Inc.All rights reserved3 Million Input Ad Events5 GB of Input Data 250 Billion Input Ad Events 4 TB of Input DataEVERY SECONDEVERY DAYEVERY SECONDEVERY DAY18DATA SCALEDATA SCA

12、LEWaggle Streaming ApplicationWaggle Streaming Application2024 Databricks Inc.All rights reserved19REDESIGN AND MIGRATION REDESIGN AND MIGRATION OF STREAMING OF STREAMING APPLICATIONAPPLICATION2024 Databricks Inc.All rights reservedWaggle:name of dance used by honeybees to communicate with each othe

13、rReads data in real-time from Kinesis Stream written to by upstream data sourcesMassive application with over 30K lines of codeJava KCL application migrated to Scala/Java Spark Structured StreamingTransforms,filters,anonymizes,and joins ad eventsJoins using Aerospike database Waggle itself remains s

14、tatelessRoutes transformed ad events to S3,Kinesis,and HTTP endpointsDelivers streaming ad logs and writes data to our warehouse20WAGGLEWAGGLEStreaming application at the frontline of the Data PlatformStreaming application at the frontline of the Data Platform2024 Databricks Inc.All rights reserved2

15、1DATA PLATFORM ARCHITECTUREDATA PLATFORM ARCHITECTURE2024 Databricks Inc.All rights reserved22DATA PLATFORM ARCHITECTUREDATA PLATFORM ARCHITECTURE2024 Databricks Inc.All rights reservedSystem stabilityCost inefficienciesAvoidable maintenance timeData quality dimensions:AccuracyCompletenessTimeliness

16、Delivery semantics23REDESIGN AND MIGRATION OBJECTIVESREDESIGN AND MIGRATION OBJECTIVESTarget KPIsTarget KPIs2024 Databricks Inc.All rights reservedRearchitect system using top-down designMigrate from Kinesis Client Library(KCL)to Spark Structured Streaming on Databricks:Eliminate memory leaksSignifi

17、cantly reduce instance failuresEliminate concurrency issuesManaged infrastructureBuilt-in micro-batchingUse Scala for new code and convert some of the Java code to Scala24REDESIGN AND MIGRATION STRATEGYREDESIGN AND MIGRATION STRATEGYTo improve target KPIsTo improve target KPIs2024 Databricks Inc.All

18、 rights reserved25RESULTSRESULTSTarget KPIsTarget KPIsSystem stabilityCost inefficienciesAvoidable maintenanceData quality dimensions:AccuracyCompletenessTimelinessDelivery semanticsImprovedMostly eliminatedGreatly reducedData quality dimensions:ImprovedImprovedImprovedCloser to exactly once2024 Dat

19、abricks Inc.All rights reservedSCALAHigh Level OutlineHigh Level Outline1 val streamingDataFrame=kinesisReader.getStreamingData(appConfig)2 val dataStreamWriter=streamingDataFrame.writeStream.trigger(Trigger.ProcessingTime(”60 seconds)34 dataStreamWriter.foreachBatch (df,batchId)=5 df.persist()6 val

20、 waggleMessageDS:DatasetWaggleMessage=df.asKinesisRecord.map(decodeAndTransform)7 val s3MessageDS:DatasetS3Message=waggleMessageDS.flatMap(_.s3Message)89 s3MessageDS.foreachPartition msgs:IteratorS3Message=10 msgs.toVector.groupBy(_.destination).foreach case(destination,records)=11 writeRecordsToS3(

21、location,records)12 1314 /output to other endpoints15 df.unpersist()16 APPLICATION STREAMING LOGICAPPLICATION STREAMING LOGIC262024 Databricks Inc.All rights reserved27MONITORING AND MONITORING AND OBSERVABILITYOBSERVABILITY2024 Databricks Inc.All rights reservedMonitoring is crucial for ensuring sy

22、stem reliability and performanceChallengesScalabilityFlexibilityReliabilityCompatibilityTraditional monitoring solutions often struggle to address these challenges effectivelyMONITORING AND OBSERVABILITYMONITORING AND OBSERVABILITY282024 Databricks Inc.All rights reservedPrometheus-monitoringA widel

23、y used open-source monitoring and alerting toolDesigned for reliability,scalability,and easy integration with other toolsKey features:Multi-dimensional data modelPromQLPull-based metrics collectionService discoveryGrafana visualization and alerting managementMONITORING AND OBSERVABILITYMONITORING AN

24、D OBSERVABILITY292024 Databricks Inc.All rights reservedMONITORING AND OBSERVABILITYMONITORING AND OBSERVABILITY30Architecture DiagramArchitecture Diagram2024 Databricks Inc.All rights reservedMONITORING AND OBSERVABILITYMONITORING AND OBSERVABILITY312024 Databricks Inc.All rights reservedMONITORING

25、 AND OBSERVABILITYMONITORING AND OBSERVABILITY322024 Databricks Inc.All rights reservedUse the secondary IP address to scrape for Prometheus metricsPrometheus EC2 service discovery only gets the primary IP addressOur workaroundHave an EC2 tag for the secondary IP addressRelabel IP address to replace

26、 the primary with the secondary IP addressMONITORING AND OBSERVABILITYMONITORING AND OBSERVABILITY33ImplementationImplementation2024 Databricks Inc.All rights reservedIncorporated Spark Dropwizard metrics to monitor the underlying Spark platformCreated another Grafana dashboard as a central place fo

27、r Databricks logsConfigured Databricks to send logs to AWS S3Ingested logs from S3 into Grafana CloudGrafana dashboard provides searching and filtering capabilitiesMONITORING AND OBSERVABILITYMONITORING AND OBSERVABILITY34ImplementationImplementation2024 Databricks Inc.All rights reservedMONITORING

28、AND OBSERVABILITYMONITORING AND OBSERVABILITY352024 Databricks Inc.All rights reserved36PERFORMANCE PERFORMANCE OPTIMIZATIONOPTIMIZATION2024 Databricks Inc.All rights reservedUsed Kryo serializationKryo is faster than Java serializationSpark uses Java serialization by defaultMemory optimizationUsed

29、lookup table for reusable Java objectsKept only necessary information for output in DataFrame/DatasetPERFORMANCE OPTIMIZATIONPERFORMANCE OPTIMIZATION372024 Databricks Inc.All rights reservedKinesis parametersFor micro-batch interval=60 secondsPERFORMANCE OPTIMIZATIONPERFORMANCE OPTIMIZATION38Paramet

30、erMeaningDefaultWhat we usemaxFetchRateMaximum prefetch data rate1.02.0maxFetchDurationBuffer time of prefetched data before data is made available for processing10s30sshardsPerTaskNumber of Kinesis shards each Spark task handles.515coalesceThresholdBlockSizeThe threshold at which automatic coalesce

31、 occurs.10MB50MBcoalesceBinSizeApproximate block size after coalescing.128MB128MB2024 Databricks Inc.All rights reservedPERFORMANCE OPTIMIZATIONPERFORMANCE OPTIMIZATION392024 Databricks Inc.All rights reservedParallel processingDriver simultaneous DataFrame processingExecutor simultaneous S3 writes

32、within custom Java codeCachingDataFrameReusable AWS resources(S3 clients,Kinesis producers,etc.)Future:Stream pipelining serverlessPERFORMANCE OPTIMIZATIONPERFORMANCE OPTIMIZATION402024 Databricks Inc.All rights reserved2024-05-23T15:42:30.641Z S3 batch 73720/78706-BEGIN2024-05-23T15:42:30.849Z S3 b

33、atch 73720/78706-51 log config identifiers.2024-05-23T15:42:32.212Z S3 batch 73720/78706-took 0.310 seconds,fileSize=249.4 kB2024-05-23T15:42:32.239Z S3 batch 73720/78706-took 0.293 seconds,fileSize=217.6 kB2024-05-23T15:42:32.266Z S3 batch 73720/78706-took 0.392 seconds,fileSize=414.7 kB2024-05-23T

34、15:42:32.324Z S3 batch 73720/78706-took 0.261 seconds,fileSize=231.0 kB2024-05-23T15:42:32.336Z S3 batch 73720/78706-took 0.298 seconds,fileSize=95.6 kB2024-05-23T15:42:32.372Z S3 batch 73720/78706-took 0.354 seconds,fileSize=399.2 kB2024-05-23T15:42:32.408Z S3 batch 73720/78706-took 0.541 seconds,f

35、ileSize=706.9 kB2024-05-23T15:42:32.447Z S3 batch 73720/78706-took 0.502 seconds,fileSize=592.1 kB2024-05-23T15:42:32.457Z S3 batch 73720/78706-took 0.590 seconds,fileSize=646.8 kB2024-05-23T15:42:32.504Z S3 batch 73720/78706-took 0.637 seconds,fileSize=636.1 kB2024-05-23T15:42:32.623Z S3 batch 7372

36、0/78706-took 0.640 seconds,fileSize=692.5 kB2024-05-23T15:42:32.758Z S3 batch 73720/78706-took 0.663 seconds,fileSize=908.6 kB2024-05-23T15:42:32.964Z S3 batch 73720/78706-took 0.939 seconds,fileSize=1.5 MB2024-05-23T15:42:32.971Z S3 batch 73720/78706-took 1.146 seconds,fileSize=1.9 MB2024-05-23T15:

37、42:33.005Z S3 batch 73720/78706-took 1.051 seconds,fileSize=1.6 MB2024-05-23T15:42:33.123Z S3 batch 73720/78706-took 1.203 seconds,fileSize=3.1 MB2024-05-23T15:42:33.300Z S3 batch 73720/78706-took 1.385 seconds,fileSize=3.7 MB2024-05-23T15:42:33.318Z S3 batch 73720/78706-took 1.433 seconds,fileSize=

38、3.5 MB2024-05-23T15:42:33.318Z S3 batch 73720/78706-END 2.677 secondsPERFORMANCE OPTIMIZATIONPERFORMANCE OPTIMIZATION412024 Databricks Inc.All rights reserved42DATABRICKS DEVELOPMENT DATABRICKS DEVELOPMENT EXPERIENCEEXPERIENCE2024 Databricks Inc.All rights reservedIntuitive APIs(for application code

39、)Databricks Terraform Provider(for IAC)Databricks Asset Bundles(for CI/CD)DatabricksIQ-powered features:Databricks AssistantAI-Generated Comments43DATABRICKS DEVELOPMENT EXPERIENCEDATABRICKS DEVELOPMENT EXPERIENCEUseful FeaturesUseful Features2024 Databricks Inc.All rights reserved Helps quickly dia

40、gnose problems Provides useful code examples44DATABRICKS ASSISTANTDATABRICKS ASSISTANT2024 Databricks Inc.All rights reservedAccurateSaves timeEncourages good habits45AIAI-GENERATED COMMENTSGENERATED COMMENTS2024 Databricks Inc.All rights reserved46FUTURE FUTURE DATABRICKS DATABRICKS WORKWORK2024 Da

41、tabricks Inc.All rights reservedTrying out these upcoming Databricks features:Serverless compute for streaming workflowsAutoscaling for serverless streaming workflowsMore Databricks Compute and Unity Catalog adoption47FUTURE DATABRICKS WORKFUTURE DATABRICKS WORK2024 Databricks Inc.All rights reserved48THANK YOU!THANK YOU!

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(使用 DATABRICKS 遷移和優化大規模流式處理應用程序(重復).pdf)為本站 (張5G) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站