當前位置：首頁 > 報告詳情

數據攝取快慢：如何通過正確的時間處理提高數據可用性和數據質量.pdf

上傳人： 2*** 編號：139020 2023-06-04 PDF PDF 26頁 4.63MB

該報告所屬合集： 2023年數據和人工智能峰會（data+ai summit2023）演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/26

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《數據攝取快慢：如何通過正確的時間處理提高數據可用性和數據質量.pdf》由會員分享，可在線閱讀，更多相關《數據攝取快慢：如何通過正確的時間處理提高數據可用性和數據質量.pdf（26頁珍藏版）》請在三個皮匠報告上搜索。

1、Data Ingestion,Fast and SlowDillon BostwickDatabricks20232023 Databricks Inc.All rights reservedConfidential and Proprietary3The Worlds Data Is Real-Time2023 Databricks Inc.All rights reservedConfidential and Proprietary4The World Needs Real-Time Analytics2023 Databricks Inc.All rights reservedConfi

2、dential and ProprietaryAnd Not JustDigital Transformation52023 Databricks Inc.All rights reservedConfidential and ProprietaryConventional“Batch vs.Streaming”Dichotomy Is Limiting62023 Databricks Inc.All rights reservedConfidential and ProprietaryA Data Engineers LifeFrom:Retailer CEORetailer CEOSubj

3、ect:Need an analysis ASAP!Subject:Need an analysis ASAP!To:Dillon Bostwick Great report!Can you update it every dayevery day?every minuteWeekly Sales Forecast2023 Databricks Inc.All rights reservedConfidential and Proprietary8Process Data at Real Time2023 Databricks Inc.All rights reservedConfidenti

4、al and Proprietary9Process Data at RightTime2023 Databricks Inc.All rights reservedConfidential and Proprietary10Think Right-TimeAdjust Data Freshness to Business Need,at Any SpeedCostLatencySecondsMinutesHoursDaysWeeks1.Reduce risk of decisions on stale data2.Reduce risk of inaccurate models(drift)

5、3.Improve agility in transition to real-time4.Improve cost predictability2023 Databricks Inc.All rights reservedConfidential and ProprietaryStep 1:Scoping Right-Time Use Cases11Fraud DetectionRecommendersConcept DriftReal-Time BiddingComposable CDPRegulatory ReportingIntraday ValuationLoan ApprovalI

6、nventory RebalancingSpoilage ReductionOn-Shelf AvailabilityStock tickerTradingPoint of SaleERP(Inventory)OT(digital twin,IoT)Retail/ManufacturingFinancial ServicesMediaHorizontal/DigitalUse CaseData SourcesDSP*CRMClickstreamUser preferencesLTV*Demand Side Platform2023 Databricks Inc.All rights reser

7、vedConfidential and ProprietaryStep 1:Scoping Right-Time Use Cases12Focus on data source characteristics and availability when prioritizing use casesBest case freshness cannot exceed availabilityExample:Fraud DetectionHigh Costs to Reduce False Negatives and Positives13Source:https:/ Adversarial Nat

8、ure of Fraud Concept drift directly impacts revenue at risk Models may need to be retrained very quickly in response to drift Scoring latency directly impacts CX,revenue14Why Fraud Detection Exemplifies the Challenges of Real-Time MLReal-Time ML(Fraud Detection)Example2022 Databricks Inc.All rights

9、reservedStream Processing is unboundedStep 2:When to Use Streaming16Non-Stream Processing isone-off and bounded1Source2TargetSourceTargetDont get trapped in semantics!2022 Databricks Inc.All rights reservedStreaming Is Not Only For Low LatencyStream processing can be applied to use cases of any late

10、ncy“Batch”is a special case of streamingspark.readStream .format(delta).option(maxFilesPerTrigger,1).load(inputDir).writeStream .trigger(Trigger.AvailableNow).option(checkpointLocation,cpDir).start()172022 Databricks Inc.All rights reservedAvailableNow()+ProcessingTime()triggers enable full latency

11、spectrumspark.readStream .format(delta).option(maxFilesPerTrigger,1).load(inputDir).writeStream .trigger(Trigger.AvailableNow).option(checkpointLocation,checkpointDir).start()18spark.readStream .format(delta).option(maxFilesPerTrigger,1).load(inputDir).writeStream .trigger(Trigger.ProcessingTime(“2

12、seconds”).option(checkpointLocation,checkpointDir).start()2022 Databricks Inc.All rights reservedStep 3:Use Spark&Delta to Unify Data Flow19The Lambda Architecture2022 Databricks Inc.All rights reservedStep 3:Use Spark&Delta to Unify Data Flow20Spark:API unifies batch and streamingDelta:ACID enables

13、 concurrent batch/stream reads/writes2022 Databricks Inc.All rights reservedUnity Catalog*ReposDatabricks WorkflowsStep 4:Use DLT to SimplifyFrom query to production pipeline just by adding LIVE.CREATE LIVE TABLE raw_data as SELECT*FROM json.CREATE LIVE TABLE clean_data as SELECT FROM LIVE.raw_dataD

14、ependencyManagementIncrementalComputationCheckpointing&Retries ExpectationsFull RefreshDelta Live Tables2022 Databricks Inc.All rights reservedStep 4:Use DLT to SimplifyCREATE TABLE raw_data as SELECT*FROM json.CREATE TABLE clean_data as SELECT FROM raw_dataDependencyManagementDaily PartitionComputa

15、tionCheckpointing&RetriesQuality ChecksGovernanceData DiscoveryBackfill HandlingVersion ControlDeployment Infrastructure2022 Databricks Inc.All rights reservedDependencyManagementDaily PartitionComputationCheckpointing&RetriesQuality ChecksGovernanceData DiscoveryBackfill HandlingVersion ControlDepl

16、oyment InfrastructureCREATE TABLE raw_data as SELECT*FROM json.CREATE TABLE clean_data as SELECT FROM raw_dataStep 4:Use DLT to Simplify2022 Databricks Inc.All rights reserved24Step 4:Use DLT to SimplifySTREAMING TABLELIVE TABLE(MV)TriggeredSpark equivalent:AvailableNowIncrementally add data at spee

17、d of businessSpark equivalent:batch refreshUpdate batch data without fault toleranceContinuousIncrementally add datain real-time*Spark equivalent:fixed interval micro-batchIntelligently materializethe dataSpark equivalent:batch refreshEquivalent w/Enzyme:intelligent recompute Show the agility of cha

18、nging Pipeline Modes in your demos!Start with the DLT demo here*Best-case latency determined by trigger interval,etc2022 Databricks Inc.All rights reservedDelta Tracked ChangesQuery PlanAnalysisMonotonic AppendPartition RecomputeMERGE UpdatesFull RecomputeCost ModelOptimalUpdateTechniqueAWSAzureGCPP

19、rivate PreviewGA coming soonPrivate PreviewGA coming soonPrivate PreviewGA coming soonProject Enzyme(private preview)Achieving right-time data freshness for streaming pipelines with large data volumes requires significant infrastructure spend or complex hand-coding.Automatically optimize incremental updatesQuestions?

相關圖表

本文主要討論了實時數據處理的重要性以及實時分析的挑戰。文章指出，傳統的“批量處理與流處理”的二分法限制了數據工程師的工作。隨著數據量的實時增長，世界需要實時分析，而不僅僅是數字轉型。文章強調了根據業務需求調整數據新鮮度的關鍵性，并提出了減少決策風險、提高模型準確性、提高實時適應性和成本可預測性四個方面的好處。在確定實時用例時，應關注數據源特性和可用性，并提出了一個框架來劃分何時使用流處理。文章還介紹了使用Spark和Delta統一數據流程的方法，以及使用分布式賬本技術(DLT)簡化從查詢到生產管道的步驟。最后，文章討論了實現大規模數據流管道實時新鮮度所面臨的挑戰，并預告了自動優化增量更新的技術即將推出。

"實時數據處理如何影響企業決策？" "如何利用Spark和Delta統一數據流程？" "實時數據新鮮度對不同行業案例的影響是什么？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站