《數據攝取快慢:如何通過正確的時間處理提高數據可用性和數據質量.pdf》由會員分享,可在線閱讀,更多相關《數據攝取快慢:如何通過正確的時間處理提高數據可用性和數據質量.pdf(26頁珍藏版)》請在三個皮匠報告上搜索。
1、Data Ingestion,Fast and SlowDillon BostwickDatabricks20232023 Databricks Inc.All rights reservedConfidential and Proprietary3The Worlds Data Is Real-Time2023 Databricks Inc.All rights reservedConfidential and Proprietary4The World Needs Real-Time Analytics2023 Databricks Inc.All rights reservedConfi
2、dential and ProprietaryAnd Not JustDigital Transformation52023 Databricks Inc.All rights reservedConfidential and ProprietaryConventional“Batch vs.Streaming”Dichotomy Is Limiting62023 Databricks Inc.All rights reservedConfidential and ProprietaryA Data Engineers LifeFrom:Retailer CEORetailer CEOSubj
3、ect:Need an analysis ASAP!Subject:Need an analysis ASAP!To:Dillon Bostwick Great report!Can you update it every dayevery day?every minuteWeekly Sales Forecast2023 Databricks Inc.All rights reservedConfidential and Proprietary8Process Data at Real Time2023 Databricks Inc.All rights reservedConfidenti
4、al and Proprietary9Process Data at RightTime2023 Databricks Inc.All rights reservedConfidential and Proprietary10Think Right-TimeAdjust Data Freshness to Business Need,at Any SpeedCostLatencySecondsMinutesHoursDaysWeeks1.Reduce risk of decisions on stale data2.Reduce risk of inaccurate models(drift)
5、3.Improve agility in transition to real-time4.Improve cost predictability2023 Databricks Inc.All rights reservedConfidential and ProprietaryStep 1:Scoping Right-Time Use Cases11Fraud DetectionRecommendersConcept DriftReal-Time BiddingComposable CDPRegulatory ReportingIntraday ValuationLoan ApprovalI
6、nventory RebalancingSpoilage ReductionOn-Shelf AvailabilityStock tickerTradingPoint of SaleERP(Inventory)OT(digital twin,IoT)Retail/ManufacturingFinancial ServicesMediaHorizontal/DigitalUse CaseData SourcesDSP*CRMClickstreamUser preferencesLTV*Demand Side Platform2023 Databricks Inc.All rights reser
7、vedConfidential and ProprietaryStep 1:Scoping Right-Time Use Cases12Focus on data source characteristics and availability when prioritizing use casesBest case freshness cannot exceed availabilityExample:Fraud DetectionHigh Costs to Reduce False Negatives and Positives13Source:https:/ Adversarial Nat
8、ure of Fraud Concept drift directly impacts revenue at risk Models may need to be retrained very quickly in response to drift Scoring latency directly impacts CX,revenue14Why Fraud Detection Exemplifies the Challenges of Real-Time MLReal-Time ML(Fraud Detection)Example2022 Databricks Inc.All rights
9、reservedStream Processing is unboundedStep 2:When to Use Streaming16Non-Stream Processing isone-off and bounded1Source2TargetSourceTargetDont get trapped in semantics!2022 Databricks Inc.All rights reservedStreaming Is Not Only For Low LatencyStream processing can be applied to use cases of any late
10、ncy“Batch”is a special case of streamingspark.readStream .format(delta).option(maxFilesPerTrigger,1).load(inputDir).writeStream .trigger(Trigger.AvailableNow).option(checkpointLocation,cpDir).start()172022 Databricks Inc.All rights reservedAvailableNow()+ProcessingTime()triggers enable full latency
11、spectrumspark.readStream .format(delta).option(maxFilesPerTrigger,1).load(inputDir).writeStream .trigger(Trigger.AvailableNow).option(checkpointLocation,checkpointDir).start()18spark.readStream .format(delta).option(maxFilesPerTrigger,1).load(inputDir).writeStream .trigger(Trigger.ProcessingTime(“2
12、seconds”).option(checkpointLocation,checkpointDir).start()2022 Databricks Inc.All rights reservedStep 3:Use Spark&Delta to Unify Data Flow19The Lambda Architecture2022 Databricks Inc.All rights reservedStep 3:Use Spark&Delta to Unify Data Flow20Spark:API unifies batch and streamingDelta:ACID enables
13、 concurrent batch/stream reads/writes2022 Databricks Inc.All rights reservedUnity Catalog*ReposDatabricks WorkflowsStep 4:Use DLT to SimplifyFrom query to production pipeline just by adding LIVE.CREATE LIVE TABLE raw_data as SELECT*FROM json.CREATE LIVE TABLE clean_data as SELECT FROM LIVE.raw_dataD
14、ependencyManagementIncrementalComputationCheckpointing&Retries ExpectationsFull RefreshDelta Live Tables2022 Databricks Inc.All rights reservedStep 4:Use DLT to SimplifyCREATE TABLE raw_data as SELECT*FROM json.CREATE TABLE clean_data as SELECT FROM raw_dataDependencyManagementDaily PartitionComputa
15、tionCheckpointing&RetriesQuality ChecksGovernanceData DiscoveryBackfill HandlingVersion ControlDeployment Infrastructure2022 Databricks Inc.All rights reservedDependencyManagementDaily PartitionComputationCheckpointing&RetriesQuality ChecksGovernanceData DiscoveryBackfill HandlingVersion ControlDepl
16、oyment InfrastructureCREATE TABLE raw_data as SELECT*FROM json.CREATE TABLE clean_data as SELECT FROM raw_dataStep 4:Use DLT to Simplify2022 Databricks Inc.All rights reserved24Step 4:Use DLT to SimplifySTREAMING TABLELIVE TABLE(MV)TriggeredSpark equivalent:AvailableNowIncrementally add data at spee
17、d of businessSpark equivalent:batch refreshUpdate batch data without fault toleranceContinuousIncrementally add datain real-time*Spark equivalent:fixed interval micro-batchIntelligently materializethe dataSpark equivalent:batch refreshEquivalent w/Enzyme:intelligent recompute Show the agility of cha
18、nging Pipeline Modes in your demos!Start with the DLT demo here*Best-case latency determined by trigger interval,etc2022 Databricks Inc.All rights reservedDelta Tracked ChangesQuery PlanAnalysisMonotonic AppendPartition RecomputeMERGE UpdatesFull RecomputeCost ModelOptimalUpdateTechniqueAWSAzureGCPP
19、rivate PreviewGA coming soonPrivate PreviewGA coming soonPrivate PreviewGA coming soonProject Enzyme(private preview)Achieving right-time data freshness for streaming pipelines with large data volumes requires significant infrastructure spend or complex hand-coding.Automatically optimize incremental updatesQuestions?