《使用 DLT 和在線表簡化實時攝取和服務.pdf》由會員分享,可在線閱讀,更多相關《使用 DLT 和在線表簡化實時攝取和服務.pdf(34頁珍藏版)》請在三個皮匠報告上搜索。
1、2024 Databricks Inc.All rights reserved1STREAMLINE REAL TIME STREAMLINE REAL TIME INGESTION AND SERVING INGESTION AND SERVING WITH DLT AND WITH DLT AND ONLINE ONLINE TABLESTABLES2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved 30 years in the IT industry Developer,appli
2、cation architect,integration architect,enterprise architect 13 years in analytics 5 years working with Databricks Self-employed consultant since 20172Magnus JohannessonSolutions Architect and Data EngineerSolutions Architect and Data Engineer22024 Databricks Inc.All rights reserved2024 Databricks In
3、c.All rights reserved 12 years in the IT industry Data Engineer With Databricks for 2 years3Sanjeev KumarSpecialist Solution ArchitectSpecialist Solution Architect32024 Databricks Inc.All rights reserved4STREAMLINE REAL TIME STREAMLINE REAL TIME INGESTION AND SERVING INGESTION AND SERVING WITH DLT A
4、ND WITH DLT AND ONLINE ONLINE TABLESTABLES2024 Databricks Inc.All rights reserved5AgendaVsttrafik:Our business Use case and problem description Architecture Pipeline Implementation Future improvements Online Store-Serving Demo Key takeaways 2024 Databricks Inc.All rights reservedVSTTRAFIK62024 Datab
5、ricks Inc.All rights reserved2024 Databricks Inc.All rights reservedVsttrafik72024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedVsttrafik82024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved Planned traffic dataSome of our data9 9 Ticket data Sal
6、es Validations Ticket inspections Real-time data Automatic passenger data2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedUSE CASE2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedUse Case11RealReal-time position of Vsttrafiks vehiclestime pos
7、ition of Vsttrafiks vehiclesTrams&buses GPS position,vehicle ID,line,stops,etc.Trains Only GPS position and internal vehicle IDProof of Concept(PoC)Enrich using existing dataTest streaming data in DatabricksTarget latency 10 sec2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights re
8、servedProject Goals122024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedARCHITECTURE142024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedArchitecture15Consideration 1Enhanced positional dataUnity CatalogStream Vehicle position Data Intelligence P
9、latformStream data outMQTTVsttrafiks travel appTravel apps cache(backend)Streaming with Delta Live TablesStreaming with Delta Live TablesStream Trafikverket dataFile dropDelta Live Tables(DLT)Streaming Vehicle position.Train Nr,line,next stop,ETA etc.2024 Databricks Inc.All rights reservedWHY THIS A
10、RCHITECTURE?16ETL&Real-time AnalyticsOrchestrationData WarehousingData Science&AIMosaic AIDelta Live TablesWorkflowsDatabricks SQLUnified security,governance,and catalogingUnity CatalogDatabricks Data Intelligence PlatformUnified data storage for reliability and sharingDelta LakeUse generative AI to
11、 understand the semantics of your dataDatabricksIQOpen Data LakeAll Raw Data(Logs,Texts,Audio,Video,Images)Unity CatalogSecurely get insights in natural languageDelta Lake UniFormData layout is automatically optimized based on usage patternsDatabricks SQLText-to-SQLText-to-VizWorkflowsJob cost optim
12、ized based on past runsDelta Live TablesAutomated data qualityMosaic AICreate,tune,and serve custom LLMs2024 Databricks Inc.All rights reservedDelta Live Tables18CREATE STREAMING TABLE raw_dataAS SELECT*FROM cloud_files(/raw_data,json)CREATE MATERIALIZED VIEW clean_dataAS SELECT FROM LIVE.raw_dataTh
13、e declarative way to do ETL on the lakehouseThe declarative way to do ETL on the lakehouseAccelerate ETL developmentDeclare SQL or Python and DLT automatically orchestrates the DAG,handles retries,changing dataAutomatically manage your infrastructureAutomates complex tedious activities like recovery
14、,auto-scaling,and performance optimizationEnsure high data qualityDeliver reliable data with built-in quality controls,testing,monitoring,and enforcementUnify batch and streamingGet the simplicity of SQL with freshness of streaming with one unified API2024 Databricks Inc.All rights reservedPIPELINE
15、IMPLEMENTATION192024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedData Metrics202024 Databricks Inc.All rights reservedDLT Streaming Pipeline 1212024 Databricks Inc.All rights reservedDLT Streaming Pipeline 2222024 Databricks Inc.All rights reserved2024 Databricks Inc.All
16、 rights reservedImplementation23 Train Arrival and Departure data is streamed as files.DLT Pipeline 1DLT Pipeline 2DLT Pipeline 1DLT Pipeline 2 Train Positions data is streamed from Event hub.Joined with Batch data i.e.Vehicle and Planned Traffic,and creates final Train Positions table.232024 Databr
17、icks Inc.All rights reservedOPTIMIZATIONS242024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedTop Optimization25Optimize for latencyOptimize for latency We started with 2 minutes 2 minutesof latency.Final outcome was 10 seconds 10 secondslatency.2024 Databricks Inc.All rig
18、hts reserved2024 Databricks Inc.All rights reservedEvent hub:Kafka interface has better read speeds.UsemaxOffsetsPerTriggerfor rate limiting.Event hub Optimizations26Kafka fan-out:SetminPartitionsto increase the readthroughput if cluster has more compute.PartitioningToo many:Increase the number of C
19、PUs to match no of partitionsToo few:You may not be able to handle more load due to TU limits2024 Databricks Inc.All rights reservedFuture Improvements272024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedProposed Architecture28Consideration 1Enhanced positional dataUnity C
20、atalogStream data inData Intelligence platformStream data outMQTTVsttrafiks travel appTravel apps cache(backend)Introducing Online TablesIntroducing Online TablesDelta Live Tables(DLT)Serving EndpointOnline storeAuto-sync2024 Databricks Inc.All rights reservedDatabricks Online TablesData Serving fro
21、m LakehouseData Serving from LakehouseComplexityOnline store infrastructure managementData pipeline management29PerformanceLookup latencyData freshnessDynamic scalingCostCost of publishing data to online store2023 Databricks Inc.Confidential&Subject to ChangeREST APIauto-syncDatabricks Online Tables
22、SimplicitySimplicityFully managed serverless online storeFully managed data synchronization pipelineUnified governance with Unity CatalogZero-configuration ServingAnalyticsPlatformOnline StoreInfrastructureData PipelineAPIGatewayDelta TableOnline TablesSpeedSpeedLookup latency(10 milliseconds)2023 D
23、atabricks Inc.Confidential&Subject to ChangeDEMO2023 Databricks Inc.Confidential&Subject to ChangeServing with Online TablePre-recorded demo2024 Databricks Inc.All rights reserved33CostCompute cost affected due to multiple DLT pipelines and latency needs12x performance6x costStorage cost affected due to continuous reads and writesDepends on chosen redundancyTune the clusterCheaper node typeMax worker nodes2024 Databricks Inc.All rights reserved34Key takeaways2024 Databricks Inc.All rights reservedQUESTIONS?36