《加速DATALAKE在流媒體&機器學習用例中的采用.pdf》由會員分享,可在線閱讀,更多相關《加速DATALAKE在流媒體&機器學習用例中的采用.pdf(36頁珍藏版)》請在三個皮匠報告上搜索。
1、2024 Databricks Inc.All rights reserved1ACCELERATING ADOPTION ACCELERATING ADOPTION OF DATALAKEOF DATALAKEFOR STREAMING/ML USE FOR STREAMING/ML USE CASESCASESJune 2024June 20242024 Databricks Inc.All rights reserved22024 Databricks Inc.All rights reservedOur mission is to grow and empower Our missio
2、n is to grow and empower local economieslocal economies3MissionMission2024 Databricks Inc.All rights reserved4Doordash MarketplaceDoordash MarketplaceRestaurant Delivery Restaurant Delivery&Pickup&PickupDashPass DashPass MembershipMembershipConvenience&Convenience&Grocery Grocery B2B B2B Fulfillment
3、Fulfillment2024 Databricks Inc.All rights reservedData warehouse to Datalake JourneyChallenges with AdoptionAccelerator strategyAccelerator Tools Deep-diveAI Accelerator Deep-diveConclusion5AgendaAgenda2024 Databricks Inc.All rights reservedComparatively very easy to use Primarily used in Analytics/
4、Reporting/ExperimentationFully Managed solution end to endGreat interactive query performance Exposes various underlying technologiesEvolution started with ML/DS and Real-time usecasesDecoupled Storage and ComputeGreat data processing capabilities on large amounts of dataData WarehouseData Warehouse
5、Data Lake Data Lake 6Journey Data Warehouse to Datalake Journey Data Warehouse to Datalake Common Industry trend is to scale data based decision making while reducing costs2024 Databricks Inc.All rights reserved7Journey Current stateJourney Current stateData Warehouse Data Lake Analytics,Reporting a
6、nd ExperimentationData Science,Advanced ML,Real TimeData Copies back and forth for common Analytics and DS/ML sourcesTime consuming and cost prohibitive2024 Databricks Inc.All rights reserved8Journey next stopJourney next stop-Data Lakehouse Data Lakehouse Not EitherNot Either-Or,But BothOr,But Both
7、-And.And.Common Storage layerData Warehouse Consumption layerData Lake Ingestion,Processing and serving layerAnalytics,Reporting and ExperimentationData Science,Advanced ML,Real Time2024 Databricks Inc.All rights reserved9Stages of Datalake Evolution Stages of Datalake Evolution InceptionAdoptionExp
8、ansionOmnipresentExploration phase Make right platform choicesDefine right architectureSlow and manual processDiscover and address gaps User Enablement and gain familiarity with tech choicesClear paved paths and guidelinesRight platform for right use casePlatforms are mature and cater to all busines
9、s data needsMake-or-break stage2024 Databricks Inc.All rights reservedData readiness for real time event streams and ML/DS dataTime consuming and resource intensiveLeads to ETLs,ML Models and Storage migrations in some casesDownstream impact scope creep10Challenges in Adoption PhaseChallenges in Ado
10、ption PhaseChange is hard at first,Change is hard at first,messy in the middlemessy in the middleand gorgeous at the endand gorgeous at the end2024 Databricks Inc.All rights reservedGoal:Reduce Time to adopt for new usecases or migrate relevant usecases by 3XAutomation of manual tasks through series
11、 of toolsAI assistance in AccelerationData readiness tools Self-serve11Accelerator StrategyAccelerator StrategyDo More with LessDo More with Less2024 Databricks Inc.All rights reservedML/Data Model ChangesData Discovery and ML/Data Model ChangesData Discovery and ExplorationExplorationSQL CoSQL Co-p
12、ilotpilot12Role of AI in Accelerator StrategyRole of AI in Accelerator StrategyNot smart Search and replaceUnderstand the semantics and auto apply model level changes Auto generate GIT PRs with changes Make the existing data catalog available to conversational AI agentExplore data insights and trend
13、s using results sourced from the right platform Generate SQL in the appropriate SQL dialectAuto-optimize the SQL patternsQueries are generated only on blessed and certified datasetsInhouse AskDataAI Platform capabilitiesInhouse AskDataAI Platform capabilities2024 Databricks Inc.All rights reserved13
14、Accelerator StrategyAccelerator Strategy-ToolsToolsMigration of Existing use cases Migration of Existing use cases(if applicable):Similar to Car manufacturing,Series of automations on an assembly line working together to migrate from one platform to another.Adoption for New use cases:Adoption for Ne
15、w use cases:Series of automations that can help data readiness for upstream dependencies and translation+impact assessment for downstream artifacts like ML models,data models,reports,etc2024 Databricks Inc.All rights reserved14AcceleratorsAcceleratorsInventory Tool:Migration Diagnostics tool Transax
16、le:SQL Translator-Supports multiple dialects like Trino,Spark SQL and Snowflake SQLAssembly Tool:Airflow Dag generator Inspection Tool:Data Validation AskDataAI:AI assistance in Acceleration 2024 Databricks Inc.All rights reserved15DATALAKE ACCELERATION TOOLS DATALAKE ACCELERATION TOOLS DEEPDIVEDEEP
17、DIVE2024 Databricks Inc.All rights reservedEffort related to pipeline analysisManual SQL ReviewDependency analysis for each DAG/task/tableDownstream impact analysisDatalake table searchParity inspection16INVENTORY TOOL INVENTORY TOOL Pipeline analysis at scale for Datalake adoption problemPipeline a
18、nalysis at scale for Datalake adoption problem2024 Databricks Inc.All rights reservedDependency and downstream parsing from SQL logsDownstream impact analysisSQLdialect translation17INVENTORY TOOLINVENTORY TOOLAssistance UI and automated pipeline analysisAssistance UI and automated pipeline analysis
19、Real-time validation/InspectionDWH/DL mirrors searchCode generation2024 Databricks Inc.All rights reservedLow-latency querying across different platforms using TrinoSchema comparisonData volume comparison(row count)18INVENTORY TOOLINVENTORY TOOLRealReal-time health inspection capabilitiestime health
20、 inspection capabilities2024 Databricks Inc.All rights reservedAirflow DBSource codeTask logsAirflow DBSource codeTask logs19INVENTORY TOOLINVENTORY TOOLDag listTask listTask operatorsLog linksOperator argumentsSource SQL templatesExecuted SQLsImplementation details Implementation details-Data sourc
21、esData sourcesDL mirror serviceDL mirror serviceDatalake table mappingTable originAdditional metadata2024 Databricks Inc.All rights reservedPYTHONImplementation details Implementation details-Evolving airflow operatorsEvolving airflow operatorsdb_campaign_test_orders=DdDWHTargetLoadOperatorV2(dag=da
22、g,task_id=xx.db.campaign_test_orders,table_fq_name=foobar.campaign_test_orders,column_definitions=COLUMN_DEFINITIONS,table_type=TableType.TRANSACTIONAL.name,query=SQL_QUERIEScampaign_test_orders,load_type=INSERT,pre_delete_sql=DELETE FROM TargetDatabase.TargetSchema.campaign_test_orders,sla=datetime
23、.timedelta(minutes=10),)There are 50 different operators that generate SQLsINVENTORY TOOLINVENTORY TOOL202024 Databricks Inc.All rights reservedAirflow logSolution Solution-Airflow logs parsing with regexAirflow logs parsing with regexPotentially CRF ML Model for sequence labeling can be trainedPote
24、ntially CRF ML Model for sequence labeling can be trained2023-12-04T23:20:50.977+0000 subprocess.py:93 INFO-2023-12-04 23:20:50,977,977 INFO snow.py:142 Query time(s):7.612023-12-04T23:20:50.977+0000 subprocess.py:93 INFO-2023-12-04 23:20:50,977,977 INFO s3_raw_base.py:391 Created compacted clone of
25、 R2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-2023-12-04 23:20:50,977,977 INFO snow.py:114 Executing SQL:MERGE INTO INCOME_PMERGE INTO INCOME_P2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-(SELECT ROW_ID,DATA,(SELECT ROW_ID,DATA,2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-WHEN MAT
26、CHED AND t.UPDATED_AT f.UPDATED_AT THENWHEN MATCHED AND t.UPDATED_AT f.UPDATED_AT THEN2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-UPDATE SET f.ROW_ID=t.ROW_ID,f.DATAUPDATE SET f.ROW_ID=t.ROW_ID,f.DATA=t.DATA,f.CHANGE_REFCHANGE_REF2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-WHEN NOT MAT
27、CHED THENWHEN NOT MATCHED THEN2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-INSERT(ROW_ID,DATA,CHANGE_REFINSERT(ROW_ID,DATA,CHANGE_REF2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-2023-12-04 23:20:50,977,977 INFO cursor.py:738 query:MERGE INTO INCOME_PLATFOR2023-12-04T23:20:54.961+0000 sub
28、process.py:93 INFO-2023-12-04 23:20:54,961,961 INFO cursor.py:751 query execution doneINVENTORY TOOLINVENTORY TOOL212024 Databricks Inc.All rights reservedPYTHONSQL Decomposition into downstream and dependenciesSQL Decomposition into downstream and dependencieswith open source SQLGlot libwith open s
29、ource SQLGlot libfrom sqlglot import parse_one,expast=parse_one(sql,read=source system)tables=ast.find_all(exp.Table)tables=filter(lambda x:x.db,tables)tables=list(map(lambda x:fx.catalog.x.db.x.name.lower(),tables)downstream=if ast.key!=select and ast.key!=union:if len(tables)0:if ast.key=altertabl
30、e and len(tables)=2 and rename in sql.lower():downstream=tables.pop(1)else:downstream=tables.pop(0)INVENTORY TOOLINVENTORY TOOL222024 Databricks Inc.All rights reservedEnd-to-end migration to the DataLake requires making SQL translations.This can take a lot of time and manual effort.23TRANSAXLE TRAN
31、SAXLE-SQL TRANSLATION TOOLSQL TRANSLATION TOOLSQL migration problemSQL migration problem2024 Databricks Inc.All rights reservedCentral hub to serve SQL translation needsReduce query translation and syntax validation time from days to minutesProvides vendor agnostic SQLTable name mappingQuery validat
32、ion with Live Spark ClusterIntegrates with code generation tools24TRANSAXLE TRANSAXLE-SQL TRANSLATION TOOLSQL TRANSLATION TOOLSQL migration solutionSQL migration solution2024 Databricks Inc.All rights reservedSQL code converter and data reconciliation tool for accelerating data onboarding to Databri
33、cks from EDW,CDW and other ETL sources.25TRANSAXLE TRANSAXLE-SQL TRANSLATION TOOLSQL TRANSLATION TOOLExploits Databricks labs Remorph libraryExploits Databricks labs Remorph library2024 Databricks Inc.All rights reservedMigrating a DAG code to Datalake version involves multiple steps that takes few
34、days from code analysis to testing.26ASSEMBLY TOOL ASSEMBLY TOOL-DAG GENERATIONDAG GENERATIONCode generation problemCode generation problem2024 Databricks Inc.All rights reservedAssembly tool automates the majority of steps and in some cases is capable to generate the final end-to-end solution.27ASS
35、EMBLY TOOL ASSEMBLY TOOL-DAG GENERATIONDAG GENERATIONCode generation solutionCode generation solutionGeneration steps:1.SQL extraction and translation2.Operators replacement with parity3.Pull-request assembly4.Integration and unit testing5.Migration report generation2024 Databricks Inc.All rights re
36、served Data exploration across different platforms Semantic search Data interpretation SQL Query generation Accessing complex DWH structures like metric cubes Searching internal documentation in google drive/confluence28AI Accelerator AI Accelerator-Problem StatementProblem StatementAgent deployment
37、 problem for adoption and data explorationAgent deployment problem for adoption and data exploration2024 Databricks Inc.All rights reservedAskDataAI Platform:User interfaceVectorDB and semantic search engineIntegration with slack and other communication channelsAPI endpointLoader/Worker templates(go
38、ogle drive,etc.)Client AskData-X Apps:Custom promptVectorDB collectionsData loadersFunctions/toolsAPI endpoints for internal services29AskDataAI AskDataAI-SolutionSolutionAskDataAI as the internal“GPTs”platform for data exploration solutionAskDataAI as the internal“GPTs”platform for data exploration
39、 solution2024 Databricks Inc.All rights reservedDo we have any metrics to track protective equipment masks?AI Agent actions:1.MetricsSearch(protective equipment)2.TablesSearch(protective equipment)3.GetTableDescription(edw.core.finance_metrics)4.FinalResponse:(SQL)30AskDataAI AskDataAI-SolutionSolut
40、ionData exploration with LLM,VectorDB,and AI AgentsData exploration with LLM,VectorDB,and AI Agents2024 Databricks Inc.All rights reservedLangchain is used to define toolsLLM decides on what to do next in agentic workflow(generates thoughts,actions,and the final response)Tool/function examples:Googl
41、eDriveSearchShowCubeMetricsDescribeTableShowTablesLikeInHubbleShowTablesLikeCallAPI31AskDataAI AskDataAI-ArchitectureArchitectureImplementing an agent with LangChainImplementing an agent with LangChain2024 Databricks Inc.All rights reservedTest-driven developmentRegression testingTools/Semantic sear
42、ch testabilityevaluate(query,expected_answer,actual_answer)-explanation,score32ASKDATAAI ASKDATAAI-Data AccuracyData AccuracyHow to ensure reliability with AIHow to ensure reliability with AI-graded testsgraded testsImage source:https:/ Databricks Inc.All rights reservedApplying LLM to generate tabl
43、e description based on column names and available documentationLLM to generate business question and SQL solution using table/metricDocument=metric metadata+LLM description+SQL example+describe table resultsEvaluation using AI generated business questions and cross-validationData loaders and workers
44、 33AskDataAI AskDataAI-Semantic SearchSemantic SearchSemantic search with FAISS Library and generating documentsSemantic search with FAISS Library and generating documents2024 Databricks Inc.All rights reserved34ASKDATAAI ASKDATAAI-Acceleration ExampleAcceleration ExampleSchema migration acceleratio
45、n use case with agentic workflowSchema migration acceleration use case with agentic workflow2024 Databricks Inc.All rights reservedChange in the data space is a constant,Adapt fast.Define the right data architecture-There is no one size fits all solution Proactively identify adoption/migration bottlenecks very early in the gameTools/Frameworks play an important role in technology adoption.Invest in themAI based solutions have high potential in data applications beyond mainstream use cases35ConclusionConclusion2024 Databricks Inc.All rights reserved36QUESTIONS?QUESTIONS?