當前位置：首頁 > 報告詳情

加速DATALAKE在流媒體&機器學習用例中的采用.pdf

上傳人：張** 編號：167613 2024-06-15 PDF PDF 36頁 2.36MB

該報告所屬合集： 2024年數據和人工智能峰會（data+ai summit2024）演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/36

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《加速DATALAKE在流媒體&機器學習用例中的采用.pdf》由會員分享，可在線閱讀，更多相關《加速DATALAKE在流媒體&機器學習用例中的采用.pdf（36頁珍藏版）》請在三個皮匠報告上搜索。

1、2024 Databricks Inc.All rights reserved1ACCELERATING ADOPTION ACCELERATING ADOPTION OF DATALAKEOF DATALAKEFOR STREAMING/ML USE FOR STREAMING/ML USE CASESCASESJune 2024June 20242024 Databricks Inc.All rights reserved22024 Databricks Inc.All rights reservedOur mission is to grow and empower Our missio

2、n is to grow and empower local economieslocal economies3MissionMission2024 Databricks Inc.All rights reserved4Doordash MarketplaceDoordash MarketplaceRestaurant Delivery Restaurant Delivery&Pickup&PickupDashPass DashPass MembershipMembershipConvenience&Convenience&Grocery Grocery B2B B2B Fulfillment

3、Fulfillment2024 Databricks Inc.All rights reservedData warehouse to Datalake JourneyChallenges with AdoptionAccelerator strategyAccelerator Tools Deep-diveAI Accelerator Deep-diveConclusion5AgendaAgenda2024 Databricks Inc.All rights reservedComparatively very easy to use Primarily used in Analytics/

4、Reporting/ExperimentationFully Managed solution end to endGreat interactive query performance Exposes various underlying technologiesEvolution started with ML/DS and Real-time usecasesDecoupled Storage and ComputeGreat data processing capabilities on large amounts of dataData WarehouseData Warehouse

5、Data Lake Data Lake 6Journey Data Warehouse to Datalake Journey Data Warehouse to Datalake Common Industry trend is to scale data based decision making while reducing costs2024 Databricks Inc.All rights reserved7Journey Current stateJourney Current stateData Warehouse Data Lake Analytics,Reporting a

6、nd ExperimentationData Science,Advanced ML,Real TimeData Copies back and forth for common Analytics and DS/ML sourcesTime consuming and cost prohibitive2024 Databricks Inc.All rights reserved8Journey next stopJourney next stop-Data Lakehouse Data Lakehouse Not EitherNot Either-Or,But BothOr,But Both

7、-And.And.Common Storage layerData Warehouse Consumption layerData Lake Ingestion,Processing and serving layerAnalytics,Reporting and ExperimentationData Science,Advanced ML,Real Time2024 Databricks Inc.All rights reserved9Stages of Datalake Evolution Stages of Datalake Evolution InceptionAdoptionExp

8、ansionOmnipresentExploration phase Make right platform choicesDefine right architectureSlow and manual processDiscover and address gaps User Enablement and gain familiarity with tech choicesClear paved paths and guidelinesRight platform for right use casePlatforms are mature and cater to all busines

9、s data needsMake-or-break stage2024 Databricks Inc.All rights reservedData readiness for real time event streams and ML/DS dataTime consuming and resource intensiveLeads to ETLs,ML Models and Storage migrations in some casesDownstream impact scope creep10Challenges in Adoption PhaseChallenges in Ado

10、ption PhaseChange is hard at first,Change is hard at first,messy in the middlemessy in the middleand gorgeous at the endand gorgeous at the end2024 Databricks Inc.All rights reservedGoal:Reduce Time to adopt for new usecases or migrate relevant usecases by 3XAutomation of manual tasks through series

11、 of toolsAI assistance in AccelerationData readiness tools Self-serve11Accelerator StrategyAccelerator StrategyDo More with LessDo More with Less2024 Databricks Inc.All rights reservedML/Data Model ChangesData Discovery and ML/Data Model ChangesData Discovery and ExplorationExplorationSQL CoSQL Co-p

12、ilotpilot12Role of AI in Accelerator StrategyRole of AI in Accelerator StrategyNot smart Search and replaceUnderstand the semantics and auto apply model level changes Auto generate GIT PRs with changes Make the existing data catalog available to conversational AI agentExplore data insights and trend

13、s using results sourced from the right platform Generate SQL in the appropriate SQL dialectAuto-optimize the SQL patternsQueries are generated only on blessed and certified datasetsInhouse AskDataAI Platform capabilitiesInhouse AskDataAI Platform capabilities2024 Databricks Inc.All rights reserved13

14、Accelerator StrategyAccelerator Strategy-ToolsToolsMigration of Existing use cases Migration of Existing use cases(if applicable):Similar to Car manufacturing,Series of automations on an assembly line working together to migrate from one platform to another.Adoption for New use cases:Adoption for Ne

15、w use cases:Series of automations that can help data readiness for upstream dependencies and translation+impact assessment for downstream artifacts like ML models,data models,reports,etc2024 Databricks Inc.All rights reserved14AcceleratorsAcceleratorsInventory Tool:Migration Diagnostics tool Transax

16、le:SQL Translator-Supports multiple dialects like Trino,Spark SQL and Snowflake SQLAssembly Tool:Airflow Dag generator Inspection Tool:Data Validation AskDataAI:AI assistance in Acceleration 2024 Databricks Inc.All rights reserved15DATALAKE ACCELERATION TOOLS DATALAKE ACCELERATION TOOLS DEEPDIVEDEEP

17、DIVE2024 Databricks Inc.All rights reservedEffort related to pipeline analysisManual SQL ReviewDependency analysis for each DAG/task/tableDownstream impact analysisDatalake table searchParity inspection16INVENTORY TOOL INVENTORY TOOL Pipeline analysis at scale for Datalake adoption problemPipeline a

18、nalysis at scale for Datalake adoption problem2024 Databricks Inc.All rights reservedDependency and downstream parsing from SQL logsDownstream impact analysisSQLdialect translation17INVENTORY TOOLINVENTORY TOOLAssistance UI and automated pipeline analysisAssistance UI and automated pipeline analysis

19、Real-time validation/InspectionDWH/DL mirrors searchCode generation2024 Databricks Inc.All rights reservedLow-latency querying across different platforms using TrinoSchema comparisonData volume comparison(row count)18INVENTORY TOOLINVENTORY TOOLRealReal-time health inspection capabilitiestime health

20、 inspection capabilities2024 Databricks Inc.All rights reservedAirflow DBSource codeTask logsAirflow DBSource codeTask logs19INVENTORY TOOLINVENTORY TOOLDag listTask listTask operatorsLog linksOperator argumentsSource SQL templatesExecuted SQLsImplementation details Implementation details-Data sourc

21、esData sourcesDL mirror serviceDL mirror serviceDatalake table mappingTable originAdditional metadata2024 Databricks Inc.All rights reservedPYTHONImplementation details Implementation details-Evolving airflow operatorsEvolving airflow operatorsdb_campaign_test_orders=DdDWHTargetLoadOperatorV2(dag=da

22、g,task_id=xx.db.campaign_test_orders,table_fq_name=foobar.campaign_test_orders,column_definitions=COLUMN_DEFINITIONS,table_type=TableType.TRANSACTIONAL.name,query=SQL_QUERIEScampaign_test_orders,load_type=INSERT,pre_delete_sql=DELETE FROM TargetDatabase.TargetSchema.campaign_test_orders,sla=datetime

23、.timedelta(minutes=10),)There are 50 different operators that generate SQLsINVENTORY TOOLINVENTORY TOOL202024 Databricks Inc.All rights reservedAirflow logSolution Solution-Airflow logs parsing with regexAirflow logs parsing with regexPotentially CRF ML Model for sequence labeling can be trainedPote

24、ntially CRF ML Model for sequence labeling can be trained2023-12-04T23:20:50.977+0000 subprocess.py:93 INFO-2023-12-04 23:20:50,977,977 INFO snow.py:142 Query time(s):7.612023-12-04T23:20:50.977+0000 subprocess.py:93 INFO-2023-12-04 23:20:50,977,977 INFO s3_raw_base.py:391 Created compacted clone of

25、 R2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-2023-12-04 23:20:50,977,977 INFO snow.py:114 Executing SQL:MERGE INTO INCOME_PMERGE INTO INCOME_P2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-(SELECT ROW_ID,DATA,(SELECT ROW_ID,DATA,2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-WHEN MAT

26、CHED AND t.UPDATED_AT f.UPDATED_AT THENWHEN MATCHED AND t.UPDATED_AT f.UPDATED_AT THEN2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-UPDATE SET f.ROW_ID=t.ROW_ID,f.DATAUPDATE SET f.ROW_ID=t.ROW_ID,f.DATA=t.DATA,f.CHANGE_REFCHANGE_REF2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-WHEN NOT MAT

27、CHED THENWHEN NOT MATCHED THEN2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-INSERT(ROW_ID,DATA,CHANGE_REFINSERT(ROW_ID,DATA,CHANGE_REF2023-12-04T23:20:50.978+0000 subprocess.py:93 INFO-2023-12-04 23:20:50,977,977 INFO cursor.py:738 query:MERGE INTO INCOME_PLATFOR2023-12-04T23:20:54.961+0000 sub

28、process.py:93 INFO-2023-12-04 23:20:54,961,961 INFO cursor.py:751 query execution doneINVENTORY TOOLINVENTORY TOOL212024 Databricks Inc.All rights reservedPYTHONSQL Decomposition into downstream and dependenciesSQL Decomposition into downstream and dependencieswith open source SQLGlot libwith open s

29、ource SQLGlot libfrom sqlglot import parse_one,expast=parse_one(sql,read=source system)tables=ast.find_all(exp.Table)tables=filter(lambda x:x.db,tables)tables=list(map(lambda x:fx.catalog.x.db.x.name.lower(),tables)downstream=if ast.key!=select and ast.key!=union:if len(tables)0:if ast.key=altertabl

30、e and len(tables)=2 and rename in sql.lower():downstream=tables.pop(1)else:downstream=tables.pop(0)INVENTORY TOOLINVENTORY TOOL222024 Databricks Inc.All rights reservedEnd-to-end migration to the DataLake requires making SQL translations.This can take a lot of time and manual effort.23TRANSAXLE TRAN

31、SAXLE-SQL TRANSLATION TOOLSQL TRANSLATION TOOLSQL migration problemSQL migration problem2024 Databricks Inc.All rights reservedCentral hub to serve SQL translation needsReduce query translation and syntax validation time from days to minutesProvides vendor agnostic SQLTable name mappingQuery validat

32、ion with Live Spark ClusterIntegrates with code generation tools24TRANSAXLE TRANSAXLE-SQL TRANSLATION TOOLSQL TRANSLATION TOOLSQL migration solutionSQL migration solution2024 Databricks Inc.All rights reservedSQL code converter and data reconciliation tool for accelerating data onboarding to Databri

33、cks from EDW,CDW and other ETL sources.25TRANSAXLE TRANSAXLE-SQL TRANSLATION TOOLSQL TRANSLATION TOOLExploits Databricks labs Remorph libraryExploits Databricks labs Remorph library2024 Databricks Inc.All rights reservedMigrating a DAG code to Datalake version involves multiple steps that takes few

34、days from code analysis to testing.26ASSEMBLY TOOL ASSEMBLY TOOL-DAG GENERATIONDAG GENERATIONCode generation problemCode generation problem2024 Databricks Inc.All rights reservedAssembly tool automates the majority of steps and in some cases is capable to generate the final end-to-end solution.27ASS

35、EMBLY TOOL ASSEMBLY TOOL-DAG GENERATIONDAG GENERATIONCode generation solutionCode generation solutionGeneration steps:1.SQL extraction and translation2.Operators replacement with parity3.Pull-request assembly4.Integration and unit testing5.Migration report generation2024 Databricks Inc.All rights re

36、served Data exploration across different platforms Semantic search Data interpretation SQL Query generation Accessing complex DWH structures like metric cubes Searching internal documentation in google drive/confluence28AI Accelerator AI Accelerator-Problem StatementProblem StatementAgent deployment

37、 problem for adoption and data explorationAgent deployment problem for adoption and data exploration2024 Databricks Inc.All rights reservedAskDataAI Platform:User interfaceVectorDB and semantic search engineIntegration with slack and other communication channelsAPI endpointLoader/Worker templates(go

38、ogle drive,etc.)Client AskData-X Apps:Custom promptVectorDB collectionsData loadersFunctions/toolsAPI endpoints for internal services29AskDataAI AskDataAI-SolutionSolutionAskDataAI as the internal“GPTs”platform for data exploration solutionAskDataAI as the internal“GPTs”platform for data exploration

39、 solution2024 Databricks Inc.All rights reservedDo we have any metrics to track protective equipment masks?AI Agent actions:1.MetricsSearch(protective equipment)2.TablesSearch(protective equipment)3.GetTableDescription(edw.core.finance_metrics)4.FinalResponse:(SQL)30AskDataAI AskDataAI-SolutionSolut

40、ionData exploration with LLM,VectorDB,and AI AgentsData exploration with LLM,VectorDB,and AI Agents2024 Databricks Inc.All rights reservedLangchain is used to define toolsLLM decides on what to do next in agentic workflow(generates thoughts,actions,and the final response)Tool/function examples:Googl

41、eDriveSearchShowCubeMetricsDescribeTableShowTablesLikeInHubbleShowTablesLikeCallAPI31AskDataAI AskDataAI-ArchitectureArchitectureImplementing an agent with LangChainImplementing an agent with LangChain2024 Databricks Inc.All rights reservedTest-driven developmentRegression testingTools/Semantic sear

42、ch testabilityevaluate(query,expected_answer,actual_answer)-explanation,score32ASKDATAAI ASKDATAAI-Data AccuracyData AccuracyHow to ensure reliability with AIHow to ensure reliability with AI-graded testsgraded testsImage source:https:/ Databricks Inc.All rights reservedApplying LLM to generate tabl

43、e description based on column names and available documentationLLM to generate business question and SQL solution using table/metricDocument=metric metadata+LLM description+SQL example+describe table resultsEvaluation using AI generated business questions and cross-validationData loaders and workers

44、 33AskDataAI AskDataAI-Semantic SearchSemantic SearchSemantic search with FAISS Library and generating documentsSemantic search with FAISS Library and generating documents2024 Databricks Inc.All rights reserved34ASKDATAAI ASKDATAAI-Acceleration ExampleAcceleration ExampleSchema migration acceleratio

45、n use case with agentic workflowSchema migration acceleration use case with agentic workflow2024 Databricks Inc.All rights reservedChange in the data space is a constant,Adapt fast.Define the right data architecture-There is no one size fits all solution Proactively identify adoption/migration bottlenecks very early in the gameTools/Frameworks play an important role in technology adoption.Invest in themAI based solutions have high potential in data applications beyond mainstream use cases35ConclusionConclusion2024 Databricks Inc.All rights reserved36QUESTIONS?QUESTIONS?

相關圖表

本文主要探討了數據湖在流式數據處理和機器學習應用中的加速采用。文章提出了從數據倉庫到數據湖的轉型策略，并強調了自動化工具在加速這一過程中的作用。關鍵點包括： 1. 數據湖相較于傳統數據倉庫，在 Analytics/Reporting/Experimentation 以及 Data Science, Advanced ML, Real Time 應用中具有優勢，如易于使用、全托管解決方案、強大的交互式查詢性能等。 2. 數據湖的演進從 ML/DS 和實時用例開始，解耦了存儲和計算，具有大量數據處理能力。 3. 采用數據湖面臨挑戰，如初期變革困難、中間過程混亂，但最終成果顯著。目標是通過自動化工具減少新用例采用時間，提高效率。 4. 加速采用策略包括工具鏈：庫存工具、Transaxle（SQL翻譯）、組裝工具（Airflow Dag生成）、檢查工具（數據驗證）。 5. AI在加速器策略中扮演重要角色，如自動生成GIT PR、優化SQL模式、支持對話式AI等。 6. 數據湖加速工具解決了從數據探索到管道分析的問題，并提高了數據湖采用的效率。文章最后強調了在數據應用中，適應變化、定義正確的數據架構、早期識別和解決采用/遷移瓶頸的重要性，并指出AI解決方案在數據應用中具有巨大潛力。

如何實現數據倉庫到數據湖的遷移？" 如何利用AI技術提高數據探索和模型開發效率？" 如何選擇合適的數據湖工具進行數據管理和分析？"

相關報告

SAS：全球反洗錢調研：在逆境中加速人工智能和機器學習在反洗錢中的應用現狀（15頁）.pdf

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站