《Skyscanner 實現實用數據和人工智能治理的旅程.pdf》由會員分享,可在線閱讀,更多相關《Skyscanner 實現實用數據和人工智能治理的旅程.pdf(62頁珍藏版)》請在三個皮匠報告上搜索。
1、2024 Databricks Inc.All rights reservedMichael Ewins,Director of Engineering&JM Michael Ewins,Director of Engineering&JM LaplanteLaplante,Principal Software Engineer,Principal Software Engineer1SKYSCANNERS JOURNEY SKYSCANNERS JOURNEY OF ENABLING PRACTICAL OF ENABLING PRACTICAL DATA&AI GOVERNANCEDATA
2、&AI GOVERNANCE2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedA global leader in travel and technology connecting millions of people every day with trusted travel providers.2About SkyscannerAbout Skyscanner2024 Databricks Inc.All rights reservedWe reach more than 110 mi
3、llion users every month2024 Databricks Inc.All rights reserved5,000search requests per second2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedAnalytical data is used to optimize the business and user experience.We use Machine Learning models to rank and recommend travel
4、optionsAI Search answer traveller questions to inspire new experiences.5Data fuels SkyscannerData fuels Skyscanner2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved6Data journey:Data journey:where wewhere westartedstarted2024 Databricks Inc.All rights reserved2024 Databri
5、cks Inc.All rights reserved7Firefighting to keep data flowingFirefighting to keep data flowingWe were fixing symptoms not underlying issuesWe were fixing symptoms not underlying issues72024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved8Complex Data PipelinesComplex Data P
6、ipelinesLate arriving data,inconsistent data quality,unhappy data customersLate arriving data,inconsistent data quality,unhappy data customers82024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved9Multiple Data Technology StacksMultiple Data Technology StacksComplexity,dupli
7、cate systems,expensive to operateComplexity,duplicate systems,expensive to operate92024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedSkyscanner needs reliable&trustworthy data so that we enable smart decisions quickly using data10Data TribeData TribeOur mission.Our missio
8、n.2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved11DirectionDirectionofoftraveltravel2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved12The need for simplificationThe need for simplificationThis XKCD comic sums up where we wereThis XKCD com
9、ic sums up where we were122024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved13Skyscanner data journeySkyscanner data journeyStrategic staircaseStrategic staircase13SimplificationSimplificationData maturity assessmentData maturity assessmentEnable trust in our dataEnable t
10、rust in our dataScale data governanceScale data governanceGoverned Data&AI enabled use casesGoverned Data&AI enabled use cases2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedAll data ingestion lands in BronzeTransform data using PySpark/DatabricksConsume data from Medal
11、lion14Simplification:from ETL to ELTSimplification:from ETL to ELTInvesting in improvements around Medallion architectureInvesting in improvements around Medallion architecture142024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved15Skyscanner data journeySkyscanner data jou
12、rneyStrategic staircaseStrategic staircase15Simplification:all data lands in BronzeSimplification:all data lands in BronzeData maturity assessmentData maturity assessmentEnable trust in our dataEnable trust in our dataScale data governanceScale data governanceGoverned Data&AI enabled use casesGovern
13、ed Data&AI enabled use cases2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved16Data Maturity AssessmentData Maturity AssessmentLearning from othersLearning from others162024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved17EnableEnabletrusttrust
14、ininour our datadata2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved18Skyscanner data journeySkyscanner data journeyStrategic staircaseStrategic staircase18Simplification:all data lands in BronzeSimplification:all data lands in BronzeData maturity assessment:plan of act
15、ionData maturity assessment:plan of actionEnable trust in our dataEnable trust in our dataScale data governanceScale data governanceGoverned Data&AI enabled use casesGoverned Data&AI enabled use cases2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedHive meta store contai
16、ned 30,000 tablesNot all data is equalDefine our business-critical use casesWe identified 350 business critical data sets19Introducing Data GovernanceIntroducing Data GovernanceWhat is our business critical data?What is our business critical data?192024 Databricks Inc.All rights reserved2024 Databri
17、cks Inc.All rights reservedWho owns this data?Ownership by teams20Measurable Data GovernanceMeasurable Data GovernanceOwnershipOwnership202024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedWhere does my data come from?Who uses my data?21Measurable Data GovernanceMeasurable
18、 Data GovernanceLineage&access controlLineage&access control212024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedIs my data complete?Is my data on-time?22Measurable Data GovernanceMeasurable Data GovernanceTable level data quality checksTable level data quality checks22202
19、4 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedNot all data is equalExtra checks for important data23Measurable Data GovernanceMeasurable Data GovernanceField level data quality checksField level data quality checks232024 Databricks Inc.All rights reserved2024 Databricks
20、Inc.All rights reserved24Measurable Data GovernanceMeasurable Data GovernanceComing up deeper dive on implementation decisionsComing up deeper dive on implementation decisions242024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved25Skyscanner data journeySkyscanner data jour
21、neyStrategic staircaseStrategic staircase25Simplification:all data lands in BronzeSimplification:all data lands in BronzeData maturity assessment:plan of actionData maturity assessment:plan of actionEnable trust in our dataEnable trust in our dataScale data governanceScale data governanceGoverned Da
22、ta&AI enabled use casesGoverned Data&AI enabled use cases2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedCalculate progress:25%ownership25%enabled in UC25%table quality checks25%field quality checks26Tracking data set health enablementTracking data set health enablement
23、Initial focus on 350 business critical datasetsInitial focus on 350 business critical datasets262024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved27Unity Catalog benefits now visibleUnity Catalog benefits now visibleDiscoverability,data domains,lineage,access control,etc.
24、Discoverability,data domains,lineage,access control,etc.272024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedYipitData“how can we get Unity Catalog without teams modifying their code?”System Tables-see reads&writes for tables outside of Unity CatalogWe learned 1500 active
25、tables not 3000028But how could we go faster?But how could we go faster?Inspiration came from talks at Data&AI Summit 2023Inspiration came from talks at Data&AI Summit 2023282024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedData ownerData ClassificationData CategoryData r
26、etention periodBusiness criticalityTable and Column descriptions29Tracking Metadata adoptionTracking Metadata adoptionAutomating data governanceAutomating data governance292024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved30Evolving Data GovernanceEvolving Data Governance
27、New criteria+clearer asksNew criteria+clearer asks302024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved31Deeper Dive:Deeper Dive:Enable Enable trust trust in our in our datadata2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved32Enabling Unity C
28、atalogEnabling Unity CatalogReRe-platform&improve data architectureplatform&improve data architecture32Review Databricks Workspace topologyArchitectureArchitecture2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved33Enabling Unity CatalogEnabling Unity CatalogKeyKey decisi
29、onsdecisions33Review Databricks Workspace topologyUC Metastore structure aligned with Medallion&data domainsArchitectureArchitecture2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved34Enabling Unity CatalogEnabling Unity CatalogKeyKey decisionsdecisions34Review Databricks
30、 Workspace topologyUC Metastore structure aligned with Medallion&data domainsCatalog visibility across workspacesAccess control modelArchitectureArchitecture2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved35Enabling Unity CatalogEnabling Unity CatalogKey decisionsKey de
31、cisions35Review Databricks Workspace topologyUC Metastore structure aligned with Medallion&data domainsCatalog visibility across workspacesAccess control modelArchitectureArchitectureTechnologyTechnologyStorage layer infrastructureCreation of UC catalogs and enabling it alongside hiveSelf-serve impl
32、ementation2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved36Unity Catalog ArchitectureUnity Catalog Architecture2 workspaces connected to UC,single region/2 workspaces connected to UC,single region/metastoremetastore362024 Databricks Inc.All rights reserved2024 Databric
33、ks Inc.All rights reserved37Unity Catalog ArchitectureUnity Catalog Architecture3 Medallion3 Medallion-aligned aligned catalogscatalogs per workspaceper workspace372024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved38Unity Catalog ArchitectureUnity Catalog ArchitectureOneO
34、ne S3 bucket per S3 bucket per catalogcatalog382024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved39Unity Catalog ArchitectureUnity Catalog ArchitectureCatalogCatalog visibility visibility-Development workspaceDevelopment workspace392024 Databricks Inc.All rights reserved2
35、024 Databricks Inc.All rights reserved40Unity Catalog ArchitectureUnity Catalog ArchitectureCatalogCatalog visibility visibility-Production workspaceProduction workspace402024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved41Access ControlAccess ControlWrite access via serv
36、ice principals&Read access locked by defaultWrite access via service principals&Read access locked by default41Technical vs.Business ownershipOnly service principals can be table ownersBusiness owners get read access by defaultTeams and SPs can request read-only access to other domains2024 Databrick
37、s Inc.All rights reserved2024 Databricks Inc.All rights reserved42Access ControlAccess ControlExampleExample422024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedTerraform code for domain creationSet of templates based on Databricks Terraform providerSet of templates based
38、on Databricks Terraform providermodule create-prod-price-accuracy-domain name=price_accuracy#domain nameowner=peregrine#this will become owner schema tagdescription=This domain contains data for prices accuracycatalogs=catalog name=silver#Other teams and SPs,which want to read tables in this domaint
39、able_readers=silver=calendar_cell_quote=prod-price-coverage-domain,Weathervane squad,inaccurate_calendar_cell=prod-price-coverage-domainSelf serve domainsSelf serve domains43432024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedEssential metadataoOwnershipoData Category and
40、 classificationoDescriptionsoSLOsoLineage&access controloTable level quality checksoField level quality checks44Unity Catalog enabledUnity Catalog enabledUnlocks level 2 of our data governance frameworkUnlocks level 2 of our data governance framework442024 Databricks Inc.All rights reserved2024 Data
41、bricks Inc.All rights reserved45Introducing Monte CarloIntroducing Monte CarloData Observability SolutionData Observability Solution45Enhanced data lineageML-based table level monitorsField-level quality metricsBasic anomaly detectionAlerting and incident management2024 Databricks Inc.All rights res
42、erved2024 Databricks Inc.All rights reserved46Measurable data set healthMeasurable data set healthAutomation Automation are we adhering to the data health framework?are we adhering to the data health framework?46Is owner valid?Is metadata filled?Are required validations added?Is monitoring in place?
43、How many incidents were there and were they actioned?2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved47Deeper Dive:Deeper Dive:ScaleScaledatadatagovernancegovernance2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved48Step 1:Stop the growth of
44、 hiveStep 1:Stop the growth of hive48Production standardsProvide new UC catalog for ad-hoc&test data with easy accessBlock creation of new databases in hiveBlock creation of new tables in hiveBe careful with different modes of using and accessing data2024 Databricks Inc.All rights reserved2024 Datab
45、ricks Inc.All rights reserved49Step 2:Define migration scopeStep 2:Define migration scope49Databricks System tableshms_to_uc_migration schema provides access information for S3 locations and tables in HiveScope reduction from 30k tables to 1.5k activeactive tablesUnlocking benefits of UC vs.Medallio
46、n compliance2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved50Step 3:MigrationStep 3:Migration50Decentralized physical storageoMultiple S3 bucketsoDifferent ownersoPermissionsData not always stored in Delta formatDBR versions incompatible with Unity CatalogoMust be=11.3
47、Code-free migration for data consumersChallengesChallenges2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved51Step 3:MigrationStep 3:Migration51DBR version upgrade across the organizationPer table:oUpdate UC IAM role with read/write permissionsoCreate external location in
48、 UCoCreate external table in UCoSetup permissions for access via Unity CatalogUpdate default catalog for everyoneStrategyStrategy2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved52Make our progress visibleMake our progress visibleC Customizedustomized hms_to_uc_migration
49、hms_to_uc_migration dashboarddashboard522024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved53Make our progress visibleMake our progress visibleDatabricks Runtime Upgrade progressDatabricks Runtime Upgrade progress532024 Databricks Inc.All rights reserved2024 Databricks Inc
50、.All rights reserved54Are Are we we there there yet?yet?2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved55Skyscanner data journeySkyscanner data journeyChange takes time so we need to find ways to accelerateChange takes time so we need to find ways to accelerate55Simpli
51、fication:all data lands in BronzeSimplification:all data lands in BronzeData maturity assessment:plan of actionData maturity assessment:plan of actionEnable trust in our dataEnable trust in our dataScale data governanceScale data governanceGoverned Data&AI enabled use casesGoverned Data&AI enabled u
52、se cases2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved56The Data and AI flywheelThe Data and AI flywheelData is powering Analytical and Operational use casesData is powering Analytical and Operational use cases562024 Databricks Inc.All rights reserved2024 Databricks I
53、nc.All rights reserved57What did we find useful?What did we find useful?Practical data governance at scalePractical data governance at scale57Learn from others.What does good look like?2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved58What did we find useful?What did we
54、 find useful?Practical data governance at scalePractical data governance at scale58Learn from others.What does good look like?Not all data is equal.What is business critical?2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved59What did we find useful?What did we find usefu
55、l?Practical data governance at scalePractical data governance at scale59Learn from others.What does good look like?Not all data is equal.What is business critical?Unity Catalog v Medallion2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved60What did we find useful?What did
56、 we find useful?Practical data governance at scalePractical data governance at scale60Learn from others.What does good look like?Not all data is equal.What is business critical?Unity Catalog v MedallionMeasure and make progress visible2024 Databricks Inc.All rights reserved2024 Databricks Inc.All ri
57、ghts reserved61What did we find useful?What did we find useful?Practical data governance at scalePractical data governance at scale61Learn from others.What does good look like?Not all data is equal.What is business critical?Unity Catalog v MedallionMeasure and make progress visibleBe clear with priorities and asks of non-data teams2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved62Thank youThank you