《通過 UNITY CATALOG 為 COMCAST 提供集中數據治理.pdf》由會員分享,可在線閱讀,更多相關《通過 UNITY CATALOG 為 COMCAST 提供集中數據治理.pdf(61頁珍藏版)》請在三個皮匠報告上搜索。
1、2024 Databricks Inc.All rights reservedEmpowering Centralized Data Governance Comcast with Unity Catalog2024 Databricks Inc.All rights reservedFragmented data governance&access control on 150+WorkspacesInefficient data management and utilizationIncreased security risks and potential data breachesDif
2、ficulty in collaboration and data sharingWasted storage costs due to data redundancyIncreased administrative overhead for teamsProblem Statement2024 Databricks Inc.All rights reservedDatabricks Unity CatalogUnified governance for data,analytics and AI2024 Databricks Inc.All rights reservedCloud Stor
3、age(ADLS Gen 2)Unity Catalog-ArchitectureContainerAuditingDatabricks WorkspaceAccount LevelUser/Grp MgmtMetastoreLineage ExplorerDelta SharingData ExplorerAccess PermissionsUnity CatalogUserExternalLocationsStorageCredentials2024 Databricks Inc.All rights reservedAccess Management&Collaboration2024
4、Databricks Inc.All rights reservedDatabricks Workspace 2Centralized Metadata,Identity,and Access ManagementWithout Unity CatalogDatabricks Workspace 1Hive Metastore 1User/Group Management 1ClustersSQL WarehousesHive Metastore 2User/Group Management 2ClustersSQL WarehousesAccess Controls 1Access Cont
5、rols 2With Unity CatalogDatabricks Workspace 1Databricks Workspace 2Unity Catalog(Account Level)UC Metastore*User/Group ManagementClustersSQL WarehousesClustersSQL WarehousesAccess ControlsIdentity Provider*Important:only one UC metastore per region!2024 Databricks Inc.All rights reservedIdentity Fe
6、deration Enrollment before UCGroup administration changes for Workspaces before UCUsers-created in the WorkspaceGroups-created in the Workspace with a Create Group buttonThese are now referred to as“Workspace local”groupsChallenge:duplicate administration work across workspacesBeforeBefore2024 Datab
7、ricks Inc.All rights reservedIdentity Federation Enrollment with UCGroup administration changes for Workspaces connected to UCUsers-created in the Workspace or at the Account level-users created in the Workspace are synced to the accountGroups-created exclusively in the Account Console and are added
8、 to the Workspace(Add Group button)New account users groupNew Service principals menuSimplified administrationAfterAfter2024 Databricks Inc.All rights reservedDatabricks Workspace 2Hive vs UC metadataWithout Unity CatalogWith Unity CatalogDatabricks Workspace 1Unity Catalog(Account Level)UC Metastor
9、eHive Metastore 1Hive Metastore 2Databricks Managed ServiceUser/Group ManagementAccess ControlsHive Metadata 1Hive Metadata 2Default Managed Table Dataabfss:/uc-metastore-rootyourstorageCatalog&Schema-level Managed Table Dataabfss:/containeryourstorageUnity Catalog MetadataManaged Table Data 1dbfs:/
10、user/hive/warehouse/abfss:/contyourstorageManaged Table Data 2dbfs:/user/hive/warehouse/abfss:/contyourstorage2024 Databricks Inc.All rights reservedGRANTSELECTONiot.eventsTOengineersChoose permission levelSync groups from identity providerChoose resourceUsing ANSI Standard SQLUsing the UIAccess Con
11、trol SimplifiedCentrally grant and manage access permissions across all workloads2024 Databricks Inc.All rights reservedData Isolation Data Isolation with UC at with UC at ComcastComcast2024 Databricks Inc.All rights reservedStorage based isolationUsersSPs UsersUsers2024 Databricks Inc.All rights re
12、servedWorkspace based isolationDatabricks AccountProdCatalogTestCatalogTestSchemaDev Schema(Database)External TableProd WorkspaceTest WorkspaceOptionally bind catalogs to specific workspace(s)VolumeUnity CatalogProd ModelProdSchema(Database)Dev WorkspaceDevCatalog Model Volume2024 Databricks Inc.All
13、 rights reservedUser based isolationAssign group,user,and service principal permissions as needed2024 Databricks Inc.All rights reservedUC DesignCatalogsWorkspacesAccount groups1One Time SetupCreate metastoreIdentity federationJoin workspaces2Create UC ObjectsStorage CredentialsExternal locationsCat
14、alogsSet Owners3Migrate Legacy MetadataSYNC external tables/schemasMigrate managed tables,files4Grant AccessCatalogsSchemasTablesFiles57High Level Roadmap to Unity CatalogSteps to consider2024 Databricks Inc.All rights reserved7Create metastore and join workspacesOne Time SetupCreate metastoreIdenti
15、ty federationJoin workspaces22024 Databricks Inc.All rights reservedUC Object ModelCreate UC ObjectsStorage CredentialsExeral locationsCatalogsSet Owners2024 Databricks Inc.All rights reservedCreate UC Objects(Storage Credential,External Locations)Create UC ObjectsStorage CredentialsExternal locatio
16、nsCatalogsSet Owners32024 Databricks Inc.All rights reservedManaged Location for Catalogs-Storage based data isolation Must choose from previously configured External LocationsOptionally include sub folder/path2024 Databricks Inc.All rights reservedCatalog-Workspace binding env based data isolation2
17、024 Databricks Inc.All rights reservedAssign group,user&service principal permissions-User based isolation2024 Databricks Inc.All rights reservedUnity Catalog Onboarding Automation Flow2024 Databricks Inc.All rights reserved7Migrate Legacy MetadataMigrate Legacy MetadataSYNC external tables/schemasM
18、igrate managed tables,files42024 Databricks Inc.All rights reservedData Governance&Security2024 Databricks Inc.All rights reservedMain Challenges:Metastores tied to clusters,not users.Clusters need to have readaccess to the storagecredentialsMounts are difficult to manage and tend to overprovision a
19、ccessWorkspaceHive metastore table access control(legacy)User/WorkflowClusterCloud StorageExternal container/bucketWorkspace root container/bucket(DBFS)/user/hive/warehouseCredentialsHive Metastore(HMS)Schema/Database LOCATION=ext.loc.Schema/Database(default location)/user/hive/warehouse13Secret Man
20、agement2Confidential and ProprietaryEnable Table Access Control2024 Databricks Inc.All rights reservedLegacy data access VS Three level namespaceTablesForeign SchemaWithout UC:SELECT*FROM hive_metastore.default.customers;Foreign TableUnity CatalogForeign*CatalogSchema AExternalVolumesViewsModels/Fun
21、ctionsCatalog 1With UC:SELECT*FROM dev.iot.sensor_bronze;-.123default(database)customers(table)hive_metastore(legacy)Local hive metastore is displayed for convenience2024 Databricks Inc.All rights reservedUserCluster or SQL endpoint1Send query(SQL Python,R,Scala,)Send result8Cloud Storage(S3,ADLS)Hi
22、ve Metastorereturn path to the table(s3:/,abfss:/,gs:/)4Lookup location3Read data with Service Principal,Passthrough,Instance Profile5Return data6Table ACLsCheck grants2Filter7unauth.dataIAM role credential passthrough(Legacy)1.Create&manage instance profiles2.Connection to S3 is not short-lived.3.E
23、very cluster needs setup with configs and instance profile4.Create mounts for cloud locations5.Risk of over provisioningDifficult to administer;overprovisioned accessNeed to configure spark.confproperties/instance profiles onclusters,configure table ACLson each workspace,mounts areinflexible2024 Dat
24、abricks Inc.All rights reservedUserLife of a query with Unity CatalogCloud Storage(S3,ADLS)Audit logWrite audit logAdmin1Send query(SQL Python,R,Scala,)Create access tokens usingIAM Role/Managed Identitymetadata and grants23Return list of paths/data filesand scoped down temporarytokens4Use token to
25、read/write data from paths/data files5Return data6Send result8Cluster or SQL warehouseEnforce policies7Unity Catalog1.No longer any need for Instance Profiles,Mounts,Secret Scopes,Spark.conf settings2.No cluster level configs required.3.Storage cred,External locations configured only once and can be
26、 used from any workspace4.Data access for authorized users/service principals is via down scoped,short-lived,OAuthtokens2024 Databricks Inc.All rights reservedSecure access to Data Lake&On-prem Data2024 Databricks Inc.All rights reservedPre-UC:MULTIPLE ACCESS PATTERNSAccess Cloud Storage(no access t
27、o on-prem)OnPrem Storage+Cloud StorageSAT-NGClient ID/ScopeSAT JWT TokenNotebook CodeAccess key+Secret KeySAT-NGCustom Broker S3Client ID/ScopeSAT JWT TokenOnPremSession TokenSession TokenS3 policyPrivaceraCustom BrokerDatabricks non-UC Cluster with Privacera pluginDatabricks non-UC Clusterwithout P
28、rivacera pluginPre-UC:MULTIPLE ACCESS PATTERNSPlugin installed on the cluster*Jar installed on the cluster*1.Involves SAT portal that generates the JWT token2.Authentication via JWT Token.No user level Identity3.Any user with the token can access the data4.Privacera needs to be installed on the clus
29、ter5.User cant access minIO and non-DX S3 buckets1.Doesnt integrate with Privacera2.Involves SAT portal that generates the JWT token3.Authentication via JWT Token.No user level Identity4.Any user with the token can access the data5.Custom Broker Jar needs to be installed on the cluster6.No fine-grai
30、ned access2024 Databricks Inc.All rights reservedNotebook CodeOnPremSession TokenTokenDatabricks UC Clusterwithout Privacera pluginAccess S3 bucket+OnPrem StorageNo installations needed on the cluster*Highlights:Users own identity will be used for data access.Hence,Governance can be managed without
31、JWT tokens and clientIDIntegrates with Privacera and at the same time lets users access On-prem and non-DX S3 bucketsNo installations needed on the clusterPost UC-SINGLE ACCESS PATTERNPrivacera+Unity CatalogUnity Catalog2024 Databricks Inc.All rights reservedUserCluster or SQL warehouseCloud Storage
32、(S3)Managed container/bucketExternal container/bucketExternal container/bucketAudit logExternal Locations&Storage CredentialsAccess ControlFiles in Cloud StrgExternal tablesManaged tablesManaged Data SourcesExternal LocationsUnity CatalogManaged Data Sources&External LocationsSimplify data access ma
33、nagement across cloudsAccess ControlExternal Locations&Storage CredentialsAWS:Instance Profiles,MountsAzure:Secret Scopes,Key Vaults,KeyRotation,Spark.conf settings,Mounts,Credential Passthroughdbutils.fs,spark.read,.2024 Databricks Inc.All rights reservedCharacteristicManagedUnmanaged(a.k.a.“Extern
34、al”)Tables Type Property Value“MANAGED”“EXTERNAL”DROP Table BehaviorDeletes the associated data Generally what business analysts(SQL users)expectOnly deletes metadata.Does not delete the underlying S3 data.May be helpful in certain use casesCreate Table SyntaxCREATE TABLE.(;CREATE TABLE.(LOCATION ab
35、fss:/;Data File LocationWhichever is found first:Location specified for the database(if specified)Location specified for the catalog(if specified)Metastore default managed storage locationThe path specified by the LOCATION keyword in your create table statementPerformance OptimizationsAuto Tune(In P
36、review)Manually managed by the customerData Format SupportDELTADELTA,CSV,JSON,AVRO,PARQUET,ORC,TEXTComparison of Managed and External UC TablesConsider the benefits of Managed tables2024 Databricks Inc.All rights reservedData Management&Observability2024 Databricks Inc.All rights reservedUnity Catal
37、og Governance Portal2024 Databricks Inc.All rights reservedUnity Catalog Governance Portal2024 Databricks Inc.All rights reservedAI for Governance2024 Databricks Inc.All rights reservedUnity Catalog Governance Portal2024 Databricks Inc.All rights reservedLakehouse Monitoring Dashboard2024 Databricks
38、 Inc.All rights reservedLakehouse Monitoring Dashboard2024 Databricks Inc.All rights reservedData Explorer Dashboard2024 Databricks Inc.All rights reservedData Lineage Dashboard2024 Databricks Inc.All rights reservedData Lineage Dashboard2024 Databricks Inc.All rights reservedData Audit:Systems Cata
39、log2024 Databricks Inc.All rights reservedAudit and Billing Monitoring Using System Catalog2024 Databricks Inc.All rights reservedBilling Usage Sample View,Queries2024 Databricks Inc.All rights reservedCost Governance Dashboard2024 Databricks Inc.All rights reservedOpen data sharing between teams an
40、d collaborationSecurely share live data across cloud regions,and data platforms and eliminate vendor lock-inReduce operational overhead of maintaining multiple copies of data for different recipientsCentrally manage,monitor usage and audit access to shared data with easeDelta Sharing ProtocolNo repl
41、icationEasy to manageSecure2024 Databricks Inc.All rights reservedTypes of delta sharing2024 Databricks Inc.All rights reservedQuestions2024 Databricks Inc.All rights reservedSimpleProd/Non-ProdAs NecessaryExternal Location PatternsComcast governance requirements drive the patternExternal LocationSt
42、orage AccountStorage Credential2024 Databricks Inc.All rights reservedPermissions Example Prod/Dev CatalogsUSE CATALOG CREATESCHEMAUSE SCHEMA MODIFY SCHEMASELECT MODIFYDevelopers need Unity Catalog permissions on External Locations to read non-UC files and create external tablesREAD FILESCREATEEXTER
43、NALTABLERead/WriteUSE CATALOG USE SCHEMA SELECTStorageBlobData ContributorRole2024 Databricks Inc.All rights reservedUC MetastoreExternal LocationsWorkspaceData Access Model with Unity CatalogUser/WorkflowCluster or SQL endpointExternal container/bucketExternal container/bucketExternal tablesCatalog
44、(e.g.env,etc.)Schema/DatabaseManaged tablesUnity CatalogCloud StorageMetastore:managed container/bucketContainer:managed container/bucketSchema:managed container/bucketStorage CredentialsHive Metastore(HMS)Schema/Database LOCATION=ext.loc.Schema/Database(default location)/user/hive/warehousetokentok
45、en1234VolumeFiles(Cloud Storage)Data access is via short-lived,down-scoped tokensConfidential and ProprietaryUsers dont need permissions on cloud storage2024 Databricks Inc.All rights reservedWorkspaceCluster or SQL endpointCredentialsUC MetastoreExternal LocationsData Access Model combining both ap
46、proachesUser/WorkflowExternal container/bucketExternal container/bucketExternal tablesCatalog(e.g.env,etc.)Schema/DatabaseManaged tablesUnity CatalogCloud StorageMetastore:managed container/bucketContainer:managed container/bucketSchema:managed container/bucketStorage CredentialsHive Metastore(HMS)S
47、chema/Database LOCATION=ext.loc.Schema/Database(default location)/user/hive/warehousetokentoken1234VolumeFiles(Cloud Storage)External container/bucketWorkspace root container/bucket(DBFS)/user/hive/warehouse13Secret Management2(e.g.for migrations)Confidential and Proprietary2024 Databricks Inc.All r
48、ights reservedAdvanced Use Cases2024 Databricks Inc.All rights reservedUC Machine Leaning Model for Customer Churn2024 Databricks Inc.All rights reservedEnd-to-end visibility into how data flows in my org with automated data lineage down to column levelView lineage across tables,columns,notebooks,wo
49、rkflows,dashboardsCaptured in real time across all workloads in SQL,Python,Scala,and RIntegration with enterprise catalogsData observability and quality with Lineage2024 Databricks Inc.All rights reservedBilling Usage Sample View,Queries2024 Databricks Inc.All rights reservedMap,secure and audit dat
50、a across cloudsCatalog all data,analytics and AI assets and createa unified view of teams entire data estateCentrally manage access permissions and audit controls for files,tables across all workspaces and workloads using a familiar interface based on ANSI SQL2024 Databricks Inc.All rights reservedS
51、oftware Development Lifecycle setup with UCUsersTESTWorkspacePRDWorkspaceWorkloadWorkloadPRDCatalogTESTCatalogread via tokenPRD dataGrant access to TEST dataGrant access to SPs Usersread/write via tokenGrant full access to tables on TEST datagrant read-only access to tables on PRD datametadata acces
52、sDEVCatalogDEVWorkspaceWorkloadmetadata access DEV System Account for UC TEST System Account for UC PROD System Account for UC System Account to execute jobs in PRDUser Group working in DEV,TEST and Prodread/write via tokenGrant full access to tables on DEV dataDEV dataGrant access to assignassignUn
53、ity Catalog MetastoreData LakeUsersmetadata accessassign2024 Databricks Inc.All rights reservedLakehouse PlatformData WarehousingData EngineeringData StreamingData Science and MLCloud Data LakeAll structured and unstructured dataDatabricks Unity CatalogUnified governance for data,analytics and AIUni
54、ty CatalogFine-grained governance for data and AIDelta LakeData reliability and performanceCentralized governance for data and AIEnhanced auditing and data sharingBuilt-in data search and discoveryPerformance and scaleAutomated lineage for all workloadsSimplified administrationCentralized Identity and access management