《變體數據類型 - 使半結構化數據變得快速而簡單.pdf》由會員分享,可在線閱讀,更多相關《變體數據類型 - 使半結構化數據變得快速而簡單.pdf(50頁珍藏版)》請在三個皮匠報告上搜索。
1、2024 Databricks Inc.All rights reservedVariant Data TypeVariant Data TypeMaking SemiMaking Semi-Structured Structured Data Fast and SimpleData Fast and SimpleGene Pang,Chenhao LiGene Pang,Chenhao Li20242024-0606-131312024 Databricks Inc.All rights reserved Motivation Variant Data Type Overview Using
2、 Variant Deep Dive:Variant Binary Format Performance2OUTLINEOUTLINE2024 Databricks Inc.All rights reserved Semi-structured data is partially structuredDoesnt fully adhere to relational table modelSchema may be unknown,or incompatible,or evolving JSON is very popular semi-structured data formatFlexib
3、le,and supported in most programming languagesHow do we store and process semi-structured data in the lakehouse?3SemiSemi-Structured Data in the LakehouseStructured Data in the Lakehouse2024 Databricks Inc.All rights reserved On ingestion,read data and infer schema(structs,arrays,scalars,etc.)Read q
4、ueries use the relational schema Performance same as structured/relational data4Schema InferenceSchema InferenceOption 1Option 12024 Databricks Inc.All rights reserved Inference must determine a schema that works with all the dataIf data is diverse,can produce huge,but sparse schemas Schema enforcem
5、ent is strictIncoming data must be compatible with schemaAccessing missing field may produce exceptions5Challenges with Schema InferenceChallenges with Schema InferenceTOO STRICTTOO STRICT2024 Databricks Inc.All rights reserved On ingestion,data is stored as stringNo schema enforcement on ingestion
6、Read queries parse the string during execution Maximum flexibility for any data6Treat Data as StringTreat Data as StringOption 2Option 22024 Databricks Inc.All rights reserved Parsing String in queries is slowTypically,data is read more than it is written,so expensive parsing is repeated for every q
7、uery7Challenges with using String typeChallenges with using String typeTOO SLOWTOO SLOW2024 Databricks Inc.All rights reserved8Your ChoicesYour ChoicesTOO STRICTTOO STRICTTOO SLOWTOO SLOW2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved9VARIANT:VARIANT:OPEN,FLEXIBLE,OPEN
8、,FLEXIBLE,PERFORMANT TYPE FOR PERFORMANT TYPE FOR SEMISEMI-STRUCTURED STRUCTURED DATADATA2024 Databricks Inc.All rights reservedOpenFlexiblePerformantOpenFlexiblePerformant10Variant Data TypeVariant Data TypeSpark&Delta data typeSpark Variant expressionsOpen-source library for Variant binary encodin
9、gNo schema on ingestionSchema-on-read accessOffset-based binary encoding speeds up navigationOpen,Flexible,Performant Data Type for SemiOpen,Flexible,Performant Data Type for Semi-Structured DataStructured Data2024 Databricks Inc.All rights reserved Code merged for Spark 4.0 and Delta Lake 4.0Releas
10、ed in Spark 4.0 PREVIEW and Delta Lake 4.0 PREVIEW Open source library for encoding and decoding Variant binary formatMake it easier for other projects to support Variant Future Variant support for other engines and table formats11Open SourceOpen Source2024 Databricks Inc.All rights reservedparse_js
11、on:Constructs a VARIANT from a JSON stringto_json:Converts a VARIANT to a JSON stringvariant_get:Extracts the path of a specified type,from the VARIANTcast:Cast to and from VARIANTschema_of_variant:Returns the schema string of a VARIANTvariant_explode:Table function for un-nesting a VARIANT12Variant
12、 ExpressionsVariant ExpressionsNew/Updated ExpressionsNew/Updated Expressions2024 Databricks Inc.All rights reserved13Variant Usage ExamplesVariant Usage Examples-Create a table with a Variant columnCREATE TABLE T(variant_col Variant)-Use PARSE_JSON()to convert JSON string to VariantSELECT PARSE_JSO
13、N(json_str_col)variant_col FROM T-Variant path navigationSELECT variant_col:a.b.c:int,variant_col:arr1.field:double FROM T-Un-nest Variant objectsSELECT key,value FROM T,LATERAL VARIANT_EXPLODE(T.variant_col:obj)2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved14VARIANTV
14、ARIANTBINARYBINARYFORMATFORMATDEEPDEEP-DIVEDIVE2024 Databricks Inc.All rights reserved Binary encoding(instead of plain text)to represent semi-structured data Uses offsets to enable skipping for faster navigation 2 binary blobs used to encodeMETADATA holds dictionary of keysVALUE holds Variant data
15、and structure,referring to dictionary in METADATA On-disk and in-memory binary formats are identical Typically smaller size than String representation15Variant Binary FormatVariant Binary Format2024 Databricks Inc.All rights reserved16Variant Binary FormatVariant Binary FormatSimplified Example of a
16、 Variant ObjectSimplified Example of a Variant Objectheaderdictionary size(n)METADATA2024 Databricks Inc.All rights reserved17Variant Binary FormatVariant Binary FormatSimplified Example of a Variant ObjectSimplified Example of a Variant Objectheaderdictionary size(n)key offset 1.key offset nkey 1.k
17、ey nMETADATA2024 Databricks Inc.All rights reserved18Variant Binary FormatVariant Binary FormatSimplified Example of a Variant ObjectSimplified Example of a Variant Objectheaderdictionary size(n)key offset 1.key offset nkey 1.key nheadernum fields(k)METADATAVALUE2024 Databricks Inc.All rights reserv
18、ed19Variant Binary FormatVariant Binary FormatSimplified Example of a Variant ObjectSimplified Example of a Variant Objectheaderdictionary size(n)key offset 1.key offset nkey 1.key nheadernum fields(k)METADATAVALUEfield id 1.field id kfield id values refer to dictionary keys2024 Databricks Inc.All r
19、ights reserved20Variant Binary FormatVariant Binary FormatSimplified Example of a Variant ObjectSimplified Example of a Variant Objectheaderdictionary size(n)key offset 1.key offset nkey 1.key nheadernum fields(k)METADATAVALUEfield id 1.field id kfield offset 1.field offset kfield value 1.field valu
20、e kfield id values refer to dictionary keys2024 Databricks Inc.All rights reserved21Variant Binary FormatVariant Binary FormatSimplified Example of a Variant ObjectSimplified Example of a Variant Objectheaderdictionary size(n)key offset 1.key offset nkey 1.key nheadernum fields(k)METADATAVALUEfield
21、id 1.field id kfield offset 1.field offset kfield value 1.field value kfield id values refer to dictionary keysfields are lexicographically ordered by dictionary key2024 Databricks Inc.All rights reserved22Variant Binary FormatVariant Binary FormatJSON StringJSON String“key1”:1,“key2”:2,“key1”:3,“ke
22、y2”:4VALUEVALUE0:1,1:2,0:3,1:4METADATAMETADATA“key1”,“key2”Example of Key DeduplicationExample of Key Deduplication2024 Databricks Inc.All rights reserved23Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimplified Example of Navigationfield001:value001,.,field100:v
23、alue100JSON STRING2024 Databricks Inc.All rights reserved24Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimplified Example of Navigationfield001:value001,.,field100:value100JSON STRING2024 Databricks Inc.All rights reserved25Variant Binary NavigationVariant Binar
24、y NavigationSimplified Example of NavigationSimplified Example of Navigationfield001:value001,.,field100:value100JSON STRINGSequential and linear processing of JSON string2024 Databricks Inc.All rights reserved26Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimpli
25、fied Example of Navigationfield001:value001,.,field100:value100JSON STRINGSequential and linear processing of JSON string2024 Databricks Inc.All rights reserved27Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimplified Example of Navigationfield001:value001,.,fiel
26、d100:value100JSON STRINGSequential and linear processing of JSON string2024 Databricks Inc.All rights reserved28Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimplified Example of Navigationfield001:value001,.,field100:value100JSON STRINGSequential and linear proc
27、essing of JSON string2024 Databricks Inc.All rights reserved29Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimplified Example of Navigationfield001:value001,.,field100:value100JSON STRINGSequential and linear processing of JSON string2024 Databricks Inc.All right
28、s reserved30Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimplified Example of Navigationfield001:value001,.,field100:value100JSON STRINGSequential and linear processing of JSON string2024 Databricks Inc.All rights reserved31Variant Binary NavigationVariant Binar
29、y NavigationSimplified Example of NavigationSimplified Example of Navigationfield001:value001,.,field100:value100JSON STRINGSequential and linear processing of JSON string2024 Databricks Inc.All rights reserved32Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimpli
30、fied Example of Navigationfield001:value001,.,field100:value100JSON STRINGSequential and linear processing of JSON string2024 Databricks Inc.All rights reserved33Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimplified Example of Navigationfield001:value001,.,fiel
31、d100:value100JSON STRINGSequential and linear processing of JSON string2024 Databricks Inc.All rights reserved34Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimplified Example of Navigationheader100(dict size)VARIANT METADATAfield001:value001,.,field100:value100J
32、SON STRING0(offset 1).792(offset 100)field001field100.header100(obj size)VARIANT VALUE0.value001value100.990.8912024 Databricks Inc.All rights reserved35Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimplified Example of Navigationheader100(dict size)VARIANT METAD
33、ATAfield001:value001,.,field100:value100JSON STRING0(offset 1).792(offset 100)field001field100.header100(obj size)VARIANT VALUE0.value001value100.990.891Binary search over the fields to find desired field2024 Databricks Inc.All rights reserved36Variant Binary NavigationVariant Binary NavigationExamp
34、le of Binary SearchExample of Binary Searchfield001field100.field075field050.field088.2024 Databricks Inc.All rights reserved37Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimplified Example of Navigationheader100(dict size)VARIANT METADATAfield001:value001,.,fie
35、ld100:value100JSON STRING0(offset 1).792(offset 100)field001field100.header100(obj size)VARIANT VALUE0.value001value100.990.891Jump to the desired fields offset value2024 Databricks Inc.All rights reserved38Variant Binary NavigationVariant Binary NavigationSimplified Example of NavigationSimplified
36、Example of Navigationheader100(dict size)VARIANT METADATAfield001:value001,.,field100:value100JSON STRING0(offset 1).792(offset 100)field001field100.header100(obj size)VARIANT VALUE0.value001value100.990.891Jump to the desired fields value data2024 Databricks Inc.All rights reserved2024 Databricks I
37、nc.All rights reserved39VARIANTVARIANTPERFORMANCEPERFORMANCE2024 Databricks Inc.All rights reserved TPCDS-JSONRows of each table is converted to flat JSON records or Variant records TPCH-NESTEDDataset is denormalized to nested JSON records or nested Variant records40Performance BenchmarksPerformance
38、 Benchmarks2024 Databricks Inc.All rights reserved41Variant vs JSON String PerformanceVariant vs JSON String Performance8x faster20 x faster2024 Databricks Inc.All rights reserved Certain paths are stored in separate columns in file Shredded paths are removed from binary representation Faster to acc
39、ess shredded pathsLess IO required to fetch pathLess CPU required to decode valuesmin/max statistics available for data skipping Performance nearly equivalent to fully structured,relational data42Sneak Peak:Variant ShreddingSneak Peak:Variant ShreddingWorkWork-inin-Progress:Performance Optimizations
40、Progress:Performance Optimizations2024 Databricks Inc.All rights reservedWithout ShreddingWithout Shredding43Variant Shredding StorageVariant Shredding StorageVariant BinaryVariant BinaryVariant Binary2024 Databricks Inc.All rights reservedWithout ShreddingWith ShreddingWithout ShreddingWith Shreddi
41、ng44Variant Shredding StorageVariant Shredding StorageVariant BinaryVariant BinaryVariant BinaryVariant BinaryVariant BinaryVariant BinaryShredded paths are removed from binary and stored in separate columns2024 Databricks Inc.All rights reserved45Querying Variant Shredded DataQuerying Variant Shred
42、ded DataVariant BinaryVariant BinaryVariant BinaryQuery wantsWithout ShreddingWithout Shredding2024 Databricks Inc.All rights reserved46Querying Variant Shredded DataQuerying Variant Shredded DataVariant BinaryVariant BinaryVariant BinaryVariant BinaryQuery wantsVariant Binaryfetch all Variant binar
43、y valuesVariant BinaryWithout ShreddingWithout Shredding2024 Databricks Inc.All rights reserved47Querying Variant Shredded DataQuerying Variant Shredded DataVariant BinaryVariant BinaryVariant BinaryVariant BinaryQuery wantsVariant Binaryfetch all Variant binary valuesVariant Binaryparse each Varian
44、t binary and extract desired pathWithout ShreddingWithout Shredding2024 Databricks Inc.All rights reserved48Querying Variant Shredded DataQuerying Variant Shredded DataVariant BinaryVariant BinaryVariant BinaryVariant BinaryVariant BinaryVariant BinaryVariant BinaryQuery wantsVariant Binaryfetch all
45、 Variant binary valuesVariant Binaryparse each Variant binary and extract desired pathWithout ShreddingWith ShreddingWithout ShreddingWith Shredding2024 Databricks Inc.All rights reserved49Querying Variant Shredded DataQuerying Variant Shredded DataVariant BinaryVariant BinaryVariant BinaryVariant B
46、inaryVariant BinaryVariant BinaryVariant BinaryQuery wantsVariant Binaryfetch all Variant binary valuesVariant Binaryparse each Variant binary and extract desired pathfetch only desiredshredded pathWithout ShreddingWith ShreddingWithout ShreddingWith Shredding2024 Databricks Inc.All rights reservedOpenFlexiblePerformantOpenFlexiblePerformant50Variant Variant Data TypeData Type