《Iceberg 高級特性與應用實踐.pdf》由會員分享,可在線閱讀,更多相關《Iceberg 高級特性與應用實踐.pdf(19頁珍藏版)》請在三個皮匠報告上搜索。
1、演講題目演講人:陳俊杰-騰訊-資深研發工程師Iceberg 最新高級特性介紹01Iceberg 高級特性解鎖新場景0203高級特性在騰訊應用與實踐04Q/A目錄Iceberg 社區高級特性介紹Part 01Branch and TagNew Table APIcreateBranch(String name,long snapshotId);createTag(String name,long snapshotId);A-B-C(master)(tag1)D-E(archive branch)F-G(test branch)-Create a branch/tag for tableALTER
2、 TABLE table CREAT TAG/BRANCH tagNameAS OF VERSION snapshotIdRETAIN interval DAYS|HOURS|MINUTES-Read from a branchSELECT*FROM table BRANCH/TAG branch_name-Insert into a branchINSERT INTO table BRANCH branch_name SELECT.spark().read().format(iceberg).option(branch,branchName).load(table)spark().write
3、().format(iceberg).option(branch,branchName).mode(SaveMode.Append).save(table)Puffin formatA file format designed to store information such as indexes and statistics about data managed in an Iceberg table that cannot be stored directly within the Iceberg manifest.public interface UpdateStatistics ex
4、tends PendingUpdateList/*Set the tables statistics file for the given snapshot,replacing the previous statistics*file for the snapshot if any exists.*return this for method chaining*/UpdateStatistics setStatistics(long snapshotId,StatisticsFile statisticsFile);/*Remove the tables statistics file for
5、 given snapshot.*return this for method chaining*/UpdateStatistics removeStatistics(long snapshotId);Statistics Table statistics Number of rows Number of distinct values in a column The faction of NULL values in a column Min/max value in a column The average data size of a column How statistics help
6、 CBO?ViewA view is a logical table that can be referenced by future queries,the iceberg viewdefinition standardizes the view metadata for ease of sharing the views across engines.Iceberg高級特性解鎖新場景Part 02BRANCH 解鎖場景一:CDC 入湖Write raw CDC events to the change branch,produce change log feed from the bran
7、ch.-Create a snapshot view for usersCREATE VIEW usersAS SELECT user_cols.*,-the columns of the original table txId-the incremental transaction id,or timestampFROM(SELECT ROW_NUMBER()OVER(PARTITION BY row.idORDER BY txId DESC)as row_numberoperation,row as user_cols FROM users BRANCH changes)WHERE row
8、_number=1 AND operation!=delete changesmergedoptimizedMERGE INTO Users BRANCH optimized as tUSING incr_changes as sON s.id=t.idWHEN MATCHED and(time cond)updateWHEN NOT MATCHED insert allBRANCH解鎖場景二:多流拼接Write partial inserts to one branch,merge incremental to merged branch/step 1:define windowWindow
9、Spec windowSpec=Window.partitionBy(primaryKey).orderBy(functions.desc(orderColumn).rangeBetween(Window.unboundedPreceding(),Window.unboundedFollowing();/step 2:compact via window aggregationsPrimary key-col(key column)Order column-max(order column);Data column-first(data column,true)/step 3:merge in
10、to the target branchmerge into table branch optimized as tusing aggDf as son t.key=s.keywhen match update*when not match insert*Puffin 解鎖場景一:異步 Stat 構建Store table statsPuffin 解鎖場景:index 構建Store indexesView 解鎖場景:MVA materialized view is a pre-computed data set derived from a query specification(the S
11、ELECT in the view definition)and stored for later use.Iceberg新特性在騰訊應用實踐Part 03CBO Build table statistics asynchronously,and update partition level statistics incrementallyvia theta sketch.Indexing Async indexing,support Bloom filter and Bitmap IndexCREATE INDEX index_name ON TABLE table_nameUSING BL
12、OOMFILTER(colName1 options ,.)options OPTIONS(key1 =val1 ,.)File 1File 3File 2File 2Manifest fileoffset1Puffin fileIndex Dataoffset2offset3offset4Authorization Thousands of columns in a table Different departments focus on separated columns Use authorized view instead of tableA/B testing Async indexing upon query analysis Async z-order clustering upon query analysis Effect validation on the branch感謝觀看!您公司的Logo