《通過探索Apache Iceberg的力量來進行數據協調.pptx》由會員分享,可在線閱讀,更多相關《通過探索Apache Iceberg的力量來進行數據協調.pptx(45頁珍藏版)》請在三個皮匠報告上搜索。
1、|Copyright 2023,InfluxData1Navigating Data Harmony by Exploring the Power of Apache IcebergZoe Steinkamp|Copyright 2023,InfluxData2Agenda Introduction to Apache Iceberg Why it was built+How it works Key Benefits of Apache Iceberg Migration+Integrations Use Cases Why InfluxDB is using Iceberg Resourc
2、es|Copyright 2023,InfluxData3Introduction to Apache Iceberg3|Copyright 2023,InfluxData4|Copyright 2023,InfluxData4Apache Iceberg,an open-source data table format,revolutionizes data management by addressing traditional catalog inefficiencies and enhancing query performance and storage costs.It suppo
3、rts ACID transactions,time travel,and SQL-like operations,integrating seamlessly with frameworks like Apache Spark and Apache Flink,making it ideal for large-scale data lakes.|Copyright 2023,InfluxData5What Iceberg is and is not Table Format specification APIs and libraries for interaction with that
4、 specification A storage engine An Execution Engine(for Query/Compute)A service|Copyright 2023,InfluxData6When Iceberg is not the right fitSmall datasetsUsing Iceberg for a small dataset that doesnt necessitate a data lake might be excessive.Real-time data ingestionOut of the box,Apache Iceberg does
5、 not support real-time data injection due to its reliance on batch processing.|Copyright 2023,InfluxData7|Copyright 2023,InfluxData8Why it was built8|Copyright 2023,InfluxData9|Copyright 2023,InfluxData9|Copyright 2023,InfluxData10Case Study-Netflix-Atlas Performance Hive table-with Parquet filters:
6、400k+splits per day,not combinedExplain Query:9.6 minutes(planning time)Iceberg table-partition data filtering:15,218 splits,combined13 min(wall time)/10 sec(planning)Iceberg table-partition and min/max filtering:412 splits42 sec(wall time)/25 sec(planning)|Copyright 2023,InfluxData11How it works11|
7、Copyright 2023,InfluxData12|Copyright 2023,InfluxData12Iceberg Table Format-Metadata is stored as files in object storage(just like data files).-Read performance scales with low CPU cost.-Hierarchical data statistics allow execution engines to efficiently prune metadata and data files.|Copyright 202
8、3,InfluxData13|Copyright 2023,InfluxData13Catalog-The catalog is the storehouse for current metadata pointers for each table.-Multiple catalog backends exist,including Hive Metastore,AWS Glue,and SQL/JDBC.|Copyright 2023,InfluxData14|Copyright 2023,InfluxData14Metadata File andSnapshot-The table met
9、adata file describes a tables schemas,partition specs,and snapshots.-Snapshots describe a table at specific points in time.-Queries can“time travel”-neat!|Copyright 2023,InfluxData15|Copyright 2023,InfluxData15Manifest Lists andManifest Files-A manifest file describes partition and data stats per da
10、ta file.-A manifest list summarizes partition and data stats for multiple manifest files.|Copyright 2023,InfluxData16|Copyright 2023,InfluxData16Data Files-These allow execution planners to aggressively and accurately prune data files to be read.-Careful attention to partitioning strategy yields bet
11、ter execution latency.|Copyright 2023,InfluxData17Key Benefits of Apache Iceberg17|Copyright 2023,InfluxData18Design Benefits Efficiently make smaller updates Snapshot isolation for transactions Table evolution(Schema/Partition)All engines can see changes immediately ACID transactions Fast query pla
12、nning and execution Time travel queries|Copyright 2023,InfluxData19Schema Evolution Add-add a new column to the table or to a nested struct Drop-remove an existing column from the table or a nested struct Rename-rename an existing column or field in a nested struct Update-widen the type of a column,
13、struct field,map key,map value,or list element Reorder-change the order of columns or fields in a nested struct|Copyright 2023,InfluxData20Handling High ConcurrencySnapshot Isolation:Important for Data integrityTransactional Support:Critical for multi-user environmentsACID:Atomicity,Consistency,Isol
14、ation,Durability|Copyright 2023,InfluxData21Version Rollback+Time TravelTime Travel:Good for audit process and rollbackVersion rollback:Revert changes in the case of errors|Copyright 2023,InfluxData22File Pruning and FilteringAdvanced Filtering:Reduce I/O operations during queriesSmart File Scan:Que
15、ry speed upPartition Pruning:Eliminate irrelevant partitioning without scanningPredicate Pushdown:Reducing the amount of data read|Copyright 2023,InfluxData23Query Execution OptimizationColumnar I/O:When combined with columnar storage formats like Parquet,Iceberg optimizes columnar I/O operations,al
16、lowing it to fetch only the necessary columns for a query,further reducing data access times.Unlike Avro,which is not Columnar.|Copyright 2023,InfluxData24Computational Overhead ReductionIncremental data processing:Good for reducing computational loads with high update frequencies|Copyright 2023,Inf
17、luxData25|Copyright 2023,InfluxData25Reliability FeaturesAtomic Commits:Prevents incomplete data returningReliable Reads:All transaction create snapshotsFile-Level Operations:Target single records|Copyright 2023,InfluxData26Migration26|Copyright 2023,InfluxData27Step 1-Choosing your catalogType Of C
18、atalogProsConsProject NessieGit like functionalityCatalog Level rollback,Time-travelCloud Managed service(Arctic)Support from engines beyond Spark,Dremio,Flink,PrestoIf not using arctic,must deploy and maintain Nessie ServerHive MetastoreCan use existing metastoreYou have to deploy and maintain a hi
19、ve metastoreAWS GlueInterop with AWS ServicesSupport outside of AWS ecosystem and tools that support it|Copyright 2023,InfluxData28Step 2-In place migration|Copyright 2023,InfluxData29Step 2-Shadow Migration|Copyright 2023,InfluxData30Integrations30|Copyright 2023,InfluxData31|Copyright 2023,InfluxD
20、ata31|Copyright 2023,InfluxData32Many popular data tools that currently support Iceberg include:DremioApache SparkApache FlinkProject NessieApache GobblinApache DrillApache HiveApache DorrisApache ImpalaSnowflakeCloudera Data EngineeringCloudera Data WarehousingTrinoPrestoDataprocAWS GlueAmazon Athe
21、naStarRocksDebezium Google Big Lake/BigQuery|Copyright 2023,InfluxData33Use Cases33|Copyright 2023,InfluxData34ExpediaExpedia Group is transitioning to modern data management by adopting Apache Iceberg,enhancing their data lake capabilities with features like ACID transactions and efficient data fil
22、e tracking to improve performance and consistency.|Copyright 2023,InfluxData35AppleApple collaborates with the Apache Iceberg and Spark communities to refine update,delete,and merge operations in data lakes using differential files to enhance efficiency and concurrency.|Copyright 2023,InfluxData36Ho
23、w InfluxDB is joining the Iceberg Ecosystem36|Copyright 2023,InfluxData37InfluxDBDataMetadataReal-time analytics|Copyright 2023,InfluxData38InfluxDBDataMetadataReal-time analyticsBusiness analyticsReportingetc etc|Copyright 2023,InfluxData39|Copyright 2023,InfluxData39Future Prospects and Enhancemen
24、tsAs businesses increasingly rely on data-driven decision-making,the demand for robust data management solutions like Apache Iceberg will grow.Future enhancements will include integration with AI and ML platforms,improved security features,and the API support in languages beyond Java,like Python,Rust,and Go.|Copyright 2023,InfluxData40Resources40|Copyright 2023,InfluxData41Apache X-table41|Copyright 2023,InfluxData42Apache Iceberg Summit42|Copyright 2023,InfluxData43Apache Iceberg 10143|Copyright 2022,InfluxData44Try It Yourselfhttps:/https:/ 2022,InfluxData45T H A N K Y O U