當前位置：首頁 > 報告詳情

通過探索Apache Iceberg的力量來進行數據協調.pptx

上傳人：王** 編號：171089 2024-07-23 PPTX PPTX 45頁 13.99MB

該報告所屬合集： 2024年AI/機器學習峰會（AI/MACHINE LEARNING SUMMIT）嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/45

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《通過探索Apache Iceberg的力量來進行數據協調.pptx》由會員分享，可在線閱讀，更多相關《通過探索Apache Iceberg的力量來進行數據協調.pptx（45頁珍藏版）》請在三個皮匠報告上搜索。

1、|Copyright 2023,InfluxData1Navigating Data Harmony by Exploring the Power of Apache IcebergZoe Steinkamp|Copyright 2023,InfluxData2Agenda Introduction to Apache Iceberg Why it was built+How it works Key Benefits of Apache Iceberg Migration+Integrations Use Cases Why InfluxDB is using Iceberg Resourc

2、es|Copyright 2023,InfluxData3Introduction to Apache Iceberg3|Copyright 2023,InfluxData4|Copyright 2023,InfluxData4Apache Iceberg,an open-source data table format,revolutionizes data management by addressing traditional catalog inefficiencies and enhancing query performance and storage costs.It suppo

3、rts ACID transactions,time travel,and SQL-like operations,integrating seamlessly with frameworks like Apache Spark and Apache Flink,making it ideal for large-scale data lakes.|Copyright 2023,InfluxData5What Iceberg is and is not Table Format specification APIs and libraries for interaction with that

4、 specification A storage engine An Execution Engine(for Query/Compute)A service|Copyright 2023,InfluxData6When Iceberg is not the right fitSmall datasetsUsing Iceberg for a small dataset that doesnt necessitate a data lake might be excessive.Real-time data ingestionOut of the box,Apache Iceberg does

5、 not support real-time data injection due to its reliance on batch processing.|Copyright 2023,InfluxData7|Copyright 2023,InfluxData8Why it was built8|Copyright 2023,InfluxData9|Copyright 2023,InfluxData9|Copyright 2023,InfluxData10Case Study-Netflix-Atlas Performance Hive table-with Parquet filters:

6、400k+splits per day,not combinedExplain Query:9.6 minutes(planning time)Iceberg table-partition data filtering:15,218 splits,combined13 min(wall time)/10 sec(planning)Iceberg table-partition and min/max filtering:412 splits42 sec(wall time)/25 sec(planning)|Copyright 2023,InfluxData11How it works11|

7、Copyright 2023,InfluxData12|Copyright 2023,InfluxData12Iceberg Table Format-Metadata is stored as files in object storage(just like data files).-Read performance scales with low CPU cost.-Hierarchical data statistics allow execution engines to efficiently prune metadata and data files.|Copyright 202

8、3,InfluxData13|Copyright 2023,InfluxData13Catalog-The catalog is the storehouse for current metadata pointers for each table.-Multiple catalog backends exist,including Hive Metastore,AWS Glue,and SQL/JDBC.|Copyright 2023,InfluxData14|Copyright 2023,InfluxData14Metadata File andSnapshot-The table met

9、adata file describes a tables schemas,partition specs,and snapshots.-Snapshots describe a table at specific points in time.-Queries can“time travel”-neat!|Copyright 2023,InfluxData15|Copyright 2023,InfluxData15Manifest Lists andManifest Files-A manifest file describes partition and data stats per da

10、ta file.-A manifest list summarizes partition and data stats for multiple manifest files.|Copyright 2023,InfluxData16|Copyright 2023,InfluxData16Data Files-These allow execution planners to aggressively and accurately prune data files to be read.-Careful attention to partitioning strategy yields bet

11、ter execution latency.|Copyright 2023,InfluxData17Key Benefits of Apache Iceberg17|Copyright 2023,InfluxData18Design Benefits Efficiently make smaller updates Snapshot isolation for transactions Table evolution(Schema/Partition)All engines can see changes immediately ACID transactions Fast query pla

12、nning and execution Time travel queries|Copyright 2023,InfluxData19Schema Evolution Add-add a new column to the table or to a nested struct Drop-remove an existing column from the table or a nested struct Rename-rename an existing column or field in a nested struct Update-widen the type of a column,

13、struct field,map key,map value,or list element Reorder-change the order of columns or fields in a nested struct|Copyright 2023,InfluxData20Handling High ConcurrencySnapshot Isolation:Important for Data integrityTransactional Support:Critical for multi-user environmentsACID:Atomicity,Consistency,Isol

14、ation,Durability|Copyright 2023,InfluxData21Version Rollback+Time TravelTime Travel:Good for audit process and rollbackVersion rollback:Revert changes in the case of errors|Copyright 2023,InfluxData22File Pruning and FilteringAdvanced Filtering:Reduce I/O operations during queriesSmart File Scan:Que

15、ry speed upPartition Pruning:Eliminate irrelevant partitioning without scanningPredicate Pushdown:Reducing the amount of data read|Copyright 2023,InfluxData23Query Execution OptimizationColumnar I/O:When combined with columnar storage formats like Parquet,Iceberg optimizes columnar I/O operations,al

16、lowing it to fetch only the necessary columns for a query,further reducing data access times.Unlike Avro,which is not Columnar.|Copyright 2023,InfluxData24Computational Overhead ReductionIncremental data processing:Good for reducing computational loads with high update frequencies|Copyright 2023,Inf

17、luxData25|Copyright 2023,InfluxData25Reliability FeaturesAtomic Commits:Prevents incomplete data returningReliable Reads:All transaction create snapshotsFile-Level Operations:Target single records|Copyright 2023,InfluxData26Migration26|Copyright 2023,InfluxData27Step 1-Choosing your catalogType Of C

18、atalogProsConsProject NessieGit like functionalityCatalog Level rollback,Time-travelCloud Managed service(Arctic)Support from engines beyond Spark,Dremio,Flink,PrestoIf not using arctic,must deploy and maintain Nessie ServerHive MetastoreCan use existing metastoreYou have to deploy and maintain a hi

19、ve metastoreAWS GlueInterop with AWS ServicesSupport outside of AWS ecosystem and tools that support it|Copyright 2023,InfluxData28Step 2-In place migration|Copyright 2023,InfluxData29Step 2-Shadow Migration|Copyright 2023,InfluxData30Integrations30|Copyright 2023,InfluxData31|Copyright 2023,InfluxD

20、ata31|Copyright 2023,InfluxData32Many popular data tools that currently support Iceberg include:DremioApache SparkApache FlinkProject NessieApache GobblinApache DrillApache HiveApache DorrisApache ImpalaSnowflakeCloudera Data EngineeringCloudera Data WarehousingTrinoPrestoDataprocAWS GlueAmazon Athe

21、naStarRocksDebezium Google Big Lake/BigQuery|Copyright 2023,InfluxData33Use Cases33|Copyright 2023,InfluxData34ExpediaExpedia Group is transitioning to modern data management by adopting Apache Iceberg,enhancing their data lake capabilities with features like ACID transactions and efficient data fil

22、e tracking to improve performance and consistency.|Copyright 2023,InfluxData35AppleApple collaborates with the Apache Iceberg and Spark communities to refine update,delete,and merge operations in data lakes using differential files to enhance efficiency and concurrency.|Copyright 2023,InfluxData36Ho

23、w InfluxDB is joining the Iceberg Ecosystem36|Copyright 2023,InfluxData37InfluxDBDataMetadataReal-time analytics|Copyright 2023,InfluxData38InfluxDBDataMetadataReal-time analyticsBusiness analyticsReportingetc etc|Copyright 2023,InfluxData39|Copyright 2023,InfluxData39Future Prospects and Enhancemen

24、tsAs businesses increasingly rely on data-driven decision-making,the demand for robust data management solutions like Apache Iceberg will grow.Future enhancements will include integration with AI and ML platforms,improved security features,and the API support in languages beyond Java,like Python,Rust,and Go.|Copyright 2023,InfluxData40Resources40|Copyright 2023,InfluxData41Apache X-table41|Copyright 2023,InfluxData42Apache Iceberg Summit42|Copyright 2023,InfluxData43Apache Iceberg 10143|Copyright 2022,InfluxData44Try It Yourselfhttps:/https:/ 2022,InfluxData45T H A N K Y O U

相關圖表

本文主要介紹了Apache Iceberg，一種開源的數據表格式，它通過解決傳統目錄的低效性和提高查詢性能和存儲成本來革新數據管理。Iceberg支持ACID事務、時間旅行和SQL樣操作，可以與Apache Spark和Apache Flink等框架無縫集成，非常適合大規模數據湖。文章詳細介紹了Iceberg的工作原理、關鍵優勢、遷移和集成方法、用例以及InfluxDB如何加入Iceberg生態系統。文中還提到了Iceberg Summit和Apache X-table等資源。

為什么Apache Iceberg適合大規模數據湖？ Iceberg如何通過ACID事務支持多用戶環境？如何利用Iceberg進行數據遷移和集成？

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站