《勁松-之信Apache Paimon 實時湖倉存儲底座.pdf》由會員分享,可在線閱讀,更多相關《勁松-之信Apache Paimon 實時湖倉存儲底座.pdf(27頁珍藏版)》請在三個皮匠報告上搜索。
1、李勁松/Paimon PMC Chair 關于我關于我阿里云開源大數據表存儲團隊負責人PMC Member of Apache FlinkCommitter of Apache IcebergPMC Chair of Apache PaimonContent實時湖倉PART ONE應用場景PART THREEApache PaimonPART TWO前沿技術PART FOURPARTPART ONEONE 實時湖倉實時湖倉數據架構的時效性演進數據架構的時效性演進ADSADSODSODSDWDDWDDWSDWSETLETLETLETLETLETLQueryQueryQueryQuery批式數倉:
2、批計算+Hive 表+OLAPETL 時效性:天或小時級查詢時效性:分鐘或秒級BatchBatchBatchBatchBatchBatchADSADSODSODSDWDDWDDWSDWSETLETLETLETLETLETLQueryQueryStreamStreamStreamStreamStreamStream實時數倉:流計算+Kafka+OLAPETL 時效性:秒或分鐘級查詢時效性:秒或毫秒級數據架構的時效性演進數據架構的時效性演進ADSADSODSODSDWDDWDDWSDWSETLETLETLETLETLETLQueryQueryQueryQuery實時湖倉:流批計算+湖格式+OLAP
3、ETL 時效性:分鐘或天級查詢時效性:分鐘或秒級ADSADSODSODSDWDDWDDWSDWSETLETLETLETLETLETLQueryQueryStreamStreamStreamStreamStreamStream實時數倉:流計算+Kafka+OLAPETL 時效性:秒或分鐘級查詢時效性:秒或毫秒級BatchBatchStreamStreamBatchBatchStreamStreamBatchBatchStreamStream實時湖倉實時湖倉是批式數倉的原地升級是批式數倉的原地升級時效性 T+1寫:覆蓋全分區讀:掃描全分區時效性分鐘級流式更新批式更新流式讀取文件過濾解鎖完整的大數據
4、全生態流批一體的計算、OLAP數據架構的選擇數據架構的選擇批式數倉實時湖倉實時數倉PARTPART TWOTWO ApacheApache PaimonPaimonApacheApache IcebergIcebergShared Database Storage for Big DataApacheApache IcebergIceberg:HiveHive 存儲的格式升級存儲的格式升級對象存儲友好ACID transactionsINSERT&UPDATE&DELETETime Travel and rollbackSchema EvolutionTag&BranchApacheApach
5、e PaimonPaimon:IcebergIceberg 的實時化升級的實時化升級流式主鍵更新變更日志消費ApacheApache PaimonPaimon:湖格式:湖格式 +LSMLSM 寫入寫入:攢內存 Buffer Append 寫。壓縮壓縮:Minor Compact 減少寫放大。清理清理:默認自動清理過期的快照。讀取讀?。簡?bucket 內多路有序歸并。The Past,Present and Future of PaimonThe Past,Present and Future of Paimon2022.012023.032023.122024.032024.10Initia
6、lized in Apache FlinkSub Project of Apache FlinkOriginal name-Flink Table StoreBecame Apache Incubating ProjectRename to Apache PaimonOpen to others rather than FlinkGA Version ReleaseProduction ReadyFor Both Streaming&BatchBecame Apache Top-Level-Project Graduated from Apache Software FoundationInt
7、egrated with Flink/Spark/Presto/Unified Lake FormatUnified for Streaming/Batch/OLAPCompatible with Apache IcebergPaimonPaimonTableTable StoreStorePaimonPaimonPaimonPaimonPaimonPaimon-1.0-1.0PARTPART THREETHREE 應用場景應用場景數據庫數據庫 CDCCDC 入湖入湖Schema EvolutionFlink CDC整庫同步Changelog DataPaimonPaimonKafka 同步F
8、link CDC 同步表同步Paimon CDC表結構變更同步增加列刪除列(忽略)類型變更 重命名列 添加新一列新增計算列定義分區列定義主鍵分庫分表同步整庫同步Paimon CDCINCLUDING/EXCLUDING表名前后綴自動跳過失敗的表動態新增表 一鍵同步入湖CDC DataStream API:編寫 Schema 同步的流作業阿里智能引擎實踐:數據庫統一入湖阿里智能引擎實踐:數據庫統一入湖分庫分表的 Mysql 表App 2App 1App nPaimon Table App 1App 2App n分庫分表的 Mysql 表Scan Snapshot-1Scan Snapshot-2
9、Scan Snapshot-3一、原鏈路缺點1、并發有上限限制,吞吐受限,盲目加并發有拉掛庫的風險。2、核心庫拉取時間只能晚上。3、每個應用都要拉一遍 Mysql 表。二、新鏈路收益1、并發無上限,釋放吞吐。2、全天 24 小時均可拉取。3、Mysql 表只需拉取一次,能做到共享。Select*From 流 or 批流流 or 批湖上全鏈路流式湖上全鏈路流式 ETLETLODSODSDWDDWDDWSDWSPartial-UpdatePartial-UpdateStreamStreamStreamStreamAggregationAggregationCDCCDC IngestionInges
10、tionPartial-Update MergeAggregation MergeChangelog ProducerQueryQueryQueryQuery少狀態、簡單、低成本、全鏈路可查螞蟻應用實踐:計算螞蟻應用實踐:計算 UVUV 指標指標 資源開銷方面,基于Paimon方案CPU使用量約下降60%,內存使用量約下降35%Checkpoint穩定性得到大大的提升,cp 大小和 cp耗時下降90%任務回刷重置耗時降低80%簡化研發模式,降低研發成本湖上湖上數據查詢加速數據查詢加速ODSODSDWDDWDDWSDWSBatchBatchZ-orderZ-orderBatchBatchZ-or
11、derZ-orderDataData IngestionIngestionSpark X Paimon 不弱于內表Z-order 排序/Cluster構建文件級索引StarRocks 5秒內極速查詢QueryQueryQueryQuery低成本、高性能 OLAPApplicationApplicationDorisDoris餓了么實踐餓了么實踐:湖上:湖上 OLAPOLAPData IngestionData Ingestion時效性 1-5分鐘查詢延時 5秒內成本 1/10PaimonPaimon+PARTPART FOURFOUR 前沿技術前沿技術DeletionDeletion Vect
12、orsVectors:MergeMerge OnOn WriteWriteDELETE&UPDATE 生成 Deletion Vectors無主鍵表主鍵表流式寫入生成 Deletion Vectors更新與極速查詢兼得TagTag&BranchBranchS1S2S3S4T2Create Tag:不可變S2Create BranchMain BranchS3S4Custom BranchMerge BranchReplace BranchTag 支持了 TTL 管理Tag&Branch 像 Git 一樣操作數據工程驗證及測試隔離的流批一體通用索引支持通用索引支持 自帶的索引機制:minmax
13、擴展的索引機制:bitmap,bloomfilter,倒排索引 索引支持嵌套類型:Map Key命中索引10 100倍性能提升0202 低延時低成本低延時低成本0101 實時離線一體化實時離線一體化0303 開發效率高開發效率高即刻體驗阿里云上統一大數據湖格式企業級、高性能、云上企業級、高性能、云上Serverless Serverless 服務服務掃碼即可參與0元領取Flink+Paimon 試用資源包掃碼加入 Apache Paimon技術交流釘釘群新用戶新用戶 0 0 元領取資源包,免費試用元領取資源包,免費試用 (5000CU5000CU*時,時,3 3個月)個月)https:/ https:/