當前位置：首頁 > 報告詳情

通過 CDC、Apache Spark? 流和 Delta Lake 解鎖近實時數據復制.pdf

上傳人： 2*** 編號：139075 2023-06-04 PDF PDF 26頁 1.50MB

該報告所屬合集： 2023年數據和人工智能峰會（data+ai summit2023）演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/26

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《通過 CDC、Apache Spark? 流和 Delta Lake 解鎖近實時數據復制.pdf》由會員分享，可在線閱讀，更多相關《通過 CDC、Apache Spark? 流和 Delta Lake 解鎖近實時數據復制.pdf（26頁珍藏版）》請在三個皮匠報告上搜索。

1、Unlocking Near Real Time Data Replication with CDC,Apache Spark Streaming,and Delta LakeDatabricks2023Ivan Peng and Phani NalluriHow many orders did DoorDash do yesterday?How many orders did DoorDash do yesterday?Get me data from databasesselect*from table_nameGet me data from databases,fastselect*f

2、rom table_name where updated_at$LATEST_DATEmergeGet me data from databases,fast and as the schema changesselect*from information.schemas where name=table_name;mergeselect*from information.schemas where name=table_name;reconcilepageincompatibleselect*from information.schemas where name=table_name;mer

3、geselect*from information.schemas where name=table_name;reconcilepageincompatiblex1000Somewhere in there is a migration from Redshift to Snowflake,and building a whole orchestration system around the tasks HistoryAKA the State of Data at DoorDash,2020 90%of 1000 DB tables were dumped to Snowflake vi

4、a naive dump Incremental tables required:Table to have an updated_at fieldIndex on that fieldApplication to update that field on every write operation CDC was present,but in its infancy at DoorDashProject PeptoAlleviating indigestion of data processingRequirementsHave better data freshness than 24 h

5、oursOwn our data on a modern Lakehouse platformHandle schema evolution and backfillsEnable analytical workloads that otherwise would have been run on the production databasesDesign TenetsLean into CDC/Kafka across all database flavorsBuild a self-serve platform to democratize onboarding of tablesWri

6、te-once,read manyLeverage streaming checkpointing to bypass late-arriving dataOperational simplicityProject PeptoWhat we are not A coupled service with databases A real-time system that feeds into online servicesProject PeptoHighlighted Design Decisions Not-kappa architecture Freezing schemas with“s

7、chema registry”Delta Lake over other table formatsSteady State ModeRebuild ModeBatch Merge ModeProject PeptoResults Table onboarding down to 1 hour and self-serve 450 streams,over 1000 EC2 nodes running 24/7 800 GB/day as input,80 TB rewritten/day Data freshness of 7-30 minutesResultsProject PeptoCh

8、allenges and Learnings Checkpointing solves a lot of problems Type conversions are hard!For every adapter theres 2 serializers Large tables are operationally challenging State management is tough make everything idempotentDatabrickss API with idempotency guarantees simplifies a lot Reputation is hard to gain,easy to loseFuture Work Ad Hoc queries to migrate from online DBs to Delta Lake workloads Streaming PII obfuscation in medallion architecture Schema changes to the sourceQuestions?Thank you

相關圖表

DoorDash通過使用CDC、Apache Spark Streaming和Delta Lake，實現了近實時數據復制。主要內容包括：從數據庫中快速獲取數據，并在數據結構變化時進行合并；在Redshift到Snowflake的數據遷移過程中，建立了一個數據處理流程；DoorDash在2020年的數據處理狀態，包括90%的數據表已通過簡單轉儲方式導入Snowflake，要求增量表具有`updated_at`字段，并對該字段進行索引，應用程序在每次寫操作時更新該字段；項目Pepto旨在提高數據新鮮度，擁有自己的數據在現代數據湖平臺上的所有權，處理數據模式演變和后填充，并允許原本在生產數據庫上運行的分析工作負載。設計原則包括：在所有數據庫類型中使用CDC/Kafka，構建一個自助服務平臺以民主化表的上線，實現一次寫入、多次讀取，利用流式檢查點繞過遲到的數據，以及操作簡單性。項目Pepto不是與數據庫緊密耦合的服務，也不是實時系統。設計決策包括：非Kappa架構，使用“模式注冊表”固定模式，選擇Delta Lake作為其他表格式的替代品。在穩態模式、重建模式和批量合并模式下運行。項目Pepto的成果包括：表上線時間縮短至小于1小時，自助服務；運行在1000個EC2節點上的450個流，每天輸入約800GB數據，重寫約80TB數據，數據新鮮度約為7-30分鐘。挑戰和學習包括：檢查點解決了許多問題，類型轉換很困難，每個適配器都有兩個序列化器，大型表在操作上具有挑戰性，狀態管理困難，Databricks API的冪等性保證簡化了很多問題。未來的工作包括：將在線數據庫的Ad Hoc查詢遷移到Delta Lake工作負載，在Medallion架構中進行流式PII模糊處理，以及處理源數據的模式變更。

"DoorDash如何實現數據實時復制？" "如何在Delta Lake上處理數據庫表的schema演變？" "DoorDash如何通過Project Pepto提高數據處理效率？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站