當前位置：首頁 > 報告詳情

2017年airbnb數據平臺實踐.pdf

上傳人：云閑編號：92372 2021-01-01 PDF PDF 64頁 5.23MB

該報告所屬合集： 2017年ArchSummit全球架構師峰會嘉賓演講PPT資料合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/64

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《2017年airbnb數據平臺實踐.pdf》由會員分享，可在線閱讀，更多相關《2017年airbnb數據平臺實踐.pdf（64頁珍藏版）》請在三個皮匠報告上搜索。

1、Airbnb?HONGBO ZENGData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgendaData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgenda13B35PB1400+Warehouse Size

2、#Events CollectedMachinesHadoop+Presto+SparkScale of Data Infrastructure at Airbnb5xYoY Data GrowthEvent LogsMySQL DumpsGold ClusterHDFSHiveKafkaSqoopSilver ClusterSpark ClusterSparkReAirAirflow SchedulingS3Presto ClusterAirPalSuperSetTableauData PlatformYarnHDFSHiveYarn5AirStreamEvent LogsMySQL Dum

3、psGold ClusterHDFSHiveKafkaSqoopSilver ClusterSpark ClusterSparkReAirAirflow SchedulingS3Presto ClusterAirPalSuperSetTableauData PlatformYarnHDFSHiveYarn6AirStreamData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgendaSetupSing

4、le HDFS,MR and Hive installationc3.8xlarge(32 cores/60G mem/640GB disk)+3TB of EBS volume800 nodesTested DN on different AZsAll data managed by HiveOriginal ClusterChallengesLimited isolation between production/adhocAdhoc-Difficult to meet SLAs-Harder for capacity plan Disaster recoveryDifficult rol

5、l outsTwo independent HDFS,MR,Hive metastoresd2.8xlarge w/48TB local250 instances in final setupReplication of common/critical data-Silver is superof GoldFor disaster recovery,separate AZsTwo ClustersGold ClusterHDFSHiveSilver ClusterReplicationYarnHDFSHiveYarnAdvantagesFailure isolation with user j

6、obsEasy capacity planningGuarantee SLAsAble to test new versionsDisaster RecoveryMulti-Cluster Trade-OffsDisadvantagesData synchronizationUser confusionOperational overheadAdvantagesFailure isolation with user jobsEasy capacity planningGuarantee SLAsAble to test new versionsDisaster RecoveryMulti-Cl

7、uster Trade-OffsDisadvantagesData synchronizationUser confusionOperational overheadData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgendaBatchScan HDFS,metastoreCopy relevant entriesSimple,no stateHigh latencyWarehouse Replica

8、tion ApproachesIncrementalRecord changes in sourceCopy/re-run operations on destinationMore complex,more stateLow latency(seconds)Record Changes on Source Convert Changes to Replication Primitives Run Primitives on the DestinationIncrementalReplication14 Hive provides hooks API to fire at specific p

9、oints-Pre-execute-Post-execute-Failure Use post-execute to log objects that are created into an audit log In critical path for queries Record ChangesOn Source15Example Audit Log Entry16 3 types of objects-DB,table,partition 3 types of operations-Copy,rename,drop 9 different primitive operations Idem

10、potentConvert Changes to Primitive Operations17CREATE TABLE srcpart(key STRING)PARTITIONED BY(ds STRING)Copy TableINSERT OVERWRITE TABLE srcpart PARTITION(ds=1)SELECT key FROM src Copy PartitionALTER TABLE srcpart SET FILEFORMAT TEXTFILE Copy TableALTER TABLE srcpart RENAME to srcpart_old Rename tab

11、lePrimitive Example18Copy Table Flowsource exists?dest exists and the same?copy to temp locationverify the copytmp-destadd metadatadoneYNYNData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgendaBatch Infrastructure21Event LogsM

12、ySQL DumpsGold ClusterHDFSHiveKafkaSqoopSilver ClusterSpark ClusterSparkReAirAirflow SchedulingS3Presto ClusterAirPalSuperSetTableauYarnHDFSHiveYarnAirStream22SourceProcessSinkStreaming at Airbnb-AirStream23ClusterSpark StreamingAirflow SchedulingHBaseHDFSSourcesKafkaS3HDFSSinksDatadogKafkaDynamoDBE

13、lasticSearchLambda ArchitectureBatchAirStreamHiveSpark SQLLambda Architecture 25StreamingKafkaSpark StreamingState StorageSources26Streamingsource:name:source_example,type:kafka,config:topic:example_topic,Batchsource:name:source_example,type:hive,sql:select*from db.table where ds=2017-06-05;Computat

14、ion27Streaming/Batchprocess:name=process_example,type=sql,sql=SELECT listing_id,checkin_date,context.source as source FROM source_example WHERE user_id IS NOT NULL Sinks28Streamingsink:name=sink_example input=process_example type=hbase_update hbase_table_name=test_table bulk_upload=false Batchsink:n

15、ame=sink_example input=process_example type=hbase_update hbase_table_name=test_table bulk_upload=true StreamingComputation Flow29SourceProcess_AProcess_BProcess_A1Sink_A2Sink_B2BatchSourceProcess_AProcess_BProcess_A1Sink_A2Sink_B2Unified API through AirStreamDeclarative job configurationStreaming so

16、urce vs static sourceComputation operator or sink can be shared by streaming and batch job.Computation flow is shared by streaming and batchSingle driver executes in both streaming and batch mode job30Shared State StorageAirStreamShared Global State Store32HBase TablesSpark StreamingSpark StreamingS

17、park StreamingSpark StreamingSpark BatchSpark BatchSpark BatchSpark BatchWell integrated with Hadoop eco systemEfficient API for streaming writes and bulk uploadsRich API for sequential scan and point-lookups Merged view based on version 33Why HBaseUnified Write API34DataFrameHBaseRegion 1Region 2Re

18、gion NRe-partitionPutsHFile BulkLoadRich Read API 35HBase TablesSpark Streaming/Batch JobsMulti-GetsPrefix ScanTime Range ScanMerged Views36Row KeyR1V200TS200R1V150TS150R1V01TS01TimeStreaming WritesStreaming WritesStreaming WritesMerged Views37Row KeyR1V200TS200R1V150TS150R1V01TS01TimeStreaming Writ

19、esStreaming WritesStreaming WritesR1V100TS100Batch Bulk UploadOur FoundationsUnify streaming and batch processShared global state store38MySQL DB Snapshot Using Binlog Replay Large amount of data:Multiple large mysql DBs Realtime-ness:minutes delay/hours delay Transaction:Need to keep transaction ac

20、ross different tables Schema change:Table schema evolvesDatabase Snapshot40Move Elephant41Binlog Replay on Spark20+hr4+hrAirStream Job5 mins15 1 hrspinal tapseedStreaming and Batch shares Logic:Binlog file reader,DDL processor,transaction processor,DML processor.Merged by binlog position:Idempotent:

21、Log can be replayed multiple times.Schema changes:Full schema change history.42Log ParserTransaction ProcessorChange ProcessorSchema ProcessorHBASELambda ArchitectureBinlog(realtime/history)DMLDDLXVIDMysql InstanceStreaming Ingestion&Realtime Interactive QueryRealtime Ingestion and Interactive Query

22、44HBaseAirStreamSpark StreamingKafkaQuery EngineDataPortalSpark SQLHive SQLPresto SQLInteractive Query in SqlLab45ThanksRealtime OLAP with DruidRealtime Ingestion for Druid48DruidAirStreamSpark StreamingKafkaDimensionMetricsDruid BeamSuperset Powered by Druid49Realtime IndexingHiveRealtime Indexing5

23、1Elastic Searches_version=mutation idAirStreamSpark StreamingSpark BatchTable AEventEventEventKafkaTable BTable CBackup SlidesTipsMoving Window ComputationLong Window Computation55What if window is weeks,months,or even years?Distinct in a Large Window56I dont want approximation.What should I do?Dist

24、inct Count57Row KeyListing 1 Visitor 01TS100Listing 1 Visitor 02TS100Listing 1 Visitor 04TS98Listing 1 Visitor 03TS99Prefix Scan with TimeRangePrefix Scan with TimeRangeTimeMoving Average58Row KeyListing 1Total Review Cnt:100TS100Listing 1Total Review Cnt:98TS99Listing 1Total Review Cnt:01TS01Listin

25、g 1Total Review Cnt:50TS50Count Difference/Time ElapsedCount Difference/Time ElapsedTimeWindow 1Window 2Schema EnforcementStreaming EventsThrift-DataFrame60ThriftEventhttps:/ ClassThrift ObjectFieldMetaDataStruct TypeFieldValueRowDataFrameSummaryUnify Batch and Streaming Computation62Global State Store Using HBase63Serial execution-Easy to reason about operations-Very slowParallel execution-Fast and scalable-Ordering is important:e.g.create table before copying a partition-DAG of primitive operationsRun Primitiveson Destination64

相關圖表

本文主要介紹了Airbnb數據平臺的發展和關鍵技術。Airbnb采用單一HDFS、MR和Hive安裝，通過c3.8xlarge實例和800個節點實現數據管理。面臨挑戰包括生產與臨時查詢之間的隔離性有限、難以滿足服務級別協議、災難恢復困難等。解決方案包括增量數據復制ReAir、統一流式和批量處理AirStream。ReAir通過記錄和運行源數據的變化實現增量復制，而AirStream提供統一API，支持流式和批量處理。Airbnb還使用HBase作為全局狀態存儲，支持豐富的讀寫操作。此外，Airbnb還開發了AirPal和SuperSet等工具，以簡化數據管理和分析。通過采用這些技術和策略，Airbnb成功實現了高效、可擴展的數據平臺。

"Airbnb如何實現數據平臺的演進？" "如何在Airbnb實現增量數據復制？" "Airbnb如何統一流式處理和批量處理？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站