《云原生開源數據湖最佳實踐-開源大數據與AI行業實踐論壇(16頁).pdf》由會員分享,可在線閱讀,更多相關《云原生開源數據湖最佳實踐-開源大數據與AI行業實踐論壇(16頁).pdf(16頁珍藏版)》請在三個皮匠報告上搜索。
1、云原生開源數據湖最佳實踐呂侶小紅書大數據負責人Cloud native open source data lake best practices年輕用戶在這里通過短視頻、圖文等形式記錄生活點滴分享生活方式,并基于興趣形成互動這里有超過70%的90后年輕人是年輕人不可替代的生活方式平臺和消費決策入口小紅書介紹1 億月活躍年輕用戶100.125.145.160.180.195.220.225.235.240.260.290.330.360.400.100.200.300.400.500.07/3108/3109/3010/3111/3012/3101/3102/2803/3104/3005/310
2、6/3007/3108/3109/30數據量小紅書數據量增長近14個月數據量增長4倍核心擴展性倉端分隔擁抱云考量點HDFSS3OSSAlluxioAgentFSHiveMetaParquetORCAvroIcebergHudiDynamicSchemaK8sYarnPavoTezSparkFlinkClickhouseTiDBPrestoDorisKylin聯邦查詢函數治理WaypointVelaGravityTableauGuanyuanDorRugalPokaGalaxyVirgo實驗平臺小紅書大數據架構圖應用層計算資源層計算引擎層數據層存儲層研發工具產品運營賦能分析報表離線引擎實時引擎即
3、席分析物化引擎引擎治理存儲層,以對象存儲為核心標準型低頻型歸檔型冷歸檔型統一的生命周期管理強一致數據模型,應用架構更簡單多存儲類型,全面覆蓋各種訪問模式數據成本海量數據場景下成本能匹配數據價值性能滿足多種不同數據場景的性能訴求可靠性數據高可用,安全合規生態支持多種數據快速接入,多種數據引擎共享數據存儲層-AgentFSAgentFS計算引擎MetaServerS3/OSSHiveMeta-StorePorterDBPrometheus創建、查詢、刪除修改文件路徑等查詢表文件路徑刪除文件根據路徑讀寫文件數據歸位查詢表或者分區的地址添加分區信息查詢分區信息修改分區地址查詢遷移表信息監控指標批量刪除
4、文件數據層-Dynamic Schemacreate external table dw_user_profile_day(user_token string,gender struct,location_info struct,app_info struct,user_app_list struct user_channel_info_list struct)PARTITIONED BY(dtm string)ROW FORMAT SERDEorg.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDeSTORED AS INPUTFO
5、RMATorg.apache.hadoop.hive.ql.io.parquet.serde.MapredParquetInputFormatOUTPUTFORMATorg.apache.hadoop.hive.ql.io.parquet.serde.MapredParquetOutputFormatdw_user_profile_day表結構定義user_profile_location_infoCLocationInfouser_profile_userapplistCUserAppListinsert overwrite table reddw.dw_user_profile_basic
6、_info_day partition(dtm=20210816)select a.user_token,a.s.col2.profile_gender as profile_gender,a.s.col2.profile_latest_country as profile_latest_country,a.s.col2.profile_usually_city as profile_usually_cityfrom(select user_token,max(struct(dtm,named_struct(profile_gender,default.pb_deserializer(gend
7、er).enum_gendertype,profile_latest_country,default.pb_deserializer(location_info).latestlocation.country,profile_usually_city,default.pb_deserializer(location_info).usuallylocation.city)s from reddw.dw_user_profile_daywhere dtm=f_getdate(20210816,-2)group by user_token)a自定義udf pb_deserializer查找調用相關類
8、PB deserializer注冊中心實時引擎BIOLAPREDClickhouseFlinkHive on TezSpark SQL計算引擎層TiDBShare NothingClickhouse1Shard1MergeTreeMetaMergeTreePartClickhouse2Shard2MergeTreeMetaMergeTreePartClickhouse3Shard3MergeTreeMetaMergeTreePart服務調度計算存儲ServerMetaStoreAdminKV StoreWorker ClusterWorker本地存儲Worker ClusterWorker本地
9、存儲SSDJuiceFS對象存儲計算引擎層-Clickhouseoa apiyarn apirbacfe apiwebhook apiaws java sdkqcloud sdkaliyun sdkaws cloudwatchpushgatewayopenfaasdispathcher apiscenario metric apiscenario apilog apicluster apiaws ig apicluster manager apiig expand shrink apicluster pool apilog apibvao servercluster Discovery-Ser
10、viceovap server計算資源層-Pavo產品用戶端/顯示端核心系統/產品功能外部服務/組件Pavo前端AirflowDorVenus數據應用-Vela報表系統TiDBkafkaVela任務Vela任務Vela任務上游依賴上游依賴上游依賴upstream data readySQL交互AST解析血緣解析任務構建對象存儲數據湖核心數據應用-Galaxy數據應用-實驗平臺探索中的應用Iceberg點查數據根據uid shuffle排序后寫入Iceberg表Kafka集群OSSS3對象存儲Flink集群Spark集群定周期讀取數據排序寫入后單分區數據分布變化date=2021-09-01mi
11、nute15=1minute15=Nbucket=0bucket=99minute15=1 minute15=Nminute15=1 minute15=Nfille0 fille1 filleN fille0 fille1 filleN fille0 fille1 filleN fille0 fille1 filleNdate=2021-09-01minute15=1minute15=Nbucket=0bucket=99minute15=1 minute15=Nminute15=1 minute15=Nfille0 fille1 filleN fille0 fille1 filleN fill
12、e0 fille1 filleN fille0 fille1 filleNselect*from ubt where id=xx1 and date=xxxid=xx1ubt表簡化結構CREATE TABLEhive_prod.iceberg_db.ubt(id STRING,content BINARY,date STRING,minute15 STRING)USING iceberg PARTITIONED BY(last_date,bucket(100,id),minute15)TBLPROPERTIES(write.metadata.metrics.default=truncate(32),write.format.default=parquet,write.target-file-size-bytes=536870912,read.split.target-size=536870912);THANKS