《2. Apache Doris 在趣丸科技實時數倉的應用實踐.pdf》由會員分享,可在線閱讀,更多相關《2. Apache Doris 在趣丸科技實時數倉的應用實踐.pdf(25頁珍藏版)》請在三個皮匠報告上搜索。
1、Apache Apache DorisDoris在趣丸科技實時數倉的應用實踐高瑞林趣丸科技 數據部數據倉庫架構師Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Do
2、ris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023自我介紹高瑞林現任趣丸科技數據倉庫架構師個人簡介:曾就職于百度,基于 Doris 擁有廣告投放,游戲社交,在線教育等領域的數據倉庫架構經驗,目前負責趣丸科技實時數倉的架構Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2
3、023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summi
4、t Asia 2023Doris Summit Asia 2023目錄2.實時數倉架構設計3.實時數倉場景化實戰4.總結收益與展望1.趣丸科技實時數倉發展歷程Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023D
5、oris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023趣丸科技實時數倉發展歷程1 1Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023
6、Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 20231.1 1.1 公司業務介紹
7、趣丸科技核心業務興趣社交電子競技技術研發唱鴨麥可TTchatTT語音人工智能聯合實驗室碼上躍見技術品牌王者榮耀分部、英雄聯盟分部英雄聯盟手游分部、和平精英分部Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023D
8、oris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023運營活動運營標簽營銷活動效果分析業務報表數據駕駛艙多維度分析報表分鐘趨勢報表算法數據預測模型訓練1.2 Doris 1.2 Doris 使用場景近百個實時任務導入數據單表數據量最大增量超百億/天數百T T存儲20222022年上半年引入DorisDo
9、ris主集群規模數十個節點Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023D
10、oris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023第一階段第二階段第三階段(現在)面向單業務單場景面向單個業務數倉面向公司全業務數倉運營和拉新活動TTTT語音業務所有實時場景公司所有業務和場景業務場景數倉發展挑戰1.3 1.3 趣丸科技實時數倉的發展歷程穩定性保障開發效率場景多樣化Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Sum
11、mit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023D
12、oris Summit Asia 20231.4 1.4 實時數倉架構的演變(V1.0)(V1.0)binlogbinlog更新維表dwddwd事實表數據源mysqlmysql遇到的主要問題問題排查難度高:體現在基于Flink將指標提前聚合好,結果存儲在Mysql/mongodb等,問題排查需要從Kafka里面拉取明細數據;數據質量問題,多流join都在hbase,經常出現數據延遲導致結果計算錯誤;dimdim維度表維度退化dwsdws聚合表FlinkSqlFlinkSqlKafkaKafka存儲層數據應用大盤指標模型預測埋點日志系統日志架構設計基于Flink+Hbase的架構mongodbm
13、ongodb數據回傳Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Dori
14、s Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 20231.5 1.5 實時數倉架構的演變(v2.0)(v2.0)更新維表dwddwd事實表dimdim維度表維度退化dwsdws聚合表FlinkSqlFlinkSqlKafkaKafka存儲計算數據應用大盤指標用戶標簽多維報表dwsdwsdwddwdFlinkSqlFlinkSql調度系統.binlogbinlog數據源埋點日志系統日志遇到的主要問題Clickhouse的本身問題:對標準SQ
15、L支持有限,語法不易理解;多維報表的查詢效率慢,高峰情況下容易超時;支持直接訂閱Kafka,但是無法保障數據不丟失,不重復實時更新刪除的能力較弱,集群運維成本高;架構設計基于Flink+Clickhouse高時效性場景通過Flink計算準實時的將明細數據寫到Clickhouse,通過分鐘級調度計算;Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia
16、2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 20231.6 1.6 應用場景面臨的問題應用場景時效性要求業務的關注點數據特點面臨的問題運營標簽根據計算復雜度復雜
17、標簽:準實時簡單標簽:實時關注用戶在特定業務場景下的標簽數據,用于運營策略的調整涉及實體:用戶和設備統計指標:人數,人次,金額,時間統計周期:近n分鐘,近n小時,近n天,歷史累計基于現有的架構開發成本高實時報表根據指標級別核心指標:實時其他指標:準實時關注廣告投放后各渠道的獲客指標,消費指標,及端內關鍵行為指標維度靈活:通過各種維度組合看數據準確性:數據準確性要求高多維度分析場景下,使用Clickhouse通常需要join維度表,查詢較慢,使用高峰期容易超時算法預測核心模型:實時其他模型:準實時通過模型預測未來數據的變化趨勢表多且數據量大:通常使用的特征數據每天都在數億級任務性能瓶徑問題,時效
18、性無法滿足業務Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris S
19、ummit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 20231.7 Doris VS Clickhouse 1.7 Doris VS Clickhouse 性能測試基礎條件:集群1FE3BE,duplicate表,數據量7億左右,集群規模華為云服務器32core256G SSD 1T查詢1查詢 2查詢3Clickhouse24.162Doris2.12.6812.76010203040506070查詢時間(單位:S)性能測試對比測試場景:查詢1:單表
20、(7億)分組聚合查詢2:大表(7億)Join 小表(400w)查詢3:大表(7億)Join 大表(7億)笛卡爾積查詢3的示例:Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia
21、 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023實時數倉的架構設計2 2Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2
22、023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 20232.1 2.1 數倉架構設計方案架構設計公共明細層:通過 Flin
23、kSQL 對各業務的原始數據做清洗,形成各業務統一標準的 dwd 層數據,供下游使用;針對通用場景,整個所有業務通用字段,便于下游使用針對時效性高場景:基于清洗后的 DWD,通過 FlinkSQL 計算后,寫入對應的存儲介質,供下游使用;針對準實時場景:基于清洗后的 DWD,通過 DorisSQL,分鐘級計算,寫入 Doris 后,供下游使用Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia
24、2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 20232.2 Doris 2.2 Doris 在實時場景的應用方案架構
25、設計DWD基于清洗后的明細數據,通過FlinkSQL寫入Doris,作為Doris的數據源,通過Doris的明細模型,盡可能保留全部字段,方便新增字段時,歷史數據的回溯;基于Doris的第一層明細數據,針對復雜的join場景做維度退化,最終字段設計盡可能輕量級;DWS復雜標簽的計算,可在Doris計算后,同步至Kafka供下游使用;基于Doris的聚合模型,對指標進行預聚合,供上層報表或分析使用;Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit
26、 Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 20232.3
27、2.3 數據監控方案監控方式數據鏈路數據上報數據清洗數據同步數倉層由于涉及鏈路長,暫時沒法對端到端做完整監控,只針對全鏈路的各個環節做保障Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summi
28、t Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 20232.4 2.4 內部平臺對Doris Doris 支撐DorisDoris元數據管理通過提前注冊Doris和Kafka的元數據,任務的開發只需要關注邏輯的實現DorisDoris任務血緣基于社區Doris Doris 的Flink connector Flink con
29、nector 的改良新增條件下推,doris開源官方的在實現flink-connector-doris的說明不實現Filter下推和Projection下推改造后:數據的FilterFilter不需要通過參數配置改造前:Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit A
30、sia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023實時數倉場景化實戰3 3Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asi
31、a 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Su
32、mmit Asia 2023場景一:實時報表維度變更問題業務場景廣告投放場景下:分省份,城市,性別維度,統計用戶數遇到的問題在doris中指定了維度列和指標列,AGGREGATE模型設定了key之后,維度發生變更會導致匯總后結果重復。用戶idid年齡省份城市更新時間A未知廣東省廣州市10:30:20B18湖北省武漢市10:30:20用戶idid年齡省份城市更新時間A25廣東省廣州市10:40:18年齡省份城市用戶數未知廣東省廣州市118湖北省武漢市1聚合后10分鐘后用戶填寫了年齡由未知變更為25結果表年齡省份城市用戶數未知廣東省廣州市125廣東省廣州市118湖北省武漢市1聚合后結果表省份城市用
33、戶數廣東省廣州市2湖北省武漢市1按省份匯總查看Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit
34、Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023場景一:實時報表維度變更問題解決方式將維度列age字段類型改成REPLACE改成REPLACE后,當維度發生變更時,之前當數據也會被替換掉,從而保障數據準確性缺點:無法根據聚合字段創建索引,會降低一定的查詢性能。用戶idid年齡省份城市更新時間A未知廣東省廣州市10:30:20B18湖北省武漢市10:30:20用戶idid年齡省份城市更新時間A25廣東省廣州市
35、10:40:18年齡省份城市用戶數未知廣東省廣州市118湖北省武漢市1聚合后10分鐘后用戶填寫了年齡由未知變更為25結果表年齡省份城市用戶數25廣東省廣州市118湖北省武漢市1聚合后結果表省份城市用戶數廣東省廣州市1湖北省武漢市1按省份匯總查看Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Do
36、ris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023場景二:基于Doris Doris 的指標和標簽的計算業務場景指標的計算場景分鐘級指標;當天分小時指標;當天累計指標;沒有Doris Doris 之前,基于Flin
37、kSql FlinkSql 的計算方式基于Doris Doris 的解決方案采用AGGREGATE模型,通過bitmap計算uvDoris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asi
38、a 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023場景二:基于Doris Doris 的指標和標簽的計算基于Doris Doris 的解決方案用戶idid次數消費金額消費時間A110.010:30:20B120.010:31:18用戶idid次數金額首次消費時間最近消費時間A110.010:30:2010:30:20B120.0
39、10:31:1810:31:18聚合后用戶idid次數消費金額消費時間A115.010:35:25B130.010:36:20用戶idid次數金額首次消費時間最近消費時間A225.010:30:2010:35:25B250.010:31:1810:36:20聚合后業務場景標簽的計算場景統計型規則(注冊的用戶當天累計消費次數大于y次);序列型規則(A用戶注冊后x秒內發生了消費)混合型規則(A用戶注冊后x秒內發生了充值金額且累計大于y元)Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit As
40、ia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris S
41、ummit Asia 2023總結收益與展望4 4Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summi
42、t Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023業務場景場景特點與訴求面臨的問題DorisDoris的優勢運營標簽涉及實體:用戶和設備統計指標:人數,人次,金額,時間統計周期:近n分鐘,近n小時,近n天,歷史累計開發成本高Aggregate模型通過預聚合的方式,提前聚合好部分標簽結果,讓標簽開發簡單高效,相比之前平均開發效率提升30%,部分標簽提升60%以上實時報表維度靈活:通過各種維度組合看數據準確性
43、:數據準確性要求高多維度分析場景下,使用Clickhouse通常需要join維度表,查詢較慢,容易超時相比Clickhouse,Doris不僅支持大寬表模型,還支持星型和雪花模型,多維度分析場景下,支持分布式Join,報表95分位查詢性能相比之前平均提升35%以上;數據寫入有事務,保證了數據寫入的“不丟不重”,因之前CK導入數據重復問題造成的數據質量問題相比之前下降40%;算法推薦數據量大:通常使用的數據每天都在數億級任務性能瓶徑問題,時效性無法滿足業務MPP的架構,能夠支持算法場景下,復雜的Sql查詢,相同情況下,對比Clickhouse查詢性能平均提升25%。4.1 4.1 總結收益Dor
44、is Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia
45、 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 20234.2 4.2 展望未來深入 Doris 在 OLAP 分析場景下的應用及 Doris 的調優;引入數據湖相關技術,探索基于 Doris 在湖倉一體方面的應用;Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia
46、 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023獲取更多社區動態與最佳實踐Doris Summit 峰會官網:
47、doris- Doris Summit 峰會回放:https:/ Doris 官網:doris.apache.orgApache Doris GitHub: Doris 官方平臺:Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023Doris Summit Asia 2023