《3-5 Kyligence Cloud 云上數據湖分析的競爭優勢分析.pdf》由會員分享,可在線閱讀,更多相關《3-5 Kyligence Cloud 云上數據湖分析的競爭優勢分析.pdf(29頁珍藏版)》請在三個皮匠報告上搜索。
1、KYLIGENCE CLOUD云上數據湖分析的競爭優勢分析馬洪賓 技術合伙人、首席布道師|技術撐競爭優勢業務問題About KyligenceFounded in 2016 by creators of Apache KylinLeading Open Source OLAP for Big Data1500+Global Open-source AdoptionsSuccess Enterprise cases with Industry Leaders24*7 Global SLA,ISO 9001,ISO 27001,SOC2 Type 1Dual-head quarters in Sa
2、n Jose,USA and Shanghai ChinaVenture backed by top investors:Redpoints,Cisco,CBC,Shunwei,8Roads(Fidelity arm),Coatue,SPDBI,CICC,Gopher,ASG,etc.Kyligence=Kylin+IntelligenceAnalytics on Data Lakes3 Major Pains in a Self-Serving Data LakeFor multi-dimensional analytics scenariosTrust in data?Has a lot
3、of data,but little trustSum(Dept.Revenue)Total Revenue?Muddy lake?Everyone is creating their own data setWide tables,ETLs,every where,ever growing“Dont reuse that table.What if it gets changed?!”Cost?With 100 x more users,comes 100 x IT costCheaper is always betterThe Muddy Lake from an Internet Gia
4、nt in ChinaWhen discovery,free-form,self-serving went to its extremeDBEventLogData AppReportBI/AI5.7 kODS tables1 mwide tablesWide TablesAggregated TablesODS TablesJoinsAggr2 years data construction since 2019 H2,reaches a muddy lake:-Massive data expansion:5.7k ODS tables grow into 1 m tables-Crazy
5、 linage:Table TX_ORDERS has 10k direct descendants-Many duplicated ETL and wasted computation-No unified business semantics,no trust in data-Expect-Save tens of millions RMB every year once metrics are governed-Improve analytics efficiencyHow Multi-Dimensional Data Model HelpsDBDBLogData AppReportBI
6、/AIDimensional Models&MetricsStandardizationGoverned InnovationBase&DerivedMetricsHow Multi-Dimensional Data Model HelpsOrganize your data asset5.7 kODS tablesDBDBLogData AppReportBI/AIDimensional Models&Metrics2 kData Models10 kDerived MetricsStandardizationGoverned InnovationMulti-Dimensional Data
7、 Model Benefits-Saves cost by removing duplicated ETL-Standardized single source of truth-Yet same level of self-serving business innovation95%Precomputation5%MPPMetricsvia SQLKyligence Cloud:The Intelligent OLAP PlatformA Multi-Dimensional Database!KyligenceKyligence Cloud(as a MD Cloud(as a MD DB)
8、aims to:DB)aims to:Organize data assetOrganize data assetReduce time to insight Reduce time to insight for even PB datafor even PB dataReduce TCO Reduce TCO and human and human effortseffortsfor data lake users.for data lake users.MultiMulti-DimensionalDimensionalDatabaseDatabaseOrganize data asset|
9、Multi-dimensional Data ModelThe killer feature that MPP Databases do not have.Focus on Model,not Table.A more business-friendly data language for everyone.Like SQL vs ExcelCitizen analysts armed with the right data language will be the true source of business innovation.How MD DB Solves the PainsDBD
10、BLogData AppReportBI/AIMD DBStandardizationGoverned InnovationBase&DerivedMetrics1.A central place where governance is enforced.Single source of truth.2.Multi-dimensional Model,a tidy box of wide tables,eliminates duplications.3.Optimal precomputation and auto acceleration,minimizes cost/query.Befor
11、e-The Muddy LakeSame star schema,multiple duplicated ETLs,“self-served”by separated teamsLINEORDERCUSTOMERCHANNELUS Sales TeamChina Sales TeamChina Marketing TeamGlobal Finance TeamLCLCCLCCLCAGG-RCDAGG-RCCDAGG-RCCDAGG-RCDWide Table 1,billion rowsWide Table 2,billion rowsWide Table 3,billion rowsWide
12、 Table 4,billion rowsAgg.Table 1,million rowsAgg.Table 2,million rowsAgg.Table 3,million rowsAgg.Table 4,million rowsUS direct sale analysisChina channel sale analysis1993-01-01 China campaign analysisGlobal P&L analysisR:RevenueC:CustomerC:ChannelD:DateAfter-A Tidy Multi-Dimensional ModelLINEORDERC
13、USTOMERCHANNELUS direct sale analysisChina channel sale analysis1993-01-01 China campaign analysisGlobal P&L analysisMetrics Perspective1 model,2 derived metricsMD Model Perspective1 cube,2 cuboidsTables reduced:8-3Billion level exec.reduced:4-1LCCIndex:RCDIndex:RCCDAfter-A Tidy Multi-Dimensional Mo
14、delIn a real example,batch creating 100 metrics results just 4 modelsLINEORDERCUSTOMERCHANNELUS direct sale analysisChina channel sale analysis1993-01-01 China campaign analysisGlobal P&L analysisMetrics Perspective1 base metric,3 derived metricsWhy Kylin/Kyligence A tidy box that auto-arranges the“
15、wide tables”as they are created in the form of metrics,eliminating duplications.Optimal data preparation flow,combines dup ETLs,always precomputes from the cheapest upstream.Continuously recommend new cheaper index according to metrics usages.E.g.the system may suggest Index:RD as it is a hot hit fr
16、om the global P&L analysis.Recommend removal of rare used index.Solve the 3 major pains of using data!MD Model Perspective1 cube,2 cuboidsTables reduced:8-3Billion level exec.reduced:4-1LCCIndex:RCDIndex:RDIndex:RCCDReduce time to insight for even PB data|How Multi-Dimensional Data Model HelpsDBDBLo
17、gData AppReportBI/AIDimensional Models&MetricsStandardizationGoverned Innovation95%Precomputation5%MPPMetricsvia SQLOffline ComputationOnline ComputationOnline Computation based on Apache Spark OptimizerOptimizerNative RuntimeNative RuntimeStorage CacheStorage CacheOptimizer regarding Pre-computingS
18、torage is cheaper than CPU.Use storage to avoid duplicated computation.20Gluten DesignReference:1.https:/ Engine+Arrow ComputeVeloxFPGA/GPU/ASIC AcceleratorsColumnarNode TreeSubstraitSubstraitSubstraitSubstraitGluten Fallback Processing Replace Spark physical plan by Transformers Unsupported operato
19、rs with express will fallback to Vanilla Spark Combining several Transformers into WholestageTransformer to generate whole stage substrait plan21CondProjectTransformerShuffledHashJoinTransformerUnsupporedOperatorAggregateTransformerCondProjectTransformerCondProjectTransformerWholestageTransformerVal
20、idate FailSpark computingC2RR2CWholestageTransformerValidate SucceedNative computingValidate Succeed,Native computingPhoton:Native Spark from DatabricksStorage Cache:RaptorX(Meta)分層緩存Metastore versioned cache:緩存表的元數據(模式、分區列表和分區信息),以避免向 Hive Metastore 發起耗時很的getPartitions 調;File list cache:將件列表緩存在內存中,
21、從避免對遠端存儲發起耗時很的 listFile 調;File desc&footer cache:將件描述符及 footer 信息緩存在內存中,避免對遠端存儲進耗時較的 openFile 調;Alluxio data cache(Local Cache):每次進讀取操作時將遠端存儲數據以它原始格式緩存在本地 SSD 上;Fragment result cache:運在 leaf 階段的 Presto worker 可以決定將部分計算結果緩存在本地 SSD 上;Soft affinity scheduling:為了最程度地提 worker 的緩存命中率,coordinator 需要將同件的請求調
22、度給同批workers;參考:https:/prestodb.io/blog/2021/02/04/raptorxOur Work:https:/kylin.apache.org/blog/2021/10/21/Local-Cache-and-Soft-Affinity-Scheduling/Reduce TCO and human efforts|Reduce TCO and human effortsAvoid unnecessary ETL jobsAvoid unnecessary ETL jobsAutomated Scalability and Automated Scalabil
23、ity and ElasticityElasticityContainerize&K8S schedulingContainerize&K8S schedulingAutomated Scalability and Elasticity on CloudStorage/Computation SeparationCompute on Virtual MachinesElastic Scaling with Spot InstanceElastic Scaling with Spot InstanceStore in Object StorageIn Customer VPCRead/Write
24、 SeparationLoad DataQuerySQL/MDXIn Customer VPCAnalysis Tool /Reporting /Data ProductPre-computing Workload(On-demand Scaling)Query Workload(StaticResources)Object StorageWriteReadK8S中數據任務調度探索為什么是K8S?作業混布,提利率,節約成本可擴展性、可運維性降低投存算分離適應數據湖架構Key TakeawaysA A MultiMulti-Dimensional Database in Data Lake ca
25、n:Dimensional Database in Data Lake can:-Solve the problem of“Muddy Lake”by well organizing your data and ETL Solve the problem of“Muddy Lake”by well organizing your data and ETL jobsjobs-Reduce time to insight for even PB Reduce time to insight for even PB datadata-Reduce TCO and human effortsReduce TCO and human efforts非常感謝您的觀看|