《速度與規模的架構.pdf》由會員分享,可在線閱讀,更多相關《速度與規模的架構.pdf(100頁珍藏版)》請在三個皮匠報告上搜索。
1、Architecting for Speed and ScaleGet consistency and realtime latency with a Data Integration Hub(DIH)AgendaIntroductionGenerative AI is the least important AI.Concurrency matters!“Big Data”isnt a big deal.Containerization is complicated.Cloud is not king.Performance often doesnt matter.Real-time pro
2、cessing isnt what you think.ETL to ELT,now back to ETL again aka Data Integration HubTips&takeawaysIntroductionIntroduction28 years in data management software.OReilly author.IASA Architecture Fundamentals certified.Hortonworks,Syncsort,Bloor Group,Actian,Pervasive,CSC,Data Junction,GridGain,Vertica
3、.Cant seem to decide what I want to be when I grow up:Software EngineerSupport TechTech WriterSoftware TrainerConsultantProduct and Technical MarketerProduct ManagerPaige Roberts,Data Nerd Actively looking for a new position!Ive spent way too many years of my life geeking out about enterprise data a
4、rchitecture.Generative AI is the least important AI.Right now.Realms of Data ScienceData ScienceThe use of algorithms to extract knowledge and insights from data in various forms.Some subfields:Statistics,Artificial Intelligence(AI),Computational MathData ScienceArtificial IntelligenceMachine Learni
5、ngArtificial Intelligence(AI)The simulation of intelligent human behavior for problem-solving and decision-making.Some subfields:Robotics,Natural Language Processing(NLP),Machine Learning(ML)Machine Learning(ML)The process by which machines are taught to make calculated suggestions and/or prediction
6、s by examining large amounts of input dataSome subfields:Regression,Deep Learning,Reinforcement Learning,ClusteringRealms of Data ScienceData ScienceThe use of algorithms to extract knowledge and insights from data in various forms.Some subfields:Statistics,Artificial Intelligence(AI),Computational
7、MathData ScienceArtificial IntelligenceNatural Language ProcessingGenerative AIArtificial Intelligence(AI)The simulation of intelligent human behavior for problem-solving and decision-making.Some subfields:Robotics,Natural Language Processing(NLP),Machine Learning(ML)Natural Language Processing(NLP)
8、The process by which machines are taught to translate code or commands into natural human language,or vice versa.Some subfields:Natural Language Queries,Generative AIMachine Learning Companies that use ML consistently show a higher profit margin,more growth,etc.than their competitors who dont use ML
9、.Machine learning projects used to be notoriously hard to get into production.Now,Machine Learning Operations(MLOps)aka Model Ops has solid best practices.Follow them.Energy and UtilitiesPower usage analyticsSmart grid managementSeismic data processingEnergy demand and supply optimizationRisk analyt
10、ics and regulationCustomer segmentationCredit evaluationFraud detectionCybersecurityAircraft schedulingDynamic pricingTraffic congestion and accident managementPredictive maintenanceDemand forecastingProcess optimizationTelematics and AIOpsPredictive and real-time inventory optimizationRecommendatio
11、n enginesTargeted marketingCustomer satisfactionCustomer lifetime value optimizationReal-time patient alerts and diagnosticsProactive health managementDiagnostic image processingFinancial ServicesManufacturing and Supply ChainRetailHealthcare and Life SciencesTravel and HospitalityCommon ML/AI Use C
12、asesConcurrency matters!Concurrency matters!Actual POC Problem with existing architectureVery popular application for finding jobs.More and more people are using it simultaneously.(Good problem to have,but still,a problem.)Current technology for analytics,Google BigQuery,is bogging down at high conc
13、urrency levels.RestrictionsSome queries are super complex.Need to handle those,too,in a reasonable response time,not put them constantly at the end of the line because they require more resources.People dont line up to use software one at a time.Concurrency is Hard Test for ItActual POC “Big Data”is
14、nt a big deal.Distributed Computing is SlowAnd rarely needed.https:/ fantastic low-cost storage for a lot of data.Serve as an excellent platform for data transformation/ETL.Deliver analytics for a minimal number of users(less than 10).16They cannotProvide robust advanced analytics across your busine
15、ss.Support hundreds or thousands of concurrent users.Achieve the performance you probably thought it could.Photo by Vineet Nangia on UnsplashPhoto by Jakob Owens on Unsplash“My cluster is bigger than your cluster.”Big clusters are the opposite of efficientHigher infrastructure costs and higher TCO f
16、or you and your customers.More nodes=Higher network bandwidth required Scaling and network traffic will hit performance limits“My cluster is bigger than your cluster”VSBig clusters are the opposite of efficientSPARK:m5n.xlarge EC2 Nodes278 instances x 0.15 USD x 730 hours in month=$30,441.00 USD$30,
17、441.00 x 12=$365,292.00 per yearVertica:r5n.xlarge EC2 Nodes9 instances x 0.188 USD x 730 hours in month=1,235.16 USD1,236.16 x 12=$14,833.92 per yearMore nodes=Higher network bandwidth required Scaling and network traffic will hit performance limitsHigher infrastructure costs and higher TCO for you
18、 and your customers.“My cluster is bigger than your cluster”VSBig clusters are the opposite of efficientSPARK:m5n.xlarge EC2 Nodes278 instances x 0.15 USD x 730 hours in month=$30,441.00 USD$30,441.00 x 12=$365,292.00 per yearVertica:r5n.xlarge EC2 Nodes9 instances x 0.188 USD x 730 hours in month=1
19、,235.16 USD1,236.16 x 12=$14,833.92 per year25X CostsMore nodes=Higher network bandwidth required Scaling and network traffic will hit performance limits30X EnergyHigher infrastructure costs and higher TCO for you and your customers.Data storage and analysis tech that scales up,not just out,is worth
20、 it.AerospikeVertica(OpenText Analytics Database)OcientDuckDBLanceDBDisk is no longer slow.(SSD)https:/ SSDs are no longer prohibitively expensive.https:/ is cool but Great for interoperability,in many ways the holy grail of data storage.Can use multiple query engines,data used by multiple teams,tec
21、hnologies.Its distributed data storage,with all the inherent advantages and disadvantages.Best for long term storage and active archive.NOT FOR REALTIME DATA PROCESSING!Containerization is complicated.Containerization AdvantagesAutomatic software upgrades Rolling upgrades mean no downtime while anal
22、ytics software is upgraded.Deployment freedom Move overnight from on-prem to cloud,from Amazon to Google to Alibaba,Enables multi-cloud,Portability to ANY environment that has Kubernetes-Ali cloud,boutique clouds,private clouds,clouds you never heard of,Deploy ANYWHEREDisaster resiliency,go down bec
23、ause your data center is underwater from a hurricane,come back up the next day exactly the same in another state,on a different environmentMake IT happy-Running everything containerized in a Kubernetes environment.Automated deployment,auto-scaling,etc.means a lot of flexibility without constant IT l
24、abor.JOB SECURITY!At least if youre a Kubernetes DevOps person.Containerization DisadvantagesIts hella complicated!-Deploying something simple is no longer simple.Brittle Every little tweak may break larger things.Slower Containerization causes about a 10%performance hit.What about autoscaling?Is in
25、creasing compute automatically a good thing?What about autoscaling?Is increasing compute automatically a good thing?What about autoscaling?Is increasing compute automatically a good thing?Auto-scaling is like fire.Powerful.Source:https:/ is like fire.Powerful.And dangerous.Source:https:/ box autosca
26、ling can be career-endingly disastrous.Elastic autoscaling needs guardrails.Guardrails are conditions you set ahead of time that limit autoscaling to only when you think it is necessary.Some cloud technologies offer this capability.REQUIRE IT.Example:Autoscale when concurrent users on this sub-clust
27、er exceeds 50.And only as high as you can reasonably afford it to go.Limit compute to control how much you spend.Example:Only autoscale to a maximum of 12 compute nodes.Autoscaling down can actually SAVE you money on the cloud.Example:Turn off nodes if idle for more than 10 minutes.Cloud is not king
28、.First Gen Cloud Drivers:Replace your data centerHardware depreciation(lease)Software purchase or depreciation Software maintenance Hardware maintenancePower consumptionData center floor space Provisioning time Cost of procurementCost of team to maintain hardwareCosts of provisioning for maximum cap
29、acityMoving data to handle new workloadsAvoid paying for idle infrastructureHandle sudden or seasonal spikes in workloadScale rapidly for high growth businessesUse a single infrastructure for many small businesses aka multi-tenancyCost SavingsSecond Gen Cloud Drivers:Handle highly variable workloads
30、Scale up and down fastAmazon EC2Microsoft AzureGoogle CloudAmazon S3Amazon AWS EC2ComputeStorageCost analysis:3 analytics techs at large telcoTD:$16.7M/year on-prem,license amortized over 3 or 5 years,includes support,hardware,and storageVAP:$2.5M/yearon-prem,license amortized over 3 or 5 years,incl
31、udes support,hardware,and storageCost analysis:3 analytics techs at large telcoTD:$16.7M/year on-prem,license amortized over 3 to 5 years,includes support,hardware,and storageVAP:$2.5M/yearon-prem,license amortized over 3 to 5 years,includes support,hardware,and storageSF:$42M/yearSaaS app on cloudB
32、UT WHY?Because cloud is cool.Everyone is going to the cloud.FASHION IS THE WORST REASON TO ADOPT ANY TECHNOLOGY!“For the same cost of cloud services we could buy a whole new set of servers every two months.”Yaacov Ben-Yaacov,CEO CatchMediaVendor lock-in is still vendor lock-in Traditional data wareh
33、ouse lock-inCloud analytics deployment lock-inWhile public cloud adoption grows,78%of organizations are now using Infrastructure as a Service(IaaS)to some extent.hybrid cloud is where most people are going.78%of organizations are now using Infrastructure as a Service(IaaS)to some extent.Top 5 areas
34、where public cloud falls short of expectations:Cost SecurityPurchase processCustomer serviceMigrationPrivate and public cloud.Purchasing DecisionsA TOP CAPABILITY that has the greatest level of influence on data storage purchasing decisions is HYBRID CLOUD.Rise of On-Prem Object StorageA TOP REASON
35、why hybrid cloud is practical is the rise of on-premises storage systems that are S3 compatible.Deployment flexibility is keyDeploy AnywhereOn SSDOn PremiseIn the CloudsAt the EdgeKafkaSparkAdvantages of Cloud Computing Elasticity Fast growth Ease of use Reduced maintenance Multi-tenancy(lots of sma
36、ll companies in one data center)42Disadvantages of Cloud Computing Cost!Pricing transparency and predictability Lock-in egress fees Regulatory compliance43Performance doesnt always matter.Performance Doesnt Always MatterThe cloud has many advantages Performance is not one of them.Why are Cloud and C
37、ontainerized Architectures so Popular?Hint:Its not performance.Cloud,or Kubernetes,or any other style of containerized or virtual infrastructure always takes about a 10%hit in performance for overhead over a bare metal implementation.Many times,this matters less than what comes out of the back end o
38、f this cow.Performance Doesnt Always Matter46The cloud has many advantages Cost savings on infrastructure or software are not among them.Why are Cloud and Containerized Architectures so Popular?Cloud,or Kubernetes,or any other style of containerized or virtual infrastructure always takes about a 10%
39、hit in performance over a bare metal implementationNor is it cost savings due to“only using what you need”which all the hype tells you is the reason to go to cloud.Cloud implementations usually cost more than on prem,sometimes orders of magnitude more.Performance Doesnt Always Matter47Performance is
40、 no longer a limiting factor for many purposes.Why are Cloud and Containerized Architectures so Popular?Even though:Cloud,or Kubernetes,or any other style of containerized or virtual infrastructure always takes about a 10%hit in performance over a bare metal implementation.Often cloud implementation
41、s cost more than on prem,sometimes orders of magnitude more.Companies are optimizing for labor costs and speed to work value.Salaries of IT experts are expensive.People dont want to wait to get their jobs done.Fast enough is fast enough.Cheap enough is cheap enough.Except when it does.You only get o
42、nce chance for realtime analytics.Streaming data vs at-rest(Do you really need realtime analytics?)Streaming Analytics ExamplePredictive MaintenanceRelational Databasesand Other Batch Data SourcesOracleSQL ServerMySQLData TransformationModel TrainingExploration ExperimentationAlert Based on Predicti
43、ve ModelsLive Streaming Device Log DataPlanned or Automated MaintenanceReduced DowntimeIncreased Customer SatisfactionHistorical Device Data in Shared Object Storage,DB that scales,or Data Lake“With 10 million ad auctions taking place every second,you can imagine the data volumes involved.”(over 15
44、PB)Ron Cormier,Principal Database EngineerData LakehouseThird-party dataEXTREMELY LOW LATENCYReal-time bidding dataContextual dataSuch as bidder dataBATCHLOW LATENCYAPPLICATIONML APPLIEDHot Fast DataBig Stored DataSQLBUSINESS INTELLIGENCEML TRAININGREPORTINGData Ingest ELT/AggregationPrimary Sub-Clu
45、sterReportingEphemeral Sub-ClustersAd Hoc BI QueriesReportingReportingReportingReportingMachine LearningReportingNoSQLLong-term Sub-ClustersRecent DataHybrid-Multi-Cloud with Disaster Recovery on PremData Ingest ELT/AggregationPrimary Sub-ClusterReportingEphemeral Sub-ClustersAd Hoc BI QueriesReport
46、ingReportingReportingReportingReportingMachine LearningData Ingest ELT/AggregationDR Sub-ClusterReportingEphemeral Sub-ClustersAd Hoc BI QueriesReportingReportingReportingReportingReportingMachine LearningOn-PremisesPrimary,SecondaryDisaster RecoveryETL to ELT to ETL againAKA Data Integration HubStr
47、eaming Analytics ExamplePredictive MaintenanceRelational Databasesand Other Batch Data SourcesOracleSQL ServerMySQLHistorical Device Data in Shared Object Storage,DB that scales,or Data LakeData TransformationModel TrainingExploration ExperimentationAlert Based on Predictive ModelsLive Streaming Dev
48、ice Log DataPlanned or Automated MaintenanceReduced DowntimeIncreased Customer SatisfactionStreaming Analytics ExamplePredictive MaintenanceRelational Databasesand Other Batch Data SourcesOracleSQL ServerMySQLHistorical Device Data in Shared Object Storage,DB that scales,or Data LakeData Transformat
49、ionModel TrainingExploration ExperimentationAlert Based on Predictive ModelsLive Streaming Device Log DataPlanned or Automated MaintenanceReduced DowntimeIncreased Customer SatisfactionLambda and Kappa are NOT the only way to handle streaming and batch.Data Integration Hubhttps:/ Data Integration Hu
50、bhttps:/ Data Integration Hub advantages In-memory for realtime analytics-streaming data never rests.SSD for at-rest data still provides extreme data transformation speed.Streaming and batch data handled together,no need for two separate branches.Hardware,cloud,deployment location independent not de
51、pendent on hyperscaler cloud services.Ideal for hybrid deployments,fast transformation,long-term storage.Data Integration Hub disadvantages Requires high RAM instances/nodes/hosts a bit expensive.DONT NEED IT IF YOU HAVE NO NEED OF REALTIME ANALYTICS!Dont just move to realtime because its cool.Final
52、 tips&takeawaysTips and TakeawaysEmbrace disruptions,but:oDont jump too fast into unproven technology.oUnderstand the capabilities you need before you embrace it.oConsider open platforms and multiple technologies.oHave a business-based use case and track the value,Use analytics to innovate and diffe
53、rentiate.Leverage disruptions to increase revenue,improve efficiency,compliance,customer satisfaction,reduce costs,The future will bite you on the butt if you dont plan for change.DOSOFU!Q&AIm listening!THANK YOU!Paige Roberts512-289-https:/ looking for a new position!theTradeDesk scale 15PBMassive
54、Volume,Velocity and VarietyIngest stats600 billion rows daily(multiple tables)30TB compressed150 300TB uncompressed(estimated)Largest Reporting tableOver 21 trillion rowsOver 966 TB compressed Reporting40K report executions per day2 x 640 node Vertica Eon clustersHalf the cluster is shutdown 18 hour
55、s per dayWhat are the Business Risks Around Cloud Migration?Compliance and RegulationsRisk and cost of data movementLock-inProductivityBill ShockExisting investmentUnpredictable pricing from“black boxes”lacking control for Scalability and Elasticity storage and analytics options become very costly t
56、o migrate in futureCant move all the data to cloud due to compliance withexisting and evolving regulations Query existingdata sources,data warehouses,and data lakes(Hadoop,object storage)Freedom to continue to use existing tools and not forced to switch to cloud-specific toolsUse cases and economics
57、now increasingly mandate running analytics where data is bornCloud locked data Egress feesBUSINESS INTELLIGENCEDATA SCIENCEModel Evaluation and ManagementSTREAM PROCESSINGUnified Analytics Platform aka LakehouseLOW LATENCYApplication data Web clicksLogsSensorsOperational metrics User trackingGeo-loc
58、ationSTREAMING DATABATCHFilesWeatherGeoCONTEXTUAL DATAApplication DataOLTP/ODSTRANSACTIONAL DATABatch ETL OREL with TON-PREMISES,HYBRID,CLOUD OR MULTI-CLOUDObject StorageHDFSCOMPLETED IN WAREHOUSEGCP SQLIsolated WorkloadsIngestion,ELT,Data Prep,EnrichmentOR+Machine LearningReportingAd Hoc QueriesVer
59、tica Machine Learning Customer SuccessEOITek provides intelligent big data analysis to help customers realize the value of operational and maintenance data for operational optimization.EOITek does AIOpsArtificially Intelligent IT optimizationin China,especially for financial services like large bank
60、s,credit cards like China Union Pay,and China East airline.Chinese financial regulations require on-prem data processing.70Without VerticaWith VerticaAction Combine events,business,performance metrics,configurations,application logs,and capacity data.Give insights on data relationships to help with
61、event noise reduction,root cause analysis,anomaly detection,capacity prediction Bring the analytics to life through visualizations a knowledge graph to see data relationships,dashboards and reports,and a complete portrait of IT systemsAs an MPP database,improved speed of data queries and analysis ov
62、er old system dramaticallyMore than 10:1 compression ratio saved data storage costs and accelerated query speedVerticas high speed Kafka connector provided fast parallel data ingestionVerticas ability to query existing Hadoop data in place meant no need to move or transform data before analysis,grea
63、t for archived dataIn-database machine learning much faster to model than Apache Spark saves 80 90%in data reading time aloneResult Hadoop,Spark,and Elasticsearch were bogging down from many of EOITeks financial customers analyzing over 15 TB per day of log and metric data Wouldnt scale to many conc
64、urrent usersAccelerate response times 10XShorten client troubleshooting time from hours to minutes with AI and VerticaCouldnt keep up with customer demand both for speed,concurrency,and for more kinds of analysis.Cut client hardware requirement from 50 servers to 10EOITek provides intelligent big da
65、ta analysis to help customers realize the value of operational and maintenance data for operational optimization.Problem:Spark analysis and machine learning performance and concurrency holding them back.Couldnt keep up with customer demand.https:/ Previously,we used Apache Spark.Using Apache Spark c
66、ost 80%of the time from data I/O,only 20%of the time on machine learning.Now,we switched to using Vertica in-database machine learning and saved 80-90%data reading time.So,its had huge benefits in saving time and costs for model training.Xiaobo Ge,CTO,EOITek71SQLDATA SCIENCEBUSINESS INTELLIGENCE+BAT
67、CHLOW LATENCYEON TV Service On/Off,Channel Change DataContact Center Geo/MappingCustomer CXOperationalFinancialService QualityTV ScheduleExtract,LoadCRMERPBillingIngest,Transform,Data PrepTransformation pushed downReal-time ad targetingCustomer profile analysis for real-time appsFinancial ReportingD
68、ata-driven AppsReportingContent AnalyticsMachine LearningReportingAd Hoc QueriesSQLDATA SCIENCEBUSINESS INTELLIGENCE+BATCHLOW LATENCYEON TV Service On/Off,Channel Change DataContact Center Geo/MappingCustomer CXOperationalFinancialService QualityTV ScheduleDistributed Analytics Database(Data Warehou
69、se)Extract,LoadCRMERPBillingIngest,TransformTransformation pushed downMachine LearningReal-time ad targetingCustomer profile analysis for real-time appsFinancial ReportingData-driven AppsReportingContent AnalyticsReportingAd Hoc QueriesSQLDATA SCIENCEBUSINESS INTELLIGENCE+BATCHMass Storage(Data Lake
70、)LOW LATENCYPlanting and harvest equipment Weather stations,probes,satellite imageryApplication data-clickstreamsBayer research trialsClimate research farms(CRF)Climate research partners(CRP)FieldView dataEnvironmental dataPlatform partner dataSales dataMarketing campaignsFlinkAmazon RDSSeparate clu
71、ster for data Ingest,ETLMachine LearningMachine LearningSystem ProblemCustomer CallDispatchOnsite Trouble-shootingRemote Monitoring Predicts Potential failureService ScheduledProblem AvoidedParts DeliveryRepair or ReplaceSystem FunctionalReactive MaintenancePredictive MaintenancePredictive Maintenan
72、ceSimplify business operations and create customer valueSTREAM PROCESSINGBUSINESS INTELLIGENCEDATA SCIENCER&D AccessRemote MonitoringRemote ServiceLOW LATENCYBATCHSQL ServerPhilips Remote Service NetworkSQLMass Storage(Data Lake)+Teradata,Salesforce,SAP data CRM dataRepair shop dataFactory dataAddre
73、ssable Advertising requires unified analytics and AI&MLThe challengeUnited Groups VIDA SolutionTelecommunications operators have a wealth of customer data and are in prime position to monetize that data with ad revenue but will need to combine data from multiple sources and operationalize AI&ML to b
74、e successful.United Group combines 50+customer and device profile attributes to feed a model that fills ad blocks in real-time via their IPTV platform.Simpli.fiVertica supports rapid data and company growth with flexible license structure and advanced analytics for faster decision making.Challenge M
75、anage explosive data growth and expected company acceleration by expanding the Vertica footprint to increase transaction flexibilityProducts and Services VerticaResult High performance columnar data processing 85%data storage reduction thanks to column compression Data granularity and transparency g
76、ives customers crucial insight Key data metrics enable faster decision makingLOW LATENCYBATCHSQLMass Storage(Data Lake)Third-party dataHDFSEXTREMELY LOW LATENCYReal-time bidding dataContextual data-PostgreSQLDATA SCIENCEBUSINESS INTELLIGENCE+BUSINESS INTELLIGENCEBUSINESS INTELLIGENCEDistributed Anal
77、ytics Database(Data Warehouse)SQLSQLData WarehouseData WarehouseSQLTransactional databasesMySQL,PostgreSQL DATA SCIENCEBUSINESS INTELLIGENCE+BATCHCassandra key value databaseMass Storage(Data Lake)HDFSLOW LATENCYApplications dataClickstreamsLocation dataSchema EnforcedFlattened,modeled tablesData Ma
78、nager/ManifestDatabase ProxyAd Hoc Analytics:CityOps Data Scientists QueryBuilder Uber created DashBuilder Uber createdApplications:ETL/Modeling CityOps Machine Learning ExperimentsDuplicate of most used dataInfrequently used dataRecent dataSTREAM PROCESSINGBUSINESS INTELLIGENCEDATA SCIENCER&D Acces
79、sRemote MonitoringRemote ServiceLOW LATENCYBATCHSQL ServerPhilips Remote Service NetworkSQLMass Storage(Data Lake)+Teradata,Salesforce,SAP data CRM dataRepair shop dataFactory dataThe Climate Corporation82Vertica supports highly sustainable and innovative farming with seamless integration and analys
80、is for data-driven decision-making.Challenge Optimize resources and maximize yield with a database that performs at scale with sophisticated queries on large volumes of dataProducts and Services Vertica Analytics PlatformResult Fully data-driven decision-making with 100%automated processes Minimal m
81、aintenance in a highly available redundant cloud deployment Sophisticated use of machine learning and geospatial capabilitiesSQLDATA SCIENCEBUSINESS INTELLIGENCE+BATCHMass Storage(Data Lake)LOW LATENCYPlanting and harvest equipment Weather stations,probes,satellite imageryApplication data-clickstrea
82、msBayer research trialsClimate research farms(CRF)Climate research partners(CRP)FieldView dataEnvironmental dataPlatform partner dataSales dataMarketing campaignsFlinkDistributed Analytics Database(Data Warehouse)Amazon RDSSeparate cluster for data Ingest,ETLProperty of Expeditors International of W
83、ashington,Inc.and its subsidiaries.Reproduction by written authorization only.EXPEDITORS DATA ARCHITECTURE84ETMSETMSOpsReal-time StoreData WarehouseBRDBBRDBOperational ReportingReal-timeLoadersReal-timeLoadersETL Batch ProcessingMarket DataEDIUser entered dataJava/SmalltalkIBM DB2VerticaInformaticaP
84、roperty of Expeditors International of Washington,Inc.and its subsidiaries.Reproduction by written authorization only.EXPEDITORS DATA ARCHITECTURE IN PROGRESS85OpsReal-timeProcessingEDIUser entered dataBig Data HubReal-time StoreTrackingStatus ReportingETL Batch ProcessingData WarehouseData StoreDat
85、a StoreAnalytical ReportingReportingReportingBUSINESS INTELLIGENCELOW LATENCYSQLMass Storage(Data Lake)Machine dataAnritsu internalCustomer dataCall Detail RecordsCold DataHDFSHot DataDATA SCIENCEML ApplicationPre-Packaged DashboardsAuto-SyncBATCHLOW LATENCYBATCHSQLMass Storage(Data Lake)Third-party
86、 dataHDFSEXTREMELY LOW LATENCYReal-time bidding dataContextual data-PostgreSQLDATA SCIENCEBUSINESS INTELLIGENCE+BUSINESS INTELLIGENCEBUSINESS INTELLIGENCESQLSQLSQLTransactional databasesMySQL,PostgreSQL DATA SCIENCEBUSINESS INTELLIGENCE+BATCHCassandra key value databaseMass Storage(Data Lake)HDFSLOW
87、 LATENCYApplications dataClickstreamsLocation dataSchema EnforcedFlattened,modeled tablesData Manager/ManifestDatabase ProxyAd Hoc Analytics:CityOps Data Scientists QueryBuilder Uber created DashBuilder Uber createdApplications:ETL/Modeling CityOps Machine Learning ExperimentsDuplicate of most used
88、dataInfrequently used dataRecent dataVertica Unified Analytics WarehouseUnified AnalyticsDATA LAKEReporting/Business IntelligenceHigh PerformanceHigh ConcurrencyReliabilitySecurityGovernanceMachine Learning/Data ScienceUnlimited ScaleStreaming DataSemi-Structured Data(JSON,AVRO,)Complex Data Types(M
89、aps,Structs,Arrays)Schema on ReadSQLPythonJupyterRDATA WAREHOUSEUnified Analytics WarehouseBUSINESS INTELLIGENCEDATA SCIENCEModel Evaluation and ManagementSTREAM PROCESSINGLOW LATENCYApplication data Web clicksLogsSensorsOperational metrics User trackingGeo-locationSTREAMING DATABATCHFilesWeatherGeo
90、CONTEXTUAL DATAApplication DataOLTP/ODSTRANSACTIONAL DATABatch ETL OREL with TON-PREMISES,HYBRID,CLOUD OR MULTI-CLOUDObject StorageHDFSCOMPLETED IN WAREHOUSEGCP SQLIsolated WorkloadsIngestion,ELT,Data PrepOR+Machine LearningReportingAd Hoc Queries91Allow user-defined functions(UDx)Require minimal ad
91、ministrationManage huge data volumesDeliver fast analyticsEmbed machine learningSupport data scientistsHandle high user concurrencyOEM software vendors need an embedded analytics software platform that can:Smarter analytics for any data anywherewith integration of data warehouse and data lakeUnderst
92、and your whole business by analyzing ALL dataGo beyond BI.Predict the future with Machine Learning/AIFind data you need fast and keep up with data changesGet ML projects into production in minutes,not monthsControl costs,stay green with efficient infrastructure usagePrivate Cloud-Flash-Optimized Dat
93、a LakesREAL-TIME RESPONSEMassively-parallel architecture delivers concurrency and real-time access to all your data.BEST IN CLASS PERFORMANCEDatabase queries run 3x faster when powered by Flash data.BUILT FOR SCALEElastically scale storage as the database grows from terabytes to multiple petabytes a
94、nd beyondRADICAL SAVINGS&EFFICIENCY2x better data center density and all-flash performance at 1/2 the costData LoadingHigh Concurrency Reporting and AnalyticsMarketingInterfaceAd Exchange Growing from Batch to RealtimeGrowth at an Ad Exchange Old architecture data lake,batchRegional Data CentersInge
95、stion ClustersData StoresHadoop ClusterKubernetes ClusterClient Logs ClusterClient Logs APIReportingAWSGrowth at an Ad ExchangeCooperative architecture(lake house),faster batchRegional Data CentersIngestion ClustersData StoresHadoop ClusterKubernetes ClusterClient Logs ClusterClient Logs APIReportin
96、gReportingAWSData WarehouseSQLJDBCAPIUIDeliveryAd Exchange Growing From Batch to RealtimeThats enough to store12,000 4K movies!Growth at an Ad Exchange“We want to control our own data again.”lakehouse,streaming with kafka,on premRegional Data CentersIngestion ClustersData StoresKafka ClusterProcessing ClusterClient Logs APIReportingSQLJDBCAPIUIData WarehouseKubernetesClusterKafka ClusterStorage ClusterBI Toolhttps:/ Simpler Diagram of a DIHhttps:/