1、Survive in Cloud The Zen of High Availability at Massive Scale in CloudMobvistaNo.1950M320M200+TOP 10Mintegral SDK DAUChinaCountries/Regionsworld-wideDMPs DAU60B Daily Ads requestAll in CloudPublisherRDSOffer managementOnline DMPKinesisEMRRedshift*Big Data&MLS3CloudWatchESMetrics&AlarmSDKAPIManualKi
2、nesisS3Lambda functionDynamoDBTracking ServiceinstancesSpot FleetAuto ScalingElastiCacheSQSVolume Processing ServiceinstancesSpot FleetAuto ScalingRTBAdvertiserCloud ComputingQuick ScalingLow CostHigh ReliableOn-DemandRapid elasticityPay per useUncertain downtimeCloud CharacteristicsService GoalsHig
3、h AvailableFault OrientedOnce you accept that failures will happen,you have the ability to design your systems reaction to specific failures.Isolated DesignMicro Kernelplug-inplug-inplug-inplug-inplug-inplug-inplug-inplug-inplug-inplug-inplug-inplug-inExtension PointExtension PointExtension PointExt
4、ension PointExtension PointExtension PointIsolated DeploymentOrdering ServiceCart ServiceCheckout ServicePayment ServiceFulfillment ServiceReused vs.IsolatedReused logic structure vs.Isolated physical structureCritical Data CollectorLog Data CollectorData Transform ServiceData Transform ServiceData
5、Transform ServiceCritical Data CollectorLog Data CollectorRedundancyRedundancyOnline ServiceStandby ServiceLoad BalancerLoad BalancerOnline RedundancyCommon Failure ModesPropagated FailureLoad BalancerQPS 1500Max QPS 1000Rate LimitCascading FailureServiceDServiceEServiceBServiceServiceAServiceCClien
6、tCircuit BreakerCircuit BreakerServiceDServiceEServiceBServiceServiceAServiceCClientFallbackSlow ResponseA quick rejection is better than a slow response.Pooled resources are exhausted!No Unlimited WaitingAny blocking operation needs a time limit!Recovery Oriented“A priori prediction of all failure
7、modes is not possible.”Health CheckZombie ProcessPooled resources exhaustedDead LockRecoverableSay“NO”to Monolithic systemStatelessSurvive when the dependent services crashingQuick restartLet it Crash!try catch(Throwable t)Negotiate With ClientServer:“I am busy,please,slow down”Client:“Get back to m
8、e,after one minute.”Chaos Engineering“If something hurts,do it more often!”Chaos under controlChaos EngineeringYou learn how to fix the things that often break.You dont learn how to fix the things that rarely break.Terminate hostInject latencyInject failureChaos EngineeringSet expected SLAInject Fai
9、luresMeasure servicesmeet SLA?ESImprove systemChaos Engineering PrinciplesBuild a Hypothesis around Steady State BehaviorVary Real-world EventsRun Experiments in ProductionAutomate Experiments to Run ContinuouslyMinimize Blast Radiushttp:/principlesofchaos.orgHigher Resilience,Lower CostCostScaleSpo
10、t Instancemicroservicestatelessquick restartfault tolerancechaos engineeringReserved InstanceSpot FleetAuto ScalingFault and Recovery Oriented ArchitectureSpot InstanceMulti-Clouds EcosystemMulti-Clouds FoundationCloud ConnectionMobvista Cloud SolutionMobvista Cloud PlatformSpot Instance MgrLoggingMonitoringCI/CD PipelineAuto ScalingHigh ReliabilityAWS APIAli APIAWS CLIAli CLIAlarmCost OptimizationSmart Load BalanceDevOpsPublic Cloud PlatformAli CloudAWS CloudMobvista AI PlatformBig Data PlatformMachine Learning PlatformService Decoratorhttps:/