《beyond-the-basics-towards-making-thanos-production-ready-yun-sakaki-27dya-daepqu-thanosjun-chang-lao-dan-zha-mao-benjamin-huo-junhao-zhang-qingcloud-technologies.pdf》由會員分享,可在線閱讀,更多相關《beyond-the-basics-towards-making-thanos-production-ready-yun-sakaki-27dya-daepqu-thanosjun-chang-lao-dan-zha-mao-benjamin-huo-junhao-zhang-qingcloud-technologies.pdf(36頁珍藏版)》請在三個皮匠報告上搜索。
1、Beyond the Basics:Towards making Thanos production-readyBenjamin Huo,Manager of the Observability Team Junhao Zhang,Sr.SE of the Observability Team KubeSphere of QingCloud Technologies Agenda-Introducing Thanos-Introducing Whizard-The production adoption of Whizard in KubeSphere -RoadmapIntroducing
2、ThanosHow do you customize&deploy Thanos?Can I use kube-thanos?-Maintained by Thanos maintainers-Customize Thanos setup with jsonnet-Deploy with yaml manifests Cons:-Jsonnet is more developer friendly,not user friendly-The sample deployment is not for production:Only one instance of query,router,ing
3、ester,and store with one replica each.No compactor,queryFrontend The hashring contains only the default soft tenant -Too many efforts needed to make it production-readyHow do you customize&deploy Thanos?What about the Thanos Helm Chart?-Maintained by bitnami-Customize Thanos setup with helm values-D
4、eploy with helm Cons:-Too many values to customize-Thanos Stateful Components are scaled by CPU/Memory(HPA)-No Tenant related setting-Hashring is configured manually What else do you need for production?Whats still missing?-Create and maintain Thanos components and configs with ease like CRDs-Tenant
5、 configs can be simpler without configuring the hashring manually-Support deploying Thanos to mutiple K8s cluster-Evaluating recording rules for each tenant and remote write the metrics back on tenant basis-What else do you need for production?What if youve tens or even hundreds of clusters?-Using a
6、 single ingester to receive all the data from all the clusters is not feasible:The Ingester isnt more powerful than a single instance of Prometheus The Ingester has to be scalable to handle 10+or 100+clusters-The Compactor,Store,Ruler has to be scalable to handle 10+or 100+clusters too-Automatically
7、 configure Thanos whenever there is a cluster added or removedIntroducing WhizardTextVM/BM/Member or Edge ClustersEdge NodesremotewriteHTTPHost ClusterWhizard Serviceobject storageremotewriteAgent(Prometheus)remotewriteWhizard Proxyreconcile and autoscaleWhizard Controller ManagerTenant Controllerre
8、mote writeGatewayquery metricsQueryQueryFrontendwrite metricsRouterIngester(tenant a,b,c)query metricsquery metricsIngester(tenant d,e,f)Compactor(tenant a,b,c)Compactor(tenant d,e,f)query rules/alertsRuler(tenant a)remote writepush alertsRuler(tenant b)BucketsStoreStoreAlertmanagerGRPCTenantsServic
9、eQueryStoresRouterIngestersCompactorsGatewayQueryFrontendstoragefor tenantStoragesreconcileWhizard Controller ManagerRulersFluentBitor Vectornode&appmetricsnode&appmetricsNotificationmanager-All Thanos components are defined in CRD-Introduced new Whizard CRDs:Service/Gateway/Storage/Tenant-Introduce
10、 Tenant&Tenant-Based autoscaling-Introduce RuleGroup-Based sharding for Ruler-Introduce Time-Based Sharding for Store-Gateway&Agent Proxy mechanism for tenant read/write control-Query Optimization-Security Enhancement:-Basic Auth to access Thanos WebUI-TLS config for all Thanos Components-Service&Co
11、mponents 2-tiers config managementCRD definition of all componentsIntroduce CRDs to define all components:-CRDs for Thanos:Router Ingester Ruler Query/QueryFrontend Store Compactor-CRDs for Whizard:Service Tenant Gateway StorageServiceService:-Default/Global settings for all components:TenantHeader/
12、DefaultTenantId/TenantLabelName Gateway Router Ingester Query/QueryFrontend Ruler Compactor Store Storage RemoteWrites/RemoteQueryTenantTenant:-Defines a tenant for Thanos-Tenant.status indicates relevant resources for this tenant including:Ingester Ruler CompactorGatewayGateway:-Gateway for all rea
13、ds/writes requests-Supports TLS config-Supports Basic Auth config-Supports exposing Thanos WebUI with basic auth-Supports exposing Thanos WebUI with OAuth2-ProxyStorageStorage:-Defines the Object Storage settings for Thanos:Bucket Endpoint AccessKey SecretKeyRouterRouter:-Defines the settings for Th
14、anos Receive Router:ReplicationFactor ReplicasIngesterIngester:-Defines the settings for Thanos Receive Ingester:DefaultTenantsPerIngester Tenants LocalTsdbRetention Replicas DefaultIngesterRetentionPeriodRulerRuler:-Defines the settings for global or tenant Ruler:Tenant:Ruler can be global for all
15、tenants or for one specific tenant Shards:Number of shards to evaluate rules.Each shard is one ruler StatefulSet EvaluationInterval RulerQueryProxy:Ruler query the gateway to fetch data to evaluate RulerWriteProxy:Ruler write back the evaluated recording rules back to ingester with the tenant labelC
16、ompactorCompactor:-Defines the settings for global or tenant Ruler:DefaultTenantsPerCompactor:number of tenants whose blocks can be compacted by this compactor Tenants:specify all the tenants whose blocks will be compacted by this compactor Retention:how long to keep the blocks in S3 Retention1h Ret
17、ention5m RetentionRaw DisableDownsampling:whether to enable downsamplingStoreStore:-Defines the settings for Store:TimeRanges:define the time range for each store shard Below example will create 2 store shards(StatefulSets):One for data now-30d Another for now-30d data now-36h QueryQuery:-Defines th
18、e settings for Query:PromqlEngine:which promql engine to use,default to the fastest thanos engine Stores:defines all the external query sources-Query Ingester for data=36hTenant-Based components auto-scaling-Each Ingester can hold data for 3 tenants by default-Each Compactor can compact blocks for 1
19、0 tenants by default-There is one dedicated ruler for each tenant to evaluate recording rules for that tenant Use RuleGroup instead of PrometheuesRule-PrometheusRule contains rules from multiple rule groups-RuleGroup only contains rules from one rule group:Easier to manage and visualization Easier t
20、o edit concurrently Avoid creating too many CRs:-Up to 40 rules in one RuleGroup Keep the rule evaluation concurrency:-Each RuleGroup is evaluated by one single goroutine-RuleGroup for namespace rules-ClusterRuleGroup for cluster rules-GlobalRuleGroup for global,multi-cluster/tenant rulesUse RuleGro
21、up instead of PrometheuesRuleTextVM/BM/Member or Edge ClustersEdge NodesremotewriteHTTPHost ClusterWhizard Serviceobject storageremotewriteAgent(Prometheus)remotewriteWhizard Proxyreconcile and autoscaleWhizard Controller ManagerTenant Controllerremote writeGatewayquery metricsQueryQueryFrontendwrit
22、e metricsRouterIngester(tenant a,b,c)query metricsquery metricsIngester(tenant d,e,f)Compactor(tenant a,b,c)Compactor(tenant d,e,f)query rules/alertsRuler(tenant a)remote writepush alertsRuler(tenant b)BucketsStoreStoreAlertmanagerGRPCTenantsServiceQueryStoresRouterIngestersCompactorsGatewayQueryFro
23、ntendstoragefor tenantStoragesreconcileWhizard Controller ManagerRulersFluentBitor Vectornode&appmetricsnode&appmetricsNotificationmanager-All Thanos components are defined in CRD-Introduced new Whizard CRDs:Service/Gateway/Storage/Tenant-Introduce Tenant&Tenant-Based autoscaling-Introduce RuleGroup
24、-Based sharding for Ruler-Introduce Time-Based Sharding for Store-Gateway&Agent Proxy mechanism for tenant read/write control-Query Optimization-Security Enhancement:-Service&Components 2-tiers config managementWhatre we still expecting from Thanos-Better multi-tenant support and improved scalabilit
25、y of Ruler:Issue:Implement multi-tenant Ruler:multitsdb and multiagent Proposal:Enable Receiver to extract Tenant from a label present in incoming timeseries PRs:receive/handler:implement tenant label splitting More PRs to come and more tests are needed-Support adding external labels to tenants:Issu
26、es:Receive:Allow specifying tenant-specific external labels Proposal:Allow statically specifying tenant-specific external labels in Receivers Whats still missing:dynamic external label,Specify different external labels for every tenant-Extract router and ingestor as separate components:Issue:Receive
27、r:Logically split router and ingestor mode https:/ The production adoption of Whizard in KubeSphere RoadmapKubeSphere Whizard Observability Center=WhizardTelemetry Observability Platform:-More observability signals will be added:Logging,Tracing,Events,Auditing,Notification-More observability features will be supported:Events/Logs alerting,cost management etc.-OpenTelemetry Support-Observability powered by eBPF-The combination of OpenTelemetry and eBPF-LLM applications observability-AI Infra observabilityWhizard has now been open-sourced at KubeCon HK 2024!https:/ the community by wechat: