《【陸家靖】SkyWalking BanyanDB:時序數據庫的查詢引擎和流式計算.pdf》由會員分享,可在線閱讀,更多相關《【陸家靖】SkyWalking BanyanDB:時序數據庫的查詢引擎和流式計算.pdf(32頁珍藏版)》請在三個皮匠報告上搜索。
1、S Sk ky yWWa al lk ki in ng g B Ba an ny ya an nD DB B時序數據庫的查詢引擎和流式計算陸陸家家靖靖收錢吧框架工具負責人復旦大學核物理博士收錢吧框架工具團隊負責人從事可觀測性平臺、API網關和服務治理平臺研發APACHE SKYWALKING PMC MEMBER陸陸家家靖靖0 01 1可可觀觀測測性性與與時時序序數數據據庫庫可可觀觀測測性性三三大大支支柱柱指標、鏈路、日志*The three pillar of the Observability.Image source:Metrics,tracing,and logging,P.Bourg
2、on.https:/peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.htmlLowvolumeHighvolumeRequest-scoped eventsRequest-scoped metricsT Tr ra ac ci in ng gRequestscopedMMe et tr ri ic cs sAggregatableL Lo og gg gi in ng gEventsRequest-scoped,aggregatable eventsAggregatable eventse.g.rollupsTraci
3、ng&LoggingWorkflow-centric,distributedCausalMetricsStatistic/Aggregatable(rollups)Temporal:fixed interval,compression 時時序序數數據據的的數數據據結結構構Tag&Fields*How ClickHouse inspired us to build a high performance time series database.Aliaksandr Valialkin(valyala).https:/ IndexSeriesID(UInt64)*Frame of Referenc
4、e and Roaring Bitmaps,A.Grand.https:/www.elastic.co/blog/frame-of-reference-and-roaring-bitmapshash(all Tags):InfluxDB,VictoriaMetrics,etc.hash(partial TagValues):BanyanDB時時序序數數據據的的數數據據結結構構高基數問題H Ho oww Q Qu ui ic ck kl ly y D Do oe es s C Ca ar rd di in na al li it ty y G Gr ro oww?*What is High Ca
5、rdinality,R.Skillington.https:/chronosphere.io/learn/what-is-high-cardinality/時時序序數數據據的的數數據據結結構構高基數問題1 https:/ https:/ https:/ https:/ https:/ via Prometheus Recording Rules1TimescaleDB:Tiered B-Tree,Chunks2VictoriaMetrics/VictoriaLogs:MergeSet3,High-cardinality TSDB benchmarks4InfluxDB IOx:columnar
6、 built on Apache Arrow and Parquet 5BanyanDB:(tailored for SkyWalking)Partial tags for seriesIDCompact seriesID(xxhash)時時序序數數據據的的數數據據結結構構讀寫模式Vertical writeHorizontal readInsertions lookupsOld data is less likely to be 時時序序數數據據的的存存儲儲RUM Conjecture“We cannot design an access method for a storage syste
7、m that is optimal in all the following three aspects-Reads,Updates,and,Memory.”*Designing Access Methods:The RUM Conjecture.M.Athanassoulis et al.Proc.19th International Conference on Extending Database Technology(EDBT),March 15-18,時時序序數數據據的的存存儲儲LSM-tree+WiscKey:Badger Log Memtable Sorting String Ta
8、bles(SSTables)Periodic Compation:Write amplification 3x 10 x Read amp.10 x 300 x WiscKey:Key-Value separation LSM-tree:Key,Pointer value-log online,lightweight Garbage Collection*WiscKey:Separating Keys from Values in SSD-conscious Storage,L.Lu et al.FAST16 0 02 2B Ba an ny ya an nD DB B簡簡介介B Ba an
9、ny ya an nD DB B數據模型GroupIndexRule DIndexRule CIndexRule BIndexRule AMeasureStreamTopNAggregationPropertyIndexRuleBindingSchemalessSB Ba an ny ya an nD DB B存儲結構Group 1Group 2Group 3Group 4Group 5Group 6superdatasetRaft-based Metadata(ETCD)S St to or ra ag ge e N No od de eSeries MetadataS Sh ha ar r
10、d dB Bl lo oc ck kSegment2023-05-29Segment2023-05-30S Se eg gmme en nt tGlobal Index(e.g.Trace ID)Block2023-05-2921:00Block2023-05-2922:00Index blockData block1:NB Ba an ny ya an nD DB B數據壓縮Facebook Gorilla:(TS,Value)Timestamp:fixed interval=derived timestampValue:XORCompress big chunk(1M)*Gorilla:a
11、 fast,scalable,in-memory time series database,T.Pelkonen et al.Proceedings of the VLDB EndowmentVolume 8Issue 12pp 18161827 B Ba an ny ya an nD DB B數據壓縮02M4M6M8M10M12M14M16Mtracelogmetric可可觀觀測測性性數數據據壓壓縮縮率率encodedraw30%(16 Bytes-5 Bytes/DataPoint)13%10%Facebook Gorilla:(TS,Value)Timestamp:fixed inter
12、val=derived timestampValue:XORCompress big chunk(1M)B Ba an ny ya an nD DB B查詢子系統流程gRPC endpointl li ia ai is so on nQueue12Measure/StreamQueryMetadataQuery Analyzer34SeriesID(Partition)indexS St to or ra ag ge eKV EB Ba an ny ya an nD DB BIterator模式Limit 10Offset 5OrderBy latency DESCProjection(ser
13、vice_id,service_instance_id,latency)IndexScan(Shard 1)SortBy TimestampIndexScan(Shard 2)SortBy TimestampIndexScan(Shard 3)SortBy Timestamp S SE EL LE EC CT T service_id,service_instance_id,latency F FR RO OMM service_instance_cpm WWH HE ER RE E zone=“Shanghai”O OR RD DE ER R B BY Y latency D DE EC C
14、S S O OF FF FS SE ET T 5 L LI IMMI IT T 10Iterator(Interface,modular)Open()Next()-ItemClose()B Ba an ny ya an nD DB B查詢優化器Limit 10Offset 5OrderBy latency DESCProjection(service_id,service_instance_id,latency)IndexScan(Shard 1)SortBy TimestampIndexScan(Shard 2)SortBy TimestampIndexScan(Shard 3)SortBy
15、 TimestampOptimizingLimit 10Offset 5Projection(service_id,service_instance_id,latency)Merge SortIndexScan(Shard 1)SortBy IndexIndexScan(Shard 2)SortBy IndexIndexScan(Shard 3)SortBy I0 03 3T To op pN N流流式式計計算算S Sk ky yWWa al lk ki in ng g中中T To op pN NSkyWalking OAPEndpoint CPMSuccess RateLS Sk ky yW
16、Wa al lk ki in ng g中中T To op pN NElasticSearch實現*https:/www.elastic.co/blog/found-elasticsearch-常常用用T To op pN N算算法法Space-Saving Given an error rate ,keep counters Suppose N N incoming elements are processed from the stream S S G Gu ua ar ra an nt te ee e:all elements with frequent are g gu ua ar ra
17、 an nt te ee ed d t to o b be e r re ep po or rt te ed d Applications:Apache Kylin,citusdata/postgresql-topn*Efficient Computation of Frequent and Top-k Elements in Data Streams,A.Metwally et al.Part of the Lecture Notes in Computer Science book series(LNCS,volume 3363)常常用用T To op pN N算算法法Count-Min
18、Sketch*https:/ Given an error and a probability ,set and ,where b is a const*.Hash collision:d pair-wise independent hash functions G Gu ua ar ra an nt te ee e:with a probability of ,the error is at most ,where is the sum of all count 流流式式處處理理時間?事件事件 Event time處理事件 Processing time=Event timeBanyanDB
19、:use E Ev ve en nt tT Ti imme e of the source measure*https:/jet-start.sh/docs/4.3.1/concepts/event-time*https:/ Use T Tu ummb bl li in ng g wwi in nd do oww with the same interval as the source measure,e.g.1min,1hr Keep N N TopN entries for each group in a window Keep MM windows in 流流式式處處理理亂序問題和水位線
20、*https:/jet-start.sh/docs/4.3.1/concepts/event-time*https:/nightlies.apache.org/flink/flink-docs-master/docs/concepts/time/流流式式處處理理亂序問題和水位線Use Measure timestamp as watermark:Strictly monoticKeep M windows in memory to accept late measures within allowed-latenessFlush at 40%of the 流流式式處處理理最終設計gRPC en
21、dpointl li ia ai is so on nqueueS St to or ra ag ge e N No od de eMeasureData BlockIndex12FilterMapperGroupByTumblingWindowTopN Op.TopNP流流式式處處理理性能對比0510152025303540455K10K25K50K100KTopN 性能對比(查詢時間/秒)FullScanPreAggregation Memorize 1,000 TopN entries per bucket Write measure with cardinality(5K,10K,25K,50K,100K)per minute Query Top 10 w/both FullScan and PreAB Ba an ny ya an nD DB B Q Qu ue er ry y S Su ub bs sy ys st te emmRoadmap Merge Query Plan/Executor for TopN and Measure query(OSPP)Add Self-Observability Cluster-mode ST TH HA AN NK KS S