《專場10.5-TiDB 如何實現 HTAP 架構中列存的高頻更新和快速查詢-韋萬.pdf》由會員分享,可在線閱讀,更多相關《專場10.5-TiDB 如何實現 HTAP 架構中列存的高頻更新和快速查詢-韋萬.pdf(32頁珍藏版)》請在三個皮匠報告上搜索。
1、Dive Deep Into TiDBs Columnar Storage Engine萬 PingCAPTiDB IntroductionTiDBis an open-source NewSQL database that supports HTAP workloads.It is MySQL compatible and features horizontal scalability,strong consistency,and high availability.And there is a serverless free TiDB available for every develop
2、ers at .Get ready in 20 seconds!How does TiDB handle HTAP workloads Dive deep into Delta Tree,the columnar storage engine of TiDB Cloud-native evolutionAgendaHow does TiDB handle HTAP workloads?Row storage engine,powered by RocksDBColumnar storage engine,named as Delta TreeTiDB IntroductionTiKVTiFla
3、sh TiKV is for OLTP,and TiFlash is for OLAP TiKV synchronizes data updates in real-time to TiFlash,via raft protocol.Reads on TiKV and TiFlash are strong consistent,with no delay.Read consistency is guaranteed by learner-read mechanism.Two storage engines work together to empower HTAP abilityOptimiz
4、er utilize both column and row storages.Dive deep into Delta Tree,the columnar storage engine of TiDBParquet is handy,isnt it enough?No!We need real-time update with high OPS.We need MVCC,to support transactional snapshot isolation.A typical write throughput of a TiFlash nodeColumnar Storage Engines
5、elect*from T with read_ts=65 Transform updates and deletes into upserts.MVCC ability get The basic ideas of Delta Tree Storage engine Split the data by PK range into many Segments Each segment is a small LSM-Tree,with only 2 layers Delta layer and write cache,i.e.memtable Stable layerThe basic ideas
6、 of Delta Tree Storage engine Segments are read in parallel,naturally.Fewer layers brings faster reads Fewer layers to sort merge Segments are compact in parallel in separate ranges,brings smaller write amplificationA typical write amplificationThe basic ideas of Delta Tree Storage engine Column fil
7、es in delta layer:64KiB 16MiBMillions per node Column files in stable layer:128MiB 1GiBThousands per node Many meta data objects needed to be persisted to diskMillions per nodeDrawback:many small fragment column files in delta layerHow to store those column files?We persist everything in Delta Tree
8、into PageStorageSegmentSegmentSegment All storage engine logic is implemented in this layer Meta&Cache In memory All IO logic is implemented in this layer All data are serialized into Pages and stored in PageStorage on disk PageStorage is a local object storagePageStorage on local diskStorage struct
9、ure of DeltaTree1.A key-value storage.Can store a large number of various sized pagesMillions of pagesPageId(i.e.key)is int64(will support binary later),and value is binary object(i.e.Page)2.Support write batchTo group several writes into an atomic write3.Support snapshotDelta Tree heavily depends o
10、n it to support snapshot read4.Support reference pagesJust like hard link in a file system.Used by Segment split and others.5.Support external pagesThe real page content is store at somewhere else(e.g.a regular file),but managed by PageStorage6.Low read/write latency,high read/write throughputKey fe
11、atures of PageStoragePage meta(In memory)Page data file(On disk)Store the page meta in memory.Key,file id,data offset,size,checksumFast read.At most one IO to read a pageThe memory consumption is limited WAL file only stores Page metaEasy to support multiple writable data filesFast to do compaction.
12、WAL file is small,and data files rarely need to clean up.First write page data to data file,then commit to WAL fileWAL file(On disk)Page data file(On disk)Basic ideas of PageStorageBlob Store is where page data actually stored Blob FileStore pages dataSupport multiple disk Space MapR-B treeRecords t
13、he free space in BlobFiles Write process:1.Select a Blob File to write2.Find a suitable free space in Space Map3.Write into the selected Blob File4.Write meta data into WAL,commit to disk5.Update meta data in MVCC PageDirectory,commit to memoryPageStorage write processPageDirectory stores all info r
14、equired to access pages Read process:1.Get a snapshot of PageDirectory2.Collect required info from the snapshot,including each pages file_id,offset,etc.3.Do reading on Blob FilesPageStorage read processPageStorage PageDirectory&GC PageDirectory is an in memory sorted map Key is PageId,value is the p
15、ages edit entry list Update operations:1.Put 2.Delete 3.Reference 4.Put External A version attaches to each update operation sequence-operation counter epoch-GC counter GC periodically remove useless entries The entries with smaller sequence than smallest snapshot sequence GC also move Page data to
16、another Blob File,to remove the low use ratio Blob File epoch increase by 1 after movek_ak_bput cur_max_sequence 102put put del Read snapshotsput Next GC round will remove this entry,as the 100 is the smallest snapshot sequencewriteGet a new sequence100102Some tricks to improve query performance of
17、Delta TreeThe scan speed of DeltaTree is 3x of ClickHouse in SELECT FINALUse DeltaIndex to accelerate scan speedData SharingHow does the combination of TP and AP work?TiDB HTAP performance in 6.x The avg latency of AP queries in typical HTAP workloads was decreased by 30%to 50%comparing with 5.x.Bet
18、ter isolation between OLTP and OLAP workloads(HATtrick Bench)Better scalabilityThe next move:cloud-native evolution Scale fast,real fast Higher availability Pay as you goWhat is the benefit of cloud-native?TiFlashlocal disk /ebsTiFlashlocal disk /ebsWrite Nodelocal cacheWrite Nodelocal cacheRead Nod
19、elocal cache Read Nodelocal cacheCurrent architectureCloud native architectureThe evolution to cloud-native(coming soon)Remote Object Storage,e.g.S3SegmentSegmentSegmentPageStorage on local diskS3 Only need to change the storage location of PageStorage Local-Remote(S3)The evolution to cloud-native(c
20、oming soon)Scale fast,real fastRead node is stateless.Scale instantly!Write node only host small data,and can be recover fast.Scale fast!Higher availabilityS3 provides better HAScale fast leads to better HA Pay as you goCompute pool(read nodes)Shared storage nodes by multiple clusters(write nodes)Cloud native!=cloud onlyBoth TiDB Cloud and on premise cluster benefit from cloud-native clusterWhat is the benefit of cloud-native?