《SNIA-SDC23-Wasserman-Bader-Breaking-Boundaries-Expanding-Cephs-Capabilities-with-NVMe-oF.pdf》由會員分享,可在線閱讀,更多相關《SNIA-SDC23-Wasserman-Bader-Breaking-Boundaries-Expanding-Cephs-Capabilities-with-NVMe-oF.pdf(42頁珍藏版)》請在三個皮匠報告上搜索。
1、1|2023 SNIA.All Rights Reserved.Virtual ConferenceSeptember 28-29,2021Breaking Boundaries:Expanding Cephs Capabilities with NVMe-oFOrit WassermanDistinguished EngineerODF Lead ArchitectCeph NVMe-of ArchitectIBMKyle BaderSTSMPrincipal Portfolio Architect-Ceph offerings IBM2The buzzwords“Software defi
2、ned storage”“Unified storage system”“Scalable distributed storage”“The future of storage”“The Linux of storage”What Is Ceph?The substanceCeph is open source softwaresoftwareRuns on commodity hardwareCommodity serversIP networksHDDs,SSDs,NVMe,NV-DIMMs,.A single cluster can serve objectobject,blockblo
3、ck,and filefile workloads3Freedom to use(free as in beer)Freedom to introspect,modify,and share(free as in speech)Freedom from vendor lock-inFreedom to innovateCeph is Free and Open Source4Reliable storage Reliable storage service out of unreliable componentsunreliable componentsNo single point of f
4、ailureData durability via replication or erasure codingNo interruption of service from rolling upgrades,online expansion,etc.Favor consistency and correctness over performanceCep is Reliable5Ceph is elastic storage infrastructureStorage cluster may grow or shrinkAdd or remove hardware while system i
5、s online and under loadScale upup with bigger,faster hardwareScale outout within a single cluster for capacity and performanceFederateFederate multiple clusters across sites with asynchronous replication and disaster recovery capabilitiesCeph is Scalable6Ceph is a Unified Storage SystemRGWS3 and Swi
6、ftobject storageLIBRADOSLow-level storage APIRADOSReliable,elastic,distributed storage layer withreplication and erasure codingRBDVirtual block deviceCEPHFSDistributed networkfile systemOBJECTBLOCKFILE7RADOSRADOS8RADOSR Reliable A Autonomic D Distributed O Object S StorageCommon storage layer underp
7、inning object,block,and file servicesProvides low-level data object storage serviceReliable and highly availableScalable(on day 1 and day 1000)Manages all replication and/or erasure coding,data placement,rebalancing,repair,etc.Strong consistencyCP,not APSimplifies design and implementation of higher
8、 layers(file,block,object)9LIBRADOS APIEfficient key/value storage inside an object(OMAP)Efficient key/value storage inside an object(OMAP)Atomic single-object transactionsupdate data,attr,keys togetheratomic compare-and-swapObject-granularity snapshot infrastructurePartial overwrite of existing dat
9、aSingle-object compound atomic operationsRADOS classes(stored procedures)Watch/Notify on an objectWatch/Notify on an object10RADOS ComponentsMonitorCentral authority for authentication,data placement,policyCoordination point for all other cluster componentsProtect critical cluster state with Paxos3,
10、5,7 per clusterManagerAggregates real-time metrics(throughput,disk usage,etc.)Host for pluggable management functions1 active,1+standby per clusterOSD(Object Storage Daemon)Stores data on an HDD or SSDServices client IO requestsCooperatively peers,replicates,rebalances data10s-1000s per clusterceph-
11、mgrceph-osdMMceph-mon11Client/Cluster ArchitectureAPPLICATIONLIBRADOSLIBRADOSRADOS CLUSTERRADOS CLUSTERMMMMMMSmart request routingFlexible network addressingSame simple application API12RBD:BLOCK STORAGERBD:BLOCK STORAGE13KVM/QEMURBD:RADOS Block DeviceVirtual block deviceStore disk images in RADOSSt
12、ripe data across many objects in a poolStorage decoupled from host,hypervisorAnalogous to AWS EBSClient implemented in KVM and LinuxIntegrated withLibvirtOpenStack(Cinder,Nova,Glace)KubernetesProxmox,CloudStack,Nebula,Ceph iSCSI gateway LIO stack+userspace tools to manage gateway configurationRADOS
13、CLUSTERRADOS CLUSTERLIBRADOSLIBRADOSLIBRBDLIBRBDVMVMLINUX HOSTKRBDKRBDXFS,EXT4,.XFS,EXT4,.RBD POOLRBD POOLVIRTIOVIRTIO-BLKBLK14NVMe Over Fabric15Non Volatile Memory Express(NVMe)Fast PCIe attached storageLocal storageExpand NVMe efficiency and performance over network fabricsEliminate unnecessary pr
14、otocol translationsEnable low-latency and high IOPS remote NVMe storageTCP:Well-known and common transportNo networking infrastructure requirements and constraintsRatified Nov,2018NVMe Over Fabric16Controller association maps 1x1 NVMe queue to a TCP connectionNo controller-wide sequencingNo controll
15、er-wide reassembly constraintsNo shared state across NVMe queues and TCPConnection binding is performed in NVMe-oF connect time(binding queue to controller)Association Model17LegacyPerformanceHigher throughput and IOPS30-70%Reduced latency30-40%Reduced CPU usage30-40%ScalabilityWhy not Ceph iSCSI?18
16、RADOS Block Device(RBD)RADOS protocolDistributed n-to-m protocolReliable object access to sharded and replicated/erasure coded storageWhy do we need another protocol to access block storage in Ceph?NVMe-over-Fabrics(NVMe-oF)Open,widely adopted industry standard Enable use-cases where NVMe-oF is alre
17、ady part of ecosystem Take advantage of NVMe-oF offloading in DPUsWhy NVMe-over-Fabrics?19Ceph NVMe-of Gateway20Ceph ClusterCeph Node 1OSD WCeph Node 2OSD XCeph Node OSD YCeph Node NOSD ZMonMonMgrCeph NVMe-oF GatewayNVMe-oF Initiator BRADOSNVMe-oF Initiator CNVMe-oF Initiator AOverviewFabrics:TCPRDM
18、A FC21Ceph ClusterCeph Node 1OSD WCeph Node 2OSD XCeph Node OSD YCeph Node NOSD ZMonMonMgrCeph/NVMe-oF GatewayNVMe-oF Initiator BNVMe-oFRADOSNVMe-oF Initiator CNVMe-oF Initiator ASubsystem MSubsystem NNVMe-oF TargetRBDController AController BController CNamespace 1Namespace 1Namespace 2RBD Image PRB
19、D Image QRBD Image SNVMe-oF and Ceph Namespace mapped to an RBD image Subsystems logical grouping of Namespaces Each initiator get a Controller22Ceph ClusterCeph Node 1OSD WCeph Node 2OSD XCeph Node OSD YCeph Node NOSD ZMonMonMgrCeph/NVMe-oF GatewayData PathNVMe-oF Initiator BRADOSNVMe-oF Initiator
20、ASubsystem MNVMe-oF TargetRBDController AController BNamespace 1RBD Image PRBD Image QNamespace 2Control DaemonRPC ClientgRPCmTLSCLINVMe-oFGateway Control Plane23Ceph ClusterCeph Node 1OSD WCeph Node 2OSD XCeph Node OSD YCeph Node NOSD ZMonMonMgrCeph/NVMe-oF GatewayData PathNVMe-oF Initiator BRADOSN
21、VMe-oF Initiator ASubsystem MNVMe-oF TargetRBDController AController BNamespace 1RBD Image PRBD Image QNamespace 2Control DaemonRPC ClientgRPCmTLSCLINVMe-oFConfigGateway Configuration24SPDKStorage Performance Development Kit(SPDK)https:/spdk.io/Provides a tools and libraries for writing high perform
22、ance,scalable,user-mode storage applicationsUserspace NVMe Over Fabric targetSupport for Ceph RBD with bdev_rbdOpen source(BSD)25Ceph ClusterCeph Node 1OSD WCeph Node 2OSD XCeph Node OSD YCeph Node NOSD ZMonMonMgrCeph/NVMe-oF GatewaySPDKNVMe-oF Initiator BRADOSNVMe-oF Initiator ASubsystem MNVMe-oF T
23、argetRBD bdevNamespace 1RBD Image PRBD Image QNamespace 2Control DaemonRPC ClientgRPCmTLSCLINVMe-oFRPCConfigListener:IP/PortController AController BData Path:SPDK26RBD backend in SPDK maps NVMe operations to RBD APINatively supportedReadWriteUnmapFlushWrite zeroesCompare and write Compare and write
24、EmulatedCompareCopyAbort*Abort*SPDKNVMe-oF Initiator BSubsystem MNVMe-oF TargetRBD bdevNamespace 1RBD Image PRBD Image QNamespace 2Listener:IP/PortController AController B?CMD XNVMe to RBD Commands Mapping27Ceph/NVMe-oF Gateway GroupCeph ClusterCeph Node 1OSD WCeph Node 2OSD XCeph Node OSD YCeph Nod
25、e NOSD ZMonMonMgrGateway 1RADOSNVMe-oF Initiator ARBD Image PNVMe-oFConfigGateway 2RBD Image PConfigSubsystem MController A(ANA optimized)Namespace 1Controller B(ANA non-optimized)Namespace 1RPC ClientGateway Groups&Multi-pathing28Ceph/NVMe-oF Gateway GroupCeph ClusterCeph Node 1OSD WCeph Node 2OSD
26、XCeph Node OSD YCeph Node NOSD ZMonMonMgrGateway 1RADOSNVMe-oF Initiator BRBD Image PNVMe-oFConfigGateway 2RBD Image PConfigSubsystem MController ANamespace 1Controller BNamespace 1RPC ClientNVMe-oF Initiator CNVMe-oF Initiator AGateway Groups:Load Balancing/Scaling29Planned/Future FeaturesPlanned/F
27、uture Features30Ceph/NVMe-oF Gateway GroupCeph ClusterCeph Node 1OSD WCeph Node 2OSD XCeph Node OSD YCeph Node NOSD ZMonMonMgrGateway 1RADOSNVMe-oF Initiator ARBD Image PNVMe-oFConfigGateway 2RBD Image PConfigSubsystem MController A(ANA optimized)Namespace 1Controller B(ANA non-optimized)Namespace 1
28、Ceph/NVMe-oF Discovery Service 1ConfigCeph/NVMe-oF Discovery Service 2ConfigPath APath BDiscovery31Ceph ClusterCeph Node 1OSD WCeph Node 2OSD XCeph Node OSD YCeph Node NOSD ZMonMonMgrCeph/NVMe-oF GatewayNVMe-oF Initiator ARBD Image PIPSec TunnelConfigSubsystem MListener:IP/Port/CredsController AName
29、space 1=Future:NVMe in-band authenticationAuthentication/Encryption:IPSec32Volume QoS limit using SPDK/bdev:I/Os per secondMB/sRead MB/sWrite MB/sCeph/NVMe-oF Gateway GroupCeph ClusterCeph Node 1OSD WCeph Node 2OSD XCeph Node OSD YCeph Node NOSD ZMonMonMgrGateway 1RADOSNVMe-oF Initiator ARBD Image P
30、NVMe-oFConfigGateway 2RBD Image PConfigSubsystem MController A(ANA optimized)Namespace 1Controller B(ANA non-optimized)Namespace 1RPC ClientQuality of Service(QoS)*Global QoS across gateways in a group is not planned33Use shared volumes to create single storage poolVMware vSphere Storage APIs Array
31、Integration(VAAI):Set of storage primitives that enable storage offloadingAtomic Test&Set(ATS)Atomic Test&Set(ATS)Support in using NVMe compare&write fused operationCmp&write limited to RBD object size resp.stripe size(alignment*)=only 4K/1 block requiredXCOPY(extended copy)XCOPY(extended copy)Copy
32、NVMe command is supported in SPDK but QoS difficultWrite same(zero)Write same(zero)Write zeroes NVMe command=maps directly to RBD operationUnmap(delete)Unmap(delete)Supported as dataset management command=discard in RBDVMWare vSphere/VAAI Support34Adaptive Distributed NVMe-oF Namespaces(ADNN)Move ga
33、teway group into OSD nodesHosts route each NVMe IO to the correct node with hash functionDedicated gateways(and extra hop)eliminatedHost overhead much less than librbd(more offload friendly)Ceph ClusterCeph Node 1OSD WCeph Node 2OSD XCeph Node OSD YCeph Node NOSD ZMonMonMgrCeph/NVMe-oF GatewayRADOSN
34、VMe-oF Initiator with ADNNRBD Image PCeph/NVMe-oF GatewayRBD Image PCeph/NVMe-oF GatewayRBD Image PCeph/NVMe-oF GatewayRBD Image PSubsystem MController ANamespace 1Controller BNamespace 1Controller CNamespace 1Controller DNamespace 1Hash hint for namespace 1For each NVMe IO:NVMe LBA to image X objec
35、t nameHash and mod object nameLook up controller in hash tableEach NVMe IO command goes to the controller in the node with the OSD containing those LBAsns_bytes:1073741824,object_bytes:4194304,object_name_format:rbd_data.2254598e5d45f.%016x,hash_fn:ceph_rjenkins,hash_table:0,1NVMe-oF35Performance36G
36、oal:Goal:As close as possible to non-gateway performancehttps:/ci.spdk.io/download/2022-virtual-forum-prc/D2_4_Yue_A_Performance_Study_for_Ceph_NVMeoF_Gateway.pdfHostCeph Clusterfio-librbdHostCeph ClusterCeph/NVMe-oF GatewaySPDK NVMe-oF bdevTCPVSNodes:2x Intel(R)Xeon(R)Gold 6258R CPU 2.70GHz(28 core
37、s),100 Gbit/s Mellanox ConnectX-5,Samsung PM1725a3-node Ceph cluster:Pacific&Quincy with rbd_cache=FALSEPerformance:Setup37IO size=16KiB,Total QD=256NVMe backed OSDsPerformance:Improvements38Use multiple Ceph client instances to improve performance in SPDKIO size=16KiB,Total QD=1024,SPDK core mask=1
38、6 coresRAMDisk backed OSDsTCPmax throughput85Gbit/s74Gbit/sPerformance:Volume Scaling39Use multiple Ceph client instances to improve performance in SPDKCheck how core count effects performancemore cores,but lower IOPSIO size=16KiB,Total QD=256RAMDisk backed OSDsPerformance:128 volumes40Jonas Pfeffer
39、le,Danny Harnik,Scott Peterson,Yue Zhu,Ernesto Puerta,Bharti Wadhwa,Ilya Dryomov,Josh Durgin,Sandy Kaur,Rebecca Cloe,Sanjeev Gupta,Brett Niver,Guifeng Tang,Mykola Golub,Congmin Yin,TJ Harris,Adam King,Redouane Kachach,Rahul Lepakshi,Aviv Caro,Alexander Indenbaum,Leonid Chernin,Gil Bergman,Barak Davi
40、dov,Roy Sahar.IBM,the IBM logo,and other IBM trademark listed on the IBM Trademarks List are trademarks or registered trademarks of IBM Corp.,in the U.S.and/or other countries.Intel,Intel logo,Intel Inside,Intel Inside logo,Intel Centrino,Intel Centrino logo,Celeron,Intel Xeon,Intel SpeedStep,Itaniu
41、m,and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.Red Hat,JBoss,OpenShift,Fedora,Hibernate,Ansible,CloudForms,RHCA,RHCE,RHCSA,Ceph,and Gluster are trademarks or registered trademarks of Red Hat,Inc.or its subsidiaries in the United States and other countries.Thank You!41https:/ Slack channel:#nvmeofWeekly meeting:every Tuesday at 7am PThttps:/meet.jit.si/ceph-nvmeofJoin the Community42|2023 SNIA.All Rights Reserved.Please take a moment to rate this session.Your feedback is important to us.