Ceph Q&A

In a little over a month, more than 1,500 people have viewed the SNIA Cloud Storage Technologies Initiative (CSTI) live webinar, “Ceph: The Linux of Storage Today,” with SNIA experts Vincent Hsu and Tushar Gohad. If you missed it, you can watch it on-demand at the SNIA Educational Library. The live audience was extremely engaged with our presenters, asking several interesting questions. As promised, Vincent and Tushar have answered them here.

Given the high level of this interest in this topic, the CSTI is planning additional sessions on Ceph. Please follow us @SNIACloud or at SNIA LinkedIn for dates.

Q: How many snapshots can Ceph support per cluster? Q: Does Ceph provide Deduplication? If so, is it across objects, file and block storage?

 A: There is no per-cluster limit. In the Ceph filesystem (cephfs) it is possible to create snapshots on a per-path basis, and currently the configurable default limit is 100 snapshots per path. The Ceph block storage (rbd) does not impose limits on the number of snapshots.  However, when using the native Linux kernel rbd client there is a limit of 510 snapshots per image.

There is a Ceph project to support data deduplication, though it is not available yet.

Q: How easy is the installation setup? I heard Ceph is hard to setup.

 A: Ceph used to be difficult to install, however, the ceph deployment process has gone under many changes and improvements.  In recent years, the experience has been very streamlined. The cephadm system was created in order to bootstrap and manage the Ceph cluster, and Ceph also can now be deployed and managed via a dashboard.

 Q: Does Ceph provide good user-interface to monitor usage, performance, and other details in case it is used as an object-as-a-service across multiple tenants?

A: Currently the Ceph dashboard allows monitoring the usage and performance at the cluster level and also at a per-pool basis. This question falls under consumability.  Many people contribute to the community in this area. You will start seeing more of these management tool capabilities being added, to see a better profile of the utilization efficiencies, multi-tenancy, and qualities of service.

The more that Ceph becomes the substrate for cloud-native on-premises storage, the more these technologies will show up in the community. Ceph dashboard has come a long way.  

Q: A slide mentioned support for tiered storage. Is tiered meant in the sense of caching (automatically managing performance/locality) or for storing data with explicitly different lifetimes/access patterns?

A: The slide mentioned the future support in Crimson for device tiering. That feature, for example, will allow storing data with different access patterns (and indeed lifetimes) on different devices. Access the full webinar presentation here.

 Q: Can you discuss any performance benchmarks or case studies demonstrating the benefits of using Ceph as the underlying storage infrastructure for AI workloads?   

A: The AI workloads have multiple requirements that Ceph is very suitable for:

  • Performance: Ceph can provide the high performance demands that AI workloads can require. As a SDS solution, it can be deployed on different hardware to provide the necessary performance characteristics that are needed. It can scale out and provide more parallelism to adapt to increase in performance demands. A recent post by a Ceph community member showed a Ceph cluster performing at 1 TiB/s.
  • Scale-out: Scale was built from the bottom up as a scale out solution. As the training and inferencing data is growing, it is possible to grow the cluster to provide more capacity and more performance. Ceph can scale to thousands of nodes.
  • Durability: Training data set sizes can become very large and it is important that the storage system itself takes care of the data durability, as transferring the data in and out of the storage system can be prohibitive. Ceph employs different techniques such as data replication and erasure coding, as well as automatic healing and data re-distribution to ensure data durability
  • Reliability: It is important that the storage systems operate continuously, even as failures are happening through the training and inference processing. In a large system where thousands of storage devices failures are the norm. Ceph was built from the ground up to avoid a single point of failure, and it can continue to operate and automatically recover when failures happen.

Object, Block, File support: Different AI applications require different types of storage. Ceph provides both object, block, and file access.

Q: Is it possible to geo-replicate Ceph datastore? Having a few Exabytes in the single datacenter seems a bit scary. 

A: We know you don’t want all your eggs in one basket. Ceph can perform synchronous or asynchronous replication. Synchronous replication is especially used in a stretch cluster context, where data can be spread across multiple data centers. Since Ceph is strongly consistent, stretch clusters are limited to deployments where the latency between the data center is relatively low. For example, stretch clusters are in general, shorter distance, i.e., not beyond 100-200 km.  Otherwise, the turnaround time would be too long.

For longer distances for geo-replication, people typically perform asynchronous replication between different Ceph clusters. Ceph also supports different geo replication schemes. The Ceph Object storage (RGW) provides the ability to access data in multiple geographical regions and allow data to be synchronized between them. Ceph RBD provides asynchronous mirroring that enables replication of RBD images between different Ceph clusters. The Ceph filesystem provides similar capabilities, and improvements to this feature are being developed.

 Q: Is there any NVMe over HDD percentage capacity that has the best throughput?  For example, for 1PB of HDD, how much NVMe capacity is recommended? Also, can you please include a link to the Communities Vincent referenced?

A: Since NVMe provides superior performance to HDD, the more NVMe devices being used, the better the expected throughput. However, when factoring in cost and trying to get better cost/performance ratio, there are a few ways that Ceph can be configured to minimize the HDD performance penalties.

The Ceph documentation recommends that in a mixed spinning and solid drive setup, the OSD metadata should be put on the solid state drive, and it should be at least in the range of 1-4% of the size of the HDD.

Ceph also allows you to create different storage pools that can be built by different media types in order to accommodate different application needs. For example, applications that need higher IO and/or higher data throughput can be set to use the more expensive NVMe based data pool, etc.

There is no hard rule.  It depends on factors like what CPU you have. What is seen today is that users tend to implement all Flash NVMe, but not a lot of hybrid configurations. They’ll implement all Flash, even object storage block storage, to get consistent performance. Another scenario is using HDD for high-capacity object storage for a data repository.

The community and Ceph documentation have the best practices, known principles and architecture guidelines for a CPU to hard drive ratio or a CPU to NVMe ratio.

The Ceph community is launching a user council to gather best practices from users, and involves two topics: Performance and Consumability

If you are a user of Ceph, we strongly recommend you join the community and participate in user council discussions. https://ceph.io/en/community/

Q: Hardware RAID controllers made sense on few CPU cores systems. Can any small RAID controller compete with massive core densities and large memory banks on modern systems?

A: Ceph provides its own durability, so in most cases there is no need to also use a RAID controller. Ceph can provide durability leveraging data replication and/or erasure coding schemes.

Q: I would like to know if there is a Docker version for Ceph? What is the simple usage of Ceph?

A: Full fledged Ceph system requires multiple daemons to be managed, and as such a single container image is not the best fit. Ceph can be deployed on Kubernetes via Rook.

There have been different experimental upstream projects to allow running a simplified version of Ceph. These are not currently supported by the Ceph community.

Q: Does Ceph support Redfish/ Swordfish APIs for management? Q: Was SPDK considered for low level locking?

A: Yes, Ceph supports both Redfish and Swordfish APIs for management.  Here are example technical user guide references.

https://docs.ceph.com/en/latest/hardware-monitoring/

https://www.snia.org/sites/default/files/technical_work/Swordfish/Swordfish_v1.0.6_UserGuide.pdf

To answer the second part of your question, SeaStar, which follows similar design principals as SPDK, is used as the asynchronous programming library given it is already in C++ and allows us to use a pluggable network and storage stack—standard kernel/libc based network stack or DPDK, io_uring or SPDK, etc.  We are in discussion with the SeaStar community to see how SPDK can be natively enabled for storage access

Q: Are there scale limitations on the number of MONs to OSDs, wouldn’t there be issues with OSDs reporting back to MON’s (epochs, maps) etc management based on number of OSDs?

A: The issue of scaling the number of OSDs has been tested and addressed. In 2017 it was reported that CERN tested successfully a Ceph cluster with over 10,000 OSDs. Nowadays, the public Ceph telemetry shows regularly many active clusters in the range of 1,000-4,000 OSDs.

Q: I saw you have support for NVMe/TCP.  Are there any plans for adding NVMe/FC support?

A: There are no current plans to support NVMe/FC.

Q: What about fault tolerance? If we have one out of 24 nodes offline, how possible is data loss? How can the cluster avoid request to down nodes?

A: There are two aspects to this question:

Data loss: Ceph has reputation in the market for its very conservative approach to protect the data. Once it approaches critical mass, Ceph will stop the writes to the system.

Availability: This depends on how you configured it. For example, some users spread 6 copies of data across 3 data centers. If you lose the whole site, or multiple drives, the data is still available. It really depends on what is your protection design for that.

Data can be set to be replicated into different failure domains, in which case it can be guaranteed that, unless there are multiple failures in multiple domains, there is no data loss.  The cluster marks and tracks down nodes and makes sure that all requests go to nodes that are available. Ceph replicates the data and different schemes can be used to provide data durability. It depends on your configuration, but the design principle of Ceph is to make sure you don’t lose data. Let’s say you have 3-way replication. If you start to lose critical mass, Ceph will go into read-only mode. Ceph will stop the write operation to make sure you don’t update the current state until you recover it.

Q: Can you comment on Ceph versus Vector Database?

A: Ceph is a unified storage system that can provide file, block, and object access. It does not provide the same capabilities that a vector database needs to provide. There are cases where a vector database can use Ceph as its underlying storage system.

 Q: Is there any kind of support for parallel I/O on Ceph?

A: Ceph natively performs parallel I/O. By default, it schedules all operations directly to the OSDs and in parallel.

Q: Can you use Ceph with two AD domains? Let say we have a path /FS/share1/. Can you create two SMB shares for this path, one per domain with different set of permissions each?

A: Partial AD support has been recently added to upstream Ceph and will be available in future versions. Support for multiple ADs is being developed.

Q: Does Ceph provide shared storage similar to Gluster or something like EFS?  Also, does Ceph work best with many small files or large files?

A: Yes, Ceph provides shared file storage like EFS. There is no concrete answer to whether many small files are better than large files. Ceph can handle either.  In terms of “what is best”, most of the file storage today is not optimized for very tiny files. In general, many small files would likely use more metadata storage, and are likely to gain less from certain prefetching optimizations. Ceph can comfortably handle large files, though this is not a binary answer.  Over time, Ceph will continue to improve in terms of granularity of file support.

Q: What type of storage is sitting behind the OSD design?  VMware SAN?

A: The OSD device can use any raw block device, e.g., JBOD.  Also, the assumption here is that every OSD, traditionally, is mapped to one disk. It could be a virtual disk, but it’s typically a physical disk. Think of a bunch of NVMe disks in a physical server with one OSD handling one disk. But we can have namespaces, for example, ZNS type drives, that allow us to do physical partitioning based on the type of media, and expose the disk as partitions. We could have one OSD to a partition. Ceph provides equivalent functionality to a VSAN. Each ceph OSD manages a physical drive or a subset of a drive.

Q: How can hardware RAID coexist with Ceph?

A: Ceph can use hardware RAID for its underlying storage, however, as Ceph manages its own durability, there is not necessarily additional benefit of adding RAID in most cases.  Doing so would duplicate the durability functions at a block level, reducing capacity and impacting performance. A lower latency drive could perform better. Most people use multiple 3-way replication or they just use erasure coding. Another consideration is you can run on any server instead of hard-coding for particular RAID adapters.

 

 

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *