Kubernetes Assimilation and the Need of Persistent Storage

Persistent storage for Kubernetes is readily available but persistent storage that performs like local NVMe® flash, not so much.


• Why stateful applications are gaining ground in the container space
• The types of stateful applications that benefit from Kubernetes orchestration
• How you can deploy high performance, low latency block storage in Kubernetes environments and get great performance, increased reliability and lower cost
• Containers are moving from a stateless micro-service oriented compute infrastructure to a stateful one – requiring persistent storage.
• Why applications that are “shared nothing” and protect data themselves can still benefit from shared storage

Why stateful applications are gaining ground in the container space

Kubernetes seems to be assimilating everything. It started out just being for microservices, especially stateless ones, but it’s really taking over all sorts of applications. What’s beneficial about Kubernetes is its ability to scale services up, to scale down, and to assure portability. So, to riff on the Borg from Star Trek, “persistence if futile!”.

Kubernetes was designed for stateless applications; the functionality of which is that they don’t have to save their state when they go down. Stateful applications do need to save their state on a regular basis and when they go down, they use that state to start up again. This applies to modern cloud-native applications, such as databases (DBs). Functionally speaking the benefits of databases are scalability, portability, and the ability to extract the DB from the underlying hardware. So, functionally DBs are aligned very well with the Kubernetes philosophy. And the notion of pods, of deploying services together is powerful especially with today’s cloud-native applications which also shares the same functionality philosophies: to scale by adding instances, hardware portability, and the flexibility to move applications. So, it seems only natural to deploy applications that are stateful under Kubernetes and get the same benefits that you have for stateless applications.

The types of stateful applications that benefit from Kubernetes orchestration

Applications that can benefit from Kubernetes orchestration are modern cloud-native applications that are made to scale and support web-based applications or services-oriented architectures. Figure 1. These can be traditional databases, SQL databases such as MySQL and Postgres. These are essentially single instance databases; they can replicate but they don’t necessarily scale out like some other NoSQL databases or other cluster databases. Other types of databases are NoSQL databases such as Cassandra and MongoDB, a document-based database. Other types are in-memory databases such as Redis and Apache Ignite or different processing frameworks, such as Apache Spark or Apache Kafka.

And for all these applications when deployed on bare metal, it’s recommended to have local flash at minimum for best performance, and NVMe® is what you really want to use for the lowest levels of latency and highest levels of bandwidth. These applications all use low latency. Occasionally they can use high bandwidth for either reads or writes. Why does an in-memory database need low latency storage? The answer is because not everything always fits in memory. Anytime you get a memory miss you’re going to go to storage and so when you do that, you want that to be as fast as possible. Also, when you start up and shut down, they’re going to read in lots of data and if they do a shutdown, they are going to save their state. When saving their state, they’re going to dump the contents of RAM to a storage device. The recommendation is that it is local.

So, what do you do then, in a Kubernetes environment? There is a local persistent volume functionality that was introduced in Kubernetes, 1.14, and this just means that you can have persistent storage on Kubernetes using the Kubernetes server local drive. You could have SSDs installed in each one of the Kubernetes application servers running various loads, as in Figure 2.

And the persistent storage is in the physical server, which delivers great performance and the pod, or the applications can shut down or start up and you’ll have great performance. But this architecture breaks the philosophy of Kubernetes and the applications. When you use this local persistent volume functionality, Kubernetes is going to only schedule the given pod that’s using this persistent storage to that one physical server. So, you’ve lost application portability, you can’t “move” off that physical server. And you’ve introduced the problem of flash allocation, especially if you don’t put them in every single server; you must start tracking which servers have flash and which don’t.

The architecture will result in poor flash utilization. You could add an 8TB drive in every single Kubernetes server, but you may only need it part of the time which results in very poor flash utilization. If you add local flash in every application server or Kubernetes server, you’re going to end up with 50 to 85% of that flash being wasted. Capacity utilization often ranges only in the 25% or as low as 15% utilization. Figure 3. It’s very common for single application servers to only need 150,000 – 300,000 IOPs, but the single NVMe® drives can supply between 600,000 – 800,000 IOPs. Underutilizing the performance of the drive.

When your underutilizing capacity and performance across the fleet, the tendency is to want to share. It’s very common when using Kubernetes to use a persistent storage solution that comes over the network. Lots of companies have CSI drivers readily available and there’s lots of persistent storage solutions. You may want to match your scalable Kubernetes orchestration environment with a scalable storage environment like Ceph and deploy it inside Kubernetes containers. iSCSI also is popular and works fine. And there’s various file system-based products for Kubernetes environments. And there are open source and propriety block-based products, but it’s extremely rare for them to perform at the same levels of local flash and the ones that do tend to either not offer any data services or data protection. And if they do, then they might be proprietary hardware or worse they are proprietary drivers. Proprietary block drivers are often tied to the kernel, or they require RDMA networking. Which means you may have RDMA NICs in some hosts and not in others. Which really starts to break the simplicity and portability philosophy of Kubernetes clusters.

Red Hat performed some tests over a year ago using Ceph in a Kubernetes environment on some very powerful servers with Intel Optane and NAND flash and the best read latency achieved was 3000 microseconds. To put that in perspective for a local flash drive, you want the read latency to be under 200 microseconds. Ceph in a Kubernetes environment can be an order of magnitude slower than local flash. Figure 4. Remember these cloud-native applications that I’ve mentioned in this blog recommend local flash for best performance.

Containers are moving from a stateless microservice-oriented compute infrastructure to a stateful one – requiring persistent storage.

So why is the title to this blog “persistence is futile?” Because persistence by itself is not enough to service well a Kubernetes environment. With all the solutions today why is persistence futile? Because choosing persistence today with the existing solutions means you’re making severe compromises. And what you really want to do is not have to compromise on your storage solution. When using local persistent volumes, you’re going to get the performance, but you lose a lot of the benefits of Kubernetes orchestration; specifically, the hardware portability and the flexibility to move applications. Using a network persistent storage solution is also not recommended as:

• They are too slow. Performance is in the millisecond range, and very few are below 1000 microseconds
• They break with the Kubernetes philosophies
• They might have proprietary block drivers which means it makes it very difficult to do upgrades.
• They might require special NICs, meaning some of your, your hardware fleet is different than others.
• You may have to make special changes to your switches, and the way you do your network management.
• Some have no data services, so they require the application to do all their own data production, and, as mentioned, you may choose to do single instance databases such as MySQL or Postgres. Maybe you don’t want to use replication because it really slows it down, so you’d like the storage to be able to provide the protection for you.

There are solutions where you don’t have to make compromises between having centralized storage but slow performance, or having local persistent volumes and be really fast but lose a lot of the benefits of Kubernetes.

How you can deploy high performance, low latency block storage in Kubernetes environments and get great performance, increased reliability and lower cost

Lightbits Labs LightOS delivers persistent storage for Kubernetes without compromise. LightOS is all about pushing the boundaries for persistent storage. Figure 5. It is very flexible, lowers TCO, but gives you very high performance at the same time. And you can stretch the limits of all three, depending on how you deploy LightOS because it gives you the power of choice.

LightOS is high-performance software defined storage for Kubernetes. LightOS resides on the target servers, depicted on the right side of this diagram [Figure 6], and the NVMe® SSDs is inside the target servers. And this is all run on standard commodity hardware of your choosing. Run LightOS on top of your favorite base OS, such as Red Hat CentOS, Ubuntu, SUSE, open SUSE, etc., and you get a very high performance redundant scalable solution on your Kubernetes systems. And it runs with standard NICs: 1Gbit, 10Gbit, 25Gbit, or 100Gbit whatever you want. And LightOS is standardized on TCP you don’t have to do any strange configuration or use a special NIC or change your switch settings.

With LightOS you system looks much like would today, just with the addition of dedicated storage servers. Using LightOS, the entire stack would look like Figure 7. At the top, you have a Kubernetes cluster that’s managing various pods and you’re running whatever applications you want in those pods,. In the middle layer, there’s the standard NVMe®/TCP block driver, which is included in many distributions today.* You have the LightOS CSI driver and it interfaces natively with Kubernetes. And it’s Ethernet networking and any NIC multipath access. Here you have the power of multipath on your Kubernetes servers, so if you have network failures, if you have switch failures, or even if you have a target server failure, your volumes will stay up. It uses standard port bonding on TCP, so that you don’t have to come up with some odd kind of scheme. If you’re already using port bonding, it’ll work fine with LightOS. And your network switches, stay the same, no special settings, no ECM global pause PFC or VLANs. If you want to use them it’s fine, but in this implementation, you aren’t required to. If you can do iSCSI in your environment today, you can do NVMe/TCP and get much better performance up to 6X the IOPs, and about 1/4 the latency and response time. It’s same exact infrastructure, same servers, but you can get much better performance over 4X better performance by just moving to NMVe/TCP instead of iSCSI

* NVMe®/TCP has been in the kernel since kernel 5.0, and it is backported and available on Red Hat, CentOS, Ubuntu, SUSE, Debian, Fedora. So, it is either already in a release you’re using and if you are using a release that the NVMe®/TCP drivers is not there then it’s open source.

In addition to persistent storage for Kubernetes, LightOS offers rich data services. You get:
Logical or persistent volumes to test your Kubernetes servers; normally associated with all flash array.

    • • Thin provisioning
    • • Line rate compression
    • • Data protection set per volume. You can have volumes that are a 2x, 3x or if you have a completely ephemeral load, you can have no protection you can have single copy volumes as well, any volume can be expanded.
    • • Very low tail latency for very consistent performance and response time without outliers

In addition to high performance and rich data services, LightOS reduces TCO by enabling better utilization of QLC media. Figure 8. QLC drives are inexpensive on a per dollar per gigabyte basis. LightOS improves the endurance of QLC drives by writing to them sequentially and in large blocks—the way they like to be written to. So with LightOS you get up to 5X more endurance from inexpensive QLC drives.

When you’re running Kubernetes across your fleet, you don’t know what every individual application’s write pattern is going to be, which makes deploying QLC flash across your servers a challenge because it might wear out prematurely. However, with LightOS you can use inexpensive QLC with confidence on the target servers because it smooths out the write patterns. LightOS writes sequentially and in large blocks, regardless of the incoming pattern, which delivers 5X more endurance on QLC. LightOS not only disaggregates and gets better NAND flash utilization, enabling the usage of a lower grade of flash, it also preserves performance, because it aggregates performance of all the drives to give you higher levels of performance, and it writes to it sequentially and in large blocks to provide better endurance. So, you can use QLC where you thought you couldn’t use it before.

Looking at Kubernetes different failure domains, LightOS v2.0 is a clustered scalable product and it oes understand fault domains. If you have a cluster and it’s spread over three racks and systems, with two LightOS targets per rack, and if you have specified 2x or 3x replication, the system automatically makes sure that the replicas are in different racks—servers tagged with Domains A, B, and C. Figure 9. Anytime there is a replica, LightOS splits the replicas over domains, so you can have an entire rack fail, and still have your volumes up. You could implement other kinds of replica splits, LightOS isn’t just limited to rack level, it can be based on a row on a Power Zone as well. It can span rows because LightOS leverages NVMe®/TCP, which is fully routable and can go anywhere across the data center, you’re not limited to only working within the rack.

Why applications that are “shared nothing” and protect data themselves can still benefit from shared storage

LightOS with the CSI functionality is philosophically compatible with Kubernetes—scalability, flexibility, high performance, and portability while adhering to the applications recommendations for local storage for best performance. LightOS performs like local storage, while maintaining your application portability, so you’re not tying applications to certain physical servers. The persistent storage is performing like NVMe® that reads under 200 microseconds, writes at about 300 microseconds, which means the storage can “move” to any physical server. So, if the pods move then LightOS storage is still going to be available to any physical server on your regular TCP IP network at local speeds. With LightOS your pods and containers can move around and your storage can move around with it.

And while many of the applications provide their own protection there’s a couple of reasons why you might want LightOS to do this. First off, LightOS guards against drive failures. LightOS with elastic RAID is going to protect against drive failures, seamlessly. So, when the drive fails in the LightOS target, no application servers are ever going to see that it’s going to be rebuilt inside the target the traffic is never going to hit the network. Whereas if you had drives in the application doing their own data protection and you had a drive failure, those applications would be resynchronizing or rebuilding over the network, which means they’d be in a degraded mode for quite some time depending on network speeds. And you’d be impacting the network, which can possibly impact other applications–that noisy neighbor problem. So, LightOS is going to eliminate issues with failed drives.

Secondarily, any kind of failure results in applications that do their own protection in a rebuild, and LightOS can avoid that even during a server failure. If you can imagine a server failure with Kubernetes, moving a database pod to an alternate Kubernetes server, rather than coming up on an empty drive and a full synchronization of say an 8TB drive having to take place, LightOS persistent storage rebuilds only the amount of data that was lost which should only be you know, a minute, maybe two, only that resynchronization has to happen so it’s very brief. And it doesn’t impact the network as much and you’re not in a degraded mode. So, there are lots of reasons actually to go ahead and even with applications that do their own protection to use LightOS, or shared storage.

Lastly, sometimes to get a replicated product to protect itself you must deploy the minimum of three replicas. But let’s say that you had a Mongo database, MongoDB, that you want it to be high availability, but you just simply didn’t need that much performance, one server or one container was enough. Rather than having to deploy, three different containers, three different pods, with each one with its own storage and do replication over the network. Wouldn’t it be great if when you needed a single instance of even a scalable database you could deploy just the one instance? With LightOS you can do that, and it will protect the data for you and make it highly available. And if that instance goes away there is a temporary disruption, while its down but when it restarts elsewhere, all this data will be preserved.

Additional Resources

Additional resources:
Container Storage Interface (CSI) Plugins: Overview and Implementation whitepaper
Cloud Native Databases Meet Cloud Native Storage webinar
Persistent Storage For Your Containerized Workloads With LightOS
Going Up? Persistent Storage For Kubernetes Applications Is On The Rise

About the Writer: