How to Address the Blast Radius Problem: Mitigating the Impact of SSD or Server Failures
NVMe SSDs, from inception, have provided more performance per drive than the typical single server can consume. Today, NVMe drives are readily available in 16TB capacities, over 30TB any day now and 64TB and higher capacities being discussed as coming soon. This excess of both performance and capacity screams “share me!”. Sharing these drives via disaggregation is straightforward by partitioning the drive and sharing it over NVME over Fabrics (NVMe-oF). In doing so, a shared SSD failure can impact multiple servers – essentially making the storage unavailable to any of the servers connected to the SSD. This phenomenon is known as the “blast radius,” and mitigating the problem is a critical element in deploying distributed storage solutions.
I recently participated in a G2M Research webinar on this topic, along with colleagues from Intel and KIOXIA. Here’s my take.
Enterprises that want to emulate the efficiencies enjoyed by the hyperscalers such as Google and Facebook, have racks full of servers. Within each, they deploy local NVMe SSDs for speed and low latency. Using this local flash helps their applications perform at the highest levels and enterprises can fill up the racks with off-the-shelf servers.
In these rack scenarios, the applications themselves typically are tasked with data replication or data protection. That’s because since the drive itself is direct-attached storage [DAS]; it is not protected. Sure, you could configure software RAID but that makes each server a uniquely configured entity and just does not scale well. It also makes the application physically dependent on that server and is antithetical to today’s accepted best data center practices. That can be problematic if the application was not originally developed to do the replication and data protection work. Even with a modern, distributed self-replicating application such as MongoDB, CockroachDB or Cassandra (for example), a drive or server failure necessitates a full rebuild. This rebuild will occur over the network and with 8-16 TB drives being commonplace, even with 25Gb application server interfaces, this can take many hours depending on database and network load. All the while suffering from degraded performance and possibly not meeting SLAs.
Capacity is not the only reason to share drives – each NVMe drive often can supply more IOPS than are needed by a single application host. Beyond just sharing capacity, it is also possible and desirable to share the IOPS as well. In most companies, there is a lot of capacity that has been paid for that is going completely unused – a wasted expense. What’s often not discussed is the wasted IOPs as well. To get around both issues, companies look to disaggregation, separating storage and compute. Allowing application servers to use the drives over the network means fewer drives are needed and less performance and capacity is wasted and hence real money can be saved.
Literally sharing single drives however also increases the blast radius.
Even solutions that share multiple drives in a server with some kind of RAID for protection against drive failures, but no target host redundancy greatly increase the blast radius issue. When a company had a single drive in a single system and the drive failed, only one server was impacted. When there is physical disaggregation via NVMe-oF, for instance, in an open source sense where many drives are being made available over the network from a single server, a server failure at the top of the rack makes the blast radius effectively the entire rack of application servers. Even with a single drive failure (without RAID), rather than losing one server for a drive failure — depending on how the drives are laid out – the enterprise may lose multiple servers.
Lightbits Labs LightOS™ alleviates the problem of server and drive blast radius.
Although it essentially sits at the top of the rack, in the case of a drive failure LightOS recognizes and only rebuilds exactly what has been allocated, significantly reducing the time to rebuild following a drive failure. The rebuild takes place inside the system rather than over the network. In case of a server failure, standards based ANA (multipathing) allows applications servers to seamlessly use a different target server. In short, if there is a drive or server failure work is not disrupted or disrupted very minimally. The blast radius is greatly reduced and recovery will take just seconds or minutes.