NVMe/TCP: A New Standard is Born

Two years after its inception, the NVMe/TCP transport binding specification has been officially ratified and is now available for public download. At the same time, NVMe/TCP software drivers were contributed to the Linux Kernel and the SPDK open-source projects. This new specification is a massive step forward for cloud storage infrastructure.

As a member of the Lightbits team that was part of NVMe/TCP creation and author of the standard, I found the standardization process intriguing and challenging. It required continually moving in many directions as we received input and requirements from vendors and adopters, ranging from new capabilities that did not exist before such as in-transit encryption, to headers and payload protocol packing and integrity.

In the very early stages of our work, Lightbits identified that NVMe-oF would be the primary interface to our technology. When talking to customers and trying to understand their pain points and also their wish list, we recognized the minimal adoption of NVMe-oF based solutions in the cloud. It was merely deemed insufficient after being evaluated.

There are many reasons for the slow adoption rate, but at least one was crystal clear and common to all—friction. The cloud uses commodity infrastructure at a massive scale, with very little tolerance to infrastructure revolutions and, more importantly, impact on running services. Any storage technology that introduces friction is usually rejected or demoted in the infinite priority list.

This adoption hurdle led Lightbits to identify an opportunity to eliminate the associated friction challenges, albeit at a significant technical challenge. Transporting storage traffic over TCP/IP reliably and with consistent latencies is a hard problem that many have tried and failed to solve. But we believed we could do it.

We investigated the NVMe/TCP concept with a few select partners, and things took off from there. With our customers and partners, we knew the only way that this concept could succeed is by defining an official NVMe standard. The solution could not be a proprietary implementation that would “lock the cloud” to a storage vendor. Instead, we all committed to an open, interoperable standard that would genuinely enable disaggregation and expedite NVMe-oF adoption in the cloud.

In April of 2017, all of us who were working together in the space introduced the “TCP Transport Binding” Technical Proposal Authorization Request (TPAR ) to the NVMe Technical Working Group.

Like any new concept, it took some persuasion and reasoning with the working group to define the scope of the TCP Transport Binding TPAR: what problems it was designed to solve and also what issues it was not intended to address. Some immediately understood the value proposition, the ease of adoption, utilization of all the existing data center infrastructure opportunities, and how a new NVMe transport could meet the challenges faced when deploying at scale. Others did not understand why NVMe needed to grow another transport when iWARP (RDMA technology over TCP/IP) existed, or even why TCP/IP was a good idea for NVMe.

Eventually, the technical working group became convinced of the need for NVMe/TCP and the new transport was approved. This kickstarted our technical proposal efforts, the software prototyping, and development, which Lightbits contributed to as part of the NVMe.org Linux task force.

The design discussions often revolved around inherent conflicts such as being friendly to hardware offloads versus elegant layering for software implementations. But throughout the process, our guiding concepts were clear and continuously used as a compass in our design. We needed to stay true to the NVMe principles of “keeping it simple and efficient,” which helped the team to consolidate on many open issues.

Allow me to stop for a moment at this point and thank the NVMe TWG members for their great help in making the NVMe/TCP transport ratification happen.

Fast forward to today, NVMe/TCP is real and showing excellent promise, as we always expected it would. It is attracting interest with customers, being adopted by vendors, demoing at exhibitions, and drawing attention in the media.

In November, we participated in the NVMe over Fabrics™ (NVMe-oF) Plugfest at The University of New Hampshire’s InterOperability Laboratory (UNH-IOL), the compliance certification lab for NVMe technology. The event brought together leading vendors to test against NVMe standards. As a NVMe/TCP storage solution vendor at this event, Lightbits successfully showcased interoperability with multiple NIC vendors and demonstrated NVMe/TCP’s ease of deployment.

These factors are all an affirmation of Lightbits’ initiative. While NVMe/TCP cannot solve every problem presented in the data center, it is a new, powerful tool giving cloud providers flexibility and operational efficiency with their infrastructure needs.

Lightbits’ mission is to make cloud providers’ life more manageable and their infrastructure more efficient. Getting the NVMe/TCP standard first ratified, supported and maintained, then widely adopted is one of the ways we are vigorously pursuing this mission. Soon, NVMe/TCP will be powering some of your favorite clouds — and it all began with a little standard called NVMe/TCP.

About the Writer:

Sagi Grimberg (@sagigrim) is a co-founder and CTO at Lightbits Labs, a storage company developing next-gen hyper scale storage solutions. He has more than 10 years of experience in storage, networking and Remote Direct Memory Access (RDMA) technologies and distributed systems. Sagi is a co-maintainer of the Linux NVMe subsystem and the lead author of the NVMe/TCP standard.  Prior to Lightbits Labs, Sagi Grimberg came from Mellanox Technologies (now owned by NVIDIA) where he served as the Storage Software manager. Sagi has written various technical papers, and made conference and meet up presentations on the innovative Lightbits NVMe/TCP capabilities.