Optimizing NVMe over TCP Performance

In Part 1 of this series, I examine how client-side tuning parameters impact NVMe over TCP (NVMe/TCP) performance with Lightbits, using a 4K 100% random read workload as the baseline.

In environments where the backend storage can sustain high NVMe/TCP performance — as is the case with Lightbits — the primary limiter often shifts away from the storage layer itself and onto how much work the client is driving and how effectively the Linux NVMe/TCP stack is parallelized.

This post walks through what I measured and how to tune nvme connect -i correctly.

Test environment

Client: Single server, 100Gb NIC
- CPU governor set to performance to eliminate frequency scaling variability during testing
Target: 3-node Lightbits cluster
Volumes: 2 namespaces, RF=3 (replication factor)
FIO Workload:

bs=4k
rw=randread
ioengine=libaio
direct=1

What I varied:
- Total outstanding I/O = numjobs × iodepth
- NVMe/TCP I/O queues = nvme connect -i <queues>

Outstanding I/O is the primary scaling knob

Before talking about queues, we need to look at how many I/Os the client is actually driving in parallel.

A graph depicting Throughput and Latency vs Outstanding IO for Lightbits block storage — Figure 1: Throughput & Latency vs Outstanding I/O

This chart shows the best-case throughput and latency as I increased the total outstanding I/O from 128 to 512.

Two things jump out immediately:

Throughput scales nearly linearly across this range.
Latency increases gradually but remains under 400µs, even as IOPS approach 2.7M.

This data shows that performance scales cleanly with increasing outstanding I/O across the 128–512 range, with throughput rising from ~1.06M to ~2.74M IOPS while latency remains under 400µs.

At 512 outstanding I/O, the system is approaching the effective packet-rate limits of a 100GbE link for 4K traffic, demonstrating that NVMe/TCP can efficiently consume available network bandwidth in this configuration.

For this workload, the efficient operating window spans 128 to 512 outstanding I/O — where concurrency drives predictable scaling while maintaining tightly controlled latency.

What does -i actually do?

The -i flag in:

nvme connect -t tcp … -i <queues>

controls how many NVMe/TCP I/O queues per target the client opens.

Each queue is its own TCP socket, submission path, and completion context. More queues mean more CPU parallelism when handling I/O completions.

Why latency drops when you increase -i

Here’s where things get interesting. Even at a moderate workload depth of 128 outstanding I/O, latency drops significantly as the number of NVMe/TCP I/O queues increases.

In this test, increasing -i from 8 to 64 reduced latency from approximately 415µs down to ~240µs, while throughput increased from ~617K to ~1.06M IOPS. The storage backend did not change — only the level of client-side queue parallelism did.

This behavior is driven by software queue contention in the Linux NVMe/TCP stack, not by storage limitations. With a small number of queues, in-flight I/Os are forced through a limited set of completion paths, creating serialization pressure in the networking and block layers. This artificially inflates latency even when the storage platform is capable of higher performance. You can observe this directly on the client using the ss utility to monitor the Recv-Q and Send-Q values for active NVMe/TCP connections on port 4420.

As the number of queues increases, I/O completion processing spreads across more CPUs and TCP sockets. That parallelism reduces software contention and collapses the excess latency, allowing the client to more efficiently consume the storage bandwidth available to it.

A graph depicting increasing NVMe over TCP queues per target -i — Figure 2: Increasing Queues

In this chart, increasing -i from 8 to 96 reduced latency by more than 42% while also increasing throughput — without changing the storage backend at all.

How to tune -i with Lightbits

Lightbits does not artificially cap the number of NVMe/TCP queues a client can open. By default, the client can use all available CPU cores for I/O processing.

But the -i flag lets you control how much CPU parallelism the storage stack consumes.

Here’s the guideline:

Workload Behavior	Recommended -i
Low outstanding I/O (light workloads)	8 – 32
High outstanding I/O (latency sensitive)	32 – 64 for typical high concurrency 64+ for very high core count systems
CPU-constrained client	Keep -i smaller
High-core-count server running heavy I/O	Increase -i aggressively

Additional data across outstanding I/O levels

To avoid overwhelming the main story, the remaining charts are shown below.

256 outstanding I/O

A graph depicting NVMe over TCP 256 outstanding IO queues per target -i

512 outstanding I/O

A graph depicting NVMe over TCP 512 outstanding IO queues per target -i

Across all these graphs, the pattern is consistent:
When the workload is deep enough, adding NVMe/TCP queues reduces software-induced latency and improves throughput.

Final takeaway

One of the strengths of Lightbits as an NVMe/TCP platform is that it does not artificially restrict client-side parallelism, leaving this fully under the customer’s control, because different workloads require very different levels of concurrency.

In practice, overall performance is governed by two simple levers:

Total outstanding I/O determines how much work the system is driving at any moment.
-i (NVMe/TCP I/O queues) determines how effectively the Linux NVMe/TCP stack can process that work in parallel.

When these two are aligned, NVMe/TCP delivers excellent throughput and competitive latency. The flexibility to tune queue parallelism is not a liability — it’s an advantage, and it allows Lightbits to adapt cleanly to everything from light, CPU-constrained applications to large, high-core-count servers running the heaviest I/O workloads.

What’s next

This post focused exclusively on a baseline 4K 100% random read workload to establish how outstanding I/O and NVMe/TCP queue parallelism interact on the client side.

In Part 2, I’ll apply the same methodology to 4K random writes and examine how the tuning guidance changes when write amplification, acknowledgements, and durability semantics are taken into account.

In Part 3, I’ll expand the analysis to larger block sizes to show when the workload transitions from CPU/packet-rate bound to bandwidth bound — and how that changes the optimal -i settings.

That should give a more complete, workload-driven tuning framework for NVMe/TCP with Lightbits.

Discover

Deploy

Decide

See us at STAC Summit London

Crusoe AI Cloud

Nebul AI Cloud

Big Financial Services Firm Breaks Free from Storage Constraints

Financial Services on AWS

Boost Transactions and Cuts Storage Costs

Power Millions of Kubernetes CPU Cores

Edge Cloud Services

FI-TS

Kubernetes as a Service

Explore resources

5 Reasons Why Lightbits Outperforms Ceph for Private Clouds

A Guide to Infrastructure Modernization for CSPs and Service Platforms

Asian eCommerce Giant Builds a Real-time Data Platform

Scaling 4K Random Read on NVMe/TCP: Why Outstanding I/O and -i Matter More Than You Think (Part 1)

Test environment

Outstanding I/O is the primary scaling knob

What does -i actually do?

Why latency drops when you increase -i

How to tune -i with Lightbits

Additional data across outstanding I/O levels

Final takeaway

What’s next

About the writer

Ready to get started?