Lightbits Intelligent Cluster Management: Scale NVMe/TCP Storage

Lightbits’ core platform, our next-generation Software Defined Storage solution, delivers far more than high throughput and low latency. Built on NVMe over TCP, it provides modern, disaggregated block storage that achieves local flash performance while reducing TCO, shrinking datacenter footprint, and avoiding vendor lock-in through an open, industry-standard architecture. Organizations choose Lightbits for its ability to modernize aging infrastructure with flexibility, sustainability, and the freedom to choose hardware, not just raw speed.

When storage scales into a fleet, the bottleneck shifts from performance to operations. Teams need consistent provisioning and policy-driven control across clusters.

As deployments expand, even the most efficient and cost-optimized individual clusters are no longer sufficient. When customers scale from a few clusters to an entire fleet, the operational challenge shifts from maximizing performance to ensuring consistent, automated, and efficient management across all environments, regardless of cluster size, hardware mix, or growth patterns.

This is where Intelligent Cluster Management, or ICM, becomes essential. A unified orchestration layer enables fleet-level visibility, policy-driven operations, and lifecycle automation that go beyond the limits of managing isolated clusters. It ensures that the same value Lightbits provides within a cluster, including efficiency, flexibility, sustainability, and freedom of hardware choice, scales seamlessly across many clusters.

With this Tech Preview, we introduce the foundational architectural shift required to bring this unified fleet-scale vision to life and enable organizations to operate and grow their Lightbits deployments far beyond the boundaries of single-cluster management.

The Problem Today: Decentralized Operations Create Imbalance

In the traditional multi-cluster environment, provisioning storage relies on direct interaction between the consumer control plane and individual storage clusters.

As management consumers such as human operators using the CLI, hypervisors, Kubernetes CSI plugins, OpenShift, or OpenStack Cinder drivers create volumes, they send management and provisioning requests directly to the API endpoint of a specific Lightbits cluster. Because clusters operate as isolated systems, operators or orchestrators must maintain manual awareness of each cluster’s capacity, health, and feature set.

Animated image illustrating decentralized data infrastructure without Lightbits cluster federation

This decentralized model can lead to poor resource allocation across the fleet. Without an overarching intelligence directing traffic, provisioning becomes unbalanced. For example, if a human operator or an automated script favors Cluster C because it is newer or easier to access, Cluster C can quickly become highly utilized even when other clusters still have significant available capacity. Over time, this creates a skew across the fleet, making growth harder to manage.

The result is faster capacity exhaustion in the wrong places, more firefighting, and slower onboarding of new workloads.

Animated image illustrating data pipeline without cluster federation from Lightbits Labs.

The Tech Preview: Centralized Provisioning With Capacity-Based Placement Only

To eliminate the need to operate on a per-cluster basis, we introduce the Intelligent Cluster Management service. ICM functions as a standalone multi-cluster management control plane and a provisioning broker across all attached Lightbits clusters.

With ICM deployed, management requests for provisioning, including Create, Get, Update, and Delete, are directed to a centralized ICM endpoint. The ICM layer sits between the consumer control planes and the underlying clusters, unifying the fleet under a single management interface.

Animated image illustration cluster federation data pipeline

The centralized flow through ICM looks like this:

Request reception: Storage consumers send provisioning requests to the ICM service using familiar API formats with minimal or no changes.
Capacity-based placement: For this Tech Preview, ICM selects the target cluster based only on available free capacity, routing new provisioning requests to the cluster with the most available free space.
Request execution: After selecting the target cluster, ICM submits the provisioning request to that cluster and ensures it completes successfully, including handling unexpected failures and coordinating retries when required.
Data-plane integrity: ICM operates solely as a control plane. The high-performance NVMe over TCP data path remains direct between the consumer and the target cluster, ensuring zero performance impact during read and write operations.

In this Tech Preview, placement decisions are based solely on capacity. Health and performance-aware policies are planned for future releases. The initial value is federation-wide capacity-based provisioning that reduces imbalance across a fleet by placing new volumes where the most free space is available. This creates a strong operational foundation for scaling beyond single cluster boundaries.

Download the installation script: [link]
Installation and Getting Started manual: [link]
White Paper: Cluster Federation Overview

What’s Coming Next: Richer Placement and a Platform for Fleet Operations

Capacity balancing is the first step. The longer-term direction for ICM is to become the policy-driven control plane for operating storage fleets at scale.

Enhanced placement policies will bring in additional dimensions beyond capacity, such as:

Performance balance, to help distribute workloads and reduce hotspots
Cluster capabilities, to match workloads to hardware characteristics and supported features
Regional desires and constraints, to honor locality, compliance, and deployment preferences

In parallel, we will continue validating and strengthening federation behavior across the ecosystem layers that customers rely on. This includes ensuring consistent end-to-end workflows and compatibility with common operators and integrations such as Kubernetes via the CSI plugin, OpenStack via Cinder, OpenShift, and additional platform and automation environments.

Beyond placement, ICM is designed to become the single point of oversight for the entire fleet. As it evolves, this centralized layer can serve as a platform for:

Observing and managing dark or remotely deployed clusters (clusters with limited direct operator access or visibility)
Call-home capabilities that proactively bubble up cluster issues.
Data mobility operations across clusters
Integration of AI-backed monitoring, diagnostics, and self-healing capabilities.

This is the path toward Intelligent Cluster Management as a true fleet operations layer. A control plane that not only provisions resources intelligently, but also improves reliability, visibility, and operational efficiency as deployments scale.

Discover

Deploy

Decide

See us at STAC Summit London

Crusoe AI Cloud

Nebul AI Cloud

Big Financial Services Firm Breaks Free from Storage Constraints

Financial Services on AWS

Boost Transactions and Cuts Storage Costs

Power Millions of Kubernetes CPU Cores

Edge Cloud Services

FI-TS

Kubernetes as a Service

Explore resources

5 Reasons Why Lightbits Outperforms Ceph for Private Clouds

A Guide to Infrastructure Modernization for CSPs and Service Platforms

Asian eCommerce Giant Builds a Real-time Data Platform

Scaling Software-Defined Storage: Introducing Intelligent Cluster Management (Tech Preview)

The Problem Today: Decentralized Operations Create Imbalance

The Tech Preview: Centralized Provisioning With Capacity-Based Placement Only

What’s Coming Next: Richer Placement and a Platform for Fleet Operations

About the writer

Ready to get started?