Following up on my Forbes Technology Council post about the crucial role of storage in improving AI inference efficiency, this blog focuses on the key metric everyone is striving to optimize: tokens per GPU. I aim to convince you that achieving a much better tokens/GPU ratio is possible, but requires the definition of a new, highly efficient storage architecture. Now, let’s take a closer look at the details.
Inference Efficiency = Tokens/GPU
GPUs are super expensive resources, both to buy (CapEx) and operate (OpEx). That’s why service providers usually charge users based on how long they use the GPU ($/GPU hours) or how many tokens they consume ($/Token). Enterprises size their hardware requirements based on the load (tokens) they expect to process. The simple approach to scaling involves adding more hardware. However, true efficiency gains (Tokens/GPU) only occur when upgrading to a newer generation of GPUs that can handle more requests at a similar or lower cost. This includes considering both the capital cost of the GPU and its energy efficiency, often measured in tokens/watt. Furthermore, scaling is frequently limited by energy constraints and space, making the simple addition of more hardware infeasible.
Any solution that increases the number of tokens a single GPU can process can dramatically improve inference efficiency and reduce cost per token. The most common way to improve GPU efficiency is to store previous computations so they can be reused later, saving precious GPU cycles that can instead be allocated to other processing. That’s the role of the KV cache I mentioned in a FTC post, and that’s why efficient storage management is critical. Even if an individual request takes longer to complete due to interactions with storage systems, the overall system efficiency improves drastically if that GPU is continuously processing more tokens and not wasting cycles on recomputation or waiting for data. As long as this increased time remains within acceptable service SLAs, improving tokens per GPU will reduce the cost per token and maximize the return on investment in expensive AI infrastructure.
Why do we need to rethink storage architecture?
Let’s focus on making things “efficient” and take a look at the bullets below. They highlight the core capabilities of traditional storage systems, why they were designed that way, and how AI inference fundamentally changes the requirements:
Traditional Storage Systems
- Access Interface: Standard interface (File, Object, Block) to support a wide range of workloads with different requirements.
- Data Protection: Data loss is catastrophic.
- Capacity Efficiency: Optimize capacity costs ($/GB), even at the expense of additional data processing.
- Integration: Loosely couple to maximize flexibility and interoperability.
Requirements for AI Inference
- Access Interface: A specialized interface optimized for high-throughput, low-latency access to attention vectors. No need for generic interfaces that can negatively impact performance and create semantic gaps.
- Data Protection: Lost data can often be recovered quickly and efficiently by recomputing it.
- Capacity Efficiency: Minimize data processing and duplicate data across multiple locations to directly enhance performance.
- Integration: Tight integration with inference frameworks to schedule inference requests close to the data or place data near where those requests are processed.
As you can see from this quick analysis, the storage systems we rely on today simply weren’t built for AI inference. They are unaware that they are storing attention vectors, nor do they fully understand how or when these vectors are actually utilized. By exposing a general object or file interface rather than a vector-native one, they introduce a significant semantic gap and incur unnecessary performance overhead. They are meticulously engineered to guarantee extreme durability and maximize availability, operating under the assumption that any data loss is catastrophic. In the world of AI inference, lost data can often be recovered quickly and efficiently by recomputing it. They prioritize capacity efficiency, which may require more processing and degrade performance, rather than duplicating data to improve performance. Last, they lack tight integration with inference frameworks, making it difficult to implement additional optimizations.
In short, storage systems predate modern AI inference, leaving significant untapped opportunities for improvements. We need to develop storage systems built specifically for the AI era—systems that are inherently aware they store and manage AI vectors. Not blocks. Not files. Not objects. These systems must seamlessly span all storage layers, from the HBM on the GPUs and the DDR memory in the host servers, to local and remote NVMe drives, all connected over high-speed networks. They must possess the intelligence to anticipate precisely when and where data will be consumed, and the ability to move that data with ultra-low latency and high throughput.
Today, existing storage systems are being pushed to their limits. As AI models continue to increase in size, the context window (input tokens) grows larger, and video and image inference and generation become more common alongside today’s popular textual LLMs, the demand for new storage solutions designed and optimized for AI inference will rise significantly.
Finally, inference demand is bounded by the load we—slow humans—generate. But that won’t last. As agent-to-agent inference accelerates, the volume of inference requests won’t just grow; it will explode, driving an urgent need for storage systems purpose-built for AI inference.
Conclusion
Improving inference efficiency isn’t about tweaking existing solutions. It’s about fundamental changes. To truly squeeze more tokens out of every expensive GPU, we need a storage architecture that natively understands the language of AI: tokens and vectors. We need a solution where the storage system is so smart it knows exactly what the GPU needs before it asks, and can move data seamlessly to/from HBM memory, all the way from/to local or remote NVMe storage in the blink of an eye.
The race to improve tokens per GPU is underway, and a new generation of smart storage architectures will be the core building block for success.