llm-d Joins CNCF, Ushering in a New Era for Kubernetes and AI Inference
The convergence of Kubernetes and Artificial Intelligence has taken a significant step forward with llm-d, an open-source Kubernetes framework designed to streamline the deployment of inference stacks for any model, on any accelerator, and in any cloud environment. This initiative aims to address the challenges of scaling and managing large language models (LLMs) in production.
IBM, Red Hat, and Google Donate llm-d to the CNCF
On March 24, 2026, at KubeCon Europe in Amsterdam, IBM Research, Red Hat, and Google Cloud jointly announced the donation of llm-d to the Cloud Native Computing Foundation (CNCF) as a Sandbox project. This contribution establishes llm-d as a community-governed blueprint for scalable, vendor-neutral LLM inference. The project benefits from the support of founding collaborators including NVIDIA and CoreWeave, as well as AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI.
Addressing the Limitations of Traditional Kubernetes for LLM Inference
Kubernetes, while the industry standard for orchestration, wasn’t initially designed to handle the stateful and dynamic demands of LLM inference. Traditional routing and autoscaling methods often fall short when managing the complexities of these workloads. Llm-d was created to fill this gap, offering a Kubernetes-native distributed inference framework that addresses these limitations. It transforms LLM serving from an improvised, model-by-model challenge into a replicable, production-grade Kubernetes-based system.
How llm-d Works: Key Features
llm-d introduces several key features to optimize LLM inference within a Kubernetes environment:
- Distributed System for LLM Serving: llm-d splits the inference process into prefill and decode phases (disaggregation), running each on separate pods. This allows for independent scaling and tuning of each phase.
- LLM-Aware Routing and Scheduling: A gateway extension routes requests based on KV-cache state, pod load, and hardware characteristics, improving latency and throughput.
- Modular Stack: llm-d utilizes vLLM as an inference gateway and related components, providing a reusable blueprint compatible with various models, accelerators, and cloud platforms.
While vLLM functions as the fast inference engine, llm-d provides the operating layer that manages its execution across clusters of GPUs/TPUs, incorporating intelligent scheduling, cache-aware routing, and autoscaling specifically tuned for LLM traffic.
Benefits of llm-d: Faster and More Cost-Effective Inference
Early testing by Google Cloud demonstrated that llm-d can deliver up to a 2x improvement in time-to-first-token for use cases like code completion, resulting in more responsive applications. This performance gain is attributed to llm-d’s ability to efficiently manage stateful inference workloads, leveraging KV cache management and optimized orchestration of prefill/decode phases.
llm-d also introduces prefix-cache-aware routing and prefill/decode disaggregation, enabling independent scaling of inference phases. It supports hierarchical cache offloading across GPU, CPU, and storage tiers, allowing for larger context windows without overwhelming accelerator memory. Its traffic- and hardware-aware autoscaler dynamically adapts to workload patterns, surpassing the capabilities of basic utilization metrics.
Integration with Emerging Kubernetes APIs
llm-d is designed to function seamlessly with emerging Kubernetes APIs, including the Gateway API Inference Extension (GAIE) and LeaderWorkerSet (LWS), further solidifying distributed inference as a first-class Kubernetes workload.
A “Well-Lit Path” to Production
Contributors describe llm-d as a “well-lit path” for organizations transitioning from experimentation to production. The framework offers reproducible benchmarks, validated deployment patterns, and compatibility across major accelerator families, from NVIDIA GPUs to Google TPUs and AMD/Intel hardware.
Future Development and the Vision for Cloud-Native AI
Looking ahead, development efforts will focus on expanding llm-d’s capabilities to support multi-modal workloads, multi-LoRA optimization with HuggingFace, and deeper integration with vLLM. Mistral AI is already contributing code to advance open standards around disaggregated serving.
IBM Research envisions llm-d as a pivotal component in standardizing the deployment and management of distributed inference, positioning the CNCF as the central hub for AI infrastructure. The goal is to create a common foundation stack that allows the ecosystem to focus on advancing AI rather than rebuilding basic infrastructure.
The post llm-d: Kubernetes Framework for Scalable LLM Inference Donated to CNCF appeared first on Archynewsy.