Red Hat announced the launch of llm-d, a new open source project that answers the most crucial need of generative AI's (gen AI) future: Inference at scale. Tapping breakthrough inference technologies for gen AI at scale, llm-d is powered by a native Kubernetes architecture, vLLM-based distributed inference and intelligent AI-aware network routing, empowering robust, large language model inference clouds to meet the most demanding production service-level objectives. While training remains vital, the true impact of gen AI hinges on more efficient and scalable inference - the engine that transforms AI models into actionable insights and user experiences.
KV (key-value) Cache Offloading, based on LMCache, shifts the memory burden of the KV cache from GPU memory to more cost-efficient and abundant standard storage, like CPU memory or network storage. Kubernetes-powered clusters and controllers for more efficient scheduling of compute and storage resources as workload demands fluctuate, while maintaining performance and lower latency. AI-Aware Network Routing for scheduling incoming requests to the servers and accelerators that are most likely to have hot caches of past inference calculations.
High-performance communication APIs for faster and more efficient data transfer between servers, with support for NVIDIA Inference Xfer Library. This new open source project has already garnered the support of a formidable coalition of leading gen AI model providers, AI accelerator pioneers, and premier AI cloud platforms. CoreWeave, Google Cloud, IBM Research and NVIDIA are founding contributors, with AMD, Cisco, Hugging Face, Intel, Lambda and Mistral AI as partners, underscoring the industry's deep collaboration to architect the future of large-scale LLM serving. The llm-d community is further joined by founding supporters at the Sky Computing Lab at the University of California, originators of vLLM, and the LMCache Lab at the University of Chicago, originators of LMCache.