NVIDIA Dynamo Planner Brings SLO-Driven Automation to Multi-Node LLM Inference

Microsoft and NVIDIA have released Part 2 of their collaboration on running NVIDIA Dynamo for large language model inference on Azure Kubernetes Service (AKS). The first announcement aimed for a raw throughput of 1.2 million tokens per second on distributed GPU systems. Now, this latest release focuses on helping developers work faster and improving operational efficiency. It does this through automated resource planning and dynamic scaling features.

The new capabilities center on two integrated components: the Dynamo Planner Profiler and the SLO-based Dynamo Planner. These tools work together to solve the “rate matching” challenge in disaggregated serving. The teams use this term when they split inference workloads. They separate prefill operations, which…

Source link

Leave a Comment