Anyscale on Azure: Scaling AI Workloads with Kubernetes and Enhanced Security
Microsoft and Anyscale are deepening their partnership to bring the power of Ray, an open-source distributed compute framework, to Azure customers. The collaboration, formalized with the launch of Anyscale on Azure, aims to simplify the building, deployment, and scaling of AI-native workloads within the secure Azure infrastructure. This integration addresses key challenges in modern AI development, including GPU capacity limitations, data management complexities, and credential security.
Addressing the Challenges of AI at Scale
As organizations increasingly adopt an AI-first approach, legacy infrastructure often struggles to meet the demands of modern AI workloads. These challenges include massive model sizes, diverse data modalities, and the intensive computational requirements of training and inference. Ray’s distributed compute framework provides a solution, but requires a robust platform for efficient management and productionization. Anyscale on Azure is designed to fill this gap.
Key Features and Benefits of Anyscale on Azure
- Integrated Azure Portal Experience: Provisioning and management of Anyscale services are directly integrated into the Azure Portal, with billing leveraging existing Azure commitments.
- Scalability and Performance: Leveraging Ray’s capabilities, Anyscale on Azure enables scaling AI workloads across thousands of nodes.
- Enhanced Security: Integration with Azure’s security frameworks, including Azure Entra ID for authentication, ensures a secure environment for AI development and deployment.
- Kubernetes-Based Infrastructure: Anyscale-managed Ray workloads run on Azure Kubernetes Service (AKS), providing a flexible and scalable foundation.
Overcoming Operational Hurdles
Several key operational challenges are addressed by the Anyscale on Azure integration:
GPU Capacity Limits
The scarcity of GPUs is a significant bottleneck in large-scale machine learning. To mitigate this, Microsoft recommends a multi-cluster, multi-region setup. By distributing Ray clusters across multiple AKS instances in different Azure regions, teams can aggregate GPU quota beyond regional limits, automatically reroute workloads during outages, and even extend compute to on-premises systems or other cloud providers using Azure Arc with AKS. The Anyscale console provides a unified view of these registered clusters.
Data Management
Efficient data transfer is crucial for ML operations. Anyscale on Azure utilizes Azure BlobFuse2, mounting Azure Blob Storage into Ray worker pods as a POSIX-compatible filesystem. This allows Ray tasks and actors to read and write data using standard file I/O, with local caching to prevent GPU stalls and data decoupling for scalable cluster operations.
Credential Management
Previous integration methods relied on CLI tokens or API keys with 30-day expiration periods, requiring manual rotation. The new approach leverages Microsoft Entra service principals and AKS workload identity, issuing short-lived tokens automatically. This eliminates the need for long-lived credentials and manual rotation, particularly beneficial in multi-cluster environments. This also provides fine-grained role-based access control (RBAC) and audit trails through Azure Activity Logs.
Industry-Wide Adoption of Kubernetes and Ray
Microsoft is not alone in embracing Ray. Amazon Web Services (AWS) also announced a partnership with Anyscale at Ray Summit 2024, connecting EKS clusters to the RayTurbo runtime. Google Cloud has contributed to open-source Ray development, including label-based scheduling and resource allocation for TPU setups. This widespread adoption indicates a growing industry preference for Kubernetes-plus-Ray as the foundation for AI workloads.
Availability
Anyscale on Azure is currently in private preview. Interested teams can request access through their Microsoft account team or by filing a request on the AKS GitHub repository, providing details about their Ray workloads and target regions. Example setups and workloads, including fine-tuning with DeepSpeed and LLaMA-Factory, are available in the Azure-Samples/aks-anyscale repository on GitHub.
The post AKS & Anyscale Ray: Scaling AI/ML with GPU, Storage & Authentication Solutions appeared first on Archynewsy.