Enabling lane detection at scale with NetApp, Run:AI, and Microsoft Azure
Today's automotive leaders are investing heavily in data-driven software applications to advance the most important innovations in autonomous and connected vehicles, mobility, and manufacturing. These new applications require an orchestration solution and a shared file system for their massive datasets to run distributed training of deep learning models on GPUs. The fascinating process for training AI models in the automotive industry involves many, many images used in a 3D matrix that's formed from 2D color images. These images are analyzed at the pixel and color (RGB) level to detect various objects, such as pedestrians, other cars, and traffic lights.
GPUs need to be maintained at high utilization to reduce training times, permit fast experimentation, and minimize the cost of usage. In addition, a high-performance, easy-to-use file system that prevents GPUs from waiting for data-'GPU starvation'-is imperative in accelerating model training in the cloud and optimizing cost.
Run:AI, Microsoft, and NetApp have teamed together to address a lane-detection use case by building a distributed training deep learning solution at scale that runs in the Azure cloud. This solution enables data scientists to fully embrace the Azure cloud scaling capabilities and cost benefits for automotive use cases.
How we set up our deep learning model training
Here are the tools we used, and how we used them:
Azure NetApp Files provided high-performance, low-latency, scalable storage through NetApp®Snapshot™ copies, cloning, and replication.
Azure Kubernetes Service (AKS) simplified deploying and orchestrating a managed Kubernetes cluster in Azure.
Azure compute SKUs with GPUs. These are specialized VMs available with single or multiple GPUs.
Run:AI enabled pooling of GPUs into two logical environments: one for build and one for training workloads. A scheduler manages the compute requests that come from data scientists, enabling elastic scaling from fractions of GPU to multiple GPUs and multiple GPU nodes. The Run:AI platform is built on top of Kubernetes, enabling simple integration with existing IT and data science workflows.
NetApp Trident integrates natively with AKS and its Persistent Volume framework and was used to seamlessly provision and manage volumes from systems running on Azure NetApp Files.
Finally, we did machine learning (ML) versioning by using Azure NetApp Files Snapshot technology combined with Run:AI. This combination perserved data lineage and allowed data scientists and data engineers to collaborate and share data with their colleagues.
What we found
By working with Run:AI, Azure, and NetApp technology, we enabled distributed computations in the cloud, creating a high-performing distributed training system. The system worked with tens of GPUs that communicated simultaneously in a meshlike architecture. And-to optimize cost-we were able to keep them fully occupied at about 95% to 100% utilization.
We were able to saturate GPU utilization and keep the GPU cycles as short as possible. (This is one of the highest-cost components in the architecture.) Azure NetApp Files provides various performance tiers that guarantee sustained throughput at submillisecond latency. We started our distributed training job on a small GPU cluster. Later, we added GPUs to the cluster on demand without interrupting the training-by using the dynamic service level change capabilities of Run:AI software to provide optimal GPU utilization.
Different data science and data engineering teams were able to use the same dataset for different projects. One team was able to work on lane detection, while another team worked on a different object detection task using the same dataset. Researchers and engineers were able to allocate volumes on demand.
We had full visibility of the AI Infrastructure. Using Run:AI's platform, we had full visibility of the AI infrastructure including all pooled GPUs, at the job, project, cluster, and node levels.
Looking to get started?
In this use case, lane detection for autonomous vehicles, we were able to use NetApp, Run:AI and Azure to create a single, unified experience for accelerating model training on the cloud, thus reducing costs while improving training times and simplifying processes for data scientists and engineers. Details are available in this technical report and apply to model training across industries and verticals.