In partnership with Global Wildlife Conservation logo

Metodology and Results

Model Training and Experiment with Kubeflow

Kubeflow and Kubernetes have become standard toolkits in industry, allowing data scientists to train, deploy, and package machine learning models in a portable, scalable and efficient way. These tools are powerful and flexible enough to accommodate the complexities of applying models to geospatial aerial imagery. With these tools, the models can be deployed to any cloud computing environments, including Microsoft Azure or Google Cloud Platform (GCP). Kubeflow was originally developed by Google. We found Kubeflow documentation on GCP is more up-to-date than Azure’s, therefore, it requires more hacky ways to deploy Kubeflow and TFJob to Azure/GCP. As follow up work, we will work to make these same workflows available on Azure Machine Learning as well.

Figure 7: The model training and experimentations are conceptualized in the diagram shown above. Model scripts were containerized and registered on Azure (or GCP). We then deployed Kubeflow to the cloud environment. Once Kubeflow is running on AKS (or GKE), TFJob model experiments were deployed to start the model training with GPU machines. In general we use K80, p100 and T4 machine types. Model evaluation on the validation set selects the best performing trained models from multiple experiments.

Training models on Microsoft Azure will require a few steps:

  • Install and setup Kubernetes CLI, kubeclt1, on your local machine.
  • Install Azure CLI and log in with your credentials. The Microsoft AI for Earth program provided Azure cloud credits for this project.
  • Create a resource group and Azure Kubernetes Services (AKS) setup on Azure. Our Kubeflow model training and experiments were deployed to AKS. AKS provides continuous integration and continuous delivery (CI/CD), as well as enterprise-grade security and governance on Azure2. A GPU node pool can be added to the AKS for both AIAIA Classifier and Detectors model training.
  • Azure Container Registry (ACR) must be set up. Model training scripts can be conterized and pushed to ACR for AKS to access later on when the model is deployed and ready to be trained with the AKS GPU node pool;
  • Kubeflow setup and deploy.
Figure 8: Kubeflow deployment to Kubernetes.
  • Store and host training dataset, pretrained model weights, and model configure files on Azure Blob Storage.
  • Setup TFJob yaml file for model deployment.
Figure 9: TFJob yaml file is structured as above. The container is built on top of “tensorflow” deployed to “kubeflow” with an model containerized training pipeline called “geoyiacr.azurecr.io//aiaia:v1.1-tf1.15-gpu”.

__________________________________

1 "Install and Set Up kubectl | Kubernetes." 27 Nov. 2020, https://kubernetes.io/docs/tasks/tools/install-kubectl/. Accessed 27 Jan. 2021
2 "Install and Set Up kubectl | Kubernetes.""Azure Kubernetes Service (AKS) | Microsoft Azure." https://azure.microsoft.com/en-us/services/kubernetes-service/. Accessed 27 Jan. 2021.