In partnership with Global Wildlife Conservation logo

Metodology and Results

AIAIA Classifier

Model performance

Figure 12: The model evaluation metrics for the AIAIA Classifier that include model loss, F1, recall, and precision scores. F1, recall, and precision scores are above 0.84.

The AIAIA Classifier model is trained for 6000 steps that lasted about 15 hours 15mins on a single NVIDIA K80 GPU. Model performance stabilized after 4000 steps (10 hours), when we witnessed model F1-beta, recall and precision scores acheive up to 0.84 on the validation dataset, while model convergence (loss) moved down to 0.35 (Figure 12 and also see the online Tensorboard).

The binary classification model training and experiment can be tracked through our online Tensorboard here. Tensorboard9 is a visualization toolkit for Tensorflow models that can track and visualize model evaluation metrics such as loss, accuracy and other customized metrics. We tracked F1-beta, recall, precision scores, and Sigmoid Focal Loss instead of default metrics provided by TensorFlow and Keras.

Figure 13: Model evaluation for the AIAIA Classifier, a binary image classification to distinguish if an image chip contains objects of interest.

Each aerial image (6016 x 4000 pixels) were gridded into 150 chips. The overall classifier model performance can be summarized as:


Table 2: The overall model performance metrics for the AIAIA Classifier.

A 0.5 confidence score was used to determine if a prediction is a true positive or false positive. With the overall model performance metrics (Table 2) and model evaluation (Figure 13), there are a few conclusions we can draw from our AIAIA Classifier metrics: When the “Object” class confidence score threshold is set higher we can better distinguish between True Positive and True Negative (Figure 12. A);

  • When the confidence score of “Object” class is set to 1.0, we are only going to have 0.7% of False Positive rate (Figure 13. B);
  • The recall for the “Object” class is much lower than the “Not-object” class, though the precision scores between them are reversed. This implies the model produced a lot more False Negatives (only ground truth without detection). This usually means the training data quality of the “Object” class is lower.

Model Inference

Figure 13: chip-n-scale-queue-arranger helps you run containerized machine learning models over images at scale. It is a collection of AWS CloudFormation templates deployed by kes, lambda functions, and utility scripts for monitoring and managing the project.

During the model inference, we scanned 5,506,337 image chips with an average speed of 12,000 image chips per minute, and finished inference in 7.6 hours. The inference was run with Chip n Scale, Development Seed’s open-sourced model inference tool, Chip n Scale (Figure 13)10. You can find the Lambda function here. At the end of the inference, 150, 141 image chips were filtered as containing “objects of interest” from 5.5 millions image chips, which is only 2.7% of the original image size.

Discussion and Conclusion

In the past, a survey in Tsavo National Park in Kenya collected 81,000 images (12.15 million image chips in 400 x 400 pixel). It took 7 months to count and validate all the objects of interest with a team of 8 human annotators. A total 3600 hours were spent, which means human annotators can validate and count objects at a speed of 56 image chips per minute. However, the speed of model inference of the AIAIA Classifier is at 12, 000 image chips per minute11. It means the AI-assisted image chips scan and filter the image chips and have objects of interest fast and cost-efficiently, which will eventually be cost-saving in the long run.

The AIAIA Classifier is not perfect, because of the training data quality issues we mentioned above. The classifier model performance is highly impacted by the missing labels that can designate an image chip as “Not-object” when it is actually “Object”. With improvements to the training label quality, the AIAIA classifier will be even more promising as a tool to quickly scan images after the aerial survey.

__________________________________

9 "TensorBoard | TensorFlow." https://www.tensorflow.org/tensorboard. Accessed 28 Jan. 2021.
10 "developmentseed/chip-n-scale-queue-arranger: Chip 'n ... - GitHub." https://github.com/developmentseed/chip-n-scale-queue-arranger. Accessed 28 Jan. 2021.
11 Relative speed for the classifier is thus 214x faster. However, given that humans in lab settings typically do around 6 hours of concentrated work in a day, but the computers can work 24, the actual rate is 4 x 214 = 856x faster for larger datasets.