In partnership with Global Wildlife Conservation logo

Data

Training Data Creation

We showcased a simplified version of the training dataset creation workflow using CVAT in Figure 5. A group of volunteers in Tanzania imported the full-size aerial images to CVAT, each with 6016 x 4000 pixels and a spatial resolution of 2-4 cm per pixel. The CVAT annotation tool used applied to draw bounding boxes around objects of interest and a label class was added per box. Once the task was finished, we exported the labeled bounding boxes XML files for further visual inspection, training data quality analysis, and converted them into machine learning ready datasets, i.e. in TFRecords format. TFrecords is a data format that stores a sequence of binary records for Tensorflow to read image and label data efficiently during the model training (TFRecord and tf.train.Example).

Figure 5: Training dataset annotation workflow using CVAT. Aerial imagery was annotated by a group of volunteers in Tanzania. The annotators created 30 classes of labels that covered wildlife, livestock, and human activities. The final labeled data is tiled/chipped and converted into TFRecords format as machine learning ready data for the coming model training.

Two iterations of training datasets were created during the summer and fall of 2020 by TZCRC Annotation Lab. The first iteration of the training dataset (created in the summer) was used to train a single 30-class object detection model. Because the training dataset was not of high quality, and the amount of labels was not sufficient for some rare classes, the first iteration of the object detection model could only detect the 9 classes that had the most training data and were easier to visually distinguish from the background. For the second training dataset , the volunteer annotators in the lab were able to create a higher quality training dataset, including:

  • Fewer missing labels
  • Fewer mis-labeled objects
  • Fewer label duplications
  • Fewer bounding boxes drawn around groups of objects and instead drawn around individuals
  • Bounding boxes drawn with more accurate boundaries around the objects instead of including extra background

The training dataset quality issue is still present) and we will present current label quality issues and how to improve for the future iterations in the next section, “Training Data Quality”.

During the second iteration, 30 classes of labels were still created. From the lessons learned during the first iteration, we discarded some labels from the model training process if sample counts were less than 100 from the aerial surveys, e.g. crane, ostrich, stork and lion. In the wildlife category, we grouped the species based on their body sizes and skin colors (a ‘visual guild’) to improve their representation during the model training, as follows:

  • A ”light colored large” class now includes classes eland, hartebeest, kudu, roan, and oryx.
  • A “dark colored large” class includes wildebeest, topi, waterbuck and sable.
  • “Smaller ungulates” include gazelle, impala and antelope.
Figure 5: Class labels per each category from wildlife, livestock to human activities that are labeled for the machine learning models.

At the end of second iteration training dataset creation, we ended up having three categories of training data for further AIAIAI Classifier and Detector model training: wildlife, livestock, and human activities (Figure 5).

Training Labels for Wildlife

We had nine classes under the wildlife category, and the top three classes by number of bounding box labels were:

  • Buffalo, 2022 labels (bounding boxes).
  • Elephant, 3937 labels.
  • Smaller ungulates, 1812 labels.

Training Labels for Livestock

The major three classes of livestock present in the aerial surveys used for training dataset labeling were:

  • Cow, 14825 labels.
  • Shoats, 7201 labels.
  • Donkey, 219 labels.

Training Labels for Human Activities

The five classes of human activities present in the aerial surveys used for training dataset labeling were:

  • Building, 4139 labels.
  • Human, 2230 labels.
  • Boma, 1276 labels.

In total, we have 45,155 labels (bounding boxes) for three categories (wildlife, livestock and human activities), which belongs to 12,500 unique image chips (each was 400 x 400 pixels). We randomly selected 7000 image chips that contain objects/bounding boxes, and labeled these as 1 and also included 7000 background chips drawn randomly from the pools that without any objects, labeling these as 0 for the model training process. For image classification, a total of 14,000 image chips were then sampled and split by 70, 20, and 10 percent as train, validation and test TFRecords respectively. We generated separate TFRecords for the three separate AIAIA detectors for wildlife, livestock and human activities. The TFRecords for object detection were created based on labels/bounding boxes presented in Figure 5.