Downloading large datasets
Perhaps the biggest challenge in this project was handling the tens of millions of images for each country. HV towers only become recognizable at about zoom 18 imagery (~0.5 meter/pixel resolution), which is a relatively high spatial resolution for commercially available satellite imagery. The cost associated with this high spatial resolution is that we were forced to handle a very large volume of tiles. Just the act of downloading the imagery from the Digital Globe Maps API was extremely computationally intensive. We ran tens of thousands of networking coroutines -- essentially lightweight threads each downloading a single image at a time -- in parallel. This was the only method we found to obtain the country-wide image sets in a reasonable amount of time. It also required us to keep track of which files had been downloaded to avoid wasting time and resources; at nearly a hundred million image tiles, tracking this process is no longer trivial. Standard operating systems will throw an error if you simply try to list or delete a folder with this many files.
For the next iteration, we will likely use AWS's Simple Queue Service (SQS) to tabulate all tiles that need downloading. Each entry will contain tile indices, which our downloading script can pull from asynchronously. Then, each entry in the queue is only removed if our downloading process sends back confirmation that it worked correctly. This should provide a fault tolerant and more efficient solution that could also scale if we needed to process more countries. It will also allow us to more easily download a portion of the total images. In many cases, the ML algorithm does not need to predict the entire country's images for the strings of HV towers to become visible to a human in an ML-generated map overlay. In the first iteration, the download script randomly skipped over images with a probability dependent on the desired download proportion (i.e., if we only wanted half of a country's images, we would set this skip probability to 0.5). Using SQS, we can randomly shuffle the queue once and then download the desired number of images directly instead of wasting computation in this skip procedure.
Optimizing download and prediction speed
We also decided to download all imagery directly to AWS S3 for storage. This choice was useful in that we never had to request any imagery twice from the Digital Globe Maps API, and we had a full copy of all the data we intended to process. We also believed it was possible to rapidly transfer images from S3 to EC2 instances for the inference stage, which was dependent on GPU instances that are charged by the length of time that they're reserved.
However, the storing and accessing data on S3 was both costly and slow. There are set financial costs associated with the raw volume of stored data as well as costs for each file uploaded or downloaded to and from S3. But accessing stored data was more expensive than expected simply because of the sheer number of files that we were required to process. Additionally, the speed at which we could access stored data was much slower than expected. This likely occurred for two reasons: First, since S3 accesses files using a key-based system, any request for a subset of imagery (using asterisk wild cards for example) required the AWS servers to iterate over all keys to find the relevant subset; again, because of the large volume of files, this was surprisingly slow — in some cases, an S3 copy request could take 1-2 hours before the download even started. Second, the effective download speed (in Mbps) was also quite slow. We found there is a fixed overhead computational cost to initiating the download of a single file. As an practical example, this means that downloading one-thousand 1kb files is much slower than downloading one 1 Mb file. We found that transferring data from S3 to EC2 instances (for inference) was about 100x slower than AWS is capable of had the same data been stored as a single file.
Future efforts should attempt to avoid S3 and focus on methods of downloading imagery directly to EC2 instances immediately before prediction. The challenge here is to match the speed that images are downloaded with the speed of prediction. Currently, the download step is 2-5x slower depending on the GPU instance used for prediction. Download speed could be increased by finding a better method of requesting images (currently done via HTTP). It may also be possible to somehow construct super-images each made up of 25 or 100 individual tiles (in 5x5 or 10x10 squares, respectively) prior to downloading them. This would address some of those constant overhead and possibly reduce the effective per-image download time. Even something as simple as combining large groups of images into single zip files might help reduce the slowdowns we experienced.