Training Data

Overview

The ramp model needs geographically tuned training data for effective deployment over a localized area of interest. There are generally three routes to acquiring the training data needed for the ramp model:

These three workstreams will be outlined in detail in the following subsections. All three workstreams will require the use of a GIS software to work with the training data. Quantum GIS (QGIS) is a powerful open source GIS software that can be quickly downloaded and installed. QGIS is what the ramp team used to create and edit training data for the ramp baseline model, and will satisfy most of your training data needs. Other proprietary software such as Esri’s ArcGIS will work as well, but can come with a cost. QGIS is what the ramp team used to create and edit training data for the ramp baseline model, and will satisfy most of your training data needs. Other proprietary software such as Esri’s ArcGIS will work as well, but can come with a cost.

Scenario:

Assume, for the sake of explanation, that there has been a recent flooding event, and your team is tasked with training the ramp model for deployment over the affected region.

Creating Your Own Training Data

The first of three options that you have is to create your own training data. For expected accuracy the ramp model needs 12 -15 thousand locally tuned training data tiles, which are 256 pixel by 256 pixel satellite imagery tiles with building outline vector labels overlaying the imagery.

Identifying the Proper Imagery

Your task is to acquire satellite imagery and create the building outline vector labels on the imagery, which will be used together to train the ramp model. In creating your own training data, you have the most control over the quality of the data and can tailor the data exactly to your needs.

Acquiring Imagery is the first step in the training data process. There are multiple sources of satellite imagery out there (Maxar Secure Watch, Planet, list a few of them), and selecting your source of imagery depends on various factors including budget and timeframe. For more detail on Imagery sources see the previous Imagery section.

When selecting your imagery for training, make sure to give it a good look over and review for overall quality, considering cloud cover, image tilt or off nadir angle (should be less than 30 degrees) , spatial resolution (size of the pixel (which should be between 30-50cm for the ramp model), and temporal resolution (when was the imagery collected? Does age impact usability for the use case?).

Most importantly, you should select imagery that is specific to the use case and deployment of the model. If you are training the model over Bangladesh, the localized training data should be over Bangladesh for the best model results. In some cases, Maxar may release pre and post event imagery exactly specific to your use case, but do not plan for this. Maxar’s ODP does not have imagery available over every country, thus sometimes the best option is to choose an area of interest that has geographic similarities as well as similar building architecture and design. It is best to have a strong understanding of the local geography for which you plan to deploy the model so you can best inform your training data image selection.

The Maxar Open Data Program ODP is an excellent resource for satellite imagery that can be utilized for training the ramp model, and will be used as the example for this section of the documentation.

Preparing Imagery for Labeling

Scenario continued:

Your team has been activated as unprecedented flooding is displacing thousands of people in central Bangladesh, near Dhaka. The first step is to acquire imagery to create training data for a regional tuned ramp model

Since we will be using Maxar ODP imagery as the example for imagery selection and processing, we need to first head to Maxar’s Open Data Program website.

Maxar ODP has imagery dating back to 2010, and you can select any year up to present and scroll the selection of releases that Maxar has published. The first and most ideal option would be to find an image release over central Bangladesh. You find that you are in luck, Maxar has released imagery over Bangladesh in 2020 in response to the COVID19 pandemic that you can use. Although two years old, this imagery will be perfect for training the ramp model for a deployment over your area of interest.

Each release will have multiple image files in the dataset that come in .tif format. Utilize the preview selection to explore the data and select the imagery file/s that will work for your use case. In this case you are looking to deploy the model over a peri-urban / agricultural region of Bangladesh, so that is what you should look to have represented in the training data. Once you find an image that will work for your needs, download the imagery.

Once the image file has been downloaded, open it on Quantum GIS (QGIS). Open the image properties to check the image resolution, imagery bands, and any additional information collected in the image metadata. Image resolution is represented by pixel size. In this image for example, the pixel size is 32 cm which is considered high resolution (remember that ramp needs imagery in the range of 30-50 cm pixels).

Also make sure that the imagery at minimum has three bands (representing the red, green, blue wavelengths of the electromagnetic spectrum). Some satellite imagery may have more bands covering different wavelengths on the spectrum such as infrared, but we do not need these for training the ramp model at this stage.

Now that you have the imagery loaded in QGIS and have checked the metadata, you can select the area of the imagery that you want to begin to label. The entire strip represents a large geographic extent, and does not need to be labeled in its entirety. Doing this would take a very long time and produce more training data than required for localizing the ramp model. Move around the image and find a region that is closely representative to where you will be deploying the model.

Scenario continued:

For the flooding use case, let’s find a river and focus our training data around that, capturing a healthy diversity of city/urban scenes, suburban labels, and rural labels (demonstrated below).

Once you have located a region of interest, you can use QGIS’ “clip raster by extent” tool to trim imagery to your area of interest. Here is a tutorial on how to do this.

IMPORTANT NOTE

Clip raster by extent can sometimes distort the pixel size of the new imagery layer. Pay attention to the pixel size of the new image and compare it to the original imagery to confirm that it is the same.

Once you have your AOI, you need to create a grid over the imagery that corresponds to the tile size requirements of the ramp model (256 x 256 pixels). Follow the following step by step guidance for how to create a grid over your Image AOI.

Navigate using the top ribbon in QGIS to Vector > Research Tools > Create Grid

This will open up a dialogue box with different parameters for your grid:

Grid Type: This should be set to “Rectangle (Polygon)”

Grid Extent: This should be set to the new area of interest layer that you just created from the original Satellite Image. Click on the button to the right of the field, then “Calculate from Layer” and select your image.

Horizontal Spacing and Vertical Spacing:

The number entered here depends on the pixel size of the imagery, and will take some simple calculation. Open the properties of your new clipped image by right-clicking the layer, and note down the pixel size number listed under the “Information” tab.

You can then calculate the size of the horizontal and vertical spacing using the following equation:

In our case this formula will be:

2.915674750738364954e-06 x 256 = 0.00074641273618688

We can now add this value in the Create Grid dialogue box for both the horizontal and vertical spacing fields and click “Run”. This will generate our grid of 256×256 pixel tiles, each representing a new training sample.

With your grid created, the imagery is now ready for you to begin labeling. Open up the attribute table for the grid layer to see how many tiles span across your area of interest. Keep in mind that you need around 12,000 tiles to fine tune the ramp model.

Creating Labels

The high resolution satellite imagery serves as the foundation of the training data, but is useless without building labels (vector polygon labels) overlaying the imagery. When creating your own training data, you are responsible for creating the building labels in each image tile. Label creation can become tedious and time consuming, but is a critical step in the process and will have a large impact on the model accuracy. Consistency is key, and all labels created should be done so in the same format and to the same specifications. This is important when a team of labelers is working on labeling, as each person will have their own style. The more the labeling process can be mimicked from one labeler to another, the better.

The ramp team has put together a label guidance document that was utilized to train the ramp baseline ramp model and a few of it’s early fine-tuned models. The closest a labeler is able to follow these guidelines the better as the baseline ramp model has learned to recognize the patterns from this specific label style, and will preform better when fine-tuned with the same approach.

The ramp Data Labeling Specifications document outlines the proper approach for collecting labels for training the ramp model. It provides general labeling guidance, guidance for complex scenes and difficult imagery, as well as highlights common errors. Note that the guide is not a comprehensive overview for training data issues over all geographies. There will be instances where you have to make the judgement call, but you can lean on this guidance to help inform that decision.

Labels should be created in geojson file format over the satellite imagery that you have just processed in the previous steps. Most GIS software have the ability to create a new vector label layer, this section of the documentation will show you how to create vector labels in QGIS.

With your imagery and grid layers loaded create a new shapefile layer.

With your new shapefile layer now created, make sure it is the active layer by clicking it in the layers window, and click the pencil icon in the top ribbon to begin editing.

The tools highlighted above are the basic tools that you will use to create your vector labels. The green bean shaped icon allows you to create a new label, and the tools icon to the right will allow you to edit and adjust existing labels.

Now that you have your label layer created and are ready to begin outlining rooftops, decide on how you are going to work through the imagery. We recommend keeping the grid layer activated and using it as a guide to methodically work your way through the imagery, chip by chip. This protects against missing buildings, which is easy to do if you take a patchwork approach to the labeling.

As you begin labeling, keep in mind the label specifications outlined in the label guidance document and aim for consistency and accuracy. For a chip to be usable, every building inside of that chip should be captured. Buildings and other ground featured that are mis-labeled will confuse the model and affect the accuracy of model outputs.

Once you have all of your labels generated, you will need to trim the labels and the imagery into individual tiles using the grid that you created. This process will be outlined a later section of documentation: Preparing the Data for the Model

Note: You can tile out the imagery before you create labels, but we advise that you tile out the imagery after the labels have been created. This helps discern buildings that fall off the edge of a tile. At a minimum you should have the ability to view adjacent tiles for context as you are labeling.

Utilizing Existing Training Data

Another feasible way to acquire the training data needed for fine-tune training the ramp model is to find training data that has already been created and has a favorable open source license. Of the three options highlighted in this documentation, this is the cheapest and potentially least time-consuming path to acquiring the training data needed. That being said, you have no control over the way the dataset was created which can pose issues when the labeling style or image resolution conflicts with the specifications that have been followed to train the ramp baseline model. It can also become time consuming when reviewing and editing the labels to match ramp specifications. The review and editing process will be highlighted in a subsequent section. Utilizing existing datasets also limits the geographic extent of the data, which is an important consideration when fine tuning the ramp model.

What follows is a brief overview of a few datasets that we used in training the baseline model and could prove beneficial in fine-tuning as well. Note that re-exposing a fine-tuned model to data that it has already ingested in the baseline model will not work to tune the model any further. The imagery behind these sources are also covered in more detail in the Imagery section.

OpenCities AI Challenge

The open cities challenge data is based on drone imagery over urban areas across the African continent. Because the data is drone imagery, the resolution is much higher (can range up to 2cm). The ramp model is accustomed to imagery within the range of 30-50 cm, so this imagery must be resampled to 30 cm for use.

These datasets would be helpful for fine tuning the ramp model over similar African urban geographies but would not be ideal otherwise. Consult the ramp data dictionary to identify the datasets that have already been used in the baseline model (avoid these for fine tuning). Guide here:

The data is split between two tiers: Tier 1 and Tier 2

Tier 1 data sets represent urban extents of Ghana, Uganda, Republic of Congo, and Tanzania and is higher in quality in comparison to Tier 2. This data was still found to have occasional un-labeled buildings, label shifting, and patches of no data that cause the labels to not be usable until edited.

Tier 2 data sets represent urban extents of Cameroon, Seychelles and Tanzania. Label gaps and shift issues are more persistent and the time to fix the data would be considerable.

These data are also not tiled out into 256 x 256 chips, and would need to be processed according before ingested into the ramp model for training.

You can access the dataset here: https://mlhub.earth/data/open_cities_ai_challenge

You can access the data that the ramp team has edited and cleaned for the ramp model here: (link to ramp hosted data)

SpaceNet2 Challenge

SpaceNet2 data over Shangai, China and Paris, France. The SpaceNet Dataset by SpaceNet Partners is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The SpaceNet Challenges have a repository of open source training data that was used to train the baseline ramp model and may be of use to a fine tune the ramp model, but it is geographically restricted to the following urban areas:

Las Vegas, Nevada USA
Shanghai, China
Khartoum, Sudan
Paris, France

The quality of the datasets vary between themselves, but generally the labels have been collected in a format which seeks to outline the actual building footprint rather than the rooftop, which is what the ramp model is trained to target. This causes most of the labels to be shifted, as seen in the center example above, to compensate for the off-nadir imagery. This must be corrected for the data to be useful for the ramp model.

The data is pre-tiled, but to the size of 512 x 512 pixels. It must be re-tiled to 256 x256 pixels for use with the ramp model.

You can access the full dataset here: https://mlhub.earth/data/spacenet2

DevGlobal Label Repository

DevGlobal hosts all of the baseline and fine tune training data used to create the first iterations of the ramp model online for free download and use. All of the internally created dev global labels were created over Maxar Open Data Program imagery.

It is important to note that the baseline datasets used to train the ramp baseline model should not be used again as fine tuning data. The model has already seen these training datasets and re-exposing the model to them will not tune the model towards that specific geography any further. These datasets are being shared below for your refence as to what geographies the model has already been exposed to.

These baseline datasets include:

Manilla, Oman, India, Haiti, St. Vincent, Sierra Leone, Malawi, Ghana, South Sudan, Myanmar, Chad

You may be able to utilize some of the datasets that were compiled to train the prototype ramp fine tuned models. These datasets have been curated to the ramp label specifications and have gone through the quality control process.

Fine tune datasets include:

Bangladesh

Sourcing Training Data Production

The final way to acquire training data that will be covered in this documentation is the process of sourcing your training data from a third party. There are many companies out there that can be contracted to produce training data for machine learning. Control over the process can vary between organization but you have the ability to inform the specifications and quality standards. This option can be expensive, but often offers a fast turn around with large teams working on putting the training data set together.

Training Data Quality Control

All of the work streams outlined above benefit greatly from a robust training data review and quality control step prior to the data being ingested by the model. In training the baseline model, our team found it very beneficial to even review and make any edits we saw fit the chips that we had created internally. Labelers will make mistakes, and there is no such thing as a perfect training dataset, but there are steps that you can take to ensure your dataset is high quality and won’t serve as a detriment to the model when training. It is up to you to balance time and resource to review and uphold your training dataset to the highest standard you can. The model will preform in direct correlation to the quality of the data it is trained on.

This label quality decision tree is a helpful tool in assessing the quality of a chip as you are working through your review. Ideally, all of the individual tiles in your dataset fall under the “acceptable tile” category

Chippy Checker Editor (CCE) - QGIS Plugin

CCE is a tool that has been developed to support the reviewing and editing of training data for the ramp model. CCE allows a user to quickly review and edit large training datasets of imagery and corresponding labeled data. Chippy Checker runs as a Quantum GIS (QGIS) plug in, efficiently loading image .tif and label .geojson files so a user can review for accuracy and edit/accept/reject training data tiles as they see fit.

Downloading and Installing CCE

The CCE plugin is available for download on the ramp github. Alongside the downloadable zip file, you can find guidance on how to easily install the plugin into QGIS.

To Install CCE

Upon starting an instance of QGIS, select “Plugins” from the top menu options
Select “Manage and Install Plugins”
Select “Install from ZIP” from the menu on the left of the dialogue box
Utilize the three dots menu to the right of the zip file box to browse to the location on your computer where the zipped file of CCE resides from you downloading it prior.
Select “Install Plugin”. Note that the file must be zipped to install correctly
Chippy Checker Editor should now be installed and accessible from Plugins on the top menu.

Running CCE

Before running CCE, you should prepare the dataset that you will be reviewing. For the tool to work, you will need three folders within a master “records directory” folder.

Imagery : A folder containing the tiled out imagery

Labels: A folder containing the tiled out label files that correspond to each of the imagery tiles

Edits: An empty folder that will be where the reviewed/edited labels will be saved

Files in the Imagery and Labels folder must be matching. For the tool to load the correct image and label counterpart, they need to have the same base name. For example:

After your directory structure and file names are in place, you can launch QGIS and start CCE. Once the tool is launched, the chippy checker editor toolbox pane will be loaded into your instance of QGIS. It should look similar to this:

This is where you will configure the tool so it can find the data in the directory that you have pre-configured in the previous step. Use the three dots menus to browse to the location of the following 4 items on your computer.

Records Directory: This is the master folder that contains the following three folders

Chips Directory: This is the folder that contains the imagery

Input Label Directory: This is the folder that contains the corresponding label files

Output Label Directory: This is the now empty folder that will soon contain the reviewed label files

Once you have configured the directories, select “Load Task” and the tool should boot up with your first chip to review/edit.

A user will now be able to systematically work their way through thousands of chips in a training dataset. The main functionality of the tool allows users two options, to “Accept” or “Reject” a chip. In the simplest review, a reviewer may move through the dataset and simply accept chips that meet quality standards and reject the chips that don’t. Note that when “rejecting” a chip, it is not outputted into the output label directory. Only labels that are accepted will be re-written into this new directory.

A “move backwards” functionality was built in to allow for a user to return to a chip that they had previously accepted. This is helpful when conducting fast reviews of simple “accept” “reject” style where it can be easy to accidentally accept a chip that should be rejected. Upon returning the previous chip by toggling the “Backward” button, a user can change the status of the chip from accepted to rejected as well as make any adjustments or edits to the labels.

CCE takes advantage of all of the native functionality of QGIS, including it’s suite of vector editing tools that allow a user to edit labels before accepting a chip. Instructions on how to use these tools will not be included here, but a quick google search will lead to an abundance of QGIS tutorials.

These tools, found in the “digitizing toolbar”, will become your best friend when editing and improving labels in the tool:

A user can also access “Stats” for the dataset that they are reviewing, which includes the number of chips in the dataset, the number that you have reviewed so far so you can track your progress, and the number of chips that are missing labels (if any).

In addition to the tool writing all of the edited and or accepted chips into the output label directory, the tool creates a csv file the tracks the chip ID, whether it was accepted or rejected, and any comments made corresponding to a chip.

Once the data has been reviewed and you are pleased with the quality of the training dataset, you are ready to begin training your fine tuned ramp model.