On this page, you can learn more about how the Ramp model performs on different types of satellite images, and factors that tend to result in optimal or suboptimal model performance.
The model outlined below is a semantic segmentation model which detects buildings from satellite imagery and delineates the footprints. The architecture and approach were inspired by the Eff-UNet model outlined in this CVPR 2020 Paper.
The Ramp project requires only free/open-source data science and geospatial analysis tools. The Ramp code is open source and written in Python. The training codebase utilizes the Tensorflow deep learning libraries and tools. Ramp was initially written and tested on the Linux platform; we are interested in porting to other platforms in common use if demand arises.
The Ramp model is designed to use an image segmentation process that assigns a label to pixel clusters that represent a building. To train the model for building segmentation, we created a large training data set of vector building labels. The process required a team of labelers to create a polygon layer by digitally tracing all the buildings in pre-selected areas of interest (AOIs). For simplification of model ingest, large swaths of imagery were divided into image chips, which are 256 x 256 pixel subsects.
The process to create vector building labels required the labeling team to digitize tens of thousands of buildings with a high level of accuracy. A subset of the labels were created “in-house” by the DevGlobal team to supplement labels that were created by TaQadam and B.O.T. (Bridge. Outsource. Transform), two organizations that specialize in providing image annotation products for semantic segmentation models.
The Ramp project has:
Ramp has identified model limitations to help improve labeling and output quality, mitigate against risks, and train for diversity across geographies and societies.
Technical performance indicators were calculated by comparing model inference output to original labeled data. This comparison quantifies the ability of the model to replicate the methods and visual acuity of a human analyst.
Used to help calculate F1 Score and evaluate geographic alignment of outputs.
Below you can find examples of baseline model results. Red polygons are predicted polygons by the baseline model (prediction data) and green are truth polygons collected by human analysts (truth data).
This screenshot features baseline results over residential area in Manjama, Sierra Leone.
Sierra Leone baseline results: Precision .835, Recall .8315, F1 .833.
This screenshot features baseline results over a residential area in Mesopotamia, St. Vincent.
St. Vincent baseline results: Precision .85,3, Recall .826, F1 8397.
The Dhaka training dataset consists of over 11,000 matching image chips and building label polygon files. It is highly segmented data, over one of the densest built-up areas in the world, and extremely challenging for machine learning models.
First, the Dhaka training data were separated into several Areas of Interest (AOIs). North and West were combined to make a single dataset for training. Then, transferability to the East AOI was tested.
Below you can see results from the Dhaka East Localized Model: an example of fine-tuned model testing. Red polygons are predicted polygons by the baseline model (prediction data) and green are truth polygons collected by human analysts (truth data).
This screenshot features an industrial area in East Dhaka.
This screenshot features a residential area in East Dhaka.
East Dhaka results: Precision .612, Recall .622, F1 .617.
This section documents the trade-offs across infrastructure choices, source data, and more.
Mosaics are wide-area images that have been atmospherically corrected and color balanced to appear as though they are one contiguous image, even though they are a patchwork of individual strips. Mosaics require time-intensive processing including curating images with low/no cloud cover, minimal haze, low off-nadir angles, and consistent pixel size, making them optimal sources for large-area extractions.
Strips are processed quickly by imagery providers and are available soon after collection from the sensor. They have varying pixel size, and are pre-processed, but can have quality issues related to clouds, haze, collection angles, etc..
DECISION: Train on strips so the model ‘learns’ to process varying resolutions, but run on mosaics for large-area extractions.
While cloud-based processing can prove faster than locally-trained datasets, the cost associated with cloud processing can oftentimes be out of reach for our our core users and focus geographies of LMICs. Locally-trained datasets require an initial up-front investment, but that is a set-cost which won’t balloon project costs and derail an organization’s budget.
Demonstrating the steps and skills required to deploy locally targets our core user groups, and these processes can eventually be expanded to be optimized for cloud deployment.
DECISION: Train locally to support our target users.
Generally, there are two approaches when training a model: supervised and unsupervised learning.
A supervised approach relies on human judgment to create data that the model will learn from. The supervised approach requires thousands of hours of digitization for a training dataset of our size, though the resulting accuracy of the training set is significantly higher.
An unsupervised approach asks the model to develop its own training data based on statistical analysis of the imagery. Unsupervised classification is far less time-intensive but also less accurate.
DECISION: RAMP requires a supervised approach for building footprint extraction.
Precision means the percentage of generated footprints which are true positives, i.e. the % of actual buildings in an image that we positively identify as buildings. Higher precision means fewer false positives.
Recall means the percentage of ground truth correctly detected/delineated by the model, i.e. the % of our detections that are actually buildings. Higher recall means a higher % of total relevant results correctly classified by the algorithm.
DECISION: Prioritize recall over precision if necessary to minimize missed buildings.
A building detection model may result in better performance, and the outputs would support desired population and population dispersal analytics.
Footprints allow for more accurate population estimation, can be integrated meaningfully with road and utility datasets, and can be used to assess the economic state within regions of interest. They can also be used to evaluate structure damage following natural disasters or degradation of structures and urban development over time.
DECISION: While it may be possible to get higher metrics-based performance from a building detection model, after consulting with our advisors and end-user groups we determined that building footprint delineation using an image segmentation model would better provide the data fidelity necessary for our target use cases. Considering the breadth of applications relative to a building detection model, instance segmentation proved the best choice for our end-users and desired outcomes.
Our initial focus was the creation of a baseline model that can accurately extract buildings over broad geographies, and then a fine-tuned model that will perform exceptionally well over Bangladesh. Because of this, we were less concerned with temporal accuracy, as the training data forms the basis of the model, and recent imagery can be used for production runs which will yield up-to-date footprints.
DECISION: Optimize for high-quality imagery knowing future deployment will be made against recent imagery.
What does Ramp achieve with robust ethics?
AI can be a driver of growth, development, and democratization. By addressing current barriers, the Global South can not only catch up to those countries that have already taken steps to advance AI, but surpass them—especially in innovating for local contexts and communities. This can result in:
The team, in concert with stakeholders, has made the determination that Ramp’s benefits outweigh potential costs.
How Ramp is working to adopt ethical practices:
RAMP aims to adhere to the Locus Charter, a set of common principles developed from international dialogues with geospatial professionals and organizations exploring what it means to use location data responsibly in different contexts. Below you can find steps RAMP is taking to thoughtfully mitigate risks and promote ethical use of location data: