Ramp Model Card v0

Ramp Model Overview

The Replicable AI for Microplanning (Ramp) open-source deep learning model analyzed in this card accurately digitizes buildings in low-and-middle-income countries (LMICs) using satellite imagery and enables in-country users to build their own deep learning models for their regions of interest. This artificially intelligent deep learning model was trained on many different types of satellite images and  taught how to contextualize and find meaning in what are initially abstract values. 

The Ramp Model Card is inspired by Google's vision for model cards. The purpose of the model card is to "organize the essential facts of machine learning models in a structured way" and "leverage a model's capabilities and steer clear of its limitations".

Ramp is building off of previous mapping efforts and data releases shaping the ecosystem of mapping for humanitarian efforts. There have been large releases of building footprint data across huge geographic areas in the past, such as Google's Open Buildings data, and there are also widely-used platforms that make building footprints available to the world at no-cost. 

Ramp will amplify these efforts aimed at mapping the ”missing millions” by improving knowledge on the development of machine learning models to extract information for global health use cases and emergency response activities

Ramp's distinct focus is to make every aspect of the project openly available to the world.

On this page, you can learn more about how the Ramp model performs on different types of satellite images, and factors that tend to result in optimal or suboptimal model performance. 

Model Description

The model outlined below is a semantic segmentation model which detects buildings from satellite imagery and delineates the footprints. The architecture and approach were inspired by the Eff-UNet model outlined in this CVPR 2020 Paper

The Ramp project requires only free/open-source data science and geospatial analysis tools. The Ramp code is open source and written in Python. The training codebase utilizes the Tensorflow deep learning libraries and tools. Ramp was initially written and tested on the Linux platform; we are interested in porting to other platforms in common use if demand arises.

Methodology

The Ramp model is designed to use an image segmentation process that assigns a label to pixel clusters that represent a building. To train the model for building segmentation, we created a large training data set of vector building labels. The process required a team of labelers to create a polygon layer by digitally tracing all the buildings in pre-selected areas of interest (AOIs). For simplification of model ingest, large swaths of imagery were divided into image chips, which are 256 x 256 pixel subsects. 

The process to create vector building labels required the labeling team to digitize tens of thousands of buildings with a high level of accuracy. A subset of the labels were created “in-house” by the DevGlobal team to supplement labels that were created by TaQadam and B.O.T. (Bridge. Outsource. Transform), two organizations that specialize in providing image annotation products for semantic segmentation models. 

The Ramp project has:

  • Produced a baseline model and a fine-tuned semantic segmentation model
  • Trained the model for use in LMICs
  • Used training data from previous open releases and open ML/AI competitions including SpaceNet, the Open Cities AI Challenge, Mapbox with OSM-labels, and the Maxar Open Data Program:
    • 110,220 image chips labeled and counted
    • 59,342 image chips for baseline model
    • 50,878 image chips for finetuned model
  • Labeled and counted over 1,200,000 individual buildings across 22 individual AOIs
  • Used a human-centered design approach to develop a model that addresses underlying human needs and factors


Considerations

Ramp has identified model limitations to help improve labeling and output quality, mitigate against risks, and train for diversity across geographies and societies.

Regional Diversity

Ramp is designed to provide accurate building footprint extraction up to a national scale. The model must be trained on geographically relevant imagery to learn a region's architectural, cultural, and geographical nuances. For example, a model trained on labels from Bangladesh may not deploy well in Kenya without further training and tuning. However, some local geographic variance is essential to account for differences between rural and urban landscapes in a country.

Resolution

The model is designed to work on satellite imagery strips and mosaics with resolutions of 50cm or better, and aerial imagery resampled to 30cm. We recommend using imagery with 50cm or better resolution, as imagery below this resolution may not have the necessary amount of detail. Low temporal resolution can act as a limitation on monitoring a changing environment.

Fine-Tuning Data

To deploy Ramp over a new geographic region, the model must be fine-tuned with a new set of training data that is representative of the desired deployment geography. Labeling can be time-consuming but is required for a high accuracy deployment of the model. We recommend creating 2,000-4,000 new training chips over the target geography to develop a fine-tune model for the region.

Informal Settlements and Partially Constructed Buildings

Structures that are connected to one another but represent individual entities should be collected as a series of separate but touching polygons. If you can differentiate between one roof to another, the individual roofs should be captured. This configuration often appears in refugee settlements. Mapping building footprints in informal settlement dwellings can help determine populations living in those camps, and the level of resources available or needed. Changes over time can help measure change in displaced populations.

Image Quality

Variation in image quality can include atmospheric distortion, cloud cover, haze, smoke and sun glare. Varying canopy cover can limit accurate segmentation of buildings in forested landscapes. High off-nadir imagery in urban regions can create occlusion issues, meaning buildings in the scene are blocked by tall buildings. Off-nadir imagery exposes the facades of the buildings, which will not be captured in the labels. Building polygon labels should not be modified to accommodate image-dependent features.

Shadows

Where shadows obscure a structure, the edges should be inferred. Shadows originating from a structure should not be included in that structure’s footprint. Shadows overlapping a structure should not be cut out of the structure’s representative polygon. When multiple image options are available, users should select those with the best combination of minimal cloud cover, low off-nadir angle, and the least obstructive shadows.

Structure Size

Small features that are not dwellings such as sheds, garages and outhouses are easy to mislabel and can cause the model to extract unnecessary structures in deployment runs.

Partially Constructed Buildings

Users should review imagery prior to training and deployment to ensure rooftops are clearly visible. Some areas may have partially constructed buildings. often with no roofs, where people are actively residing. These structures can be difficult to distinguish and can be easily overlooked.

Data Gaps

Data gaps in the imagery such as no data pixels must be accounted for to not affect the model output. Other data issues such as duplication, invalid data type, and entries containing spelling errors in the file name must also be considered.

Computing Resources

We recommend running Ramp on a server with at least 2 GPUs with a recent installation of Linux, and using a computer with access to at least 1 NVIDIA CUDA-enabled GPU. While Cloud-based computer options are available, RAMP was designed to work well with a local, relatively small machine learning server.


Performance

Technical performance indicators were calculated by comparing model inference output to original labeled data. This comparison quantifies the ability of the model to replicate the methods and visual acuity of a human analyst.

Intersection Over Union (IoU, Jaccard Index)

Used to help calculate F1 Score and evaluate geographic alignment of outputs.

Dice Coefficient (F1 Score)

  • F1 Score will be primary indicator of model performance
  • Targeting F1 Score of >=90%
  • May trade higher F1 score for improved recall if necessary

Below you can find examples of baseline model results. Red polygons are predicted polygons by the baseline model (prediction data) and green are truth polygons collected by human analysts (truth data).


This screenshot features baseline results over residential area in Manjama, Sierra Leone.
Sierra Leone baseline results: Precision .835, Recall .8315, F1 .833.


This screenshot features baseline results over a residential area in Mesopotamia, St. Vincent.
St. Vincent baseline results: Precision .85,3, Recall .826, F1 8397.

The Dhaka training dataset consists of over 11,000 matching image chips and building label polygon files. It is highly segmented data, over one of the densest built-up areas in the world, and extremely challenging for machine learning models.

First, the Dhaka training data were separated into several Areas of Interest (AOIs). North and West were combined to make a single dataset for training. Then, transferability to the East AOI was tested.

Below you can see results from the Dhaka East Localized Model: an example of fine-tuned model testing. Red polygons are predicted polygons by the baseline model (prediction data) and green are truth polygons collected by human analysts (truth data).


This screenshot features an industrial area in East Dhaka. 


This screenshot features a residential area in East Dhaka.

East Dhaka results: Precision .612, Recall .622, F1 .617.


Trade-offs

This section documents the trade-offs across infrastructure choices, source data, and more.

Mosaics are wide-area images that have been atmospherically corrected and color balanced to appear as though they are one contiguous image, even though they are a patchwork of individual strips. Mosaics require time-intensive processing including curating images with low/no cloud cover, minimal haze, low off-nadir angles, and consistent pixel size, making them optimal sources for large-area extractions.

Strips are processed quickly by imagery providers and are available  soon after collection from the sensor. They have varying pixel size, and are pre-processed, but can have quality issues related to clouds, haze, collection angles, etc..

DECISION: Train on strips so the model ‘learns’ to process varying resolutions, but run on mosaics for large-area extractions.

While cloud-based processing can prove faster than locally-trained datasets, the cost associated with cloud processing can oftentimes be out of reach for our our core users and focus geographies of LMICs. Locally-trained datasets require an initial up-front investment, but that is a set-cost which won’t balloon project costs and derail an organization’s budget. 

Demonstrating the steps and skills required to deploy locally targets our core user groups, and these processes can eventually be expanded to be optimized for cloud deployment.

DECISION: Train locally to support our target users.

Generally, there are two approaches when training a model: supervised and unsupervised learning. 

A supervised approach relies on human judgment to create data that the model will learn from. The supervised approach requires thousands of hours of digitization for a training dataset of our size, though the resulting accuracy of the training set is significantly higher.

An unsupervised approach asks the model to develop its own training data based on statistical analysis of the imagery. Unsupervised classification is far less time-intensive but also less accurate. 

DECISION: RAMP requires a supervised approach for building footprint extraction.

Precision means the percentage of generated footprints which are true positives, i.e. the % of actual buildings in an image that we positively identify as buildings. Higher precision means fewer false positives.

Recall means the percentage of ground truth correctly detected/delineated by the model, i.e. the % of our detections that are actually buildings. Higher recall means a higher % of total relevant results correctly classified by the algorithm. 

DECISION: Prioritize recall over precision if necessary to minimize missed buildings.

A building detection model may result in better performance, and the outputs would support desired population and population dispersal analytics.

Footprints allow for more accurate population estimation, can be integrated meaningfully with road and utility datasets, and can be used to assess the economic state within regions of interest. They can also be used to evaluate structure damage following natural disasters or degradation of structures and urban development over time.  

DECISION: While it may be possible to get higher metrics-based performance from a building detection model, after consulting with our advisors and end-user groups we determined that building footprint delineation using an image segmentation model would better provide the data fidelity necessary for our target use cases. Considering the breadth of applications relative to a building detection model, instance segmentation proved the best choice for our end-users and desired outcomes.

Our initial focus was the creation of a baseline model that can accurately extract buildings over broad geographies, and then a fine-tuned model that will perform exceptionally well over Bangladesh. Because of this, we were less concerned with temporal accuracy, as the training data forms the basis of the model, and recent imagery can be used for production runs which will yield up-to-date footprints.

DECISION: Optimize for high-quality imagery knowing future deployment will be made against recent imagery.


Ethics

What does Ramp achieve with robust ethics?

AI can be a driver of growth, development, and democratization. By addressing current barriers, the Global South can not only catch up to those countries that have already taken steps to advance AI, but surpass them—especially in innovating for local contexts and communities. This can result in:

  • Connecting users with tools and applications to upskill and be at the forefront of economic transformation
  • Building trusted relationships with AI tools and services through thoughtful human-machine collaborations
  • Accelerating equitable health outcomes by intentionally integrating diversity of communities being mapped across LMICs
  • Reducing reliance of governments and NGOs on foreign tech providers
  • Demonstrating an “AI for good” application that addresses ethics and privacy concerns


The team, in concert with stakeholders, has made the determination that Ramp’s benefits outweigh potential costs.

How Ramp is working to adopt ethical practices:

RAMP aims to adhere to the Locus Charter, a set of common principles developed from international dialogues with geospatial professionals and organizations exploring what it means to use location data responsibly in different contexts. Below you can find steps RAMP is taking to thoughtfully mitigate risks and promote ethical use of location data:

  • Ramp will promote equity and fairness
  • Ramp will be locally led and implemented
  • Ramp will follow the lead and best practices of communities already mapping these features
  • Ramp is creating equitable partnerships, addressing expertise gaps, and training for diversity across geographies and societies
  • Ramp is building the model to be efficient and require minimal processing power. Users can run the model locally and do not need to deploy in the cloud
  • Ramp users will define what successful adoption and implementation looks like and how it should be measured over the project lifespan
  • Ramp will incorporate user voice into our model
  • Ramp is releasing the entire collection of training data used to build the model, which provides a valuable data repository across many different geographies, as well as mechanisms to replicate the process of producing additional training data via the model
  • Detailed documentation is being designed and written for specific user personas, especially users based in-country. This documentation, along with associated tools, will help improve user experience and build user confidence in their ability to use the model and data
  • Ramp is pressure testing our model against Tamilnadu’s DEEP-MAX Scorecard, a transparent  rating system for AI Systems on: Diversity, Equity, Ethics, Privacy and Data Protection, Misuse Protection, Audit and Transparency, Cross-Geography and Cross-Society Applicability
  • Ramp is evaluating the implications of Demographically Identifiable Information (DII), and integrating data minimization principles into our standards
  • Ramp is taking into account who has access to data and models, and incorporating features and mechanisms that discourage and protect against misuse of data by bad actor networks
  • Ramp will promote informed consent, transparent, responsible and respectful use of data

We would love your feedback

We want to hear from you about the project, your use cases, and any feedback you may have