Training the Model

Ramp Training Notebook

This notebook is intended to run you through the ramp training process, explain what’s going on at every stage, and help you understand what’s in a training configuration file.

We’ll start by importing python dependencies, including functions needed from the ramp codebase.

In [1]:

				
					import os, sys
from pathlib import Path
import numpy as np
import argparse
import tensorflow as tf
from tensorflow import keras
import datetime
import random
import json

# Note: this suppresses warning and other less urgent messages,
# and only allows errors to be printed.
# Comment this out if you are having mysterious problems, so you can see all messages.
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# this variable must be defined. It is the parent of the 'ramp-code' directory.
RAMP_HOME = os.environ["RAMP_HOME"]

# import ramp dependencies.
from ramp.training.augmentation_constructors import get_augmentation_fn
from ramp.training import callback_constructors
from ramp.training import model_constructors
from ramp.training import optimizer_constructors
from ramp.training import metric_constructors
from ramp.training import loss_constructors

from ramp.data_mgmt.data_generator import training_batches_from_gtiff_dirs, test_batches_from ramp.utils.misc_ramp_utils import log_experiment_to_file, get_num_files
from ramp.models.effunet_1 import get_effunet
from ramp.utils.model_utils import get_best_model_value_and_epoch
import ramp.utils.log_fields as lf

import segmentation_models as sm
sm.set_framework('tf.keras')

Out [1]:

Segmentation Models: using `keras` framework.

A Note on Callback Functions

Segmentation model training makes use of multiple callback functions, which create and track events at regular points during the training process.

Some callbacks should be used in every training run.

Tensorboard callback (get_tb_callback_fn): Used for tracking model training metrics using Tensorboard (i.e., loss and accuracy). Always essential for training.
Prediction logging (get_pred_logging_callback_fn): Used for displaying changing building prediction results, using Tensorboard. Essential.
Model checkpoint logging callback (get_model_checkpt_callback_fn): Used for saving copies of the best-performing models during training. Essential, since saving good models is the point of the exercise.

Some callbacks are used frequently for training.

Early stopping callback (get_early_stopping_callback_fn): Stops the training process before all epochs are completed if the training metrics stop improving. Very useful.

In addition, other callback-based functionality was tested, and those were left in the codebase for others to try.

get_clr_callback_fn: Used for cyclic learning schedules, which change the learning rate as training proceeds. Sometimes provides an accuracy boost if training is prone to getting stuck in non-optimal solutions.

In [2]:

				
					print("\n".join([f for f in dir(callback_constructors) if f.startswith("get") and f.endswith("callback_fn")]))

Out [2]:

get_clr_callback_fn
get_early_stopping_callback_fn
get_model_checkpt_callback_fn
get_pred_logging_callback_fn
get_tb_callback_fn

The Training Configuration File

Training runs are defined using json configuration files, which specify training datasets, training options, and hyperparameter choices. Sample config files are in the ‘experiments’ subdirectory of the codebase, organized in folders according to the training datasets specified.

At the top level, the config file contains the following blocks:

experiment_name: simple, descriptive name for your own use
discard_experiment: set to ‘true’ if this is a test run that you don’t want to keep any record of
logging: whether to log the experiment to a csv file, and what fields to log
datasets: directories where training and validation datasets are stored
num_classes: 2 for binary masks, 4 for multichannel masks
num_epochs: number of cycles for the training run. Overridden if using early stopping.
batch_size: number of samples per ‘batch’. Smaller batches are needed for smaller gpus. Larger batches stabilize the training metrics, and result in fewer iterations and shorter training times.
input_img_shape: (H,W) of the input images.
output_img_shape: (H,W) of output masks.
loss: the loss function to use in training, and any parameters to use in its construction.
metrics: the accuracy functions to track during training, and parameters to use in their construction.
optimizer: choice of optimizer to use in training, and optimizer parameters.
model: choice of model to use in training (currently just EfficientUnet), and parameters to use in its construction, including the variety of EfficientNet to use for the encoder.
saved_model: whether to resume training from a saved model, and if so, the location of the saved model.
augmentation: whether to use image augmentation (e.g., random rotations, random changes in color) to regularize training. If so, which augmentations to apply, and parameters needed to construct them.
early_stopping: whether to use early stopping, and parameters needed to define early stopping rules.
cyclic_learning_scheduler: whether to use a cyclic learning scheduler (generally no unless you’re a power user), and parameters needed to construct one.
tensorboard: whether to log training metrics in tensorboard (always yes), with what frequency, and where to store logs.
prediction_logging: whether to log prediction samples in tensorboard (yes).
model_checkpts: whether to save ‘best models’ during training (always yes), and to where.
random_seed: an integer to use for the random seed, for reproducibility of results.

The contents of a sample configuration file are shown below.

Read in the configuration file

This code block reads in a sample file. Change this to your own configuration file when you’re doing a training run.

In [3]:

				
					config_file = "sample-data/ramp_training/sample_config.json"

with open(config_file) as jf:
        cfg = json.load(jf)

Experiment_name
A simple, descriptive name for your own use.

In [4]:

				
					cfg["experiment_name"]

Out [4]:

‘Sample Efficient-Unet binary model baseline’

Discard_experiment
Set to ‘true’ if this is a test run that you don’t need or want to keep any record of.

In [5]:

				
					cfg["discard_experiment"]

Out [5]:

False

Logging
Whether to log the experiment to a csv file. If so, what file to log to, and what information to log.

In [6]:

				
					cfg["logging"]

Out [6]:

{‘log_experiment’: True,
‘experiment_log_path’: ‘ramp-data/TRAIN/all_ramp_experiments.csv’,
‘experiment_notes’: ‘baseline model w/ earlystop, batchsize 16 on india-malawi-sierraleone-oman2-haiti data’,
‘fields_to_log’: [‘experiment_name’,
‘experiment_notes’,
‘timestamp’,
‘num_epochs’,
‘batch_size’,
‘output_img_shape’,
‘input_img_shape’,
‘get_loss_fn_name’,
‘use_saved_model’,
‘use_aug’,
‘use_early_stopping’,
‘use_clr’,
‘random_seed’,
‘num_classes’,
‘get_optimizer_fn_name’,
‘tb_logs_dir’,
‘get_model_fn_name’,
‘backbone’,
‘train_img_dir’,
‘train_mask_dir’,
‘val_img_dir’,
‘val_mask_dir’]}

datasets
Paths to directories containing the training images, training masks, validation images, and validation masks.

IMPORTANT NOTE

All directory and file paths defined in a ramp config file are assumed to be relative paths, defined relative to the RAMP_HOME environment variable, which must be defined in any environment used to run ramp code.In the ramp docker image, RAMP_HOME is defined to be '/tf' (since the ramp docker image is based on a tensorflow docker image).Therefore, the 'train_img_dir' variable below actually points to the path: RAMP_HOME/ramp-data/TRAIN/tq_baseline/chips.

In [7]:

				
					cfg["datasets"]

Out [7]:

{‘train_img_dir’: ‘ramp-data/TRAIN/tq_baseline/chips’,
‘train_mask_dir’: ‘ramp-data/TRAIN/tq_baseline/binmasks’,
‘val_img_dir’: ‘ramp-data/TRAIN/tq_baseline/valchips’,
‘val_mask_dir’: ‘ramp-data/TRAIN/tq_baseline/val-binmasks’}

num_classes
This training run is for a binary mask model. If it were a multichannel mask model, num_classes would be 4.

In [8]:

				
					cfg["num_classes"]

Out [8]:

num_epochs
The number of times the training process will pass through the entire training dataset.

I usually set this value very large, and employ early stopping during training. As noted below, early stopping will stop training if the metrics show that the model is no longer learning, even if the full number of epochs has not been reached.

In [9]:

				
					print(cfg["num_epochs"])
cfg["num_epochs"] = 2

Out [9]:

batch_size
The number of samples to use per ‘batch’ (iteration of training). Smaller batches (say, size 4) are needed for smaller/older gpus. Larger batches stabilize the training metrics (i.e., they ‘jump around’ less), and result in fewer training iterations and shorter training times.

In [10]:

				
					cfg["batch_size"]

Out [10]:

input_img_shape
(H,W) of the input images in pixels.

In [11]:

				
					cfg["input_img_shape"]

Out [11]:

[256, 256]

output_img_shape
(H,W) of the output masks. Generally but not always the same as the input.

In [12]:

				
					cfg["input_img_shape"]

Out [12]:

[256, 256]

loss
Defines the loss function, and parameters needed to use in its construction.

Training a segmentation neural network means finding a solution that minimizes the difference between a truth and a predicted segmentation mask, for all the samples in the training data set. “Differences” between masks are measured, during training, by loss functions.

Tweaking the definition of the loss function used during training can have a huge effect on the results. For example, if the loss function is defined so that any mistakes made on pixels at the boundaries of buildings increase the loss function more than mistakes made on the interior pixels of buildings, the trained model will try harder to get correct results on the boundaries.

Multiple loss functions were tested during ramp development, and the options were left in the code for others to try.

get_loss_fn_name: the name of the function in the loss_constructors module that will construct and return the loss function (using any parameters you define).

In this case, get_sparse_categorical_crossentropy_fn() will construct the sparse_categorical_crossentropy loss function and return it.

loss_fn_parms: Since there are no additional parameters needed for constructing
sparse_categorical_crossentropy(), this field is blank.

In [13]:

				
					cfg["loss"]

Out [13]:

{‘get_loss_fn_name’: ‘get_sparse_categorical_crossentropy_fn’,
‘loss_fn_parms’: {}}

The list below shows all the loss functions available in the ramp code.

In [14]:

				
					print("\n".join([f for f in dir(loss_constructors) if f.startswith("get")and f.endswith("fn")]))

Out [14]:

get_SCCE_labelsmoothing_loss_fn
get_class_tversky_fn
get_custom_loss_fn
get_sparse_categorical_crossentropy_fn
get_weighted_SCCE_loss_fn

metrics
Defines accuracy metrics to track during training, and parameters needed for their construction.

Accuracy tracking allows us to monitor the accuracy of our model during training, on both the training and validation data. In general, the value of accuracy on the validation data will improve during the early phase of training, and then start to worsen when the model starts to overfit the training data. The value of accuracy on the training data will continue to improve the whole time a model is training.

A user may track as many accuracy metrics as desired during training.

The accuracy metric tracked during ramp training was sparse categorical accuracy, which reports the percentage of pixels in the truth mask that are correctly labeled by the model. It is very computationally efficient to compute, which makes it a good accuracy metric to use during training time. This accuracy metric is generally very high during training, around 97% on the validation data, but pixelwise metrics do not necessarily give a good feeling for the accuracy of building extractions.

The metric we use to directly measure the accuracy of building extractions after training is the Spacenet building extraction metric, F1 based on IoU@0.5. This metric is too expensive to compute during training runs.

use_metrics: should always be true.

get_metrics_fn_names: this is a list of functions from the metric_constructors module that are used to construct the accuracy functions (it is a list because more than one metric can be tracked per training run).

metrics_fn_parms: a list of parameter sets, of the same length as the get_metrics_fn_names list, to pass to the metric constructors. In this case, there is only one parameter set, and it is empty because sparse_categorical_accuracy does not need any parameters for construction.

In [15]:

				
					cfg["metrics"]

Out [15]:

{‘use_metrics’: True,
‘get_metrics_fn_names’: [‘get_sparse_categorical_accuracy_fn’],
‘metrics_fn_parms’: [{}]}

The list below shows all ramp’s current options for training-time accuracy metrics.

In [16]:

				
					print("\n".join([f for f in dir(metric_constructors) if f.startswith("get") and f.endswith("fn")]))

Out [16]:

get_sparse_categorical_accuracy_fn

optimizer
Defines the optimization function to use in training, and any parameters needed for its construction.

get_optimizer_fn_name: the name of the function in the optimizer_constructors module that will construct the optimizer for the training run. In this case, get_adam_optimizer will construct an Adam optimizer.

optimizer_fn_parms: any parameters needed by the construction function. In this case, we want to set the learning rate for the Adam optimizer to 3E=04.

In [17]:

				
					cfg["optimizer"]

Out [17]:

{‘get_optimizer_fn_name’: ‘get_adam_optimizer’,
‘optimizer_fn_parms’: {‘learning_rate’: 0.0003}}

model
Defines the choice of model to use in training, and parameters to use in its construction.

The segmentation model used by ramp is a Unet with one of several EfficientNet encoder options, which may be specified in the model_fn_parms list.

get_model_fn_name: the name of the function in the model_constructors module that will construct the model for the training run. Only ‘get_effunet_model’ is currently defined in ramp, which returns a Unet segmentation model with a configured EfficientNet encoder.

model_fn_parms: parameters to pass to ‘get_effunet_model’. Currently there are two: ‘backbone’, which defines the EfficientNet type to use for the Unet encoder, and ‘classes’, which is a list of the class names used for the segmentation problem.

‘backbone’ is set to ‘efficientnetb0’, which is the smallest of the EfficientNet options. Options available are ‘efficientnetb0’ through ‘efficientnetb7’; documentation is here.

‘classes’ is a list containing 2 classes, for the binary segmentation problem.

In [18]:

				
					cfg["model"]

Out [18]:

{‘get_model_fn_name’: ‘get_effunet_model’,
‘model_fn_parms’: {‘backbone’: ‘efficientnetb0’,
‘classes’: [‘background’, ‘buildings’]}}

The list below shows all current options for segmentation models available in ramp.

In [19]:

				
					print("\n".join([f for f in dir(model_constructors) if f.startswith("get")and f.endswith("model")]))

Out [19]:

get_effunet_model

saved_model
Ramp training includes an option to resume training from an existing model.

In this case, use_saved_model is false, so we do not use a saved model. If this field were true, the training code would load the model in the “saved_model_path” field, and ignore the information in the ‘model’ section of the configuration.

If a saved model is used, and the “save_optimizer_state” field is false, then the model will be recompiled. This means that all information about the training state of the saved model will be lost. Otherwise, training will resume from the model state as it was saved.

In [20]:

				
					cfg["saved_model"]

Out [20]:

{‘use_saved_model’: False,
‘saved_model_path’: ‘ramp-data/TRAIN/tq_baseline/model-checkpts/sample_saved_model’,
‘save_optimizer_state’: False}

augmentation
Whether to use image augmentations during training, in order to regularize training. If so, then this block also specifies augmentation choices, and the parameters needed for their construction.

aug_list: the list of names of augmentation functions to use during training;

aug_parms a list, of the same length as ‘aug_list’, of sets of parameters to be passed to the constructors of each augmentation function.

In the code below, there are two augmentation functions being constructed. One is a random rotation, and the other is a random (slight) change of colors. Each has its own set of parameters which are specified for its construction.

This link points to the complete list of available augmentations, and the parameters required to construct them. Note that many augmentations (such as, for example, InvertImg) are not appropriate for use with remote sensing imagery.

In [21]:

				
					cfg["augmentation"]

Out [21]:

{‘use_aug’: True,
‘get_augmentation_fn_name’: ‘get_augmentation_fn’,
‘aug_list’: [‘Rotate’, ‘ColorJitter’],
‘aug_parms’: [{‘border_mode’: ‘BORDER_CONSTANT’,
‘interpolation’: ‘INTER_NEAREST’,
‘value’: [0.0, 0.0, 0.0],
‘mask_value’: 0,
‘p’: 0.7},
{‘p’: 0.7}]}

early_stopping
Whether to use ‘early stopping’ during training. Early stopping stops training when a watched accuracy metric fails to improve over a sufficiently long time.

For example, if the time series of validation loss values has reached its minimum value, and has failed to reach a new minimum after some (user-specified) number of new epochs, early stopping will stop the training even if the full number of epochs has not been run.

early_stopping_parms: parameters to pass to the early stopping construction function, including which metric to monitor for the early stopping decision (monitor), and how many epochs to wait before stopping a training run that is not improving (patience).

In [22]:

				
					cfg["early_stopping"]

Out [22]:

{‘use_early_stopping’: True,
‘early_stopping_parms’: {‘monitor’: ‘val_loss’,
‘min_delta’: 0.005,
‘patience’: 50,
‘verbose’: 0,
‘mode’: ‘auto’,
‘restore_best_weights’: False}}

cyclic_learning_scheduler
Whether to use a cyclic learning scheduler, which adjusts the value of the learning rate parameter during training in order to help the model escape local minima during training, and reach a more optimal final solution.

In general, you won’t need to use CLR schedulers during training unless you suspect that you are getting stuck with a non-optimal solution to your training problem.

This article provides a good introduction to cyclical learning rate schedulers.

(Also note: If tuning a neural network’s learning rate is a completely new idea, I recommend starting with this article, which is the first in a 3-part series, of which the cyclic learning rate article linked above is the second.)

In [23]:

				
					cfg["cyclic_learning_scheduler"]

Out [23]:

{‘use_clr’: False,
‘get_clr_callback_fn_name’: ‘get_clr_callback_fn’,
‘clr_callback_parms’: {‘mode’: ‘triangular2’,
‘stepsize’: 8,
‘max_lr’: 0.0001,
‘base_lr’: 3.25e-06},
‘clr_plot_dir’: ‘ramp-data/TRAIN/Shanghai-Paris-Oman/plots’}

tensorboard
Whether, and where, to log training metrics using the tensorboard utility. This option should always be turned on!

The tensorboard utility allows you to monitor the status of training, and training loss and accuracy, during the training process. Since training takes a long time, having a window into the process is critical.

tb_logs_dir: the parent directory of all tensorboard logging directories. Note that ramp code gives every training run a unique name using its time stamp, so every training run will log to a unique subdirectory of the ‘tb_logs_dir’ directory.

tb_callback_parms: a list of parameters to be passed to the Tensorboard callback function. These mostly control the frequency of update of the Tensorboard output.

In [24]:

				
					cfg["tensorboard"]

Out [24]:

{‘use_tb’: True,
‘tb_logs_dir’: ‘ramp-data/TRAIN/tq_baseline/logs’,
‘get_tb_callback_fn_name’: ‘get_tb_callback_fn’,
‘tb_callback_parms’: {‘histogram_freq’: 1, ‘update_freq’: ‘batch’}}

prediction_logging
Whether to add the display of predicted segmentation images to the Tensorboard display. This should always be true, since it lends enormous insight into the results of the training process. Prediction images should improve visibly during training, and changes should stabilize (i.e., not change wildly from epoch to epoch) as training continues.

In [25]:

				
					cfg["tensorboard"]

Out [25]:

{‘use_prediction_logging’: True,
‘get_prediction_logging_fn_name’: ‘get_pred_logging_callback_fn’}

model_checkpts
Whether and where to save ‘best models’ during training. This should always be true.

Currently, the ‘best_models’ are defined as those that have the current maximum value of the validation accuracy metric.

model_checkpt_dir: the parent directory of all model checkpoint logging directories. Note that ramp code gives every training run a unique name using its time stamp, so every training run will store its ‘best models’ to a unique subdirectory of the ‘model_checkpt_dir’ directory.

model_checkpt_callback_parms: List of parameters to pass to the model checkpoint callback function. The model checkpoint function will always monitor the validation dataset’s value of the first metric in the list of accuracy metrics. Set save_best_only to ‘false’ if you would like to save the model after every epoch.

In [26]:

				
					cfg["model_checkpts"]

Out [26]:

{‘use_model_checkpts’: True,
‘model_checkpts_dir’: ‘ramp-data/TRAIN/tq_baseline/model-checkpts’,
‘get_model_checkpt_callback_fn_name’: ‘get_model_checkpt_callback_fn’,
‘model_checkpt_callback_parms’: {‘mode’: ‘max’, ‘save_best_only’: True}}

random_seed
An integer to use for a random seed, in order to enable the reproduction of training runs.

In [27]:

				
					cfg["random_seed"]

Out [27]:

20220523

Running the RAMP Training Code

Now that we know what’s in the configuration file, we’ll walk through the steps of setting up the training and running it.

Step 1. Check your GPU setup and access.

Before you begin training, you’ll want to check whether Tensorflow has access to your GPU to run training. The code below helps you diagnose problems.

Below, we directly check Tensorflow access to the GPUs. You should see as many
PhysicalDevice listings as you have GPUs.

In [28]:

				
					tf.config.list_physical_devices('GPU')

Out [28]:

[PhysicalDevice(name=’/physical_device:GPU:0′, device_type=’GPU’),
PhysicalDevice(name=’/physical_device:GPU:1′, device_type=’GPU’)]

Set up, or disable, logging to Tensorboard and the experiment log.

Set the timestamp for the current experiment, and add it to the training run configuration.

In [29]:

				
					discard_experiment = False
if "discard_experiment" in cfg:
    discard_experiment = cfg["discard_experiment"]
    
cfg["timestamp"] = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

Step 2. Construct the loss function for the training run.

The user must specify a single loss function for the training run.

To do this, the user specifies a function from the loss_constructors module that will be used to construct the loss function. This is necessary because constructing the loss function will frequently require additional parameters to be passed in; for example, a weighted loss function (for example, one that penalizes incorrect boundary pixels more heavily than incorrect background pixels) will need the class weights to be passed in at the time of its construction. These weights must also be defined in the configuration file.

This method of constructing a function dynamically is used repeatedly in the ramp code, as you’ll see.

In [30]:

				
					# specify a function that will construct the loss function
get_loss_fn_name = cfg["loss"]["get_loss_fn_name"]
get_loss_fn = getattr(loss_constructors, get_loss_fn_name)
print(f"Loss function constructor: {get_loss_fn.__name__}")

# Construct the loss function 
loss_fn = get_loss_fn(cfg)
print(f"Loss function: {loss_fn.__name__}")

Out [30]:

Loss function constructor: get_sparse_categorical_crossentropy_fn
Loss function: sparse_categorical_crossentropy

Step 3. Construct the accuracy metrics for the training run.

While the neural network model uses only one loss function in training, you can specify more than one accuracy metric to track. You should always specify at least one accuracy measure to track.

As with the loss function, the user must specify the constructor functions that will create the desired accuracy functions.

In [31]:

				
					the_metrics = []
if cfg["metrics"]["use_metrics"]:

    get_metrics_fn_names = cfg["metrics"]["get_metrics_fn_names"]
    get_metrics_fn_parms = cfg["metrics"]["metrics_fn_parms"]
    
    for get_mf_name, mf_parms in zip(get_metrics_fn_names, get_metrics_fn_parms):
        get_metric_fn = getattr(metric_constructors, get_mf_name)
        print (f"Metric constructor function: {get_metric_fn.__name__}")
        metric_fn = get_metric_fn(mf_parms)
        the_metrics.append(metric_fn)

# Print the list of accuracy metrics
print(f"Accuracy metrics: {[fn.name for fn in the_metrics]}")

Out [31]:

Metric constructor function: get_sparse_categorical_accuracy_fn
Accuracy metrics: [‘sparse_categorical_accuracy’]

Step 4. Construct the optimizer for the training run

Model training proceeds by making incremental adjustments to the model parameters that attempt to minimize (or optimize) the value of the user’s chosen loss function. The choice of size and direction for those incremental adjustments is made by an optimization algorithm.

The user must specify the construction function that will construct the optimizer algorithm. In this example, the configuration file also contains a parameter, the ‘learning_rate’, which is used in the construction of the optimizer.

In [32]:

				
					#### construct optimizer ####

get_optimizer_fn_name = cfg["optimizer"]["get_optimizer_fn_name"]
get_optimizer_fn = getattr(optimizer_constructors, get_optimizer_fn_name)
print (f"Optimizer constructor: {get_optimizer_fn.__name__}")


optimizer = get_optimizer_fn(cfg)
print(optimizer)
print(float(optimizer.learning_rate))

Out [32]:

Optimizer constructor: get_adam_optimizer
<keras.optimizer_v2.adam.Adam object at 0x7f16d81bef70>
0.0003000000142492354

Step 5. Construct the model

The code below optionally constructs a new model, or uses a saved model from a previous training run.

Use a saved model
If a saved model is used, the model is loaded from a location in the configuration file. Note that all directories in the configuration file are given relative to the RAMP_HOME environment variable, which must be defined in every environment that runs ramp code.

If you set save_optimizer_state to True in the configuration file, ramp training will proceed using the configuration that was used to train the saved model. This will cause the above specification of loss function, accuracy metrics, and optimizer to be bypassed.

In [33]:

				
					the_model = None

if cfg["saved_model"]["use_saved_model"]:
    
    # load (construct) the model
    model_path = Path(RAMP_HOME) / cfg["saved_model"]["saved_model_path"]
    print(f"Model: importing saved model {str(model_path)}")
    the_model = tf.keras.models.load_model(model_path)
    assert the_model is not None, f"the saved model was not constructed: {model_path}"

    if not cfg["saved_model"]["save_optimizer_state"]:
        # If you don't want to save the original state of training, recompile the model.
        the_model.compile(optimizer = optimizer, 
            loss=loss_fn,
            metrics = the_metrics)

Construct a new model
If a new model is created, the user specifies the constructor for the model. The code below constructs the model, and compiles it (i.e., sets an optimizer, loss function, and metrics).

In [34]:

				
					if not cfg["saved_model"]["use_saved_model"]:
    get_model_fn_name = cfg["model"]["get_model_fn_name"]
    get_model_fn = getattr(model_constructors, get_model_fn_name)
    print(f"Model constructor: {get_model_fn.__name__}")
    the_model = get_model_fn(cfg)

    assert the_model is not None, f"the model was not constructed: {model_path}"
    the_model.compile(optimizer = optimizer, 
        loss=loss_fn,
        metrics = the_metrics)

print(the_model)

Step 6. Specify directories to use for training and validation data

The training process uses a training data set, from which it learns the model, and a validation data set, which is used to track how well the trained model performs on data it hasn’t been trained with. Both are critical.

Each data set contains sample image chips and matching truth datasets, stored in geojson files. The base filenames of each matching image and polygon file must match uniquely, e.g., ‘130bfe-210.tif’ and ‘130bfe-210.geojson’.

The paths to these datasets are defined in the configuration file, relative to the RAMP_HOME environment variable.

In [35]:

				
					#### define data directories ####
train_img_dir = Path(RAMP_HOME) / cfg["datasets"]["train_img_dir"]
train_mask_dir = Path(RAMP_HOME) / cfg["datasets"]["train_mask_dir"]
val_img_dir = Path(RAMP_HOME) / cfg["datasets"]["val_img_dir"]
val_mask_dir = Path(RAMP_HOME) / cfg["datasets"]["val_mask_dir"]

Step 7. Set up the augmentations (i.e., image transformations) that will be applied to the training data.

The get_augmentation_fn() constructs a sequence of image transformation functions that will be applied to the training data only (not the validation data). This increases the variability in the data that the model sees during training.

Note that each image transformation function in the list below requires its internal parameters to be set during construction.

In [36]:

				
					#### get the augmentation transform ####
aug = None
if cfg["augmentation"]["use_aug"]:
    aug = get_augmentation_fn(cfg)
    print(aug)

Out [36]:

Compose([
ColorJitter(always_apply=False, p=0.7, brightness=[0.8, 1.2], contrast=[0.8, 1.2], saturation=[0.8, 1.2], hue=[-0.2, 0.2]),
], p=1.0, bbox_params=None, keypoint_params=None, additional_targets={})

Step 8. Set runtime parameters, and add them to the configuration data.

The number of training iterations per epoch depends on the total quantity of training data chips, and on the batch size.

The data generator (which ‘feeds’ data to the model during training) must know the sizes of its input and output images.

In [37]:

				
					batch_size = cfg["batch_size"]
input_img_shape = cfg["input_img_shape"]
output_img_shape = cfg["output_img_shape"]

n_training = get_num_files(train_img_dir, "*.tif")
n_val = get_num_files(val_img_dir, "*.tif")
steps_per_epoch = n_training // batch_size
validation_steps = n_val // batch_size

# add these back to the config 
# in case they are needed by callbacks
cfg["runtime"] = {}
cfg["runtime"]["n_training"] = n_training
cfg["runtime"]["n_val"] = n_val
cfg["runtime"]["steps_per_epoch"] = steps_per_epoch
cfg["runtime"]["validation_steps"] = validation_steps

Step 9. Set up training and validation data 'feeds'.

Notice the call to construct training batches is different when augmentation is used.

In [38]:

				
					train_batches = None

if aug is not None:
    train_batches = training_batches_from_gtiff_dirs(train_img_dir, 
                                                train_mask_dir, 
                                                batch_size, 
                                                input_img_shape, 
                                                output_img_shape, 
                                                transforms = aug)
else:
    train_batches = training_batches_from_gtiff_dirs(train_img_dir, 
                                                train_mask_dir, 
                                                batch_size, 
                                                input_img_shape, 
                                                output_img_shape)

assert train_batches is not None, "training batches were not constructed"

In [39]:

				
					val_batches = test_batches_from_gtiff_dirs(val_img_dir, 
                                                val_mask_dir, 
                                                batch_size, 
                                                input_img_shape, 
                                                output_img_shape)

assert val_batches is not None, "validation batches were not constructed"

Step 10. Define all experiment logging and model checkpoint callbacks.

A list of callback functions will be passed to the training process.

These callbacks are essential and used in every training run (unless it is a throwaway).

In [40]:

				
					callbacks_list = []

if not discard_experiment:

    # get model checkpoint callback
    if cfg["model_checkpts"]["use_model_checkpts"]:
        get_model_checkpt_callback_fn_name = cfg["model_checkpts"]["get_model_checkpt_callback_fn_name"]
        get_model_checkpt_callback_fn = getattr(callback_constructors, get_model_checkpt_callback_fn_name)
        callbacks_list.append(get_model_checkpt_callback_fn(cfg))
        print(f"model checkpoint callback constructor:{get_model_checkpt_callback_fn.__name__}")

    # get tensorboard callback
    if cfg["tensorboard"]["use_tb"]:
        get_tb_callback_fn_name = cfg["tensorboard"]["get_tb_callback_fn_name"]
        get_tb_callback_fn = getattr(callback_constructors, get_tb_callback_fn_name)
        callbacks_list.append(get_tb_callback_fn(cfg))
        print(f"tensorboard callback constructor: {get_tb_callback_fn.__name__}")

    # get tensorboard model prediction logging callback 
    if cfg["prediction_logging"]["use_prediction_logging"]:
        assert cfg["tensorboard"]["use_tb"], 'Tensorboard logging must be turned on to enable prediction logging'
        get_prediction_logging_fn_name = cfg["prediction_logging"]["get_prediction_logging_fn_name"]
        get_prediction_logging_fn = getattr(callback_constructors, get_prediction_logging_fn_name)
        callbacks_list.append(get_prediction_logging_fn(the_model, cfg))
        print(f"prediction logging callback constructor: {get_prediction_logging_fn.__name__}")

# free up RAM
keras.backend.clear_session()

Out [40]:

model checkpoint callback constructor:get_model_checkpt_callback_fn
tensorboard callback constructor: get_tb_callback_fn
prediction logging callback constructor: get_pred_logging_callback_fn

Step 11. Define the early stopping callback if you're using it.

You often will be.

In [41]:

				
					if cfg["early_stopping"]["use_early_stopping"]:
    print("Using early stopping")
    callbacks_list.append(callback_constructors.get_early_stopping_callback_fn(cfg))

Out [41]:

Using early stopping

Step 12. Define the Cyclic Learning Rate Scheduler callback if you're using it.

You probably won’t need to do this, but it’s included here for completeness.
You’ll get an error if you try to use a cyclic learning rate scheduler together with early stopping.

In [42]:

				
					    # get cyclic learning scheduler callback
    if cfg["cyclic_learning_scheduler"]["use_clr"]:
        assert not cfg["early_stopping"]["use_early_stopping"], "cannot use early_stopping with cycling_learning_scheduler"
        get_clr_callback_fn_name = cfg["cyclic_learning_scheduler"]["get_clr_callback_fn_name"]
        get_clr_callback_fn = getattr(callback_constructors, get_clr_callback_fn_name)
        callbacks_list.append(get_clr_callback_fn(cfg))
        print(f"CLR callback constructor: {get_clr_callback_fn.__name__}")

Step 13. Train the model

The ‘history’ return value is a data structure containing training-time information, such as the values of loss and accuracy metrics after every epoch. Most of what it contains is viewable in the Tensorboard application.

The user must specify the number of epochs to run. If early stopping is used, the full number of epochs may not be run.

In [43]:

				
					n_epochs = cfg["num_epochs"]
print(n_epochs)

Out [43]:

In [44]:

				
					n_epochs = cfg["num_epochs"]
print(n_epochs)

Out [44]:

Epoch 1/2
878/878 [==============================] – ETA: 0s – loss: 0.1402 – sparse_ca
tegorical_accuracy: 0.9450INFO:tensorflow:Assets written to: /tf/ramp-data/TR
AIN/tq_baseline/model-checkpts/20220725-235719/model_20220725-235719_001_0.96
1.tf/assets
878/878 [==============================] – 194s 190ms/step – loss: 0.1402 – s
parse_categorical_accuracy: 0.9450 – val_loss: 0.0973 – val_sparse_categorica
l_accuracy: 0.9607
Epoch 2/2
878/878 [==============================] – ETA: 0s – loss: 0.0995 – sparse_ca
tegorical_accuracy: 0.9603INFO:tensorflow:Assets written to: /tf/ramp-data/TR
AIN/tq_baseline/model-checkpts/20220725-235719/model_20220725-235719_002_0.96
3.tf/assets
878/878 [==============================] – 164s 185ms/step – loss: 0.0995 – s
parse_categorical_accuracy: 0.9603 – val_loss: 0.0931 – val_sparse_categorica
l_accuracy: 0.9631

Step 14. Log experiment information to a central location.

Ramp has a simple mechanism for storing experiment information to a central CSV file, the path to which is specified (like everything else) in the configuration file. Fields to be logged are specified in the ‘logging’ configuration block, under ‘fields_to_log’.

Fields to be logged should be uniquely named (or you might not get the records you wanted!).

In [45]:

				
					if not discard_experiment and cfg["logging"]["log_experiment"]:
    exp_log_path = str(Path(RAMP_HOME)/cfg["logging"]["experiment_log_path"])
    print(f"logging experiment to: {exp_log_path}")

    # log fields chosen in the configuration file
    # the fields must be uniquely named
    fields_to_log = cfg["logging"]["fields_to_log"]
    exp_log = dict()
    for field_key in fields_to_log:
        field_val = lf.locate_field(cfg, field_key)
        if not isinstance(field_val, dict):
            exp_log[field_key] = field_val
    the_argmax, best_val_acc = get_best_model_value_and_epoch(history)
    exp_log[lf.BEST_MODEL_VALUE] = best_val_acc
    exp_log[lf.BEST_MODEL_EPOCH] = the_argmax
    log_experiment_to_file(exp_log, exp_log_path)