Lessons learned from the TGS Salt Identification competition

This summer, I wanted to take part in a Kaggle competition. I looked at the last featured challenges and choose a segmentation task. I had no prior knowledge in segmentation task before this point, so I knew that I would learn a lot. This post is about the lessons I acquire from this adventure of 3 months to rank in the top 10%.

All start with geologists who wanted to find vast deposits of salt below the Earth's surface. So they use seismic reflection, a method of exploring the Earth's crust by generating an ultrasound image of the subsurface.

Dataset

The dataset I had for this competition is images of 101x101 pixels with each pixel classified as salt or not. Thus the goal of the contest is to segment regions that contain salt.

Seismic images. The salt regions are in green

Seismic images. The salt regions are in green

That said, I look at my images: some of them have just a tiny portion of salt that we could not notice with a quick look, some have salt portion following a curve in the image, some have no salt at all or the opposite. I also get 4000 images for my training set, and I must predict 18 000 images. There another challenge: my training set is much smaller than my test set.

Metric

Now how the participants are ranked? According to the description, the score is the intersection over union metric. This latter considers a set of pixels (the ones I propose as salt) and compares it to another set of pixels (the ones which are salt). 

IoU for bounding boxes showing a poor, good and excellent prediction.  Source: By  Adrian Rosebrock

IoU for bounding boxes showing a poor, good and excellent prediction.

Source: By Adrian Rosebrock

It is a significant metric to assess performance in segmentation task, and it would be useful to compute a loss with it. Nevertheless, this metric is non-differentiable, and our optimization method needs a differentiable metric to learn from it.

Losses

In this competition, I studied and tested several loss functions to indicate to my model where it was wrong. I started with a log loss (or binary cross-entropy).

Negative log loss

Negative log loss

Classic but I would like to emphasize the mispredicted a bit more. Indeed, if my model predicts the correct class with a probability of 0.7, it doesn't need to push the likelihood to 0.99. Thus I considered the focal loss that allows controlling the slope of the log loss.

Focal loss with different values of gamma  Source:  From Focal Loss with Dense Object Detection

Focal loss with different values of gamma

Source: From Focal Loss with Dense Object Detection

These losses are good in pixel-wise but poor indicator of the quality of the segmentation. Thus I looked for more IoU based objective functions.

With the truth labels and predicted labels, the IoU is defined as

izsqel.png

 IoU produces a value between 0 and 1. For this equation works, the mask set and the prediction set need to be in 1's and 0's. Mask contains these values, but Prediction holds probabilities between 1 and 0. To convert it into a binary value, we need to use the argmax function or threshold, and these two are not differentiable.

So I got interested in the Dice coefficient which is similar to IoU but can be differentiated. The Dice loss is defined by

It was used in medical application to deal with unbalanced classes. You can get more info with by reading " V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation " .

Finally, some discussion during the competition mentioned the Lovasz-Hinge loss. By reading the referenced paper,  "The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks" , I didn't fully understand how it works, mostly because I think I lack knowledge about the mathematical properties of convex functions/geometry. However, I used it, and in practice, I got a better result.

During this challenge, I mostly combined losses. For a while, I use focal loss with the dice loss to get a pixel-wise loss with a segmentation loss. Then when I discover Lovasz-hinge, I combined log loss with Lovasz-hinge has suggested in the paper.

Data augmentation

Before going down in the architecture of the neural net, I will mention briefly the data augmentation I use. I tried several ways to augment my 3600 images (the 400 images left was for validation). Finally simple was better, so I use horizontal flip and translation + reflection to increase by three my training set.

Deep learning architecture

Concerning the architecture of the neural net, it is based on the U-Net architecture introduced by "U-Net: Convolutional Networks for Biomedical Image Segmentation". It is a fully convolutional network, composed with two main parts: a downsampling and upsampling part. [IMAGE] The particularity is that U-net is symmetric and it creates 'skip connections' between the downsampling path and upsampling path. The goal of these skip connection is to provide local information while upsampling.

Classic U-Net architecture

Classic U-Net architecture

I quickly customized the block of layers (purple) at each level to improve my model.

A ResNet block

A ResNet block

I used residual block in my model architecture. For recall, when the network gets deeper, we get vanishing gradient and degradations problems. The residual block resolve this problem by creating a shortcut that allows

  • to push the residual branch to 0 is the input form is already optimal

  • to bring better gradient value to the input.

A colleague tells me another idea name Squeeze and Excitation. The main idea is to produce channel-wise weights to convolution layer. For example: In an RGB image, if we are looking for something yellow, we will loads the red and green channels more than the blue channel. The same idea is applied to the output of a convolutional layer.

That's said, I combine the two concepts and get the following block.

Scheme on how I combined Residual block with Squeeze + Excitation concepts

Scheme on how I combined Residual block with Squeeze + Excitation concepts

This block is used before reducing the image shape by max pooling in the encoder part and before upscaling the image in the decoder part.

In my architecture, I also initialized the weights with He normal that worked better than Xavier uniform initialization. After some reading, I find that He normal was built for ReLU activation in mind whereas Xavier was thought for Tanh, sigmoid activation functions.

I also would like to note that I modified the U-net to keep the 101x101 px original size.

Optimization

I wanted to monitor the progress made by my model with the IoU score, so at the end of each epoch, the score was computed on all the validation set (and not progressively on each batch of data). For a while, I trained a model from scratch with an Adam optimizer, reduce the learning rate by LR * 0.7 when the metric did not improve on the validation set, change a thing and start this process again.

Ensemble methods have already proved their efficiency in machine learning problems. However learning takes so long, how can I have a better model quicker? I get inspired by this paper: "Snapshot Ensembles: Train 1, get M for free".

Illustration of classic SGD vs Snapshot Ensembling

Illustration of classic SGD vs Snapshot Ensembling

So for a model, I train it again and again: once it arrives in a local minimum with no improvement, I save the model, reinitialize the learning rate and start over with the weights of the last model. It is slightly different from the algorithm described in the paper which saves a model all the N epochs and which reduces the learning rate with a cyclic schedule. This method didn't work with Adam optimizer, and this post blog will explain you why.

Post-processing

Before this competition, I just retrieved the prediction of my model and them submit it the Kaggle platform. Meantime, I learned that I could do some simple post-processing, without adding cost: Test time augmentation. The principle is simple, I take my image, flip it and feed it to my model and then flip the result. 

Test Time Augmentation used for the TGS Salt Identification challenge

Test Time Augmentation used for the TGS Salt Identification challenge

Now with a picture, I can get several predictions and average this ensemble to improve my final score.

Meta Ensembling

I am going further: with ensemble snapshot, I could average together the predictions of my models. Nevertheless, some models are better than others, so how I can get the most of it? By letting a linear model weight the first models for me.

Stacking models: technique that combines multiple models via a meta-learner

Stacking models: technique that combines multiple models via a meta-learner

Overview of the pipeline

After all the things I introduced, I will summarize with a scheme of the entire pipeline.

Data pipeline for the TGS Salt Identification challenge

Data pipeline for the TGS Salt Identification challenge

Lessons I learned

At the end of this competition, I have several remarks to make.

My first comment is for Adam optimizer. Adaptive learning rate should give an edge over SGD but it doesn't work with snapshot ensemble, and it seems that in practice, by observing the other competitors, SGD with momentum works better.

My second remark goes for Keras. Finally, I got this impression to add a black box on my model training. When I use the fit method, finally I can not monitor what's going on with a specific variable without adding callback functions. Moreover last implementations of researchers use the low-level API of Tensorflow or Pytorch. I think next time I will use the low-level API of Tensorflow to get better control of the training and my variables.

My next point concerns a random seed problem. I wanted my model to be deterministic to see the influence of a parameter. So I set the Numpy random seed by np.random.seed(1) and the Tensorflow seed by tf.set_random_seed(2). Nevertheless, it doesn't work. Before my next competition, I must find a way to get a deterministic model to be able to reproduce results and also to add unit tests.

My last notes are on my workflow.

  • I am using Git to have a source code history, but I didn't do it for models that can be too big or/and too many to send on Github.

  • My experiments take time and sometimes I forgot what I already did. I should take time to do better documentation and track my work a bit more closely

  • Crashes append, and in the long run, it can waste a lot of time. I should find a way to unit test my code to improve my trust and confidence in it.

  • The visualization wasn't enough. For my next machine learning challenge, I should think what I need to visualize to understand what is going on.

The ideal workflow would imply to have a snapshot before and after an experiment.

Ideal data workflow

Ideal data workflow

People who use online GPUs have all these features, why can I have them on my machine? Have you see there is a lot to improve.

Ranking

At the end of this challenge, the final ranking is calculated with 66% of the test data. I get the 316th place of 3234 at this competition.

Final ranking for the TGS Salt Identification Challenge

Final ranking for the TGS Salt Identification Challenge

References and external links