This summer, I wanted to take part in a Kaggle competition. I looked at the last featured challenges and choose a segmentation task. I had no prior knowledge in segmentation task before this point, so I knew that I would learn a lot. This post is about the lessons I acquire from this adventure of 3 months to rank in the top 10%.
All start with geologists who wanted to find vast deposits of salt below the Earth's surface. So they use seismic reflection, a method of exploring the Earth's crust by generating an ultrasound image of the subsurface.
The dataset I had for this competition is images of 101x101 pixels with each pixel classified as salt or not. Thus the goal of the contest is to segment regions that contain salt.
That said, I look at my images: some of them have just a tiny portion of salt that we could not notice with a quick look, some have salt portion following a curve in the image, some have no salt at all or the opposite. I also get 4000 images for my training set, and I must predict 18 000 images. There another challenge: my training set is much smaller than my test set.
Now how the participants are ranked? According to the description, the score is the intersection over union metric. This latter considers a set of pixels (the ones I propose as salt) and compares it to another set of pixels (the ones which are salt).
It is a significant metric to assess performance in segmentation task, and it would be useful to compute a loss with it. Nevertheless, this metric is non-differentiable, and our optimization method needs a differentiable metric to learn from it.
In this competition, I studied and tested several loss functions to indicate to my model where it was wrong. I started with a log loss (or binary cross-entropy).
Classic but I would like to emphasize the mispredicted a bit more. Indeed, if my model predicts the correct class with a probability of 0.7, it doesn't need to push the likelihood to 0.99. Thus I considered the focal loss that allows controlling the slope of the log loss.
These losses are good in pixel-wise but poor indicator of the quality of the segmentation. Thus I looked for more IoU based objective functions.
With the truth labels and predicted labels, the IoU is defined as
IoU produces a value between 0 and 1. For this equation works, the mask set and the prediction set need to be in 1's and 0's. Mask contains these values, but Prediction holds probabilities between 1 and 0. To convert it into a binary value, we need to use the argmax function or threshold, and these two are not differentiable.
So I got interested in the Dice coefficient which is similar to IoU but can be differentiated. The Dice loss is defined by
It was used in medical application to deal with unbalanced classes. You can get more info with by reading " V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation " .
Finally, some discussion during the competition mentioned the Lovasz-Hinge loss. By reading the referenced paper, "The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks" , I didn't fully understand how it works, mostly because I think I lack knowledge about the mathematical properties of convex functions/geometry. However, I used it, and in practice, I got a better result.
During this challenge, I mostly combined losses. For a while, I use focal loss with the dice loss to get a pixel-wise loss with a segmentation loss. Then when I discover Lovasz-hinge, I combined log loss with Lovasz-hinge has suggested in the paper.
Before going down in the architecture of the neural net, I will mention briefly the data augmentation I use. I tried several ways to augment my 3600 images (the 400 images left was for validation). Finally simple was better, so I use horizontal flip and translation + reflection to increase by three my training set.
Deep learning architecture
Concerning the architecture of the neural net, it is based on the U-Net architecture introduced by "U-Net: Convolutional Networks for Biomedical Image Segmentation". It is a fully convolutional network, composed with two main parts: a downsampling and upsampling part. [IMAGE] The particularity is that U-net is symmetric and it creates 'skip connections' between the downsampling path and upsampling path. The goal of these skip connection is to provide local information while upsampling.
I quickly customized the block of layers (purple) at each level to improve my model.
I used residual block in my model architecture. For recall, when the network gets deeper, we get vanishing gradient and degradations problems. The residual block resolve this problem by creating a shortcut that allows
to push the residual branch to 0 is the input form is already optimal
to bring better gradient value to the input.
A colleague tells me another idea name Squeeze and Excitation. The main idea is to produce channel-wise weights to convolution layer. For example: In an RGB image, if we are looking for something yellow, we will loads the red and green channels more than the blue channel. The same idea is applied to the output of a convolutional layer.
That's said, I combine the two concepts and get the following block.
This block is used before reducing the image shape by max pooling in the encoder part and before upscaling the image in the decoder part.
In my architecture, I also initialized the weights with He normal that worked better than Xavier uniform initialization. After some reading, I find that He normal was built for ReLU activation in mind whereas Xavier was thought for Tanh, sigmoid activation functions.
I also would like to note that I modified the U-net to keep the 101x101 px original size.
I wanted to monitor the progress made by my model with the IoU score, so at the end of each epoch, the score was computed on all the validation set (and not progressively on each batch of data). For a while, I trained a model from scratch with an Adam optimizer, reduce the learning rate by LR * 0.7 when the metric did not improve on the validation set, change a thing and start this process again.
Ensemble methods have already proved their efficiency in machine learning problems. However learning takes so long, how can I have a better model quicker? I get inspired by this paper: "Snapshot Ensembles: Train 1, get M for free".
So for a model, I train it again and again: once it arrives in a local minimum with no improvement, I save the model, reinitialize the learning rate and start over with the weights of the last model. It is slightly different from the algorithm described in the paper which saves a model all the N epochs and which reduces the learning rate with a cyclic schedule. This method didn't work with Adam optimizer, and this post blog will explain you why.
Before this competition, I just retrieved the prediction of my model and them submit it the Kaggle platform. Meantime, I learned that I could do some simple post-processing, without adding cost: Test time augmentation. The principle is simple, I take my image, flip it and feed it to my model and then flip the result.
Now with a picture, I can get several predictions and average this ensemble to improve my final score.
I am going further: with ensemble snapshot, I could average together the predictions of my models. Nevertheless, some models are better than others, so how I can get the most of it? By letting a linear model weight the first models for me.
Overview of the pipeline
After all the things I introduced, I will summarize with a scheme of the entire pipeline.
Lessons I learned
At the end of this competition, I have several remarks to make.
My first comment is for Adam optimizer. Adaptive learning rate should give an edge over SGD but it doesn't work with snapshot ensemble, and it seems that in practice, by observing the other competitors, SGD with momentum works better.
My second remark goes for Keras. Finally, I got this impression to add a black box on my model training. When I use the fit method, finally I can not monitor what's going on with a specific variable without adding callback functions. Moreover last implementations of researchers use the low-level API of Tensorflow or Pytorch. I think next time I will use the low-level API of Tensorflow to get better control of the training and my variables.
My next point concerns a random seed problem. I wanted my model to be deterministic to see the influence of a parameter. So I set the Numpy random seed by np.random.seed(1) and the Tensorflow seed by tf.set_random_seed(2). Nevertheless, it doesn't work. Before my next competition, I must find a way to get a deterministic model to be able to reproduce results and also to add unit tests.
My last notes are on my workflow.
I am using Git to have a source code history, but I didn't do it for models that can be too big or/and too many to send on Github.
My experiments take time and sometimes I forgot what I already did. I should take time to do better documentation and track my work a bit more closely
Crashes append, and in the long run, it can waste a lot of time. I should find a way to unit test my code to improve my trust and confidence in it.
The visualization wasn't enough. For my next machine learning challenge, I should think what I need to visualize to understand what is going on.
The ideal workflow would imply to have a snapshot before and after an experiment.
People who use online GPUs have all these features, why can I have them on my machine? Have you see there is a lot to improve.
At the end of this challenge, the final ranking is calculated with 66% of the test data. I get the 316th place of 3234 at this competition.
References and external links
"The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks", https://arxiv.org/abs/1705.08790v2
"U-Net: Convolutional Networks for Biomedical Image Segmentation", https://arxiv.org/abs/1505.04597
"Focal Loss for Dense Object Detection", https://arxiv.org/abs/1708.02002
"V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation", https://arxiv.org/abs/1606.04797
"Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification", https://arxiv.org/abs/1502.01852
"Snapshot Ensembles: Train 1, get M for free", https://arxiv.org/abs/1704.00109