What neural network model for segmentation?

We've got asked now a number of times on which segmentation neural network model to use with our Facial/Headsegmentation dataset. So we thought we compare a number of state of the art models and see how they fair compared to our own internal model. For this we sampled 9 random facial images that are obviously not part of our training dataset and  check the visual quality of the returned labled images.

For this study we compare FullyConvolutionNetwork (FCN) with a Resnet50 "encoder" backend without any post processing. The backend has been pre-learned with the Imagenet dataset. The same (FCN) with a Resnet101 "encoder" backend. This time the backend and the FCN layer have been learned combined from scratch. Next up is the Pyramid Scene Parsing Network (PSPNet) by Zhao et al. using the same Resnet 50 backend as the FCN. PSPNet with a Resnet101 backend again learned from scratch. DeepLabV3 which uses Atrous Spatial Pyramid Pooling again using the same Resnet50 and Resnet101 backend as PSPNet and FCN. Our own model which is a Resnet-33-Type-Encoder with some tweaks and a multi-staged Decoder network learned end-to-end. And finally a simple Encoder/Decoder-Network with an additional Adviserial loss network which we will call PIXGAN

For all the networks except the last one Softmax Cross Entropy loss was used.

First batch of images

Here is the result of the first three test images three different ethnicities relatively simply background and composation. Note wearable items should be marked as background as they are in the training set

Source Image First example image   Source example image 2  Source example image 3
FCN Resnet50  FCN50 result image 1  FCN50 result image 2  FCN50 result image 3
FCN Resnet101  FCN101 Result Image 1  FCN101 Result Image 2  FCN101 Result Image 3
PSP Resnet50  PSP50 Result image 1  PSP50 Result Image 2  PSP50 Result image 3
PSP Resnet101  PSP101 Result Image 1  PSP101 Result Image 2  PSP101 Result Image 3
Deeplab V3 Resnet 50  DeepLabV3 Res50 Result Image 1  DeepLabV3 Res50 Result Image 2  DeepLabV3 Res50 Result Image 3
Deeplab V3 Resnet 101  DeepLabV3 Res101 Result Image 1  DeepLabV3 Res101 Result Image 2  DeepLabV3 Res101 Result Image 3
PIXGAN  PIXGAN Result Image 1  PIXGAN Result Image 2  PIXGAN Result Image 3
Our network  My network result image 1  My network result image 2 My network result image 3

From the first three test images. What can be seen that neither FCN,PSP or DeepLabV3 can label correct any of the more smaller/finer details like eyebrows, eye shapes, mouth and teeth even with deeper backbone networks/ This is not really surprising as the simple upscale scheme simply will not be able to restore the details back from the bottom layer. The increased receptive field of the Pyramid pooling scheme in both models does not really help. The gain of having a better decoder but therefore a smaller encoder network is far more significant. With this three example it is even arguable if the Pyramid scheme is any help at all as the results are not that significant better than the Fully Convoluted Network for this dataset. The PIXGan produces results that kind of resample a face but the quality is extremely poor compared to the other and therefore as a solution for this kind of dataset not advisable. 

Second batch of images

The second batch of examples starts of with an really tough example a female person were her hair is basically covering most of her face and therefore occluding many parts of the face. The other one is a frontal pose of a face and a side potrait with different amount of detail visible

Source Image Source example image 4  Source example image 5  Source example image 6
FCN Resnet50  FCN50 result image 4  FCN50 result image 5  FCN50 result image 6
FCN Resnet101  FCN101 Result Image 4  FCN101 Result Image 5  FCN101 Result Image 6
PSP Resnet50  PSP50 Result image 4  PSP50 Result Image 5  PSP50 Result image 6
PSP Resnet101  PSP101 Result Image 4  PSP101 Result Image 5  PSP101 Result Image 6
Deeplab V3 Resnet 50  DeepLabV3 Res50 Result Image 4  DeepLabV3 Res50 Result Image 5  DeepLabV3 Res50 Result Image 6
Deeplab V3 Resnet 101  DeepLabV3 Res101 Result Image 4  DeepLabV3 Res101 Result Image 5  DeepLabV3 Res101 Result Image 6
PIXGAN  PIXGAN Result Image 4  PIXGAN Result Image 5  PIXGAN Result Image 6
Our network  My network result image 4  My network result image 5 My network result image 6

None of the segmentation networks are really able to recover a good result on the first image. Which probably indicates that there are not enough similar cases of these in the training set. PIXGan produces a rather funny result . Again deeper backend seem to help a bit for FCN, & PSP but the fine details are missing and our own network model is the only one that can actually recover some of this and also produces rather sharp detail edges especially on the third image in this batch

Third batch of images

 Another diverse batch here with different difficulty, diverse headwerable, different head poses and lighting conditions

Source Image Source example image 7 Source example image 8  Source example image 9
FCN Resnet50  FCN50 result image 7  FCN50 result image 8  FCN50 result image 9
FCN Resnet101  FCN101 Result Image 7  FCN101 Result Image 8  FCN101 Result Image 9
PSP Resnet50  PSP50 Result image 7  PSP50 Result Image 8  PSP50 Result image 9
PSP Resnet101  PSP101 Result Image 7  PSP101 Result Image 8  PSP101 Result Image 9
Deeplab V3 Resnet 50  DeepLabV3 Res50 Result Image 7  DeepLabV3 Res50 Result Image 8  DeepLabV3 Res50 Result Image 9
Deeplab V3 Resnet 101  DeepLabV3 Res101 Result Image 7  DeepLabV3 Res101 Result Image 8  DeepLabV3 Res101 Result Image 9
PIXGAN  PIXGAN Result Image 7  PIXGAN Result Image 8  PIXGAN Result Image 9
Our network  My network result image 7  My network result image 8 My network result image 9

This batch more or less shows similar characteristics as seen with the first one, The difference between the Fully convoluted Network compared to the more complex PSP and DeepLabV3 are not that major sometimes the one looks better sometimes the other. A deeper backbone network seems to helps. Our network is the only one that is able to recover the more fine detail in the labels. PIXGan even though it produces something that does is close to a plausible face in this case the results are pretty unusable. 

Summary

  • For this kind of dataset the extra effort in both the pyramid schemes of PSP and DeepLabV3 does not really pay off that much compared to the standard Fully Convoluted network. 
  • PIXGan at least in this simplistic network form is not worth pursuing as a solution
  • Deeper backbone networks as encoder seem to help
  • If you do not want to go deeper spend more time on the decoder part the gain seems larger at least in terms of recovering detail

Got part 2 to see if we can improve the results of PSP and DeepLabV3

You are interested in any of the trained models well get in contact with us simply write an email to