Image Segmentation

Meena Vyas
3 min readJan 12, 2019

--

In computer vision, image segmentation is the process of partitioning an image into multiple segments and associating every pixel in an input image with a class label.

Semantic segmentation algorithms are used in self-driving cars.

I got intrigued by this post by Lex Fridman on driving scene segmentation. I wanted to see if it works on difficult and different Indian terrain.

So I have created a short video of Tawang, in Arunachal Pradesh India. The video is of duration 16 seconds and it contains around 325 image frames.

Refer this ipython notebook I have used. It is based on tutorial_driving_scene_segmentation.ipynb. It downloads ‘mobilenetv2_coco_cityscapes_trainfine’ model from tensor flow.

Different models in Tensor Flow deeplab are given in the link https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md

MobileNet-v2 based model has been pre-trained on MS-COCO dataset and does not employ ASPP and decoder modules for fast computation. You can choose any pre-trained Tensor Flow model that suits your need.

This model classifies pixels into the following classes

'road', 'sidewalk', 'building', 'wall', 'fence', 'pole', 'traffic light', 'traffic sign', 'vegetation', 'terrain', 'sky', 'person', 'rider', 'car', 'truck', 'bus', 'train', 'motorcycle', 'bicycle'

Color code of the classes is shown below

For all the pictures same color coding is used for the same classes. I ran the model on a two images first, here is the output

then I ran the model on the whole video. Here is the link to the output video.

Comparisons of some of the images and their output is shown in the table given below

Thoughts

  • The model is trained on city images but the input I gave was of a mountainous region. We may get better results if we gave city videos.
  • We are using 2D segmentation model. In 3D segmentation we will get better results. It will be able to guess the depth.
  • We are running model per image in the video. We are not taking advantage of information in 2 consecutive frames and correlating them.
  • Model wasn’t trained mountains or lakes as they were not in the training set output classes.
  • Input video is not taken from car dashboard as in training set. We may get better results if we do so.
  • If the objects are far off accuracy of prediction may not be good.
  • Some roads are not cemented or tarred but are mud roads. And in these places there is no road/pavement distinction. May be the model was trained on better proper roads we would get better results.

References

Originally published at meenavyas.wordpress.com

--

--

No responses yet