Skip to content

shulavkarki/YOLO-v2-with-Pytorch

Repository files navigation

YOLO-v1-with-Pytorch

Overall Structure of YOLO v1 Network:

image

  1. Use Convolution Network as feature extraction.
  2. Use fully connected layers to predict output probabilites and coordinates.

Let's break each one of em.

1. Convolution Netork

In the yolo paper, the author pretrain on 20 convolution layers on the ImageNet dataset.
Here the convolution layers is used as feature extraction.

Obviously, i'm not gonna pretain in the imagenet. coz, it gonna take up to lot of time to train(may be weeks). Nonetheless, one can use any pretrained cnn network like resnet, or others mentioned in paper.

However.., here i've used custom architecture (Archi.config in repo) to train image classifier in custom dataset (pizza vs sandwich dataset.)
If you've noticed the architecture, i've used the higher convolution followed by lower number of channels convolution, because it reduces the amount of computation and improves the non-linearity of the model. Result of Image Classifier:

Optimizer Epoch Learning Rate Training Accuracy Testing Accuracy
SGD Gradient Descent 50 0.0001 93.50% 91.41%

The trained model is in ./Saved Models/ folder. You can pretrain CNN Network in your own dataset.
For training and testing , you can find in classifer.ipynb file within the repo.
Also, you can checkout this repo.

3. FC Layers for prediction

Yolov1 frame object detection as a regression problem to spatially seperated bounding box and associated class probabilites.
For the last convolution layer, it outputs a tensor shapeed (7, 7, 1024). Then the tensor expands using 2 FC layers as a form of linear regression. It outputs parameters and then reshapes into (7, 7, 30).

Now let's look at the workflow or how the model gets trained.

  1. In the paper, the image is divided into S*S grid(virtually). The author has taken S = 7. image

2.The output is SS(5B+C).
Since, S=7. The image is alltogether divided into 77 grid.
So, for each grid, the size of output is 5B+C.
Terms:
B = Bounding Box
C = Proababilies of each class
If we consider B=1, and C = 'n' class., then for each grid 1 bouding box is predicted. It looks something like this. image If we consider B=2 then, image This means, each grid is going to predict 2 bounding box which is defined by (x, y, w, h)which are center, width and height of bounding box.
Therefore, the output is the flatten of size S**S
(5B+C). The 30 in the fully connected layer is the (5B+C), where the author considers B=2 and C=20 classes(can predict upto 20 classes)

Loss Function

source

Implementation

However in this repo, i've used yolov3.(but doesn't contain pipeline for three scaled images.s.) Unlike in the yolov1 where the final year consists of the regressor, here CNN is used in the final layer.

Consideration:

  • Classifier is used as Feature Selection/Feature Extraction.
  • Extra layer is added to the classifier to get the output CNN.
  • The output CNN should be in the dimension of SSC. where S: Grid size, and C= Channel.
  • Here the S=13. Meaning the image is divided into 13 by 13 grid and the output consists of 13*13 height and weight and C channle.
  • Here the C=7. Meaning the ouput will have 7 channel. [1st chnnel:Confidence Score, 2nd to 5th channel: x, y, w, h and 6th to 7th channel consists of probability score of given object falls in particular class.]
  • Here (x, y): Center of the bounding box. (w, h): Dimension of bouding box.
  • Since, here i've considered only two class. So, only two channel after x,y,w and h.

Architecture:

  1. Classifier Netwwork
    image

  2. Object Detection Network
    image

Actual Volume Interpretation:
image

Loss Function

The Yolov1 loss function is used in this implementation.

Yolo Loss Function looks something like this:

image

Let's break each of em.

  1. Bounding Box Coordinate Loss/ Regression Loss.

image

1obj

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object.
We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth.
However, i've only taken one bounding box per grid cell.

(xi, yi, wi, hi): Network prediction of center ,height and width of the responsible bounding box in the i'th grid box
(x^i, y^i, w^i, h^i): Ground truth

  1. Confidence Loss

image

Ci: True obectness score. Ci^: Prediction from network * Iou between ground truth and predicted volume.

  1. Classificaiton Loss

image

Limitations of Yolo

  • Comparatively low recall and more localization error compared to Faster R_CNN.
  • Struggles to detect close objects because each grid can propose only 2 bounding boxes.
  • Struggles to detect small objects.

About

Object Detection using YOLO v2 with pytroch.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages