Classifying an image is good but how to detect wherein the image an object lies?
Let alone one object what about multiple objects?

YOLO is a famous algorithm that does this job and does it very well indeed.

Here is the abstract working of its object detection ↓
YOLO stands for 𝘠𝘖𝘜 𝘓𝘖𝘖𝘒 𝘖𝘕𝘓𝘠 𝘖𝘕𝘊𝘌

Unlike the sliding window technique, it looks at the image just one time, justifying the name.

It implicitly encodes textual information about the classes (name of objects) and their appearance.

1/10
YOLO divides the image into grids.

Then if the center of an object lies in a grid then that grid becomes responsible for predicting the class of that object.

2/10
Each of the grid is responsible for predicting some bounding boxes and confidence score for those boxes to show how sure the model is about any particular object.

The score doesn’t indicate what kind of object it is rather if it contains some object

No object-> Zero Score

3/10
Each bounding box is consists of 5 predictions, the coordinates of the center of the box relative to the bounds of the cell.

The width and the height are predicted relative to the whole image.

4/10
On visualizing we get a bunch of bounding boxes around each object, the thickness of the box shows the confidence for that object.

Each grid cell predicts the class probabilities. Given that it’s an object, the conditional probabilities for each class of the object.

5/10
It predicts only one set of class probabilities per grid cell.

So if the grid predicts a Dog that doesn’t mean that it contains a dog but rather if that grid contains an object then most probably it would be a dog.

6/10
Then at test time, it multiplies multiple conditional class probabilities and the individual box confidence predictions.

The output scores not only encode the probability of the class fitting the box but also how well the box fits the object.

7/10
We then have a lot of predictions.

They can include multiple predictions for the same object by different grids with different threshold values.

So to resolve that we use a method called Non-Max-Suppression.

8/10
NMS, in a nutshell, suppresses or discards bounding boxes with a confidence score less than a selected threshold value (a chosen minimum cut-off)

And then further discards the ones that are left which do not have maximum values, giving us our final predictions.

9/10
YOLO trains on full images and directly optimizes detection performance.

It is widely used and known for its fast results.

The latest version is YOLO V3 and can be found here.
https://pjreddie.com/darknet/yolo/ 

10/10

Hope this thread was helpful! 👍
You can follow @capeandcode.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: