Detailed Interpretation of YOLO Articles
The article "You Only Look Once: Unified, Real-Time Object Detection" puts forward the method hereinafter referred to as YOLO.
At present, a series of target detection algorithms based on deep learning algorithm can be roughly divided into two schools:
1. Two-stage algorithm: firstly, candidate regions are generated and then CNN classification (RCNN series) is performed.
2. One-stage algorithm: directly apply the algorithm to the input image and output categories and corresponding positioning (YOLO series)
Previous R-CNN series has high accuracy, but even if it develops to Faster R-CNN, it needs 7fps to detect a picture as shown in the following figure (original text is 5fps). In order to make the detection work used in real-time scene, YOLO is proposed.
YOLO's idea of detection is different from that of R-CNN series. It regards target detection as a regression task to solve.
Let's look at the overall structure of YOLO:
As shown in the above two figures, the network is improved according to Google LeNet. The size of the input picture is 448*448, and the output is. Now it seems strange to write the output dimension in this way. Let's see how the output is defined.
The picture is divided into cells (S = 7 in the original text), and the output is in cells:
1. If the center of an object falls on a cell, the cell is responsible for predicting the object.
2. Each cell needs to predict B bbox values (bbox values include coordinates and width, B=2 in the original text), and predict a confidence score for each bbox value. That is, each cell needs to predict B * (4 + 1) values.
3. Each cell needs to predict a conditional probability value of C (number of object types, original C=20, which is related to the database used).
So, finally, the output dimension of the network is that although each cell is responsible for predicting an object (which is also the problem of this article, when there are small objects, there may be problems), each cell can predict multiple bbox values (here we can think of several different shapes of bbox, in order to locate the object more accurately, as shown in the figure below).
Since this is solved as a regression problem, all outputs, including coordinates and width, are best defined between 0 and 1. A more detailed picture is shown on the Internet as follows.
Let's look at the meaning of each parameter in the conditional probability of B (x, y, w, h, confidence) vectors and C predicted by each cell (assuming the width of the picture is {w_i} and the height is {hi}, dividing the picture into:
1. (x, y) is the offset of the center of the bbox relative to the cell
For the cell of the blue box in the figure below (coordinate is), assuming that its predicted output is the bbox of the red box and the central coordinate of the bbox is set, the final predicted (x, y) is normalized and represents the offset of the center relative to the cell. The calculation formula is as follows:
2. (w, h) is the ratio of bbox to the whole picture
The predicted width and height of the bbox is (w, h) which represents the proportion of the bbox relative to the whole picture. The calculation formula is as follows:
3. confidence
This confidence is composed of two parts, one is whether there is a goal in the lattice, the other is the accuracy of bbox. Define confidence as.
Here, if there is an object in the lattice, then the confidence is equal to IoU. If there is no object in the lattice, then the confidence is 0.
Conditional Probability of Class 4.C
Conditional probability is defined as the probability that the cell has an object and belongs to class I.
The probability of each cell predicting the final output during the test is defined as shown in the following two graphs (two graphs are different, representing a box that outputs the probability value of column B)
Finally, the result of the column is fed into NMS, and the final output box result is obtained.
Finally, let's take a look at the definition of loss function used in training YOLO (I wanted to use latex to type it myself, but later there was a symbol that could not be typed out, using the graph of netizens as follows).
Two points are emphasized here:
1. Each cell of a picture does not necessarily contain an object. Without an object, the confidence becomes zero, which may cause the gradient to leap too far and the model to run unsteadily when optimizing the model. In order to balance this point, two parameters are set in the loss function, in which the loss of bbox predicted position is controlled and the loss without target in a single cell is controlled.
2. For large objects, small deviations have a greater impact on small objects. In order to reduce this impact, the width and height of bbox are all root-coded.
Reference resources
1. https://docs.google.com/presentation/d/1aeRvtKG21KHdD5lg6Hgyhx5rPq_ZOsGjG5rJ1HP7BA/pub?Start=false&loop=false&delayms=3000&slide=id.g137784ab86_4_1822
2. https://blog.csdn.net/u011974639/article/details/78208773
Author: Michael Liu_dev
Link: https://www.jianshu.com/p/13ec2aa50c12
Source: Brief Book
The copyright of the brief book belongs to the author. For any form of reprinting, please contact the author for authorization and indicate the source.
Please read the Chinese version for details.