All of the start: YOLOv1

8 min readJun 2, 2021

YOLO: you only look once.

這個做為即時影像物件偵測的開山祖師。時至今日，雖然原始作者Joseph Redmon因為對於YOLO隱私及演算法被使用在軍事上等原因已不再進行維護，但YOLO還是在所有物件偵測的演算法中，佔有一席之地。至筆者紀錄的當下YOLO目前已經更新到YOLOv5，此系列文章將從YOLOv1開始一路到YOLOv5。雖然YOLOv1、v2 比起新的技術早已落後，不過背後的原理還是值得我們探討。

讀者可以參考 Sik-Ho Tsang 的文章，對於演算法架構的解析講解得相當到位，如果還想更清楚了解模型的細部資訊可以直接閱讀論文。

[2016 CVPR] You Only Look Once: Unified, Real-Time Object Detection [paper link]

以下正文開始。

我將文章拆解成幾個小段，方便大家直接跳到想閱讀的段落

Uified Detection
Model Architechture
Loss function
Results

1. Unified Detection

依照目標檢測里程碑的 roadmap，我們可以看到 two-stage 的算法，例如：RCNN、Fast RCNN 都是 YOLO 的前輩，在當時提出 region proposal 可以說是當時的顯學，但 two-stage 的算法缺點也相當明顯，例如：RCNN 使用Selective Search 找出 2000–3000 個 region proposal，將取出的 region proposal 壓縮成一樣大小之後再丟入CNN擷取特徵，利用 SVM 加以分類，並對bounding box做線性回歸，這一步驟在執行上非常耗時，並且這個方法對於box size的選擇非常敏感。

因此 one-stage 的始祖 YOLOv1 在2016年問世，YOLOv1 由FAIR (Facebook AI Research)所開發，作者將region proposal巧妙地轉為迴歸分析問題，也就是將挑選region proposal的問題也丟進model內，請model也幫忙預測bounding box。

YOLOv1在速度上的確非常有效率，實作上可以達到45 fps. 甚至在Fast YOLOv1可以達到155 fps。YOLOv1不尋找region proposal，反而建議使用 Unified Detection ，這也就是You Only Look Once 的由來。

YOLO將輸入的圖片resize成448*448，並分為 S * S (S=7)個網格，如果物體的中心落入該網格中，該網格就要負責偵測該物體。
每個網格需要預測B個bounding box (B=2) 還有該bouding box 的偵測該物體的confidence score，confidence score = P(該物體|所有物體)* 物體在預測網格內的分數(IoU) [ Pr(Class_i|Object) * IOU]。以20種預測物體來舉例，每張圖所輸出的tensor size 也就是 (5*2+20)* 7*7 = 30*7*7。
每個bounding box會有五個值 x, y, w, h, confidence score， (x,y)代表該物體的中心，w 代表該物體的寬度，h 代表該物體的高度。confidence score代表該bounding box和正確解答的IOU (Intersection Over Union)。

The final output size becomes: 7×7×(2×5+20)=1470

2. Model Architecture

Net Architecture 是由GoogLeNet所啟發，取代了GoogLeNet使用inception，YOLOv1使用了1*1 reduction layer (reduce parameter) followed by 3*3 Conv layer，最後接上兩個Conn layer。

Activation function: Leaky Relu
Activation function of the last layer: Linear

最後的模型流程圖如下：

在這邊稍微聊一下NMS (Non-Maximal Suppression)，在論文中實際上只有提到 "However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections." ，但實際上還沒聽說有人不用的。

NMS

NMS說白了就是在挑選正確的Bounding Box，以下是如何實踐NMS。

Step 0: 設定confidence score threshold，去掉無用的bounding box 再開始做NMS，可以有效減少無用的計算量。

Step 1: 設定IoU threshold (一般用0.5，如果太高會造成物件重複偵測的問題)，用來刪掉IoU值太高的兩個框其中之一。

Step 2: 使用confidence score 對 Bounding box 排序，然後用第一名的Bounding box跟其他Bounding box算 IoU，如果大於IoU threshould 就將其信心值歸零。

Step 3: Repeat Step 1, Step 2 直到所有物件的IoU值為0，此時剩下的就是結果啦。

3. Loss Function

Sum-Squared Error是YOLOv1主要用來產生loss function 的武器，相對其他Loss function，雖然不是最好，但較容易最佳化，這也是為什麼YOLOv1較難偵測小物體的其中原因之一。

λnoobj=0.5，許多沒有物體的網格會將confidence score 變小，這樣將會模型變得比較不穩定並難以收斂，因為需要將這個term的影響給減小，所以給他較小的數值。

λcoord = 5，基於相同的理由，我們將有物體的網格影響增大。

4. Result

Picasso Dataset precision-recall curves.

Limitation

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict => 重疊難預測
Our model struggles with small objects that appear in groups, such as flocks of birds. => 群體中的小物體難預測
Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. => 形狀奇怪的難預測
Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image. => downsampling layers 會失去細節的資訊
While we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations. => Loss function 對小物體偵測效率不高。