Z. Zou, Z. Shi, Y. Guo, and J. Ye, ‘‘Object detection in 20 years: A survey,’’ 2019, arXiv:1905.05055v2. [Online].
Available: https://arxiv.org/abs/1905.05055v2
文前心得,如果你想開始機器視覺又不知道從何開始,該選擇哪種演算法這篇論文整理了2019年之前你所必須要知道的演算法、公開資料庫、及關於目標檢測你所需要知道的小細節通通都被整理在這篇文章之中。你可以把它當做大神的筆記,強烈建議你將這篇論文慢慢讀完,知乎上也有大神將全文翻譯成簡體中文版。
此論文非常長,因此我將本篇論文分為背景介紹及技術細節兩個大章,本文將以背景介紹為主,詳細的技術細節將會在往後一一詳盡分析。
Introduction
甚麼是物件偵測?論文內給了一句很簡潔的答案: What objects are where?
Object detection 基本上可以被分成兩大類:
- general objection detection: aims to explore the methods of detecting different types of objects under a unified framework to simulate the human vision and cognition
- detection application: specific application scenarios, such as pedestrian detection, face detection, text detection
Key points of the paper
- 本篇論文收納了400+篇論文(1990s-2019)的精華供讀者對目標檢測算法的所有算法有個概念
- The key technology of the-state-of-art object detection system, such as “multiscale detection”, “hard negative mining”, “bounding box” …etc.
- A comprehensive analysis of detection speed-up techniques.
— “detection pipeline” (e.g., cascaded detection, feature map shared computation)
— “detection backbone” (e.g., network compression, lightweight network design)
— “numerical computation” (e.g., integral image, vector quantization).
OBJECT DETECTION IN 20 YEARS
Milestone: Traditional detectors
- Viola-Jones Detectors (VJ detector)
1. Integral image: the Haar wavelet is used in VJ detector as the feature representation of an image. The integral image makes the computational complexity of each window in VJ detector independent of its window size.[何謂積分影像]
2. Feature selection: Adaboost algorithm to select a small set of features that are most helpful for face detection from a huge set of random features pools (about 180k-dimensional) [Adaboost 簡介]
3. Detection cascades: A multi-stage detection paradigm - Histogram of Oriented Gradients (HOG Detector)
HOG descriptor is designed to be computed on a dense grid of uniformly spaced cells and use overlapping local contrast normalization (on “blocks”) for improving accuracy. - Deformable Part-based Model (DPM)
Milestones: CNN based Two-stage Detectors
- RCNN: The idea behind RCNN is simple: It starts with the extraction of a set of object proposals (object candidate boxes) by selective search. Then each proposal is rescaled to a fixed size image and fed into a CNN model trained on ImageNet (say, AlexNet) to extract features. Finally, linear SVM classifiers are used to predict the presence of an object within each region and to recognize object categories.
Drawbacks: the redundant feature computations on a large number of overlapped proposals (over 2000 boxes from one image) leads to an extremely slow detection speed (14s per image with GPU)
- Spatial Pyramid Pooling Networks (SPPNet):
With SPP, we don’t need to crop the image to a fixed size, like AlexNet, before going into CNN. Any image sizes can be inputted. [Review: SPPNet]
Drawbacks: first, the training is still multi-stage, second, SPPNet only fine-tunes its fully connected layers while simply ignores all previous layers. Later in the next year, Fast RCNN was proposed and solved these problems.
- Fast-RCNN
Fast-RCNN successfully integrates the advantages of R-CNN and SPPNet, its detection speed is still limited by the proposal detection (see Section 2.3.2 for more details). Then, a question naturally arises: “can we generate object proposals with a CNN model?” Later, Faster R-CNN [19] has answered this question.
- Faster RCNN
The main contribution of Faster-RCNN is the introduction of the Region Proposal Network (RPN) that enables nearly cost-free region proposals.
- Feature Pyramid Networks (FPN)
A top-down architecture with lateral connections is developed in FPN for building high-level semantics at all scales. Since a CNN naturally forms a feature pyramid through its forward propagation, the FPN shows great advances for detecting objects with a wide variety of scales. [Review: FPN]
Milestones: CNN based One-stage Detectors
- You Only Look Once (YOLO)
It was the first one-stage detector in the deep learning era. The authors have completely abandoned the previous detection paradigm of “proposal detection + verification”. Instead, it follows a totally different philosophy: to apply a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region simultaneously.
Drawbacks: In spite of its great improvement of detection speed, YOLO suffers from a drop in localization accuracy compared with two-stage detectors, especially for some small objects. [YOLOv1] - Single Shot MultiBox Detector (SSD)
The main contribution of SSD is the introduction of the multi-reference and multi-resolution detection techniques, which significantly improves the detection accuracy of a one-stage detector, especially for some small objects. The main difference between SSD and any previous detectors is that the former one detects objects of 5 different scales on different layers of the network, while the latter ones only run detection on their top layers. - RetinaNet
A new loss function named “focal loss” has been introduced in RetinaNet by reshaping the standard cross entropy loss so that detector will put more focus on hard, misclassified examples during training. Focal Loss enables the one-stage detectors to achieve comparable accuracy of two-stage detectors while maintaining a very high detection speed.
Object Detection Datasets and Metrics
In object detection, a number of well-known datasets and benchmarks have been released in the past 10 years, including the datasets of PASCAL VOC Challenges (e.g., VOC2007, VOC2012), ImageNet Large Scale Visual Recognition Challenge (e.g., ILSVRC2014), MS-COCO Detection Challenge, etc.
Metrics
- AP
In recent years, the most frequently used evaluation for object detection is “Average Precision (AP)”. AP is defined as the average detection precision under different recalls and is usually evaluated in a category-specific manner. To compare performance overall object categories, the mean AP (mAP) averaged over all object categories is usually used as the final metric of performance. - IoU
To measure the object localization accuracy, the Intersection over Union (IoU) is used to check whether the IoU between the predicted box and the ground truth box is greater than a predefined threshold, say, 0.5. If yes, the object will be identified as “successfully detected”, otherwise will be identified as “missed”. The 0.5- IoU based mAP has then become the de facto metric for object detection problems for years.