Object detection is a classic task in the field of computer vision. It is the basic premise for advanced visual tasks such as Video Content Recognition (VCR) and Image Understand- ing (IU) [3]. Some traditional object detection methods based on manual features have been gradually eliminated with the advent of neural networks because of their low accuracy
and poor environmental adaptability [4]–[9]. Because of the growth of computing power and the dedicated deep learning chips and the availability of large-scale labelled samples (e.g., ImageNet [10], [11], PASCAL VOC [12] and
MS-COCO [13]), Convolutional Neural Network(CNN) has been extensively studied due to its fast, scalable and end to-end learning framework.