We first pre-train weights with a traditional CNN for general feature learning. The convolutional neural network takes a video frame as its input and produce a feature map of the whole image. The convolutional weights are learned with ImageNet data of 1000 classes such that the network has a generalized understanding of almost arbitrary visual objects. During pre-training, the output of the first fully connected layer is a feature vector of size 4096, a dense representation of the mid-level visual features. In theory, the feature vector can be fed into any classification tool (such as an SVM or CNN) to achieve good classification results with proper training.Once we have the pre-trained weights able to generate visual features, we adopt the YOLO architecture as the detection module. On top of the convolutional layers, YOLO adopts fully connected layers to regress feature representation into region predictions. These predictions are encoded as an S X S X(B 5 + C) tensor. It denotes that the image is divided into S S splits. Each split has B bounding boxes predicted, represented by its 5 location parameters including x, y, w, h, and itsconfidence c. A one-hot feature vector of length C is also predicted, indicating the class label of each bounding box. In our framework, we follow the YOLO architecture and set S = 7, B = 2, C = 20. Each bounding box originally consists of 6 predictions: x, y, w, h, class label and confidence, but we nullify class label and confidence for visual tracking, as the evaluation consists of locations only.where (x; y) represent the coordinates of the bounding box center relative to the width and the height of the image, respectively. The width and height of the bounding box, are also relative to those of the image. Consequently, (x; y;w; h) 2 [0; 1], and it is easier for regression when they are concatenated with the 4096-dimensional visual features, which will be fed into the tracking module.