Extensive empirical evaluation has been conducted, comparing the performance of ROLO with 10 distinct trackers on a suite of 30 challenging and publicly available video sequences. Specifically, we compare our results with the top 9 trackers that achieved the best performance evaluated by the benchmark [26], including STRUCK [9], CXT [3], OAB [7], CSK [10], VTD [16], VTS [17], LSK [18], TLD [15], RS [2]. Note that CNN-SVM [13] is another tracking algorithm based on representations from CNN, as a baseline for trackers that adopt deep learning. We also use a modified version of SORT [1] to evaluate the tracking performance of YOLO with kalman filter. As a generic object detector, YOLO can be trained to recognize arbitrary objects. Since the performance of ROLO depends on the YOLO part, we choose the default YOLO model for fair comparison. The model is pre-trained on ImageNet dataset and finetuned on VOC dataset, capable of detecting objects of 20 classes. We pick a subset of 30 videos from the benchmark, where the targets belong to these classes. The video sequences considered in this evaluation are summarized in Table 1. According to experimental results of benchmark methods, the average difficulty of OTB-30 is harder than that ofthe full benchmark.