Qualitative result in Fig.5 shows that ROLO successfully tracks the object under occlusion challenges in unseen frames. Note that during frames 776-783, ROLO continues tracking the vehicle even though the detection module fails. We also train an alternative ROLO model with heatmap instead of location coordinates, in order to analyze LSTM under occlusion conditions. The model is trained offline with 1/3 frames from OTB-30 tested on unseen videos. It is shown in Fig. 6 that ROLO tracks the object in near-complete occlusions. Even though two similar targets simultaneously occur in this video, ROLO tracks the correct target as the detection module inherently feeds the LSTM unit with spatial constraint. Note that between frame 47-60, YOLO fails in detection but ROLO does not lose the track. The heatmap is involved with minor noise when no detection is presented as the similar target is still in sight. Nevertheless, ROLO has more confidence on the real target even whenit is fully occluded, as ROLO exploits its history of locations as well as its visual features. ROLO is proven to be effective due to several reasons: (1) the representation power of the high-level visual features from the convNets, (2) the feature interpretation power of LSTM, therefore the ability to detect visual objects, which is spatially supervised by a location or heatmap vector, (3) the capabilityof regressing effectively with spatio-temporal information.