CNN-SVM is CNN-based tracking method with robust features, but lacks temporal information to deal with severe occlusion. YOLO with kalman filter takes into account the temporal evolution of locations,while ignorant of actual environments. Due to fast motions, occlusions, and therefore occasionally poor detections, YOLO with the kalman filter perform inferiorly lacking knowledge of the visual context. In contrast, with LSTM ROLO synthesizes over sequences the robust image features as well as their soft spatial supervision. ROLO is spatially deep, as it is capable of interpreting the visual features and detecting objects on its own, which can be spatially supervised by concatenating locations or heatmaps to the visual features. It is also temporally deep by exploring temporal features as well as their possible locations. Step size denotes the number of previous frames considered each time for a prediction by LSTM. In previous experiments, we used 6 as the step number. In order to shed light upon how sequence step of LSTM affects the overall performance and running time, we repeat the 2nd experiment with various step sizes, and illustrate the results in Fig. 9. In our experiments, we also tried dropouts on visual features, random offset of detection boxes during training intended for more robust tracking, and auxiliary cost to the objective function to emphasize detection over visual features, but these results are inferior to what is shown.