The overview of the tracking procedures is illustrated in Fig. 1. We choose YOLO to collect rich and robust visual features, as well as preliminary location inferences; and we use LSTM in the next stage as it is spatially deep and appropriate for sequence processing. The proposed model is a deep neural network that takes as input raw video frames and returns the coordinates of a bounding box of an object being tracked in each frame. Mathematically, the proposed model factorizes the full tracking probability intowhere Bt and Xt are the location of an object and an input frame, respectively, at time t. B