Inspired by the recent success of regression-based object detectors, we propose a new system of neural networks in order to effectively (1) process spatiotemporal information and (2) infer region locations. Our methods extends the YOLO deep convolutional neural network into the spatiotemporal domain using recurrent neural networks. So, we refer to our method by ROLO (recurrent YOLO). The architecture of our proposed ROLO is shown in Fig. 2. Specifically, (1) we use YOLO to collect rich and robust visual features, as well as preliminary location inferences; and we use LSTM in the next stage as it is spatially deep and appropriate for sequence processing. (2) Inspired by YOLO’s location inference by regression, we study in this paper the regression capability of LSTM, and propose to concatenate high-level visual features produced by convolutional networks with region information. There are three phases for the end-to-end training of the ROLO model: the pre-training phase of convolutional layers for feature learning, the traditional YOLO training phase for object proposal, and the LSTM training phase for object tracking.