Regressing coordinates directly is highly non-linear and it is difficult for us to interpret the mapping.In order to know what really happens in LSTM during tracking, especially under occlusion conditions,we alternatively convert the ROLO prediction location into a feature vector of length 1024, which can be translated into a 32-by-32 heatmap. And we concatenate it with the 4096 visual features before feeding into the LSTM. The advantage of the heatmap is that it allows to have confidence at multiple spatial locations and we can visualize the intermediate results. The heatmap not only acts as an input feature but can also warp predicted positions in the image. During training, we transfer the region information from the detection box into the heatmap by assigning value 1 to the corresponding regions while 0 elsewhere. Specifically, the detection box is converted to be relative to the 32-by-32heatmap, which is then flattened to concatenate with the 4096 visual features as LSTM input. Let Htarget denote the heatmap vector of the groundtruth and Hpred denote the heatmap predicted in LSTM output. The objective function is defined as: