Motivated by this, we use semantics as an intermediaterepresentation and train a neural autoregressive model thatgeneralizes (zero-shot) to datasets it has never been trainedon. Specifically, we train a model that predicts future trajectories of traffic participants over the KITTI [1] dataset, andtest it on different datasets [2], [3] and show that the networkperforms well across these datasets which differ in scenelayout, weather conditions, and also generalizes well acrosssensing modalities (models trained using a combination ofstereo and LiDAR data perform well even when either ofthose modalities are absent at test time). In addition totransferring well, we outperform the current best future prediction model [6] on the KITTI [1] dataset while predictingdeep into the future (3 − 4 sec) by a significant margin.Furthermore, we conduct a thorough ablation study of ourintermediate representations, to answer the question “Whatkind of semantic information is crucial to accurate futureprediction?". We then showcase one important application offuture prediction—multi-object tracking—and present resultson select sequences from the KITTI [1] and Cityscapes [2]datasets