Our approach, dubbed INFER (INtermediate representations for distant FuturE pRediction), involves training an autoregressive model that takes in an intermediate representation of past states of the world, and predicts a multimodal distribution over plausible future states. The model takes asinput an intermediate representation of the scene semantics (intermediate, because it is neither too primitive (eg. raw pixel intensities) nor too abstract (eg. velocities, steering angles). Using these intermediate representations, we predict the plausible future locations of the Vehicles of Interest (VoI)