We then fine-tune on a much smaller dataset consisting only of 20 contrast enhanced CT images from the Visceral Challenge dataset4 (Jimenez-del Toro et al., 2016), but with substantially more anatomical structures labeled in each image (20 in total). This fine-tuning process across different datasets is illustrated in Fig. 9 with some ground truth label examples used for pre-training and fine-tuning. In fine-tuning, we use a 10 times smaller learning rate. We furthermore test our models on a completely unseen data collection of 10 torso CT images with 8 labels, including organs that were not labeled in the original abdominal dataset, e.g. the kidneys and lungs. A probabilistic output for kidney (not in the pre-training dataset) from our model is shown in Fig. 10.