In our experiments, the input to the network is fixed to a given size Nx × Ny × Nz, mainly influenced by considerations of available memory size on the GPU. In training, sub-volumes of that given size are randomly sampled from the candidate regions within the training CT images, as described below. To increase the field of view presented to the CNN and reduce informative redundancy among neighboring voxels, each image is downsampled by a factor of 2. The resultingprediction maps are then resampled back to the original resolution using nearest neighbor interpolation (or linear interpolation in case of the probability maps).1st Stage. In the first stage, we apply simple thresholding in combination with morphological operations (hole filling and largest component selection) to get a mask of the patient's body. This mask can be utilized as candidate region C1 to reduce the number of voxels necessary to compute the network's loss function and reduce the amount of input 3D regions shown to the CNN during training to about 40%.2nd Stage. After training the first-stage FCN, it is applied to each image to generate candidate regions C2 for training the second-stage FCN (see Fig. 1). We define the predicted organ labels in the testing phase using the argmax of the class probability maps. All foreground labels are then dilated in 3D using a voxel radius of r in order to compute C2, resulting in a binary candidate map.When comparing the recall and false-positive rates of the first-stage FCN with respect to r for both the training and validation sets, r = 3 gives good trade-off between high recall (> 99%) and low false-positive rates (∼10%) for each organ on our training and validation sets (seeFig. 6).