Concept Guide

9 Addressing the Memory Bottleneck in AI Model Training for Healthcare
We overcame this memory bottleneck on our development server by reducing the training
batch size from 16 down to 2, while reducing the image size to reasonably smaller sized
dimensions instead of the full-scale image feature map (240x240x144). Of course, this has an
impact on the model accuracy and convergence time. Next, we upgraded our server’s system
memory to its maximum supported memory capacity (384 GB), increased the image size of the
dataset to about one-half, but reduced the batch size by half. In this scenario, the training job
completed successfully. In the next section, we will go over the details of the training infrastructure
with a memory-rich” serverusing the full-scale BraTS images.
Figure 3. Benchmarking the memory usage of 3D U-Net model-training over various input
tensors sizes on an Intel Xeon Scalable Processor-based server with 1.5 TB system
memory.