Concept Guide

9 Addressing the Memory Bottleneck in AI Model – Training for Healthcare

We overcame this memory bottleneck on our development server by reducing the training

batch size from 16 down to 2, while reducing the image size to reasonably smaller sized

dimensions instead of the full-scale image feature map (240x240x144). Of course, this has an

impact on the model accuracy and convergence time. Next, we upgraded our server’s system

memory to its maximum supported memory capacity (384 GB), increased the image size of the

dataset to about one-half, but reduced the batch size by half. In this scenario, the training job

completed successfully. In the next section, we will go over the details of the training infrastructure

— with a “memory-rich” server—using the full-scale BraTS images.

Figure 3. Benchmarking the memory usage of 3D U-Net model-training over various input

tensors sizes on an Intel Xeon Scalable Processor-based server with 1.5 TB system

memory.