Concept Guide

10 Addressing the Memory Bottleneck in AI Model Training for Healthcare
Training 3D U-Net on a Large-Memory System
A single-node server with large memory has the potential to reduce organization’s total cost
of ownership (TCO), while addressing the memory bottleneck involved with training large models
with complex datasets. Using a 4-socket 2
nd
Generation Intel Xeon Scalable Processor system
on a Dell EMC PowerEdge server equipped with 1.5 TB of system memory (Figure 4), we trained
the 3D U-Net model with the BraTS dataset (using only the “FLAIR” channel) without the need for
scaling down the data nor tiling images to fit in memory. We used Intel-optimized TensorFlow -
available as an Anaconda library [9]
- and Conda as the Python virtual execution environment. The Intel-optimized TensorFlow
distribution incorporates Deep Neural Network Library (DNNL) [10] (formerly MKL-DNN), allowing
us to leverage the processors’ underlying hardware features, including high CPU core count (80
cores), AVX-512 for floating-point operations, and integrated memory controllers supporting 1TB-
per-socket system memory, to speed up the training process.
Using this system configuration, we achieved, within 25 training iterations (epochs), close to
state-of-the-art performance: 0.997 accuracy, 0.125 loss and 0.82 dice coefficient. We also
profiled the memory footprint of the training task, comparing the results (Figure 5) with our
theoretical calculations from Table 1 and found our estimations to be accurate for our chosen
hyperparameters (batch, feature-map, and image sizes). Meanwhile, the training speed (TS) for
a single step (involving forward pass and backward pass of a single 3D scan) per training epoch
Figure 4. Training infrastructure for 3D U-Net model with a 4-socket 2
nd
Generation Intel Xeon Scalable Processor system on a 2U Dell EMC
PowerEdge R840 server.