Concept Guide

10 Addressing the Memory Bottleneck in AI Model – Training for Healthcare

Training 3D U-Net on a Large-Memory System

A single-node server with large memory has the potential to reduce organization’s total cost

of ownership (TCO), while addressing the memory bottleneck involved with training large models

with complex datasets. Using a 4-socket 2

Generation Intel Xeon Scalable Processor system

on a Dell EMC PowerEdge server equipped with 1.5 TB of system memory (Figure 4), we trained

the 3D U-Net model with the BraTS dataset (using only the “FLAIR” channel) without the need for

scaling down the data nor tiling images to fit in memory. We used Intel-optimized TensorFlow -

available as an Anaconda library [9]

- and Conda as the Python virtual execution environment. The Intel-optimized TensorFlow

distribution incorporates Deep Neural Network Library (DNNL) [10] (formerly MKL-DNN), allowing

us to leverage the processors’ underlying hardware features, including high CPU core count (80

cores), AVX-512 for floating-point operations, and integrated memory controllers supporting 1TB-

per-socket system memory, to speed up the training process.

Using this system configuration, we achieved, within 25 training iterations (epochs), close to

state-of-the-art performance: 0.997 accuracy, 0.125 loss and 0.82 dice coefficient. We also

profiled the memory footprint of the training task, comparing the results (Figure 5) with our

theoretical calculations from Table 1 and found our estimations to be accurate for our chosen

hyperparameters (batch, feature-map, and image sizes). Meanwhile, the training speed (TS) for

a single step (involving forward pass and backward pass of a single 3D scan) per training epoch

Figure 4. Training infrastructure for 3D U-Net model with a 4-socket 2

Generation Intel Xeon Scalable Processor system on a 2U Dell EMC

PowerEdge R840 server.