White Papers

Deep Learning Performance: Scale-up vs Scale-out

Architectures & Technologies Dell EMC | Infrastructure Solutions Group

7.4.1 Hyper-parameters tuning

The section below are the commands with the hyper-parameter tuning used to maximize the

throughput performance in single and distributed mode server implementations.

Figure 41 shows the high impact of the hyper-parameter tuning in the throughput performance:

Single Node – TensorFlow:

#python3 tf_cnn_benchmarks.py --variable_update=replicated --

data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=ResNet50 --batch_size=128 --

device=gpu --num_gpus=4 --num_epochs=90 --print_training_accuracy=true --summary_verbosity=0 --

momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --weight_decay=0.0001 --

optimizer=momentum --use_fp16=True --local_parameter_device=gpu --all_reduce_spec=nccl --

display_every=1000

Distributed Horovod – TensorFlow:

#mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 -

x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=INFO --bind-to none --map-by slot --mca plm_rsh_args "-

p 50000" python tf_cnn_benchmarks.py --variable_update=horovod --

data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=ResNet50 --batch_size=128 --

num_epochs=90 --display_every=1000 --device=gpu --print_training_accuracy=true --

summary_verbosity=0 --momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --

weight_decay=0.0001 --optimizer=momentum --use_fp16=True --local_parameter_device=gpu --

horovod_device=gpu --datasets_num_private_threads=4