White Papers

Deep Learning Performance: Scale-up vs Scale-out
Architectures & Technologies Dell EMC | Infrastructure Solutions Group
46
7.4.1 Hyper-parameters tuning
The section below are the commands with the hyper-parameter tuning used to maximize the
throughput performance in single and distributed mode server implementations.
Figure 41 shows the high impact of the hyper-parameter tuning in the throughput performance:
Single Node TensorFlow:
#python3 tf_cnn_benchmarks.py --variable_update=replicated --
data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=ResNet50 --batch_size=128 --
device=gpu --num_gpus=4 --num_epochs=90 --print_training_accuracy=true --summary_verbosity=0 --
momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --weight_decay=0.0001 --
optimizer=momentum --use_fp16=True --local_parameter_device=gpu --all_reduce_spec=nccl --
display_every=1000
Distributed Horovod TensorFlow:
#mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 -
x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=INFO --bind-to none --map-by slot --mca plm_rsh_args "-
p 50000" python tf_cnn_benchmarks.py --variable_update=horovod --
data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=ResNet50 --batch_size=128 --
num_epochs=90 --display_every=1000 --device=gpu --print_training_accuracy=true --
summary_verbosity=0 --momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --
weight_decay=0.0001 --optimizer=momentum --use_fp16=True --local_parameter_device=gpu --
horovod_device=gpu --datasets_num_private_threads=4