White Papers

45 CheXNet – Inference with Nvidia T4 on Dell EMC PowerEdge R7425

A Troubleshooting

In this section we describe the main issues we faced implementing the custom model CheXNet with

Nvidia TensorRT™ and how we solved these:

• TensorRT™ installation. For TF-TRT integration, recommended to work with the docker image

nvcr.io/nvidia/tensorflow:<tag version>-py3. For Native TRT, recommended to work with the

docker image nvcr.io/nvidia/TensorRT™:<tag version>-py3.

• Python path to TF models. If using TensorFlow official model as based model, and working

within the docker environment, make sure to include the python path to official models once

inside the docker: export PYTHONPATH="$PYTHONPATH:/home/models/“.

• ImageNet TFRecords. If using TensorFlow official model as based model, make sure that

there are not missing tfrecords in the dataset. If this is the case, update the file

/home/models/official/resnet/imagenet_main.py.

• Non-supported Layer Error. Before building the custom model, double check that the selected

framework supports operations by TensorRT™; otherwise, the network subgraph conversion will

fail. In our case, we started with Keras-TensorFlow backend framework and the TensorRT™

script failed converting most of the nodes. Then, we switched the model to TensorFlow

framework version and resolved the issues. See Supported operations for TF-TRT Integration

[13].

• Unimplemented: Not supported constant type at Const_1/Const_5 Error. Error related with

the same issue above. By the time the tests were conducted, it looks like some Keras layers

were not supported by TF-TRT Integration.

• Not conversion function registered for layer IteratortoGetNet Error. This error was thrown

by the system because the input function was not configurated in the model. When building the

custom model, make sure to define the input_function properly, and when exporting the model

with export_savedmodel make sure assure to configure the input_receiver_fn for serving as

input_receiver_fn=export.build_tensor_serving_input_receiver_fn(shape,

batch_size=FLAGS.batch_size)

• Cuda Error in allocate:2. Subgraph conversion error for subgraph_index 1 due to:

“Internal: Engine building failure” SKIPPING (437 nodes)”. Sometimes this error is related

to the GPU memory capacity; so, try to run the tests with lower batch size and one precision

mode at the time.

• Tensor batch_normalization/beta is not found in resnet_v2_imagenet_checkpoint error.

In our case we built the custom model CheXNet using transfer learning and the TensorFlow

official pre-trained ResnetV2_50 checkpoints downloaded from its website. This error was

produced because by the time the model was trained we didn’t place our variables in the same