TensorRT Summary

Rumor has it that NVidia TensorRT, a high-performance deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications, performs the fastest speeds on CUDA platforms. To evaluate the availability, robustness, reliability and flexibility of TensorRT, I have delved into it for almost a month, dealing with troubles every day, and write down this document as a summary.

From MXNet to TensorRT: What, Why and How

Typically, our training codes are written in MXNet, with symbols and parameters saved separately. To deliver a faster inference performance, however, we should use TensorRT engines whose network definitions are rather different. Thus, transferring MXNet modules into TensorRT engines is necessary, where ONNX is an inevitable and irreplaceable tool.

ONNX is a open format to represent deep learning models. With ONNX, AI developers can more easily move models between state-of-the-art tools and choose the combination that is best for them. Recently, ONNX importer and exporter have been merged into MXNet master branch. Before use them, you should download the latest MXNet sources from master branch and build it following the official instructions. Then, it is as easy as one line to transform MXNet models into ONNX models:

from mxnet.contrib.onnx import export_model

filename = "model.onnx"
export_model(sym, params, input_shape, onnx_file_path=filename)

where sym refer to the model’s symbol and params refer to the model’s parameters, both of which could be either the path (str) or the object (Symbol object or Dict object). You also need to make clear of the input shapes and encapsulate the tuples into a list. onnx_file_path is alternative and model.onnx by default.

Then, onnx-tensorrt, an open source project belonging to ONNX community, is available to convert ONNX models into serialized TensortRT engines in such a command:

onnx2trt model.onnx -o engine.trt

onnx-tensorrt also provides a TensorRT backend, which, in my experience, is not ease of use. For more usages and details, you should peruse the official documents.

Install the prerequisites

In this subsection, I’ll tell about how to install the prerequisites: protobuf, tensorrt, onnx and onnx-tensorrt. Fortunately, there has been a script to automatically install protobuf and tensorrt, and you just need to run

chmod +x tensorrt.sh
./tensorrt.sh

You can install onnx either from PyPi

pip install onnx

or build locally from source.

git clone --recursive https://github.com/onnx/onnx.git
cd onnx
python setup.py install

Finally, install onnx-tensorrt. To begin with, find the TensorRT library’s location, which is /usr/src/tensorrt in my case. Then type the following commands:

git clone --recursive https://github.com/onnx/onnx-tensorrt.git
mkdir build
cd build
cmake .. -DTENSORRT_ROOT=/usr/src/tensorrt
make -j8
sudo make install

Please make sure that the argument -DTENSORRT_ROOT is correct.

Inference with TensorRT: Code Examples and Test Results

Suppose you have a serialized TensorRT engine, named Inception-7.trt and converted from Inception-V3 network in the mxnet-model-gallery. To use TensorRT for inference acceleration, load the engine first:

import tensorrt as trt

G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.INFO)
engine = trt.utils.load_engine(G_LOGGER, 'model/Inception-7.trt')

Then you should create an execution context:

context = engine.create_execution_context()

Allocate buffer for inputs and outputs. Please be care of the input shapes and the output shapes.

import pycuda.driver as cuda
import pycuda.autoinit

d_input = cuda.mem_alloc(input_shapes)
d_output = cuda.mem_alloc(output_shapes)
bindings = [int(d_input), int(d_output)]

Then feed the input data and execute the asynchronous inference:

stream = cuda.Stream()
cuda.memcpy_htod_async(d_input, data, stream)
context.enqueue(batch_size, bindings, stream.handle, None)
cuda.memcpy_dtoh_async(output, d_output, stream)
stream.synchronize()

If you’d like synchronous inference:

cuda.memcpy_htod(d_input, data)
context.execute(batch_size, bindings)
cuda.memcpy_dtoh(output, d_output)

In my experiments, TensorRT is not as fast as the original MXNet model and it seems that TensorRT even retards the inference, as shown in the following figure. The deeper reasons are still not clear at present.

TensorRT vs MXNet

TensorRT-Integration: the Future Direction

Nowadays, NVidia developers are actively pushing the TensorRT runtime integration merged into the MXNet master branch. If you are interested, you can watch this PR to follow up. I have managed to make a metal build successful, but met several unexpected bugs during unit tests. TensorRT integration will certainly be the future direction, but it is not mature at now.

written on 26-07-2018 by Faldict