TensorRT Summary
TensorRT Summary
Rumor has it that NVidia TensorRT, a high-performance deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications, performs the fastest speeds on CUDA platforms. To evaluate the availability, robustness, reliability and flexibility of TensorRT, I have delved into it for almost a month, dealing with troubles every day, and write down this document as a summary.
From MXNet to TensorRT: What, Why and How
Typically, our training codes are written in MXNet, with symbols and parameters saved separately. To deliver a faster inference performance, however, we should use TensorRT engines whose network definitions are rather different. Thus, transferring MXNet modules into TensorRT engines is necessary, where ONNX is an inevitable and irreplaceable tool.
ONNX is a open format to represent deep learning models. With ONNX, AI developers can more easily move models between state-of-the-art tools and choose the combination that is best for them. Recently, ONNX importer and exporter have been merged into MXNet master branch. Before use them, you should download the latest MXNet sources from master branch and build it following the official instructions. Then, it is as easy as one line to transform MXNet models into ONNX models:
from mxnet.contrib.onnx import export_model
filename = "model.onnx"
export_model(sym, params, input_shape, onnx_file_path=filename)
where sym
refer to the model’s symbol and params
refer to the model’s parameters, both of which could be either the path (str) or the object (Symbol object or Dict object). You also need to make clear of the input shapes and encapsulate the tuples into a list. onnx_file_path
is alternative and model.onnx
by default.
Then, onnx-tensorrt
, an open source project belonging to ONNX community, is available to convert ONNX models into serialized TensortRT engines in such a command:
onnx2trt model.onnx -o engine.trt
onnx-tensorrt
also provides a TensorRT backend, which, in my experience, is not ease of use.
For more usages and details, you should peruse the official documents.
Install the prerequisites
In this subsection, I’ll tell about how to install the prerequisites: protobuf
, tensorrt
, onnx
and onnx-tensorrt
. Fortunately, there has been a script to automatically install protobuf
and tensorrt
, and you just need to run
chmod +x tensorrt.sh
./tensorrt.sh
You can install onnx
either from PyPi
pip install onnx
or build locally from source.
git clone --recursive https://github.com/onnx/onnx.git
cd onnx
python setup.py install
Finally, install onnx-tensorrt
. To begin with, find the TensorRT library’s location, which is /usr/src/tensorrt
in my case. Then type the following commands:
git clone --recursive https://github.com/onnx/onnx-tensorrt.git
mkdir build
cd build
cmake .. -DTENSORRT_ROOT=/usr/src/tensorrt
make -j8
sudo make install
Please make sure that the argument -DTENSORRT_ROOT
is correct.
Inference with TensorRT: Code Examples and Test Results
Suppose you have a serialized TensorRT engine, named Inception-7.trt
and converted from Inception-V3 network in the mxnet-model-gallery. To use TensorRT for inference acceleration, load the engine first:
import tensorrt as trt
G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.INFO)
engine = trt.utils.load_engine(G_LOGGER, 'model/Inception-7.trt')
Then you should create an execution context:
context = engine.create_execution_context()
Allocate buffer for inputs and outputs. Please be care of the input shapes and the output shapes.
import pycuda.driver as cuda
import pycuda.autoinit
d_input = cuda.mem_alloc(input_shapes)
d_output = cuda.mem_alloc(output_shapes)
bindings = [int(d_input), int(d_output)]
Then feed the input data and execute the asynchronous inference:
stream = cuda.Stream()
cuda.memcpy_htod_async(d_input, data, stream)
context.enqueue(batch_size, bindings, stream.handle, None)
cuda.memcpy_dtoh_async(output, d_output, stream)
stream.synchronize()
If you’d like synchronous inference:
cuda.memcpy_htod(d_input, data)
context.execute(batch_size, bindings)
cuda.memcpy_dtoh(output, d_output)
In my experiments, TensorRT is not as fast as the original MXNet model and it seems that TensorRT even retards the inference, as shown in the following figure. The deeper reasons are still not clear at present.
TensorRT-Integration: the Future Direction
Nowadays, NVidia developers are actively pushing the TensorRT runtime integration merged into the MXNet master branch. If you are interested, you can watch this PR to follow up. I have managed to make a metal build successful, but met several unexpected bugs during unit tests. TensorRT integration will certainly be the future direction, but it is not mature at now.
written on 26-07-2018 by Faldict
Comments