TF Lite optimization - deep-learning

I tried to convert my Pytorch models to TensorFlow Lite with ONNX. But my inference time from TensorFlow Lite is twice as slow as Tensorflow and Pytorch. I run TensorFlow Lite model in google colab and this is my first time using TensorFlow Lite.
Here is my code to convert from Tensorflow to TensorFlow Lite:
converter = tf.lite.TFLiteConverter.from_saved_model("model/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
model_lite = converter.convert()
with open('model.tflite', 'wb') as f:
f.write(model_lite)
Any suggestions will help me a lot.

TensorFlow Lite models are supposed to run fast on embedded devices. So you have to use it inside an android phone to find out the time. Colab notebook will not give you the correct time.
You can also use benchmark tool to measure inference time of steady state.

If you would like to run inference on a PC or Google Colab, I'd recommend OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes the inference performance by e.g. graph pruning and fusing some operations. Here are the performance benchmarks for PyTorch models, among others.
You can find a full tutorial on how to convert the PyTorch model here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert the PyTorch model directly for now but it can do it with the ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool that comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to OV format (aka IR), which is a default format for OpenVINO. It also changes the precision to FP16 (to further increase performance). Run in a command line:
mo --input_model "model.onnx" --input_shape "[1, 3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
It's worth mentioning that Runtime can process the ONNX model directly. In that case, just skip the conversion (Model Optimizer) step and give onnx path to the read_model function.
Disclaimer: I work on OpenVINO.

Related

AssertionError: If capturable=False, state_steps should not be CUDA tensors

I get this error while loading model weights of a previous epoch on Google colab. I'm using PyTorch version 1.12.0. I can't downgrade to a lower version as there are external libraries that Im using that require Pytorch 1.12.0
Thanks!
It seems related to a newly introduced parameter (capturable) for the Adam and AdamW optimizers. Currently two workarounds:
forcing capturable = True after loading the checkpoint optim.param_groups[0]['capturable'] = True. This seems to slow down the model training by approx. 10% (YMMV depending on the setup).
Reverting PyTorch back to previous versions (could be 1.11.0).
Source: https://github.com/pytorch/pytorch/issues/80809#issuecomment-1173481031
If you are using PyTorch 1.12.0 with Cuda binaries 11.6/11.7 then on your shell or command prompt, paste the following;
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
The Adam Optimizer regression was removed in the updated torch version
Edit
There's a new torch version use this to install it
pip install torch==1.13.0+cu117 torchvision torchaudio torchtext --extra-index-url https://download.pytorch.org/whl/cu117
Can you tell me which Optimizer you are using. I have encountered this with AdamW optimizer. You can avoid it by loading optimizer with the load_state_dict and then mapping it to cpu explicitly using .cpu() function.

Number of parameters and FLOPS in ONNX and TensorRT model

Does number of parameters and FLOPS (float operations per second) change when convert a model from PyTorch to ONNX or TensorRT format?
I don't think Anvar's post answered OP's question thoroughly so I did a little bit of research. Some general info before the answers to the questions as I believe OP hasn't understood fully what TensorRT and ONNX optimizations happen during the conversion from PyTorch format.
Both conversions, Pytorch to ONNX and ONNX to TensorRT increase the performance of the model by using several different optimizations. The tools actually print you information about what they do if you choose the verbose flag for them.
The preferred way to convert a Pytorch model to TensorRT is to use Torch-TensorRT as explained here.
TensorRT fuses layers and tensors in the model graph, it then uses a large kernel library to select implementations that perform best on the target GPU.
ONNX runtime offers mostly graph optimizations such as graph simplifications and node fusions to improve performance.
1. Does the number of parameters change when converting a PyTorch model to ONNX or TensorRT?
No: even though the layers are fused the number of parameters does not decrease unless there are some redundant branches in the model.
I tested this by downloading the yolov5s.onnx model here. The original model has 7.2M parameters according to the repository authors. Then I used this tool to count the number of parameters in the yolov5.onnx model and got 7225917 as a result. Thus, onnx conversion did not reduce the amount of parameters.
I was not able to get as elaborate information for TensorRT model but you can get layer information using trtexec. There is a recent question about this but there are no answers yet.
2. Does the number of FLOPS change when converting a PyTorch model to ONNX or TensorRT?
According to this post, no.
I know that since some of new versions of Pytorch (I used 1.8 and it worked for me) there are some fusions of batch norm layers and convolutions while saving model. I'm not sure about ONNX, but TensorRT actively uses horizontal and vertical fusion of different layers, so final model would be computational cheaper, than model that you initialized.

How to improve model latency for deployment

Question:
How to improve model latency for web deployment without retraining the models? What is the checklist that I should mark to improve the model speed?
Context:
I have multiple models that process a video sequentially on one machine with one K80 GPU; each model takes around 5 mins to process a video that is 1 min long. What ideas and suggestions should I try to improve each model latency without changing the model architecture? How should I structure my thinking about this problem?
Sampling frames is the easiest technique if it fits your usecase. Picking every 5th frame for inference will cut your inference time by ~5x(theoretically). Caveat is if you are working on tasks like object tracking you will have reduced accuracy.
fp32 to fp16 might increase your inference speed.
Batch Inference always lowers inference time by a decent bit. Ref: https://github.com/ultralytics/yolov5/issues/1806#issuecomment-752834571
Multiprocess Concurrent Inference is basically spinning up more than 1 instances of the same model on seperate processes and infer parallely. torch has a multiprocessing module torch.multiprocessing. I havent ever used this but i assume the setup would be somewhat significant and complex.
Nvidia Tesla K80 is quite an old GPU (2014), so that's probably the reason why the processing time is so long. If your machine has a modern Intel CPU and/or iGPU you could try OpenVINO. It's a heavily optimized toolkit for inference. Here are some performance benchmarks.
You can find a full tutorial on how to convert the PyTorch model here.
Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool which comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to IR, which is a default format for OpenVINO. It also changes the precision to FP16 for even better performance (there shouldn't be much accuracy drop). Run in command line:
mo --input_model "model.onnx" --input_shape "[1,3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU. Use AUTO if you want to deploy on the best device you have.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Is it possible to create and execute a deep learning model and doing its prediction in c++?

Let's suppose that my chip doesn't support any API like keras, tensorflow or sklearn; however I need to implement a deep learning model in python.
Is it possible to make my training and testing model in python, then, I want to call the best model results for prediction with C++?
Where I mus save the resulted best model in order to be called in the next steps? Must I save it in the chip? Did I need to install tensorflow and keras in my chip in this case?
TERMINOLOGY
You seem to be confused about terminology. Here's a somewhat simplified overview.
Your chip is the hardware (CPU or GPU), and will include circuitry to support its instruction set (move data to/from local memory, perform math and logic operations, etc.). A CPU/GPU chip that cannot support your ML software is hard to visualize, and would not support Python or C++, either. The chip comes on a board, which includes a lot of peripheral connections, secondary memory, etc.
Then your operating system (basic software) is installed on the hardware. This OS manages resources: jobs, processes, memory allocation, etc. If there's a failure in support, it would be here, not in the chip. Finally, you install your desired applications (software tools, programs, etc.) as additions to the OS.
C++ and Python are two high-level languages, popular applications. These languages support Tensorflow and Keras (machine learning frameworks) and SciKit (scientific / statistical package; sklearn is the package name you import).
DIRECT ANSWER
Yes, you can write your NN in Python. Yes, you can call it from C++. Python depends on C/C++ libraries; there is a viable interface between the two.
There is no particular method you must use to save your model and call it later: if you're writing your own model in Python, you get to decide the storage format and location. All you need is to have your Python and C++ programs "agree" on the format. Since you're writing them both, then you can choose whatever works for you.
RECOMMENDATION
Don't write these yourself, unless you really want the exercise. Instead, install a framework (TensorFlow, Caffe, Neon, Torch, MXNet, Keras, ...). Then, simply follow the given tutorials to learn how to build, save, and restore your model.

Any code to convert tensorflow model to caffe model?

My end goal is to use the TensorRT to optimize my model for deployment. I am doing my experiments in tensorflow. But the TensorRT requires the input model in caffemodel format.
As of TensorRT 2.1, there is no support to convert TensorFlow model. This might be supported in the future.
You might want to explore options to convert TensorFlow model to Caffe and then convert that using TensorRT into an inference engine.
New TensorRT 3 added support for TensorFlow.