how to serve deep learning model without GPU - deep-learning

Due to the cost-saving, I'm running a deep learning model with a regular CPU. It takes 10 seconds to finish a request and it's written in python.
I'm thinking about to improve the perf by using java, C++, or rust. Is there any existing rust framework to pick a deep learning model.

Is there any existing rust framework to pick a deep learning model.
While I am not familiar with rust framework. If you are running you model on intel cpu, I would suggest to export model using ONNX and run it with mxnet with Intel MKLDNN backend. This should give you the most performance as it uses Intel MKLDNN and Intel MKL library. You can use C++/Python.
Install mxnet with MKLDNN
https://mxnet.apache.org/versions/1.6/api/python/docs/tutorials/performance/backend/mkldnn/mkldnn_readme.html

Tensorflow's performance critical parts are written in C++. Using other language won't cause a drastic performance difference. You may Quantize your network or do a Network Pruning to increase performance.

Related

How to improve model latency for deployment

Question:
How to improve model latency for web deployment without retraining the models? What is the checklist that I should mark to improve the model speed?
Context:
I have multiple models that process a video sequentially on one machine with one K80 GPU; each model takes around 5 mins to process a video that is 1 min long. What ideas and suggestions should I try to improve each model latency without changing the model architecture? How should I structure my thinking about this problem?
Sampling frames is the easiest technique if it fits your usecase. Picking every 5th frame for inference will cut your inference time by ~5x(theoretically). Caveat is if you are working on tasks like object tracking you will have reduced accuracy.
fp32 to fp16 might increase your inference speed.
Batch Inference always lowers inference time by a decent bit. Ref: https://github.com/ultralytics/yolov5/issues/1806#issuecomment-752834571
Multiprocess Concurrent Inference is basically spinning up more than 1 instances of the same model on seperate processes and infer parallely. torch has a multiprocessing module torch.multiprocessing. I havent ever used this but i assume the setup would be somewhat significant and complex.
Nvidia Tesla K80 is quite an old GPU (2014), so that's probably the reason why the processing time is so long. If your machine has a modern Intel CPU and/or iGPU you could try OpenVINO. It's a heavily optimized toolkit for inference. Here are some performance benchmarks.
You can find a full tutorial on how to convert the PyTorch model here.
Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool which comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to IR, which is a default format for OpenVINO. It also changes the precision to FP16 for even better performance (there shouldn't be much accuracy drop). Run in command line:
mo --input_model "model.onnx" --input_shape "[1,3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU. Use AUTO if you want to deploy on the best device you have.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

How to make Intel GPU available for processing through pytorch?

I'm using a laptop which has Intel Corporation HD Graphics 520.
Does anyone know how to it set up for Deep Learning, specifically Pytorch? I have seen if you have Nvidia graphics I can install cuda but what to do when you have intel GPU?
PyTorch doesn't support anything other than NVIDIA CUDA and lately AMD Rocm.
Intels support for Pytorch that were given in the other answers is exclusive to xeon line of processors and its not that scalable either with regards to GPUs.
Intel's oneAPI formerly known ad oneDNN however, has support for a wide range of hardwares including intel's integrated graphics but at the moment, the full support is not yet implemented in PyTorch as of 10/29/2020 or PyTorch 1.7.
But you still have other options. for inference you have couple of options.
DirectML is one of them. basically you convert your model into onnx, and then use directml provider to run your model on gpu (which in our case will use DirectX12 and works only on Windows for now!)
Your other Option is to use OpenVino and TVM both of which support multi platforms including Linux, Windows, Mac, etc.
OpenVino and TVM use ONNX models so you need to first convert your model to onnx format and then use them.
Lately(as of 2023),IREE (Intermediate Representation Execution Environment) (torch-mlir in this case) can be used as well.
Intel provides optimized libraries for Deep and Machine Learning if you are using one of their later processors. A starting point would be this post, which is about getting started with Intel optimization of PyTorch. They provide more information about this in their AI workshops.

Is it possible to TRAIN a neural network model with Tensoflow Lite/Or any other frameworks on smartphones?

Is it possible to TRAIN a neural network model with Tensoflow Lite/Or any other frameworks on smartphones?
Specifically in the context for federative learning?
You could check out Deeplearning4j which supports Android integration and also on-device training. For a Federated Learning setup, you may think of implementing the Federated Averaging algorithm by yourself on a Java based server, using the same DL4J framework as on mobile clients.
Although, support for on-device training is already on the Tensorflow Lite roadmap for some time, so it is a matter of time until TFLite will provide it's own solution.
TensorFlow Federated is a framework for machine learning and other computations on decentralized data (i.e. federated learning). TFF currently supports running research simulations for learning algorithms involving fleets of mobile devices, but does not currently provide a platform necessary to deploy such on-device training.

Is it possible to create and execute a deep learning model and doing its prediction in c++?

Let's suppose that my chip doesn't support any API like keras, tensorflow or sklearn; however I need to implement a deep learning model in python.
Is it possible to make my training and testing model in python, then, I want to call the best model results for prediction with C++?
Where I mus save the resulted best model in order to be called in the next steps? Must I save it in the chip? Did I need to install tensorflow and keras in my chip in this case?
TERMINOLOGY
You seem to be confused about terminology. Here's a somewhat simplified overview.
Your chip is the hardware (CPU or GPU), and will include circuitry to support its instruction set (move data to/from local memory, perform math and logic operations, etc.). A CPU/GPU chip that cannot support your ML software is hard to visualize, and would not support Python or C++, either. The chip comes on a board, which includes a lot of peripheral connections, secondary memory, etc.
Then your operating system (basic software) is installed on the hardware. This OS manages resources: jobs, processes, memory allocation, etc. If there's a failure in support, it would be here, not in the chip. Finally, you install your desired applications (software tools, programs, etc.) as additions to the OS.
C++ and Python are two high-level languages, popular applications. These languages support Tensorflow and Keras (machine learning frameworks) and SciKit (scientific / statistical package; sklearn is the package name you import).
DIRECT ANSWER
Yes, you can write your NN in Python. Yes, you can call it from C++. Python depends on C/C++ libraries; there is a viable interface between the two.
There is no particular method you must use to save your model and call it later: if you're writing your own model in Python, you get to decide the storage format and location. All you need is to have your Python and C++ programs "agree" on the format. Since you're writing them both, then you can choose whatever works for you.
RECOMMENDATION
Don't write these yourself, unless you really want the exercise. Instead, install a framework (TensorFlow, Caffe, Neon, Torch, MXNet, Keras, ...). Then, simply follow the given tutorials to learn how to build, save, and restore your model.

Performance differences between different CUDA SDK's?

If I want to re-write my application so that it leverages the power of nVidia's CUDA SDK, are there any differences at all in runtime performance between the different SDK offerings: C++, Java, Python?
Is there any difference at all between these 3 SDK's, besides the obvious language being used?
There will be a measurable performance impact on the CPU bound portions of your processing. For instance, if your CUDA data requires pre-processing before reaching the GPU, writing the numerical routine in Python would be suboptimal.
If your CUDA routines dominate the computation time (the CPU remains relatively idle), any of the bindings are a good choice.
It may be best to first prototype in a language such as Python, and if you identify a performance bottleneck move that code to C++.