How can I specify a particular dataflow for inference using TFLite and an Edge TPU? - hardware-acceleration

I have a TFLite model deployed to a Raspberry Pi. I'm using a Coral USB Accelerator to speed up inference, which contains an Edge TPU. I'm interested in experimenting with the impact of using different dataflows on the energy efficiency of this deployment.
Does anyone know how I could specify a particular dataflow, such as row-stationary or output-stationary, when accelerating a TFLite model using an Edge TPU?
For reference: https://people.csail.mit.edu/emer/papers/2017.05.ieee_micro.dnn_dataflow.pdf

According to the Coral support team:
"As far as we know, there is not a specific way available to run the model with a particular data-flow.
It is recommended to visit the model compatibility page at https://coral.ai/docs/edgetpu/models-intro/ to understand how a TFLite model is mapped to the EdgeTPU."

Related

Simulation of a deep self-learning neural network on a robot

Do you know of any ready platforms/software/code that I can use for me to run a simulation of a self-learning deep neural network on a robotic platform, while being able to tweek the code to change specific parameters?
I have found scientific documentation and platforms like Webots, but for the project I want it, I need a ready solution that I will be able to change some parameters in and measure the effect of it on the generations the algorithm needs to train the robot, based on these parameters.

How do I train a deep learning model across multiple virtual machines using PyTorch?

I'm trying to train a model by purchasing multiple virtual machines on a cloud service provider(3-4 instances probably on AWS). I would like to load the model onto each VM, run the training process, then update the models on each VM. The problem is once each model has made its forward pass, I don't know how to accumulate the gradients of each model, and if I could, I'm not sure if I should sum the gradients or average them. I've been using DataParallel on single multiple GPU VMs, so I haven't had to keep track of multiple gradients before this point. I'm unsure if there is a PyTorch package that could help with GPU data parallelism across multiple VMs. I've seen PyTorch Lightning, but it isn't clear what modules to use to communicate between VMs. I'm very new to machine learning, and this is my first training process beyond a single machine. Any advice and tips would be appreciated on this matter including packages or architectural ideas.

How to improve model latency for deployment

Question:
How to improve model latency for web deployment without retraining the models? What is the checklist that I should mark to improve the model speed?
Context:
I have multiple models that process a video sequentially on one machine with one K80 GPU; each model takes around 5 mins to process a video that is 1 min long. What ideas and suggestions should I try to improve each model latency without changing the model architecture? How should I structure my thinking about this problem?
Sampling frames is the easiest technique if it fits your usecase. Picking every 5th frame for inference will cut your inference time by ~5x(theoretically). Caveat is if you are working on tasks like object tracking you will have reduced accuracy.
fp32 to fp16 might increase your inference speed.
Batch Inference always lowers inference time by a decent bit. Ref: https://github.com/ultralytics/yolov5/issues/1806#issuecomment-752834571
Multiprocess Concurrent Inference is basically spinning up more than 1 instances of the same model on seperate processes and infer parallely. torch has a multiprocessing module torch.multiprocessing. I havent ever used this but i assume the setup would be somewhat significant and complex.
Nvidia Tesla K80 is quite an old GPU (2014), so that's probably the reason why the processing time is so long. If your machine has a modern Intel CPU and/or iGPU you could try OpenVINO. It's a heavily optimized toolkit for inference. Here are some performance benchmarks.
You can find a full tutorial on how to convert the PyTorch model here.
Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool which comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to IR, which is a default format for OpenVINO. It also changes the precision to FP16 for even better performance (there shouldn't be much accuracy drop). Run in command line:
mo --input_model "model.onnx" --input_shape "[1,3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU. Use AUTO if you want to deploy on the best device you have.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

pytorch static quantization: different training(calibration) and inference backends

Can we use a different CPU architecture(and backend) for training(calibration) and inference of the quantized pytorch model?
The only post on this subject that I've found states:
static quantization must be performed on a machine with the same
architecture as your deployment target. If you are using FBGEMM, you
must perform the calibration pass on an x86 CPU; if you are using
QNNPACK, calibration needs to happen on an ARM CPU
But there is nothing about this in the official documentation.
The information in the link you posted is correct. You should use the same backend in both the cases. This is also mentioned in the official documentation-
"When preparing a quantized model, it is necessary to ensure that qconfig and the engine used for quantized computations match the backend on which the model will be executed."
Find it here
https://pytorch.org/docs/stable/quantization.html

Sagemaker model evaluation

The Amazon documentation lists several approaches to evaluate a model (e.g. cross validation, etc.) however these methods does not seem to be available in the Sagemaker Java SDK.
Currently if we want to do 5-fold cross validation it seems the only option is to create 5 models (and also deploy 5 endpoints) one model for each subset of data and manually compute the performance metric (recall, precision, etc.).
This approach is not very efficient and can also be expensive need to deploy k-endpoints, based on the number of folds in the k-fold validation.
Is there another way to test the performance of a model?
Amazon SageMaker is a set of multiple components that you can choose which ones to use.
The built-in algorithms are designed for (infinite) scale, which means that you can have huge datasets and be able to build a model with them quickly and with low cost. Once you have large datasets you usually don't need to use techniques such as cross-validation, and the recommendation is to have a clear split between training data and validation data. Each of these parts will be defined with an input channel when you are submitting a training job.
If you have a small amount of data and you want to train on all of it and use cross-validation to allow it, you can use a different part of the service (interactive notebook instance). You can bring your own algorithm or even container image to be used in the development, training or hosting. You can have any python code based on any machine learning library or framework, including scikit-learn, R, TensorFlow, MXNet etc. In your code, you can define cross-validation based on the training data that you copy from S3 to the worker instances.