How to speed up YoloV5 TFLite inference time in phone app? - yolov5

I am using the YoloV5 model for custom object recognition, and when I export it to tflite model for inclusion in the mobile app, the resulting time to object recognition is 5201.2ms inference. How can I reduce the inference to optimal for faster recognition? The dataset I use to train is 2200 images and use the model yolov5x to train. Thank for help me !!

You have several options:
Train a smaller Yolo model (m instead of x, for example)
Resize the images (640x640 to for example 320x320, notice that the dimension need to be a multiple of the maximum stride which is 32)
Quantize the model to FP16 or INT8
Use NNAPI delegate (only provides speedup if the CPU contains any HW accelerator: GPU, DSP, NN engine)
None of these options exclude each other, all can be used at the same time for maximum inference speed. 1, 2 & 3 will sacrifice model performance for inference speed.

Related

How can you increase the accuracy of ResNet50?

I'm using Resnet50 model to classify images into two classes: normal cells and cancer cells.
so I want to to increase the accuracy but i don't know what to modify.
# we are using resnet50 for transfer learnin here. So we have imported it
from tensorflow.keras.applications import resnet50
# initializing model with weights='imagenet'i.e. we are carring its original weights
model_name='resnet50'
base_model=resnet50.ResNet50(include_top=False, weights="imagenet",input_shape=img_shape, pooling='max')
last_layer=base_model.output # we are taking last layer of the model
# Add flatten layer: we are extending Neural Network by adding flattn layer
flatten=layers.Flatten()(last_layer)
# Add dense layer
dense1=layers.Dense(100,activation='relu')(flatten)
# Add dense layer to the final output layer
output_layer=layers.Dense(class_count,activation='softmax')(flatten)
# Creating modle with input and output layer
model=Model(inputs=base_model.inputs,outputs=output_layer)
model.compile(Adamax(learning_rate=.001), loss='categorical_crossentropy', metrics=['accuracy'])
There were 48 errors in 534 test cases Model accuracy= 91.01 %
Also what do you think about the results of the graph?
this is the classification report
i got good results but is there a possibility to increase accuracy more than that?
This is a broad question as there are many ways one can attempt to generally improve the network's accuracy. some of which may be
Increase the dimension of the layers that are learned in transfer learning (make sure not to overfit)
Use transfer learning with Convolution layers and not MLP
let the optimization algorithm choose the learning rate on its own
Play with additional augmentations to the dataset
and the list goes on.
Also, if possible, I would suggest comparing your results to other publicly available benchmarks - by doing so you might understand the upper bounds of the accuracies better

1D Sequence Classification

I am working with a long sequence (~60 000 timesteps) classification task with continuous input domain. The input has the shape (B, L, C) where B is the batch size, L is the sequence length (i.e. timesteps) and C is the number of features where each feature is continuous (i.e. values like 0.6, 0.2, 0.5, 1.3, etc.).
Since the sequence is very long, I can't directly apply an RNN or Transformer Encoder layer without exceeding memory limits. Some proposed methods use several CNN layers to "downsample" the sequence length before feeding it into an RNN model. A successful example of this includes the CNN-LSTM model. By introducing several subsequent Convolutional blocks followed by max-pooling it is possible to "downsample" the sequence length by a given factor. The sampled sequence would instead have a sequence length of 60 timesteps for instance, which is much more manageable for an LSTM model.
Does it make sense to directly substitute the LSTM model with a Transformer encoder? I have read that the transformer attention mechanism can complement the LSTM layers and be used in succession.
There also exist many variants of Transformers and other architectures designed for handling long sequences. Latest examples include Performer, Linformer, Reformer, Nyströmformer, BigBird, FNet, S4, CDIL-CNN. Does there exist a library similar to torchvision for implementing these models in pytorch without copy-pasting large amounts of code from the respective repositories?

Finding anomaly detection by pattern matching in a set of continous data

I have series of sensors (around 4k) and each sensor will measure the amplitudes at each point.Suppose I train the neural network with sufficent set of 4k values (N * 4k shape). The machine will find a pattern in the series of values.If the values stray away from the pattern (that is anomaly) it can detect the point and will be able to say that anomaly is in the 'X'th sensor.Is this possible.If so what kind of neural network should I use?
Since you are having a time series inputs you can use sequential models like RNN, LSTM, GRU. And use softmax layer at the end, which can output (normal/anomaly).
you can use the same model(weights) 4k times to find which sensor is at fault.
Or same sequential network can be trained with multi dimensional softmax (anomaly1/normal1 ... fault4k/ normal4k)
But such networks won't work well when data is imbalanced(anomalies are rare).
you can also try RPCA for anomaly detection.

What is the purpose of the ROI layer in a Fast R-CNN?

In this tutorial about object detection, the fast R-CNN is mentioned. The ROI (region of interest) layer is also mentioned.
What is happening, mathematically, when region proposals get resized according to final convolution layer activation functions (in each cell)?
Region-of-Interest(RoI) Pooling:
It is a type of pooling layer which performs max pooling on inputs (here, convnet feature maps) of non-uniform sizes and produces a small feature map of fixed size (say 7x7). The choice of this fixed size is a network hyper-parameter and is predefined.
The main purpose of doing such a pooling is to speed up the training and test time and also to train the whole system from end-to-end (in a joint manner).
It's because of the usage of this pooling layer the training & test time is faster compared to original(vanilla?) R-CNN architecture and hence the name Fast R-CNN.
Simple example (from Region of interest pooling explained by deepsense.io):
ROI (region of interest) layer is introduced in Fast R-CNN and is a special case of spatial pyramid pooling layer which is introduced in Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition. The main function of ROI layer is reshape inputs with arbitrary size into a fixed length output because of size constraint in Fully Connected layers.
How ROI layer works is showed below:
In this image, the input image with arbitrary size is fed into this layer which has 3 different window: 4x4 (blue), 2x2 (green), 1x1 (gray) to produce outputs with fixed size of 16 x F, 4 x F, and 1 x F, respectively, where F is the number of filters. Then, those outputs are concatenated into a vector to be fed to Fully Connected layer.

Hard to understand Caffe MNIST example

After going through the Caffe tutorial here: http://caffe.berkeleyvision.org/gathered/examples/mnist.html
I am really confused about the different (and efficient) model using in this tutorial, which is defined here: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt
As I understand, Convolutional layer in Caffe simply calculate the sum of Wx+b for each input, without applying any activation function. If we would like to add the activation function, we should add another layer immediately below that convolutional layer, like Sigmoid, Tanh, or Relu layer. Any paper/tutorial I read on the internet applies the activation function to the neuron units.
It leaves me a big question mark as we only can see the Convolutional layers and Pooling layers interleaving in the model. I hope someone can give me an explanation.
As a site note, another doubt for me is the max_iter in this solver:
https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt
We have 60.000 images for training, 10.000 images for testing. So why does the max_iter here only 10.000 (and it still can get > 99% accuracy rate)? What does Caffe do in each iteration?
Actually, I'm not so sure if the accuracy rate is the total correct prediction/test size.
I'm very amazed of this example, as I haven't found any example, framework that can achieve this high accuracy rate in that very short time (only 5 mins to get >99% accuracy rate). Hence, I doubt there should be something I misunderstood.
Thanks.
Caffe uses batch processing. The max_iter is 10,000 because the batch_size is 64. No of epochs = (batch_size x max_iter)/No of train samples. So the number of epochs is nearly 10. The accuracy is calculated on the test data. And yes, the accuracy of the model is indeed >99% as the dataset is not very complicated.
For your question about the missing activation layers, you are correct. The model in the tutorial is missing activation layers. This seems to be an oversight of the tutorial. For the real LeNet-5 model, there should be activation functions following the convolution layers. For MNIST, the model still works surprisingly well without the additional activation layers.
For reference, in Le Cun's 2001 paper, it states:
As in classical neural networks, units in layers up to F6 compute a dot product between their input vector and their weight vector, to which a bias is added. This weighted sum, denoted a_i, for unit i, is then passed through a sigmoid squashing function to produce the state of unit i ...
F6 is the "blob" between the two fully connected layers. Hence the first fully connected layers should have an activation function applied (the tutorial uses ReLU activation functions instead of sigmoid).
MNIST is the hello world example for neural networks. It is very simple to today's standard. A single fully connected layer can solve the problem with accuracy of about 92%. Lenet-5 is a big improvement over this example.