Questions about the processing of CRNN - deep-learning

I'm studying the CRNN(Convolutional Recurrent Neural Network). I will use the CRNN model for image classification or gesture recognition. I have some questions about the CRNN model.
There is a video consisting of 20 frames of a 7x7 image. What is the difference between having 20 images reshaped to 7x140 and learning on the CNN model and having 20 7x7 images trained on the CRNN model?
Many studies are using the CRNN model for real-time gesture recognition. Assume that learning one gesture with 20 images. When using it in real-time, for input data, should I collect 20 images and input them all at once, or should I input one image immediately?
Much appreciated for your time

Related

Image Classification on heavy occluded and background camouflage

I am doing a project on image classification on classifying various species of bamboos.
The problems on Kaggle are pretty well labeled, singluar and concise pictures.
But the issue with bamboo is they appear in a cluster in most images sometimes with more than 1 species. Also there is a prevalence of heavy occlusion and background camouflage.
Besides there is not much training data available for this problem.
So I have been making my own dataset by collecting the data from the internet and also clicking images from my DSLR.
My first approach was to use a weighted Mask RCNN for instance segmentation and then classifying it using VGGNet and GoogleNet.
My next approach is to test on Attention UNet, YOLO v3 and a new paper BCNet from ICLR 2021.
And then classify on ResNext, GoogleNet and SENet then compare the results.
Any tips or better approach is much appreciated.

Pretrained model or training from scratch for object detection?

I have a dataset composed of 10k-15k pictures for supervised object detection which is very different from Imagenet or Coco (pictures are much darker and represent completely different things, industrial related).
The model currently used is a FasterRCNN which extracts features with a Resnet used as a backbone.
Could train the backbone of the model from scratch in one stage and then train the whole network in another stage be beneficial for the task, instead of loading the network pretrained on Coco and then retraining all the layers of the whole network in a single stage?
From my experience, here are some important points:
your train set is not big enough to train the detector from scratch (though depends on network configuration, fasterrcnn+resnet18 can work). Better to use a pre-trained network on the imagenet;
the domain the network was pre-trained on is not really that important. The network, especially the big one, need to learn all those arches, circles, and other primitive figures in order to use the knowledge for detecting more complex objects;
the brightness of your train images can be important but is not something to stop you from using a pre-trained network;
training from scratch requires much more epochs and much more data. The longer the training is the more complex should be your LR control algorithm. At a minimum, it should not be constant and change the LR based on the cumulative loss. and the initial settings depend on multiple factors, such as network size, augmentations, and the number of epochs;
I played a lot with fasterrcnn+resnet (various number of layers) and the other networks. I recommend you to use maskcnn instead of fasterrcnn. Just command it not to use the masks and not to do the segmentation. I don't know why but it gives much better results.
don't spend your time on mobilenet, with your train set size you will not be able to train it with some reasonable AP and AR. Start with maskrcnn+resnet18 backbone.

CNN for Brain Tumour Detection using 3D MRI images

I am looking to implement a CNN to detect Alzheimer disease using 3D MRI dataset containing healthy and diseased samples. I am getting a very low efficiency using LeNet and VGG16. My models aren't learning and I have around 300 images total. I am trying 3D as well as 2D convolution.
The main problem is handling the 3D images. Please do help

How to perform polynomial landmark detection with deep learning

I am trying to build a system to segment vehicles using a deep convolutional neural network. I am familiar with predicting a set amount of points (i.e. ending a neural architecture with a Dense layer with 4 neurons to predict 2 points(x,y) coords for both). However, vehicles come in many different shapes and sizes and one vehicle may require more segmentation points than another. How can I create a neural network that can have different amounts of output values? I imagine I could use a RNN of some sort but would like a little guidance. Thank you
For example, in the following image the two vehicles have a different number of labeled keypoints.

Training model to recognize one specific object (or scene)

I am trying to train a learning model to recognize one specific scene. For example, say I would like to train it to recognize pictures taken at an amusement park and I already have 10 thousand pictures taken at an amusement park. I would like to train this model with those pictures so that it would be able to give a score for other pictures of the probability that they were taken at an amusement park. How do I do that?
Considering this is an image recognition problem, I would probably use a convolutional neural network, but I am not quite sure how to train it in this case.
Thanks!
There are several possible ways. The most trivial one is to collect a large number of negative examples (images from other places) and train a two-class model.
The second approach would be to train a network to extract meaningful low-dimensional representations from an input image (embeddings). Here you can use siamese training to explicitly train the network to learn similarities between images. Such an approach is employed for face recognition, for instance (see FaceNet). Having such embeddings, you can use some well-established methods for outlier detections, for instance, 1-class SVM, or any other classifier. In this case you also need negative examples.
I would heavily augment your data using image cropping - it is the most obvious way to increase the amount of training data in your case.
In general, your success in this task strongly depends on the task statement (are restricted to parks only, or any kind of place) and the proper data.