One box object detection - deep-learning

I am using a faster rcnn model to predict one object in an image. There can only be one object in each image.
Is it possible to force Faster Rcnn to train and predict as if it should only find one object per image?

Yes, all depends only on data that you train on.
But I don't think that fast-rcnn is the best solution for this case: it's "brute force" solution (if only one object)- but if the data is really complex and such big object detection model worth it, try to use modern convolution-based object detection architectures like YOLO or SSD

Related

Object detection from synthetic to real life data with Yolov5

Currently trying yolov5 with custom synthetic data. The dataset we've created consists of 8 different objects. Each object has a minimum of 1500 pictures/labels, where the pictures are split 500/500/500 of normal/fog/distractors around object. Sample images from the dataset is in the first imgur link. The model is not trained from scratch, but from yolov5 standard .pt.
So far we've tried:
Adding more data (from 300 images per object, to 4500)
Creating more complex data (distractors on/around objects)
Running multiple runs of training
Trained with network size small, medium, large, xlarge
Different batch size between 4-32 (depending on model size)
Everything so far has resulted in good/great detection on synthetic data, but completely off when used on real-life data.
Examples: Thinks that the whole pictures of unrelated objects is a paperbox, walls are pallets, etc. Quick sample images in the last imgur link.
Anyone got clues for how to improve the training or data to be better suited for real life detection? Or how to better interpret the results? I don't understand how the model draws the conclusion that a whole picture, with unrelated objects, is a box/pallet.
Results from training uploaded to imgur:
https://imgur.com/a/P0TQeBl
Example on real life data:
https://imgur.com/a/SGY7w8w
There are couple of things to improve results.
After training your model with synthetic data, fine tune your model with real training data, with a smaller learning rate (1/10th maybe). This will reduce the gap between synthetic and real life images. In some cases rather than fine tuning, training the model with mixed (synthetic+real) produces better results.
Generate images structurally similar to real life examples. For example, put humans inside forklifts, or pallets or barrels on forks, etc. Models learn from it.
Randomize the texture on items that you want to detect. Models tend to focus on textures for detection. By randomizing textures, with lots of variability including mon natural occurrences, you force model to learn to identify objects not based on its textures. Although, texture of an object sometimes is a good identifier, synthetic data suffers from not replicating that feature good enough, hence the domain gap, so you reduce its impact on model decision.
I am not sure whether the screenshot accurately represent your data generation distribution, if so, you have to randomize the angles of objects, sizes and occlusion amounts more.
Use objects that you don’t want to detect but will be in the images you will do inference as distractors, rather than simple shapes like spheres.
Randomize lighting more. Intensity, color, angles etc.
Increase background and ground randomization. Use hdris, there are lots of free hdris
Balance your dataset
https://imgur.com/a/LdCa8aO
Checking your results the answer is that your synthetic data is way to dissimilar to the real life data you want it to work for. Try to generate synthetic scenes that are closer to your real life counterparts and training again would clearly improve your results. That includes more realistic backgrounds and scene compositions. I don't know if your training set resembles the validation images you shared here but in case it does, try to have more objects per image, closer to the camera and add variation to their relative positions. Having just one random 3D object in the middle of an image is not going to provide good results. By the way, you are already overfitting your models, so more training images wouldn't help at this point.

Will object detection/segmentation improve classification accuracy?

if i want to do a classification task will the classification score improve if i have bounding box or pixel wise information?
or will a classification network outperform an object detection/Segmentation network in classification?
Good question.
The answer is the following: it depends on the case.
For instance, say you train YoloV3 on 10 types of dogs. Then the output of YoloV3 is a bounding box with the corresponding class.
Having two neural networks however, one for detecting dogs(object detection) and the other one for classification(what type of dog) may perhaps yield a better score.
The same goes for semantic/instance segmentation network.
Although there is no recipe for saying which approach works better, I would personally incline to believe that the second step pipeline would yield slightly better results(there observations are based on my experience, since I have implemented both approaches at a point).

How to input audio data into deep learning algorithm?

I'm very new in deep learning, and I'm targeting to use GAN (Generative Adversarial Network) to recognize emotional speech. I've only known images being as inputs to most deep learning algorithms, such as GAN. but I'm curious as to how audio data can be an input into it, besides of using images of the spectrograms as the input. also, i'd appreciate it if you can explain it in laymen terms.
Audio data can be be represented in form of numpy arrays but before moving to that you must understand what audio really is. If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time.
Assuming that our audio is represented in time domain, we can extract the values at every half-second(arbitrary). This is called sampling rate.
Converting the data into frequency domain can reduce the amount of computation requires as the sampling rate is less.
Now, let's load the data. We'll use a library called librosa , which can be installed using pip.
data, sampling_rate = librosa.load('audio.wav')
Now, you have both the data and the sampling rate. We can plot the waveform now.
librosa.display.waveplot(data, sr=sampling_rate)
Now, you have the audio data in form of numpy array. You can now study the features of the data and extract the ones you find interesting to train your models.
Further to Ayush’s discussion, for information on the challenges and work arounds of dealing with large amounts of data at different time scales in audio data I suggest this post on WaveNet: https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
After that it sounds like you want to do classification. In that case a GAN on it’s own is not suitable. If you have plenty of data you could use a straight LSTM (or another type of RNN) which is designed to model time series, or you can take set sized chunks of input and use a 1-d CNN (similar to WaveNet). If you have lots of unlabelled data from the same or similar domain and limited training data you could use a GAN to learn to generate new samples, then use the discriminator from the GAN as pre-trained weights for a CNN classifier.
Since you are trying to perform Speech Emotion Recognition (SER) using deep learning, you can go for a recurrent architecture (LSTM or GRU) or a combination of CNN and recurrent network architecture (CRNN) instead of GANs since GANs are complicated and difficult to train.
In a CRNN, the CNN layers will extract features of varying details and complexity, whereas the recurrent layers will take care of the temporal dependencies. You can then finally use a fully connected layer for regression or classification output, depending on whether your output label is discrete (for categorical emotions like angry, sad, neutral etc) or continuous (arousal and valence space).
Regarding the choice of input, you can use either a spectrogram input (2D) or raw speech signal (1D) as input. For spectrogram input, you have to use a 2D CNN whereas for a raw speech signal you can use a 1D CNN. Mel scale spectrograms are usually preferred over linear spectrograms since our ears hear frequencies in log scale and not linearly.
I have used a CRNN architecture to estimate the level of verbal conflict arising from conversational speech. Even though it is not SER, it is a very similar task.
You can find more details in the paper
http://www.eecs.qmul.ac.uk/~andrea/papers/2019_SPL_ConflictNET_Rajan_Brutti_Cavallaro.pdf
Also, check my github code for the same paper
https://github.com/smartcameras/ConflictNET
and a SER paper whose code I reproduced in Python
https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition
And finally as Ayush mentioned, Librosa is one of the best Python libraries for audio processing. You have functions to create spectrograms in Librosa.

Computer vision detection on small object bad results why?

Currently, for detection (localisation + recognition tasks) we use mainly deep learning algorithm in computer vision. Two types of detector exist :
one stage : SSD, YOLO, retinanet, ...
two stage : RCNN, Fast RCNN and faster RCNN for example
Using these detectors on very small objects (10 pixels for example) is a very challenging tasks and it seems the one stage algorithm are worse than the two stage algorithm. But I do not really understand why it works better on Faster RCNN for example. In fact, the one and two stage detector use both of them the anchor concept, and most of them use the same backbone like VGG16 or resnet50/resnet101. That means the receptive fields is the same. For example, I tried to detect very small object on retinanet and on faster RCNN. On retinanet, small object are not detected contrary to faster rcnn. I do not understand why. What is the explication theoretically ? (same backbone : resnet50)
I think in general networks like retinaNet are trying to bridge the gap you mention.Usually in one stage networks we will have anchor boxes of varying scales in the feature maps produced by the Backbone net, These feature maps are produced by heavily down sampling the input image, A lot of information about small object might be lost while performing this operation.While this is the case with one stage detectors, In two stage detectors because of flexibility of the RPN network, The RPN network may still propose regions which are small and this may help it to perform slightly better than its one stage counterparts.
I don't think you should be very surprised that both of these might use the same backbone, After the conv features are extracted both networks use different methods to perform detection.
Hope this helps, Let me know if i wasn't clear enough,or you have questions.

single-class or multi-class object detection for a specific class object?

One thing that I was wondering for a long time is the performance of a CNN-based object detector in single class and multi-class.
For example, If I want to design a pedestrian detector using the famous Faster R-CNN(VGG-16). The official version could detect pedestrian with 76.7 AP (PASCAL 07 test) if the training data is PASCAL VOC07+12 trainval.
However, I am quite satisfied with the detection results but what if I just revise its framework into a single class pedestrian detector and the training data will only contain pedestrian so both training and testing data will be fewer.
I know the computational power will consume less than the original 20-class one but I am curious about the detection performance.
has anybody tried to compare single-class and multi-class detector in the same class?
Yes, but the results vary quite a bit according to model and application. I've done this with several SVM applications and one CNN. As expected, the single-class consumed less resource in every case.
However, the results were quite different. One SVM actually did better in single-class training; two were significantly worse, and the other 3-4 were about the same (within expected error range).
The CNN didn't fare so well; it needed some tweaks to the topology.