How to reject false alarm in object detection using SSD? - deep-learning

I used SSD for my object detection. But there are some false detection from some other objects in the image.
That is happening consistently from same objects. So is there a way to reject those components in training?

For Yolo, I can do as follow.
Just add images with these non-labeled objects to the training dataset and train. Network will learn not to detect such objects.
Also it is desirable to add negative-samples to your training dataset: https://github.com/AlexeyAB/darknet
desirable that our training dataset include images with non-labeled objects that we do not want to detect - negative samples without bounded box (empty .txt files). (Credit to alexbe)
In general, what we can do are
Hard negative mining, Inspect confusion matrix and Data Augmentation.

Related

Is it possible to do transfer learning on different observation and action space for Actor-Critic?

I have been experimenting with actor-critic networks such as SAC and TD3 on continuous control tasks and trying to do transfer learning using the trained network to another task with smaller observation and action space.
Would it be possible to do so if i were to save the weights in a dictionary and then load it in the new environment? The inputs to the Actor-Critic network requires a state with different dimensions as well as outputting an actor with different dimensions.
I had some experience doing fine-tuning with transformer models by addind another classifier head and fine-tuning it, but how would i do this with Actor-Critic networks, if the initial layer and final layer does not match with the learned agent.

6D pose estimation of a known 3D CAD object with limited model training for a new object

I'm working on a project where I need to estimate the 6DOF pose of a known 3D CAD object in a single RGB image - i.e. this task: https://paperswithcode.com/task/6d-pose-estimation. There are several constraints on the problem:
Usable commercially (licensed under BSD, MIT, BOOST, etc.), not GPL.
The CAD object is known and we do NOT aim for generality (i.e.recognize the class of all chairs).
The CAD object can be uploaded by a user, so it may have symmetries and a range of textures.
Inference step will be run on a smartphone, and should be able to run at >30fps.
The inference step can either be a) find the pose of the object once and then I can write code to continue to track it or b) find the pose of the object continuously. I.e. the model doesn't need to have any continuous refinement steps after the initial pose estimate is found.
Can be anywhere on the scale of single instance of a single object to multiple instances of multiple objects (MiMo). MiMO is preferred, but not required.
If a deep learning approach is used, the training time required for a new CAD object should be on the order of hours, not days.
Can either 1) just find the initial pose of an object and not have any refinement steps after or 2) find the initial pose of the object and also have refinement steps after.
I am open to traditional approaches (i.e. 2D->3D correspondences then solving with PnP), but it seems like deep learning approaches outperform them (classical are too slow - Real time 6D pose estimation of known 3D CAD objects from a single 2D image or point clouds from RGBD Camera when objects are one on top of the other?). Looking at deep learning approaches (poseCNN, HybridPose, Pix2Pose, CosyPose), it seems most of them match these constraints, except that they require model training time. Though perhaps I can use a single pre-trained model and then specialize it for each new CAD object with a shorter training step. But I am not sure of this, and I think success probably relies on the specific model chosen. For example, this project says it requires 3 hours of training time: https://github.com/DLR-RM/AugmentedAutoencoder.
So, my question: would somebody know what the state of the art, commercially usable implementation that doesn't require extensive training time for a new CAD object is?

Object Detection from Image Classification

Can I use a model used for Image Classification to do Object Detection? Already I spent too much time doing the image collection and distribute each class into it folder.
You can use your classification model as an initialized backbone for a detection model (e.g. Faster-RCNN) but it might not help that much compared to train your detector from scratch.
You will need to add detection layers (e.g. ROI pooling) to your backbone to perform detection.
While you can try unsupervised object detection, usually you will need extra labels such as object bounding-boxes to train your object detector.

How to input audio data into deep learning algorithm?

I'm very new in deep learning, and I'm targeting to use GAN (Generative Adversarial Network) to recognize emotional speech. I've only known images being as inputs to most deep learning algorithms, such as GAN. but I'm curious as to how audio data can be an input into it, besides of using images of the spectrograms as the input. also, i'd appreciate it if you can explain it in laymen terms.
Audio data can be be represented in form of numpy arrays but before moving to that you must understand what audio really is. If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time.
Assuming that our audio is represented in time domain, we can extract the values at every half-second(arbitrary). This is called sampling rate.
Converting the data into frequency domain can reduce the amount of computation requires as the sampling rate is less.
Now, let's load the data. We'll use a library called librosa , which can be installed using pip.
data, sampling_rate = librosa.load('audio.wav')
Now, you have both the data and the sampling rate. We can plot the waveform now.
librosa.display.waveplot(data, sr=sampling_rate)
Now, you have the audio data in form of numpy array. You can now study the features of the data and extract the ones you find interesting to train your models.
Further to Ayush’s discussion, for information on the challenges and work arounds of dealing with large amounts of data at different time scales in audio data I suggest this post on WaveNet: https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
After that it sounds like you want to do classification. In that case a GAN on it’s own is not suitable. If you have plenty of data you could use a straight LSTM (or another type of RNN) which is designed to model time series, or you can take set sized chunks of input and use a 1-d CNN (similar to WaveNet). If you have lots of unlabelled data from the same or similar domain and limited training data you could use a GAN to learn to generate new samples, then use the discriminator from the GAN as pre-trained weights for a CNN classifier.
Since you are trying to perform Speech Emotion Recognition (SER) using deep learning, you can go for a recurrent architecture (LSTM or GRU) or a combination of CNN and recurrent network architecture (CRNN) instead of GANs since GANs are complicated and difficult to train.
In a CRNN, the CNN layers will extract features of varying details and complexity, whereas the recurrent layers will take care of the temporal dependencies. You can then finally use a fully connected layer for regression or classification output, depending on whether your output label is discrete (for categorical emotions like angry, sad, neutral etc) or continuous (arousal and valence space).
Regarding the choice of input, you can use either a spectrogram input (2D) or raw speech signal (1D) as input. For spectrogram input, you have to use a 2D CNN whereas for a raw speech signal you can use a 1D CNN. Mel scale spectrograms are usually preferred over linear spectrograms since our ears hear frequencies in log scale and not linearly.
I have used a CRNN architecture to estimate the level of verbal conflict arising from conversational speech. Even though it is not SER, it is a very similar task.
You can find more details in the paper
http://www.eecs.qmul.ac.uk/~andrea/papers/2019_SPL_ConflictNET_Rajan_Brutti_Cavallaro.pdf
Also, check my github code for the same paper
https://github.com/smartcameras/ConflictNET
and a SER paper whose code I reproduced in Python
https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition
And finally as Ayush mentioned, Librosa is one of the best Python libraries for audio processing. You have functions to create spectrograms in Librosa.

MXnet - ImageRecordIter and data augmentation for ROI-Pooling enabled CNN

How can I perform data augmentation when I use ROI-Pooling in a CNN network which I developed using MXnet ?
For example suppose I have a resnet50 architecture which uses a roi-pooling layer and I want to use random-crops data augmentation in the ImageRecord Iterator.
Is there an automatic way that data coordinates in the rois passed to the roi pooling layer, transform so as to be applied in images generated by the data-augmentation process of the ImageRecord Iterator ?
You should be able to repurpose the ImageDetRecordIter for this. It is intended for use with Object Detection data containing bounding boxes, but you could define the bounding boxes as your ROIs. And now when you apply augmentation operations (such as flip and rotation), the coordinates of the bounding boxes will be adjusted in-line with the images.
Otherwise you can easily write your own transform function using Gluon, and can make use of any OpenCV augmentation to apply to both your image and ROIs. Just write a function that takes data and label, and returns the augmented data and label.