In rapidminer what is the meaning of:
- training cycles
- learning rate
- momentum
I'm trying to work with rapidminer but i cant understand this notions. Any help would be apreciated, as newbly as possible.
This is terminology used in neural nets (and the libraries RapidMiner is using, resp.) , not RapidMiner in particular.
Here is a really short explanation. For more details you have to study neural networks.
A Multi-Layer Perceptron with Backpropagation learns the model (i.e. adjusts the weights/backpropagates the error) in several iteration which are called training cycle. In each iteration the error will be reduced.
The learning rate specifies how much the old weight contributes to the learning in each cycle. A too large rate leads to a bad flexibility of the model and it can yield a local minima. A too small value means that the model will be learnt very slowly. Use a small value and increase it carefully if your learning is to slow.
The momentum is another parameter which determines if you get stuck in a local minima (small value) or not (large value, but with a unstable learning).
Related
I am a student and currently studying deep learning by myself. Here I would like to ask for clarification regarding the transfer learning.
For example MobileNetv2 (https://keras.io/api/applications/mobilenet/#mobilenetv2-function), if the weights parameter is set to None, then I am not doing transfer learning as the weights are random initialized. If I would like to do transfer learning, then I should set the weights parameter to imagenet. Is this concept correct?
Clarification and explanation regarding deep learning
Yes, when you initialize the weights with random values, you are just using the architecture and training the model from scratch. The goal of transfer learning is to use the previously gained knowledge by another trained model to get better results or to use less computational resources.
There are different ways to use transfer learning:
You can freeze the learned weights of the base model and replace the last layer of the model base on your problem and just train the last layer
You can start with the learned weights and fine-tune them (let them change in the learning process). Many people do that because sometimes it makes the training faster and gives better results because the weights already contain so much information.
You can use the first layers to extract basic features like colors, edges, circles... and add your desired layers after them. In this way, you can use your resources to learn high-level features.
There are more cases, but I hope it could give you an idea.
I have a dataset composed of 10k-15k pictures for supervised object detection which is very different from Imagenet or Coco (pictures are much darker and represent completely different things, industrial related).
The model currently used is a FasterRCNN which extracts features with a Resnet used as a backbone.
Could train the backbone of the model from scratch in one stage and then train the whole network in another stage be beneficial for the task, instead of loading the network pretrained on Coco and then retraining all the layers of the whole network in a single stage?
From my experience, here are some important points:
your train set is not big enough to train the detector from scratch (though depends on network configuration, fasterrcnn+resnet18 can work). Better to use a pre-trained network on the imagenet;
the domain the network was pre-trained on is not really that important. The network, especially the big one, need to learn all those arches, circles, and other primitive figures in order to use the knowledge for detecting more complex objects;
the brightness of your train images can be important but is not something to stop you from using a pre-trained network;
training from scratch requires much more epochs and much more data. The longer the training is the more complex should be your LR control algorithm. At a minimum, it should not be constant and change the LR based on the cumulative loss. and the initial settings depend on multiple factors, such as network size, augmentations, and the number of epochs;
I played a lot with fasterrcnn+resnet (various number of layers) and the other networks. I recommend you to use maskcnn instead of fasterrcnn. Just command it not to use the masks and not to do the segmentation. I don't know why but it gives much better results.
don't spend your time on mobilenet, with your train set size you will not be able to train it with some reasonable AP and AR. Start with maskrcnn+resnet18 backbone.
When using the reinforcement learning model ddpg, the input data are sequence data, high-dimensional (21 dimensional) state and low dimensional (1-dimensional) action. Does this have any negative impact on the training of the model? How to solve it
In general in any machine learning scenario, dimensionality per se is not a problem, it is mostly a matter of how much variability there is the input data. Of course, higher dimensional data can have much higher variability than lower dimensional one.
Even considering this, the problem can "easily" be solved by feeding more data to the ML algorithm and increasing the complexity that it is allowed to represent (i.e. more nodes and/or layers in a neural network).
In RL, this is even less of a problem because you don't really have a restriction on how much data you actually have. You can always run your agent some more on the environment to get more sample trajectories to train on. The only issue you might find here is that your computing time grows a lot (depending on how much more you need to train on the environment for this problem).
I'm very new in deep learning, and I'm targeting to use GAN (Generative Adversarial Network) to recognize emotional speech. I've only known images being as inputs to most deep learning algorithms, such as GAN. but I'm curious as to how audio data can be an input into it, besides of using images of the spectrograms as the input. also, i'd appreciate it if you can explain it in laymen terms.
Audio data can be be represented in form of numpy arrays but before moving to that you must understand what audio really is. If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time.
Assuming that our audio is represented in time domain, we can extract the values at every half-second(arbitrary). This is called sampling rate.
Converting the data into frequency domain can reduce the amount of computation requires as the sampling rate is less.
Now, let's load the data. We'll use a library called librosa , which can be installed using pip.
data, sampling_rate = librosa.load('audio.wav')
Now, you have both the data and the sampling rate. We can plot the waveform now.
librosa.display.waveplot(data, sr=sampling_rate)
Now, you have the audio data in form of numpy array. You can now study the features of the data and extract the ones you find interesting to train your models.
Further to Ayush’s discussion, for information on the challenges and work arounds of dealing with large amounts of data at different time scales in audio data I suggest this post on WaveNet: https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
After that it sounds like you want to do classification. In that case a GAN on it’s own is not suitable. If you have plenty of data you could use a straight LSTM (or another type of RNN) which is designed to model time series, or you can take set sized chunks of input and use a 1-d CNN (similar to WaveNet). If you have lots of unlabelled data from the same or similar domain and limited training data you could use a GAN to learn to generate new samples, then use the discriminator from the GAN as pre-trained weights for a CNN classifier.
Since you are trying to perform Speech Emotion Recognition (SER) using deep learning, you can go for a recurrent architecture (LSTM or GRU) or a combination of CNN and recurrent network architecture (CRNN) instead of GANs since GANs are complicated and difficult to train.
In a CRNN, the CNN layers will extract features of varying details and complexity, whereas the recurrent layers will take care of the temporal dependencies. You can then finally use a fully connected layer for regression or classification output, depending on whether your output label is discrete (for categorical emotions like angry, sad, neutral etc) or continuous (arousal and valence space).
Regarding the choice of input, you can use either a spectrogram input (2D) or raw speech signal (1D) as input. For spectrogram input, you have to use a 2D CNN whereas for a raw speech signal you can use a 1D CNN. Mel scale spectrograms are usually preferred over linear spectrograms since our ears hear frequencies in log scale and not linearly.
I have used a CRNN architecture to estimate the level of verbal conflict arising from conversational speech. Even though it is not SER, it is a very similar task.
You can find more details in the paper
http://www.eecs.qmul.ac.uk/~andrea/papers/2019_SPL_ConflictNET_Rajan_Brutti_Cavallaro.pdf
Also, check my github code for the same paper
https://github.com/smartcameras/ConflictNET
and a SER paper whose code I reproduced in Python
https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition
And finally as Ayush mentioned, Librosa is one of the best Python libraries for audio processing. You have functions to create spectrograms in Librosa.
I have a small dataset collect from imagenet(7 classes each class with 1000 training data). I try to train it with alexnet model. But somehow the accuracy just cant go any higher(about 68% maximum). I remove conv4 and conv5 layer to prevent model overfitting also decrease the number of neuron in each layer(conv and fc). here is my setup.
Did i do anything wrong so that the accuracy is so low?
I want to sort out a few terms:
(1) A perceptron is an individual cell in a neural net.
(2) In a CNN, we generally focus on the kernel (filter) as a unit; this is the square matrix of perceptrons that forms a psuedo-visual unit.
(3) The only place it usually makes sense to focus on an individual perceptron is in the FC layers. When you talk about removing some of the perceptrons, I think you mean kernels.
The most important part of training a model is to make sure that your model is properly fitted to the problem at hand. AlexNet (and CaffeNet, the BVLC implementation) is fitted to the full ImageNet data set. Alex Krizhevsky and his colleagues spent a lot of research effort in tuning their network to the problem. You are not going to get similar accuracy -- on a severely reduced data set -- by simply removing layers and kernels at random.
I suggested that you start from CONVNET (the CIFAR-10 net) because it's much better tuned to this scale of problem. Most of all, I strongly recommend that you make constant use of your visualization tools, so that you can detect when the various kernel layers begin to learn their patterns, and to see the effects of small changes in the topology.
You need to run some experiments to tune and understand your topology. Record the kernel visualizations at chosen times during the training -- perhaps at intervals of 10% of expected convergence -- and compare the visual acuity as you remove a few kernels, or delete an entire layer, or whatever else you choose.
For instance, I expect that if you do this with your current amputated CaffeNet, you'll find that the severe losses in depth and breadth greatly change the feature recognition it's learning. The current depth of building blocks is not enough to recognize edges, then shapes, then full body parts. However, I could be wrong -- you do have three remaining layers. That's why I asked you to post the visualizations you got, to compare with published AlexNet features.
edit: CIFAR VISUALIZATION
CIFAR is much better differentiated between classes than is ILSVRC-2012. Thus, the training requires less detail per layer and fewer layers. Training is faster, and the filters are not nearly as interesting to the human eye. This is not a problem with the Gabor (not Garbor) filter; it's just that the model doesn't have to learn so many details.
For instance, for CONVNET to discriminate between a jonquil and a jet, we just need a smudge of yellow inside a smudge of white (the flower). For AlexNet to tell a jonquil from a cymbidium orchid, the network needs to learn about petal count or shape.