I was reading about PCA data augmentation in AlexNet. I cannot understand meaning behind the process. Why use principal components i.e variance of RGB values through eigenvector for augmentation.
Also just to be clear, in this method we are subtracting more from channels with higher variance and less from channels with less variance. Is this right?
Related
I understand that the Fourier transform of a convolution of two signals is the pointwise product of their Fourier transforms (convolutional theorem). What I wonder is there known cases where a convolution can be meaningfully applied to a Fourier-transformed signal (e.g. time series, or image) in the frequency domain to act as a filter instead of the multiplication by a square matrix. Also, are there known applications of filters that increase the size of the time domain, ie where the matrix in the frequency domain is rectangular, and then an inverse FT is applied to back to the time domain? In particular, I'm interested known examples of such method for deep learning.
As you say, convolution of two signals is the pointwise product of their Fourier transforms. This is true in both directions - the convolution of two Fourier-transformed signals is equal to the pointwise product of the two time series.
You do have to define "convolution" suitably - for discrete Fourier transforms, the convolution is a circular convolution.
There are definitely meaningful uses for doing a pointwise block multiply in the time domain (for example, applying a data window to a signal before converting to the frequency domain, or modulating a carrier), so you can say that it is meaningful to do the convolution in the frequency domain. But it is unlikely to be efficient, compared to just doing the operation in the time domain.
Note that a LOT of effort has been spent over the years at optimizing Fourier transforms, precisely because it is more efficient to do a block multiply in the frequency domain (it is O(n)) compared to doing a convolution in the time domain (which is O(n^2)). Since the Fourier transform is O(n log(n)), the combination of forwardTransform-blockMultiply-inverseTransform is usually faster than doing a convolution. This is still true in the other direction, so if you start with frequency data a inverseTransform-blockMultiply-forwardTransform will usually be faster than doing a convolution in the frequency domain. And, of course, usually you already have the original time data somewhere, so the block multiply in the time domain would then be even faster.
Unfortunately, I don't know of applications that increase the size of the time domain off the top of my head. And I don't know anything about deep learning, so I can't help you there.
I'm trying to understand the technical part of Latent Dirichlet Allocation (LDA), but I have a few questions on my mind:
First: Why do we need to add alpha and gamma every time we sample the equation below? What if we delete the alpha and gamma from the equation? Would it still be possible to get the result?
Second: In LDA, we randomly assign a topic to every word in the document. Then, we try to optimize the topic by observing the data. Where is the part which is related to posterior inference in the equation above?
If you look at the inference derivation on Wiki, the alpha and beta are introduced simply because the theta and phi are both drawn from Dirichlet distribution uniquely determined by them separately. The reason of choosing Dirichlet distribution as the prior distribution (e.g. P(phi|beta)) are mainly for making the math feasible to tackle by utilizing the nice form of conjugate prior (here is Dirichlet and categorical distribution, categorical distribution is a special case of multinational distribution where n is set to one, i.e. only one trial). Also, the Dirichlet distribution can help us "inject" our belief that doc-topic and topic-word distribution are centered in a few topics and words for a document or topic (if we set low hyperparameters). If you remove alpha and beta, I am not sure how it will work.
The posterior inference is replaced with joint probability inference, at least in Gibbs sampling, you need joint probability while pick one dimension to "transform the state" as the Metropolis-Hasting paradigm does. The formula you put here is essentially derived from the joint probability P(w,z). I would like to refer you the book Monte Carlo Statistical Methods (by Robert) to fully understand why inference works.
I am training a CNN model for regression task on normally distributed data.
Most of the data points take values between 0.4 and 0.6. Will the network learn the features of datapoints which are less than 0.4 and more than 0.6 which are less represented?
I also don't want to make the distribution uniform as I want the network to learn the distribution of training data.
The model learns whatever you train with. If you train more with some particular type of data points model will learn them well. In your case, the model is more likely to perform well on data points lying in range of 0.4 to 0.6. It is very less likely that model performs better on type of data points lying at the tails of normal curve.
To make the model learn well about the points at the tails of the normal curve, you need to augment the data set to balance it. Other thing you can do is use weighted loss for the data points in the tail region.
The following is how I understand the point of parameter sharing in RNNs:
In regular feed-forward neural networks, every input unit is assigned an individual parameter, which means that the number of input units (features) corresponds to the number of parameters to learn. In processing e.g. image data, the number of input units is the same over all training examples (usually constant pixel size * pixel size * rgb frames).
However, sequential input data like sentences can come in highly varying lengths, which means that the number of parameters will not be the same depending on which example sentence is processed. That is why parameter sharing is necessary for efficiently processing sequential data: it makes sure that the model always has the same input size regardless of the sequence length, as it is specified in terms of transition from one state to another. It is thus possible to use the same transition function with the same weights (input to hidden weights, hidden to output weights, hidden to hidden weights) at every time step. The big advantage is that it allows generalization to sequence lengths that did not appear in the training set.
My questions are:
Is my understanding of RNNs, as summarized above, correct?
In the actual code example in Keras I looked at for LSTMs, they padded the sentences to equal lengths before all. By doing so, doesn't this wash away the whole purpose of parameter sharing in RNNs?
Parameter Sharing
Being able to efficiently process sequences of varying length is not the only advantage of parameter sharing. As you said, you can achieve that with padding. The main purpose of parameter sharing is a reduction of the parameters that the model has to learn. This is the whole purpose of using a RNN.
If you would learn a different network for each time step and feed the output of the first model to the second etc. you would end up with a regular feed-forward network. For a number of 20 time steps, you would have 20 models to learn. In Convolutional Nets, parameters are shared by the Convolutional Filters because when we can assume that there are similar interesting patterns in different regions of the picture (for example a simple edge). This drastically reduces the number of parameters we have to learn. Analogously, in sequence learning we can often assume that there are similar patterns at different time steps. Compare 'Yesterday I ate an apple' and 'I ate an apple yesterday'. These two sentences mean the same, but the 'I ate an apple' part occurs on different time steps. By sharing parameters, you only have to learn what that part means once. Otherwise, you'd have to learn it for every time step, where it could occur in your model.
There is a drawback to sharing the parameters. Because our model applies the same transformation to the input at every time step, it now has to learn a transformation that makes sense for all time steps. So, it has to remember, what word came in which time step, i.e. 'chocolate milk' should not lead to the same hidden and memory state as 'milk chocolate'. But this drawback is small compared to using a large feed-forward network.
Padding
As for padding the sequences: the main purpose is not directly to let the model predict sequences of varying length. Like you said, this can be done by using parameter sharing. Padding is used for efficient training - specifically to keep the computational graph during training low. Without padding, we have two options for training:
We unroll the model for each training sample. So, when we have a sequence of length 7, we unroll the model to 7 time steps, feed the sequence, do back-propagation through the 7 time steps and update the parameters. This seems intuitive in theory. But in practice, this is inefficient, because TensorFlow's computational graphs don't allow recurrency, they are feedforward.
The other option is to create the computational graphs before starting training. We let them share the same weights and create one computational graph for every sequence length in our training data. But when our dataset has 30 different sequence lengths this means 30 different graphs during training, so for large models, this is not feasible.
This is why we need padding. We pad all sequences to the same length and then only need to construct one computational graph before starting training. When you have both very short and very long sequence lengths (5 and 100 for example), you can use bucketing and padding. This means, you pad the sequences to different bucket lengths, for example [5, 20, 50, 100]. Then, you create a computational graph for each bucket. The advantage of this is, that you don't have to pad a sequence of length 5 to 100, as you would waste a lot of time on "learning" the 95 padding tokens in there.
In implementation of densenet model as in CheXNet paper, in section 3.1 it is mentioned that:
Before inputting the images into the network, we downscale the images to 224x224 and normalize based on the mean and standard edviation of images in the ImageNet training set.
Why would we want to normalize new set of images with mean and std of different dataset?
How do we get the mean and std of ImageNet dataset? Is it provided somewhere?
Subtracting the mean centers the input to 0, and dividing by the standard deviation makes any scaled feature value the number of standard deviations away from the mean.
Consider how a neural network learns its weights. C(NN)s learn by continually adding gradient error vectors (multiplied by a learning rate) computed from backpropagation to various weight matrices throughout the network as training examples are passed through.
The thing to notice here is the "multiplied by a learning rate".
If we didn't scale our input training vectors, the ranges of our distributions of feature values would likely be different for each feature, and thus the learning rate would cause corrections in each dimension that would differ (proportionally speaking) from one another. We might be over compensating a correction in one weight dimension while undercompensating in another.
This is non-ideal as we might find ourselves in a oscillating (unable to center onto a better maxima in cost(weights) space) state or in a slow moving (traveling too slow to get to a better maxima) state.
Original Post: https://stats.stackexchange.com/questions/185853/why-do-we-need-to-normalize-the-images-before-we-put-them-into-cnn
They used mean and std dev of the ImageNet training set because the weights of their model were pretrained on ImageNet (see Model Architecture and Training section of the paper).