I am confiused how to get the out_channels in torch.nn.Conv2d and the in_features, out_features in torch.nn.Linear.
For example I have a non-color 28*28 image input.
the in_channels = 1,kernel_size=5,padding=2 how can I figure the out_channels.
After the convolutional, I want to make a linear layer.
How do I figure the values of in_features, out_features ?
The choice of out_channels is up to you, it's the number of filters you want your convolutional layer to compute. The higher this number is, the heavier the layer will be, but on the other hand the more features it will be able to learn (theoretically).
After going through the convolution (assuming out_channels = C), your data will have shape (C, 28, 28). In other words, one sample contains 28*28*C numbers / dimensions. It is this number that you need to input as in_features for the following linear layer. Then again, out_features is up to you.
I strongly suggest that you read and learn about neural networks (and their typical convolutional and linear layers) before attempting to use them to make magic happen. Without the required knowledge about them, you will at best be able to produce results that you don't really understand, and at worst issues that you don't know how to fix. It takes time to learn, but it really is worth it.
Related
I have trained a CNN model whose forward-prop is like:
*Part1*: learnable preprocess
*Part2*: Mixup which does not need to calculate gradient
*Part3*: CNN backbone and classifier head
Both part1 and part3 need to calculate the gradient and need update weights when back-prop, but part2 is just a simple mixup and don't need gradient, so I tried wrapped this Mixup with torch.no_grad() to save computational resource and speed up training, which it indeed speed my training a lot, but the model`s prediction accuracy drops a lot.
I'm wondering if Mixup does not need to calculate the gradient, why wrap it with torch.no_grad() hurt the model`s ability so much, is it due to loss of the learned weights of Part1, or something like break the chain between Part1 and Part2?
Edit:
Thanks #Ivan for your reply and it sounds reasonable, I also have the same thought but don't know how to prove it.
In my experiment when I apply torch.no_grad() on Part2, the GPU memory consumption drops a lot, and training is much faster, so I guess this Part2 still needs gradient even it does not have learnable parameters.
So can we conclude that torch.no_grad() should not be applied between 2 or more learnable blocks, otherwise it would drop the learning ability of blocks before this no_grad() part?
but part2 is just simple mixup and don't need gradient
It actually does! In order to compute the gradient flow and backpropagate successfully to part1 of your model (which is learnable, according to you) you need to compute the gradients on part2 as well. Even though there are no learnable parameters on part2 of your model.
What I'm assuming happened when you applied torch.no_grad() on part2 is that only part3 of your model was able to learn while part1 stayed untouched.
Edit
So can we conclude that torch.no_grad() should not be applied between 2 or more learnable blocks, otherwise it would drop the learning ability of blocks before this no_grad() part?
The reasoning is simple: to compute the gradient on part1 you need to compute the gradient on intermediate results, irrespective of the fact that you won't use those gradients to update the tensors on part2. So indeed, you are correct.
I'm a beginner in CNN DeepLearning, I know the basic concept that we use some filters to generate a set of feature maps from an image, we activate it using non-linear method like 'relu' before we downsample it. We keep doing this until the image becomes very small. Then we flatten it and use a fully connected network to calculate its category. And we use the back-propergation technique to calculate all parameters in the map. One thing I don't understand is that when we do Conv2D we create many filters(channels) from an image. Like in the sample code:
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)))
I understand this is to generate as many features as possible. But how these filters are trained to detect different features from one image? If all of them are initialized with the same value (like 0) then they should end up with detecting the same feature, right? Are we giving them random values during initialization so that they can find their local minimum loss using gradient descent?
If you initialize all filters with the same value, then you are right, they will learn the same thing. That's why we never initialize with same value. We initialize each kernel with random values (usually 0 mean and some small variance).
There are many methods to find out a good initialization for your network. One of the most famous and used ones is Xavier initialization.
Adding to what being discussed, the weights in the CONV layer also learns the same way weights learn in FC layer, through backpropagation, using some optimization algorithm (GD, Adam, RMSprop etc). Ending up in local optimum is very unlikely in big networks as a point being local optimum for all the weights is very unlikely as no of weights increases. If weights are initialized with zeros, the gradients become the same for the update and hidden units become the same in a layer. Hence they learn the same features. Hence we use random initialization with mean 0 and variance inversely proportional to the number of units in the previous layer. (eg Xavier)
I want to predict the trajectory of a ball falling. That trajectory is parabolic. I know that LSTM may be too much for this (i.e. a simpler method could suffice).
I thought that we can do this with 2 LSTM layers and a Dense layer at the end.
The end result that I want is to give the model 3 heights h0,h1,h2 and let it predict h3. Then, I want to give it h1, h2, and the h3 it outputted previously to predict h4, and so on, until I can predict the whole trajectory.
Firstly, what would the input shape be for the first LSTM layer ?
Would it be input_shape = (3,1) ?
Secondly, would the LSTM be able to predict a parabolic path ?
I am getting almost a flat line, not a parabola, and I want to rule out the possibility that I am misunderstanding how to feed and shape input.
Thank you
The input shape is in the form (samples, timeSteps, features).
Your only feature is "height", so features = 1.
And since you're going to input sequences with different lengths, you can use timeSteps = None.
So, your input_shape could be (None, 1).
Since we're going to use a stateful=True layer below, we can use batch_input_shape=(1,None,1). Choose the amount of "samples" you want.
Your model can predict the trajectory indeed, but maybe it will need more than one layer. (The exact answer about how many layers and cells depend on knowing how the match inside LSTM works).
Training:
Now, first you need to train your network (only then it will be able to start predicting good things).
For training, suppose you have a sequence of [h1,h2,h3,h4,h5,h6...], true values in the correct sequence. (I suggest you have actually many sequences (samples), so your model learns better).
For this sequence, you want an output predicting the next step, then your target would be [h2,h3,h4,h5,h6,h7...]
So, suppose you have a data array with shape (manySequences, steps, 1), you make:
x_train = data[:,:-1,:]
y_train = data[:,1:,:]
Now, your layers should be using return_sequences=True. (Every input step produces an output step). And you train the model with this data.
A this point, whether you're using stateful=True or stateful=False is not very relevant. (But if true, you always need model.reset_state() before every single epoch and sequence)
Predicting:
For predicting, you can use stateful=True in the model. This means that when you input h1, it will produce h2. And when you input h2 it will remember the "current speed" (the state of the model) to predict the correct h3.
(In the training phase, it's not important to have this, because you're inputting the entire sequences at once. So the speed will be understood between steps of the long sequences).
You can se the method reset_states() as set_current_speed_to(0). You will use it whenever the step you're going to input is the first step in a sequence.
Then you can do loops like this:
model.reset_states() #make speed = 0
nextH = someValueWithShape((1,1,1))
predictions = [nextH]
for i in range(steps):
nextH = model.predict(nextH)
predictions.append(nextH)
There is an example here, but using two features. There is a difference that I use two models, one for training, one for predicting, but you can use only one with return_sequences=True and stateful=True (don't forget to reset_states() at the beginning of every epoch in training).
Hello Guys,
I am working right now on an Autoencoder reducing some simple 2D Data to 1D. The architecture is 2 - 10 - 1 - 10 - 2 Neurons/Layer. As Activation Function I use sigmoid in every layer but the output-layer, where I use the identity.
I am using the Accord.NET Framework to build that.
I am Pre-Training the Autoencoder with RBMs and CD-Algorithm, where I can change the initial weights, the learning rate, the momentum and the weight decay.
The Fine-Tuning is accomplished by backpropagation where I can configure the learning rate and the momentum.
The data is some artificially created shape and is marked green in the picture:
data + reconstruction
The reconstruction of the autoencoder is the yellow line. Which leads to my problem. Somehow the encoder is not able to create a non-linear shape as output.
Although I tested arround a lot and changed values a dozen times, I am not getting better results. Maybe someone here has an idea how I could find the problem.
Thanks!
Look in general any neural network is based on a linear representation for your feature against the output so what the net is actually doing (consider two features) is [w1*x1 + w2*x2 = output].
What you need to do to achieve a non-linear representation is to use extra feature(s) which is a non-linear representation of the old feature(s). Let's say for example use x1^2 as an extra feature or x2^2 or both of them. Hence, the net will give this global equation [w1*x1 + w2*x2 + w3*x1^2 = output] which in nature is a non-linear equation and then you can have a nonlinear representation.
The extra feature equation depends mainly on your data. I have used a quadratic equation in my example but it is not always the correct thing to do. Referring to
Your data I think you need to use a cos(x) or sin(x) representation.
I was wondering which is the best machine learning technique to approximate a function that takes a 32-bit number and returns another 32-bit number, from a set of observations.
Thanks!
Multilayer perceptron neural networks would be worth taking a look at. Though you'll need to process the inputs to a floating point number between 0 and 1, and then map the outputs back to the original range.
There are several possible solutions to your problem:
1.) Fitting a linear hypothesis with least-squares method
In that case, you are approximating a hypothesis y = ax + b with the least squares method. This one is really easy to implement, but sometimes, a linear model is not good enough to fit your data. But - I would give this one a try first.
Good thing is that there is a closed form, so you can directly calculate parameters a and b from your data.
See Least Squares
2.) Fitting a non-linear model
Once seen that your linear model does not describe your function very well, you can try to fit higher polynomial models to your data.
Your hypothesis then might look like
y = ax² + bx + c
y = ax³ + bx² + cx + d
etc.
You can also use least squares method to fit your data, and techniques from the gradient descent types (simmulated annealing, ...). See also this thread: Fitting polynomials to data
Or, as in the other answer, try fitting a Neural Network - the good thing is that it will automatically learn the hypothesis, but it is not so easy to explain what the relation between input and output is. But in the end, a neural network is also a linear combination of nonlinear functions (like sigmoid or tanh functions).