pytorch layer input, output shape calculation - deep-learning

Can anyone help me to understand when I use conv1d and then a linear layer, What will be the inputs of the linear layer? How to calculate how many input features should I have to pass in pytorch

In Pytorch, Linear layers operate using only the last dimension of the input tensor: [*features_in] -> [*,features_out].
However, Conv1D layers consider the last 2 dimensions of the input tensor: [batches,channels_in, length_in] -> [batches,channels_out, length_out].
Therefore, if no pre-processing is used, Linear layers will only work with the signals defined for every channel, i.e., [batches,channels_in,features_in] -> [batches,channels_in,features_out]. This behavior is rarely desired, so people usually flatten tensors before passing them to a Linear layer. For example, it's common to use Linear(x.view(n_batches,-1)).
The behavior you need depends on the details of your application. Good luck,
Sources:
https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html (Conv1d)
https://pytorch.org/docs/stable/generated/torch.nn.Linear.html (Linear)

Related

Cost of back-propagation for subset of DNN parameters

I am using pytorch for evaluating gradients of feed-forward network, but only for a subset of parameters, related to the first two layers.
Since backpropagation is carried backwards layer by layer, I wonder: why is it computationally faster than evaluating gradients of whole network?
Pytorch builds a computation graph for backward propagation that only contains the minimum nodes and edges to get the accumulated gradient for leaves that require gradient. Even if the first two layers require gradient, there are many tensors (intermediate tensors or frozen parameters tensors) that are unused and that are cut in the backward graph. Plus the built-in function AccumulatedGradient that stores the gradients in .grad attribute is call less time reducing the total computation time too.
Here you can see an example for an "AddBackward Node" where for instance A is an intermediate tensor computed with the first two layers and B is the 3rd (constant) layer that can be ignored.
An other example: if you have a matrix-matrix product (MmBackward Node) that uses an intermediate tensor that not depends on the 2 first layers. In this case the tensor itself is required to compute the backprop but the "previous" tensors that were used to compute it can be ignored in the graph.
To visualize the sub-graph that is actually computed (and compare when the model is unfrozen), you can use torchviz.

Can a neural network having non-linear activation function (say ReLU) be used for linear classification task?

I think the answer would be yes, but I'm unable to reason out a good explanation on this.
The mathematical argument lies in a power to represent linearity, we can use following three lemmas to show that:
Lemma 1
With affine transformations (linear layer) we can map the input hypercube [0,1]^d into arbitrary small box [a,b]^k. Proof is quite simple, we can just make all the biases to be equal to a, and make weights multiply by (b-a).
Lemma 2
For sufficiently small scale, many non-linearities are approximately linear. This is actually very much a definition of a derivative, or, taylor expansion. In particular let us take relu(x), for x>0 it is in fact, linear! What about sigmoid? Well if we look at a tiny tiny region [-eps, eps] you can see that it approaches a linear function as eps->0!
Lemma 3
Composition of affine functions is affine. In other words, if I were to make a neural network with multiple linear layers, it is equivalent of having just one. This comes from the matrix composition rules:
W2(W1x + b1) + b2 = W2W1x + W2b1 + b2 = (W2W1)x + (W2b1 + b2)
------ -----------
New weights New bias
Combining the above
Composing the three lemmas above we see that with a non-linear layer, there always exists an arbitrarily good approximation of the linear function! We simply use the first layer to map entire input space into the tiny part of the pre-activation spacve where your linearity is approximately linear, and then we "map it back" in the following layer.
General case
This is a very simple proof, now in general you can use Universal Approximation Theorem to show that a non-linear neural network (Sigmoid, Relu, many others) that is sufficiently large, can approximate any smooth target function, which includes linear ones. This proof (originally given by Cybenko) is however much more complex and relies on showing that specific classes of functions are dense in the space of continuous functions.
Technically, yes.
The reason you could use a non-linear activation function for this task is that you can manually alter the results. Let's say the range the activation function outputs is between 0.0-1.0, then you can round up or down to get a binary 0/1. Just to be clear, rounding up or down isn't linear activation, but for this specific question the purpose of the network was for classification, where some kind of rounding has to be applied.
The reason you shouldn't is the same reason that you shouldn't attach an industrial heater to a fan and call it a hair-drier, it's unnecessarily powerful and it could potentially waste resources and time.
I hope this answer helped, have a good day!

values in torch.nn.conv2d and torch.nn.Linear

I am confiused how to get the out_channels in torch.nn.Conv2d and the in_features, out_features in torch.nn.Linear.
For example I have a non-color 28*28 image input.
the in_channels = 1,kernel_size=5,padding=2 how can I figure the out_channels.
After the convolutional, I want to make a linear layer.
How do I figure the values of in_features, out_features ?
The choice of out_channels is up to you, it's the number of filters you want your convolutional layer to compute. The higher this number is, the heavier the layer will be, but on the other hand the more features it will be able to learn (theoretically).
After going through the convolution (assuming out_channels = C), your data will have shape (C, 28, 28). In other words, one sample contains 28*28*C numbers / dimensions. It is this number that you need to input as in_features for the following linear layer. Then again, out_features is up to you.
I strongly suggest that you read and learn about neural networks (and their typical convolutional and linear layers) before attempting to use them to make magic happen. Without the required knowledge about them, you will at best be able to produce results that you don't really understand, and at worst issues that you don't know how to fix. It takes time to learn, but it really is worth it.

Question about dimensions when processing lists with a multi layer perceptron

I'm quite new to PyTorch and I'm trying to build a net that is composed only of linear layers that will get a list of objects as input and output some score (which is a scalar) for each object. I'm wondering if my input tensor's dimensions should be (batch_size, list_size, object_size) or should I flatten each list and get (batch_size, list_size*object_size)? According to my understanding, in the first option I will have an output dimension of (batch_size, list_size, 1) and in the second (batch_size, list_size), does it matter? I read the documentation but it still wasn't very clear to me.
If you want to do the classification for each object in your input, you should keep the objects separate from each other; i.e., your input should be in the shape of (batch_size, list_size, object_size). Then considering the number of classes you got (let's say m classes), the linear layer would transform the input to the shape of (batch_size, list_size, m). In this case, you will have m scores for each object which can be utilized to predict the class label.
But question arises now; why do we flatten in neural networks at all? The answer is simple: because you want to couple the whole information (in your specific case, the information pieces are the objects) within a batch to see if they somehow affect each other, and if that's the case, to examine whether your network is able to learn these features/patterns. In practice, considering the nature of your problem and the data you are working with, if different objects really relate to each other, then your network will be able to learn those.

what data sets can rectifier linear unit classify?

I am trying to look at the possible activation functions for deep network for speaker recognition. I will have an input and a label (0s and 1s)as an output. I was wondering if rectifier linear unit (ReLu) can be used with any type of output or just specific one ? Thank you
Activation functions are normally only used in the hidden layers. Output units almost always have a linear activation function (ie identity or no activation function). Rectified units are used in the hidden layers because their gradient is much simpler than their sigmoidal counterparts which allows for better training with many layers.
You mentioned your output has labels as either 0s or 1s is this an output vector with N outputs that are either 0 or 1? Or do you mean that you have only 2 classes (0 or 1). If you want to do classification (getting the network to output either class 0 or class 1) you would use Softmax activation on the output layer. Softmax scales the output to be a probability of the network's predicted class.
Let me know more information and I will see if I can help more.