I need some clarity on how to correctly connect the embedding layer and lstm in Pytorch. For example, if I have only one feature I will send it to the embedding layer such as vector (batch size, length of the sequence, 1 ) and after embedding lstm get (batch size, length of sequence, size of embedding ) right? (of course, if we set batch_first=True)
I can’t understand what view of vector I need to send to lstm if I have some features and one embedding layer for every feature
Related
I would like use the efficiency of transformer architecture to do anomaly detection on time series based on transformers. I am wondering:
Can we modify slightly the architecture to create a bottleneck in the transformer network (similar to a fully connected network AutoEncoder, or AE with LSTMs).
does it actually makes sense to try to do that
I would like the transformer to learn how to reconstruct in output the input sequence, with some intermediate latent space that has lower dimensionality (bottleneck).
My idea was to reduce d_model (number of variables in the time series, or embedding dimension in nlp) but it must be of the same size of the input series according to `torch.nn.Transformer`` (see here)
According to the documentation on pre-trained computer vision models for transfer learning (e.g., here), input images should come in "mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224".
However, when running transfer learning experiments on 3-channel images with height and width smaller than expected (e.g., smaller than 224), the networks generally run smoothly and often get decent performances.
Hence, it seems to me that the "minimum height and width" is somehow a convention and not a critical parameter. Am I missing something here?
There is a limitation on your input size which corresponds to the receptive field of the last convolution layer of your network. Intuitively, you can observe the spatial dimensionality decreasing as you progress through the network. At least this is the case for feature extractor CNNs which aim at extracting feature embeddings from the input image. That is most pre-trained models such as vanilla VGG, and ResNets networks do not retain spatial dimensionality. If the input of a convolutional layer is smaller than the kernel size (even if/when padded), then you simply won't be able to perform the operation.
TLDR: adaptive pooling layer
For example, the standard resnet50 model accepts input only in ranges 193-225, and this is due to the architecture and downscaling layers (see below).
The only reason why the default pytorch model works is that it is using adaptive pooling layer which allows to not restrict input size. So it's gonna work but you should be ready for performance decay and other fun things :)
Hope you will find it useful:
https://discuss.pytorch.org/t/how-can-torchvison-models-deal-with-image-whose-size-is-not-224-224/51077/3
What is Adaptive average pooling and How does it work?
https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html
https://github.com/pytorch/vision/blob/c187c2b12d86c3909e59a40dbe49555d85b98703/torchvision/models/resnet.py#L118
https://github.com/pytorch/vision/blob/c187c2b12d86c3909e59a40dbe49555d85b98703/torchvision/models/resnet.py#L151
https://developpaper.com/pytorch-implementation-examples-of-resnet50-resnet101-and-resnet152/
I want to get a feature vector out of an image by passing the image through a pre-trained VGG-16. I used the pretrained Resnet50 to get a feature vector and that worked perfectly. But when I use the same method to get a feature vector from the VGG-16 network, I don’t get the 4096-d vector which I assume I should get. I got the code from a variety of sources and it is as follows:
vgg16_model=models.vgg16(pretrained=True)
modules=list(vgg16_model.children())[:-1]
vgg16_model=nn.Sequential(*modules)
data=moveaxis(data,2,0)
img_var=Variable(torch.from_numpy(data).unsqueeze(0)).float()
features_var=vgg16_model(img_var)
features=features_var.data
features=features.data.numpy()
print(features.shape)
The variable “data” is an image numpy array of dimensions (300, 400, 3)
Hence I use the move axis to jumble the axis so that I have 3 channels and not 300.
The output(features.shape) which I get is : (1, 512, 7, 7)
I want a 4096-d vector as the VGG-16 gives before the softmax layer.
I even tried the list(vgg16_model.classifier.children())[:-1] approach but that did not go too well either. There are a lot of discussions about this but none of them worked for me. Let me know where I might be going wrong. Thank you!
Recently I am learning about the ideal about the embedding layer in neural networks. The best explanation I found so far is here The explanation there well addressed the core concept of why to use embedding layer and how it works.
It also mentioned that our embedding will map similar words to similar region. And thus the quality of our embedding representation is how close or similar that a group of similar representations from original space is in embedding space. But I really have no ideal of how to do it.
My question is, how to design the weight matrix in order to have a better embedding representation that is customised for specific dataset ?
Any hint would be really helpful to me!
Thank you all!
Suppose you know some concepts of neural networks and Word2Vec, I try to explain things briefly.
1, the weight matrix in the embedding layer is often randomly initialized just like weights in other types of neural networks layers.
2, the weight matrix in the embedding layer transforms the sparse input into a dense vector as explained in the post you mentioned.
3, the weight matrix in the embedding layer can be updated during the training process using your dataset along the backpropagation.
Therefore, after training, the learned weight matrix should give you better representations of your specific data. Just like how word embedding works, more data often yields better representations in the embedding layer. Another factor is the number of dimension(generally speaking, the higher dimension, the more degrees of freedom the model will have to learn the representations of the features).
I intend to make a classifier using the feature map obtained from a CNN. Can someone suggest how I can do this?
Would it work if I first train the CNN using +ve and -ve samples (and hence obtain the weights), and then every time I need to classify an image, I apply the conv and pooling layers to obtain the feature map? The problem I find in this, is that the image I want to classify, may not have a similar feature map, and hence I wouldn't be able to find the distance correctly. As the order of the features may by different in the layer.
You can use the same CNN for classification if you used (for example) the cross entropy loss-(also known as softmax with loss). In this case, you should take the argmax of your last layer (the node with the highest score), and that would be the class given by the network. However, all the architectures used in machine learning would expect at testing time an input similar to those used during training.