How to implement low-dimensional embedding layer in pytorch - deep-learning

I recently read a paper about embedding.
In Eq. (3), the f is a 4096X1 vector. the author try to compress the vector in to theta (a 20X1 vector) by using an embedding matrix E.
The equation is simple theta = E*f
I was wondering if it can using pytorch to achieve this goal, then in the training, the E can be learned automatically.
How to finish the rest? thanks so much.
The demo code is follow:
import torch
from torch import nn
f = torch.randn(4096,1)

Assuming your input vectors are one-hot that is where "embedding layers" are used, you can directly use embedding layer from torch which does above as well as some more things. nn.Embeddings take non-zero index of one-hot vector as input as a long tensor. For ex: if feature vector is
f = [[0,0,1], [1,0,0]]
then input to nn.Embeddings will be
input = [2, 0]
However, what OP has asked in question is getting embeddings by matrix multiplication and below I will address that. You can define a module to do that as below. Since, param is an instance of nn.Parameter it will be registered as a parameter and will be optimized when you call Adam or any other optimizer.
class Embedding(nn.Module):
def __init__(self, input_dim, embedding_dim):
super().__init__()
self.param = torch.nn.Parameter(torch.randn(input_dim, embedding_dim))
def forward(self, x):
return torch.mm(x, self.param)
If you notice carefully this is the same as a linear layer with no bias and slightly different initialization. Therefore, you can achieve the same by using a linear layer as below.
self.embedding = nn.Linear(4096, 20, bias=False)
# change initial weights to normal[0,1] or whatever is required
embedding.weight.data = torch.randn_like(embedding.weight)

Related

Can neural network converge to completely random function

I am trying to train DNN that converges to random (i.e., drawn from normal distribution) function but for now the network doesn't learn anything and the loss is stuck. Is is even possible or am I just wasting my time?
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, Dense
import numpy as np
import matplotlib.pyplot as plt
n_hidden_units = 25
num_lay = 10
learning_rate = 0.01
batch_size = 1000
epochs = 1000
save_freq_epoches = 500000
num_of_xs = 2
inputs_train = np.random.randn(batch_size*10,num_of_xs)*1
outputs_train = np.random.randn(batch_size*10,1)#np.sum(inputs_train,axis=1)#
inputs_train = tf.convert_to_tensor(inputs_train)
outputs_train = tf.convert_to_tensor((outputs_train-outputs_train.min())/(outputs_train.max()-outputs_train.min()))
kernel_init = keras.initializers.RandomUniform(-0.25, 0.25)
inputs = Input(num_of_xs)
x = Dense(n_hidden_units, kernel_initializer=kernel_init, activation='relu', )(inputs)
for _ in range(num_lay):
x = Dense(n_hidden_units,kernel_initializer=kernel_init, activation='relu', )(x)
outputs = Dense(1, kernel_initializer=kernel_init, activation='linear')(x)
model = Model(inputs=inputs, outputs=outputs)
optimizer1 = keras.optimizers.Adam(beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0,
amsgrad=True,learning_rate=learning_rate)
model.compile(loss='mse', optimizer=optimizer1, metrics=None)
model.fit(inputs_train, outputs_train, batch_size=batch_size,epochs=epochs, shuffle=False,
verbose=2,
)
plt.plot(outputs_train,'ob')
plt.plot(model(inputs_train),'*r')
plt.show()
For now I am getting the worst predictions (in red) relative to the target labels (blue)
If you are using a validation split, you can't. Otherwise you do, but it will be hard, since good pipelines have regularization techniques that try to prevent this from happening.
Your target distribution is given by
np.random.randn(batch_size*10,1)
Then normalized to:
(outputs_train-outputs_train.min())/(outputs_train.max()-outputs_train.min())
As you can see, your targets are completely independent from your variable x! So, if you have to predict the value (y) for a previously unseen value (x), there is literally nothing you can do better than simply predicting the mean value for y.
In other words, your target distribution is a flat line y = avg + noise.
Your question is then: can the network predict this extra noise? Well, no, that's why we call it noise, because it is the random deviations from the pattern that are completely unrelated to the input info that we feed the network.
BUT.
If you do NOT use validation (that is, you are interested in the prediction error with respect to the {x, y} pairs that you see during training) then the network will learn noise, up to its full prediction capacity (the more complex the network, the more it can adapt to complex noise). This is precisely what we call overfitting, and it is a BAD thing!
Normally we want models to predict something like "y = x * 2 + 3", whereas learning noise is more like learning a dictionary of unrelated predictions: "{x1: 2.93432, x2: -0.00324, ...}"
Because overfitting is bad (it is bad because it makes predictions for unseen validation data worse, which means our models are worse in new data), pipelines have built-in techniques to fight the natural tendency of neural networks to do this. Such techniques include data augmentation (common in images), early stopping, dropout, and so on.
If you REALLY need to overfit to your data, you will need to deactivate any such techniques, and train for as long as you can (which is normally not something we want to do!).

How to get logits as neural network output

Simple and short question. I have a network (Unet) which performs image segmentation. I want the logits as the output to feed into the cross entropy loss (using pytorch). Currently my final layer looks as so:
class Logits(nn.Sequential):
def __init__(self,
in_channels,
n_class
):
super(Logits, self).__init__()
# fully connected layer outputting the prediction layers for each of my classes
self.conv = self.add_module('conv_out',
nn.Conv2d(in_channels,
n_class,
kernel_size = 1
)
)
self.activ = self.add_module('sigmoid_out',
nn.Sigmoid()
)
Is it correct to use the sigmoid activation function here? Does this give me logits?
When people talk about "logits" they usually refer to the "raw" n_class-dimensional output vector. For multi-class classification (n_class > 2) you want to convert the n_class-dimensional vector of raw "logits" into a n_class-dim probability vector.
That is, you want prob = f(logits) with prob_i >= 0 for all n_class entries, and that sum(prob)=1.
The most straight forward way of doing that in a differentiable way is to use the Softmax function:
prob_i = softmax(logits) = exp(logits_i) / sum_j exp(logits_j)
It is easy to see that the output of softmax is indeed a n_class-dim probability vector (I leave it to you as a short exercise).
BTW, this is why the raw predictions are called "logits" because they are kind of "log" of the output predicted probabilities.
Now, it is customary not to explicitly compute the softmax on top of a classification network and defer its computation to the loss function, e.g. nn.CrossEntropyLoss that internally computes the softmax and requires the raw logits as inputs, rather than the normalized probabilities. This is done mainly for numerical stability.
Therefore, if you are training a multi-class classification network with nn.CrossEntropyLoss you do not need to worry at all about the final activation and simply output the raw logits from your final conv/linear layer.
Most importantly, do not use nn.Sigmoid() activation as it tends to have saturated gradients and will mess up your training.
As far as I understood, you are working on a multi-label classification task where a single input can have several labels, hence your usage of nn.Sigmoid (vs nn.Softmax for multi-class classification).
There a loss function which combines nn.Sigmoid and the nn.BCELoss: nn.BCEWithLogitsLoss. So you would have as input, a vector of logits whose length is the number of classes. And, the target would as well have the same shape: as a multi-hot-encoding, with 1s for active classes.

Proper way to extract embedding weights for CBOW model?

I'm currently trying to implement the CBOW model on managed to get the training and testing, but am facing some confusion as to the "proper" way to finally extract the weights from the model to use as our word embeddings.
Model
class CBOW(nn.Module):
def __init__(self, config, vocab):
self.config = config # Basic config file to hold arguments.
self.vocab = vocab
self.vocab_size = len(self.vocab.token2idx)
self.window_size = self.config.window_size
self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.config.embed_dim)
self.linear = nn.Linear(in_features=self.config.embed_dim, out_features=self.vocab_size)
def forward(self, x):
x = self.embed(x)
x = torch.mean(x, dim=0) # Average out the embedding values.
x = self.linear(x)
return x
Main process
After I run my model through a Solver with the training and testing data, I basically told the train and test functions to also return the model that's used. Then I assigned the embedding weights to a separate variable and used those as the word embeddings.
Training and testing was conducted using cross entropy loss, and each training and testing sample is of the form ([context words], target word).
def run(solver, config, vocabulary):
for epoch in range(config.num_epochs):
loss_train, model_train = solver.train()
loss_test, model_test = solver.test()
embeddings = model_train.embed.weight
I'm not sure if this is the correct way of going about extracting and using the embeddings. Is there usually another way to do this? Thanks in advance.
Yes, model_train.embed.weight will give you a torch tensor that stores the embedding weights. Note however, that this tensor also contains the latest gradients. If you don't want/need them, model_train.embed.weight.data will give you the weights only.
A more generic option is to call model_train.embed.parameters(). This will give you a generator of all the weight tensors of the layer. In general, there are multiple weight tensors in a layer and weight will give you only one of them. Embedding happens to have only one, so here it doesn't matter which option you use.

Training with BatchNorm in pytorch

I'm wondering if I need to do anything special when training with BatchNorm in pytorch. From my understanding the gamma and beta parameters are updated with gradients as would normally be done by an optimizer. However, the mean and variance of the batches are updated slowly using momentum.
So do we need to specify to the optimizer when the mean and variance parameters are updated, or does pytorch automatically take care of this?
Is there a way to access the mean and variance of the BN layer so that I can make sure it was changing while I trained the model.
If needed here is my model and training procedure:
def bn_drop_lin(n_in:int, n_out:int, bn:bool=True, p:float=0.):
"Sequence of batchnorm (if `bn`), dropout (with `p`) and linear (`n_in`,`n_out`) layers followed by `actn`."
layers = [nn.BatchNorm1d(n_in)] if bn else []
if p != 0: layers.append(nn.Dropout(p))
layers.append(nn.Linear(n_in, n_out))
return nn.Sequential(*layers)
class Model(nn.Module):
def __init__(self, i, o, h=()):
super().__init__()
nodes = (i,) + h + (o,)
self.layers = nn.ModuleList([bn_drop_lin(i,o, p=0.5)
for i, o in zip(nodes[:-1], nodes[1:])])
def forward(self, x):
x = x.view(x.shape[0], -1)
for layer in self.layers[:-1]:
x = F.relu(layer(x))
return self.layers[-1](x)
Training:
for i, data in enumerate(trainloader):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
Batchnorm layers behave differently depending on if the model is in train or eval mode.
When net is in train mode (i.e. after calling net.train()) the batch norm layers contained in net will use batch statistics along with gamma and beta parameters to scale and translate each mini-batch. The running mean and variance will also be adjusted while in train mode. These updates to running mean and variance occur during the forward pass (when net(inputs) is called). The gamma and beta parameters are like any other pytorch parameter and are updated only once optimizer.step() is called.
When net is in eval mode (net.eval()) batch norm uses the historical running mean and running variance computed during training to scale and translate samples.
You can check the batch norm layers running mean and variance by displaying the layers running_mean and running_var members to ensure batch norm is updating them as expected. The learnable gamma and beta parameters can be accessed by displaying the weight and bias members of a batch norm layer respectively.
Edit
Below is a simple demonstration code showing that running_mean is updated during forward. Observe that it is not updated by the optimizer.
>>> import torch
>>> import torch.nn as nn
>>> layer = nn.BatchNorm1d(5)
>>> layer.train()
>>> layer.running_mean
tensor([0., 0., 0., 0., 0.])
>>> result = layer(torch.randn(5,5))
>>> layer.running_mean
tensor([ 0.0271, 0.0152, -0.0403, -0.0703, -0.0056])

Tune input features using backprop in keras

I am trying to implement discriminant condition codes in Keras as proposed in
Xue, Shaofei, et al., "Fast adaptation of deep neural network based
on discriminant codes for speech recognition."
The main idea is you encode each condition as an input parameter and let the network learn dependency between the condition and the feature-label mapping. On a new dataset instead of adapting the entire network you just tune these weights using backprop. For example say my network looks like this
X ---->|----|
|DNN |----> Y
Z --- >|----|
X: features Y: labels Z:condition codes
Now given a pretrained DNN, and X',Y' on a new dataset I am trying to estimate the Z' using backprop that will minimize prediction error on Y'. The math seems straightforward except I am not sure how to implement this in keras without having access to the backprop itself.
For instance, can I add an Input() layer with trainable=True with all other layers set to trainable= False. Can backprop in keras update more than just layer weights? Or is there a way to hack keras layers to do this?
Any suggestions welcome.
thanks
I figured out how to do this (exactly) in Keras by looking at fchollet's post here
Using the keras backend I was able to compute the gradient of my loss w.r.t to Z directly and used it to drive the update.
Code below:
import keras.backend as K
import numpy as np
model.summary() #Pretrained model
loss = K.categorical_crossentropy(Y, Y_out)
grads = K.gradients(loss, Z)
grads /= (K.sqrt(K.mean(K.square(grads)))+ 1e-5)
iterate = K.function([X,Z],[loss,grads])
step = 0.1
Z_adapt = Z_in.copy()
for i in range(100):
loss_val, grads_val = iterate([X_in,Z_adapt])
Z_adapt -= grads_val[0] * step
print "iter:",i,np.mean(loss_value)
print "Before:"
print model.evaluate([X_in, Z_in],Y_out)
print "After:"
print model.evaluate([X_in, Z_adapt],Y_out)
X,Y,Z are nodes in the model graph. Z_in is an initial value for Z'. I set it to an average value from the train set. Z_adapt is after 100 iterations of gradient descent and should give you a better result.
Assume that the size of Z is m x n. Then you can first define an input layer of size m * n x 1. The input will be an m * n x 1 vector of ones. You can define a dense layer containing m * n neurons and set trainable = True for it. The response of this layer will give you a flattened version of Z. Reshape it appropriately and give it as input to the rest of the network that can be appended ahead of this.
Keep in mind that if the size of Z is too large, then network may not be able to learn a dense layer of that many neurons. In that case, maybe you need to put additional constraints or look into convolutional layers. However, convolutional layers will put some constraints on Z.