Faster way to do multiple embeddings in PyTorch? - deep-learning

I'm working on a torch-based library for building autoencoders with tabular datasets.
One big feature is learning embeddings for categorical features.
In practice, however, training many embedding layers simultaneously is creating some slowdowns. I am using for-loops to do this and running the for-loop on each iteration is (I think) what's causing the slowdowns.
When building the model, I associate embedding layers with each categorical feature in the user's dataset:
for ft in self.categorical_fts:
feature = self.categorical_fts[ft]
n_cats = len(feature['cats']) + 1
embed_dim = compute_embedding_size(n_cats)
embed_layer = torch.nn.Embedding(n_cats, embed_dim)
feature['embedding'] = embed_layer
Then, with a call to .forward():
embeddings = []
for i, ft in enumerate(self.categorical_fts):
feature = self.categorical_fts[ft]
emb = feature['embedding'](codes[i])
embeddings.append(emb)
#num and bin are numeric and binary features
x = torch.cat(num + bin + embeddings, dim=1)
Then x goes into dense layers.
This gets the job done but running this for loop during each forward pass really slows down training, especially when a dataset has tens or hundreds of categorical columns.
Does anybody know of a way of vectorizing something like this? Thanks!
UPDATE:
For more clarity, I made this sketch of how I'm feeding categorical features into the network. You can see that each categorical column has its own embedding matrix, while numeric features are concatenated directly to their output before being passed into the feed-forward network.
Can we do this without iterating through each embedding matrix?

just use simple indexing [, though i'm not sure whether it is fast enough
Here is a simplified version for all feature have same vocab_size and embedding dim, but it should apply to cases of heterogeneous category features
xdim = 240
embed_dim = 8
vocab_size = 64
embedding_table = torch.randn(size=(xdim, vocab_size, embed_dim))
batch_size = 32
x = torch.randint(vocab_size, size=(batch_size, xdim))
out = embedding_table[torch.arange(xdim), x]
out.shape # (bz, xdim, embed_dim)
# unit test
i = np.random.randint(batch_size)
j = np.random.randint(xdim)
x_index = x[i][j]
w = embedding_table[j]
torch.allclose(w[x_index], out[i, j])

Related

Pytorch : different behaviours in GAN training with different, but conceptually equivalent, code

I'm trying to implement a simple GAN in Pytorch. The following training code works:
for epoch in range(max_epochs): # loop over the dataset multiple times
print(f'epoch: {epoch}')
running_loss = 0.0
for batch_idx,(data,_) in enumerate(data_gen_fn):
# data preparation
real_data = data
input_shape = real_data.shape
inputs_generator = torch.randn(*input_shape).detach()
# generator forward
fake_data = generator(inputs_generator).detach()
# discriminator forward
optimizer_generator.zero_grad()
optimizer_discriminator.zero_grad()
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
# loss discriminator
labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
loss_discriminator_batch = criterion_discriminator(predictions,
labels_real_fake)
# update discriminator
loss_discriminator_batch.backward()
optimizer_discriminator.step()
# generator
# zero the parameter gradients
optimizer_discriminator.zero_grad()
optimizer_generator.zero_grad()
fake_data = generator(inputs_generator) # make again fake data but without detaching
predictions_on_fake = discriminator(fake_data) # D(G(encoding))
# loss generator
labels_fake = torch.tensor([1]*batch_size)
loss_generator_batch = criterion_generator(predictions_on_fake,
labels_fake)
loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
optimizer_generator.step()
If I plot the generated images for each iteration, I see that the generated images look like the real ones, so the training procedure seems to work well.
However, if I try to change the code in the ALERT CODE part , i.e., instead of:
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
I use the following:
#################### ALERT CODE #######################
predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
#######################################################
That is conceptually the same (in a nutshell, instead of doing two different forward on the discriminator, the former on the real, the latter on the fake data, and finally concatenate the results, with the new code I first concatenate real and fake data, and finally I make just one forward pass on the concatenated data.
However, this code version does not work, that is the generated images seems to be always random noise.
Any explanation to this behavior?
Why do we different results?
Supplying inputs in either the same batch, or separate batches, can make a difference if the model includes dependencies between different elements of the batch. By far the most common source in current deep learning models is batch normalization. As you mentioned, the discriminator does include batchnorm, so this is likely the reason for different behaviors. Here is an example. Using single numbers and a batch size of 4:
features = [1., 2., 5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 3.5, std 2.0615528128088303
>>>normalized features [-1.21267813 -0.72760688 0.72760688 1.21267813]
Now we split the batch into two parts. First part:
features = [1., 2.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 1.5, std 0.5
>>>normalized features [-1. 1.]
Second part:
features = [5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 5.5, std 0.5
>>>normalized features [-1. 1.]
As we can see, in the split-batch version, the two batches are normalized to the exact same numbers, even though the inputs are very different. In the joint-batch version, on the other hand, the larger numbers are still larger than the smaller ones as they are normalized using the same statistics.
Why does this matter?
With deep learning, it's always hard to say, and especially with GANs and their complex training dynamics. A possible explanation is that, as we can see in the example above, the separate batches result in more similar features after normalization even if the original inputs are quite different. This may help early in training, as the generator tends to output "garbage" which has very different statistics from real data.
With a joint batch, these differing statistics make it easy for the discriminator to tell the real and generated data apart, and we end up in a situation where the discriminator "overpowers" the generator.
By using separate batches, however, the different normalizations result in the generated and real data to look more similar, which makes the task less trivial for the discriminator and allows the generator to learn.

Incremental Neural Network Training

I am working with a very large dataset with hundreds of long videos to be used as training and I'm using Google Colab to perform some tests. The whole code I wrote is quite simple and uses PyTorch.
When I try to perform the training, if I use more than 200 videos at a time, the RAM fullfills during the training and the Colab crashes. I noticed that this does not happens if I train with lower number of training videos.
For that reason I thought that my model may be trained incrementally creaing a structure as follows:
model = torch.nn.Sequential( # create a model
...
nn.Softmax(dim=1)
)
MAX_VIDEOS_PER_BATCH = 100
for current_batch in range (0, TOTAL_BATCHES): # Perform TOTAL_BATCHES trainings
videos = []
labels = []
for index, video_file_name in enumerate(os.listdir(VIDEOS_DIR)): # Read 100 videos as training set
if index < MAX_VIDEOS_PER_BATCH * current_batch:
continue
... # read the video and add it to videos
... # add the considered labels to videos list
video_training = torch.tensor(np.asarray(videos)).float() # (batch x frames x channels x height x width)
learning_rate = 1e-4
for t in range(ITERATIONS): # Train the model, if I already trained it the model is not resetted
y_pred = model(torch.FloatTensor(np.asarray(video_training )))
loss = loss_fn(y_pred, torch.tensor(labels))
print("#" + str(t), " loss:" + str(loss.item()))
model.zero_grad()
loss.backward()
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
My question is, is this method correct? I am training the network in a correct manner or this batches approach will create some damages or biases to the model?
When I go from batch 1 to batch 2, the model won't lose the previous trained knowledge, is it correct?
This is correct but the best way is to do the reverse.

Wasserstein GAN implemtation in pytorch. How to implement the loss?

I'm currently working on a project in pytorch on Wasserstein GAN (https://arxiv.org/pdf/1701.07875.pdf).
In Wasserstain GAN a new objective function is defined using the wasserstein distance as :
Which leads to the following algorithms for training the GAN:
My question is :
When implementing line 5 and 6 of the algorithm in pytorch should I be multiplying my loss -1 ? As in my code (I use RMSprop as my optimizer for both the generator and critic):
############################
# (1) Update D network: maximize (D(x)) + (D(G(x)))
###########################
for n in range(n_critic):
D.zero_grad()
real_cpu = data[0].to(device)
b_size = real_cpu.size(0)
output = D(real_cpu)
#errD_real = -criterion(output, label) #DCGAN
errD_real = torch.mean(output)
# Calculate gradients for D in backward pass
errD_real.backward()
D_x = output.mean().item()
## Train with all-fake batch
# Generate batch of latent vectors
noise = torch.randn(b_size, 100, device=device) #Careful here we changed shape of input (original : torch.randn(4, 100, 1, 1, device=device))
# Generate fake image batch with G
fake = G(noise)
# Classify all fake batch with D
output = D(fake.detach())
# Calculate D's loss on the all-fake batch
errD_fake = torch.mean(output)
# Calculate the gradients for this batch
errD_fake.backward()
D_G_z1 = output.mean().item()
# Add the gradients from the all-real and all-fake batches
errD = -(errD_real - errD_fake)
# Update D
optimizerD.step()
#Clipping weights
for p in D.parameters():
p.data.clamp_(-0.01, 0.01)
As you can see, I do the operation errD = -(errD_real - errD_fake), with errD_real and errD_fake being respectively the mean of the predictions of the critic on real and fake samples.
To my understanding RMSprop should optimize the weights of the critic the following way :
w <- w - alpha*gradient(w)
(alpha being the learning rate divided by the square root of the weighted moving average of the squared gradient)
Since the optimization problem requires to "go" in the same direction as the gradient it should be required to multiply gradient(w) by -1 before optimizing the weights.
Do you think that my reasoning is right ?
The program runs but my results are quiet poor.
I follow the same logic for the generator's weights but this time in order to go in the opposite direction of the gradient:
############################
# (2) Update G network: minimize -D(G(x))
###########################
G.zero_grad()
noise = torch.randn(b_size, 100, device=device)
fake = G(noise)
#label.fill_(fake_label) # fake labels are real for generator cost
# Since we just updated D, perform another forward pass of all-fake batch through D
output = D(fake).view(-1)
# Calculate G's loss based on this output
#errG = criterion(output, label) #DCGAN
errG = -torch.mean(output)
# Calculate gradients for G
errG.backward()
D_G_z2 = output.mean().item()
# Update G
optimizerG.step()
Sorry for the long question, I tried to explain my doubt as clear as possible. Thank you everyone.
I noticed some errors in the implementation of your discriminator training protocol. You call your backward functions twice with both the real and fake values loss being backpropagated at different time steps.
Technically an implementation using this scheme is possible but highly unreadable. There was a mistake with your errD_real in which your output is going to be positive instead of negative as an optimal D(G(z))>0 and so you penalize it for being correct. Overall your model converges simply by predicting D(x)<0 for all inputs.
To fix this do not call your errD_readl.backward() or your errD_fake.backward(). Simply using an errD.backward() after you define errD would work perfectly fine. Otherwise, your generator seems to be correct.

Understanding when to call zero_grad() in pytorch, when training with multiple losses

I am going through an open-source implementation of a domain-adversarial model (GAN-like). The implementation uses pytorch and I am not sure they use zero_grad() correctly. They call zero_grad() for the encoder optimizer (aka the generator) before updating the discriminator loss. However zero_grad() is hardly documented, and I couldn't find information about it.
Here is a psuedo code comparing a standard GAN training (option 1), with their implementation (option 2). I think the second option is wrong, because it may accumulate the D_loss gradients with the E_opt. Can someone tell if these two pieces of code are equivalent?
Option 1 (a standard GAN implementation):
X, y = get_D_batch()
D_opt.zero_grad()
pred = model(X)
D_loss = loss(pred, y)
D_opt.step()
X, y = get_E_batch()
E_opt.zero_grad()
pred = model(X)
E_loss = loss(pred, y)
E_opt.step()
Option 2 (calling zero_grad() for both optimizers at the beginning):
E_opt.zero_grad()
D_opt.zero_grad()
X, y = get_D_batch()
pred = model(X)
D_loss = loss(pred, y)
D_opt.step()
X, y = get_E_batch()
pred = model(X)
E_loss = loss(pred, y)
E_opt.step()
It depends on params argument of torch.optim.Optimizer subclasses (e.g. torch.optim.SGD) and exact structure of the model.
Assuming E_opt and D_opt have different set of parameters (model.encoder and model.decoder do not share weights), something like this:
E_opt = torch.optim.Adam(model.encoder.parameters())
D_opt = torch.optim.Adam(model.decoder.parameters())
both options MIGHT indeed be equivalent (see commentary for your source code, additionally I have added backward() which is really important here and also changed model to discriminator and generator appropriately as I assume that's the case):
# Starting with zero gradient
E_opt.zero_grad()
D_opt.zero_grad()
# See comment below for possible cases
X, y = get_D_batch()
pred = discriminator(x)
D_loss = loss(pred, y)
# This will accumulate gradients in discriminator only
# OR in discriminator and generator, depends on other parts of code
# See below for commentary
D_loss.backward()
# Correct weights of discriminator
D_opt.step()
# This only relies on random noise input so discriminator
# Is not part of this equation
X, y = get_E_batch()
pred = generator(x)
E_loss = loss(pred, y)
E_loss.backward()
# So only parameters of generator are updated always
E_opt.step()
Now it's all about get_D_Batch feeding data to discriminator.
Case 1 - real samples
This is not a problem as it does not involve generator, you pass real samples and only discriminator takes part in this operation.
Case 2 - generated samples
Naive case
Here indeed gradient accumulation may occur. It would occur if get_D_batch would simply call X = generator(noise) and passed this data to discriminator.
In such case both discriminator and generator have their gradients accumulated during backward() as both are used.
Correct case
We should take generator out of the equation. Taken from PyTorch DCGan example there is a little line like this:
# Generate fake image batch with G
fake = generator(noise)
label.fill_(fake_label)
# DETACH HERE
output = discriminator(fake.detach()).view(-1)
What detach does is it "stops" the gradient by detaching it from the computational graph. So gradients will not be backpropagated along this variable. This effectively does not impact gradients of generator so it has no more gradients so no accumulation happens.
Another way (IMO better) would be to use with.torch.no_grad(): block like this:
# Generate fake image batch with G
with torch.no_grad():
fake = generator(noise)
label.fill_(fake_label)
# NO DETACH NEEDED
output = discriminator(fake).view(-1)
This way generator operations will not build part of the graph so we get better performance (it would in the first case but would be detached afterwards).
Finally
Yeah, all in all first option is better for standard GANs as one doesn't have to think about such stuff (people implementing it should, but readers should not). Though there are also other approaches like single optimizer for both generator and discriminator (one cannot zero_grad() only for subset of parameters (e.g. encoder) in this case), weight sharing and others which further clutter the picture.
with torch.no_grad() should alleviate the problem in all/most cases as far as I can tell and imagine ATM.

Proper way to extract embedding weights for CBOW model?

I'm currently trying to implement the CBOW model on managed to get the training and testing, but am facing some confusion as to the "proper" way to finally extract the weights from the model to use as our word embeddings.
Model
class CBOW(nn.Module):
def __init__(self, config, vocab):
self.config = config # Basic config file to hold arguments.
self.vocab = vocab
self.vocab_size = len(self.vocab.token2idx)
self.window_size = self.config.window_size
self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.config.embed_dim)
self.linear = nn.Linear(in_features=self.config.embed_dim, out_features=self.vocab_size)
def forward(self, x):
x = self.embed(x)
x = torch.mean(x, dim=0) # Average out the embedding values.
x = self.linear(x)
return x
Main process
After I run my model through a Solver with the training and testing data, I basically told the train and test functions to also return the model that's used. Then I assigned the embedding weights to a separate variable and used those as the word embeddings.
Training and testing was conducted using cross entropy loss, and each training and testing sample is of the form ([context words], target word).
def run(solver, config, vocabulary):
for epoch in range(config.num_epochs):
loss_train, model_train = solver.train()
loss_test, model_test = solver.test()
embeddings = model_train.embed.weight
I'm not sure if this is the correct way of going about extracting and using the embeddings. Is there usually another way to do this? Thanks in advance.
Yes, model_train.embed.weight will give you a torch tensor that stores the embedding weights. Note however, that this tensor also contains the latest gradients. If you don't want/need them, model_train.embed.weight.data will give you the weights only.
A more generic option is to call model_train.embed.parameters(). This will give you a generator of all the weight tensors of the layer. In general, there are multiple weight tensors in a layer and weight will give you only one of them. Embedding happens to have only one, so here it doesn't matter which option you use.