Simple LSTM in PyTorch with Sequential module - deep-learning

In PyTorch, we can define architectures in multiple ways. Here, I'd like to create a simple LSTM network using the Sequential module.
In Lua's torch I would usually go with:
model = nn.Sequential()
model:add(nn.SplitTable(1,2))
model:add(nn.Sequencer(nn.LSTM(inputSize, hiddenSize)))
model:add(nn.SelectTable(-1)) -- last step of output sequence
model:add(nn.Linear(hiddenSize, classes_n))
However, in PyTorch, I don't find the equivalent of SelectTable to get the last output.
nn.Sequential(
nn.LSTM(inputSize, hiddenSize, 1, batch_first=True),
# what to put here to retrieve last output of LSTM ?,
nn.Linear(hiddenSize, classe_n))

Define a class to extract the last cell output:
# LSTM() returns tuple of (tensor, (recurrent state))
class extract_tensor(nn.Module):
def forward(self,x):
# Output shape (batch, features, hidden)
tensor, _ = x
# Reshape shape (batch, hidden)
return tensor[:, -1, :]
nn.Sequential(
nn.LSTM(inputSize, hiddenSize, 1, batch_first=True),
extract_tensor(),
nn.Linear(hiddenSize, classe_n)
)

According to the LSTM cell documentation the outputs parameter has a shape of (seq_len, batch, hidden_size * num_directions) so you can easily take the last element of the sequence in this way:
rnn = nn.LSTM(10, 20, 2)
input = Variable(torch.randn(5, 3, 10))
h0 = Variable(torch.randn(2, 3, 20))
c0 = Variable(torch.randn(2, 3, 20))
output, hn = rnn(input, (h0, c0))
print(output[-1]) # last element
Tensor manipulation and Neural networks design in PyTorch is incredibly easier than in Torch so you rarely have to use containers. In fact, as stated in the tutorial PyTorch for former Torch users PyTorch is built around Autograd so you don't need anymore to worry about containers. However, if you want to use your old Lua Torch code you can have a look to the Legacy package.

As far as I'm concerned there's nothing like a SplitTable or a SelectTable in PyTorch. That said, you are allowed to concatenate an arbitrary number of modules or blocks within a single architecture, and you can use this property to retrieve the output of a certain layer. Let's make it more clear with a simple example.
Suppose I want to build a simple two-layer MLP and retrieve the output of each layer. I can build a custom class inheriting from nn.Module:
class MyMLP(nn.Module):
def __init__(self, in_channels, out_channels_1, out_channels_2):
# first of all, calling base class constructor
super().__init__()
# now I can build my modular network
self.block1 = nn.Linear(in_channels, out_channels_1)
self.block2 = nn.Linear(out_channels_1, out_channels_2)
# you MUST implement a forward(input) method whenever inheriting from nn.Module
def forward(x):
# first_out will now be your output of the first block
first_out = self.block1(x)
x = self.block2(first_out)
# by returning both x and first_out, you can now access the first layer's output
return x, first_out
In your main file you can now declare the custom architecture and use it:
from myFile import MyMLP
import numpy as np
in_ch = out_ch_1 = out_ch_2 = 64
# some fake input instance
x = np.random.rand(in_ch)
my_mlp = MyMLP(in_ch, out_ch_1, out_ch_2)
# get your outputs
final_out, first_layer_out = my_mlp(x)
Moreover, you could concatenate two MyMLP in a more complex model definition and retrieve the output of each one in a similar way.
I hope this is enough to clarify, but in case you have more questions, please feel free to ask, since I may have omitted something.

Related

Do linear layer after GRU saved the sequence output order?

I'm dealing with the following senario:
My input has the shape of: [batch_size, input_sequence_length, input_features]
where:
input_sequence_length = 10
input_features = 3
My output has the shape of: [batch_size, output_sequence_length]
where:
output_sequence_length = 5
i.e: for each time slot of 10 units (each slot with 3 features) I need to predict the next 5 slots values.
I built the following model:
import torch
import torch.nn as nn
import torchinfo
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.GRU = nn.GRU(input_size=3, hidden_size=32, num_layers=2, batch_first=True)
self.fc = nn.Linear(32, 5)
def forward(self, input_series):
output, h = self.GRU(input_series)
output = output[:, -1, :] # get last state
output = self.fc(output)
output = output.view(-1, 5, 1) # reorginize output
return output
torchinfo.summary(MyModel(), (512, 10, 3))
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
MyModel [512, 5, 1] --
├─GRU: 1-1 [512, 10, 32] 9,888
├─Linear: 1-2 [512, 5] 165
==========================================================================================
I'm getting good results (very small MSE loss, and the predictions looks good),
but I'm not sure if the model output (5 sequence values) are really ordered by the model ?
i.e the second output based on the first output and the third output based on the second output ...
I know that the GRU output based on the learned sequence history.
But I'm also used linear layer, so is the output (after the linear layer) still sorted by time ?
UPDATE
This answer isn't quite right, see this follow-up question. The best way is to write the math and show that the 5 scalar outputs aren't functions of each other.
Old Answer
I'm not sure if the model output (5 sequence values) are really ordered by the model ? i.e the second output based on the first output and the third output based on the second output
No, they aren't. You can check that the gradients of, say, the last output w.r.t to the previous outputs are zeroes, which basically means that the last output isn't a function of the previous outputs.
model = MyModel()
x = torch.rand([2, 10, 3])
y = model(x)
y.retain_grad() # allows accessing y.grad although y is a non-leaf Tensor
y[:, -1].sum().backward() # computes gradients of last output
assert torch.allclose(y.grad[:, :-1], torch.tensor(0.)) # gradients w.r.t previous outputs are zeroes
A popular model to capture dependencies among output labels is conditional random fields. But since you're already happy with the predictions of the current model, perhaps modelling the output dependencies isn't that important.

Implementation of multitask "nested" neural network

I am trying to implement a multitask neural network used by a paper but am quite unsure how I should code the multitask network because the authors did not provide code for that part.
The network architecture looks like (paper):
To make it simpler, the network architecture could be generalized as (For demo I changed their more complicated operation for the pair of individual embeddings to concatenation):
The authors are summing the loss from the individual tasks and the pairwise tasks, and using the total loss to optimize the parameters for the three networks (encoder, MLP-1, MLP-2) in each batch, but I am kind of at sea as to how different types of data are combined in a single batch to feed into two different networks that share an initial encoder. I tried to search for other networks with similar structure but did not find any sources. Would appreciate any thoughts!
This is actually a common pattern. It would be solved by code like the following.
class Network(nn.Module):
def __init__(self, ...):
self.encoder = DrugTargetInteractiongNetwork()
self.mlp1 = ClassificationMLP()
self.mlp2 = PairwiseMLP()
def forward(self, data_a, data_b):
a_encoded = self.encoder(data_a)
b_encoded = self.encoder(data_b)
a_classified = self.mlp1(a_encoded)
b_classified = self.mlp1(b_encoded)
# let me assume data_a and data_b are of shape
# [batch_size, n_molecules, n_features].
# and that those n_molecules are not necessarily
# equal.
# This can be generalized to more dimensions.
a_broadcast, b_broadcast = torch.broadcast_tensors(
a_encoded[:, None, :, :],
b_encoded[:, :, None, :],
)
# this will work if your mlp2 accepts an arbitrary number of
# learding dimensions and just broadcasts over them. That's true
# for example if it uses just Linear and pointwise
# operations, but may fail if it makes some specific assumptions
# about the number of dimensions of the inputs
pairwise_classified = self.mlp2(a_broadcast, b_broadcast)
# if that is a problem, you have to reshape it such that it
# works. Most torch models accept at least a leading batch dimension
# for vectorization, so we can "fold" the pairwise dimension
# into the batch dimension, presenting it as
# [batch*n_mol_1*n_mol_2, n_features]
# to mlp2 and then recover it back
B, N1, N_feat = a_broadcast.shape
_B, N2, _N_feat = b_broadcast.shape
a_batched = a_broadcast.reshape(B*N1*N2, N_feat)
b_batched = b_broadcast.reshape(B*N1*N2, N_feat)
# above, -1 would suffice instead of B*N1*N2, just being explicit
batch_output = self.mlp2(a_batched, b_batched)
# this should be exactly the same as `pairwise_classified`
alternative_classified = batch_output.reshape(B, N1, N2, -1)
return a_classified, b_classified, pairwise_classified

How is Siamese network realized with Pytorch if it is single input during inference?

I am trying to train one CNN model with Pytorch, so that the output behaves differently for different types of inputs. (i.e. If the input images are human-beings, it outputs pattern A, but if the input is some other animals, it outputs pattern B).
After some online search, it seems Siamese network is related to this. So I have the following 2 questions:
(1) Is Siamese network really a good way to train such a model?
(2) From the implementation point of view, how should I implement the code in pytorch?
class SiameseNetwork(nn.Module):
def __init__(self):
super(SiameseNetwork, self).__init__()
self.cnn1 = nn.Sequential(
nn.ReflectionPad2d(1),
nn.Conv2d(1, 4, kernel_size=3),
nn.ReLU(inplace=True),
nn.BatchNorm2d(4),
nn.ReflectionPad2d(1),
nn.Conv2d(4, 8, kernel_size=3),
nn.ReLU(inplace=True),
nn.BatchNorm2d(8),
nn.ReflectionPad2d(1),
nn.Conv2d(8, 8, kernel_size=3),
nn.ReLU(inplace=True),
nn.BatchNorm2d(8),
)
self.fc1 = nn.Sequential(
nn.Linear(8*100*100, 500),
nn.ReLU(inplace=True),
nn.Linear(500, 500),
nn.ReLU(inplace=True),
nn.Linear(500, 5))
def forward_once(self, x):
output = self.cnn1(x)
output = output.view(output.size()[0], -1)
output = self.fc1(output)
return output
def forward(self, input1, input2):
output1 = self.forward_once(input1)
output2 = self.forward_once(input2)
return output1, output2
Currently, I am trying some existing implementation I found online like the above class definition. It works, but there will always be two inputs and two outputs for this model. I agree that it is convenient for training, but ideally, it should be only one input and one (two is also fine) output during inference.
Could someone provide some guidance on how to modify the code to make it single input?
You can call forward_once during inference: this takes a single input and returns a single output. Note that explicitly calling forward_once will not invoke any hooks you might have on forward/backward calls of your module.
Alternatively, you can make forward_once your module's forward function, and make your training function do the double calling of your model (which makes more sense: Siamese networks is a training method, and not part of a network's architecture).

Can I share weights between keras layers but have other parameters differ?

In keras, is it possible to share weights between two layers, but to have other parameters differ? Consider the following (admittedly a bit contrived) example:
conv1 = Conv2D(64, 3, input_shape=input_shape, padding='same')
conv2 = Conv2D(64, 3, input_shape=input_shape, padding='valid')
Notice that the layers are identical except for the padding. Can I get keras to use the same weights for both? (i.e. also train the network accordingly?)
I've looked at the keras doc, and the section on shared layers seems to imply that sharing works only if the layers are completely identical.
To my knowledge, this cannot be done by the common "API level" of Keras usage.
However, if you dig a bit deeper, there are some (ugly) ways to share the weights.
First of all, the weights of the Conv2D layers are created inside the build() function, by calling add_weight():
self.kernel = self.add_weight(shape=kernel_shape,
initializer=self.kernel_initializer,
name='kernel',
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint)
For your provided usage (i.e., default trainable/constraint/regularizer/initializer), add_weight() does nothing special but appending the weight variables to _trainable_weights:
weight = K.variable(initializer(shape), dtype=dtype, name=name)
...
self._trainable_weights.append(weight)
Finally, since build() is only called inside __call__() if the layer hasn't been built, shared weights between layers can be created by:
Call conv1.build() to initialize the conv1.kernel and conv1.bias variables to be shared.
Call conv2.build() to initialize the layer.
Replace conv2.kernel and conv2.bias by conv1.kernel and conv1.bias.
Remove conv2.kernel and conv2.bias from conv2._trainable_weights.
Append conv1.kernel and conv1.bias to conv2._trainable_weights.
Finish model definition. Here conv2.__call__() will be called; however, since conv2 has already been built, the weights are not going to be re-initialized.
The following code snippet may be helpful:
def create_shared_weights(conv1, conv2, input_shape):
with K.name_scope(conv1.name):
conv1.build(input_shape)
with K.name_scope(conv2.name):
conv2.build(input_shape)
conv2.kernel = conv1.kernel
conv2.bias = conv1.bias
conv2._trainable_weights = []
conv2._trainable_weights.append(conv2.kernel)
conv2._trainable_weights.append(conv2.bias)
# check if weights are successfully shared
input_img = Input(shape=(299, 299, 3))
conv1 = Conv2D(64, 3, padding='same')
conv2 = Conv2D(64, 3, padding='valid')
create_shared_weights(conv1, conv2, input_img._keras_shape)
print(conv2.weights == conv1.weights) # True
# check if weights are equal after model fitting
left = conv1(input_img)
right = conv2(input_img)
left = GlobalAveragePooling2D()(left)
right = GlobalAveragePooling2D()(right)
merged = concatenate([left, right])
output = Dense(1)(merged)
model = Model(input_img, output)
model.compile(loss='binary_crossentropy', optimizer='adam')
X = np.random.rand(5, 299, 299, 3)
Y = np.random.randint(2, size=5)
model.fit(X, Y)
print([np.all(w1 == w2) for w1, w2 in zip(conv1.get_weights(), conv2.get_weights())]) # [True, True]
One drawback of this hacky weight-sharing is that the weights will not remain shared after model saving/loading. This will not affect prediction, but it may be problematic if you want to load the trained model for further fine-tuning.

CTC implementation in Keras error

I am working on image OCR with my own dataset, I have 1000 images of variable length and I want to feed in images in form of patches of 46X1. I have generated patches of my images and my label values are in Urdu text, so I have encoded them as utf-8. I want to implement CTC in the output layer. I have tried to implement CTC following the image_ocr example at github. But I get the following error in my CTC implementation.
'numpy.ndarray' object has no attribute 'get_shape'
Could anyone guide me about my mistakes? Kindly suggest the solution for it.
My code is:
X_train, X_test, Y_train, Y_test =train_test_split(imageList, labelList, test_size=0.3)
X_train_patches = np.array([image.extract_patches_2d(X_train[i], (46, 1))for i in range (700)]).reshape(700,1,1) #(Samples, timesteps,dimensions)
X_test_patches = np.array([image.extract_patches_2d(X_test[i], (46, 1))for i in range (300)]).reshape(300,1,1)
Y_train=np.array([i.encode("utf-8") for i in str(Y_train)])
Label_length=1
input_length=1
####################Loss Function########
def ctc_lambda_func(args):
y_pred, labels, input_length, label_length = args
# the 2 is critical here since the first couple outputs of the RNN
# tend to be garbage:
y_pred = y_pred[:, 2:, :]
return K.ctc_batch_cost(labels, y_pred, input_length, label_length)
#Building Model
model =Sequential()
model.add(LSTM(20, input_shape=(None, X_train_patches.shape[2]), return_sequences=True))
model.add(Activation('relu'))
model.add(TimeDistributed(Dense(12)))
model.add(Activation('tanh'))
model.add(LSTM(60, return_sequences=True))
model.add(Activation('relu'))
model.add(TimeDistributed(Dense(40)))
model.add(Activation('tanh'))
model.add(LSTM(100, return_sequences=True))
model.add(Activation('relu'))
loss_out = Lambda(ctc_lambda_func, name='ctc')([X_train_patches, Y_train, input_length, Label_length])
The way CTC is modelled currently in Keras is that you need to implement the loss function as a layer, you did that already (loss_out). Your problem is that the inputs you give that layer are not tensors from Theano/TensorFlow but numpy arrays.
To change that one option is to model these values as inputs to your model. This is exactly what the implementation does that you copied the code from:
labels = Input(name='the_labels', shape=[img_gen.absolute_max_string_len], dtype='float32')
input_length = Input(name='input_length', shape=[1], dtype='int64')
label_length = Input(name='label_length', shape=[1], dtype='int64')
# Keras doesn't currently support loss funcs with extra parameters
# so CTC loss is implemented in a lambda layer
loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')([y_pred, labels, input_length, label_length])
To make this work you need to ditch the Sequential model and use the functional model API, exactly as done in the code linked above.