why unsupervised model needs to implement nn.diag? - deep-learning

I am trying to learn deep learning.
In torch tutorial,
https://github.com/torch/tutorials/blob/master/2_supervised/2_model.lua
https://github.com/torch/tutorials/blob/master/3_unsupervised/2_models.lua
Supervised model
-- Simple 2-layer neural network, with tanh hidden units
model = nn.Sequential()
model:add(nn.Reshape(ninputs))
model:add(nn.Linear(ninputs,nhiddens))
model:add(nn.Tanh())
model:add(nn.Linear(nhiddens,noutputs))
Unsupervised model
-- encoder
encoder = nn.Sequential()
encoder:add(nn.Linear(inputSize,outputSize))
encoder:add(nn.Tanh())
encoder:add(nn.Diag(outputSize))
-- decoder
decoder = nn.Sequential()
decoder:add(nn.Linear(outputSize,inputSize))
-- complete model
module = unsup.AutoEncoder(encoder, decoder, params.beta)
why unsupervised model needs to implement nn.Diag ?
Thanks in advance.

It is in fact a scaling by a learnable vector (the diagonal of a matrix). This is mentioned in the section 3.1 of the paper Learning Fast Approximations of Sparse Coding. It is multiplied by the tanh and together form the non linearity.

Related

Probabilistic CNN using pre-trained models

I am currently working on estimating uncertainty in deep learning models. I came across this tutorial where a probablistic CNN is implemented for classification of MNIST dataset. However, the model is a custom deep model as show below
def get_probabilistic_model(input_shape, loss, optimizer, metrics):
"""
This function should return the probabilistic model according to the
above specification.
The function takes input_shape, loss, optimizer and metrics as arguments, which should be
used to define and compile the model.
Your function should return the compiled model.
"""
model = Sequential([
Conv2D(kernel_size=(5, 5), filters=8, activation='relu', padding='VALID', input_shape=input_shape),
MaxPooling2D(pool_size=(6, 6)),
Flatten(),
Dense(tfpl.OneHotCategorical.params_size(10)),
tfpl.OneHotCategorical(10, convert_to_tensor_fn=tfd.Distribution.mode)
])
model.compile(loss=loss, optimizer=optimizer, metrics=metrics)
return model
However, I want to use a pre-trained deep learning model like efficientnet or mobilenet above and get the estimate of uncertainty of predictions from those models for my problem. How do I go about doing that?

How to use pytorch multi-head attention for classification task?

I have a dataset where x shape is (10000, 102, 300) such as ( samples, feature-length, dimension) and y (10000,) which is my binary label. I want to use multi-head attention using PyTorch. I saw the PyTorch documentation from here but there is no explanation of how to use it. How can I use my dataset for classification using multi-head attention?
I will write a simple pretty code for classification this will work fine, if you need implementation detail then this part is the same as the Encoder layer in Transformer, except in the last you would need a GlobalAveragePooling Layer and a Dense Layer for classification
attention_layer = nn.MultiHeadAttion(300 , 300%num_of_heads==0,dropout=0.1)
neural_net_output = point_wise_neural_network(attention_layer)
normalize = LayerNormalization(input + neural_net_output)
globale_average_pooling = nn.GlobalAveragePooling(normalize)
nn.Linear(input , num_of_classes)(global_average_pooling)

BertModel or BertForPreTraining

I want to use Bert only for embedding and use the Bert output as an input for a classification net that I will build from scratch.
I am not sure if I want to do finetuning for the model.
I think the relevant classes are BertModel or BertForPreTraining.
BertForPreTraining head contains two "actions":
self.predictions is MLM (Masked Language Modeling) head is what gives BERT the power to fix the grammar errors, and self.seq_relationship is NSP (Next Sentence Prediction); usually refereed as the classification head.
class BertPreTrainingHeads(nn.Module):
def __init__(self, config):
super().__init__()
self.predictions = BertLMPredictionHead(config)
self.seq_relationship = nn.Linear(config.hidden_size, 2)
I think the NSP isn't relevant for my task so I can "override" it.
what does the MLM do and is it relevant for my goal or should I use the BertModel?
You should be using BertModel instead of BertForPreTraining.
BertForPreTraining is used to train bert on Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks. They are not meant for classification.
BERT model simply gives the output of the BERT model, you can then finetune the BERT model along with the classifier that you build on top of it. For classification, if its just a single layer on top of BERT model, you can directly go with BertForSequenceClassification.
In anycase, if you just want to take the output of BERT model and learn your classifier (without fine-tuning BERT model), then you can freeze the Bert model weights using:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
for param in model.bert.bert.parameters():
param.requires_grad = False
The above code is borrowed from here

How to get logits as neural network output

Simple and short question. I have a network (Unet) which performs image segmentation. I want the logits as the output to feed into the cross entropy loss (using pytorch). Currently my final layer looks as so:
class Logits(nn.Sequential):
def __init__(self,
in_channels,
n_class
):
super(Logits, self).__init__()
# fully connected layer outputting the prediction layers for each of my classes
self.conv = self.add_module('conv_out',
nn.Conv2d(in_channels,
n_class,
kernel_size = 1
)
)
self.activ = self.add_module('sigmoid_out',
nn.Sigmoid()
)
Is it correct to use the sigmoid activation function here? Does this give me logits?
When people talk about "logits" they usually refer to the "raw" n_class-dimensional output vector. For multi-class classification (n_class > 2) you want to convert the n_class-dimensional vector of raw "logits" into a n_class-dim probability vector.
That is, you want prob = f(logits) with prob_i >= 0 for all n_class entries, and that sum(prob)=1.
The most straight forward way of doing that in a differentiable way is to use the Softmax function:
prob_i = softmax(logits) = exp(logits_i) / sum_j exp(logits_j)
It is easy to see that the output of softmax is indeed a n_class-dim probability vector (I leave it to you as a short exercise).
BTW, this is why the raw predictions are called "logits" because they are kind of "log" of the output predicted probabilities.
Now, it is customary not to explicitly compute the softmax on top of a classification network and defer its computation to the loss function, e.g. nn.CrossEntropyLoss that internally computes the softmax and requires the raw logits as inputs, rather than the normalized probabilities. This is done mainly for numerical stability.
Therefore, if you are training a multi-class classification network with nn.CrossEntropyLoss you do not need to worry at all about the final activation and simply output the raw logits from your final conv/linear layer.
Most importantly, do not use nn.Sigmoid() activation as it tends to have saturated gradients and will mess up your training.
As far as I understood, you are working on a multi-label classification task where a single input can have several labels, hence your usage of nn.Sigmoid (vs nn.Softmax for multi-class classification).
There a loss function which combines nn.Sigmoid and the nn.BCELoss: nn.BCEWithLogitsLoss. So you would have as input, a vector of logits whose length is the number of classes. And, the target would as well have the same shape: as a multi-hot-encoding, with 1s for active classes.

Wrong input shape for lstm keras

I came across a tutorial where the autor use a LSTM network for a time series prediction like this :
trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = numpy.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)
We agree that the LSTM in this act like a normal NN (and is useless ?) since the LSTM got only one time step without stateful = TRUE , Am I right ?
Generally speaking, you are correct. The input shape should be (window length, features n).
However, there has been some success in transforming the input to the way you describe above. Below is a whitepaper where they were able to beat many top performing algorithms by doing so, and they used convolutional 1D layers to handle the time series pattern through a separate input.
LSTM Fully Convolutional Networks for Time
Series Classification