How to use pytorch multi-head attention for classification task?

How to use pytorch multi-head attention for classification task? - deep-learning

I have a dataset where x shape is (10000, 102, 300) such as ( samples, feature-length, dimension) and y (10000,) which is my binary label. I want to use multi-head attention using PyTorch. I saw the PyTorch documentation from here but there is no explanation of how to use it. How can I use my dataset for classification using multi-head attention?

I will write a simple pretty code for classification this will work fine, if you need implementation detail then this part is the same as the Encoder layer in Transformer, except in the last you would need a GlobalAveragePooling Layer and a Dense Layer for classification
attention_layer = nn.MultiHeadAttion(300 , 300%num_of_heads==0,dropout=0.1)
neural_net_output = point_wise_neural_network(attention_layer)
normalize = LayerNormalization(input + neural_net_output)
globale_average_pooling = nn.GlobalAveragePooling(normalize)
nn.Linear(input , num_of_classes)(global_average_pooling)

Related

PyTorch find keypoints: output nodes to be in a range and negative loss

I am beginner in deep learning.
I am using this dataset and I want my network to detect keypoints of a hand.
How can I make my output layer's nodes to be in range [-1, 1] (range of normalized 2D points)?
Another problem is when I train for more than 1 epoch the loss gets negative values
criterion: torch.nn.MultiLabelSoftMarginLoss() and optimizer: torch.optim.SGD()
Here u can find my repo
net = nnModel.Net()
net = net.to(device)
criterion = nn.MultiLabelSoftMarginLoss()
optimizer = optim.SGD(net.parameters(), lr=learning_rate)
lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer=optimizer, gamma=decay_rate)

You can use the Tanh activation function, since the image of the function lies in [-1, 1].
The problem of predicting key-points in an image is more of a regression problem than a classification problem (especially if you're making your model outputs + targets fall within a continuous interval). Therefore, I suggest you use the L2 Loss.
In fact, it could be a good exercise for you to determine which loss function that is appropriate for regression problems provides the lowest expected generalization error using cross-validation. There's several such functions available in PyTorch.

One way I can think of is to use torch.nn.Sigmoid which produces outputs in [0,1] range and scale outputs to [-1,1] using 2*x-1 transformation.

One Hot Encoding dimension - Model Compexity

I will explain my problem:
I have around 50.000 samples, each of one described by a list of codes representing "events"
The number of unique codes are around 800.
The max number of codes that a sample could have is around 600.
I want to represent each sample using one-hot encoding. The representation should be, if we consider the operation of padding for those samples that has fewer codes, a 800x600 matrix.
Giving this new representation as input of a network, means to flatten each matrix to a vector of size 800x600 (460.000 values).
At the end the dataset should consist in 50.000 vectors of size 460.000 .
Now, I have two considerations:
How is it possible to handle a dataset of that size?(I tried data generator to obtain the representation on-the-fly but they are really slow).
Having a vector of size 460.000 as input for each sample, means that the complexity of my model( number of parameters to learn ) is extremely high ( around 15.000.000 in my case ) and, so, I need an huge dataset to train the model properly. Doesn't it?

Why do not you use the conventional model used in NLP?
These events can be translated as you say by embedding matrix.
Then you can represent the chains of events using LSTM (or GRU or RNN o Bilateral LSTM), the difference of using LSTM instead of a conventional network is that you use the same module repeated by N times.
So your input really is not 460,000, but internally an event A indirectly helps you learn about an event B. That's because the LSTM has a module that repeats itself for each event in the chain.
You have an example here:
https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras
Broadly speaking what I would do would be the following (in Keras pseudo-code):
Detect the number of total events. I generate a unique list.
unique_events = list (set ([event_0, ..., event_n]))
You can perform the translation of a sequence with:
seq_events_idx = map (unique_events.index, seq_events)
Add the necessary pad to each sequence:
sequences_pad = pad_sequences (sequences, max_seq)
Then you can directly use an embedding to carry out the transfer of the event to an associated vector of the dimension that you consider.
input_ = Input (shape = (max_seq,), dtype = 'int32')
embedding = Embedding (len(unique_events),
                    dimensions,
                    input_length = max_seq,
                    trainable = True) (input_)
Then you define the architecture of your LSTM (For example):
lstm = LSTM (128, input_shape = (max_seq, dimensions), dropout = 0.2, recurrent_dropout = 0.2, return_sequences = True) (embedding)
Add the dense and the result you want:
out = Dense (10, activation = 'softmax') (lstm)
I think that this type of model can help you and give better results.

How to reshape a pytorch matrix without mixing elements of items in a batch

In my Neural network model, I represent an 8 word-sentence with a 8x256 dimensional embedding matrix. I want to give it to a LSTM as a input where LSTM takes a single word embedding at a time as input and process it. According to pytorch documentation, the input should be in the shape of (seq_len, batch, input_size). What is the correct way to convert my input to desired shape ? I don't want to mixup the numbers by mistake. I am quite new in PyTorch and row-major calculations, therefore I wanted to ask it here. I do it as follows, is it correct ?
x = torch.rand(8,256)
lstm_input = torch.reshape(x,(8,1,256))

Your solution is correct: you added a Singleton dimension for the "batch" dimension, leaving x to be with temporal dimension 8 and input dimension 256.
Since you are new to pytorch, here are a few equivalent ways of doing the same thing:
x = x[:, None, :]
Putting None in the dim=1 indicates to pytorch to add a singelton dimension.
Another way is to use view:
x = x.view(8, 1, 256)

Wrong input shape for lstm keras

I came across a tutorial where the autor use a LSTM network for a time series prediction like this :
trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = numpy.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)
We agree that the LSTM in this act like a normal NN (and is useless ?) since the LSTM got only one time step without stateful = TRUE , Am I right ?

Generally speaking, you are correct. The input shape should be (window length, features n).
However, there has been some success in transforming the input to the way you describe above. Below is a whitepaper where they were able to beat many top performing algorithms by doing so, and they used convolutional 1D layers to handle the time series pattern through a separate input.
LSTM Fully Convolutional Networks for Time
Series Classification

Implementing joint learning in keras

i am trying to implement a model that is composed of two layers to segment object candidates in keras
So basically this model has the following architecture
Image(channel,width,height) -> multiple convolution and pooling layers- > output('n' feature maps , height width )
Now this single output is being used by two layers
which are as follows
1) convolution (1*1) - > dense layer with m units (output = n * 1*1 ) - > pixel classifier using fully connected layers of h*w dimesion -> upsmapling to (H,N) - > output
2) convolution -> maxpooling->dense layer - > score
Cost function uses outputs of both these layers which is sum of binary logistic regression of each output
Now I have two questions
1) how to implement dense connection over convoluted output in layer 1 to produce h*w pixel classifier as mentioned above
2) How to merge the two layers to calculate the single cost function and then train both the layers jointly using back-propagation
Can anyone tell me how to create the model for above mentioned network architecture.i am new to deep learning so if there something which i misunderstood i ll appreciate if anyone can explain me the errors in my understanding
Thanks

It's easier when you share the code you already have.
For the transition convolution to dense, you have to use model.add(Flatten()), like in the examples here.
Unfortunately, I don't know for the second question, but according to what I just read in the Keras Models, you have to use the graph model.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to use pytorch multi-head attention for classification task? - deep-learning

Related

PyTorch find keypoints: output nodes to be in a range and negative loss

One Hot Encoding dimension - Model Compexity

How to reshape a pytorch matrix without mixing elements of items in a batch

Wrong input shape for lstm keras

Implementing joint learning in keras

Categories

Resources