Positional Embedding in Transformers - Time Series Data - deep-learning

I'm adding Multi-Headed attention at the input of my CNN to improve interpretability and explainability of my model. The data is formed as time-series 3D input of shape (125, 5, 6) where 5x6 part represents the data in a single sample and 125 represents 125 ordered samples (basically 1 second long recording of 5x6 time-series data).
While reading about transformers for time-series and video data I've come across positional embedding. From my understanding, positional embedding is used to connect the data to its order so that learning is not completely random which would be wrong in the case of data where order is crucial. Now, one thing I don't understand is if my model needs positional embedding considering I input it in order as (125, 5, 6) and separate inputs don't have any connection, but the order of these 125 samples must be preserved.
Essentially, I'm not sure if the addition to my model should be:
Multi-headed attention -> Dropout -> Normalization -> the developed CNN
or
Positional-embedding -> Multi-headed attention -> Dropout -> Normalization -> the developed CNN
or
Multi-headed attention -> the developed CNN
or
Positional-embedding -> Multi-headed attention -> the developed CNN
or something completely different.
Another question is: Do I need more than a single layer of attention considering that the addition of it is solely with purpose of improving interpretability and explainability?
Ps. I'm aware the data needs to be reshaped into (125, 5, 6, 1) to be able to use Transformers.

Related

Anomaly Detection with Autoencoder using unlabelled Dataset (How to construct the input data)

I am new in deep learning field, i would like to ask about unlabeled dataset for Anomaly Detection using Autoencoder. my confusing part start at a few questions below:
1) some post are saying separated anomaly and non-anomaly (assume is labelled) from the original dataset, and train AE with the only non-anomaly dataset (usually amount of non-anomaly will be more dominant). So, the question is how am I gonna separate my dataset if it is unlabeled?
2) if I train using the original unlabeled dataset, how to detect the anomaly data?
Label of data doesn't go into autoencoder.
Auto Encoder consists of two parts
Encoder and Decoder
Encoder: It encodes the input data, say a sample with 784 features to 50 features
Decoder: from those 50 features it converts it back to original feature i.e 784 features.
Now to detect anomaly,
if you pass an unknown sample, it should be converted back to its original sample without much loss.
But if there is a lot of error in converting it back. then it could be an anomaly.
Picture Credit: towardsdatascience.com
I think you answered the question already yourself in part: The definition of an anomaly is that it should be considered "a rare event". So even if you don't know the labels, your training data will contain only very few such samples and predominantly learn on what the data usually looks like. So both during training as well as at prediction time, your error will be large for an anomaly. But since such examples should come up only very seldom, this will not influence your embedding much.
In the end, if you can really justify that the anomaly you are checking for is rare, you might not need much pre-processing or labelling. If it occurs more often (a threshold is hard to give for that, but I'd say it should be <<1%), your AE might pick up on that signal and you would really have to get the labels in order to split the data... . But then again: This would not be an anomaly any more, right? Then you could go ahead and train a (balanced) classifier with this data.

One Hot Encoding dimension - Model Compexity

I will explain my problem:
I have around 50.000 samples, each of one described by a list of codes representing "events"
The number of unique codes are around 800.
The max number of codes that a sample could have is around 600.
I want to represent each sample using one-hot encoding. The representation should be, if we consider the operation of padding for those samples that has fewer codes, a 800x600 matrix.
Giving this new representation as input of a network, means to flatten each matrix to a vector of size 800x600 (460.000 values).
At the end the dataset should consist in 50.000 vectors of size 460.000 .
Now, I have two considerations:
How is it possible to handle a dataset of that size?(I tried data generator to obtain the representation on-the-fly but they are really slow).
Having a vector of size 460.000 as input for each sample, means that the complexity of my model( number of parameters to learn ) is extremely high ( around 15.000.000 in my case ) and, so, I need an huge dataset to train the model properly. Doesn't it?
Why do not you use the conventional model used in NLP?
These events can be translated as you say by embedding matrix.
Then you can represent the chains of events using LSTM (or GRU or RNN o Bilateral LSTM), the difference of using LSTM instead of a conventional network is that you use the same module repeated by N times.
So your input really is not 460,000, but internally an event A indirectly helps you learn about an event B. That's because the LSTM has a module that repeats itself for each event in the chain.
You have an example here:
https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras
Broadly speaking what I would do would be the following (in Keras pseudo-code):
Detect the number of total events. I generate a unique list.
unique_events = list (set ([event_0, ..., event_n]))
You can perform the translation of a sequence with:
seq_events_idx = map (unique_events.index, seq_events)
Add the necessary pad to each sequence:
sequences_pad = pad_sequences (sequences, max_seq)
Then you can directly use an embedding to carry out the transfer of the event to an associated vector of the dimension that you consider.
input_ = Input (shape = (max_seq,), dtype = 'int32')
embedding = Embedding (len(unique_events),
                    dimensions,
                    input_length = max_seq,
                    trainable = True) (input_)
Then you define the architecture of your LSTM (For example):
lstm = LSTM (128, input_shape = (max_seq, dimensions), dropout = 0.2, recurrent_dropout = 0.2, return_sequences = True) (embedding)
Add the dense and the result you want:
out = Dense (10, activation = 'softmax') (lstm)
I think that this type of model can help you and give better results.

Does any H2O algorithm support multi-label classification?

Is deep learning model supports multi-label classification problem or any other algorithms in H2O?
Orginal Response Variable -Tags:
apps, email, mail
finance,freelancers,contractors,zen99
genomes
gogovan
brazil,china,cloudflare
hauling,service,moving
ferguson,crowdfunding,beacon
cms,naytev
y,combinator
in,store,
conversion,logic,ad,attribution
After mapping them on the keys of the dictionary:
Then
Response variable look like this:
[74]
[156, 89]
[153, 13, 133, 40]
[150]
[474, 277, 113]
[181, 117]
[15, 87, 8, 11]
Thanks
No, H2O only contains algorithms that learn to predict a single response variable at a time. You could turn each unique combination into a single class and train a multi-class model that way, or predict each class with a separate model.
Any algorithm that creates a model that gives you "finance,freelancers,contractors,zen99" for one set of inputs, and "cms,naytev" for another set of inputs is horribly over-fitted. You need to take a step back and think about what your actual question is.
But in lieu of that, here is one idea: train some word embeddings (or use some pre-trained ones) on your answer words. You could then average the vectors for each set of values, and hope this gives you a good numeric representation of the "topic". You then need to turn your, say, 100 dimensional averaged word vector into a single number (PCA comes to mind). And now you have a single number that you can give to a machine learning algorithm, and that it can predict.
You still have a problem: having predicted a number, how do you turn that number into a 100-dim vector, and from there in to a topic, and from there into topic words? Tricky, but maybe not impossible.
(As an aside, if you turn the above "single number" into a factor, and have the machine learning model do a categorization, to predicting the most similar topic to those it has seen before... you've basically gone full circle and will get a model identical to the one you started with that has too many classes.)

Expanding CNN with GAN methodology in caffe

I have a network which estimates the depth_map from an input image. In a nutshell, I have an input_image and its corresponding ground_truth (depth map). Let us call this network a generator network. So far so good. Now I have heard of ´Generative Adversarial Networksand I thought I could improve my network with adding aDiscriminator`-Network as follows:
input -> neural network -> estimated_depth_image -> discriminator -> output: true or false depending on real or synthesised
^
|
ground_truth_depth_image
But then, how would I switch between feeding the estimated_depth_image with label 0 and feeding the ground_truth_image with label 1 into the discriminator. Is that even possible? If yes, how would you approach it?
My problem is, how do I feed my new dataset which is consists of all the groud_truth depth images with label 1 and all the estimated_deth_images with label 0 into the discriminator network at the same time using caffe?
As far as I know this is not easy to do in Caffe, because you would need two different optimizers (one for the generator and one for the discriminator). You also need to do the forward pass only in G, and then in D using the output of G and some real depth data. I would recommend to use Tensorflow, torch or pytorch to do that.

Can I use autoencoder for clustering?

In the below code, they use autoencoder as supervised clustering or classification because they have data labels.
http://amunategui.github.io/anomaly-detection-h2o/
But, can I use autoencoder to cluster data if I did not have its labels.?
Regards
The deep-learning autoencoder is always unsupervised learning. The "supervised" part of the article you link to is to evaluate how well it did.
The following example (taken from ch.7 of my book, Practical Machine Learning with H2O, where I try all the H2O unsupervised algorithms on the same data set - please excuse the plug) takes 563 features, and tries to encode them into just two hidden nodes.
m <- h2o.deeplearning(
2:564, training_frame = tfidf,
hidden = c(2), auto-encoder = T, activation = "Tanh"
)
f <- h2o.deepfeatures(m, tfidf, layer = 1)
The second command there extracts the hidden node weights. f is a data frame, with two numeric columns, and one row for every row in the tfidf source data. I chose just two hidden nodes so that I could plot the clusters:
Results will change on each run. You can (maybe) get better results with stacked auto-encoders, or using more hidden nodes (but then you cannot plot them). Here I felt the results were limited by the data.
BTW, I made the above plot with this code:
d <- as.matrix(f[1:30,]) #Just first 30, to avoid over-cluttering
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17) #Triangle
text(d, labels, pos = 3) #pos=3 means above
(P.S. The original data came from Brandon Rose's excellent article on using NLTK. )
In some aspects encoding data and clustering data share some overlapping theory. As a result, you can use Autoencoders to cluster(encode) data.
A simple example to visualize is if you have a set of training data that you suspect has two primary classes. Such as voter history data for republicans and democrats. If you take an Autoencoder and encode it to two dimensions then plot it on a scatter plot, this clustering becomes more clear. Below is a sample result from one of my models. You can see a noticeable split between the two classes as well as a bit of expected overlap.
The code can be found here
This method does not require only two binary classes, you could also train on as many different classes as you wish. Two polarized classes is just easier to visualize.
This method is not limited to two output dimensions, that was just for plotting convenience. In fact, you may find it difficult to meaningfully map certain, large dimension spaces to such a small space.
In cases where the encoded (clustered) layer is larger in dimension it is not as clear to "visualize" feature clusters. This is where it gets a bit more difficult, as you'll have to use some form of supervised learning to map the encoded(clustered) features to your training labels.
A couple ways to determine what class features belong to is to pump the data into knn-clustering algorithm. Or, what I prefer to do is to take the encoded vectors and pass them to a standard back-error propagation neural network. Note that depending on your data you may find that just pumping the data straight into your back-propagation neural network is sufficient.