Anomaly Detection with Autoencoder using unlabelled Dataset (How to construct the input data) - deep-learning

I am new in deep learning field, i would like to ask about unlabeled dataset for Anomaly Detection using Autoencoder. my confusing part start at a few questions below:
1) some post are saying separated anomaly and non-anomaly (assume is labelled) from the original dataset, and train AE with the only non-anomaly dataset (usually amount of non-anomaly will be more dominant). So, the question is how am I gonna separate my dataset if it is unlabeled?
2) if I train using the original unlabeled dataset, how to detect the anomaly data?

Label of data doesn't go into autoencoder.
Auto Encoder consists of two parts
Encoder and Decoder
Encoder: It encodes the input data, say a sample with 784 features to 50 features
Decoder: from those 50 features it converts it back to original feature i.e 784 features.
Now to detect anomaly,
if you pass an unknown sample, it should be converted back to its original sample without much loss.
But if there is a lot of error in converting it back. then it could be an anomaly.
Picture Credit: towardsdatascience.com

I think you answered the question already yourself in part: The definition of an anomaly is that it should be considered "a rare event". So even if you don't know the labels, your training data will contain only very few such samples and predominantly learn on what the data usually looks like. So both during training as well as at prediction time, your error will be large for an anomaly. But since such examples should come up only very seldom, this will not influence your embedding much.
In the end, if you can really justify that the anomaly you are checking for is rare, you might not need much pre-processing or labelling. If it occurs more often (a threshold is hard to give for that, but I'd say it should be <<1%), your AE might pick up on that signal and you would really have to get the labels in order to split the data... . But then again: This would not be an anomaly any more, right? Then you could go ahead and train a (balanced) classifier with this data.

Related

Can variational autoencoders be used on non-image data?

I have a question about variational autoencoders (VAE),
I need to generate new data from my dataset which contains just numerical data, so i want to use VAE for that task, but all the available tutorials and articles use images as input data for the variatioanl autoencoder.
My question is: can i use VAE for generating new data from my datasets eventhough my data is not images ??
Thank you.
Short answer is yes. You should read up a bit on the basics of neural nets if this wasn't obvious already - an image is just a Channel X Height X Width dimensional vector. You might use different kinds of layers in your network to suit the kind of data that you have to give a better inductive bias, but otherwise nothing changes. Follow those tutorials!

Model suggestion: Keyword spotting

I want to predict the occurrences of the word "repeat" in a speech as well as the word's approximate duration. For this task, I'm planning to build a Deep Learning model. I've around 50 positive as well as 50 negative utterances (I couldn't collect more).
Initially I've searched for any pretrained models for keyword spotting, but I couldn't get a good one.
Then I tried Speech Recognition models (Deep Speech), but it couldn't predict the exact repeat words as my data follows Indian accent. Also, I've thought that going for ASR models for this task would be a over-killing one.
Now, I've split the entire audio into chunk of 1 secs with 50% overlapping and tried a binary audio classification in each chunk that is whether the chunk has the word "repeat" or not. For building the classification model, I calculated the MFCC features and build a sequence model on the top of it. Nothing seems to work for me.
If anyone already worked with this kind of task, please provide me with a correct method/resources to build a DL model for this task. Thanks in advance!

LSTM Evolution Forecast

I have a confusion about the way the LSTM networks work when forecasting with an horizon that is not finite but I'm rather searching for a prediction in whatever time in future. In physical terms I would call it the evolution of the system.
Suppose I have a time series $y(t)$ (output) I want to forecast, and some external inputs $u_1(t), u_2(t),\cdots u_N(t)$ on which the series $y(t)$ depends.
It's common to use the lagged value of the output $y(t)$ as input for the network, such that I schematically have something like (let's consider for simplicity just lag 1 for the output and no lag for the external input):
[y(t-1), u_1(t), u_2(t),\cdots u_N(t)] \to y(t)
In this way of thinking the network, when one wants to do recursive forecast it is forced to use the predicted value at the previous step as input for the next step. In this way we have an effect of propagation of error that makes the long term forecast badly behaving.
Now, my confusion is, I'm thinking as a RNN as a kind of an (simple version) implementation of a state space model where I have the inputs, my output and one or more state variable responsible for the memory of the system. These variables are hidden and not observed.
So now the question, if there is this kind of variable taking already into account previous states of the system why would I need to use the lagged output value as input of my network/model ?
Getting rid of this does my long term forecast would be better, since I'm not expecting anymore the propagation of the error of the forecasted output. (I guess there will be anyway an error in the internal state propagating)
Thanks !
Please see DeepAR - a LSTM forecaster more than one step into the future.
The main contributions of the paper are twofold: (1) we propose an RNN
architecture for probabilistic forecasting, incorporating a negative
Binomial likelihood for count data as well as special treatment for
the case when the magnitudes of the time series vary widely; (2) we
demonstrate empirically on several real-world data sets that this
model produces accurate probabilistic forecasts across a range of
input characteristics, thus showing that modern deep learning-based
approaches can effective address the probabilistic forecasting
problem, which is in contrast to common belief in the field and the
mixed results
In this paper, they forecast multiple steps into the future, to negate exactly what you state here which is the error propagation.
Skipping several steps allows to get more accurate predictions, further into the future.
One more thing done in this paper is predicting percentiles, and interpolating, rather than predicting the value directly. This adds stability, and an error assessment.
Disclaimer - I read an older version of this paper.

Does any H2O algorithm support multi-label classification?

Is deep learning model supports multi-label classification problem or any other algorithms in H2O?
Orginal Response Variable -Tags:
apps, email, mail
finance,freelancers,contractors,zen99
genomes
gogovan
brazil,china,cloudflare
hauling,service,moving
ferguson,crowdfunding,beacon
cms,naytev
y,combinator
in,store,
conversion,logic,ad,attribution
After mapping them on the keys of the dictionary:
Then
Response variable look like this:
[74]
[156, 89]
[153, 13, 133, 40]
[150]
[474, 277, 113]
[181, 117]
[15, 87, 8, 11]
Thanks
No, H2O only contains algorithms that learn to predict a single response variable at a time. You could turn each unique combination into a single class and train a multi-class model that way, or predict each class with a separate model.
Any algorithm that creates a model that gives you "finance,freelancers,contractors,zen99" for one set of inputs, and "cms,naytev" for another set of inputs is horribly over-fitted. You need to take a step back and think about what your actual question is.
But in lieu of that, here is one idea: train some word embeddings (or use some pre-trained ones) on your answer words. You could then average the vectors for each set of values, and hope this gives you a good numeric representation of the "topic". You then need to turn your, say, 100 dimensional averaged word vector into a single number (PCA comes to mind). And now you have a single number that you can give to a machine learning algorithm, and that it can predict.
You still have a problem: having predicted a number, how do you turn that number into a 100-dim vector, and from there in to a topic, and from there into topic words? Tricky, but maybe not impossible.
(As an aside, if you turn the above "single number" into a factor, and have the machine learning model do a categorization, to predicting the most similar topic to those it has seen before... you've basically gone full circle and will get a model identical to the one you started with that has too many classes.)

Can I use autoencoder for clustering?

In the below code, they use autoencoder as supervised clustering or classification because they have data labels.
http://amunategui.github.io/anomaly-detection-h2o/
But, can I use autoencoder to cluster data if I did not have its labels.?
Regards
The deep-learning autoencoder is always unsupervised learning. The "supervised" part of the article you link to is to evaluate how well it did.
The following example (taken from ch.7 of my book, Practical Machine Learning with H2O, where I try all the H2O unsupervised algorithms on the same data set - please excuse the plug) takes 563 features, and tries to encode them into just two hidden nodes.
m <- h2o.deeplearning(
2:564, training_frame = tfidf,
hidden = c(2), auto-encoder = T, activation = "Tanh"
)
f <- h2o.deepfeatures(m, tfidf, layer = 1)
The second command there extracts the hidden node weights. f is a data frame, with two numeric columns, and one row for every row in the tfidf source data. I chose just two hidden nodes so that I could plot the clusters:
Results will change on each run. You can (maybe) get better results with stacked auto-encoders, or using more hidden nodes (but then you cannot plot them). Here I felt the results were limited by the data.
BTW, I made the above plot with this code:
d <- as.matrix(f[1:30,]) #Just first 30, to avoid over-cluttering
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17) #Triangle
text(d, labels, pos = 3) #pos=3 means above
(P.S. The original data came from Brandon Rose's excellent article on using NLTK. )
In some aspects encoding data and clustering data share some overlapping theory. As a result, you can use Autoencoders to cluster(encode) data.
A simple example to visualize is if you have a set of training data that you suspect has two primary classes. Such as voter history data for republicans and democrats. If you take an Autoencoder and encode it to two dimensions then plot it on a scatter plot, this clustering becomes more clear. Below is a sample result from one of my models. You can see a noticeable split between the two classes as well as a bit of expected overlap.
The code can be found here
This method does not require only two binary classes, you could also train on as many different classes as you wish. Two polarized classes is just easier to visualize.
This method is not limited to two output dimensions, that was just for plotting convenience. In fact, you may find it difficult to meaningfully map certain, large dimension spaces to such a small space.
In cases where the encoded (clustered) layer is larger in dimension it is not as clear to "visualize" feature clusters. This is where it gets a bit more difficult, as you'll have to use some form of supervised learning to map the encoded(clustered) features to your training labels.
A couple ways to determine what class features belong to is to pump the data into knn-clustering algorithm. Or, what I prefer to do is to take the encoded vectors and pass them to a standard back-error propagation neural network. Note that depending on your data you may find that just pumping the data straight into your back-propagation neural network is sufficient.