Case 1:
I am feeding a variable-length input time-series window to the GRU model. Sometimes there may be 900 samples in the window, and sometimes there may be only 16. I fed into the RNN model (GRU) since I learned that RNN methods work better on long sequences. I utilize one GRU layer and get hidden sequences across all the time stamps in order to get maximum information of all the time stamps. Then, I used average pooling on GRU output to bring representation into fixed-length. The intuition of using average-pooling instead of max-pooling is that it may achieve summarized information of all the timestamps. Here is the code of the model:
input_layer = tf.keras.Input(shape=input_shape, name="time_series_activity")
input_mask = tf.keras.layers.Masking(mask_value=0.00000)(input_layer)
gru_l5 = tf.keras.layers.GRU(64, activation='tanh', recurrent_activation='sigmoid',
recurrent_initializer=tf.keras.initializers.Orthogonal(), dropout=0.5, recurrent_dropout=0.5, return_sequences=True
)(input_mask)
AP = tf.keras.layers.GlobalAveragePooling1D()(gru_l5)
gru_fm = tf.keras.layers.Dropout(0.3)(AP)
output_layer = tf.keras.layers.Dense(total_classes, activation="softmax")(gru_fm)
return tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
From this model, I am obtaining better performance on validation set while on training data, performance increased by 100% (going for worst), however, the major issue is that validation loss is "nan." This issue is currently being explored on GitHub and StackOverflow.
I tried nearly all of the options provided here, here and here. But unable to resolve this validation_loss = non issue.
Case 2:
Then I decided not to get all of the GRU's hidden states but rather to retrieve only the last hidden state, which would provide a fixed-length representation and eliminate the requirement for pooling. Here, the validation loss as "nan" probelm is fixed, but the test data performance is drastically reduced. Here is this model's source code:
input_layer = tf.keras.Input(shape=input_shape, name="time_series_activity")
input_mask = tf.keras.layers.Masking(mask_value=0.00000)(input_layer)
gru_l5 = tf.keras.layers.GRU(64, activation='tanh', recurrent_activation='sigmoid',
recurrent_initializer=tf.keras.initializers.Orthogonal(), dropout=0.5, recurrent_dropout=0.5)(input_mask)
gru_fm = tf.keras.layers.Dropout(0.3)(gru_l5)
output_layer = tf.keras.layers.Dense(total_classes, activation="softmax")(gru_fm)
return tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
We can observe the results of both Cases. In Case 1, I have the feeling that the vanishing gradient problem occurs with longer sequences. Any thoughts or discussions on resolving this "nan" issue and achieving high performance would be much appreciated.
Related
I am using T5-Large by HuggingFace for inference. Given a premise and a hypothesis, I need to determine whether they are related or not. So, if I feed a string "mnli premise: This game will NOT open unless you agree to them sharing your information to advertisers. hypothesis: Personal data disclosure is discussed." the model is supposed to return either entailment, neutral, or contradiction.
Though I am able to determine the result, I am unable to determine the probability of the sequence generated. For instance, consider the model will generate entailment for the example given above. I also want to know what is the probability of entailment. So far, I have been using the following code,
from transformers import T5Tokenizer, T5ForConditionalGeneration
def is_entailment(premise, hypothesis):
entailment_premise = premise
entailment_hypothesis = hypothesis
token_output = tokenizer("mnli premise: " + entailment_premise + " hypothesis: " + entailment_hypothesis,
return_tensors="pt", return_length=True)
input_ids = token_output.input_ids
output = model.generate(input_ids, output_scores=True, return_dict_in_generate=True, max_new_tokens=15)
entailment_ids = output["sequences"]
entailment = tokenizer.decode(entailment_ids[0], skip_special_tokens=True)
return entailment
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
premise = "This game will NOT open unless you agree to them sharing your information to advertisers."
hypothesis = "Personal data disclosure is discussed."
print(is_entailment(premise, hypothesis))
I have tried using the scores we get as output, but not sure how to calculate the probability from them. Same goes for the last hidden states that can be fetched as the output from the generate(). I saw in another question on Stack Overflow that suggested using a softmax function on the last hidden states but I am unsure how to do it.
How can I calculate the probability of the sequence being generated? That is, if I get entailment for a pair of hypothesis and premise, what would be the P(entailment)?
What you get as the scores are output token distributions before the softmax, so-called logits. You can get the probabilities of generated tokens by normalizing the logits and taking respective token ids. You can get them from the field sequences from what the generate method returns.
These are, however, not the probabilities you are looking for because T5 segments your output words into smaller units (e.g., "entailment" gets segmented to ['▁', 'en', 'tail', 'ment'] using the t5-small tokenizer). This is even trickier because different answers get split into a different number of tokens. You can get an approximate score by averaging the token probabilities (this is typically used during beam search). Such scores do not sum up to one.
If you want a normalized score, the only way is to feed all three possible answers to the decoder, get their scores, and normalize them to sum to one.
I constructed several glmer.nb models with different combinations of random intercepts, and for one of the models (nested random intercepts, with the lowest AICc), I consistently get: "iteration limit reached", without the usual "Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :..."
Here's what I know:
it is a warning (from the color) but not labeled as such
you can also have that warning with GLMs and LMERs
Here's what I don't know:
does it mean the model is invalid?
what causes that issue?
what could I do to resolve that issue?
Here's what I searched:
https://stats.stackexchange.com/questions/67287/very-large-theta-values-using-glm-nb-in-r-alternative-approaches (no explanation as to the why and how)
GLMM FAQ: no mention
I am not the only regularly running into that or similar problems: Using glmer.nb(), the error message:(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate is returned
https://stats.stackexchange.com/questions/40647/lme-error-iteration-limit-reached/40664
Here's what would be highly appreciated:
A more informative warning message: did the model converge? what caused this? What can one do to fix it? Can we read more about this (link to GLMM FAQ - brms-style)?
This is a general question. I did not provide reproducible code because an answer that is generalisable would be most useful.
library(lme4)
dd <- data.frame(f = factor(rep(1:20, each = 20)))
dd$y <- simulate(~ 1 + (1|f), family = "poisson",
newdata = dd,
newparam = list(beta = 1, theta = 1),
seed = 101)[[1]]
m1 <- glmer.nb(y ~ 1 + (1|f), data = dd)
Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :
iteration limit reached
It's a bit hard to tell, but this warning occurs in MASS::theta.ml(), which is called to get an initial estimate of the dispersion parameter. (If you set options(error = recover, warn = 2), warnings will be converted to errors and errors will dump you into a debugger, where you can see the sequence of calls that were active when the warning/error occurred).
This generally occurs when the data (specifically, the conditional distribution of the data) is actually equidispersed (variance == mean) or underdispersed (i.e. variance < mean), which can't be achieved by a negative binomial distribution. If you run getME(m1, "glmer.nb.theta") you'll generally get a very large value (in this case it's 62376), which indicates where the optimizer gave up while it was trying to send the dispersion parameter to infinity.
You can:
ignore the warning (the negative binomial isn't a good choice, but the model is effectively converging to a Poisson solution anyway).
revert to a Poisson model (the CV question you link to does say "a Poisson model might be a better choice")
People often worry less about underdispersion than overdispersion (because underdispersion makes results of a Poisson model conservative), but if you want to take underdispersion into account you can fit your model with a conditional distribution that allows underdispersion as well as overdispersion (not directly possible within lme4, but see here)
PS the "iteration limit reached without convergence" warning in one of your linked answers, from nlminb within lme, is a completely different issue (except that both situations involve some form of iterative solution scheme with a set maximum number of iterations ...)
So, I'm doing a 4 label x-ray images classification on around 12600 images:
Class1:4000
Class2:3616
Class3:1345
Class4:4000
I'm using VGG-16 architecture pertained on the imageNet dataset with cross-entrpy and SGD and a batch size of 32 and a learning rate of 1e-3 running on pytorch
[[749., 6., 50., 2.],
[ 5., 707., 9., 1.],
[ 56., 8., 752., 0.],
[ 4., 1., 0., 243.]]
I know since both train loss/acc are relatively 0/1 the model is overfitting, though I'm surprised that the val acc is still around 0.9!
How to properly interpret that and what causing it and how to prevent it?
I know it's something like because the accuracy is the argmax of softmax like the actual predictions are getting lower and lower but the argmax always stays the same, but I'm really confused about it! I even let it train for +64 epochs same results flat acc while loss increases gradually!
PS. I have seen other questions with answers and didn't really get an explanation
I think your question already says about what is going on. Your model is overfitting as you have also figured out. Now, as you are training more your model slowly becoming more specialized to the train set and loosing the the capability to generalize gradually. So the softmax probabilities are getting more and more flat. But still it is showing more or less the same accuracy for validation set as still now the correct class has at least slightly more probability than the others. So in my opinion there can be some possible reasons for this:
Your train set and validation set may not be from the same distribution.
Your validation set doesn't cover all cases need to be evaluated, it probably contains similar types of images but they do not differ too much. So, when the model can identify one, it can identify many of them from the validation set. If you add more heterogeneous images in validation set, you will no longer see such a large accuracy in validation set.
Similarly, we can say your train set has images which are heterogeneous i.e, they have a lot of variations, and the validation set is covering only a few varieties, so as training goes on, those minorities are getting less priority as the model yet to have many things to learn and generalize. This can happen if you augment your train-set and your model finds the validation set is relatively easier initially (until overfitting), but as training goes on the model gets lost itself while learning a lot of augmented varieties available in the train set. In this case don't make the augmentation too much wild. Think, if the augmented images are still realistic or not. Do augmentation on images as long as they remain realistic and each type of these images' variations occupy enough representative examples in the train set. Don't include unnecessary situations in augmentation those will never occur in reality, as these unrealistic examples will just increase burden on the model than doing any help.
In the below code, they use autoencoder as supervised clustering or classification because they have data labels.
http://amunategui.github.io/anomaly-detection-h2o/
But, can I use autoencoder to cluster data if I did not have its labels.?
Regards
The deep-learning autoencoder is always unsupervised learning. The "supervised" part of the article you link to is to evaluate how well it did.
The following example (taken from ch.7 of my book, Practical Machine Learning with H2O, where I try all the H2O unsupervised algorithms on the same data set - please excuse the plug) takes 563 features, and tries to encode them into just two hidden nodes.
m <- h2o.deeplearning(
2:564, training_frame = tfidf,
hidden = c(2), auto-encoder = T, activation = "Tanh"
)
f <- h2o.deepfeatures(m, tfidf, layer = 1)
The second command there extracts the hidden node weights. f is a data frame, with two numeric columns, and one row for every row in the tfidf source data. I chose just two hidden nodes so that I could plot the clusters:
Results will change on each run. You can (maybe) get better results with stacked auto-encoders, or using more hidden nodes (but then you cannot plot them). Here I felt the results were limited by the data.
BTW, I made the above plot with this code:
d <- as.matrix(f[1:30,]) #Just first 30, to avoid over-cluttering
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17) #Triangle
text(d, labels, pos = 3) #pos=3 means above
(P.S. The original data came from Brandon Rose's excellent article on using NLTK. )
In some aspects encoding data and clustering data share some overlapping theory. As a result, you can use Autoencoders to cluster(encode) data.
A simple example to visualize is if you have a set of training data that you suspect has two primary classes. Such as voter history data for republicans and democrats. If you take an Autoencoder and encode it to two dimensions then plot it on a scatter plot, this clustering becomes more clear. Below is a sample result from one of my models. You can see a noticeable split between the two classes as well as a bit of expected overlap.
The code can be found here
This method does not require only two binary classes, you could also train on as many different classes as you wish. Two polarized classes is just easier to visualize.
This method is not limited to two output dimensions, that was just for plotting convenience. In fact, you may find it difficult to meaningfully map certain, large dimension spaces to such a small space.
In cases where the encoded (clustered) layer is larger in dimension it is not as clear to "visualize" feature clusters. This is where it gets a bit more difficult, as you'll have to use some form of supervised learning to map the encoded(clustered) features to your training labels.
A couple ways to determine what class features belong to is to pump the data into knn-clustering algorithm. Or, what I prefer to do is to take the encoded vectors and pass them to a standard back-error propagation neural network. Note that depending on your data you may find that just pumping the data straight into your back-propagation neural network is sufficient.
I am working with the Dynamic Topic Models package that was developed by Blei. I am new to LDA however I understand it.
I would like to know what does the output by the name of
lda-seq/topic-000-var-obs.dat store?
I know that lda-seq/topic-001-var-e-log-prob.dat stores the log of the variational posterior and by applying the exponential over it, I get the probability of the word within Topic 001.
Thanks
Topic-000-var-e-log-prob.dat store the log of the variational posterior of the topic 1.
Topic-001-var-e-log-prob.dat store the log of the variational posterior of the topic 2.
I have failed to find a concrete answer anywhere. However, since the documentation's sample.sh states
The code creates at least the following files:
- topic-???-var-e-log-prob.dat: the e-betas (word distributions) for topic ??? for all times.
...
- gam.dat
without mentioning the topic-000-var-obs.dat file, suggests that it is not imperative for most analyses.
Speculation
obs suggest observations. After a little dig around in the example/model_run results, I plotted the sum across epochs for each word/token using:
temp = scan("dtm/example/model_run/lda-seq/topic-000-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))
and the result is something like:
The general trend of the non-negative values is decreasing and many values are floored (in this case to -11.00972 = log(1.67e-05)) Suggesting that these values are weightings or some other measure of influence on the model. The model removes some tokens and the influence/importance of the others tapers off over the index. The later trend may be caused by preprocessing such as sorting tokens by tf-idf when creating the dictionary.
Interestingly the row sum values varies for both the floored tokens and the set with more positive values:
temp = scan("~/Documents/Python/inference/project/dtm/example/model_run/lda-seq/topic-009-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))