Dynamic Topic model output - Blei format - lda

I am working with the Dynamic Topic Models package that was developed by Blei. I am new to LDA however I understand it.
I would like to know what does the output by the name of
lda-seq/topic-000-var-obs.dat store?
I know that lda-seq/topic-001-var-e-log-prob.dat stores the log of the variational posterior and by applying the exponential over it, I get the probability of the word within Topic 001.
Thanks

Topic-000-var-e-log-prob.dat store the log of the variational posterior of the topic 1.
Topic-001-var-e-log-prob.dat store the log of the variational posterior of the topic 2.

I have failed to find a concrete answer anywhere. However, since the documentation's sample.sh states
The code creates at least the following files:
- topic-???-var-e-log-prob.dat: the e-betas (word distributions) for topic ??? for all times.
...
- gam.dat
without mentioning the topic-000-var-obs.dat file, suggests that it is not imperative for most analyses.
Speculation
obs suggest observations. After a little dig around in the example/model_run results, I plotted the sum across epochs for each word/token using:
temp = scan("dtm/example/model_run/lda-seq/topic-000-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))
and the result is something like:
The general trend of the non-negative values is decreasing and many values are floored (in this case to -11.00972 = log(1.67e-05)) Suggesting that these values are weightings or some other measure of influence on the model. The model removes some tokens and the influence/importance of the others tapers off over the index. The later trend may be caused by preprocessing such as sorting tokens by tf-idf when creating the dictionary.
Interestingly the row sum values varies for both the floored tokens and the set with more positive values:
temp = scan("~/Documents/Python/inference/project/dtm/example/model_run/lda-seq/topic-009-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))

Related

Determining the probability of a sequence generated by T5 model by HuggingFace

I am using T5-Large by HuggingFace for inference. Given a premise and a hypothesis, I need to determine whether they are related or not. So, if I feed a string "mnli premise: This game will NOT open unless you agree to them sharing your information to advertisers. hypothesis: Personal data disclosure is discussed." the model is supposed to return either entailment, neutral, or contradiction.
Though I am able to determine the result, I am unable to determine the probability of the sequence generated. For instance, consider the model will generate entailment for the example given above. I also want to know what is the probability of entailment. So far, I have been using the following code,
from transformers import T5Tokenizer, T5ForConditionalGeneration
def is_entailment(premise, hypothesis):
entailment_premise = premise
entailment_hypothesis = hypothesis
token_output = tokenizer("mnli premise: " + entailment_premise + " hypothesis: " + entailment_hypothesis,
return_tensors="pt", return_length=True)
input_ids = token_output.input_ids
output = model.generate(input_ids, output_scores=True, return_dict_in_generate=True, max_new_tokens=15)
entailment_ids = output["sequences"]
entailment = tokenizer.decode(entailment_ids[0], skip_special_tokens=True)
return entailment
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
premise = "This game will NOT open unless you agree to them sharing your information to advertisers."
hypothesis = "Personal data disclosure is discussed."
print(is_entailment(premise, hypothesis))
I have tried using the scores we get as output, but not sure how to calculate the probability from them. Same goes for the last hidden states that can be fetched as the output from the generate(). I saw in another question on Stack Overflow that suggested using a softmax function on the last hidden states but I am unsure how to do it.
How can I calculate the probability of the sequence being generated? That is, if I get entailment for a pair of hypothesis and premise, what would be the P(entailment)?
What you get as the scores are output token distributions before the softmax, so-called logits. You can get the probabilities of generated tokens by normalizing the logits and taking respective token ids. You can get them from the field sequences from what the generate method returns.
These are, however, not the probabilities you are looking for because T5 segments your output words into smaller units (e.g., "entailment" gets segmented to ['▁', 'en', 'tail', 'ment'] using the t5-small tokenizer). This is even trickier because different answers get split into a different number of tokens. You can get an approximate score by averaging the token probabilities (this is typically used during beam search). Such scores do not sum up to one.
If you want a normalized score, the only way is to feed all three possible answers to the decoder, get their scores, and normalize them to sum to one.

Validation Loss with non in time-series classification

Case 1:
I am feeding a variable-length input time-series window to the GRU model. Sometimes there may be 900 samples in the window, and sometimes there may be only 16. I fed into the RNN model (GRU) since I learned that RNN methods work better on long sequences. I utilize one GRU layer and get hidden sequences across all the time stamps in order to get maximum information of all the time stamps. Then, I used average pooling on GRU output to bring representation into fixed-length. The intuition of using average-pooling instead of max-pooling is that it may achieve summarized information of all the timestamps. Here is the code of the model:
input_layer = tf.keras.Input(shape=input_shape, name="time_series_activity")
input_mask = tf.keras.layers.Masking(mask_value=0.00000)(input_layer)
gru_l5 = tf.keras.layers.GRU(64, activation='tanh', recurrent_activation='sigmoid',
recurrent_initializer=tf.keras.initializers.Orthogonal(), dropout=0.5, recurrent_dropout=0.5, return_sequences=True
)(input_mask)
AP = tf.keras.layers.GlobalAveragePooling1D()(gru_l5)
gru_fm = tf.keras.layers.Dropout(0.3)(AP)
output_layer = tf.keras.layers.Dense(total_classes, activation="softmax")(gru_fm)
return tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
From this model, I am obtaining better performance on validation set while on training data, performance increased by 100% (going for worst), however, the major issue is that validation loss is "nan." This issue is currently being explored on GitHub and StackOverflow.
I tried nearly all of the options provided here, here and here. But unable to resolve this validation_loss = non issue.
Case 2:
Then I decided not to get all of the GRU's hidden states but rather to retrieve only the last hidden state, which would provide a fixed-length representation and eliminate the requirement for pooling. Here, the validation loss as "nan" probelm is fixed, but the test data performance is drastically reduced. Here is this model's source code:
input_layer = tf.keras.Input(shape=input_shape, name="time_series_activity")
input_mask = tf.keras.layers.Masking(mask_value=0.00000)(input_layer)
gru_l5 = tf.keras.layers.GRU(64, activation='tanh', recurrent_activation='sigmoid',
recurrent_initializer=tf.keras.initializers.Orthogonal(), dropout=0.5, recurrent_dropout=0.5)(input_mask)
gru_fm = tf.keras.layers.Dropout(0.3)(gru_l5)
output_layer = tf.keras.layers.Dense(total_classes, activation="softmax")(gru_fm)
return tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
We can observe the results of both Cases. In Case 1, I have the feeling that the vanishing gradient problem occurs with longer sequences. Any thoughts or discussions on resolving this "nan" issue and achieving high performance would be much appreciated.

"iteration limit reached" in lme4 GLMM - what does it mean?

I constructed several glmer.nb models with different combinations of random intercepts, and for one of the models (nested random intercepts, with the lowest AICc), I consistently get: "iteration limit reached", without the usual "Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :..."
Here's what I know:
it is a warning (from the color) but not labeled as such
you can also have that warning with GLMs and LMERs
Here's what I don't know:
does it mean the model is invalid?
what causes that issue?
what could I do to resolve that issue?
Here's what I searched:
https://stats.stackexchange.com/questions/67287/very-large-theta-values-using-glm-nb-in-r-alternative-approaches (no explanation as to the why and how)
GLMM FAQ: no mention
I am not the only regularly running into that or similar problems: Using glmer.nb(), the error message:(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate is returned
https://stats.stackexchange.com/questions/40647/lme-error-iteration-limit-reached/40664
Here's what would be highly appreciated:
A more informative warning message: did the model converge? what caused this? What can one do to fix it? Can we read more about this (link to GLMM FAQ - brms-style)?
This is a general question. I did not provide reproducible code because an answer that is generalisable would be most useful.
library(lme4)
dd <- data.frame(f = factor(rep(1:20, each = 20)))
dd$y <- simulate(~ 1 + (1|f), family = "poisson",
newdata = dd,
newparam = list(beta = 1, theta = 1),
seed = 101)[[1]]
m1 <- glmer.nb(y ~ 1 + (1|f), data = dd)
Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :
iteration limit reached
It's a bit hard to tell, but this warning occurs in MASS::theta.ml(), which is called to get an initial estimate of the dispersion parameter. (If you set options(error = recover, warn = 2), warnings will be converted to errors and errors will dump you into a debugger, where you can see the sequence of calls that were active when the warning/error occurred).
This generally occurs when the data (specifically, the conditional distribution of the data) is actually equidispersed (variance == mean) or underdispersed (i.e. variance < mean), which can't be achieved by a negative binomial distribution. If you run getME(m1, "glmer.nb.theta") you'll generally get a very large value (in this case it's 62376), which indicates where the optimizer gave up while it was trying to send the dispersion parameter to infinity.
You can:
ignore the warning (the negative binomial isn't a good choice, but the model is effectively converging to a Poisson solution anyway).
revert to a Poisson model (the CV question you link to does say "a Poisson model might be a better choice")
People often worry less about underdispersion than overdispersion (because underdispersion makes results of a Poisson model conservative), but if you want to take underdispersion into account you can fit your model with a conditional distribution that allows underdispersion as well as overdispersion (not directly possible within lme4, but see here)
PS the "iteration limit reached without convergence" warning in one of your linked answers, from nlminb within lme, is a completely different issue (except that both situations involve some form of iterative solution scheme with a set maximum number of iterations ...)

Riding the wave Numerical schemes for hyperbolic PDEs, lorena barba lessons, assistance needed

I am a beginner python user who is trying to get a feel for computer science, I've been learning how to use it by studying concepts/subjects I'm already familiar with, such as Computation Fluid Mechanics & Finite Element Analysis. I got my degree in mechanical engineering, so not much CS background.
I'm studying a series by Lorena Barba on jupyter notebook viewer, Practical Numerical Methods, and i'm looking for some help, hopefully someone familiar with the subjects of CFD & FEA in general.
if you click on the link below and go to the following output line, you'll find what i have below. Really confused on this block of code operated within the function that is defined.
Anyway. If there is anyone out there, with any suggestions on how to tackle learning python, HELP
In[9]
rho_hist = [rho0.copy()]
rho = rho0.copy() **# im confused by the role of this variable here**
for n in range(nt):
# Compute the flux.
F = flux(rho, *args)
# Advance in time using Lax-Friedrichs scheme.
rho[1:-1] = (0.5 * (rho[:-2] + rho[2:]) -
dt / (2.0 * dx) * (F[2:] - F[:-2]))
# Set the value at the first location.
rho[0] = bc_values[0]
# Set the value at the last location.
rho[-1] = bc_values[1]
# Record the time-step solution.
rho_hist.append(rho.copy())
return rho_hist
http://nbviewer.jupyter.org/github/numerical-mooc/numerical-mooc/blob/master/lessons/03_wave/03_02_convectionSchemes.ipynb
The intent of the first two lines is to preserve rho0 and provide copies of it for the history (copy so that later changes in rho0 do not reflect back here) and as the initial value for the "working" variable rho that is used and modified during the computation.
The background is that python list and array variables are always references to the object in question. By assigning the variable you produce a copy of the reference, the address of the object, but not the object itself. Both variables refer to the same memory area. Thus not using .copy() will change rho0.
a = [1,2,3]
b = a
b[2] = 5
print a
#>>> [1, 2, 5]
Composite objects that themselves contain structured data objects will need a deepcopy to copy the data on all levels.
Numpy array values changed without being aksed?
how to pass a list as value and not as reference?

How does rstan store posterior samples for separate chains?

I would like to understand how the output of extract in rstan orders the posterior samples. I understand that I can view the posterior samples from each chain by using as.array,
stanfit <- sampling(
model,
data = stan.data)
​
fitarray <- as.array(stanfit)
For example, fitarray[, 2, 1] will give me the samples for the second chain of the first parameter. One way to store the posterior samples in the output of extract would be just to concatenate them. When I do,
fit <- extract(stanfit)
mean(fitarray[,2,1]) == mean(fit$ss[1001:2000])
for several chains and parameters I always get TRUE (ss is the first parameter). This makes it seem like the posterior samples are being concatenated in fit. However, when I do,
fitarray[,2,1] == fit$ss[1001:2000]
I get FALSE (confirmed that there's not just precision difference). It appears that fitarray and fit are storing the iterations differently. How do I view the iterations (in order) of each posterior sample chain separately?
As can be seen from rstan:::as.array.stanfit, the as.array method is essentially defined as
extract(x, permuted = FALSE, inc_warmup = FALSE)
Your default use of extract keeps the warmup and permutes the post-warmup draws randomly, which is why the indices do not line up with the as.array output.