Organizing ranef.mer in ascending or descending order - lme4

I'm trying to figure out how to organize the ranef.mer list of random effects from a simple lmer model with only random intercepts and one variable (sex).
fit.b <- lmer(Math ~ 1 + Sex + (1+Sex|SchoolID), data=pisa_com, REML=FALSE)
I've plotted the random effects using qqmath, but I either need to be able to label each of the random effects by their cluster number (in this case, schools), or organize the ranef.mer output.

Solved this last night. The ranef.mer can be coerced into a dataframe.
I fit the model:
fit.b <- lmer(Math ~ 1 + Sex + (1+Sex|SchoolID), data=pisa_com, REML=FALSE)
Then coerced it into a dataframe by including the identifying variable
random.effects <- as.data.frame(ranef(fit.b)$SchoolID)
Then write it to a .csv for sorting in excel
write.csv(random.effects, file="~/folder/file.name.csv")

Related

Determining the probability of a sequence generated by T5 model by HuggingFace

I am using T5-Large by HuggingFace for inference. Given a premise and a hypothesis, I need to determine whether they are related or not. So, if I feed a string "mnli premise: This game will NOT open unless you agree to them sharing your information to advertisers. hypothesis: Personal data disclosure is discussed." the model is supposed to return either entailment, neutral, or contradiction.
Though I am able to determine the result, I am unable to determine the probability of the sequence generated. For instance, consider the model will generate entailment for the example given above. I also want to know what is the probability of entailment. So far, I have been using the following code,
from transformers import T5Tokenizer, T5ForConditionalGeneration
def is_entailment(premise, hypothesis):
entailment_premise = premise
entailment_hypothesis = hypothesis
token_output = tokenizer("mnli premise: " + entailment_premise + " hypothesis: " + entailment_hypothesis,
return_tensors="pt", return_length=True)
input_ids = token_output.input_ids
output = model.generate(input_ids, output_scores=True, return_dict_in_generate=True, max_new_tokens=15)
entailment_ids = output["sequences"]
entailment = tokenizer.decode(entailment_ids[0], skip_special_tokens=True)
return entailment
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
premise = "This game will NOT open unless you agree to them sharing your information to advertisers."
hypothesis = "Personal data disclosure is discussed."
print(is_entailment(premise, hypothesis))
I have tried using the scores we get as output, but not sure how to calculate the probability from them. Same goes for the last hidden states that can be fetched as the output from the generate(). I saw in another question on Stack Overflow that suggested using a softmax function on the last hidden states but I am unsure how to do it.
How can I calculate the probability of the sequence being generated? That is, if I get entailment for a pair of hypothesis and premise, what would be the P(entailment)?
What you get as the scores are output token distributions before the softmax, so-called logits. You can get the probabilities of generated tokens by normalizing the logits and taking respective token ids. You can get them from the field sequences from what the generate method returns.
These are, however, not the probabilities you are looking for because T5 segments your output words into smaller units (e.g., "entailment" gets segmented to ['▁', 'en', 'tail', 'ment'] using the t5-small tokenizer). This is even trickier because different answers get split into a different number of tokens. You can get an approximate score by averaging the token probabilities (this is typically used during beam search). Such scores do not sum up to one.
If you want a normalized score, the only way is to feed all three possible answers to the decoder, get their scores, and normalize them to sum to one.

Why is transformer decoder always generating output of same length as gold labels?

I am generating some summaries using a fine-tuned BART model, and I've noticed something strange. If I feed the labels to the model, it will always generate summaries of the same length of the label, whereas if I do not pass the labels to the model, it generates outputs of length 1024 (max BART seq length). This is unexpected, so I'm trying to understand if there is any problem / bug with the reproducible example below
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model=AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-large-cnn')
tokenizer=AutoTokenizer.from_pretrained('facebook/bart-large-cnn')
sentence_to_summarize = ['This is a text to summarise. I just went for a walk in the park and saw very large crowds gathering to watch an impromptu football match']
encoded_dict = tokenizer.batch_encode_plus(sentence_to_summarize, return_tensors='pt', max_length=1024, padding='max_length')
input_ids = encoded_dict['input_ids']
attention_mask = encoded_dict['attention_mask']
label = tokenizer.encode('I went to the park', return_tensors='pt')
Notice the following two cases.
Case 1:
output = model(input_ids=input_ids, attention_mask=attention_mask)
print(output['logits'].shape)
shape printed is torch.Size([1, 1024, 50264])
Case 2
output = model(input_ids=input_ids, attention_mask=attention_mask, labels=label)
print(output['logits'].shape)
shape printed is torch.Size([1, 7, 50264]) where 7 is the length of the label 'I went to the park' (including start and end tokens).
Ideally the summarization model would learn when to generate the EOS token, but this should not always lead to summaries of identical length of the gold output (i.e. the label). Why is the label length influencing the model output in this way?
I would expect the only difference between cases 1 and 2 being that in the second case the output also contains the loss value, but I wouldn't expect this to influence the logits in any way
Original example not use label parameter
https://huggingface.co/docs/transformers/v4.22.1/en/model_doc/bart#transformers.BartForConditionalGeneration.forward.example
label parameter is optional and i think not used for summerizing
https://huggingface.co/docs/transformers/v4.22.1/en/model_doc/bart#transformers.BartForConditionalGeneration.forward.labels

include random slope in binomial mixed model

I am using a binomial GLMM to examine the relationship between presence of individuals (# hours/day) at a site over time. Since presence is measured daily for several individuals, I've included a random intercept for individual ID.
e.g.,
presence <- cbind(hours, 24-hours)
glmer(presence ~ time + (1 | ID), family = binomial)
I'd like to also look at using ID as a random slope, but I don't know how to add this to my model. I've tried the two different approaches below, but I'm not sure which is correct.
glmer(presence ~ time + (1 + ID), family = binomial)
Error: No random effects terms specified in formula
glmer(presence ~ time + (1 + ID | ID), family = binomial)
Error: number of observations (=1639) < number of random effects (=5476) for term (1 + ID | ID); the random-effects parameters are probably unidentifiable
You cannot have a random slope for ID and have ID as a (level-two) grouping variable (see this documentation for more detail: https://cran.r-project.org/web/packages/lme4/lme4.pdf).
The grouping variable, which is ID in the models below, is used as a variable for which to specify random effects. model_1 gives random intercepts for the ID variable. model_2 gives both random intercepts and random slopes for the time variable. In other words, model_1 allows the intercept of the relationship between presence and time to vary with ID(the slope remains the same), whereas model_2 allows for the both the intercept and slopes to vary with ID, so that the relationship between presence and time (i.e., the slope) can be different for each individual (ID).
model_1 = glmer(presence ~ time + (1 | ID), family = binomial)
model_2 = glmer(presence ~ time + (1 + time | ID), family = binomial)
I would also recommend:
Snijders, T. A. B., & Bosker, R. J. (2012). Multilevel analysis: an introduction to basic and advanced multilevel modeling (2nd ed.): Sage.

How does rstan store posterior samples for separate chains?

I would like to understand how the output of extract in rstan orders the posterior samples. I understand that I can view the posterior samples from each chain by using as.array,
stanfit <- sampling(
model,
data = stan.data)
​
fitarray <- as.array(stanfit)
For example, fitarray[, 2, 1] will give me the samples for the second chain of the first parameter. One way to store the posterior samples in the output of extract would be just to concatenate them. When I do,
fit <- extract(stanfit)
mean(fitarray[,2,1]) == mean(fit$ss[1001:2000])
for several chains and parameters I always get TRUE (ss is the first parameter). This makes it seem like the posterior samples are being concatenated in fit. However, when I do,
fitarray[,2,1] == fit$ss[1001:2000]
I get FALSE (confirmed that there's not just precision difference). It appears that fitarray and fit are storing the iterations differently. How do I view the iterations (in order) of each posterior sample chain separately?
As can be seen from rstan:::as.array.stanfit, the as.array method is essentially defined as
extract(x, permuted = FALSE, inc_warmup = FALSE)
Your default use of extract keeps the warmup and permutes the post-warmup draws randomly, which is why the indices do not line up with the as.array output.

p values for random effects in lmer

I am working on a mixed model using lmer function. I want to obtain p-values for all the fixed and random effects. I am able to obtain p-values for fixed effects using different methods but I haven't found anything for random effects. Whatever method I have found on the internet is to make a null model for the same and then get the p-values by comparison. Can I have a method through which I don't need to make an another model?
My model looks like:
mod1 = lmer(Out ~ Var1 + (1 + Var2 | Var3), data = dataset)
You must do these through model comparison, as far as I know. The lmerTest package has a function called step, which will reduce your model to just the significant parameters (fixed and random) based on a number of different tests. The documentation isn't entirely clear on how everything is done, so I much prefer to use model comparison to get at specific tests.
For your model, you could test the random slope by specifying:
mod0 <- lmer(Out ~ Var1 + (1 + Var2 | Var3), data = dataset, REML=TRUE)
mod1 <- lmer(Out ~ Var1 + (1 | Var3), data = dataset, REML=TRUE)
anova(mod0, mod1, refit=FALSE)
This will show you the log likelihood test and test statistic (chi-square distributed). But you are testing two parameters here: the random slope of Var2 and the covariance between the random slopes and random intercepts. So you need a p-value adjustment:
1-(.5*pchisq(anova(mod0,mod1, refit=FALSE)$Chisq[[2]],df=2)+
.5*pchisq(anova(mod0,mod1, refit=FALSE)$Chisq[[2]],df=1))
More on those tests here or here.