How does rstan store posterior samples for separate chains? - output

I would like to understand how the output of extract in rstan orders the posterior samples. I understand that I can view the posterior samples from each chain by using as.array,
stanfit <- sampling(
model,
data = stan.data)
​
fitarray <- as.array(stanfit)
For example, fitarray[, 2, 1] will give me the samples for the second chain of the first parameter. One way to store the posterior samples in the output of extract would be just to concatenate them. When I do,
fit <- extract(stanfit)
mean(fitarray[,2,1]) == mean(fit$ss[1001:2000])
for several chains and parameters I always get TRUE (ss is the first parameter). This makes it seem like the posterior samples are being concatenated in fit. However, when I do,
fitarray[,2,1] == fit$ss[1001:2000]
I get FALSE (confirmed that there's not just precision difference). It appears that fitarray and fit are storing the iterations differently. How do I view the iterations (in order) of each posterior sample chain separately?

As can be seen from rstan:::as.array.stanfit, the as.array method is essentially defined as
extract(x, permuted = FALSE, inc_warmup = FALSE)
Your default use of extract keeps the warmup and permutes the post-warmup draws randomly, which is why the indices do not line up with the as.array output.

Related

Determining the probability of a sequence generated by T5 model by HuggingFace

I am using T5-Large by HuggingFace for inference. Given a premise and a hypothesis, I need to determine whether they are related or not. So, if I feed a string "mnli premise: This game will NOT open unless you agree to them sharing your information to advertisers. hypothesis: Personal data disclosure is discussed." the model is supposed to return either entailment, neutral, or contradiction.
Though I am able to determine the result, I am unable to determine the probability of the sequence generated. For instance, consider the model will generate entailment for the example given above. I also want to know what is the probability of entailment. So far, I have been using the following code,
from transformers import T5Tokenizer, T5ForConditionalGeneration
def is_entailment(premise, hypothesis):
entailment_premise = premise
entailment_hypothesis = hypothesis
token_output = tokenizer("mnli premise: " + entailment_premise + " hypothesis: " + entailment_hypothesis,
return_tensors="pt", return_length=True)
input_ids = token_output.input_ids
output = model.generate(input_ids, output_scores=True, return_dict_in_generate=True, max_new_tokens=15)
entailment_ids = output["sequences"]
entailment = tokenizer.decode(entailment_ids[0], skip_special_tokens=True)
return entailment
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
premise = "This game will NOT open unless you agree to them sharing your information to advertisers."
hypothesis = "Personal data disclosure is discussed."
print(is_entailment(premise, hypothesis))
I have tried using the scores we get as output, but not sure how to calculate the probability from them. Same goes for the last hidden states that can be fetched as the output from the generate(). I saw in another question on Stack Overflow that suggested using a softmax function on the last hidden states but I am unsure how to do it.
How can I calculate the probability of the sequence being generated? That is, if I get entailment for a pair of hypothesis and premise, what would be the P(entailment)?
What you get as the scores are output token distributions before the softmax, so-called logits. You can get the probabilities of generated tokens by normalizing the logits and taking respective token ids. You can get them from the field sequences from what the generate method returns.
These are, however, not the probabilities you are looking for because T5 segments your output words into smaller units (e.g., "entailment" gets segmented to ['▁', 'en', 'tail', 'ment'] using the t5-small tokenizer). This is even trickier because different answers get split into a different number of tokens. You can get an approximate score by averaging the token probabilities (this is typically used during beam search). Such scores do not sum up to one.
If you want a normalized score, the only way is to feed all three possible answers to the decoder, get their scores, and normalize them to sum to one.

how can I implement a function to load data into a design matrix and an output vector in octave

I have a .txt file with dimensions 100x4 but i want to generalise and make an initial matrix with m x n+1 dimension as the code should work fine with any data file. m is the number of training examples and n is the number of training features and the last column is the output vector.
function [X,y]= loadData(filename)
data=load(filename);
X=load(filename);
y=load(filename);
m=rows(filename);
n=size(filename);
end
expected value of elements in the matrix do not match the found value.
what is the mistake?
First of all you are loading 3 times the same things, so at the end data, X, and y contain exactly the same things.
Then you are passing filename -that is a string- to rows() and size(), so do not expect getting the sizes of some arrays: these functions won't open any file, they just operate on the string in this case. In octave a string is considered as a 1xl matric, l being the length of the string.

Scilab incorrect indices returned when using find command

I'm relatively new to Scilab and I would like to find the indices of a number in my matrix.
I have defined my number as maximal deformation (MaxEYY) and on displaying it, it is correct (I can double check in my *.csv file). However, when I want to know exactly where this number lies in my matrix using the find command, only (1,1) is returned, but I know for a fact that this number is located at (4,8).
My matrix is not huge (4x18) and I know that this number only occurs once. On opening the *.csv file, I removed the headers so there is no text.
Can anyone help me with this?
N=csvRead("file.csv",",",".",[],[],[],[],1)
EYY=N(:,8);
MaxEYY=max(EYY);
MinEYY=min(EYY);
[a,b]=find(MaxEYY);
disp([a,b]);
First, you need to understand how find() works: it looks for values of true or false in a matrix. So if you want to find a certain value in it, you should do find(value == matrix).
Then, notice that in your code, you are applying find() to MaxEYY, which is a single value, that is, a scalar, a 1-by-1 matrix. When you do that, you can only get (1,1) or [] as results.
Now, combining these two informations, this what you should've done:
[a, b] = find(EYY == MaxEYY);
Also, there is a quicker way to get this indices: max() can also return the indices of the maximum value by doing
[MaxEYY, inds] = max(EYY);
And the same goes for min().

Possible to call subfunction in S-function level-2

I have been trying to convert my level-1 S-function to level-2 but I got stuck at calling another subfunction at function Output(block) trying to look for other threads but to no avail, do you mind to provide related links?
My output depends on a lot processing with the inputs, this is the reason I need to call the sub-function in order to calculate and then return output values, all the examples that I can see are calculating their outputs directly in "function Output(block)", in my case I thought it is not possible.
I then tried to use Interpreted Matlab Function block but failed due to the output dimension is NOT the same as input dimension, also it does not support the return of more than ONE output................
Dear Sir/Madam,
I read in S-function documentation that "S-function level-1 supports vector inputs and outputs. DOES NOT support multiple input and output ports".
Does the second sentence mean the input and output dimension MUST BE SAME?
I have been using S-function level-1 to do the following:
[a1, b1] = choose_cells(c, d);
where a1 and b1 are outputs, c and d are inputs. All the variables are having a single value, except d is an array with 6 values.
Referring to the image attached, we all know that in S-function block, the input dimension must be SAME as output dimension, else we will get error, in this case, the input dimension is 7 while the output dimension is 2, so I have to include the "Terminator" blocks in the diagram for it to work perfectly, otherwise, I will get an error.
My problem is, when the system gets bigger, the array d could contain hundreds of variables, using this method, it means I would have to add hundreds of "Terminator" blocks in order to get this work, this definitely does not sound practical.
Could you please suggest me a wise way to implement this?
Thanks in advance.
http://imgur.com/ib6BTTp
http://imageshack.us/content_round.php?page=done&id=4tHclZ2klaGtl66S36zY2KfO5co

Dynamic Topic model output - Blei format

I am working with the Dynamic Topic Models package that was developed by Blei. I am new to LDA however I understand it.
I would like to know what does the output by the name of
lda-seq/topic-000-var-obs.dat store?
I know that lda-seq/topic-001-var-e-log-prob.dat stores the log of the variational posterior and by applying the exponential over it, I get the probability of the word within Topic 001.
Thanks
Topic-000-var-e-log-prob.dat store the log of the variational posterior of the topic 1.
Topic-001-var-e-log-prob.dat store the log of the variational posterior of the topic 2.
I have failed to find a concrete answer anywhere. However, since the documentation's sample.sh states
The code creates at least the following files:
- topic-???-var-e-log-prob.dat: the e-betas (word distributions) for topic ??? for all times.
...
- gam.dat
without mentioning the topic-000-var-obs.dat file, suggests that it is not imperative for most analyses.
Speculation
obs suggest observations. After a little dig around in the example/model_run results, I plotted the sum across epochs for each word/token using:
temp = scan("dtm/example/model_run/lda-seq/topic-000-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))
and the result is something like:
The general trend of the non-negative values is decreasing and many values are floored (in this case to -11.00972 = log(1.67e-05)) Suggesting that these values are weightings or some other measure of influence on the model. The model removes some tokens and the influence/importance of the others tapers off over the index. The later trend may be caused by preprocessing such as sorting tokens by tf-idf when creating the dictionary.
Interestingly the row sum values varies for both the floored tokens and the set with more positive values:
temp = scan("~/Documents/Python/inference/project/dtm/example/model_run/lda-seq/topic-009-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))