Rules to set hyper-parameters alpha and theta in LDA model - lda

I will like to know more about whether or not there are any rule to set the hyper-parameters alpha and theta in the LDA model. I run an LDA model given by the library gensim:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=30, id2word = dictionary, passes=50, minimum_probability=0)
But I have my doubts on the specification of the hyper-parameters. From what I red in the library documentation, both hyper-parameters are set to 1/number of topics. Given that my model has 30 topics, both hyper-parameters are set to a common value 1/30. I am running the model in news-articles that describe the economic activity. For this reason, I expect that the document-topic distribution (theta) to be high (similar topics in documents),while the topic-word distribution (alpha) be high as well (topics sharing many words in common, or, words not being so exclusive for each topic). For this reason, and given that my understanding of the hyper-parameters is correct, is 1/30 a correct specification value?

I'll assume you expect theta and phi (document-topic proportion and topic-word proportion) to be closer to equiprobable distributions instead of sparse ones, with exclusive topics/words.
Since alpha and beta are parameters to a symmetric Dirichlet prior, they have a direct influence on what you want. A Dirichlet distribution outputs probability distributions. When the parameter is 1, all possible distributions are equally liked to outcome (for K=2, [0.5,0.5] and [0.99,0.01] have the same chances). When parameter>1, this parameter behaves as a pseudo-counter, as a prior belief. For a high value, equiprobable output is preferred (P([0.5,0.5])>P([0.99,0.01]). Parameter<1 has the opposite behaviour. For big vocabularies you don't expect topics with probability in all words, that's why beta tends to be under 1 (the same for alpha).
However, since you're using Gensim, you can let the model learn alpha and beta values for you, allowing to learn asymmetric vectors (see here), where it stands
alpha can be set to an explicit array = prior of your choice. It also
support special values of ‘asymmetric’ and ‘auto’: the former uses a
fixed normalized asymmetric 1.0/topicno prior, the latter learns an
asymmetric prior directly from your data.
The same for eta (which I call beta).

Related

Would this be a valid Implementation of an ordinal CrossEntropy?

Would this be a valid implementation of a cross entropy loss that takes the ordinal structure of the GT y into consideration? y_hat is the prediction from a neural network.
ce_loss = F.cross_entropy(y_hat, y, reduction="none")
distance_weight = torch.abs(y_hat.argmax(1) - y) + 1
ordinal_ce_loss = torch.mean(distance_weight * ce_loss)
I'll attempt to answer this question by first fully defining the task, since the question is a bit sparse on details.
I have a set of ordinal classes (e.g. first, second, third, fourth,
etc.) and I would like to predict the class of each data example from
among this set. I would like to define an entropy-based loss-function
for this problem. I would like this loss function to weight the loss
between a predicted class torch.argmax(y_hat) and the true class y
according to the ordinal distance between the two classes. Does the
given loss expression accomplish this?
Short answer: sure, it is "valid". You've roughly implemented L1-norm ordinal class weighting. I'd question whether this is truly the correct weighting strategy for this problem.
For instance, consider that for a true label n, the bin n response is weighted by 1, but the bin n+1 and n-1 responses are weighted by 2. This means that a lot more emphasis will be placed on NOT predicting false positives than on correctly predicting true positives, which may imbue your model with some strange bias.
It also means that examples on the edge will result in a larger total sum of weights, meaning that you'll be weighting examples where the true label is say "first" or "last" more highly than the intermediate classes. (Say you have 5 classes: 1,2,3,4,5. A true label of 1 will require distance_weight of [1,2,3,4,5], the sum of which is 15. A true label of 3 will require distance_weight of [3,2,1,2,3], the sum of which is 11.
In general, classification problems and entropy-based losses are underpinned by the assumption that no set of classes or categories is any more or less related than any other set of classes. In essence, the input data is embedded into an orthogonal feature space where each class represents one vector in the basis. This is quite plainly a bad assumption in your case, meaning that this embedding space is probably not particularly elegant: thus, you have to correct for it with sort of a hack-y weight fix. And in general, this assumption of class non-correlation is probably not true in a great many classification problems (consider e.g. the classic ImageNet classification problem, wherein the class pairs [bus,car], and [bus,zebra] are treated as equally dissimilar. But this is probably a digression into the inherent lack of usefulness of strict ontological structuring of information which is outside the scope of this answer...)
Long Answer: I'd highly suggest moving into a space where the ordinal value you care about is instead expressed in a continuous space. (In the first, second, third example, you might for instance output a continuous value over the range [1,max_place]. This allows you to benefit from loss functions that already capture well the notion that predictions closer in an ordered space are better than predictions farther away in an ordered space (e.g. MSE, Smooth-L1, etc.)
Let's consider one more time the case of the [first,second,third,etc.] ordinal class example, and say that we are trying to predict the places of a set of runners in a race. Consider two races, one in which the first place runner wins by 30% relative to the second place runner, and the second in which the first place runner wins by only 1%. This nuance is entirely discarded by the ordinal discrete classification. In essence, the selection of an ordinal set of classes truncates the amount of information conveyed in the prediction, which means not only that the final prediction is less useful, but also that the loss function encodes this strange truncation and binarization, which is then reflected (perhaps harmfully) in the learned model. This problem could likely be much more elegantly solved by regressing the finishing position, or perhaps instead by regressing the finishing time, of each athlete, and then performing the final ordinal classification into places OUTSIDE of the network training.
In conclusion, you might expect a well-trained ordinal classifier to produce essentially a normal distribution of responses across the class bins, with the distribution peak on the true value: a binned discretization of a space that almost certainly could, and likely should, be treated as a continuous space.

what's the meaning of 'parameterize' in deep learning?

no
13
what's the meaning of 'parameterize' in deep learning? As shown in the photo, does it means the matrix 'A' can be changed by the optimization during training?
Yes, when something can be parameterized it means that gradients can be calculated.
This means that the (dE/dw) which means the derivative of Error with respect to weight can be calculated (i.e it must be differentiable) and subtracted from the model weights along with obviously a learning_rate and other params being included depending on the optimizer.
What the paper is saying is that if you make a binary matrix a weight and then find the gradient (dE/dw) of that weight with respect to a loss and then make an update on the binary matrix through backpropagation, there is not really an activation function (which by requirement must be differentiable) that can keep the values discrete (like 0 and 1) but rather you will end up with continous values (like these decimal values).
Therefore it is saying since the idea of having binary values be weights and for them to be back-propagated in a way where the weights + activation function also yields an updated weight matrix that is also binary is difficult, another solution like the Bernoulli Distribution is used instead to initialize parameters of a model.
Hope this helps,

Speeding up Berlekamp Welch algorithm using FFT for Shamir Secret Share

I believe the Berlekamp Welch algorithm can be used to correctly construct the secret using Shamir Secret Share as long as $t<n/3$. How can we speed up the BW algorithm implementation using Fast Fourier transform?
Berlekamp Welch is used to correct errors for the original encoding scheme for Reed Solomon code, where there is a fixed set of data points known to encoder and decoder, and a polynomial based on the message to be transmitted, unknown to the decoder. This approach was mostly replaced by switching to a BCH type code where a fixed polynomial known to both encoder and decoder is used instead.
Berlekamp Welch inverts a matrix with time complexity O(n^3). Gao improved on this, reducing time complexity to O(n^2) based on extended Euclid algorithm. Note that the R[-1] product series is pre-computed based on the fixed set of data points, in order to achieve the O(n^2) time complexity. Link to the Wiki section on "original view" decoders.
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction#Reed_Solomon_original_view_decoders
Discreet Fourier essentially is the same as the encoding process, except there is a constraint on the fixed data points for encoding (they need to be successive powers of the field primitive) in order for the inverse transform to work. The inverse transform only works if the received data is error free. Lagrange interpolation doesn't have the constraint on the data points, and doesn't require the received data to be error free. Wiki has a section on this also:
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction#Discrete_Fourier_transform_and_its_inverse
In coding theory, the Welch-Berlekamp key equation is a interpolation problem, i.e. w(x)s(x) = n(x) for x = x_1, x_2, ...,x_m, where s(x) is known. Its solution is a polynomial pair (w(x), n(x)) satisfying deg(n(x)) < deg(w(x)) <= m/2. (Here m is even)
The Welch-Berlekamp algorithm is an algorithm for solving this with O(m^2). On the other hand, D.B. Blake et al. described the solution set as a module of rank 2 and gave an another algorithm (called modular approach) with O(m^2). You can see the paper (DOI: 10.1109/18.391235)
Over binary fields, FFT is complex since the size of the multiplicative group cannot be a power of 2. However, Lin, et al. give a new polynomial basis such that the FFT transforms over binary fields is with complexity O(nlogn). Furthermore, this method has been used in decoding Reed-Solomon (RS) codes in which a modular approach is taken. This modular approach takes the advantages of FFT such that its complexity is O(nlog^2n). This is the best complexity to date. The details are in (DOI: 10.1109/TCOMM.2022.3215998) and in (https://arxiv.org/abs/2207.11079, open access).
To sum up, this exists a fast modular approach which uses FFT and is capable of solving the interpolation problem in RS decoding. You should metion that this method requires that the evaluation set to be a subspace v or v + a. Maybe the above information is helpful.

Is it possible to implement a loss function that prioritizes the correct answer being in the top k probabilities?

I am working on an multi-class image recognition problem. The task is to have the correct answer being in the top 3 output probabilities. So I was thinking that maybe there exists a clever cost function that prioritizes the correct answer being in the top K and doesn't penalize much in between these top K.
This can be achieved by class-weighted cross-entropy loss, which essentially assigns the weight to the errors associated with each class. This loss is used in research, e.g. see the paper "Multi-task learning and Weighted Cross-entropy for DNN-based Keyword" by S. Panchapagesan at al. Before computing the cross-entropy, you can check if the predicted distribution satisfies your condition (e.g., ground truth class is in top-k of the predicted classes) and assign the zero (or near zero) weights accordingly, if it does.
There are open questions though: when the correct class is in top-k, should you penalize the k-1 incorrectly predicted classes? What if, for example, the prediction is (0.9, 0.05, 0.01, ...), the third class is correct and it is in top-3 -- is this prediction good enough or not? Should you care what exactly k-1 incorrect classes are?
All these question arise because this kind of loss doesn't have probabilistic interpretation, unlike standard cross-entropy. That's why I wouldn't recommend using it in practice, but reformulate the goal instead.
E.g., if the original problem is that for some inputs several classes are equally good, the best way to deal with it is to use soft labels, e.g. (0.33, 0.33, 0.33, 0, 0, 0, ...) instead of one-hot (note that this totally agrees with probabilistic interpretation). It will force the network to learn features associated with all three good classes, and generally lead to the same goal, but with better control over target classes.

Using and interpreting output from gvlma

I want to test whether all assumptions for my linear regression model hold. I did this manually and it seems to be fine. However, I want to double check with the function gvlma. The output I get is:
gvlma(x = m_lag)
Value p-value Decision
Global Stat 82.475 0.00000 Assumptions NOT satisfied!
Skewness 72.378 0.00000 Assumptions NOT satisfied!
Kurtosis 1.040 0.30778 Assumptions acceptable.
Link Function 6.029 0.01407 Assumptions NOT satisfied!
Heteroscedasticity 3.027 0.08187 Assumptions acceptable.
My question is:
How do I interpret Global Stat
Since the assumption is violated, what can I do about it now? (Same with the other 2 assumptions which were not accepted)
Global Stat- Are the relationships between your X predictors and Y roughly linear?. Rejection of the null (p < .05) indicates a non-linear relationship between one or more of your X’s and Y
Skewness - Is your distribution skewed positively or negatively, necessitating a transformation to meet the assumption of normality? Rejection of the null (p < .05) indicates that you should likely transform your data.
Kurtosis- Is your distribution kurtotic (highly peaked or very shallowly peaked), necessitating a transformation to meet the assumption of normality? Rejection of the null (p < .05) indicates that you should likely transform your data.
Link Function- Is your dependent variable truly continuous, or categorical? Rejection of the null (p < .05) indicates that you should use an alternative form of the generalized linear model (e.g. logistic or binomial regression).
Heteroscedasticity- Is the variance of your model residuals constant across the range of X (assumption of homoscedastiity)? Rejection of the null (p < .05) indicates that your residuals are heteroscedastic, and thus non-constant across the range of X. Your model is better/worse at predicting for certain ranges of your X scales.
I know the question was written a long time ago, but the only answer is not really accurate.
Based on Pena and Slate (2006), the four assumptions in linear regression are normality, heteroscedasticity, and linearity, and what the authors refer to as uncorrelatedness.
For the assumption 'uncorrelatedness', I usually call it independence. The authors refer to independence as a measurement that is validated by an assessment of uncorrelatedness and normality combined. The author also refers to other scholars whom indicate this is the independence of the residuals (on the left side pg 342).
Global Stat
This is the overall metric; this states whether the model, as a whole, passes or fails.
Skewness <- measuring the distribution
Kurtosis <- measuring the distribution, outliers, influential data, etc
Link function <- misspecified model, how you linked the elements in the model assignment
Heteroscedasticity <- looking for equal variance in the residuals
The measurements are not specifically skew, kurtosis, etc; if you look closely at the math behind the measures. These metrics are mathematical derivations from multiple statistical analysis methods. In their research, the authors found that when they combined these four measurements, it not only accurately assessed the four assumptions of linear regression, but also the interaction of the assumptions on the residuals.
In order to determine what to do first for correcting the issues, it would be necessary to know what data you are using, sample size, and the model you have established. The high value in skew could be from distribution, variance, etc. There are things to look for, based on the original work by Pena and Slate, but it seems like if you have a large or small sample size, it could drastically change where you start. I have not worked through all of the conclusions in the article, to know for sure.
Pena, E. A., & Slate, E. H. (2006). Global validation of linear model assumptions. Journal of the American Statistical Association, 101(473), 341-354. https://doi.org/10.1198/016214505000000637