I want to test whether all assumptions for my linear regression model hold. I did this manually and it seems to be fine. However, I want to double check with the function gvlma. The output I get is:
gvlma(x = m_lag)
Value p-value Decision
Global Stat 82.475 0.00000 Assumptions NOT satisfied!
Skewness 72.378 0.00000 Assumptions NOT satisfied!
Kurtosis 1.040 0.30778 Assumptions acceptable.
Link Function 6.029 0.01407 Assumptions NOT satisfied!
Heteroscedasticity 3.027 0.08187 Assumptions acceptable.
My question is:
How do I interpret Global Stat
Since the assumption is violated, what can I do about it now? (Same with the other 2 assumptions which were not accepted)
Global Stat- Are the relationships between your X predictors and Y roughly linear?. Rejection of the null (p < .05) indicates a non-linear relationship between one or more of your X’s and Y
Skewness - Is your distribution skewed positively or negatively, necessitating a transformation to meet the assumption of normality? Rejection of the null (p < .05) indicates that you should likely transform your data.
Kurtosis- Is your distribution kurtotic (highly peaked or very shallowly peaked), necessitating a transformation to meet the assumption of normality? Rejection of the null (p < .05) indicates that you should likely transform your data.
Link Function- Is your dependent variable truly continuous, or categorical? Rejection of the null (p < .05) indicates that you should use an alternative form of the generalized linear model (e.g. logistic or binomial regression).
Heteroscedasticity- Is the variance of your model residuals constant across the range of X (assumption of homoscedastiity)? Rejection of the null (p < .05) indicates that your residuals are heteroscedastic, and thus non-constant across the range of X. Your model is better/worse at predicting for certain ranges of your X scales.
I know the question was written a long time ago, but the only answer is not really accurate.
Based on Pena and Slate (2006), the four assumptions in linear regression are normality, heteroscedasticity, and linearity, and what the authors refer to as uncorrelatedness.
For the assumption 'uncorrelatedness', I usually call it independence. The authors refer to independence as a measurement that is validated by an assessment of uncorrelatedness and normality combined. The author also refers to other scholars whom indicate this is the independence of the residuals (on the left side pg 342).
Global Stat
This is the overall metric; this states whether the model, as a whole, passes or fails.
Skewness <- measuring the distribution
Kurtosis <- measuring the distribution, outliers, influential data, etc
Link function <- misspecified model, how you linked the elements in the model assignment
Heteroscedasticity <- looking for equal variance in the residuals
The measurements are not specifically skew, kurtosis, etc; if you look closely at the math behind the measures. These metrics are mathematical derivations from multiple statistical analysis methods. In their research, the authors found that when they combined these four measurements, it not only accurately assessed the four assumptions of linear regression, but also the interaction of the assumptions on the residuals.
In order to determine what to do first for correcting the issues, it would be necessary to know what data you are using, sample size, and the model you have established. The high value in skew could be from distribution, variance, etc. There are things to look for, based on the original work by Pena and Slate, but it seems like if you have a large or small sample size, it could drastically change where you start. I have not worked through all of the conclusions in the article, to know for sure.
Pena, E. A., & Slate, E. H. (2006). Global validation of linear model assumptions. Journal of the American Statistical Association, 101(473), 341-354. https://doi.org/10.1198/016214505000000637
Related
I believe the Berlekamp Welch algorithm can be used to correctly construct the secret using Shamir Secret Share as long as $t<n/3$. How can we speed up the BW algorithm implementation using Fast Fourier transform?
Berlekamp Welch is used to correct errors for the original encoding scheme for Reed Solomon code, where there is a fixed set of data points known to encoder and decoder, and a polynomial based on the message to be transmitted, unknown to the decoder. This approach was mostly replaced by switching to a BCH type code where a fixed polynomial known to both encoder and decoder is used instead.
Berlekamp Welch inverts a matrix with time complexity O(n^3). Gao improved on this, reducing time complexity to O(n^2) based on extended Euclid algorithm. Note that the R[-1] product series is pre-computed based on the fixed set of data points, in order to achieve the O(n^2) time complexity. Link to the Wiki section on "original view" decoders.
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction#Reed_Solomon_original_view_decoders
Discreet Fourier essentially is the same as the encoding process, except there is a constraint on the fixed data points for encoding (they need to be successive powers of the field primitive) in order for the inverse transform to work. The inverse transform only works if the received data is error free. Lagrange interpolation doesn't have the constraint on the data points, and doesn't require the received data to be error free. Wiki has a section on this also:
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction#Discrete_Fourier_transform_and_its_inverse
In coding theory, the Welch-Berlekamp key equation is a interpolation problem, i.e. w(x)s(x) = n(x) for x = x_1, x_2, ...,x_m, where s(x) is known. Its solution is a polynomial pair (w(x), n(x)) satisfying deg(n(x)) < deg(w(x)) <= m/2. (Here m is even)
The Welch-Berlekamp algorithm is an algorithm for solving this with O(m^2). On the other hand, D.B. Blake et al. described the solution set as a module of rank 2 and gave an another algorithm (called modular approach) with O(m^2). You can see the paper (DOI: 10.1109/18.391235)
Over binary fields, FFT is complex since the size of the multiplicative group cannot be a power of 2. However, Lin, et al. give a new polynomial basis such that the FFT transforms over binary fields is with complexity O(nlogn). Furthermore, this method has been used in decoding Reed-Solomon (RS) codes in which a modular approach is taken. This modular approach takes the advantages of FFT such that its complexity is O(nlog^2n). This is the best complexity to date. The details are in (DOI: 10.1109/TCOMM.2022.3215998) and in (https://arxiv.org/abs/2207.11079, open access).
To sum up, this exists a fast modular approach which uses FFT and is capable of solving the interpolation problem in RS decoding. You should metion that this method requires that the evaluation set to be a subspace v or v + a. Maybe the above information is helpful.
I'm trying to do a bayesian gamma regression with stan.
I know the correct link function is the inverse canonical link,
but if i dont use a log link parameters can be negative, and enter in a gamma distribution with a negative value, that obviously can't be possible.
how can i deal with it?
parameters {
vector[K] betas; //the regression parameters
real beta0;
real<lower=0.001, upper=100 > shape; //the variance parameter
}
transformed parameters {
vector<lower=0>[N] eta; //the expected values (linear predictor)
vector[N] alpha; //shape parameter for the gamma distribution
vector[N] beta; //rate parameter for the gamma distribution
eta <- beta0 + X*betas; //using the log link
}
model {
beta0 ~ normal( 0 , 2^2 );
for(i in 2:K)
betas[i] ~ normal( 0 , 2^2 );
y ~ gamma(shape,shape * eta);
}
I was struggling with this a couple weeks ago, and while I don't consider this a definitive answer, hopefully it is still helpful. For what it's worth, McCullagh and Nelder directly acknowledge this inappropriate support of the canonical link function. They advise that one must constrain the betas to properly match the support. Here's the relevant passage
The canonical link function yields sufficient statistics which are linear functions of the data and it is given by η = 1/μ. Unlike the canonical links for the Poisson and binomial distributions, the reciprocal transformation, which is often interpretable as the rate of a process, does not map the range of μ onto the whole real line. Thus the requirement that η > 0 implies restrictions on the βs in any linear model. Suitable precautions must be taken in computing β_hat so that negative values of μ_hat are avoided.
-- McCullagh and Nelder (1989). Generalized Linear Models. p. 291
It depends on your X values, but as far as I can tell (please correct me someone!) in an MCMC-based Bayesian case, you can achieve this by either using a truncated prior on the betas or a strong enough prior on your intercept to make the inappropriate regions numerically impossible to reach.
In my case, I ultimately used an identity link with a strong positive prior intercept and that was sufficient and yielded reasonable results.
Also, the choice of link really depends on your X. As the passage above implies, the use of the canonical link assumes that your linear model is in rate space. Using log or identity link functions appear to be also very common, and ultimately it's about providing a space that offers a sufficient span for the linear function to capture the response.
I am working on an multi-class image recognition problem. The task is to have the correct answer being in the top 3 output probabilities. So I was thinking that maybe there exists a clever cost function that prioritizes the correct answer being in the top K and doesn't penalize much in between these top K.
This can be achieved by class-weighted cross-entropy loss, which essentially assigns the weight to the errors associated with each class. This loss is used in research, e.g. see the paper "Multi-task learning and Weighted Cross-entropy for DNN-based Keyword" by S. Panchapagesan at al. Before computing the cross-entropy, you can check if the predicted distribution satisfies your condition (e.g., ground truth class is in top-k of the predicted classes) and assign the zero (or near zero) weights accordingly, if it does.
There are open questions though: when the correct class is in top-k, should you penalize the k-1 incorrectly predicted classes? What if, for example, the prediction is (0.9, 0.05, 0.01, ...), the third class is correct and it is in top-3 -- is this prediction good enough or not? Should you care what exactly k-1 incorrect classes are?
All these question arise because this kind of loss doesn't have probabilistic interpretation, unlike standard cross-entropy. That's why I wouldn't recommend using it in practice, but reformulate the goal instead.
E.g., if the original problem is that for some inputs several classes are equally good, the best way to deal with it is to use soft labels, e.g. (0.33, 0.33, 0.33, 0, 0, 0, ...) instead of one-hot (note that this totally agrees with probabilistic interpretation). It will force the network to learn features associated with all three good classes, and generally lead to the same goal, but with better control over target classes.
I will like to know more about whether or not there are any rule to set the hyper-parameters alpha and theta in the LDA model. I run an LDA model given by the library gensim:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=30, id2word = dictionary, passes=50, minimum_probability=0)
But I have my doubts on the specification of the hyper-parameters. From what I red in the library documentation, both hyper-parameters are set to 1/number of topics. Given that my model has 30 topics, both hyper-parameters are set to a common value 1/30. I am running the model in news-articles that describe the economic activity. For this reason, I expect that the document-topic distribution (theta) to be high (similar topics in documents),while the topic-word distribution (alpha) be high as well (topics sharing many words in common, or, words not being so exclusive for each topic). For this reason, and given that my understanding of the hyper-parameters is correct, is 1/30 a correct specification value?
I'll assume you expect theta and phi (document-topic proportion and topic-word proportion) to be closer to equiprobable distributions instead of sparse ones, with exclusive topics/words.
Since alpha and beta are parameters to a symmetric Dirichlet prior, they have a direct influence on what you want. A Dirichlet distribution outputs probability distributions. When the parameter is 1, all possible distributions are equally liked to outcome (for K=2, [0.5,0.5] and [0.99,0.01] have the same chances). When parameter>1, this parameter behaves as a pseudo-counter, as a prior belief. For a high value, equiprobable output is preferred (P([0.5,0.5])>P([0.99,0.01]). Parameter<1 has the opposite behaviour. For big vocabularies you don't expect topics with probability in all words, that's why beta tends to be under 1 (the same for alpha).
However, since you're using Gensim, you can let the model learn alpha and beta values for you, allowing to learn asymmetric vectors (see here), where it stands
alpha can be set to an explicit array = prior of your choice. It also
support special values of ‘asymmetric’ and ‘auto’: the former uses a
fixed normalized asymmetric 1.0/topicno prior, the latter learns an
asymmetric prior directly from your data.
The same for eta (which I call beta).
I am doing an analysis in Stata of the determinants of census tract unemployment rates. Some of the previous literature on my topic has used straight OLS regression, and I started with this type of analysis, but it seems to me after my own further reading that a Generalized Linear Model is better. This is especially because I am interested in presenting predicted values for the census tracts' unemployment rates based on my regression and I would like these to be appropriately bounded (between 0% and 100% inclusive). My unemployment rates include 0s for some census tracts so I would need to take this into account.
My questions are:
whether Stata's fracreg logit is equivalent to the program's glm with a logit link and binomial family? (I have read about using the glm version in a few places including here but see that fracreg is a new-ish command which seems to serve the same purpose). Can I specify an equivalent to the robust option when using fracreg logit?
if using fracreg, on what basis should I decide to use a fractional probit (fracreg probit) or fractional logit (fracreg logit) regression?
a simply (probably ignorant) question of interpretation: I see that the fracreg and glm regressions mentioned above don't report an R-squared value. Is there an equivalent measure for these regressions I can calculate? My OLS R-squared values have been reasonably high and this has been a point of reassurance for me, so I'd like to see how these models compare (though I know R-squared isn't everything!).
if using these models are there any additional restrictions or assumptions (such as additional assumptions beyond the BLUE of OLS) that I should keep in mind? With my OLS regressions I have taken the natural log of unemployment rates (makes my residuals more normal, higher R-squared, and convenient interpretation). Could I do the same with the fracreg or glm regressions above?
It's been a while since I formally studied limited dependent variables so please excuse my ignorance on these issues.
I have cross-posted this question at Statalist here.
This isn't Stata-specific, but check out Paolino's 2001 "Maximum Likelihood Estimation of Models with Beta-Distributed Dependent Variables;" at a minimum will highlight a lit review for why OLS offers biased estimators.
Hey, follow-up: Someone did make a Stata solution, check out "Buckley, Jack. 2003. "Estimation of Models with Beta-Distributed Dependent Variables: A Replication and Extension of Paolino's Study." Political Analysis. 11(2): 204-205."