What is the difference between RSE and MSE? - regression

I am going through Introduction to Statistical Learning in R by Hastie and Tibshirani. I came across two concepts: RSE and MSE. My understanding is like this:
RSE = sqrt(RSS/N-2)
MSE = RSS/N
Now I am building 3 models for a problem and need to compare them. While MSE come intuitively to me, I was also wondering if calculating RSS/N-2 will make any use which is according to above is RSE^2
I think I am not sure which to use where?

RSE is an estimate of the standard deviation of the residuals, and therefore also of the observations. Which is why it's equal to RSS/df. And in your case, as a simple linear model df = 2.
MSE is mean squared error observed in your models, and it's usually calculated using a test set to compare the predictive accuracy of your fitted models. Since we're concerned with the mean error of the model, we divide by n.

I think RSE ⊂ MSE (i.e. RSE is part of MSE).
And MSE = RSS/ degree of freedom
MSE for a single set of data (X1,X2,....Xn) would be RSS over N
or more accurately is RSS/N-1
(since your freedom to vary will be reduced by one when U have used up all the freedom)
But in linear regression concerning X and Y with binomial term, the degree of freedom is affected by both X and Y thus N-2
thus yr MSE = RSS/N-2
and one can also call this RSE
And in over parameterized model, meaning one have a collection of many ßs (more than 2 terms| y~ ß0 + ß1*X + ß2*X..), one can even penalize the model by reducing the denominator by including the number of parameters:
MSE= RSS/N-p (p is the number of fitted parameters)

Related

Is it possible that the number of basic functions is more than the number of observations in spline regression?

I want to run regression spline with B-spline basis function. The data is structured in such a way that the number of observations is less than the number of basis functions and I get a good result.
But I`m not sure if this is the correct case.
Do I have to have more rows than columns like linear regression?
Thank you.
When the number of observations, N, is small, it’s easy to fit a model with basis functions with low square error. If you have more basis functions than observations, then you could have 0 residuals (perfect fit to the data). But that is not to be trusted because it may not be representative of more data points. So yes, you want to have more observations than you do columns. Mathematically, you cannot properly estimate more than N columns because of collinearity. For a rule of thumb, 15 - 20 observations are usually needed for each additional variable / spline.
But, this isn't always the case, such as in genetics when we have hundreds of thousands of potential variables and small sample size. In that case, we turn to tools that help with a small sample size, such as cross validation and bootstrap.
Bootstrap (ie resample with replacement) your datapoints and refit splines many times (100 will probably do). Then you average the splines and use these as the final spline functions. Or you could do cross validation, where you train on a smaller dataset (70%) and then test it on the remaining dataset.
In the functional data analysis framework, there are packages in R that construct and fit spline bases (such as cubic, B, etc). These packages include refund, fda, and fda.usc.
For example,
B <- smooth.construct.cc.smooth.spec(object = list(term = "day.t", bs.dim = 12, fixed = FALSE, dim = 1, p.order = NA, by = NA),data = list(day.t = 200:320), knots = list())
constructs a B spline basis of dimension 12 (over time, day.t), but you can also use these packages to help choose a basis dimension.

Using and interpreting output from gvlma

I want to test whether all assumptions for my linear regression model hold. I did this manually and it seems to be fine. However, I want to double check with the function gvlma. The output I get is:
gvlma(x = m_lag)
Value p-value Decision
Global Stat 82.475 0.00000 Assumptions NOT satisfied!
Skewness 72.378 0.00000 Assumptions NOT satisfied!
Kurtosis 1.040 0.30778 Assumptions acceptable.
Link Function 6.029 0.01407 Assumptions NOT satisfied!
Heteroscedasticity 3.027 0.08187 Assumptions acceptable.
My question is:
How do I interpret Global Stat
Since the assumption is violated, what can I do about it now? (Same with the other 2 assumptions which were not accepted)
Global Stat- Are the relationships between your X predictors and Y roughly linear?. Rejection of the null (p < .05) indicates a non-linear relationship between one or more of your X’s and Y
Skewness - Is your distribution skewed positively or negatively, necessitating a transformation to meet the assumption of normality? Rejection of the null (p < .05) indicates that you should likely transform your data.
Kurtosis- Is your distribution kurtotic (highly peaked or very shallowly peaked), necessitating a transformation to meet the assumption of normality? Rejection of the null (p < .05) indicates that you should likely transform your data.
Link Function- Is your dependent variable truly continuous, or categorical? Rejection of the null (p < .05) indicates that you should use an alternative form of the generalized linear model (e.g. logistic or binomial regression).
Heteroscedasticity- Is the variance of your model residuals constant across the range of X (assumption of homoscedastiity)? Rejection of the null (p < .05) indicates that your residuals are heteroscedastic, and thus non-constant across the range of X. Your model is better/worse at predicting for certain ranges of your X scales.
I know the question was written a long time ago, but the only answer is not really accurate.
Based on Pena and Slate (2006), the four assumptions in linear regression are normality, heteroscedasticity, and linearity, and what the authors refer to as uncorrelatedness.
For the assumption 'uncorrelatedness', I usually call it independence. The authors refer to independence as a measurement that is validated by an assessment of uncorrelatedness and normality combined. The author also refers to other scholars whom indicate this is the independence of the residuals (on the left side pg 342).
Global Stat
This is the overall metric; this states whether the model, as a whole, passes or fails.
Skewness <- measuring the distribution
Kurtosis <- measuring the distribution, outliers, influential data, etc
Link function <- misspecified model, how you linked the elements in the model assignment
Heteroscedasticity <- looking for equal variance in the residuals
The measurements are not specifically skew, kurtosis, etc; if you look closely at the math behind the measures. These metrics are mathematical derivations from multiple statistical analysis methods. In their research, the authors found that when they combined these four measurements, it not only accurately assessed the four assumptions of linear regression, but also the interaction of the assumptions on the residuals.
In order to determine what to do first for correcting the issues, it would be necessary to know what data you are using, sample size, and the model you have established. The high value in skew could be from distribution, variance, etc. There are things to look for, based on the original work by Pena and Slate, but it seems like if you have a large or small sample size, it could drastically change where you start. I have not worked through all of the conclusions in the article, to know for sure.
Pena, E. A., & Slate, E. H. (2006). Global validation of linear model assumptions. Journal of the American Statistical Association, 101(473), 341-354. https://doi.org/10.1198/016214505000000637

Rules to set hyper-parameters alpha and theta in LDA model

I will like to know more about whether or not there are any rule to set the hyper-parameters alpha and theta in the LDA model. I run an LDA model given by the library gensim:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=30, id2word = dictionary, passes=50, minimum_probability=0)
But I have my doubts on the specification of the hyper-parameters. From what I red in the library documentation, both hyper-parameters are set to 1/number of topics. Given that my model has 30 topics, both hyper-parameters are set to a common value 1/30. I am running the model in news-articles that describe the economic activity. For this reason, I expect that the document-topic distribution (theta) to be high (similar topics in documents),while the topic-word distribution (alpha) be high as well (topics sharing many words in common, or, words not being so exclusive for each topic). For this reason, and given that my understanding of the hyper-parameters is correct, is 1/30 a correct specification value?
I'll assume you expect theta and phi (document-topic proportion and topic-word proportion) to be closer to equiprobable distributions instead of sparse ones, with exclusive topics/words.
Since alpha and beta are parameters to a symmetric Dirichlet prior, they have a direct influence on what you want. A Dirichlet distribution outputs probability distributions. When the parameter is 1, all possible distributions are equally liked to outcome (for K=2, [0.5,0.5] and [0.99,0.01] have the same chances). When parameter>1, this parameter behaves as a pseudo-counter, as a prior belief. For a high value, equiprobable output is preferred (P([0.5,0.5])>P([0.99,0.01]). Parameter<1 has the opposite behaviour. For big vocabularies you don't expect topics with probability in all words, that's why beta tends to be under 1 (the same for alpha).
However, since you're using Gensim, you can let the model learn alpha and beta values for you, allowing to learn asymmetric vectors (see here), where it stands
alpha can be set to an explicit array = prior of your choice. It also
support special values of ‘asymmetric’ and ‘auto’: the former uses a
fixed normalized asymmetric 1.0/topicno prior, the latter learns an
asymmetric prior directly from your data.
The same for eta (which I call beta).

Make a prediction using Octave plsregress

I have a good (or at least a self-consistent) calibration set and have applied PCA and recently PLS regression on n.i.r. spectrum of known mixtures of water and additive to predict the percentage of additive by volume. I thus far have done self-calibration and now want to predict the concentration from the n.i.r.spectrum blindly. Octave returns XLOADINGS, YLOADINGS, XSCORES, YSCORES, COEFFICIENTS, and FITTED with the plsregress command. The "fitted" is the estimate of concentration. Octave uses the SIMPLS approach.
How do I use these returned variables to predict concentration give a new samples spectrum?
Scores are usually denoted by T and loadings by P and X=TP'+E where E is the residual. I am stuck.
Note that T and P are X scores and loadings, respectively. Unlike PCA, PLS has scores and loadings for Y as well (usually denoted U and Q).
While the documentation of plsregress is sketchy at best, the paper it refers to Sijmen de Jong: SIMPLS: an alternativ approach to partial least squares regression Chemom Intell Lab Syst, 1993, 18, 251-263, DOI: 10.1016/0169-7439(93)85002-X
discusses prediction with equations (36) and (37), which give:
Yhat0 = X0 B
Note that this uses centered data X0 to predict centered y-values. B are the COEFFICIENTS.
I recommend that as a first step you predict your training spectra and make sure you get the correct results (FITTED).

Best techinique to approximate a 32-bit function using machine learning?

I was wondering which is the best machine learning technique to approximate a function that takes a 32-bit number and returns another 32-bit number, from a set of observations.
Thanks!
Multilayer perceptron neural networks would be worth taking a look at. Though you'll need to process the inputs to a floating point number between 0 and 1, and then map the outputs back to the original range.
There are several possible solutions to your problem:
1.) Fitting a linear hypothesis with least-squares method
In that case, you are approximating a hypothesis y = ax + b with the least squares method. This one is really easy to implement, but sometimes, a linear model is not good enough to fit your data. But - I would give this one a try first.
Good thing is that there is a closed form, so you can directly calculate parameters a and b from your data.
See Least Squares
2.) Fitting a non-linear model
Once seen that your linear model does not describe your function very well, you can try to fit higher polynomial models to your data.
Your hypothesis then might look like
y = ax² + bx + c
y = ax³ + bx² + cx + d
etc.
You can also use least squares method to fit your data, and techniques from the gradient descent types (simmulated annealing, ...). See also this thread: Fitting polynomials to data
Or, as in the other answer, try fitting a Neural Network - the good thing is that it will automatically learn the hypothesis, but it is not so easy to explain what the relation between input and output is. But in the end, a neural network is also a linear combination of nonlinear functions (like sigmoid or tanh functions).