Why is R2 of a prediction lower than another prediction while RMSE is lower too? - regression

I have a regression problem with two objective variables or outputs (named E & r). I made a model for every objective separately. I used Gaussian processes regression.
I obtained prediction for both objectives as below (error bar shows standard deviation):
Title of the plots shows R2 & RMSE of prediction. There is a categorical variable in dataset (Mixer) which has two values (50L, 2400L), shown by different colors on the plot.
Following, I calculate R2 & RMSE separately for every Mixer (shown in the legend):
As you can see, for objective E, RMSE of Mixer 2400L (blue color) is less than RMSE of Mixer 50L (orange color). But, its R2 is very low which is surprising for me. I expect that when RMSE is lower, R2 should be higher.
And for objective r, RMSE of both Mixers are almost similar. But, R2 of Mixer 2400L is much lower.
I have only one assumption about this phenomenon. Reason of low R2 is because of lower No. of samples for Mixer 2400L.
No. of Observations:
Total : 119
Mixer 50L : 106
Mixer 2400L : 13
If you have any idea. please let me know.

Well this is simply because the RMSE tells us how well a regression model can predict the value of the response variable in absolute terms while R2 tells us how well a model can predict the value of the response variable in percentage terms.
It is all about the scale. Scale of R2 values can vary significantly while RMSE is just raw error distance

Related

Is it possible that the number of basic functions is more than the number of observations in spline regression?

I want to run regression spline with B-spline basis function. The data is structured in such a way that the number of observations is less than the number of basis functions and I get a good result.
But I`m not sure if this is the correct case.
Do I have to have more rows than columns like linear regression?
Thank you.
When the number of observations, N, is small, it’s easy to fit a model with basis functions with low square error. If you have more basis functions than observations, then you could have 0 residuals (perfect fit to the data). But that is not to be trusted because it may not be representative of more data points. So yes, you want to have more observations than you do columns. Mathematically, you cannot properly estimate more than N columns because of collinearity. For a rule of thumb, 15 - 20 observations are usually needed for each additional variable / spline.
But, this isn't always the case, such as in genetics when we have hundreds of thousands of potential variables and small sample size. In that case, we turn to tools that help with a small sample size, such as cross validation and bootstrap.
Bootstrap (ie resample with replacement) your datapoints and refit splines many times (100 will probably do). Then you average the splines and use these as the final spline functions. Or you could do cross validation, where you train on a smaller dataset (70%) and then test it on the remaining dataset.
In the functional data analysis framework, there are packages in R that construct and fit spline bases (such as cubic, B, etc). These packages include refund, fda, and fda.usc.
For example,
B <- smooth.construct.cc.smooth.spec(object = list(term = "day.t", bs.dim = 12, fixed = FALSE, dim = 1, p.order = NA, by = NA),data = list(day.t = 200:320), knots = list())
constructs a B spline basis of dimension 12 (over time, day.t), but you can also use these packages to help choose a basis dimension.

Regression coefficients for centered predictor variables: One unit increase *as well as* decrease?

I have a question on how to interpret coefficients in a regression analysis:
I'm doing a (logistic) regression analysis with a mean centered (continuous) predictor (where the 0 value has no meaning). With non-centered predictors, I would interpret the coefficient for the predictor as the change in the outcome variable for one unit increase in the predictor variable; adding one unit of the predictor variable is the only thing that makes sense, since the level for this variable is set to 0.
However, when I have centered predictors, can the coefficient for the predictor be interpreted as the change in the outcome for a predictor variable unit increase and a unit decrease, that is, in both directions away from the the mean? -- Obviously, half of my data consists of observations that have lower than average values on the outcome variable, and I'm interested in having meaningful coefficients for these as well ... (I can't find any answer to this, neither on the YouTube channels I use for statistics learning, nor in my (regrettably only 5) statistics books.)
(See the attached screenshots for an example: OLS regressions from the mtcars package in R (with mpg (miles per gallon) as outcome and wt (weight in 1000 lbs) as predictor, mean centered in the bottom screenshot.)

Obtaining multiple output in regression using deep learning

Given an RGB image of hand and 3d position of the keypoints of the hand as dataset, I want to do this as regression problem in DL. In this case input will be the RGB image, and output should be estimated 3d position of keypoints.
I have seen some info about regression but most of them are trying to estimate one single value. Is it possible to estimate multiple values(or output) all at once?
For now I have referred to this code. This guy is trying to estimate the age of a person in the image.
The output vector from a neural net can represent anything as long as you define loss function well. Say you want to detect (x,y,z) co-ordinates of 10 keypoints, then just have 30 element long output vector say (x1,y1,z1,x2,y2,z2..............,x10,y10,z10), where xi,yi,zi denote coordinates of ith keypoint, basically you can use any order you feel convenient with. Just be careful with your loss function. Say you want to calculate RMSE loss, you would have to extract tripes correctly and then calculate RMSE loss for each keypoint, or if you are fimiliar with linear algebra, just reshape it into a 3x10 matrix correctly and and have your results also as a 3x10 matrix and then just use
loss = tf.sqrt(tf.reduce_mean(tf.squared_difference(Y1, Y2)))
But once you have formulated your net you will have to stick to it.

Make a prediction using Octave plsregress

I have a good (or at least a self-consistent) calibration set and have applied PCA and recently PLS regression on n.i.r. spectrum of known mixtures of water and additive to predict the percentage of additive by volume. I thus far have done self-calibration and now want to predict the concentration from the n.i.r.spectrum blindly. Octave returns XLOADINGS, YLOADINGS, XSCORES, YSCORES, COEFFICIENTS, and FITTED with the plsregress command. The "fitted" is the estimate of concentration. Octave uses the SIMPLS approach.
How do I use these returned variables to predict concentration give a new samples spectrum?
Scores are usually denoted by T and loadings by P and X=TP'+E where E is the residual. I am stuck.
Note that T and P are X scores and loadings, respectively. Unlike PCA, PLS has scores and loadings for Y as well (usually denoted U and Q).
While the documentation of plsregress is sketchy at best, the paper it refers to Sijmen de Jong: SIMPLS: an alternativ approach to partial least squares regression Chemom Intell Lab Syst, 1993, 18, 251-263, DOI: 10.1016/0169-7439(93)85002-X
discusses prediction with equations (36) and (37), which give:
Yhat0 = X0 B
Note that this uses centered data X0 to predict centered y-values. B are the COEFFICIENTS.
I recommend that as a first step you predict your training spectra and make sure you get the correct results (FITTED).

Extract 3D coordinates from R PCA

I am trying to find a way make 3D PCA visualization from R more portable;
I have run a PCA on 2D matrix using prcomp().
How do I export the 3D coordinates of data points, along with labels and colors (RGB) associated with each?
Whats the practical difference with princomp() and prcomp()?
Any ideas on how to best view the 3D PCA plot using HTML5 and canvas?
Thanks!
Here is an example to work from:
pc <- prcomp(~ . - Species, data = iris, scale = TRUE)
The axis scores are extracted from component x; as such you can just write out (you don't say how you want the exported) as CSV using:
write.csv(pc$x[, 1:3], "my_pc_scores.csv")
If you want to assign information to these scores (the colours and labels, which are not something associated with the PCA but something you assign yourself), then add them to the matrix of scores and then export. In the example above there are three species with 50 observations each. If we want that information exported alongside the scores then something like this will work
scrs <- data.frame(pc$x[, 1:3], Species = iris$Species,
Colour = rep(c("red","green","black"), each = 50))
write.csv(scrs, "my_pc_scores2.csv")
scrs looks like this:
> head(scrs)
PC1 PC2 PC3 Species Colour
1 -2.257141 -0.4784238 0.12727962 setosa red
2 -2.074013 0.6718827 0.23382552 setosa red
3 -2.356335 0.3407664 -0.04405390 setosa red
4 -2.291707 0.5953999 -0.09098530 setosa red
5 -2.381863 -0.6446757 -0.01568565 setosa red
6 -2.068701 -1.4842053 -0.02687825 setosa red
Update missed the point about RGB. See ?rgb for ways of specifying this in R, but if all you want are the RGB strings then change the above to use something like
Colour = rep(c("#FF0000","#00FF00","#000000"), each = 50)
instead, where you specify the RGB strings you want.
The essential difference between princomp() and prcomp() is the algorithm used to calculate the PCA. princomp() uses a Eigen decomposition of the covariance or correlation matrix whilst prcomp() uses the singular value decomposition (SVD) of the raw data matrix. princomp() only handles data sets where there are at least as many samples (rows) and variables (columns) in your data. prcomp() can handle that type of data and data sets where there are more columns than rows. In addition, and perhaps of greater importance depending on what uses you had in mind, the SVD is preferred over the eigen decomposition for it's better numerical accuracy.
I have tagged the Q with html5 and canvas in the hope specialists in those can help. If you don't get any responses, delete point 3 from your Q and start a new one specifically on the topic of displaying the PCs using canvas, referencing this one for detail.
You can find out about any R object by doing str(object_name). In this case:
m <- matrix(rnorm(50), nrow = 10)
res <- prcomp(m)
str(m)
If you look at the help page for prcomp by doing ?prcomp, you can discover that the scores are in res$x and the loadings are in res$rotation. These are labeled by PC already. There are no colors, unless you decide to assign some colors in the course of a plot. See the respective help pages to compare princomp with prcomp for a comparison between the two functions. Basically, the difference between them has to do with the method used behind the scenes. I can't help you with your last question.
You state that your perform PCA on a 2D matrix. If this is your data matrix there is no way to get 3D PCA's. Ofcourse it might be that your 2D matrix is a covariance matrix of the data, in that case you need to use princomp (not prcomp!) and explictely pass the covariance matrix m like this:
princomp(covmat = m)
Passing the covariance matrix like:
princomp(m)
does not yield the correct result.