Is it possible that the number of basic functions is more than the number of observations in spline regression? - regression

I want to run regression spline with B-spline basis function. The data is structured in such a way that the number of observations is less than the number of basis functions and I get a good result.
But I`m not sure if this is the correct case.
Do I have to have more rows than columns like linear regression?
Thank you.

When the number of observations, N, is small, it’s easy to fit a model with basis functions with low square error. If you have more basis functions than observations, then you could have 0 residuals (perfect fit to the data). But that is not to be trusted because it may not be representative of more data points. So yes, you want to have more observations than you do columns. Mathematically, you cannot properly estimate more than N columns because of collinearity. For a rule of thumb, 15 - 20 observations are usually needed for each additional variable / spline.
But, this isn't always the case, such as in genetics when we have hundreds of thousands of potential variables and small sample size. In that case, we turn to tools that help with a small sample size, such as cross validation and bootstrap.
Bootstrap (ie resample with replacement) your datapoints and refit splines many times (100 will probably do). Then you average the splines and use these as the final spline functions. Or you could do cross validation, where you train on a smaller dataset (70%) and then test it on the remaining dataset.
In the functional data analysis framework, there are packages in R that construct and fit spline bases (such as cubic, B, etc). These packages include refund, fda, and fda.usc.
For example,
B <- smooth.construct.cc.smooth.spec(object = list(term = "day.t", bs.dim = 12, fixed = FALSE, dim = 1, p.order = NA, by = NA),data = list(day.t = 200:320), knots = list())
constructs a B spline basis of dimension 12 (over time, day.t), but you can also use these packages to help choose a basis dimension.

Related

perMANOVA for small sample size

I have data of 6 groups with sample size of n = 2, 10, 2, 9, 3, 1 and I want to perform Permutational multivariate analysis of variance (PERMANOVA) on these data.
My question is: Is it correct to run perMANOVA on these data with the small sample size? The results look strange for me because the group of n = 1 showed insignificant difference to other groups although the graphical representation of the groups clearly show a difference.
Thank you
I would not trust any result with group of n=1 because there is no source of variation to define difference among groups.
I have also received some answers from other platforms. I put them here for information:
The sample size is simply too small to yield a stable solution via manova. Note that the n = 1 cell contributes a constant value for that cell's mean, no matter what you do by way of permutations.
Finally, note that the effective per-cell sample size with unequal cell n for one-way designs tracks well to the harmonic mean of n. For your data set as it stands, that means an "effective" per-cell n of about 2.4. Unless differences are gigantic on the DV set, no procedure (parametric or exact/permutation) will have the statistical power to detect differences with that size.
MANOVA emphasizes the attribute scattering in the study group and the logic of this analysis is based on the scattering of scores. It is not recommended to use small groups with one or more people (I mean less than 20 people) to perform parametric tests such as MANOVA. In my opinion, use non-parametric tests to examine small groups.

How to define the number of factors in parallel analysis

I conducted an Exploratory Factor Analysis (Principal Axis Factoring) on my data and wanted to determine the number of factors to extract via. Horn's Parallel Analysis.
However I have two problems:
The parallel analysis suggests to extract 1 factor, however the plot shows more than one intersection of my "FA Actual Data" and my "FA Simulated Data" line. I do not get why it is just one factor (the first intersection) then.... This plot does not look typical to other parallel analysis plots.
Why does the number of factors to extract change with the number of observations (n.obs) I state? I mean that I just changed the number of observations from 50 to 500 (which is a lie), however then parallel analysis suggested 5 factors to extract instead of 9. I do not get why....
Thank you so much for any helpful tips.
Valerie
fa.parallel(cor(My_Data), n.obs = 50, fa="fa", fm="pa")
Parallel analysis suggests that the number of factors = 1 and the number of components = NA

What is the difference between RSE and MSE?

I am going through Introduction to Statistical Learning in R by Hastie and Tibshirani. I came across two concepts: RSE and MSE. My understanding is like this:
RSE = sqrt(RSS/N-2)
MSE = RSS/N
Now I am building 3 models for a problem and need to compare them. While MSE come intuitively to me, I was also wondering if calculating RSS/N-2 will make any use which is according to above is RSE^2
I think I am not sure which to use where?
RSE is an estimate of the standard deviation of the residuals, and therefore also of the observations. Which is why it's equal to RSS/df. And in your case, as a simple linear model df = 2.
MSE is mean squared error observed in your models, and it's usually calculated using a test set to compare the predictive accuracy of your fitted models. Since we're concerned with the mean error of the model, we divide by n.
I think RSE ⊂ MSE (i.e. RSE is part of MSE).
And MSE = RSS/ degree of freedom
MSE for a single set of data (X1,X2,....Xn) would be RSS over N
or more accurately is RSS/N-1
(since your freedom to vary will be reduced by one when U have used up all the freedom)
But in linear regression concerning X and Y with binomial term, the degree of freedom is affected by both X and Y thus N-2
thus yr MSE = RSS/N-2
and one can also call this RSE
And in over parameterized model, meaning one have a collection of many ßs (more than 2 terms| y~ ß0 + ß1*X + ß2*X..), one can even penalize the model by reducing the denominator by including the number of parameters:
MSE= RSS/N-p (p is the number of fitted parameters)

Temperature Scale in SA

First, this is not a question about temperature iteration counts or automatically optimized scheduling. It's how the data magnitude relates to the scaling of the exponentiation.
I'm using the classic formula:
if(delta < 0 || exp(-delta/tK) > random()) { // new state }
The input to the exp function is negative because delta/tK is positive, so the exp result is always less then 1. The random function also returns a value in the 0 to 1 range.
My test data is in the range 1 to 20, and the delta values are below 20. I pick a start temperature equal to the initial computed temperature of the system and linearly ramp down to 1.
In order to get SA to work, I have to scale tK. The working version uses:
exp(-delta/(tK * .001)) > random()
So how does the magnitude of tK relate to the magnitude of delta? I found the scaling factor by trial and error, and I don't understand why it's needed. To my understanding, as long as delta > tK and the step size and number of iterations are reasonable, it should work. In my test case, if I leave out the extra scale the temperature of the system does not decrease.
The various online sources I've looked at say nothing about working with real data. Sometimes they include the Boltzmann constant as a scale, but since I'm not simulating a physical particle system that doesn't help. Examples (typically with pseudocode) use values like 100 or 1000000.
So what am I missing? Is scaling another value that I must set by trial and error? It's bugging me because I don't just want to get this test case running, I want to understand the algorithm, and magic constants mean I don't know what's going on.
Classical SA has 2 parameters: startingTemperate and cooldownSchedule (= what you call scaling).
Configuring 2+ parameters is annoying, so in OptaPlanner's implementation, I automatically calculate the cooldownSchedule based on the timeGradiant (which is a double going from 0.0 to 1.0 during the solver time). This works well. As a guideline for the startingTemperature, I use the maximum score diff of a single move. For more information, see the docs.

Determining edge weights given a list of walks in a graph

These questions regard a set of data with lists of tasks performed in succession and the total time required to complete them. I've been wondering whether it would be possible to determine useful things about the tasks' lengths, either as they are or with some initial guesstimation based on appropriate domain knowledge. I've come to think graph theory would be the way to approach this problem in the abstract, and have a decent basic grasp of the stuff, but I'm unable to know for certain whether I'm on the right track. Furthermore, I think it's a pretty interesting question to crack. So here we go:
Is it possible to determine the weights of edges in a directed weighted graph, given a list of walks in that graph with the lengths (summed weights) of said walks? I recognize the amount and quality of permutations on the routes taken by the walks will dictate the quality of any possible answer, but let's assume all possible walks and their lengths are given. If a definite answer isn't possible, what kind of things can be concluded about the graph? How would you arrive at those conclusions?
What if there were several similar walks with possibly differing lengths given? Can you calculate a decent average (or other illustrative measure) for each edge, given enough permutations on different routes to take? How will discounting some permutations from the available data set affect the calculation's accuracy?
Finally, what if you had a set of initial guesses as to the weights and had to refine those using the walks given? Would that improve upon your guesstimation ability, and how could you apply the extra information?
EDIT: Clarification on the difficulties of a plain linear algebraic approach. Consider the following set of walks:
a = 5
b = 4
b + c = 5
a + b + c = 8
A matrix equation with these values is unsolvable, but we'd still like to estimate the terms. There might be some helpful initial data available, such as in scenario 3, and in any case we can apply knowledge of the real world - such as that the length of a task can't be negative. I'd like to know if you have ideas on how to ensure we get reasonable estimations and that we also know what we don't know - eg. when there's not enough data to tell a from b.
Seems like an application of linear algebra.
You have a set of linear equations which you need to solve. The variables being the lengths of the tasks (or edge weights).
For instance if the tasks lengths were t1, t2, t3 for 3 tasks.
And you are given
t1 + t2 = 2 (task 1 and 2 take 2 hours)
t1 + t2 + t3 = 7 (all 3 tasks take 7 hours)
t2 + t3 = 6 (tasks 2 and 3 take 6 hours)
Solving gives t1 = 1, t2 = 1, t3 = 5.
You can use any linear algebra techniques (for eg: http://en.wikipedia.org/wiki/Gaussian_elimination) to solve these, which will tell you if there is a unique solution, no solution or an infinite number of solutions (no other possibilities are possible).
If you find that the linear equations do not have a solution, you can try adding a very small random number to some of the task weights/coefficients of the matrix and try solving it again. (I believe falls under Perturbation Theory). Matrices are notorious for radically changing behavior with small changes in the values, so this will likely give you an approximate answer reasonably quickly.
Or maybe you can try introducing some 'slack' task in each walk (i.e add more variables) and try to pick the solution to the new equations where the slack tasks satisfy some linear constraints (like 0 < s_i < 0.0001 and minimize sum of s_i), using Linear Programming Techniques.
Assume you have an unlimited number of arbitrary characters to represent each edge. (a,b,c,d etc)
w is a list of all the walks, in the form of 0,a,b,c,d,e etc. (the 0 will be explained later.)
i = 1
if #w[i] ~= 1 then
replace w[2] with the LENGTH of w[i], minus all other values in w.
repeat forever.
Example:
0,a,b,c,d,e 50
0,a,c,b,e 20
0,c,e 10
So:
a is the first. Replace all instances of "a" with 50, -b,-c,-d,-e.
New data:
50, 50
50,-b,-d, 20
0,c,e 10
And, repeat until one value is left, and you finish! Alternatively, the first number can simply be subtracted from the length of each walk.
I'd forget about graphs and treat lists of tasks as vectors - every task represented as a component with value equal to it's cost (time to complete in this case.
In tasks are in different orderes initially, that's where to use domain knowledge to bring them to a cannonical form and assign multipliers if domain knowledge tells you that the ratio of costs will be synstantially influenced by ordering / timing. Timing is implicit initial ordering but you may have to make a function of time just for adjustment factors (say drivingat lunch time vs driving at midnight). Function might be tabular/discrete. In general it's always much easier to evaluate ratios and relative biases (hardnes of doing something). You may need a functional language to do repeated rewrites of your vectors till there's nothing more that romain knowledge and rules can change.
With cannonical vectors consider just presence and absence of task (just 0|1 for this iteratioon) and look for minimal diffs - single task diffs first - that will provide estimates which small number of variables. Keep doing this recursively, be ready to back track and have a heuristing rule for goodness or quality of estimates so far. Keep track of good "rounds" that you backtraced from.
When you reach minimal irreducible state - dan't many any more diffs - all vectors have the same remaining tasks then you can do some basic statistics like variance, mean, median and look for big outliers and ways to improve initial domain knowledge based estimates that lead to cannonical form. If you finsd a lot of them and can infer new rules, take them in and start the whole process from start.
Yes, this can cost a lot :-)