I use RapidMiner and i have a data set which contains 40 lines, each line has 14 column.
Lines are different kinds of metrics of Android applications + and the end of the line there is google-play ranking (first line is the header which contains the name of metrics).
(So the goal is predict google play ranking from metrics.)
The data set: http://pastebin.com/Cw1BR4K6
column 1-13: different kind of metrics
column 14: google play ranking
line 2-40: metrics of Android projects
I used PolynomialRegression in RapidMiner and i got this result:
- 6.723 * lloc ^ 1.000
+ 1.187 * nid ^ 2.000
- 47.730 * nle ^ 1.000
- 36.433 * nel ^ 1.000
- 1.466 * nip ^ 2.000
- 97.187 * activites ^ 1.000
- 50.080 * inside-permissions ^ 1.000
- 60.291 * outside-permissions ^ 1.000
- 52.472 * all-permissions ^ 4.000
- 2.309 * jtlloc ^ 1.000
+ 36.058 * jtnm ^ 1.000
+ 9.924 * jtna ^ 1.000
+ 40.504 * jtncl ^ 1.000
+ 9.455
My question:
How can i check that this result is correct?
How can i check this result to an already available line?
For example i would like to apply this result to the line 25: 25,8,5,10,0,1,0,0,0,239,10,14,4,3.8
My other question:
What are the methods which i can do predicts about this set?
And what is the best methods to do it? I would like ask you to explain it to me, if it is possible.
Thanks in advance, Peter
The result of the polynomial regression is a trained model. If you want to apply the model to a data set and see the results, use the Apply Model operator. It takes two inputs: the model and the data. The output of this operator is dataset with one more attribute: the regression result.
But evaluating performance of a model using the same data as it was trained on is a very bad idea.(overfitting). To correctly evaluate the model's performance, split the data to training set(used for training the model) and testing set(used to evaluate performance). Or use cross-validation which is in fact the same, but done multiple times and averaged. (in Rapidminer : Edit -> New Building Block -> Numerical X-Validation)
Which regression method to choose is a difficult problem and depends on your specific needs. Is your only criterion the regression error? Do you need human readable output?
You will surely need to experiment with multiple methods. And I'm not sure you will get some conclusive results with this small dataset.
Related
I constructed several glmer.nb models with different combinations of random intercepts, and for one of the models (nested random intercepts, with the lowest AICc), I consistently get: "iteration limit reached", without the usual "Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :..."
Here's what I know:
it is a warning (from the color) but not labeled as such
you can also have that warning with GLMs and LMERs
Here's what I don't know:
does it mean the model is invalid?
what causes that issue?
what could I do to resolve that issue?
Here's what I searched:
https://stats.stackexchange.com/questions/67287/very-large-theta-values-using-glm-nb-in-r-alternative-approaches (no explanation as to the why and how)
GLMM FAQ: no mention
I am not the only regularly running into that or similar problems: Using glmer.nb(), the error message:(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate is returned
https://stats.stackexchange.com/questions/40647/lme-error-iteration-limit-reached/40664
Here's what would be highly appreciated:
A more informative warning message: did the model converge? what caused this? What can one do to fix it? Can we read more about this (link to GLMM FAQ - brms-style)?
This is a general question. I did not provide reproducible code because an answer that is generalisable would be most useful.
library(lme4)
dd <- data.frame(f = factor(rep(1:20, each = 20)))
dd$y <- simulate(~ 1 + (1|f), family = "poisson",
newdata = dd,
newparam = list(beta = 1, theta = 1),
seed = 101)[[1]]
m1 <- glmer.nb(y ~ 1 + (1|f), data = dd)
Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :
iteration limit reached
It's a bit hard to tell, but this warning occurs in MASS::theta.ml(), which is called to get an initial estimate of the dispersion parameter. (If you set options(error = recover, warn = 2), warnings will be converted to errors and errors will dump you into a debugger, where you can see the sequence of calls that were active when the warning/error occurred).
This generally occurs when the data (specifically, the conditional distribution of the data) is actually equidispersed (variance == mean) or underdispersed (i.e. variance < mean), which can't be achieved by a negative binomial distribution. If you run getME(m1, "glmer.nb.theta") you'll generally get a very large value (in this case it's 62376), which indicates where the optimizer gave up while it was trying to send the dispersion parameter to infinity.
You can:
ignore the warning (the negative binomial isn't a good choice, but the model is effectively converging to a Poisson solution anyway).
revert to a Poisson model (the CV question you link to does say "a Poisson model might be a better choice")
People often worry less about underdispersion than overdispersion (because underdispersion makes results of a Poisson model conservative), but if you want to take underdispersion into account you can fit your model with a conditional distribution that allows underdispersion as well as overdispersion (not directly possible within lme4, but see here)
PS the "iteration limit reached without convergence" warning in one of your linked answers, from nlminb within lme, is a completely different issue (except that both situations involve some form of iterative solution scheme with a set maximum number of iterations ...)
The expression (exp(t) - 1)/t converges to 1 as t tends to 0. However, when computed numerically, we get a different story:
In [19]: (exp(10**(-12)) - 1) * (10**12)
Out[19]: 1.000088900582341
In [20]: (exp(10**(-13)) - 1) * (10**13)
Out[20]: 0.9992007221626409
In [21]: (exp(10**(-14)) - 1) * (10**14)
Out[21]: 0.9992007221626409
In [22]: (exp(10**(-15)) - 1) * (10**15)
Out[22]: 1.1102230246251565
In [23]: (exp(10**(-16)) - 1) * (10**16)
Out[23]: 0.0
Is there some way I can compute this expression without encountering these problems? I've thought of using a power series but I'm wary of implementing this myself as I'm not sure of implementation details like how many terms to use.
If it's relevant, I'm using Python with scipy and numpy.
The discussion in the comments about tiny values seems pointless. If t is so tiny that it causes underflow, then the expression is 1 "since a long time". Indeed the Taylor development is
1 + t/2 + t²/6 + t³/24...
and as soon as t < 1 ulp, the floating-point representation is exactly 1.
Above that, expm1(t)/t will do a good job.
EFA first-timer here!
I ran an Exploratory Factor Analysis (EFA) on a data set ("df1" = 1320 observations) with 50 variables by creating a subset with relevant variables only that have no missing values ("df2" = 301 observations).
I was able to filter 4 factors (19 variables in total).
Now I would like to take those 4 factors and regress them with control variables.
For instance: Factor 1 (df2$fa1) describes job satisfaction.
I would like to control for age and marital status.
Fa1Regression <- lm(df2$fa1 ~ df1$age + df1$marital)
However I receive the error message:
Error in model.frame.default(formula = df2$fa1 ~ df1$age + :
variable lengths differ (found for 'df1$age')
What can I do to run the regression correctly? Can I delete observations from df1 that are nonexistent in df2 so that the variable lengths are the same?
Its having a problem using lm to regress a latent factor on other coefficients. Instead, use the lavaan package, where your model statement would be myModel<- 'df2$fa1~ x1+x2+x3'
So, I have a vector that corresponds to a given feature (same dimensionality). Is there a package in Julia that would provide a mathematical function that fits these data points, in relation to the original feature? In other words, I have x and y (both vectors) and need to find a decent mapping between the two, even if it's a highly complex one. The output of this process should be a symbolic formula that connects x and y, e.g. (:x)^3 + log(:x) - 4.2454. It's fine if it's just a polynomial approximation.
I imagine this is a walk in the park if you employ Genetic Programming, but I'd rather opt for a simpler (and faster) approach, if it's available. Thanks
Turns out the Polynomials.jl package includes the function polyfit which does Lagrange interpolation. A usage example would go:
using Polynomials # install with Pkg.add("Polynomials")
x = [1,2,3] # demo x
y = [10,12,4] # demo y
polyfit(x,y)
The last line returns:
Poly(-2.0 + 17.0x - 5.0x^2)`
which evaluates to the correct values.
The polyfit function accepts a maximal degree for the output polynomial, but defaults to using the length of the input vectors x and y minus 1. This is the same degree as the polynomial from the Lagrange formula, and since polynomials of such degree agree on the inputs only if they are identical (this is a basic theorem) - it can be certain this is the same Lagrange polynomial and in fact the only one of such a degree to have this property.
Thanks to the developers of Polynomial.jl for leaving me just to google my way to an Answer.
Take a look to MARS regression. Multi adaptive regression splines.
My Calculus teacher gave us a program on to calculate the definite integrals of a given interval using the trapezoidal rule. I know that programmed functions take an input and produce an output as arithmetic functions would but I don't know how to do the inverse: find the input given the output.
The problem states:
"Use the trapezoidal rule with varying numbers, n, of increments to estimate the distance traveled from t=0 to t=9. Find a number D for which the trapezoidal sum is within 0.01 unit of this limit (468) when n > D."
I've estimated the limit through "plug and chug" with the calculator and I know that with a regular algebraic function, I could easily do:
limit (468) = algebraic expression with variable x
(then solve for x)
However, how would I do this for a programmed function? How would I determine the input of a programmed function given output?
I am calculating the definite integral for the polynomial, (x^2+11x+28)/(x+4), between the interval 0 and 9. The trapezoidal rule function in my calculator calculates the definite integral between the interval 0 and 9 using a given number of trapezoids, n.
Overall, I want to know how to do this:
Solve for n:
468 = trapezoidal_rule(a = 0, b = 9, n);
The code for trapezoidal_rule(a, b, n) on my TI-83:
Prompt A
Prompt B
Prompt N
(B-A)/N->D
0->S
A->X
Y1/2->S
For(K,1,N-1,1)
X+D->X
Y1+S->S
End
B->X
Y1/2+S->S
SD->I
Disp "INTEGRAL"
Disp I
Because I'm not familiar with this syntax nor am I familiar with computer algorithms, I was hoping someone could help me turn this code into an algebraic equation or point me in the direction to do so.
Edit: This is not part of my homework—just intellectual curiosity
the polynomial, (x^2+11x+28)/(x+4)
This is equal to x+7. The trapezoidal rule should give exactly correct results for this function! I'm guessing that this isn't actually the function you're working with...
There is no general way to determine, given the output of a function, what its input was. (For one thing, many functions can map multiple different inputs to the same output.)
So, there is a formula for the error when you apply the trapezoidal rule with a given number of steps to a given function, and you could use that here to work out the value of n you need ... but (1) it's not terribly beautiful, and (2) it doesn't seem like a very reasonable thing to expect you to do when you're just starting to look at the trapezoidal rule. I'd guess that your teacher actually just wanted you to "plug and chug".
I don't know (see above) what function you're actually integrating, but let's pretend it's just x^2+11x+28. I'll call this f(x) below. The integral of this from 0 to 9 is actually 940.5. Suppose you divide the interval [0,9] into n pieces. Then the trapezoidal rule gives you: [f(0)/2 + f(1*9/n) + f(2*9/n) + ... + f((n-1)*9/n) + f(9)/2] * 9/n.
Let's separate this out into the contributions from x^2, from 11x, and from 28. It turns out that the trapezoidal approximation gives exactly the right result for the latter two. (Exercise: Work out why.) So the error you get from the trapezoidal rule is exactly the same as the error you'd have got from f(x) = x^2.
The actual integral of x^2 from 0 to 9 is (9^3-0^3)/3 = 243. The trapezoidal approximation is [0/2 + 1^2+2^2+...+(n-1)^2 + n^2/2] * (9/n)^2 * (9/n). (Exercise: work out why.) There's a standard formula for sums of consecutive squares: 1^2 + ... + n^2 = n(n+1/2)(n+1)/3. So our trapezoidal approximation to the integral of x^2 is (9/n)^3 times [(n-1)(n-1/2)n/3 + n^2/2] = (9/n)^3 times [n^3/3+1/6] = 243 + (9/n)^3/6.
In other words, the error in this case is exactly (9/n)^3/6 = (243/2) / n^3.
So, for instance, the error will be less than 0.01 when (243/2) / n^3 < 0.01, which is the same as n^3 > 100*243/2 = 12150, which is true when n >= 23.
[EDITED to add: I haven't checked any of my algebra or arithmetic carefully; there may be small errors. I take it what you're interested is the ideas rather than the specific numbers.]