Best fit to model is a poor predictor - regression

For fun, I was trying to make a predictor for how long it would take for George R. R. Martin's The Winds of Winter to be released. My "best" model is the one that had the lowest combined AIC and BIC score (summed together). I tried polynomials of degree 0 to something like 50. The best one of this sort was this, which had a degree of 3 or 4:
Where the y-axis is the days since A Game of Thrones was released and the x-axis is the number of books released in the A Song of Ice and Fire series. Despite this having the best combined AIC and BIC score, it is a poor predictor, predicting in a way that doesn't make any temporal sense. However, this had the best combined AIC and BIC score (the lowest one). This pointed out to me that my notion that "the best predictor is the one with the lowest combined AIC and BIC score" is flawed. Where have I gone wrong in my thinking, and what kind of scoring criteria would be more appropriate?

Related

Predict scores of a product

I want to predict scores of a product , each product will have scores out of 10 and their will be 5 different scoring feature for a product like robustness, style, nuance, modern, quality and each of them can be between 0 to 10, i have tried xgbregressor with multi output regressor library of sklearn to deal with multiple labels and got a mean absolute error of 1.19 with standard deviation 0.036 after doing 10 fold cv, i am not sure whether this a good score or not and whether this a good way of predicting these scores.

Training a network on two feature vectors?

I want to train an MLP that takes in two nba teams A and B and classifies one as a winner and the other as a loser (probably binary classification 0 for loser 1 for winner) as well as a predictor that assigns a probability that team A wins vs team B. I'm having trouble figuring out how the feature vector should look though and would like some advice. My ideas are
Take the difference between each team's feature
Concatenate the featers IE for one training example it would be [A_1,B_1,A_2,B_2,...,A_n,B_n] where n is the feature number
Use one feature vector for each team?(Don't know if that works)
Could anyone give somee suggestions
While I agree with Scott Hunter to try everything, here are a few thoughts:
Taking the difference - this depends very much on what the team's features represent. If each feature is like a statistic of the team (such as win rate, etc), then taking the difference might work. If it is something more abstract, it may not be a good idea. But you can try.
Concatenate the features - I think this would be a good choice, at least when starting out. It definitely seems the most obvious, and I think it would work to a good extent.
Another approach: You could build an encoder, which takes the team's feature vector and outputs a condensed representation. Then, you can do something with this condensed representation (feed it to a simpler model, or another MLP as well).

What is wrong with fitting a second seasonal term in ARIMA

Let's say you are fitting a seasonal ARIMA model with regressors with some daily data and your best model candidate is: ARIMA(1,0,0)(2,0,0)[7].
By 'best' I mean a model that simultaneously deals with autocorrelations in the residuals and forecasts reasonably.
Now, according to this source, this should not be a good model:
Rule 13: If the autocorrelation at the seasonal period is positive,
consider adding an SAR term to the model. If the autocorrelation at
the seasonal period is negative, consider adding an SMA term to the
model. Do not mix SAR and SMA terms in the same model, and avoid using
more than one of either kind.
Does having a second seasonal AR or MA term really invalidates your model; which assumptions are violated?

Analyse data with degree of affection

Hello everyone! I'm a newbie studying Data Analysis.
If you'd like to see relationship how A,B,C affects outcome, you may use several models such as KNN, SVM, Logistics regression (as far as I know).
But all of them are kinda categorical, rather than degree of affection.
Let's say, I'd like to show how Fonts and Colors contribute the degree of attraction (as shown).
What models can I use?
Thousands thanks!
If your input is only categorical variables (and having a few values each), then there are finitely many potential samples. Therefore, the model will have finitely many inputs, and, therefore, only a few outputs. Just warning.
If you use, say, KNN or random forest, you can assign L2 norm as your accuracy metric. It will emphasize that 1 is closer to 2 than 5 (pls don't forget to normalize).

Extract GLM coefficients for one factor only

I am running a series of GLMs for a number of species of the form:
glm.sp<-glm(number~site+as.factor(year)+offset(log(visits)),family=poisson,data=data.sp)
Note that the year term is deliberately a factor as it's not reasonable to assume a linear relationship. The yearly coefficients produced from this model are a measure of the number of each species per year, taking account of the ammount of effort (visits). I then want to extract, exponentiate and index (relative to the last year) the year coefficients and run a GAM on them.
Currently I do this by eyeballing the coefficients and calling them directly:
data.sp.coef$coef<-exp(glm.sp$coefficients[60:77])
However as the number of sites and the number of years recorded for each species are different this means I need to eyeball each species. For example, a different species might have the year coefficients at 51:64. I'd rather not do that, and feel there must be a better way of calling out the coefficients for the years.
I've tried, the below (which doesn't work!)
> coef(glm.sp)["year"]
<NA>
NA
And I also tried saving all the coefficients as a dataframe and using a fuzzy search to extract all the values that contained "year" (the coefficients are automatically saved in the format yearXXXX-YY).
I'm certain I'm missing something simple, so would very much appreciate being proded in the right direction!
Thanks
Matt