Extract GLM coefficients for one factor only - extract

I am running a series of GLMs for a number of species of the form:
glm.sp<-glm(number~site+as.factor(year)+offset(log(visits)),family=poisson,data=data.sp)
Note that the year term is deliberately a factor as it's not reasonable to assume a linear relationship. The yearly coefficients produced from this model are a measure of the number of each species per year, taking account of the ammount of effort (visits). I then want to extract, exponentiate and index (relative to the last year) the year coefficients and run a GAM on them.
Currently I do this by eyeballing the coefficients and calling them directly:
data.sp.coef$coef<-exp(glm.sp$coefficients[60:77])
However as the number of sites and the number of years recorded for each species are different this means I need to eyeball each species. For example, a different species might have the year coefficients at 51:64. I'd rather not do that, and feel there must be a better way of calling out the coefficients for the years.
I've tried, the below (which doesn't work!)
> coef(glm.sp)["year"]
<NA>
NA
And I also tried saving all the coefficients as a dataframe and using a fuzzy search to extract all the values that contained "year" (the coefficients are automatically saved in the format yearXXXX-YY).
I'm certain I'm missing something simple, so would very much appreciate being proded in the right direction!
Thanks
Matt

Related

Best regression model where some fields may be intentionally blank for some samples

I'm looking to build a regression model where I have time based variables that may or may not exist for each data sample.
For instance, let's say we wanted to build a regression model where we could predict how long a new car will last. One of the values is when the car gets its first servicing. However, there are some samples where the car never gets serviced at all. In these situations, how can I account for this when building the model? Can I even use a linear regression model or will I have to choose a different regression model?
When I think about it, this is basically the equivalent of having 2 fields: one for whether the car was serviced and if that is true, a second field for when. But I'm not sure how to build a regression that has data that is intentionally missing.
Apply regression without using time-series. To try to capture seasonality in the data, encode the date/time columns into binary columns (to represent year, day of year, day of the month and day of the week etc.).

How to deal with different subsampling time in economic datasets for deep learning?

I am building a deep learning model for macro-economic prediction. However, different indicators varies widely when it comes to its subsampling time, ranging from minutes to annually.
Dataframe example
The picture contains the 'Treasury Rates (DGS1-20)' which is sampled daily and 'Inflation Rate(CPALT...)' which is sampled monthly. These features are essential for the model to train and dropping out the NaN rows would result in too little data.
I've read some books and articles about how to deal with missing data that includes down sampling to monthly time frames, swapping the NaNs with -1, filling it with averages between the last and next value etc. But the methods that I read mostly deals with data sets that has a missing value of about 10% of the whole dataset while in this case of mine, the monthly sampled 'Inflation(CPI)' is missing at 90+% if I combine it with the 'Treasury Rate' dataset.
I was wondering if there was any workaround to handle missing values, particularly for economic data where the sampling time gap ranges so widely. Thank you

Analyse data with degree of affection

Hello everyone! I'm a newbie studying Data Analysis.
If you'd like to see relationship how A,B,C affects outcome, you may use several models such as KNN, SVM, Logistics regression (as far as I know).
But all of them are kinda categorical, rather than degree of affection.
Let's say, I'd like to show how Fonts and Colors contribute the degree of attraction (as shown).
What models can I use?
Thousands thanks!
If your input is only categorical variables (and having a few values each), then there are finitely many potential samples. Therefore, the model will have finitely many inputs, and, therefore, only a few outputs. Just warning.
If you use, say, KNN or random forest, you can assign L2 norm as your accuracy metric. It will emphasize that 1 is closer to 2 than 5 (pls don't forget to normalize).

Can I find price floors and ceilings with cuda

Background
I'm trying to convert an algorithm from sequential to parallel, but I am stuck.
Point and Figure Charts
I am creating point and figure charts.
Decreasing
While the stock is going down, add an O every time it breaks through the floor.
Increasing
While the stock is going up, add an X every time it breaks through the ceiling.
Reversal
If the stock reverses direction, but the change is less than a reversal threshold (3 units) do nothing. If the change is greater than the reversal threshold, start a new column (X or O)
Sequential vs Parallel
Sequentially, this is pretty straight forward. I keep a variable for the floor and ceiling. If the current price breaks through the floor or ceiling, or changes more than the reversal threshold, I can take the appropriate action.
My question is, is there a way to find these reversal point in parallel? I'm fairly new to thinking in parallel, so I'm sorry if this is trivial. I am trying to do this in CUDA, but I have been stuck for weeks. I have tried using the finite difference algorithms from NVidia. These produce local max / min but not the reversal points. Small fluctuations produce numerous relative max / min, but most of them are trivial because the change is not greater than the reversal size.
My question is, is there a way to find these reversal point in parallel?
one possible approach:
use thrust::unique to remove periods where the price is numerically constant
use thrust::adjacent_difference to produce 1st difference data
use thrust::adjacent_difference on 1st difference data to get the 2nd difference data, i.e the points where there is a change in the sign of the slope.
use these points of change in sign of slope to identify separate regions of data - build a key vector from these (e.g. with a prefix sum). This key vector segments the price data into "runs" where the price change is in a particular direction.
use thrust::exclusive_scan_by_key on the 1st difference data, to produce the net change of the run
Wherever the net change of the run exceeds a threshold, flag as a "reversal"
Your description of what constitutes a reversal may also be slightly unclear. The above method would not flag a reversal on certain data patterns that you might classify as a reversal. I suspect you are looking beyond a single run as I have defined it here. If that is the case, there may be a method to address that as well - with more steps.

How does SPSS assign factor scores for cases where underlying variables were pairwise deleted?

Here's a simplified example of what I'm trying to figure out from a report. All analyses are being run in SPSS, which I don't have and don't use (my experience is with SAS and R).
They were running a regression to predict overall meal satisfaction from food type ordered, self-reported food flavor, and self-reported food texture.
But food flavor and texture are highly correlated, so they conducted a factor analysis, found food flavor and texture load on one factor, and used the factor scores in the regression.
However, about 40% of respondents don't have responses on self-reported food texture, so they used pairwise deletion while making the factors.
My question is when SPSS calculates the factor scores and outputs them as new variables in the data set, what does it do with people who had an input for a factor that was pairwise deleted?
How does it calculate (if it calculates it at all) factor scores for those people who had a response pairwise deleted during the creation of the factors and who therefore have missing data for one of the variables?
Factor scores are a linear combination of their scaled inputs. That is, given normalized variables X1, ..., Xn, we have the following (where LaTeX formatting isn't supported and the L's indicate the loadings)
f = \sum_{i=1}^n L_i X_i
In your case n = 2. Now suppose one of the X_i is missing. How do you take a sum involving a missing value? Clearly, you cannot... unless you impute it.
I don't know what the analyst who created your report did. I suggest you ask them.