How to fit autocorrelation structures in a diving glmer model vs nlme and glmmTMB models? - lme4

I want to compare diving behaviour between penguin populations breeding at different locations using generalized linear mixed effects models (glmer). The data looks like this:
# A tibble: 12 × 7
# Groups: ID [1]
ID begdesc bout maxdep island phase diurnal
<chr> <dttm> <dbl> <dbl> <chr> <chr> <chr>
1 Nelson_Bro1_HP01 2019-01-01 19:53:11 1 52.0 Nelson Chick day
2 Nelson_Bro1_HP01 2019-01-01 20:08:18 2 5.34 Nelson Chick day
3 Nelson_Bro1_HP01 2019-01-01 20:14:39 2 52.0 Nelson Chick day
4 Nelson_Bro1_HP01 2019-01-01 20:24:46 3 64.1 Nelson Chick day
5 Nelson_Bro1_HP01 2019-01-01 20:28:44 3 75.5 Nelson Chick day
6 Nelson_Bro1_HP01 2019-01-01 20:39:44 4 68.5 Nelson Chick day
7 Nelson_Bro1_HP01 2019-01-01 20:46:58 4 62.8 Nelson Chick day
8 Nelson_Bro1_HP01 2019-01-01 20:52:19 4 62.0 Nelson Chick day
9 Nelson_Bro1_HP01 2019-01-01 20:56:43 4 60.7 Nelson Chick day
10 Nelson_Bro1_HP01 2019-01-01 20:59:56 4 62.1 Nelson Chick day
11 Nelson_Bro1_HP01 2019-01-01 21:04:05 4 62.8 Nelson Chick day
12 Nelson_Bro1_HP01 2019-01-01 21:10:05 4 51.5 Nelson Chick day
ID= Individual penguin, begdesc = date and time of dive, bout = bout number, maxdep = dive depth, Island, breeding phase (incubation or chick-rearing) and diurnal = day/night.
The response variable is maximum dive depth and the fixed effects are Island ('Kop' or 'Nel') and Diurnal (Day or Night). I include the ID of each penguin as a random effect to account for the repeated measures. The diving depths have a Gamma error distribution and each row represents a single dive. The times of dives are irregularly spaced. I used the following syntax for the generalized linear mixed models.
glmer(maxdepth ~ island + diurnal+ (1|ID), data = dive.stats, family = Gamma(link = "log")
I realized that the dives closer in time are highly autocorrelated.
ACF plot without an important time variable
So now I also have a variable 'bout' that group dives that are closer together in time in the same group and dives that are separated by more than 5 minutes into a different dive bout so that the timing of dives is accounted for. I account for repeated measures of bouts within individuals as a nested random effect:
glmer(maxdepth ~ island + diurnal+ (1|ID/bout), data = dive.stats, family = Gamma(link = "log")
Once I account for bouts of dives within individuals, the ACF plot improved a lot.
ACF plot with a variable that accounts for time.
But I still want to include an autocorrelation structure in the glmer model to improve the model estimates. I have looked online for how to include autocorrelation structures in a generalized linear mixed effects model, but I can't find any good answers. Has anyone been able to do this before?
Other options I explored are to use 'lme' instead, where I can add autocorrelation structures easily, but that also means that I can't specify that the data is Gamma distributed. So I will need to log transform the diving depth variable so that the errors are normally distributed, which has been discouraged before because interpretation is not easy to do with transformed data. The ACF plots look even better when I use 'lme' compared to the glmer models.
ACF plot using lme
I've also tried 'glmmTMB' where I can specify the Gamma (link: log) error distribution, but I am struggling to figure out how to include the AR1 correlation structure. Can I specify 'bout' as the time factor needed for the ar1 correlation structure, since groups of dives within 5 min of each other in the same bout? Or should I try and include another time variable, e.g. hour when dive took place?
According to this vignette, if I have irregular spaced dives I should rather use the ou (Ornstein-Uhlenbeck) correlation structure, but then I need to include the locations of dives?
I would appreciate any suggestions you might have!

Related

Multinomial Logistic Regression Predictors Set Up

I would like to use a multinomial logistic regression to get win probabilities for each of the 5 horses that participate in any given race using each horses previous average speed.
RACE_ID H1_SPEED H2_SPEED H3_SPEED H4_SPEED H5_SPEED WINNING_HORSE
1 40.482081 44.199627 42.034929 39.004813 43.830139 5
2 39.482081 42.199627 41.034929 41.004813 40.830139 4
I am stuck on how to handle the independent variables for each horse given that any of the 5 horses average speed can be placed in any of H1_SPEED through H5_SPEED.
Given the fact that for each race I can put any of the 5 horses under H1_SPEED meaning there is no real relationship between H1_SPEED from RACE_ID 1 and H1_SPEED from RACE_ID 2 other than the arbitrary position I selected.
Would there be any difference if the dataset looked like this -
For RACE_ID 1 I swapped H3_SPEED and H5_SPEED and changed WINNING_HORSE from 5 to 3
For RACE_ID 2 I swapped H4_SPEED and H1_SPEED and changed WINNING_HORSE from 4 to 1
RACE_ID H1_SPEED H2_SPEED H3_SPEED H4_SPEED H5_SPEED WINNING_HORSE
1 40.482081 44.199627 43.830139 39.004813 42.034929 3
2 41.004813 42.199627 41.034929 39.482081 40.830139 1
Is this an issue, if so how should this be handled? What if I wanted to add more independent features per horse?
You cannot change in that way your dataset, because each feature (column) has a meaning and probably it depends on the values of the other features. You can imagine it as a six dimensional hyperplane, if you change the value of a feature the position of the point in the hyperplane changes, it does not remain stationary.
If you deem that a feature is useless to solve your problem (i.e. it is independent from the target), you can drop it or avoid to use it during the training phase of your model.
Edit
To solve your specific problem you may add a parameter for each speed column that takes care of the specific horse which is running with that speed. It is a sort of data augmentation, in order to add more problem related features to your model.
RACE_ID H1_SPEED H1_HORSE H2_SPEED H2_HORSE ... WINNING_HORSE
1 40.482081 1 44.199627 2 ... 5
2 39.482081 3 42.199627 5 ... 4
I've invented the number associated to each horse, but it seems that this information is present in your dataset.

Using JOIN (?) to intentionally return more results than rows

Sorry for the length of detail required to ask the question.
There are four tables (related to research, not really having anything to do with a sporting facility). They're as follows:
1) Let's say the first table is a list of tennis courts, and let's say there are hundreds of possibilities (not just indoor and outdoor).
-------------
TENNIS_COURTS
ID Type
-------------
1 Indoor
2 Outdoor
…
2) We want to note which day of the year they're available for rental. To prevent redundant rows, we can list individual days (e.g., just the 2nd day of the year, entered as "From:2", "To:2") or blocks (e.g., from the 24th day through the 25th day, entered as "From:24", "To:25"). In this example, the indoor court is the most available while the outdoor court has only two date ranges (obviously unrealistic for winter).
---------------------------
DAYS_AVAILABLE
ID ProductID From To
---------------------------
1 1 2 2 《 Indoor
2 2 24 25 《 Outdoor
3 2 140 170 《 Outdoor
4 1 280 300 《 Indoor
5 1 340 345 《 Indoor
…
3) We also want to add a list of attributes which will grow quite long over time. So rather than incorporating these in a field rule, there's an Attributes table.
-----------------------
ATTRIBUTES
ID Attribute
-----------------------
1 Age of Player
2 Time of Day
3 Outside Temperature
…
4) Lastly, we want to add a list of Considerations (or factors) to consider when renting a court. In this example, risk of injury applies to both indoor and outdoor courts, but visibility and temperature only applies to outdoor.
--------------------------------------------------
CONSIDERATIONS
ID ProductID AttributeID Effect Link
--------------------------------------------------
1 1 1 Risk of injury www… 《 Indoor
2 2 1 Risk of injury www… 《 Outdoor
3 2 2 Hard to see www… 《 Outdoor
4 2 3 Gets cold www… 《 Outdoor
…
Utilizing the individual tables above, we'd like to create a consolidated saved view that contains at least one row for each date in the range, starting from the first day of the year (in which a court is available) through the last day of the year (for which a court is available). We also want to repeat the applicable considerations for each day listed.
Based on the data shown above, it would look like this:
----------------------------------------
CONSOLIDATED VIEW
Day Court Consideration Link
----------------------------------------
2 Indoor 《 from DAYS_AVAILABLE
2 Indoor Risk of injury www… 《 from CONSIDERATIONS
24 Outdoor 《 from DAYS_AVAILABLE
24 Outdoor Risk of injury www… 《 from CONSIDERATIONS
24 Outdoor Hard to see www… 《 from CONSIDERATIONS
24 Outdoor Gets cold www… 《 from CONSIDERATIONS
25 Outdoor 《 from DAYS_AVAILABLE
25 Outdoor Risk of injury www… 《 from CONSIDERATIONS
25 Outdoor Hard to see www… 《 from CONSIDERATIONS
25 Outdoor Gets cold www… 《 from CONSIDERATIONS
…
We can then query the consolidated view (e.g., "SELECT * FROM CONSOLIDATED_VIEW where Day = 24") to produce a simple output like:
Court: Indoor
Available: 24th day
Note: Risk of injury (www…)
Hard to see (www…)
Gets cold (www…)
We want to produce the above shown example from a consolidated view because once the data is stored, it won't be changing frequently, and we very likely will not be querying single days at a time anyhow. It's more likely that a web client will fetch all of the rows into a large array (TBD based on determining the total size), and will then present it to users without further server interaction.
Can we produce the CONSLIDATED_TABLE solely with an SQL query or do we need to perform some other coding (e.g., PHP or NodeJS)?
The real deal in your question is: how can I get a list of the available days so I can join my other tables and produce my output, right? I mean, having a list of days, all you need to is JOIN the other tables.
As you have a limited list (days of the year), I'd suggest creating a table with a single column containing the 365 (or 366) days (1, 2, 3, ...) and JOIN it with your other tables. The query would be smtg similar to:
SELECT ... -- fields u want
FROM YOUR_NEW_TABLE n
JOIN DAYS_AVAILABLE D on (n.DAY between D.From and D.To)
JOIN ... -- other tables that you need info

Panel data or time-series data and xt regression

Need help observing simple regression as well as xt-regression for panel data.
The dataset consists of 16 participants in which daily observations were made.
I would like to observe the difference between pre-test (from the first date on which observations were taken) and post-test (the last date on which observations were made) across different variables.
also I was advised to do xtregress, re
what is this re? and its significance?
If the goal is to fit some xt model at the end, you will need the data in long form. I would use:
bysort id (year): keep if inlist(_n,1,_N)
For each id, this puts the data in ascending chronological order, and keeps the first and last observation for each id.
The RE part of your question is off-topic here. Try Statalist or CV SE site, but do augment your questions with details of the data and what you hope to accomplish. These may also reveal that getting rid of the intermediate data is a bad idea.
Edit:
Add this after the part above:
bysort id (year): gen t= _n
reshape wide x year, i(id) j(t)
order id x1 x2 year1 year2
Perhaps this sample code will set you in the direction you seek.
clear
input id year x
1 2001 11
1 2002 12
1 2003 13
1 2004 14
2 2001 21
2 2002 22
2 2003 23
3 1005 35
end
xtset id year
bysort id (year): generate firstx = x[1]
bysort id (year): generate lastx = x[_N]
list, sepby(id)
With regard to xterg, re, that fits a random effects model. See help xtreg for more details, as well as the documentation for xtreg in the Stata Longitudinal-Data/Panel-Data Reference Manual included in your Stata documentation.

Isolating unique observations and calculating the average in Stata

Currently I have a dataset that appears as follows:
mnbr firm contribution
1591 2 1
9246 6 1
812 6 1
674 6 1
And so on. The idea is that mnbr is the member number of employees who work at firm # whatever. If contribution is 1 (and I have dropped all the 0s for this purpose) said employee has contributed to a certain fund.
I additionally used codebook to determine the number of unique firms that exist. The goal is to determine the average number of contributions per firm i.e. there was 1 contribution for firm 2, 3 contributions for firm 6 and so on. The problem I arrive at is accessing that the unique values number from codebook.
I read some documentation online for
inspect *varlist*
display r(N_unique)
which suggests to me that using r(N_unique) would store that value, yet unfortunately this method did not work for me. So that is part 1.
Part 2 is I'd also like to create a variable that shows the contributions in each firm i.e.
mnbr firm contribution average
1591 2 1 1
9246 6 . 2/3
812 6 1 2/3
674 6 1 2/3
to show that for firm 6, 2 out of the 3 employees contributed to this fund.
Thanks in advance for the help.
To answer your comment, this works for me:
clear
set more off
input ///
mnbr firm cont
1591 2 1
9246 6 .
812 6 1
674 6 1
end
list
// problem 1
inspect firm
display r(N_unique)
// problem 2
bysort firm: egen totc = total(cont)
by firm: gen share = totc / _N
list
You have to use r(N_unique) before running another Stata command, or it can get lost. You can also save that result to a local or scalar.
Problem 2 is also addressed.

mySQL: Quering time-related rows grouped by century?

I am working on a project which features one database table looking like this (structurally, but not datawise - puuh):
year | event | category
------------------------------------------------------------
1970 | Someone ate a cheeseburger | food
2010 | Justin bieber was discovered | other
1500 | Columbus makes 3rd trip to America | notable
------------------------------------------------------------
How would I query this table, so that my result is grouped in a per-century way?
2000-century:
2010 - Justin bieber was discovered
1900-century:
1970 - Someone ate a cheeseburger
1500-century:
1500 - Columbus makes 3rd trip to America
Sorry for the cheesy psuedodata :)
MySQL doesn't have a century function, but it does have year, so you basically do:
SELECT whatever
FROM yourtable
WHERE ...
GROUP BY CAST((Year(datetimefield) / 100) AS INTEGER)
Of course, this doesn't take into account that centuries officially start on their year "1", and not on the year "0" (e.g. 1900 is still the 18th century, not 19th). But if you're not a stickler for precision, the simple "divide by 100" method is the quickest/easiest.
SELECT CAST((Year(datetimefield) / 100) AS INTEGER) as century, someotherfields
FROM yourtable
WHERE ...
ORDER BY datetimefield
Is a better aproach to the question I think.
(pseudo code borrowed from Marc B)