Use R or mysql to calculate time period returns? - mysql

I'm trying to calculate various time period returns (monthly, quarterly, yearly etc.) for each unique member (identified by Code in the example below) of a data set. The data set will contain monthly pricing information for a 20 year period for approximately 500 stocks. An example of the data is below:
Date Code Price Dividend
1 2005-01-31 xyz 1000.00 20.0
2 2005-01-31 abc 1.00 0.1
3 2005-02-28 xyz 1030.00 20.0
4 2005-02-28 abc 1.01 0.1
5 2005-03-31 xyz 1071.20 20.0
6 2005-03-31 abc 1.03 0.1
7 2005-04-30 xyz 1124.76 20.0
I am fairly new to R, but thought that there would be a more efficient solution than looping through each Code and then each Date as shown here:
uniqueDates <- unique(data$Date)
uniqueCodes <- unique(data$Code
for (date in uniqueDates) {
for (code in uniqueCodes) {
nextDate <- seq.Date(from=stock_data$Date[i], by="3 months",length.out=2)[2]
curPrice <- data$Price[data$Date == date]
futPrice <- data$Price[data$Date == nextDate]
data$ret[(data$Date == date) & (data$Code == code)] <- (futPrice/curPrice)-1
}
}
This method in itself has an issue in that seq.Date does not always return the final day in the month.
Unfortunately the data is not uniform (the number of companies/codes varies over time) so using a simple row offset won't work. The calculation must match the Code and Date with the desired date offset.
I had initially tried selecting the future dates by using the seq.Date function
data$ret = (data[(data$Date == (seq.Date(from = data$Date, by="3 month", length.out=2)[2])), "Price"] / data$Price) - 1
But this generated an error as seq.Date requires a single entry.
> Error in seq.Date(from = stock_data$Date, by = "3 month", length.out =
> 2) : 'from' must be of length 1
I thought that R would be well suited to this type of calculation but perhaps not. Since all the data is in a mysql database I am now thinking that it might be faster/easier to do this calc directly in the database.
Any suggestions would be greatly appreciated.

Load data:
tc='
Date Code Price Dividend
2005-01-31 xyz 1000.00 20.0
2005-01-31 abc 1.00 0.1
2005-02-28 xyz 1030.00 20.0
2005-02-28 abc 1.01 0.1
2005-03-31 xyz 1071.20 20.0
2005-03-31 abc 1.03 0.1
2005-04-30 xyz 1124.76 20.0'
df = read.table(text=tc,header=T)
df$Date=as.Date(df$Date,"%Y-%m-%d")
First I would organize the data by date:
library(plyr)
pp1=reshape(df,timevar='Code',idvar='Date',direction='wide')
Then you would like to obtain monthly, quarterly, yearly, etc returns.
For that there are several options, one could be:
Make the data zoo or xts class. i.e
library(xts)
pp1[2:ncol(pp1)] = as.xts(pp1[2:ncol(pp1)],order.by=pp1$Date)
#let's create a function for calculating returns.
rets<-function(x,lag=1){
return(diff(log(x),lag))
}
Since this database is monthly, the lags for the returns will be:
monthly=1, quaterly=3, yearly =12. for instance let's calculate monthly return
for xyz.
lagged=1 #for monthly
This calculates Monthly returns for xyz
pp1$returns_xyz= c(NA,rets(pp1$Price.xyz,lagged))
To get all the returns:
#create matrix of returns
pricelist= ls(pp1)[grep('Price',ls(pp1))]
returnsmatrix = data.frame(matrix(rep(0,(nrow(pp1)-1)*length(pricelist)),ncol=length(pricelist)))
j=1
for(i in pricelist){
n = which(names(pp1) == i)
returnsmatrix[,j] = rets(pp1[,n],1)
j=j+1
}
#column names
codename= gsub("Price.", "", pricelist, fixed = TRUE)
names(returnsmatrix)=paste('ret',codename,sep='.')
returnsmatrix

You can do this very easily with the quantmod and xts packages. Using the data in AndresT's answer:
library(quantmod) # loads xts too
pp1 <- reshape(df,timevar='Code',idvar='Date',direction='wide')
# create an xts object
x <- xts(pp1[,-1], pp1[,1])
# only get the "Price.*" columns
p <- getPrice(x)
# run the periodReturn function on each column
r <- apply(p, 2, periodReturn, period="monthly", type="log")
# merge prior result into a multi-column object
r <- do.call(merge, r)
# rename columns
names(r) <- paste("monthly.return",
sapply(strsplit(names(p),"\\."), "[", 2), sep=".")
Which leaves you with an r xts object containing:
monthly.return.xyz monthly.return.abc
2005-01-31 0.00000000 0.000000000
2005-02-28 0.02955880 0.009950331
2005-03-31 0.03922071 0.019608471
2005-04-30 0.04879016 NA

Related

How to display time of day on a ggplot axis after using SQL UNIX_TIMESTAMP()?

I am working with data returned by a query similar to this:
SELECT UNIX_TIMESTAMP(timestamp) DIV 300 AS period, COUNT(*) as count from tbl
GROUP BY UNIX_TIMESTAMP(timestamp) DIV 300
which is grouping the counts into 5 minute intervals and is then imported into R and looks like this:
set.seed(1)
mydata <- data.frame(period = seq(5391360, 5391647), count = rpois(288, 4))
head(mydata)
## period count
## 1 5391360 3
## 2 5391361 3
## 3 5391362 4
## 4 5391363 7
## 5 5391364 2
## 6 5391365 7
I then plot them like this:
I would now like to plot this with ggplot, where the x axis shows the actual time starting in hourly intervals, 01:00, 02:00 03:00 etc. I have been doing this by piping the data into:
ggplot(aes(y = count, x = period)) + geom_bar(stat = "identity") +
ggtitle("5 min counts") +
theme(plot.title = element_text(lineheight=.8, face="bold", hjust = 0.5),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
which produces this:
However, as mentioned above I would like the x axis to have hourly labels: 01:00, 02:00 etc
In this solution, first I create a vector of datetime values. The vector df1$period is multiplied by 300 and coerced to class "POSIXct. Then the hours and minutes are kept.
period <- as.POSIXct(df1$period*300, origin = "1970-01-01")
period <- format(period, "%H:%M")
library(ggplot2)
ggplot(data = data.frame(period, count = df1$count),
mapping = aes(period, count)) +
geom_col(position = position_dodge())
To have a plot by hour, instead of keeping the hours and minutes, use format to keep the hours only. But then aggregate the counts by hour.
set.seed(1)
mydata <- data.frame(period = seq(5391360, 5391647), count = rpois(288, 4))
mydata$hour <- as.POSIXct(mydata$period*300, origin = "1970-01-01")
mydata$hour <- format(mydata$hour, "%H")
agg <- aggregate(count ~ hour, mydata, sum)
library(ggplot2)
ggplot(data = agg, aes(hour, count)) +
geom_col(position = position_dodge())

problem with bootMer CI: upper and lower limits are identical

I'm having the hardest time generating confidence intervals for my glmer poisson model. After following several very helpful tutorials (such as https://drewtyre.rbind.io/classes/nres803/week_12/lab_12/) as well as stackoverflow posts, I keep getting very strange results, i.e. the upper and lower limits of the CI are identical.
Here is a reproducible example containing a response variable called "production," a fixed effect called "Treatment_Num" and a random effect called "Genotype":
df1 <- data.frame(production=c(15,12,10,9,6,8,9,5,3,3,2,1,0,0,0,0), Treatment_Num=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), Genotype=c(1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2))
#run the glmer model
df1_glmer <- glmer(production ~ Treatment_Num +(1|Genotype),
data = df1, family = poisson(link = "log"))
#make an empty data set to predict from, that contains the explanatory variables but no response
require(magrittr)
df_empty <- df1 %>%
tidyr::expand(Treatment_Num, Genotype)
#create new column containing predictions
df_empty$PopPred <- predict(df1_glmer, newdata = df_empty, type="response",re.form = ~0)
#function for bootMer
myFunc_df1_glmer <- function(mm) {
predict(df1_glmer, newdata = df_empty, type="response",re.form=~0)
}
#run bootMer
require(lme4)
merBoot_df1_glmer <- bootMer(df1_glmer, myFunc_df1_glmer, nsim = 10)
#get confidence intervals out of it
predCL <- t(apply(merBoot_df1_glmer$t, MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975)))
#enter lower and upper limits of confidence interval into df_empty
df_empty$lci <- predCL[, 1]
df_empty$uci <- predCL[, 2]
#when viewing df_empty the problem becomes clear: the lci and uci are identical!
df_empty
Any insights you can give me will be much appreciated!
Ignore my comment!
The issue is with the function you created to pass to bootMer(). You wrote:
myFunc_df1_glmer <- function(mm) {
predict(df1_glmer, newdata = df_empty, type="response",re.form=~0)
}
The argument mm should be a fitted model object derived from the bootstrapped data.
However, you don't pass this object to predict(), but rather the original model
object. If you change the function to:
myFunc_df1_glmer <- function(mm) {
predict(mm, newdata = df_empty, type="response",re.form=~0)
#^^ pass in the object created by bootMer
}
then it works:
> df_empty
# A tibble: 8 x 5
Treatment_Num Genotype PopPred lci uci
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 12.9 9.63 15.7
2 1 2 12.9 9.63 15.7
3 2 1 5.09 3.87 5.89
4 2 2 5.09 3.87 5.89
5 3 1 2.01 1.20 2.46
6 3 2 2.01 1.20 2.46
7 4 1 0.796 0.361 1.14
8 4 2 0.796 0.361 1.14
As an aside -- how many genotypes in your actual data? If less than 5-7 you might
do better using a straight up glm() with genotype as a factor using sum-to-zero
contrasts.

Join and plot data with different times in 10 minute interval

I have 3 tables in an Access database with the same column names (TempDate and Temp), but different time stamps. The data was collected in 10 minute intervals, but each of the recording devices had different start times. I want to merge these into one table with a single TempDate and one Temp column for each of the tables (temp1, temp2, temp3).
I need help on how to do this in either Access or R. I've started using R with MySQL code but I'm still very new at it. Thanks in advance. Ultimately I want to join this data to another dataframe with a datetime stamp from the same period of dates. I think I can manage that if someone can show me how to tell it to group by an interval. Then finally plot using ggplot
Data
temp1<-data.frame(TempDate=c("2020/08/11 07:13:01","2020/08/11 07:23:01","2020/08/11 07:33:01","2020/08/11 07:43:01"),Temperature=c(1.610,-1.905,-1.905,-0.901))
temp2<-data.frame(TempDate=c("2020/08/11 07:10:01","2020/08/11 07:20:01","2020/08/11 07:30:01","2020/08/11 07:40:01"),Temperature=c(15.641,15.641,15.641,15.641))
temp3<-data.frame(TempDate=c("2020/08/11 07:19:01","2020/08/11 07:29:01","2020/08/11 07:39:01","2020/08/11 07:49:01"),Temperature=c(2.062,3.573,4.076,4.579))
> temp3 #as example
TempDate Temperature
1 2020/08/11 07:19:01 2.062
2 2020/08/11 07:29:01 3.573
3 2020/08/11 07:39:01 4.076
4 2020/08/11 07:49:01 4.579
#what I want row 1 is temps recorded from 07:10:00-07:29:59, etc
>
TempDate Temp1 Temp2 Temp3
1 2020/08/11 07:10:00 1.610 15.641 2.062
2 2020/08/11 07:20:00 -1.905 15.641 3.573
3 2020/08/11 07:30:00 -1.905 15.641 4.076
4 2020/08/11 07:40:00 -1.901 15.641 4.579
UPDATE:
Thanks to Ben for the great answer to get me started solving this problem. In asking another question, floor_date was suggested. This code worked better for my data than the cut function by #Ben. When using cut I would get times ending in 9 (12:19) instead of 0 (12:10). I also tried TempDate+60 within the cut function, but then some dates would get a time in the next 10 minute interval. The below code was more accurate.
library(lubridate)
tempdata<-bind_rows(burrow=burrow,shade=shade,sun=sun,.id='Series') %>%
mutate(TempDate = as.POSIXct(TempDate, tz="UTC"),
TimeStamp = floor_date(TempDate, unit='10 mins'),
TimeStamp = as.POSIXct(TimeStamp, tz="UTC")) %>%
filter(TimeStamp > as.POSIXct("2020-08-12 13:29:00", tz="UTC")) %>%
select(Series, Temperature,TimeStamp) %>%
arrange(TimeStamp)
In R you could do the following, using tidyverse approach.
First, you can use bind_rows to put all your data frames together, and add a source column with the name of data frame those temperatures came from, or destination column in final result.
Then, make sure your TempDate is POSIXct. You can use cut to put your datetimes into 10 minute intervals.
At this point, I would consider leaving the result as is for plotting with ggplot2. It's often preferable to leave in "long" format instead of "wide". However, if you want it in "wide" format, then you can use pivot_wider from tidyr.
library(dplyr)
library(tidyr)
bind_rows(temp1 = temp1, temp2 = temp2, temp3 = temp3, .id = 'source') %>%
mutate(TempDate = as.POSIXct(TempDate),
NewTempDate = cut(TempDate, breaks = "10 min")) %>%
pivot_wider(id_cols = NewTempDate, names_from = source, values_from = Temperature)
Output
NewTempDate temp1 temp2 temp3
<fct> <dbl> <dbl> <dbl>
1 2020-08-11 07:10:00 1.61 15.6 2.06
2 2020-08-11 07:20:00 -1.90 15.6 3.57
3 2020-08-11 07:30:00 -1.90 15.6 4.08
4 2020-08-11 07:40:00 -0.901 15.6 4.58
In Access (VBA), you can round the times down like this:
texttime = "2020/08/11 07:19:01"
truetime = DateValue(texttime) + TimeSerial(Hour(CDate(texttime)), (Minute(CDate(texttime)) \ 10) * 10, 0)
' Result:
' 2020-11-08 07:10:00
However, how to implement this in R, I don't know.

R: iterate through list of start and end dates and insert into an API request

To save time I would like to iterate through a vector of month start and month end dates and make an API request each time and store the output from each request.
Say we start with a dataframe called dateTable holding the first and last day of the month for the date range:
firstDOM lastDOM
2016-05-01 2016-05-31
2016-06-01 2016-06-30
2016-07-01 2016-07-31
2016-08-01 2016-08-31
2016-09-01 2016-09-30
2016-10-01 2016-10-31
2016-11-01 2016-11-30
2016-12-01 2016-12-31
2017-01-01 2017-01-31
2017-02-01 2017-02-28
2017-03-01 2017-03-31
2017-04-01 2017-04-30
2017-05-01 2017-05-31
2017-06-01 2017-06-30
2017-07-01 2017-07-31
2017-08-01 2017-08-31
I would like to iterate through each row and paste the startDate and endDate into the following rest API request however I keep getting the following error when running this piece of code and I am not sure what's causing it:
for (i in 1:nrow(dateTable)) {
startDate <- dateTable$firstDOM
endDate <- dateTable$lastDOM
#Obtian the Volume of Mentions by Day using declared specs from above
qryMen <- GET(paste("https://newapi.brandwatch.com/projects/", projId, dataSpec
, "?queryId=", queryId, "&startDate=", startDate, "&endDate=", endDate
, '&pageSize=', pageSize, "&access_token=", accessToken$access_token, sep = ""))
}
#Error
Error: length(url) == 1 is not TRUE
Any help would be greatly appreciated!
Currently you are passing the entire vector in your for loop with each iteration and not indexing by the loop variable, i:
for (i in 1:nrow(dateTable)) {
startDate <- dateTable$firstDOM[[i]]
endDate <- dateTable$lastDOM[[i]]
...
}
Nonetheless, consider Map (or the equivalent mapply(..., SIMPLIFY=FALSE)) to iterate elementwise through the two columns. With this approach you can save a large list of objects (whatever your query returns) with number of elements equal to the rows of dataTable. You can then use this list for further operations.
api_fct <- function(startDate, endDate) {
qryMen <- GET(paste0("https://newapi.brandwatch.com/projects/", projId, dataSpec
, "?queryId=", queryId, "&startDate=", startDate, "&endDate=", endDate
, '&pageSize=', pageSize, "&access_token=", accessToken$access_token))
}
api_list <- Map(api_fct, dateTable$firstDOM, dateTable$lastDOM)
# api_list <- mapply(api_fct, dateTable$firstDOM, dateTable$lastDOM, SIMPLIFY=FALSE)
Couple things, your for loop isn't actually doing anything. You say for i in ... but you never reference i again. And, there's no reason to put the startDate and endDate in the loop. Also, it'd help if you post some sample data so that we can attempt to recreate what you are doing.
Anyway, The error is telling you what is wrong: you can't pass a vector of URLs to GET. Take everything you passed to GET() and just paste it into the console. You'll get back n URLs, n being the number of rows in your dateTable.
I'm assuming your R objects that you pass to GET (other than startDate and endDate) don't change? If that's the case, and you want to use a loop, you can preallocate a vector of the same length as the data you expect to return, then loop through your startDate and endDate, passing them into GET() and slotting them into your qryMen object.
startDate <- dateTable$firstDOM
endDate <- dateTable$lastDOM
qryMen <- vector(mode = "list", length = nrow(dataTable)
for (i in 1:nrow(dateTable)) {
qryMen[i] <- GET(paste("https://newapi.brandwatch.com/projects/", projId,
dataSpec, "?queryId=", queryId,
"&startDate=", startDate[i],
"&endDate=", endDate[i],
"&pageSize=", pageSize,
"&access_token=", accessToken$access_token, sep = ""))
}

Regression by year and companyID to save coefficients

I am trying to run regressions by companyID and year, and save the coefficients for each firm-year model as new variables in a new column right besides the other columns. There is an additional wrinkle‹ I have panel data for 1990-2010 and want to run each regression using t to t-4 only (I.e., for 2001, use only 1998-2001 years of data and i.e. for 1990 then only the data of 1990 and so on). I am new to using foreach loops and I found some prior coding on the web. I have tried to adapt it to my situation but two issues: anything.....
the output is staying blank
I have not figured out how to use the rolling four year data periods.
Here is the code I tried. Any suggestions would be much appreciated.
use paneldata.dta // the dataset I am working in
generate coeff . //empty variable for coefficient
foreach x of local levels {
forval z = 1990/2010
{
capture reg excess_returns excess_market
replace coeff = _b[fyear] & _b[CompanyID] if e(sample) }
}
So below is a short snapshot of what the data looks like;
CompanyID Re_Rf Rm-Rf Year
10 2 2 1990 
10 3 2 1991 
15 3 2 1991 
15 4 2 1992
15 5 2 1993 
21 4 2 1990 
21 4 2 1991 
34 3 1 1990 
34 3 1 1991
34 4 1 1992
34 2 1 1993  
34 3 1 1994
34 4 1 1995
34 2 1 1996   
 
Re_Rf = excess_returns 
Rm_Rf = excess_market 
I want to run the following regression: ​​​​​​​
reg excess_returns excess_market
There is a good discussion on Statalist, but I think this answer may be helpful for your learning about loops and how Stata syntax work.
the code I would use is as follows:
generate coeff = . //empty variable for coefficient
// put the values of gvkey into a local macro called levels
qui levelsof CompanyID, local(levels)
foreach co of local levels {
forval yr = 1994/2010 {
// run the regression with the condition that year is between yr
// and yr-3 (which is what you write in your example)
// and the CompanyID is the same as in the regression
qui reg Re_Rf Rm_Rf if fyear <= `yr' & fyear >= `yr'-3 & CompanyID== `co'
// now replace coeff equal to the coefficient on Rm_Rf with the same
// condiditions as above, but only for year yr
replace coeff = _b[Rm_Rf] if fyear == `yr' & CompanyID == `co'
}
}
This is a potentially dangerous thing to do if you do not have a balanced panel. If you are worried about this, there may be a way to deal with it using capture or changing the fyear loop to include something like:
levelsof fyear if CompanyID == `co', local(yr_level)
foreach yr of `yr_level' { ...