Regression by year and companyID to save coefficients - regression

I am trying to run regressions by companyID and year, and save the coefficients for each firm-year model as new variables in a new column right besides the other columns. There is an additional wrinkle‹ I have panel data for 1990-2010 and want to run each regression using t to t-4 only (I.e., for 2001, use only 1998-2001 years of data and i.e. for 1990 then only the data of 1990 and so on). I am new to using foreach loops and I found some prior coding on the web. I have tried to adapt it to my situation but two issues: anything.....
the output is staying blank
I have not figured out how to use the rolling four year data periods.
Here is the code I tried. Any suggestions would be much appreciated.
use paneldata.dta // the dataset I am working in
generate coeff . //empty variable for coefficient
foreach x of local levels {
forval z = 1990/2010
{
capture reg excess_returns excess_market
replace coeff = _b[fyear] & _b[CompanyID] if e(sample) }
}
So below is a short snapshot of what the data looks like;
CompanyID Re_Rf Rm-Rf Year
10 2 2 1990 
10 3 2 1991 
15 3 2 1991 
15 4 2 1992
15 5 2 1993 
21 4 2 1990 
21 4 2 1991 
34 3 1 1990 
34 3 1 1991
34 4 1 1992
34 2 1 1993  
34 3 1 1994
34 4 1 1995
34 2 1 1996   
 
Re_Rf = excess_returns 
Rm_Rf = excess_market 
I want to run the following regression: ​​​​​​​
reg excess_returns excess_market

There is a good discussion on Statalist, but I think this answer may be helpful for your learning about loops and how Stata syntax work.
the code I would use is as follows:
generate coeff = . //empty variable for coefficient
// put the values of gvkey into a local macro called levels
qui levelsof CompanyID, local(levels)
foreach co of local levels {
forval yr = 1994/2010 {
// run the regression with the condition that year is between yr
// and yr-3 (which is what you write in your example)
// and the CompanyID is the same as in the regression
qui reg Re_Rf Rm_Rf if fyear <= `yr' & fyear >= `yr'-3 & CompanyID== `co'
// now replace coeff equal to the coefficient on Rm_Rf with the same
// condiditions as above, but only for year yr
replace coeff = _b[Rm_Rf] if fyear == `yr' & CompanyID == `co'
}
}
This is a potentially dangerous thing to do if you do not have a balanced panel. If you are worried about this, there may be a way to deal with it using capture or changing the fyear loop to include something like:
levelsof fyear if CompanyID == `co', local(yr_level)
foreach yr of `yr_level' { ...

Related

Join and plot data with different times in 10 minute interval

I have 3 tables in an Access database with the same column names (TempDate and Temp), but different time stamps. The data was collected in 10 minute intervals, but each of the recording devices had different start times. I want to merge these into one table with a single TempDate and one Temp column for each of the tables (temp1, temp2, temp3).
I need help on how to do this in either Access or R. I've started using R with MySQL code but I'm still very new at it. Thanks in advance. Ultimately I want to join this data to another dataframe with a datetime stamp from the same period of dates. I think I can manage that if someone can show me how to tell it to group by an interval. Then finally plot using ggplot
Data
temp1<-data.frame(TempDate=c("2020/08/11 07:13:01","2020/08/11 07:23:01","2020/08/11 07:33:01","2020/08/11 07:43:01"),Temperature=c(1.610,-1.905,-1.905,-0.901))
temp2<-data.frame(TempDate=c("2020/08/11 07:10:01","2020/08/11 07:20:01","2020/08/11 07:30:01","2020/08/11 07:40:01"),Temperature=c(15.641,15.641,15.641,15.641))
temp3<-data.frame(TempDate=c("2020/08/11 07:19:01","2020/08/11 07:29:01","2020/08/11 07:39:01","2020/08/11 07:49:01"),Temperature=c(2.062,3.573,4.076,4.579))
> temp3 #as example
TempDate Temperature
1 2020/08/11 07:19:01 2.062
2 2020/08/11 07:29:01 3.573
3 2020/08/11 07:39:01 4.076
4 2020/08/11 07:49:01 4.579
#what I want row 1 is temps recorded from 07:10:00-07:29:59, etc
>
TempDate Temp1 Temp2 Temp3
1 2020/08/11 07:10:00 1.610 15.641 2.062
2 2020/08/11 07:20:00 -1.905 15.641 3.573
3 2020/08/11 07:30:00 -1.905 15.641 4.076
4 2020/08/11 07:40:00 -1.901 15.641 4.579
UPDATE:
Thanks to Ben for the great answer to get me started solving this problem. In asking another question, floor_date was suggested. This code worked better for my data than the cut function by #Ben. When using cut I would get times ending in 9 (12:19) instead of 0 (12:10). I also tried TempDate+60 within the cut function, but then some dates would get a time in the next 10 minute interval. The below code was more accurate.
library(lubridate)
tempdata<-bind_rows(burrow=burrow,shade=shade,sun=sun,.id='Series') %>%
mutate(TempDate = as.POSIXct(TempDate, tz="UTC"),
TimeStamp = floor_date(TempDate, unit='10 mins'),
TimeStamp = as.POSIXct(TimeStamp, tz="UTC")) %>%
filter(TimeStamp > as.POSIXct("2020-08-12 13:29:00", tz="UTC")) %>%
select(Series, Temperature,TimeStamp) %>%
arrange(TimeStamp)
In R you could do the following, using tidyverse approach.
First, you can use bind_rows to put all your data frames together, and add a source column with the name of data frame those temperatures came from, or destination column in final result.
Then, make sure your TempDate is POSIXct. You can use cut to put your datetimes into 10 minute intervals.
At this point, I would consider leaving the result as is for plotting with ggplot2. It's often preferable to leave in "long" format instead of "wide". However, if you want it in "wide" format, then you can use pivot_wider from tidyr.
library(dplyr)
library(tidyr)
bind_rows(temp1 = temp1, temp2 = temp2, temp3 = temp3, .id = 'source') %>%
mutate(TempDate = as.POSIXct(TempDate),
NewTempDate = cut(TempDate, breaks = "10 min")) %>%
pivot_wider(id_cols = NewTempDate, names_from = source, values_from = Temperature)
Output
NewTempDate temp1 temp2 temp3
<fct> <dbl> <dbl> <dbl>
1 2020-08-11 07:10:00 1.61 15.6 2.06
2 2020-08-11 07:20:00 -1.90 15.6 3.57
3 2020-08-11 07:30:00 -1.90 15.6 4.08
4 2020-08-11 07:40:00 -0.901 15.6 4.58
In Access (VBA), you can round the times down like this:
texttime = "2020/08/11 07:19:01"
truetime = DateValue(texttime) + TimeSerial(Hour(CDate(texttime)), (Minute(CDate(texttime)) \ 10) * 10, 0)
' Result:
' 2020-11-08 07:10:00
However, how to implement this in R, I don't know.

Undefined columns selected using panelvar package

Have anyone used panel var in R?
Currently I'm using the package panelvar of R. And I'm getting this error :
Error in `[.data.frame`(data, , c(colnames(data)[panel_identifier], required_vars)) :
undefined columns selected
And my syntax currently is:
model1<-pvargmm(
dependent_vars = c("Change.."),
lags = 2,
exog_vars = c("Price"),
transformation = "fd",
data = base1,
panel_identifier = c("id", "t"),
steps = c("twostep"),
system_instruments = FALSE,
max_instr_dependent_vars = 99,
min_instr_dependent_vars = 2L,
collapse = FALSE)
I don't know why my panel_identifier is not working, it's pretty similar to the example given by panelvar package, however, it doesn't work, I want to appoint that base1 is on data.frame format. any ideas? Also, my data is structured like this:
head(base1)
id t country DDMMYY month month_text day Date_txt year Price Open
1 1 1296 China 1-4-2020 4 Apr 1 Apr 01 2020 12588.24 12614.82
2 1 1295 China 31-3-2020 3 Mar 31 Mar 31 2020 12614.82 12597.61
High Low Vol. Change..
1 12775.83 12570.32 NA -0.0021
2 12737.28 12583.05 NA 0.0014
thanks in advance !
Check the documentation of the package and the SSRN paper. For me it helped to ensure all entered formats are identical (you can check this with str(base1) command). For example they write:
library(panelvar)
data("Dahlberg")
ex1_dahlberg_data <-
pvargmm(dependent_vars = .......
When I look at it I get
>str(Dahlberg)
'data.frame': 2385 obs. of 5 variables:
$ id : Factor w/ 265 levels "114","115","120",..: 1 1 1 1 1 1 1 1 1 2 ...
$ year : Factor w/ 9 levels "1979","1980",..: 1 2 3 4 5 6 7 8 9 1 ...
$ expenditures: num 0.023 0.0266 0.0273 0.0289 0.0226 ...
$ revenues : num 0.0182 0.0209 0.0211 0.0234 0.018 ...
$ grants : num 0.00544 0.00573 0.00566 0.00589 0.00559 ...
For example the input data must be a data.frame (in my case it had additional type specifications like tibble or data.table). I resolved it by casting as.data.frame() on it.

Scraping embeded html table in R

I am fairly new to scraping/parsing HTML in R. I am trying to get data from the Career Receiving Statistics and Career Rushing Statistics' tables from http://totalfootballstats.com/PlayerWR.asp?id=1218565.
I know about the read readHTMLtable function but both these tables are embedded in so much junk and I can't seem to get past the children nodes of the root.
EDIT: the above problem has been solved. However for the website http://www.sports-reference.com/cfb/players/a-index.html I am trying to loop through all players and access their data. I'm running into trouble in accessing their respective url links. I have tried:
fb=htmlParse("http://www.sports-reference.com/cfb/players/a-index.html")
p1=getNodeSet(fb,'//pre')
con = textConnection(xmlValue(p1[[100]]))
players100 = read.table(con)
But this results in the error "Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 3 did not have 5 elements"
The other thing I tried is:
links <- xpathSApply(fb, "//a/#href")
But I feel like there should be a better way to do this?
Well here's the same player from a different website, much much cleaner. The data doesn't match though, so someone got it wrong. My money's on totalfootballstats.com. Choose your resources wisely!
readHTMLTable(
"http://www.sports-reference.com/cfb/players/doyle-aaron-1.html"
)
# $receiving
# Year School Conf Class Pos G Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
# 1 1988 Miami (FL) Ind WR 11 1 12 12.0 0 1 34 34.0 0 2 46 23.0 0
# 2 1989 Miami (FL) Ind WR 11 8 93 11.6 1 8 93 11.6 1
# $kick_ret
# Year School Conf Class Pos G Ret Yds Avg TD Ret Yds Avg TD
# 1 1988 Miami (FL) Ind WR 11 1 8 8.0 0
# 2 1989 Miami (FL) Ind WR 11
For specific requests, it looks like you can a construct a valid URL like this, which will also construct the path for multiple players at once.
## base URI
u <- "http://www.sports-reference.com"
## player first and last names
first <- "bill"
last <- "adams"
## use sprintf() to make all the paths at once
fullPath <- sprintf("%s/cfb/players/%s-%s-1.html", u, first, last)
## read the table - I think you'll need to loop readHTMLTable() though
readHTMLTable(fullPath)
# $receiving
# Year School Conf Class Pos G Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
# 1 1969 Dayton Ind WR 10 1 3 3.0 1 1 3 3.0 1
# 2 1970 Dayton Ind WR 10 4 42 10.5 1 4 42 10.5 1

octave: using find() on cell array {} subscript and assigning it to another cell array

This is an example in Section 6.3.1 Comma Separated Lists Generated from Cell Arrays of the Octave documentation (I browsed it through the doc command on the Octave prompt) which I don't quite understand.
in{1} = [10, 20, 30, 40, 50, 60, 70, 80, 90];
in{2} = inf;
in{3} = "last";
in{4} = "first";
out = cell(4, 1);
[out{1:3}] = find(in{1 : 3}); % line which I do not understand
So at the end of this section, we have in looking like:
in =
{
[1,1] =
10 20 30 40 50 60 70 80 90
[1,2] = Inf
[1,3] = last
[1,4] = first
}
and out looking like:
out =
{
[1,1] =
1 1 1 1 1 1 1 1 1
[2,1] =
1 2 3 4 5 6 7 8 9
[3,1] =
10 20 30 40 50 60 70 80 90
[4,1] = [](0x0)
}
Here, find is called with 3 output parameters (forgive me if I'm wrong on calling them output parameters, I am pretty new to Octave) from [out{1:3}], which represents the first 3 empty cells of the cell array out.
When I run find(in{1 : 3}) with 3 output parameters, as in:
[i,j,k] = find(in{1 : 3})
I get:
i = 1 1 1 1 1 1 1 1 1
j = 1 2 3 4 5 6 7 8 9
k = 10 20 30 40 50 60 70 80 90
which kind of explains why out looks like it does, but when I execute in{1:3}, I get:
ans = 10 20 30 40 50 60 70 80 90
ans = Inf
ans = last
which are the 1st to 3rd elements of the in cell array.
My question is: Why does find(in{1 : 3}) drop off the 2nd and 3rd entries in the comma separated list for in{1 : 3}?
Thank you.
The documentation for find should help you answer your question:
When called with 3 output arguments, find returns the row and column indices of non-zero elements (that's your i and j) and a vector containing the non-zero values (that's your k). That explains the 3 output arguments, but not why it only considers in{1}. To answer that you need to look at what happens when you pass 3 input arguments to find as in find (x, n, direction):
If three inputs are given, direction should be one of "first" or
"last", requesting only the first or last n indices, respectively.
However, the indices are always returned in ascending order.
so in{1} is your x (your data if you want), in{2} is how many indices find should consider (all of them in your case since in{2} = Inf) and {in3}is whether find should find the first or last indices of the vector in{1} (last in your case).

Use R or mysql to calculate time period returns?

I'm trying to calculate various time period returns (monthly, quarterly, yearly etc.) for each unique member (identified by Code in the example below) of a data set. The data set will contain monthly pricing information for a 20 year period for approximately 500 stocks. An example of the data is below:
Date Code Price Dividend
1 2005-01-31 xyz 1000.00 20.0
2 2005-01-31 abc 1.00 0.1
3 2005-02-28 xyz 1030.00 20.0
4 2005-02-28 abc 1.01 0.1
5 2005-03-31 xyz 1071.20 20.0
6 2005-03-31 abc 1.03 0.1
7 2005-04-30 xyz 1124.76 20.0
I am fairly new to R, but thought that there would be a more efficient solution than looping through each Code and then each Date as shown here:
uniqueDates <- unique(data$Date)
uniqueCodes <- unique(data$Code
for (date in uniqueDates) {
for (code in uniqueCodes) {
nextDate <- seq.Date(from=stock_data$Date[i], by="3 months",length.out=2)[2]
curPrice <- data$Price[data$Date == date]
futPrice <- data$Price[data$Date == nextDate]
data$ret[(data$Date == date) & (data$Code == code)] <- (futPrice/curPrice)-1
}
}
This method in itself has an issue in that seq.Date does not always return the final day in the month.
Unfortunately the data is not uniform (the number of companies/codes varies over time) so using a simple row offset won't work. The calculation must match the Code and Date with the desired date offset.
I had initially tried selecting the future dates by using the seq.Date function
data$ret = (data[(data$Date == (seq.Date(from = data$Date, by="3 month", length.out=2)[2])), "Price"] / data$Price) - 1
But this generated an error as seq.Date requires a single entry.
> Error in seq.Date(from = stock_data$Date, by = "3 month", length.out =
> 2) : 'from' must be of length 1
I thought that R would be well suited to this type of calculation but perhaps not. Since all the data is in a mysql database I am now thinking that it might be faster/easier to do this calc directly in the database.
Any suggestions would be greatly appreciated.
Load data:
tc='
Date Code Price Dividend
2005-01-31 xyz 1000.00 20.0
2005-01-31 abc 1.00 0.1
2005-02-28 xyz 1030.00 20.0
2005-02-28 abc 1.01 0.1
2005-03-31 xyz 1071.20 20.0
2005-03-31 abc 1.03 0.1
2005-04-30 xyz 1124.76 20.0'
df = read.table(text=tc,header=T)
df$Date=as.Date(df$Date,"%Y-%m-%d")
First I would organize the data by date:
library(plyr)
pp1=reshape(df,timevar='Code',idvar='Date',direction='wide')
Then you would like to obtain monthly, quarterly, yearly, etc returns.
For that there are several options, one could be:
Make the data zoo or xts class. i.e
library(xts)
pp1[2:ncol(pp1)] = as.xts(pp1[2:ncol(pp1)],order.by=pp1$Date)
#let's create a function for calculating returns.
rets<-function(x,lag=1){
return(diff(log(x),lag))
}
Since this database is monthly, the lags for the returns will be:
monthly=1, quaterly=3, yearly =12. for instance let's calculate monthly return
for xyz.
lagged=1 #for monthly
This calculates Monthly returns for xyz
pp1$returns_xyz= c(NA,rets(pp1$Price.xyz,lagged))
To get all the returns:
#create matrix of returns
pricelist= ls(pp1)[grep('Price',ls(pp1))]
returnsmatrix = data.frame(matrix(rep(0,(nrow(pp1)-1)*length(pricelist)),ncol=length(pricelist)))
j=1
for(i in pricelist){
n = which(names(pp1) == i)
returnsmatrix[,j] = rets(pp1[,n],1)
j=j+1
}
#column names
codename= gsub("Price.", "", pricelist, fixed = TRUE)
names(returnsmatrix)=paste('ret',codename,sep='.')
returnsmatrix
You can do this very easily with the quantmod and xts packages. Using the data in AndresT's answer:
library(quantmod) # loads xts too
pp1 <- reshape(df,timevar='Code',idvar='Date',direction='wide')
# create an xts object
x <- xts(pp1[,-1], pp1[,1])
# only get the "Price.*" columns
p <- getPrice(x)
# run the periodReturn function on each column
r <- apply(p, 2, periodReturn, period="monthly", type="log")
# merge prior result into a multi-column object
r <- do.call(merge, r)
# rename columns
names(r) <- paste("monthly.return",
sapply(strsplit(names(p),"\\."), "[", 2), sep=".")
Which leaves you with an r xts object containing:
monthly.return.xyz monthly.return.abc
2005-01-31 0.00000000 0.000000000
2005-02-28 0.02955880 0.009950331
2005-03-31 0.03922071 0.019608471
2005-04-30 0.04879016 NA