Linear Regression of Subset depending on spesific date - regression

I'm currently doing a regression analysis for every company on the Norwegian Stock Market, where i regress the stockreturns for each company against a benchmark. The period is 2009-2018. I've managed to do the regression for each company throughout the whole period, but i also want to do a regression for each month for every company.
The original dataset consists of 26000 observations, which i've then converted into subsets with a total of 390 elements(companies).
What i've done so far is shown below:
data_subset <- by(data,data$Name, subset)
data_lm <-lapply(data_subset,function(data) lm(data$CompanyReturn~data$DJReturn))
data_coef <- lapply(data_lm, coef)
data_tabell <- matrix(0,length(data_subset),2)
for (i in 1:length(data_subset)) {
data_tabell[i,]<-coef(data_lm[[i]])
}
colnames(data_tabell)<-c("Intercept","Coefficient")
rownames(data_tabell)<-names(data_subset)
Do anyone know how i can specify that i want to only do a regression for a company for a specific period, for example each year or each month for every company?
Thank you in advance for the help!

Related

Create linear predictions with Stata

I want to run a regression of profits against time and see whether there is a change in the last 6 months. To do so, I want to have the regression before July and then get estimates for the whole year assessing whether there is a difference from the actual values.
I use:
areg profits date if date<td(30,6,2021), a(company) vce(cluster date)
predict profits_reg
However, the predictions are only generated for the first 6 months, I don't get the predictions for the days from July 1st, the cells are empty (.)
How can I ensure that I get predictions for all data in my data set?

Is there a way to combine these variables in a way that makes sense?

Hello stack overflow community!
I am a sociology student working on a thesis project comparing home value appreciation and neighborhood racial composition over time.
I'm currently using two separate data sources and trying to combine them in a way that makes sense without aggregating anything.
The first data source is GIS data which has information on home sales in each year by home. The second is census data which has yearly estimates of racial composition by census tract. Both are in .csv formats.
My goal is to create a set of variables for each home row in the GIS data which represents the racial composition for the tract the home is in at the year it was sold (e.g. home 1 | 2010| $500,000 | Census tract 10 | 10% white).
I began doing this by going into Stata and using the following strategy:
For example, if I'm looking at a home sold in 2010 in Census tract 10 and I find that this tract was 10% white in 2010, using something like
If censustract=10 and year=2010, replace percentwhite = 10
However, this seemed incredibly time consuming, as I'm using data that go back decades and a couple dozen Census tracts.
Does anyone have any suggestions on how I might do this smarter, not harder? The first thought I had was to aggregate the data by census tract and year, but was hoping to avoid that if possible. Thank you so much in advance for your help and have a terrific day and start to the new year!
It sounds like you can simply merge census data onto your GIS data. That will be much less painful than using -replace-. Here's an example:
*GIS data: information on home sales in each year by home
clear
input censustract house_id year house_value_k
10 100 2010 200
11 101 2020 500
11 102 1980 100
end
tempfile GIS_data
sa `GIS_data'
*census data: yearly estimates of racial composition by census tract
clear
input censustract year percentwhite
10 2010 20
10 2000 10
11 2010 25
11 2000 5
end
tempfile census_data
sa `census_data'
*easy method: merge the census data onto your GIS data
use `GIS_data', clear
mer m:1 censustract year using `census_data'
drop if _merge==2
list
*hard method: use -replace-
use `GIS_data', clear
gen percentwhite=.
replace percentwhite=20 if censustract==10 & year==2010
replace percentwhite=10 if censustract==10 & year==2000
replace percentwhite=25 if censustract==11 & year==2010
replace percentwhite=5 if censustract==11 & year==2000
list
Both methods "work", but using -merge- is much easier and less prone to errors.
Note: I intentionally created the data sets so that the merge wouldn't be perfect. You will likely want to drop some of the observations in that case. In the code above I dropped when _merge==2

Small sample size for a regression marketing model

I have sales, advertising spend and price data for 10 brands of same industry from 2013-2018. I want to develop an equation to predict 2019 sales.
The variables I have are (price & ad spend by type) :PricePerUnit Magazine, News, Outdoor, Broadcasting, Print.
The confusion I have is I am not sure whether to run regression using only 2018 data with 2018 sales as Target variable and adding additional variable like Past_2Yeas_Sales(2016-17) to above price & ad spend variables (For clarity-Refer the image of data). With this type of data I will have a sample size of only 10 as there are only 10 brands. This I think is too low for linear regression to give correct results.
Second option (which will increase sample size) I figure is could be instead of having a brand as an observation, I take brand+year as an observation which will increase my sample size to 60- for e.g. Brand A has 6 observations like A-2013, A-2014, A-2014...,A-2018, B has B-2013,B-2014..B-2018 and so on for 10 brands(Refer image for data).
Is the second option valid way to run regression? What is the right way to run regression in such situations of small sample size?

How to get multiple coefficients from a panel data set

I have a data set with the following variables: ID of an individual, current year, year of graduation, degree, income, and a 0/1 variable to indicate treatment. The income is in the same year as the variable year.
What I want is to regress current income over treatment for every possible combination of: year, year of graduation, and degree.
That means running multiple different regressions that will give me multiple coefficients.
I have zero clues how to do so. I would normally just use:
reg income treatment
But this will not give me multiple coefficients.
Try something like this:
sysuse auto
statsby, by(foreign rep78): regress mpg weight

Cumulative Frequency Tables and Chart Output

I'm working with some rather large time-series data sets related to futures prices and am in the process of converting some calculations which I previously did in Excel to R. This conversion has been relatively straightforward thus far but I m having a bit of trouble replicating my histograms with their cumulative frequency distributions in R as I had them in Excel. If you're familiar with Excel, the Histogram function in the Data Analysis Toolpack automatically creates a Cumulative Frequency Distribution table with the cumulative percentages of each, in this case, Price Level, next to the histogram.
I've had some success creating some basic histograms using ggplot, here is a snippet of that code:
ggplot(data=CrudeRaw, aes(x=CrudeRaw$X7_1_F))+
geom_histogram(breaks=seq(X7_F_M_L, X7_F_M_H, by=0.01),
col="blue",
fill="white",
alpha= 0.2)+
labs(title="X7 1 Month Price Distribution", x="Price Levels",
y="Frequency") +
xlim(c(X7_F_M_L, X7_F_M_H)) +
ylim(c(0,100))
Several questions regarding formatting and usage.
a) CrudeRaw is a dataframe which contains roughly 276 rows, and no less then 50 columns. For the purposes of this project I've chopped the data into 20 period, 60 period, 120 period, 180 period, and 240 period subsets. The data is in chronological order by date.
Question(s): ggplot cannot take numeric data types, only data frames, so I can only feed it the entire df even though I am interested in creating distributions for the aforementioned subsets. Is there a way that I can still do this?
b) How do I get every bin (price) to show up on the x-axis rather than a number marking every 5 bins (-15, -10, -5, 0, 5 ..., 15)?
c) I've successfully created a cumulative frequency table using the follow code,
round(cbind(cumsum(table(X7_F)))/NROW(X7_F),2)
But I'd like a way to a) output each of these tables (of which there are many) to a CSV file OR, ideally create a "report" of sorts with R which can be saved to a pdf, or perhaps even within the histogram which the table/data is associated with.
d) I've done some searching on how to output data to a CSV file, but it wasnt clear from the examples I went over how I could output multiple arrays to the same sheet or workbook, en masse. That is, I would like to output my 20, 60, 120, 180, and 240 period arrays of prices to the same workbook. I'm thinking that by creating another dataframe that I could then pass these subsets of the data to the ggplot function like I mentioned I was having trouble doing in part a)
e) Lastly (for now) how do I overlay the CFD onto my histograms?
Please advise if you require any additional information or colour in order to help me and many thanks in advance for your responses!