I want to run a regression of profits against time and see whether there is a change in the last 6 months. To do so, I want to have the regression before July and then get estimates for the whole year assessing whether there is a difference from the actual values.
I use:
areg profits date if date<td(30,6,2021), a(company) vce(cluster date)
predict profits_reg
However, the predictions are only generated for the first 6 months, I don't get the predictions for the days from July 1st, the cells are empty (.)
How can I ensure that I get predictions for all data in my data set?
Related
Basically, I have a time series dataset that wants to predict the price 14 days from now. There are dozens of features and everything is being trained with TemporalFusionTransformer (PyTorch version).
Originally, I trained to predict price every day for the next 14 days, however these gave fairly poor results. So, I want to switch to predict ONLY the price 14 days from now. For a conventional dataset I would just lag the price variable 14 days (dataset['price_lagged'] = dataset['price'].shift(-13) and set max_prediction_length to 1 to predict the 14th day) and set this as a target variable.
Current code that still predicts well (and it shouldn't):
training = TimeSeriesDataSet(
dataset[lambda x: x.day <= training_cutoff],
time_idx="day",
target="price_lagged",
group_ids=["group_id"]
min_encoder_length=max_encoder_length ,
max_encoder_length=max_encoder_length,
max_prediction_length=max_prediction_length,
static_categoricals=[],
static_reals=[],
time_varying_known_categoricals=[],
time_varying_unknown_reals=[
#I removed EVERYTHING from here
],
time_varying_unknown_categoricals=[
],
time_varying_known_reals=[
],
target_normalizer=EncoderNormalizer(),
#lags=lags,
add_relative_time_idx=True,
allow_missing_timesteps=False,
add_target_scales=True,
)
However, the model seems to use the target as a variable as well. I can see this if I literally remove every other feature, I still get a fairly good prediction. How can I explicitly have the model NOT use the target as a (obvious) feature? Can TFT work like this?
I am trying to present time series of multiple sensors on a single SSRS (v14) line chart
I need to plot N series, with each independently plotting the series data in the space provided by the chart (independent vertical axis)
More about the data
There can be anywhere from ~1-10 series
The challenge is that they are different orders of magnitude.
One might be degrees F (~0-212)
One might be Carbon ppm (~1-16)
One might be Ftlbs Thrust (~10k-100k)
the point is , they have no relation and can be very different
The exact value is not important. I can hide the vertical axis
More about what I am trying to do
The idea is to show the multiple time series, plotted together against time for the 4 hours before and after
'an event'. Its not the necessarily the exact value that is important. the subject matter expert would be looking for something odd (temperature falls, thrust spikes, etc).
Things I have tried
If there were just 2 series, i could easily use the 2nd axis available in the SSRS chart. Thats exactly the idea I am chasing. But in this case, I want N series to plot using its own axis.
I have tried stacking N transparent graphs on top of each other. This would be a really ugly solution, but SSRS even wont let you do it. It unstacks them for you.
I have experimented with the Allow Scale Breaks property on the Vert Axis. This would solve the problem but we don't like the 'double jagged line'
Turning on Logarithmic scale is a possibility. It does do a better job of displaying all the data. but its not really what we want. Its going to change the shape of data that ranges over a couple orders of magnitude.
I tried the sparkline component and am having the same problem.
This approach is essentially the same a Greg's answer above. I've had to do this same process in the past comparing trends of data even though the units were dissimilar.
I took a very simple approach of adding an additional column to the query that showed each value as a percentage of the maximum value in each series.
As an example (just 2 series here for clarity) I started with data like this in myTable
Series Month myValue
A Jan 4
A Feb 8
A Mar 16
B Jan 200
B Feb 300
B Mar 400
My Dataset query would be something like.
SELECT *, myValue / MAX(myValue) OVER(PARTITION BY Series) as myPlotValue FROM myTable
This gives us a final dataset which looks liek this.
Series Month myValue myPlotValue
A Jan 4 0.25
A Feb 8 0.5
A Mar 16 1
B Jan 200 0.5
B Feb 300 0.75
B Mar 400 1
As you can see all plot values are now between 0 and 1.
I created that charts using the myPlotValue field and had the option of using the original values from the myValue field as datapoint labels.
After talking to some math people, this is a standard problem and it is solved by a process called normalization of the data.
Essentially you are changing all the series to fit in a given range (usually 0-1)
You can scale and add an offset if that makes sense for your problem domain somehow.
https://www.statisticshowto.datasciencecentral.com/normalized/
I have sales, advertising spend and price data for 10 brands of same industry from 2013-2018. I want to develop an equation to predict 2019 sales.
The variables I have are (price & ad spend by type) :PricePerUnit Magazine, News, Outdoor, Broadcasting, Print.
The confusion I have is I am not sure whether to run regression using only 2018 data with 2018 sales as Target variable and adding additional variable like Past_2Yeas_Sales(2016-17) to above price & ad spend variables (For clarity-Refer the image of data). With this type of data I will have a sample size of only 10 as there are only 10 brands. This I think is too low for linear regression to give correct results.
Second option (which will increase sample size) I figure is could be instead of having a brand as an observation, I take brand+year as an observation which will increase my sample size to 60- for e.g. Brand A has 6 observations like A-2013, A-2014, A-2014...,A-2018, B has B-2013,B-2014..B-2018 and so on for 10 brands(Refer image for data).
Is the second option valid way to run regression? What is the right way to run regression in such situations of small sample size?
I'm currently doing a regression analysis for every company on the Norwegian Stock Market, where i regress the stockreturns for each company against a benchmark. The period is 2009-2018. I've managed to do the regression for each company throughout the whole period, but i also want to do a regression for each month for every company.
The original dataset consists of 26000 observations, which i've then converted into subsets with a total of 390 elements(companies).
What i've done so far is shown below:
data_subset <- by(data,data$Name, subset)
data_lm <-lapply(data_subset,function(data) lm(data$CompanyReturn~data$DJReturn))
data_coef <- lapply(data_lm, coef)
data_tabell <- matrix(0,length(data_subset),2)
for (i in 1:length(data_subset)) {
data_tabell[i,]<-coef(data_lm[[i]])
}
colnames(data_tabell)<-c("Intercept","Coefficient")
rownames(data_tabell)<-names(data_subset)
Do anyone know how i can specify that i want to only do a regression for a company for a specific period, for example each year or each month for every company?
Thank you in advance for the help!
I have a data set with the following variables: ID of an individual, current year, year of graduation, degree, income, and a 0/1 variable to indicate treatment. The income is in the same year as the variable year.
What I want is to regress current income over treatment for every possible combination of: year, year of graduation, and degree.
That means running multiple different regressions that will give me multiple coefficients.
I have zero clues how to do so. I would normally just use:
reg income treatment
But this will not give me multiple coefficients.
Try something like this:
sysuse auto
statsby, by(foreign rep78): regress mpg weight