Small sample size for a regression marketing model - regression

I have sales, advertising spend and price data for 10 brands of same industry from 2013-2018. I want to develop an equation to predict 2019 sales.
The variables I have are (price & ad spend by type) :PricePerUnit Magazine, News, Outdoor, Broadcasting, Print.
The confusion I have is I am not sure whether to run regression using only 2018 data with 2018 sales as Target variable and adding additional variable like Past_2Yeas_Sales(2016-17) to above price & ad spend variables (For clarity-Refer the image of data). With this type of data I will have a sample size of only 10 as there are only 10 brands. This I think is too low for linear regression to give correct results.
Second option (which will increase sample size) I figure is could be instead of having a brand as an observation, I take brand+year as an observation which will increase my sample size to 60- for e.g. Brand A has 6 observations like A-2013, A-2014, A-2014...,A-2018, B has B-2013,B-2014..B-2018 and so on for 10 brands(Refer image for data).
Is the second option valid way to run regression? What is the right way to run regression in such situations of small sample size?

Related

How to analyze panel data, by decade and firm id

So heres the background.
I have panel data. 1972-2020.
I have 1 dependent variable (inflation) and 5 independent variables, imp, exp, tra, gro, pro.
I have 10 countries. 5 small and 5 big ones, numbered 1-5 and 6-10.
I want to show the independend variables effect on the dependendt variable, sorted by decade (70,80,90,00,10) for each country. SO the output can tell us how the regression coefficentsenter image description here differs for each country for each decade.

Is it possible that the number of basic functions is more than the number of observations in spline regression?

I want to run regression spline with B-spline basis function. The data is structured in such a way that the number of observations is less than the number of basis functions and I get a good result.
But I`m not sure if this is the correct case.
Do I have to have more rows than columns like linear regression?
Thank you.
When the number of observations, N, is small, it’s easy to fit a model with basis functions with low square error. If you have more basis functions than observations, then you could have 0 residuals (perfect fit to the data). But that is not to be trusted because it may not be representative of more data points. So yes, you want to have more observations than you do columns. Mathematically, you cannot properly estimate more than N columns because of collinearity. For a rule of thumb, 15 - 20 observations are usually needed for each additional variable / spline.
But, this isn't always the case, such as in genetics when we have hundreds of thousands of potential variables and small sample size. In that case, we turn to tools that help with a small sample size, such as cross validation and bootstrap.
Bootstrap (ie resample with replacement) your datapoints and refit splines many times (100 will probably do). Then you average the splines and use these as the final spline functions. Or you could do cross validation, where you train on a smaller dataset (70%) and then test it on the remaining dataset.
In the functional data analysis framework, there are packages in R that construct and fit spline bases (such as cubic, B, etc). These packages include refund, fda, and fda.usc.
For example,
B <- smooth.construct.cc.smooth.spec(object = list(term = "day.t", bs.dim = 12, fixed = FALSE, dim = 1, p.order = NA, by = NA),data = list(day.t = 200:320), knots = list())
constructs a B spline basis of dimension 12 (over time, day.t), but you can also use these packages to help choose a basis dimension.

Negative binomial regression SPSS - Quantity vs Distance

I have quite a simple dataset of quantities of litter found in a national park located on an island. For each data point I have corresponding GPS coordinates, and I've derived the distance of each point to the shore. My aim: observe if the quantities of litter increase or decrease with the distance to shore. I'm assuming that quantities of litter will increase with a decrease in distance, as litter is commonly found on beaches etc.
Quantities of litter are counts, i.e. non-parametric. Additionally I've tested the data to see if it follows a Poisson model and it does not (p-value <0.05), and I have a larger variance than the mean for each variable (quantity and distance) seemingly overdispersed. Therefore, I went on using a negbin regression, with an output as follows:
Omnibus test is highly significant (p=0.000). I was just slightly puzzled on the parameter estimates, and generally hoping that this approach makes sense. Any input much appreciated.
Interpreting the parameter estimates requires knowing the link function specified, which would be a log link if you specified your model as a negative binomial with log link on the Type of Model tab, but could be something else if you specified a custom model using a negative binomial distribution with another link (which could be identity, negative binomial, or power, instead).
If it's a log link, then for a distance of 0 (at the shore), you predict exp(2,636) for the count, or about 13,957. For a given distance from the shore, multiply the distance by -,042 and add that to the 2,636 value, then take the exponential function to the resulting power. So for every unit away from the shore you move, the log of the prediction decreases by ,042, and the prediction is multiplied by about ,959. One unit away, you predict about 13,383 for the count, two units away, about 12,833, etc. So the results are in general accord with your hypothesis. Different specific calculations would be required if you used a different link function.

Linear Regression of Subset depending on spesific date

I'm currently doing a regression analysis for every company on the Norwegian Stock Market, where i regress the stockreturns for each company against a benchmark. The period is 2009-2018. I've managed to do the regression for each company throughout the whole period, but i also want to do a regression for each month for every company.
The original dataset consists of 26000 observations, which i've then converted into subsets with a total of 390 elements(companies).
What i've done so far is shown below:
data_subset <- by(data,data$Name, subset)
data_lm <-lapply(data_subset,function(data) lm(data$CompanyReturn~data$DJReturn))
data_coef <- lapply(data_lm, coef)
data_tabell <- matrix(0,length(data_subset),2)
for (i in 1:length(data_subset)) {
data_tabell[i,]<-coef(data_lm[[i]])
}
colnames(data_tabell)<-c("Intercept","Coefficient")
rownames(data_tabell)<-names(data_subset)
Do anyone know how i can specify that i want to only do a regression for a company for a specific period, for example each year or each month for every company?
Thank you in advance for the help!

How can we define an RNN - LSTM neural network with multiple output for the input at time "t"?

I am trying to construct a RNN to predict the possibility of a player playing the match along with the runs score and wickets taken by the player.I would use a LSTM so that performance in current match would influence player's future selection.
Architecture summary:
Input features: Match details - Venue, teams involved, team batting first
Input samples: Player roster of both teams.
Output:
Discrete: Binary: Did the player play.
Discrete: Wickets taken.
Continous: Runs scored.
Continous: Balls bowled.
Question:
Most often RNN uses "Softmax" or"MSE" in the final layers to process "a" from LSTM -providing only a single variable "Y" as output. But here there are four dependant variables( 2 Discrete and 2 Continuous). Is it possible to stitch together all four as output variables?
If yes, how do we handle mix of continuous and discrete outputs with loss function?
(Though the output from LSTM "a" has multiple features and carries the information to the next time-slot, we need multiple features at output for training based on the ground-truth)
You just do it. Without more detail on the software (if any) in use it is hard to give more detasmail
The output of the LSTM unit is at every times step on of the hidden layers of your network
You can then input it in to 4 output layers.
1 sigmoid
2 i'ld messarfound wuth this abit. Maybe 4x sigmoid(4 wickets to an innnings right?) Or relu4
3,4 linear (squarijng it is as lso an option,e or relu)
For training purposes your loss function is the sum of your 4 individual losses.
Since f they were all MSE you could concatenat your 4 outputs before calculating the loss.
But sincd the first is cross-entropy (for a decision sigmoid) yould calculate seperately and sum.
You can still concatenate them after to have a output vector