Best regression model where some fields may be intentionally blank for some samples - regression

I'm looking to build a regression model where I have time based variables that may or may not exist for each data sample.
For instance, let's say we wanted to build a regression model where we could predict how long a new car will last. One of the values is when the car gets its first servicing. However, there are some samples where the car never gets serviced at all. In these situations, how can I account for this when building the model? Can I even use a linear regression model or will I have to choose a different regression model?
When I think about it, this is basically the equivalent of having 2 fields: one for whether the car was serviced and if that is true, a second field for when. But I'm not sure how to build a regression that has data that is intentionally missing.

Apply regression without using time-series. To try to capture seasonality in the data, encode the date/time columns into binary columns (to represent year, day of year, day of the month and day of the week etc.).

Related

Anylogic: How to create an objective function using values of two dataset (for optimization experiment)?

In my Anylogic model I have a population of agents (4 terminals) were trucks arrive at, are being served and depart from. The terminals have two parameters (numberOfGates and servicetime) which influence the departures per hour of trucks leaving the terminals. Now I want to tune these two parameters, so that the amount of departures per hour is closest to reality (I know the actual departures per hour). I already have two datasets within each terminal agent, one with de amount of departures per hour that I simulate, and one with the observedDepartures from the data.
I already compare these two datasets in plots for every terminal:
Now I want to create an optimization experiment to tune the numberOfGates and servicetime of the terminals so that the departure dataset is the closest to the observedDepartures dataset. Does anyone know how to do create a(n) (objective) function for this optimization experiment the easiest way?
When I add a variable diff that is updated every hour by abs( departures - observedDepartures) and put root.diff in the optimization experiment, it gives me the eq(null) is not allowed. Use isNull() instead error, in a line that reads the database for the observedDepartures (see last picture), but it works when I run the simulation normally, it only gives this error when running the optimization experiment (I don't know why).
You can use the absolute value of the sum of the differences for each replication. That is, create a variable that logs the | difference | for each hour, call it diff. Then in the optimization experiment, minimize the value of the sum of that variable. In fact this is close to a typical regression model's objectives. There they use a more complex objective function, by minimizing the sum of the square of the differences.
A Calibration experiment already does (in a more mathematically correct way) what you are trying to do, using the in-built difference function to calculate the 'area between two curves' (which is what the optimisation is trying to minimise). You don't need to calculate differences or anything yourself. (There are two variants of the function to compare either two Data Sets (your case) or a Data Set and a Table Function (useful if your empirical data is not at the same time points as your synthetic simulated data).)
In your case it (the objective function) will need to be a sum of the differences between the empirical and simulated datasets for the 4 terminals (or possibly a weighted sum if the fit for some terminals is considered more important than for others).
So your objective is something like
difference(root.terminals(0).departures, root.terminals(0).observedDepartures)
+ difference(root.terminals(1).departures, root.terminals(1).observedDepartures)
+ difference(root.terminals(2).departures, root.terminals(2).observedDepartures)
+ difference(root.terminals(3).departures, root.terminals(2).observedDepartures)
(It would be better to calculate this for an arbitrary population of terminals in a function but this is the 'raw shape' of the code.)
A Calibration experiment is actually just a wizard which creates an Optimization experiment set up in a particular way (with a UI and all settings/code already created for you), so you can just use that objective in your existing Optimization experiment (but it won't have a built-in useful UI like a Calibration experiment). This also means you can still set this up in the Personal Learning Edition too (which doesn't have the Calibration experiment).

How to deal with different subsampling time in economic datasets for deep learning?

I am building a deep learning model for macro-economic prediction. However, different indicators varies widely when it comes to its subsampling time, ranging from minutes to annually.
Dataframe example
The picture contains the 'Treasury Rates (DGS1-20)' which is sampled daily and 'Inflation Rate(CPALT...)' which is sampled monthly. These features are essential for the model to train and dropping out the NaN rows would result in too little data.
I've read some books and articles about how to deal with missing data that includes down sampling to monthly time frames, swapping the NaNs with -1, filling it with averages between the last and next value etc. But the methods that I read mostly deals with data sets that has a missing value of about 10% of the whole dataset while in this case of mine, the monthly sampled 'Inflation(CPI)' is missing at 90+% if I combine it with the 'Treasury Rate' dataset.
I was wondering if there was any workaround to handle missing values, particularly for economic data where the sampling time gap ranges so widely. Thank you

Making smaller models from pre-existing redundant models

Sorry for the vague title.
I will just start with an example. Say that I have a pre-existing model that can classify dogs, cats and humans. However, all I need is a model that can classify between dogs and cats (Humans are not needed). The pre-existing model is heavy and redundant, so I want to make a smaller, faster model that can just do the job needed.
What approaches exist?
I thought of utilizing knowledge distillation (using the previous model as a teacher and the new model as the student) and training a whole new model.
First, prune the teacher model to have a smaller version to be used as a student in distillation. A simple regime such as magnitude-based pruning will suffice.
For distillation, as your output vectors will not match anymore (the student is 2 dimensional and teacher is 3 dimensional, you will have to take this into account and only calculate this distillation loss based on the overlapping dimensions. An alternative is layer-wise distillation in which the output vectors are irrelevant and the distillation loss is calculated based on the difference between intermediate layers of the teacher and student. In both cases, total loss may include a difference between student output and label, in addition to student output and teacher output.
It is possible for a simple task like this that just basic transfer learning would suffice after pruning - that is just to replace the 3d output vector with a 2d output vector and continue training.

Forecasting using LSTM-RNN Python

I am using LSTM - RNN for Sales forecasting.
I have already trained and validated my model in the ratio of 70-30. I have around 144 rows of data which includes 2 columns (Date (YYYY/MM) and Sales).
I am only using the sales column to predict values.
I need to know whether it is possible to train 100% of your dataset. And If so, how do I predict the immediate future values (Out of Sample Data). What should be given as input for the model? Since I am kinda new to Deep Learning, this might be something easy that I am missing. Any help would be appreciated.

Extract GLM coefficients for one factor only

I am running a series of GLMs for a number of species of the form:
glm.sp<-glm(number~site+as.factor(year)+offset(log(visits)),family=poisson,data=data.sp)
Note that the year term is deliberately a factor as it's not reasonable to assume a linear relationship. The yearly coefficients produced from this model are a measure of the number of each species per year, taking account of the ammount of effort (visits). I then want to extract, exponentiate and index (relative to the last year) the year coefficients and run a GAM on them.
Currently I do this by eyeballing the coefficients and calling them directly:
data.sp.coef$coef<-exp(glm.sp$coefficients[60:77])
However as the number of sites and the number of years recorded for each species are different this means I need to eyeball each species. For example, a different species might have the year coefficients at 51:64. I'd rather not do that, and feel there must be a better way of calling out the coefficients for the years.
I've tried, the below (which doesn't work!)
> coef(glm.sp)["year"]
<NA>
NA
And I also tried saving all the coefficients as a dataframe and using a fuzzy search to extract all the values that contained "year" (the coefficients are automatically saved in the format yearXXXX-YY).
I'm certain I'm missing something simple, so would very much appreciate being proded in the right direction!
Thanks
Matt