Scoring Wizard Cox regression survival model - wizard

I have estimated a Cox regression model. This model will predict the duration of a certain car ownership and is depending on income household size and age.
I would like to run the model on a different database, to see the effect of changing social factors on car ownership. I therefore saved the model outcome in a xml document.
Then i use this document as the input for the Scoring wizard and click next.
The next step isn't completely clear to me, it shows that all the variables, depending and independent, are predictors. But shouldn't the duration variable be a target variable? Now the model will take the censored variable as a target value. (the censored variable shows whether the changes has taken place).
I would like to estimate the new duration. However i think that the predicted value of this models is the probability of surviving.
My question is thus how can i make the duration variable the target value and how should i otherwise interpret the predicted value?
Kind regards,
Marlon

Related

Best regression model where some fields may be intentionally blank for some samples

I'm looking to build a regression model where I have time based variables that may or may not exist for each data sample.
For instance, let's say we wanted to build a regression model where we could predict how long a new car will last. One of the values is when the car gets its first servicing. However, there are some samples where the car never gets serviced at all. In these situations, how can I account for this when building the model? Can I even use a linear regression model or will I have to choose a different regression model?
When I think about it, this is basically the equivalent of having 2 fields: one for whether the car was serviced and if that is true, a second field for when. But I'm not sure how to build a regression that has data that is intentionally missing.
Apply regression without using time-series. To try to capture seasonality in the data, encode the date/time columns into binary columns (to represent year, day of year, day of the month and day of the week etc.).

Questions About Deep Q-Learning

I read several materials about deep q-learning and I'm not sure if I understand it completely. From what I learned, it seems that Deep Q-learning calculates faster the Q-values rather than putting them on a table by using NN to perform a regression, calculating loss and backpropagating the error to update the weights. Then, in a testing scenario, it takes a state and the NN will return several Q-values for each action possible for that state. Then, the action with the highest Q-value will be chosen to be done at that state.
My only question is how the weights are updated. According to this site the weights are updated as follows:
I understand that the weights are initialized randomly, R is returned by the environment, gamma and alpha are set manually, but I dont understand how Q(s',a,w) and Q(s,a,w) are initialized and calculated. Does it seem that we should build a table of Q-values and update them as with Q-learning or they are calculated automatically at each NN training epoch? what I am not understanding here? can somebody explain to me better such an equation?
In Q-Learning, we are concerned with learning the Q(s, a) function which is a mapping between a state to all actions. Say you have an arbitrary state space and an action space of 3 actions, each of these states will compute to three different values, each an action. In tabular Q-Learning, this is done with a physical table. Consider the following case:
Here, we have a Q table for each state in the game (upper left). And after each time step, the Q value for that specific action is updated according to some reward signal. The reward signal can be discounted by some value between 0 and 1.
In Deep Q-Learning, we disregard the use of tables and create a parametrized "table" such as this:
Here, all of the weights will form combinations given on the input that should appromiately match the value seen in the tabular case (Still actively researched).
The equation you presented is the Q-learning update rule set in a gradient update rule.
alpha is the step-size
R is the reward
Gamma is the discounting factor
You do inference of the network to retrieve the value of the "discounted future state" and subtract this with the "current" state. If this is unclear, I recommend you to look up boostrapping which is basicly what is happening here.

Recurrent NNs: what's the point of parameter sharing? Doesn't padding do the trick anyway?

The following is how I understand the point of parameter sharing in RNNs:
In regular feed-forward neural networks, every input unit is assigned an individual parameter, which means that the number of input units (features) corresponds to the number of parameters to learn. In processing e.g. image data, the number of input units is the same over all training examples (usually constant pixel size * pixel size * rgb frames).
However, sequential input data like sentences can come in highly varying lengths, which means that the number of parameters will not be the same depending on which example sentence is processed. That is why parameter sharing is necessary for efficiently processing sequential data: it makes sure that the model always has the same input size regardless of the sequence length, as it is specified in terms of transition from one state to another. It is thus possible to use the same transition function with the same weights (input to hidden weights, hidden to output weights, hidden to hidden weights) at every time step. The big advantage is that it allows generalization to sequence lengths that did not appear in the training set.
My questions are:
Is my understanding of RNNs, as summarized above, correct?
In the actual code example in Keras I looked at for LSTMs, they padded the sentences to equal lengths before all. By doing so, doesn't this wash away the whole purpose of parameter sharing in RNNs?
Parameter Sharing
Being able to efficiently process sequences of varying length is not the only advantage of parameter sharing. As you said, you can achieve that with padding. The main purpose of parameter sharing is a reduction of the parameters that the model has to learn. This is the whole purpose of using a RNN.
If you would learn a different network for each time step and feed the output of the first model to the second etc. you would end up with a regular feed-forward network. For a number of 20 time steps, you would have 20 models to learn. In Convolutional Nets, parameters are shared by the Convolutional Filters because when we can assume that there are similar interesting patterns in different regions of the picture (for example a simple edge). This drastically reduces the number of parameters we have to learn. Analogously, in sequence learning we can often assume that there are similar patterns at different time steps. Compare 'Yesterday I ate an apple' and 'I ate an apple yesterday'. These two sentences mean the same, but the 'I ate an apple' part occurs on different time steps. By sharing parameters, you only have to learn what that part means once. Otherwise, you'd have to learn it for every time step, where it could occur in your model.
There is a drawback to sharing the parameters. Because our model applies the same transformation to the input at every time step, it now has to learn a transformation that makes sense for all time steps. So, it has to remember, what word came in which time step, i.e. 'chocolate milk' should not lead to the same hidden and memory state as 'milk chocolate'. But this drawback is small compared to using a large feed-forward network.
Padding
As for padding the sequences: the main purpose is not directly to let the model predict sequences of varying length. Like you said, this can be done by using parameter sharing. Padding is used for efficient training - specifically to keep the computational graph during training low. Without padding, we have two options for training:
We unroll the model for each training sample. So, when we have a sequence of length 7, we unroll the model to 7 time steps, feed the sequence, do back-propagation through the 7 time steps and update the parameters. This seems intuitive in theory. But in practice, this is inefficient, because TensorFlow's computational graphs don't allow recurrency, they are feedforward.
The other option is to create the computational graphs before starting training. We let them share the same weights and create one computational graph for every sequence length in our training data. But when our dataset has 30 different sequence lengths this means 30 different graphs during training, so for large models, this is not feasible.
This is why we need padding. We pad all sequences to the same length and then only need to construct one computational graph before starting training. When you have both very short and very long sequence lengths (5 and 100 for example), you can use bucketing and padding. This means, you pad the sequences to different bucket lengths, for example [5, 20, 50, 100]. Then, you create a computational graph for each bucket. The advantage of this is, that you don't have to pad a sequence of length 5 to 100, as you would waste a lot of time on "learning" the 95 padding tokens in there.

Computation consideration with different Caffe's network topology (difference in number of output)

I would like to use one of Caffe's reference model i.e. bvlc_reference_caffenet. I found that my target class i.e. person is one of the classes included in the ILSVRC dataset that has been trained for the model. As my goal is to classify whether a test image contains a person or not, I may achieve this by the following:
Use inference directly with 1000 number of output. This doesn't
require training/learning.
Change the network topology a little bit with the final FC layer's number of output (num_output) is set to 2 (instead of 1000). Retrain it as a binary classification problem.
My concern is about computational effort at deployment/prediction phase (testing). The latter looks more expensive computationally than the former. This is because during prediction phase it needs to compute those 1000 output possibilities to find the one with the highest score. What I'm not sure is that, it could be the case that there's a heuristic (which I'm not aware of) that simplifies the computation.
Can somebody please help cross check my understanding on this.

How does SPSS assign factor scores for cases where underlying variables were pairwise deleted?

Here's a simplified example of what I'm trying to figure out from a report. All analyses are being run in SPSS, which I don't have and don't use (my experience is with SAS and R).
They were running a regression to predict overall meal satisfaction from food type ordered, self-reported food flavor, and self-reported food texture.
But food flavor and texture are highly correlated, so they conducted a factor analysis, found food flavor and texture load on one factor, and used the factor scores in the regression.
However, about 40% of respondents don't have responses on self-reported food texture, so they used pairwise deletion while making the factors.
My question is when SPSS calculates the factor scores and outputs them as new variables in the data set, what does it do with people who had an input for a factor that was pairwise deleted?
How does it calculate (if it calculates it at all) factor scores for those people who had a response pairwise deleted during the creation of the factors and who therefore have missing data for one of the variables?
Factor scores are a linear combination of their scaled inputs. That is, given normalized variables X1, ..., Xn, we have the following (where LaTeX formatting isn't supported and the L's indicate the loadings)
f = \sum_{i=1}^n L_i X_i
In your case n = 2. Now suppose one of the X_i is missing. How do you take a sum involving a missing value? Clearly, you cannot... unless you impute it.
I don't know what the analyst who created your report did. I suggest you ask them.