Bayesian Gamma regression, what is the correct link function?

Bayesian Gamma regression, what is the correct link function? - regression

I'm trying to do a bayesian gamma regression with stan.
I know the correct link function is the inverse canonical link,
but if i dont use a log link parameters can be negative, and enter in a gamma distribution with a negative value, that obviously can't be possible.
how can i deal with it?
parameters {
vector[K] betas; //the regression parameters
real beta0;
real<lower=0.001, upper=100 > shape; //the variance parameter
}
transformed parameters {
vector<lower=0>[N] eta; //the expected values (linear predictor)
vector[N] alpha; //shape parameter for the gamma distribution
vector[N] beta; //rate parameter for the gamma distribution
eta <- beta0 + X*betas; //using the log link
}
model {
beta0 ~ normal( 0 , 2^2 );
for(i in 2:K)
betas[i] ~ normal( 0 , 2^2 );
y ~ gamma(shape,shape * eta);
}

I was struggling with this a couple weeks ago, and while I don't consider this a definitive answer, hopefully it is still helpful. For what it's worth, McCullagh and Nelder directly acknowledge this inappropriate support of the canonical link function. They advise that one must constrain the betas to properly match the support. Here's the relevant passage
The canonical link function yields sufficient statistics which are linear functions of the data and it is given by η = 1/μ. Unlike the canonical links for the Poisson and binomial distributions, the reciprocal transformation, which is often interpretable as the rate of a process, does not map the range of μ onto the whole real line. Thus the requirement that η > 0 implies restrictions on the βs in any linear model. Suitable precautions must be taken in computing β_hat so that negative values of μ_hat are avoided.
-- McCullagh and Nelder (1989). Generalized Linear Models. p. 291
It depends on your X values, but as far as I can tell (please correct me someone!) in an MCMC-based Bayesian case, you can achieve this by either using a truncated prior on the betas or a strong enough prior on your intercept to make the inappropriate regions numerically impossible to reach.
In my case, I ultimately used an identity link with a strong positive prior intercept and that was sufficient and yielded reasonable results.
Also, the choice of link really depends on your X. As the passage above implies, the use of the canonical link assumes that your linear model is in rate space. Using log or identity link functions appear to be also very common, and ultimately it's about providing a space that offers a sufficient span for the linear function to capture the response.

Related

what's the meaning of 'parameterize' in deep learning?

no
13
what's the meaning of 'parameterize' in deep learning? As shown in the photo, does it means the matrix 'A' can be changed by the optimization during training?

Yes, when something can be parameterized it means that gradients can be calculated.
This means that the (dE/dw) which means the derivative of Error with respect to weight can be calculated (i.e it must be differentiable) and subtracted from the model weights along with obviously a learning_rate and other params being included depending on the optimizer.
What the paper is saying is that if you make a binary matrix a weight and then find the gradient (dE/dw) of that weight with respect to a loss and then make an update on the binary matrix through backpropagation, there is not really an activation function (which by requirement must be differentiable) that can keep the values discrete (like 0 and 1) but rather you will end up with continous values (like these decimal values).
Therefore it is saying since the idea of having binary values be weights and for them to be back-propagated in a way where the weights + activation function also yields an updated weight matrix that is also binary is difficult, another solution like the Bernoulli Distribution is used instead to initialize parameters of a model.
Hope this helps,

Can a neural network having non-linear activation function (say ReLU) be used for linear classification task?

I think the answer would be yes, but I'm unable to reason out a good explanation on this.

The mathematical argument lies in a power to represent linearity, we can use following three lemmas to show that:
Lemma 1
With affine transformations (linear layer) we can map the input hypercube [0,1]^d into arbitrary small box [a,b]^k. Proof is quite simple, we can just make all the biases to be equal to a, and make weights multiply by (b-a).
Lemma 2
For sufficiently small scale, many non-linearities are approximately linear. This is actually very much a definition of a derivative, or, taylor expansion. In particular let us take relu(x), for x>0 it is in fact, linear! What about sigmoid? Well if we look at a tiny tiny region [-eps, eps] you can see that it approaches a linear function as eps->0!
Lemma 3
Composition of affine functions is affine. In other words, if I were to make a neural network with multiple linear layers, it is equivalent of having just one. This comes from the matrix composition rules:
W2(W1x + b1) + b2 = W2W1x + W2b1 + b2 = (W2W1)x + (W2b1 + b2)
------ -----------
New weights New bias
Combining the above
Composing the three lemmas above we see that with a non-linear layer, there always exists an arbitrarily good approximation of the linear function! We simply use the first layer to map entire input space into the tiny part of the pre-activation spacve where your linearity is approximately linear, and then we "map it back" in the following layer.
General case
This is a very simple proof, now in general you can use Universal Approximation Theorem to show that a non-linear neural network (Sigmoid, Relu, many others) that is sufficiently large, can approximate any smooth target function, which includes linear ones. This proof (originally given by Cybenko) is however much more complex and relies on showing that specific classes of functions are dense in the space of continuous functions.

Technically, yes.
The reason you could use a non-linear activation function for this task is that you can manually alter the results. Let's say the range the activation function outputs is between 0.0-1.0, then you can round up or down to get a binary 0/1. Just to be clear, rounding up or down isn't linear activation, but for this specific question the purpose of the network was for classification, where some kind of rounding has to be applied.
The reason you shouldn't is the same reason that you shouldn't attach an industrial heater to a fan and call it a hair-drier, it's unnecessarily powerful and it could potentially waste resources and time.
I hope this answer helped, have a good day!

State value and state action values with policy - Bellman equation with policy

I am just getting start with deep reinforcement learning and i am trying to crasp this concept.
I have this deterministic bellman equation
When i implement stochastacity from the MDP then i get 2.6a
My equation is this assumption correct. I saw this implementation 2.6a without a policy sign on the state value function. But to me this does not make sense due to i am using the probability of which different next steps i could end up in. Which is the same as saying policy, i think. and if yes 2.6a is correct, can i then assume that the rest (2.6b and 2.6c) because then i would like to write the action state function like this:
The reason why i am doing it like this is because i would like to explain myself from a deterministic point of view to a non-deterministic point of view.
I hope someone out there can help on this one!
Best regards Søren Koch

No, the value function V(s_t) does not depend on the policy. You see in the equation that it is defined in terms of an action a_t that maximizes a quantity, so it is not defined in terms of actions as selected by any policy.
In the nondeterministic / stochastic case, you will have that sum over probabilities multiplied by state-values, but this is still independent from any policy. The sum only sums over different possible future states, but every multiplication involves exactly the same (policy-independent) action a_t. The only reason why you have these probabilities is because in the nondeterministic case a specific action in a specific state can lead to one of multiple different possible states. This is not due to policies, but due to stochasticity in the environment itself.
There does also exist such a thing as a value function for policies, and when talking about that a symbol for the policy should be included. But this is typically not what is meant by just "Value function", and also does not match the equation you have shown us. A policy-dependent function would replace the max_{a_t} with a sum over all actions a, and inside the sum the probability pi(s_t, a) of the policy pi selecting action a in state s_t.

Yes, your assumption is completely right. In the Reinforcement Learning field, a value function is the return obtained by starting for a particular state and following a policy π . So yes, strictly speaking, it should be accompained by the policy sign π .
The Bellman equation basically represents value functions recursively. However, it should be noticed that there are two kinds of Bellman equations:
Bellman optimality equation, which characterizes optimal value functions. In this case, the value function it is implicitly associated with the optimal policy. This equation has the non linear maxoperator and is the one you has posted. The (optimal) policy dependcy is sometimes represented with an asterisk as follows:
Maybe some short texts or papers omit this dependency assuming it is obvious, but I think any RL text book should initially include it. See, for example, Sutton & Barto or Busoniu et al. books.
Bellman equation, which characterizes a value function, in this case associated with any policy π:
In your case, your equation 2.6 is based on the Bellman equation, therefore it should remove the max operator and include the sum over all actions and possible next states. From Sutton & Barto (sorry by the notation change wrt your question, but I think it's understable):

Is it possible to implement a loss function that prioritizes the correct answer being in the top k probabilities?

I am working on an multi-class image recognition problem. The task is to have the correct answer being in the top 3 output probabilities. So I was thinking that maybe there exists a clever cost function that prioritizes the correct answer being in the top K and doesn't penalize much in between these top K.

This can be achieved by class-weighted cross-entropy loss, which essentially assigns the weight to the errors associated with each class. This loss is used in research, e.g. see the paper "Multi-task learning and Weighted Cross-entropy for DNN-based Keyword" by S. Panchapagesan at al. Before computing the cross-entropy, you can check if the predicted distribution satisfies your condition (e.g., ground truth class is in top-k of the predicted classes) and assign the zero (or near zero) weights accordingly, if it does.
There are open questions though: when the correct class is in top-k, should you penalize the k-1 incorrectly predicted classes? What if, for example, the prediction is (0.9, 0.05, 0.01, ...), the third class is correct and it is in top-3 -- is this prediction good enough or not? Should you care what exactly k-1 incorrect classes are?
All these question arise because this kind of loss doesn't have probabilistic interpretation, unlike standard cross-entropy. That's why I wouldn't recommend using it in practice, but reformulate the goal instead.
E.g., if the original problem is that for some inputs several classes are equally good, the best way to deal with it is to use soft labels, e.g. (0.33, 0.33, 0.33, 0, 0, 0, ...) instead of one-hot (note that this totally agrees with probabilistic interpretation). It will force the network to learn features associated with all three good classes, and generally lead to the same goal, but with better control over target classes.

Using and interpreting output from gvlma

I want to test whether all assumptions for my linear regression model hold. I did this manually and it seems to be fine. However, I want to double check with the function gvlma. The output I get is:
gvlma(x = m_lag)
Value p-value Decision
Global Stat 82.475 0.00000 Assumptions NOT satisfied!
Skewness 72.378 0.00000 Assumptions NOT satisfied!
Kurtosis 1.040 0.30778 Assumptions acceptable.
Link Function 6.029 0.01407 Assumptions NOT satisfied!
Heteroscedasticity 3.027 0.08187 Assumptions acceptable.
My question is:
How do I interpret Global Stat
Since the assumption is violated, what can I do about it now? (Same with the other 2 assumptions which were not accepted)

Global Stat- Are the relationships between your X predictors and Y roughly linear?. Rejection of the null (p < .05) indicates a non-linear relationship between one or more of your X’s and Y
Skewness - Is your distribution skewed positively or negatively, necessitating a transformation to meet the assumption of normality? Rejection of the null (p < .05) indicates that you should likely transform your data.
Kurtosis- Is your distribution kurtotic (highly peaked or very shallowly peaked), necessitating a transformation to meet the assumption of normality? Rejection of the null (p < .05) indicates that you should likely transform your data.
Link Function- Is your dependent variable truly continuous, or categorical? Rejection of the null (p < .05) indicates that you should use an alternative form of the generalized linear model (e.g. logistic or binomial regression).
Heteroscedasticity- Is the variance of your model residuals constant across the range of X (assumption of homoscedastiity)? Rejection of the null (p < .05) indicates that your residuals are heteroscedastic, and thus non-constant across the range of X. Your model is better/worse at predicting for certain ranges of your X scales.

I know the question was written a long time ago, but the only answer is not really accurate.
Based on Pena and Slate (2006), the four assumptions in linear regression are normality, heteroscedasticity, and linearity, and what the authors refer to as uncorrelatedness.
For the assumption 'uncorrelatedness', I usually call it independence. The authors refer to independence as a measurement that is validated by an assessment of uncorrelatedness and normality combined. The author also refers to other scholars whom indicate this is the independence of the residuals (on the left side pg 342).
Global Stat
This is the overall metric; this states whether the model, as a whole, passes or fails.
Skewness <- measuring the distribution
Kurtosis <- measuring the distribution, outliers, influential data, etc
Link function <- misspecified model, how you linked the elements in the model assignment
Heteroscedasticity <- looking for equal variance in the residuals
The measurements are not specifically skew, kurtosis, etc; if you look closely at the math behind the measures. These metrics are mathematical derivations from multiple statistical analysis methods. In their research, the authors found that when they combined these four measurements, it not only accurately assessed the four assumptions of linear regression, but also the interaction of the assumptions on the residuals.
In order to determine what to do first for correcting the issues, it would be necessary to know what data you are using, sample size, and the model you have established. The high value in skew could be from distribution, variance, etc. There are things to look for, based on the original work by Pena and Slate, but it seems like if you have a large or small sample size, it could drastically change where you start. I have not worked through all of the conclusions in the article, to know for sure.
Pena, E. A., & Slate, E. H. (2006). Global validation of linear model assumptions. Journal of the American Statistical Association, 101(473), 341-354. https://doi.org/10.1198/016214505000000637

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008