SSVS and spike-slab prior with JAGS - regression

I'm very new to this topic/posting on a discussion board, so I apologize in advance if something is unclear.
I'm interested in performing a stochastic search variable seleciton (SSVS) in JAGS. I've seen codes online of people performing SSVS (e.g. http://www4.stat.ncsu.edu/~reich/ABA/code/SSVS which I've copied the code below) but my understanding is that to perform this method, I need to use a spike-slab prior in JAGS. The spike can be either a point mass or a distribution with a very narrow variance. Looking at most people's codes, there's only one distribution being defined (in the one above, they define a distribution on gamma, with beta = gamma * delta) and I believe they're assuming a point mass on the spike.
So my questions are:
1) Can someone explain why the code below is using the SSVS method? For example, how do we know this isn't using GVS, which is also another method that uses a Gibbs sampler?
2) Is this a point mass on the spike?
3) If I wanted to use simulated data to test whether the Gibbs sampler is correctly drawing from the spike/slab, how would I go about doing that? Would I code for a spike and a slab and what would I be looking for in the posterior to see that it is drawing correctly?
model_string <- "model{
# Likelihood
for(i in 1:n){
Y[i] ~ dpois(lambda[I])
log(lambda[i]) <- log(N[i]) + alpha + inprod(beta[],X[i,])
}
#Priors
for(j in 1:p){
gamma[j] ~ dnorm(0,tau)
delta[j] ~ dbern(prob)
beta[j] <- gamma[j]*delta[j]
}
prob ~ dunif(0,1)
tau ~ dgamma(.1,.1)
alpha ~ dnorm(0,0.1)
}"
I've also asked on the JAGS help page too: https://sourceforge.net/p/mcmc-jags/discussion/610036/thread/a44343e0/#ab47

I am also (trying to) work on some Bayesian varibale selection stuff in JAGs. I am by no means an expert on this topic, but maybe if we chat about this more we can learn together. Here is my interpretation of the varibale selection within this code:
model_string <- "model{
Likelihood
for(i in 1:n){
Y[i] ~ dpois(lambda[I])
log(lambda[i]) <- log(N[i]) + alpha + inprod(beta[],X[i,])
}
Priors
for(j in 1:p){
gamma[j] ~ dnorm(0,tau)
delta[j] ~ dbern(prob) # delta has a Bernoulli distributed prior (so it can only be 1:included or 0:notincluded)
beta[j] <- gamma[j]*delta[j] # delta is the inclusion probability
}
prob ~ dunif(0,1) # This is then setting an uninformative prior around the probability of a varible being included into the model
tau ~ dgamma(.1,.1)
alpha ~ dnorm(0,0.1)
}"
I have tried to comment out the variable selection sections of the model. The code above looks really similar to the Kuo & Mallick method of Bayesian variable selection. I am currently having trouble tuning the spike and slab method so the estimates mix properly instead of "getting stuck" on either 0 or 1.
So my priors are set up more like:
beta~ dnorm(0,tau)
tau <-(100*(1-gamma))+(0.001*(gamma)) # tau is the inclusion probability
gamma~dbern(0.5)
I have found this paper helps to explain the differences between different variable selection methods (It gets into GVS vs SSVS):
O’Hara, R.B. & Sillanpää, M.J. (2009). A review of Bayesian variable selection
methods: What, how and which. Bayesian Anal., 4, 85–118
Or this blog post: https://darrenjw.wordpress.com/2012/11/20/getting-started-with-bayesian-variable-selection-using-jags-and-rjags/
If there was no SSVS on the beta prior, the prior would look more like this:
Priors
for(j in 1:p){
beta[j] <- ~ dnorm(0,0.01) # just setting a normally (or whatever drstribution you're working in) distributed prior around beta.
}
tau ~ dgamma(.1,.1)
alpha ~ dnorm(0,0.1)
}"

Related

Bayesian Gamma regression, what is the correct link function?

I'm trying to do a bayesian gamma regression with stan.
I know the correct link function is the inverse canonical link,
but if i dont use a log link parameters can be negative, and enter in a gamma distribution with a negative value, that obviously can't be possible.
how can i deal with it?
parameters {
vector[K] betas; //the regression parameters
real beta0;
real<lower=0.001, upper=100 > shape; //the variance parameter
}
transformed parameters {
vector<lower=0>[N] eta; //the expected values (linear predictor)
vector[N] alpha; //shape parameter for the gamma distribution
vector[N] beta; //rate parameter for the gamma distribution
eta <- beta0 + X*betas; //using the log link
}
model {
beta0 ~ normal( 0 , 2^2 );
for(i in 2:K)
betas[i] ~ normal( 0 , 2^2 );
y ~ gamma(shape,shape * eta);
}
I was struggling with this a couple weeks ago, and while I don't consider this a definitive answer, hopefully it is still helpful. For what it's worth, McCullagh and Nelder directly acknowledge this inappropriate support of the canonical link function. They advise that one must constrain the betas to properly match the support. Here's the relevant passage
The canonical link function yields sufficient statistics which are linear functions of the data and it is given by η = 1/μ. Unlike the canonical links for the Poisson and binomial distributions, the reciprocal transformation, which is often interpretable as the rate of a process, does not map the range of μ onto the whole real line. Thus the requirement that η > 0 implies restrictions on the βs in any linear model. Suitable precautions must be taken in computing β_hat so that negative values of μ_hat are avoided.
-- McCullagh and Nelder (1989). Generalized Linear Models. p. 291
It depends on your X values, but as far as I can tell (please correct me someone!) in an MCMC-based Bayesian case, you can achieve this by either using a truncated prior on the betas or a strong enough prior on your intercept to make the inappropriate regions numerically impossible to reach.
In my case, I ultimately used an identity link with a strong positive prior intercept and that was sufficient and yielded reasonable results.
Also, the choice of link really depends on your X. As the passage above implies, the use of the canonical link assumes that your linear model is in rate space. Using log or identity link functions appear to be also very common, and ultimately it's about providing a space that offers a sufficient span for the linear function to capture the response.

Backpropagation on Two Layered Networks

i have been following cs231n lectures of Stanford and trying to complete assignments on my own and sharing these solutions both on github and my blog. But i'm having a hard time on understanding how to modelize backpropagation. I mean i can code modular forward and backward passes but what bothers me is that if i have the model below : Two Layered Neural Network
Lets assume that our loss function here is a softmax loss function. In my modular softmax_loss() function i am calculating loss and gradient with respect to scores (dSoft = dL/dY). After that, when i'am following backwards lets say for b2, db2 would be equal to dSoft*1 or dW2 would be equal to dSoft*dX2(outputs of relu gate). What's the chain rule here ? Why isnt dSoft equal to 1 ? Because dL/dL would be 1 ?
The softmax function is outputs a number given an input x.
What dSoft means is that you're computing the derivative of the function softmax(x) with respect to the input x. Then to calculate the derivative with respect to W of the last layer you use the chain rule i.e. dL/dW = dsoftmax/dx * dx/dW. Note that x = W*x_prev + b where x_prev is the input to the last node. Therefore dx/dW is just x and dx/db is just 1, which means that dL/dW or simply dW is dsoftmax/dx * x_prev and dL/db or simply db is dsoftmax/dx * 1. Note that here dsoftmax/dx is dSoft we defined earlier.

How to find a function that fits a given set of data points in Julia?

So, I have a vector that corresponds to a given feature (same dimensionality). Is there a package in Julia that would provide a mathematical function that fits these data points, in relation to the original feature? In other words, I have x and y (both vectors) and need to find a decent mapping between the two, even if it's a highly complex one. The output of this process should be a symbolic formula that connects x and y, e.g. (:x)^3 + log(:x) - 4.2454. It's fine if it's just a polynomial approximation.
I imagine this is a walk in the park if you employ Genetic Programming, but I'd rather opt for a simpler (and faster) approach, if it's available. Thanks
Turns out the Polynomials.jl package includes the function polyfit which does Lagrange interpolation. A usage example would go:
using Polynomials # install with Pkg.add("Polynomials")
x = [1,2,3] # demo x
y = [10,12,4] # demo y
polyfit(x,y)
The last line returns:
Poly(-2.0 + 17.0x - 5.0x^2)`
which evaluates to the correct values.
The polyfit function accepts a maximal degree for the output polynomial, but defaults to using the length of the input vectors x and y minus 1. This is the same degree as the polynomial from the Lagrange formula, and since polynomials of such degree agree on the inputs only if they are identical (this is a basic theorem) - it can be certain this is the same Lagrange polynomial and in fact the only one of such a degree to have this property.
Thanks to the developers of Polynomial.jl for leaving me just to google my way to an Answer.
Take a look to MARS regression. Multi adaptive regression splines.

Simulation from a nested glmer model

I'm having a problem generating simulations from a 3 level glmer model when conditioning on the random effects (I'm actually using predict via bootMer but the problem is the same).
This works:
library(lme4)
fit1 = glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
simulate(fit1, re.form=NULL)
This fails:
cbpp$bigherd = rep(1:7, 8)
fit2 = glmer(cbind(incidence, size - incidence) ~ period + (1 | bigherd / herd),
data = cbpp, family = binomial)
simulate(fit2, re.form=NULL)
Error: No random effects terms specified in formula
Many thanks for any ideas.
Update
Ben, many thanks for your help below, really appreciate it. I wonder if I can impose on you again.
What I want to do is simulate predictions on the response scale and I'm not sure if I can use your work around? Or if there is an alternative to what I'm doing. Thank you!
This works as expected, but is not conditional on random effects:
FUN = function(.){
predict(., type="response")
}
bootMer(fit2, FUN, nsim=3)$t
This doesn't work, as would be expected given above problem:
bootMer(fit2, FUN, nsim=3, use.u=TRUE)$t
As far as I can see, I can't pass re.form to bootMer.
Does the alternative below result in simulated predictions conditional on random effects without passing use.u to bootMer?
FUN = function(.){
predict(., type="response", re.form=~(1|herd:bigherd) + (1|bigherd))
}
bootMer(fit2, FUN, nsim=10)$t
I'm not sure what's going on yet, but here are two workarounds that do work:
simulate(fit2, re.form=lme4:::reOnly(formula(fit2)))
simulate(fit2, re.form=~(1|herd:bigherd) + (1|bigherd))
There must be something going wrong with the expansion of the "slash" term, because this doesn't work:
simulate(fit2, re.form=~(1|bigherd/herd))
I've posted this as an lme4 issue
These workarounds don't work for bootMer (which only takes the use.u argument, not re.form) in the current CRAN release (1.1-9).
It is fixed in the development version on Github (1.1-10): devtools::install_github("lme4/lme4") will install it, if you have compilation tools installed.
In the meantime you could just go ahead and implement your own parametric bootstrap (for parametric bootstrapping, bootMer is actually a very thin wrapper around simulate()/[refit()orupdate()]/FUN`). Much of the complication has to do with parallel computation (you'd have to add some of it back in if you want parallel computation in your own PB implementation).
This is the outline of a hand-rolled parametric bootstrap:
nboot <- 10
nresp <- length(FUN(orig_fit))
res <- matrix(NA,nboot,nresp)
for (i in 1:nboot) {
res[i,] <- FUN(update(orig_fit,data=simulate(orig_fit,...)))
## or use refit() for LMMs
## ... are options to simulate()
}
t(apply(res,2,quantile,c(0.025,0.975)))

Indirect Kalman Filter for Inertial Navigation System

I'm trying to implement an Inertial Navigation System using an Indirect Kalman Filter. I've found many publications and thesis on this topic, but not too much code as example. For my implementation I'm using the Master Thesis available at the following link:
https://fenix.tecnico.ulisboa.pt/downloadFile/395137332405/dissertacao.pdf
As reported at page 47, the measured values from inertial sensors equal the true values plus a series of other terms (bias, scale factors, ...).
For my question, let's consider only bias.
So:
Wmeas = Wtrue + BiasW (Gyro meas)
Ameas = Atrue + BiasA. (Accelerometer meas)
Therefore,
when I propagate the Mechanization equations (equations 3-29, 3-37 and 3-41)
I should use the "true" values, or better:
Wmeas - BiasW
Ameas - BiasA
where BiasW and BiasA are the last available estimation of the bias. Right?
Concerning the update phase of the EKF,
if the measurement equation is
dzV = VelGPS_est - VelGPS_meas
the H matrix should have an identity matrix in corrispondence of the velocity error state variables dx(VEL) and 0 elsewhere. Right?
Said that I'm not sure how I have to propagate the state variable after update phase.
The propagation of the state variable should be (in my opinion):
POSk|k = POSk|k-1 + dx(POS);
VELk|k = VELk|k-1 + dx(VEL);
...
But this didn't work. Therefore I've tried:
POSk|k = POSk|k-1 - dx(POS);
VELk|k = VELk|k-1 - dx(VEL);
that didn't work too... I tried both solutions, even if in my opinion the "+" should be used. But since both don't work (I have some other error elsewhere)
I would ask you if you have any suggestions.
You can see a snippet of code at the following link: http://pastebin.com/aGhKh2ck.
Thanks.
The difficulty you're running into is the difference between the theory and the practice. Taking your code from the snippet instead of the symbolic version in the question:
% Apply corrections
Pned = Pned + dx(1:3);
Vned = Vned + dx(4:6);
In theory when you use the Indirect form you are freely integrating the IMU (that process called the Mechanization in that paper) and occasionally running the IKF to update its correction. In theory the unchecked double integration of the accelerometer produces large (or for cheap MEMS IMUs, enormous) error values in Pned and Vned. That, in turn, causes the IKF to produce correspondingly large values of dx(1:6) as time evolves and the unchecked IMU integration runs farther and farther away from the truth. In theory you then sample your position at any time as Pned +/- dx(1:3) (the sign isn't important -- you can set that up either way). The important part here is that you are not modifying Pned from the IKF because both are running independent from each other and you add them together when you need the answer.
In practice you do not want to take the difference between two enourmous double values because you will lose precision (because many of the bits of the significand were needed to represent the enormous part instead of the precision you want). You have grasped that in practice you want to recursively update Pned on each update. However, when you diverge from the theory this way, you have to take the corresponding (and somewhat unobvious) step of zeroing out your correction value from the IKF state vector. In other words, after you do Pned = Pned + dx(1:3) you have "used" the correction, and you need to balance the equation with dx(1:3) = dx(1:3) - dx(1:3) (simplified: dx(1:3) = 0) so that you don't inadvertently integrate the correction over time.
Why does this work? Why doesn't it mess up the rest of the filter? As it turns out, the KF process covariance P does not actually depend on the state x. It depends on the update function and the process noise Q and so on. So the filter doesn't care what the data is. (Now that's a simplification, because often Q and R include rotation terms, and R might vary based on other state variables, etc, but in those cases you are actually using state from outside the filter (the cumulative position and orientation) not the raw correction values, which have no meaning by themselves).