Low power and singular fit in mixed model despite high number of observations and low number of factors

Low power and singular fit in mixed model despite high number of observations and low number of factors - lme4

Despite I made 276 independent observations across 5 sites (lowest number of obs per site: 23), I get the singularity warning and low power to fit a model with one categorical factor (two levels) and one random factor (site). Anybody could tell why would that happen?
Here is a reproducible example:
dataset:
https://www.dropbox.com/s/azpnhnemgyo0i72/example.xlsx?dl=0
require(readxl)
require(lme4)
df <- data.frame((read_excel("Example.xlsx", sheet=1)))
m1 = lmer(ds.lac ~ sand.avail + (1|site), data = df)
summary(m1)
simr::powerSim(m1, nsim=100)
# results in:
# fit: see ?isSingular
# boundary (singular) fit: sSimulating:
# |============================Simulating:
# |============================Simulating:
# |============================
Power for predictor 'sand.avail', 9.00% (95% confidence interval): ( 4.20, 16.40)
# Test: Kenward Roger (package pbkrtest)

As suggested by #Jose, this really belongs on CrossValidated.
pd <- position_dodge(width=0.75)
gg1 <- ggplot(df,
aes(x = site, y = ds.lac, colour = sand.avail)) +
stat_sum(position = pd, alpha = 0.8)
ggsave("powerSim.png")
Also as suggested by Jose, there are a number of reasons this data set doesn't have as much information as you would like
most (75%) of the observations are 0 (mean(df$ds.lac==0) counts the proportion)
you only have all three sand.avail types available at one site, two of the sites only have one type available (thus there will be a lot of overlap between the random effect of site and the effect of sand.avail)
for modern mixed models, 5 levels of the grouping variable is generally considered a minimum
Furthermore, I have a concern about what powerSim is doing here; this seems to be a post hoc power analysis (i.e., estimating the power of a data set you have already gathered), which is widely criticized by statisticians. It would be perfectly reasonable to say "why isn't the effect of sand.avail significant for this model, fitted to these data?" (e.g. car::Anova(m1) gives p=0.66), but asking about the power for this data set is conceptually problematic.

Related

What do BatchNorm2d's running_mean / running_var mean in PyTorch?

I'd like to know what exactly the running_mean and running_var that I can call from nn.BatchNorm2d.
Example code is here where bn means nn.BatchNorm2d.
vector = torch.cat([
torch.mean(self.conv3.bn.running_mean).view(1), torch.std(self.conv3.bn.running_mean).view(1),
torch.mean(self.conv3.bn.running_var).view(1), torch.std(self.conv3.bn.running_var).view(1),
torch.mean(self.conv5.bn.running_mean).view(1), torch.std(self.conv5.bn.running_mean).view(1),
torch.mean(self.conv5.bn.running_var).view(1), torch.std(self.conv5.bn.running_var).view(1)
])
I couldn't figure out what running_mean and running_var mean in the Pytorch official documentation and user community.
What do nn.BatchNorm2.running_mean and nn.BatchNorm2.running_var mean?

From the original Batchnorm paper:
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,Seguey Ioffe and Christian Szegedy, ICML'2015
You can see on Algorithm 1. how to measure the statistics of a given batch.
However what is kept in memory across batches is the running stats, i.e. the statistics which are measured iteratively at each batch inference. The computation of the running mean and running variance is actually quite well explained in the documentation page of nn.BatchNorm2d:
By default, the momentum coefficient is set to 0.1, it regulates how much of the current batch statistics will affect the running statistics:
closer to 1 means the new running stat is closer to the current batch statistics, whereas
closer to 0 means the current batch stats will not contribute much to updating the new running stats.
It's worth pointing out that Batchnorm2d is applied across spatial dimensions, * in addition*, to the batch dimension of course. Given a batch of shape (b, c, h, w), it will compute the statistics across (b, h, w). This means the running statistics are shaped (c,), i.e. there are as many statistics components as there are in input channels (for both mean and variance).
Here is a minimal example:
>>> bn = nn.BatchNorm2d(10)
>>> x = torch.rand(2,10,2,2)
Since track_running_stats is set to True by default on BatchNorm2d, it will track the running stats when inferring on training mode.
The running mean and variance are initialized to zeros and ones, respectively.
>>> running_mean, running_var = torch.zeros(x.size(1)),torch.ones(x.size(1))
Let's perform inference on bn in training mode and check its running stats:
>>> bn(x)
>>> bn.running_mean, bn.running_var
(tensor([0.0650, 0.0432, 0.0373, 0.0534, 0.0476,
0.0622, 0.0651, 0.0660, 0.0406, 0.0446]),
tensor([0.9027, 0.9170, 0.9162, 0.9082, 0.9087,
0.9026, 0.9136, 0.9043, 0.9126, 0.9122]))
Now let's compute those stats by hand:
>>> (1-momentum)*running_mean + momentum*xmean
tensor([[0.0650, 0.0432, 0.0373, 0.0534, 0.0476,
0.0622, 0.0651, 0.0660, 0.0406, 0.0446]])
>>> (1-momentum)*running_var + momentum*xvar
tensor([[0.9027, 0.9170, 0.9162, 0.9082, 0.9087,
0.9026, 0.9136, 0.9043, 0.9126, 0.9122]])

Pytorch : different behaviours in GAN training with different, but conceptually equivalent, code

I'm trying to implement a simple GAN in Pytorch. The following training code works:
for epoch in range(max_epochs): # loop over the dataset multiple times
print(f'epoch: {epoch}')
running_loss = 0.0
for batch_idx,(data,_) in enumerate(data_gen_fn):
# data preparation
real_data = data
input_shape = real_data.shape
inputs_generator = torch.randn(*input_shape).detach()
# generator forward
fake_data = generator(inputs_generator).detach()
# discriminator forward
optimizer_generator.zero_grad()
optimizer_discriminator.zero_grad()
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
# loss discriminator
labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
loss_discriminator_batch = criterion_discriminator(predictions,
labels_real_fake)
# update discriminator
loss_discriminator_batch.backward()
optimizer_discriminator.step()
# generator
# zero the parameter gradients
optimizer_discriminator.zero_grad()
optimizer_generator.zero_grad()
fake_data = generator(inputs_generator) # make again fake data but without detaching
predictions_on_fake = discriminator(fake_data) # D(G(encoding))
# loss generator
labels_fake = torch.tensor([1]*batch_size)
loss_generator_batch = criterion_generator(predictions_on_fake,
labels_fake)
loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
optimizer_generator.step()
If I plot the generated images for each iteration, I see that the generated images look like the real ones, so the training procedure seems to work well.
However, if I try to change the code in the ALERT CODE part , i.e., instead of:
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
I use the following:
#################### ALERT CODE #######################
predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
#######################################################
That is conceptually the same (in a nutshell, instead of doing two different forward on the discriminator, the former on the real, the latter on the fake data, and finally concatenate the results, with the new code I first concatenate real and fake data, and finally I make just one forward pass on the concatenated data.
However, this code version does not work, that is the generated images seems to be always random noise.
Any explanation to this behavior?

Why do we different results?
Supplying inputs in either the same batch, or separate batches, can make a difference if the model includes dependencies between different elements of the batch. By far the most common source in current deep learning models is batch normalization. As you mentioned, the discriminator does include batchnorm, so this is likely the reason for different behaviors. Here is an example. Using single numbers and a batch size of 4:
features = [1., 2., 5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 3.5, std 2.0615528128088303
>>>normalized features [-1.21267813 -0.72760688 0.72760688 1.21267813]
Now we split the batch into two parts. First part:
features = [1., 2.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 1.5, std 0.5
>>>normalized features [-1. 1.]
Second part:
features = [5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 5.5, std 0.5
>>>normalized features [-1. 1.]
As we can see, in the split-batch version, the two batches are normalized to the exact same numbers, even though the inputs are very different. In the joint-batch version, on the other hand, the larger numbers are still larger than the smaller ones as they are normalized using the same statistics.
Why does this matter?
With deep learning, it's always hard to say, and especially with GANs and their complex training dynamics. A possible explanation is that, as we can see in the example above, the separate batches result in more similar features after normalization even if the original inputs are quite different. This may help early in training, as the generator tends to output "garbage" which has very different statistics from real data.
With a joint batch, these differing statistics make it easy for the discriminator to tell the real and generated data apart, and we end up in a situation where the discriminator "overpowers" the generator.
By using separate batches, however, the different normalizations result in the generated and real data to look more similar, which makes the task less trivial for the discriminator and allows the generator to learn.

Is it possible that the number of basic functions is more than the number of observations in spline regression?

I want to run regression spline with B-spline basis function. The data is structured in such a way that the number of observations is less than the number of basis functions and I get a good result.
But I`m not sure if this is the correct case.
Do I have to have more rows than columns like linear regression?
Thank you.

When the number of observations, N, is small, it’s easy to fit a model with basis functions with low square error. If you have more basis functions than observations, then you could have 0 residuals (perfect fit to the data). But that is not to be trusted because it may not be representative of more data points. So yes, you want to have more observations than you do columns. Mathematically, you cannot properly estimate more than N columns because of collinearity. For a rule of thumb, 15 - 20 observations are usually needed for each additional variable / spline.
But, this isn't always the case, such as in genetics when we have hundreds of thousands of potential variables and small sample size. In that case, we turn to tools that help with a small sample size, such as cross validation and bootstrap.
Bootstrap (ie resample with replacement) your datapoints and refit splines many times (100 will probably do). Then you average the splines and use these as the final spline functions. Or you could do cross validation, where you train on a smaller dataset (70%) and then test it on the remaining dataset.
In the functional data analysis framework, there are packages in R that construct and fit spline bases (such as cubic, B, etc). These packages include refund, fda, and fda.usc.
For example,
B <- smooth.construct.cc.smooth.spec(object = list(term = "day.t", bs.dim = 12, fixed = FALSE, dim = 1, p.order = NA, by = NA),data = list(day.t = 200:320), knots = list())
constructs a B spline basis of dimension 12 (over time, day.t), but you can also use these packages to help choose a basis dimension.

How to define the number of factors in parallel analysis

I conducted an Exploratory Factor Analysis (Principal Axis Factoring) on my data and wanted to determine the number of factors to extract via. Horn's Parallel Analysis.
However I have two problems:
The parallel analysis suggests to extract 1 factor, however the plot shows more than one intersection of my "FA Actual Data" and my "FA Simulated Data" line. I do not get why it is just one factor (the first intersection) then.... This plot does not look typical to other parallel analysis plots.
Why does the number of factors to extract change with the number of observations (n.obs) I state? I mean that I just changed the number of observations from 50 to 500 (which is a lie), however then parallel analysis suggested 5 factors to extract instead of 9. I do not get why....
Thank you so much for any helpful tips.
Valerie
fa.parallel(cor(My_Data), n.obs = 50, fa="fa", fm="pa")
Parallel analysis suggests that the number of factors = 1 and the number of components = NA

Determining edge weights given a list of walks in a graph

These questions regard a set of data with lists of tasks performed in succession and the total time required to complete them. I've been wondering whether it would be possible to determine useful things about the tasks' lengths, either as they are or with some initial guesstimation based on appropriate domain knowledge. I've come to think graph theory would be the way to approach this problem in the abstract, and have a decent basic grasp of the stuff, but I'm unable to know for certain whether I'm on the right track. Furthermore, I think it's a pretty interesting question to crack. So here we go:
Is it possible to determine the weights of edges in a directed weighted graph, given a list of walks in that graph with the lengths (summed weights) of said walks? I recognize the amount and quality of permutations on the routes taken by the walks will dictate the quality of any possible answer, but let's assume all possible walks and their lengths are given. If a definite answer isn't possible, what kind of things can be concluded about the graph? How would you arrive at those conclusions?
What if there were several similar walks with possibly differing lengths given? Can you calculate a decent average (or other illustrative measure) for each edge, given enough permutations on different routes to take? How will discounting some permutations from the available data set affect the calculation's accuracy?
Finally, what if you had a set of initial guesses as to the weights and had to refine those using the walks given? Would that improve upon your guesstimation ability, and how could you apply the extra information?
EDIT: Clarification on the difficulties of a plain linear algebraic approach. Consider the following set of walks:
a = 5
b = 4
b + c = 5
a + b + c = 8
A matrix equation with these values is unsolvable, but we'd still like to estimate the terms. There might be some helpful initial data available, such as in scenario 3, and in any case we can apply knowledge of the real world - such as that the length of a task can't be negative. I'd like to know if you have ideas on how to ensure we get reasonable estimations and that we also know what we don't know - eg. when there's not enough data to tell a from b.

Seems like an application of linear algebra.
You have a set of linear equations which you need to solve. The variables being the lengths of the tasks (or edge weights).
For instance if the tasks lengths were t1, t2, t3 for 3 tasks.
And you are given
t1 + t2 = 2 (task 1 and 2 take 2 hours)
t1 + t2 + t3 = 7 (all 3 tasks take 7 hours)
t2 + t3 = 6 (tasks 2 and 3 take 6 hours)
Solving gives t1 = 1, t2 = 1, t3 = 5.
You can use any linear algebra techniques (for eg: http://en.wikipedia.org/wiki/Gaussian_elimination) to solve these, which will tell you if there is a unique solution, no solution or an infinite number of solutions (no other possibilities are possible).
If you find that the linear equations do not have a solution, you can try adding a very small random number to some of the task weights/coefficients of the matrix and try solving it again. (I believe falls under Perturbation Theory). Matrices are notorious for radically changing behavior with small changes in the values, so this will likely give you an approximate answer reasonably quickly.
Or maybe you can try introducing some 'slack' task in each walk (i.e add more variables) and try to pick the solution to the new equations where the slack tasks satisfy some linear constraints (like 0 < s_i < 0.0001 and minimize sum of s_i), using Linear Programming Techniques.

Assume you have an unlimited number of arbitrary characters to represent each edge. (a,b,c,d etc)
w is a list of all the walks, in the form of 0,a,b,c,d,e etc. (the 0 will be explained later.)
i = 1
if #w[i] ~= 1 then
replace w[2] with the LENGTH of w[i], minus all other values in w.
repeat forever.
Example:
0,a,b,c,d,e 50
0,a,c,b,e 20
0,c,e 10
So:
a is the first. Replace all instances of "a" with 50, -b,-c,-d,-e.
New data:
50, 50
50,-b,-d, 20
0,c,e 10
And, repeat until one value is left, and you finish! Alternatively, the first number can simply be subtracted from the length of each walk.

I'd forget about graphs and treat lists of tasks as vectors - every task represented as a component with value equal to it's cost (time to complete in this case.
In tasks are in different orderes initially, that's where to use domain knowledge to bring them to a cannonical form and assign multipliers if domain knowledge tells you that the ratio of costs will be synstantially influenced by ordering / timing. Timing is implicit initial ordering but you may have to make a function of time just for adjustment factors (say drivingat lunch time vs driving at midnight). Function might be tabular/discrete. In general it's always much easier to evaluate ratios and relative biases (hardnes of doing something). You may need a functional language to do repeated rewrites of your vectors till there's nothing more that romain knowledge and rules can change.
With cannonical vectors consider just presence and absence of task (just 0|1 for this iteratioon) and look for minimal diffs - single task diffs first - that will provide estimates which small number of variables. Keep doing this recursively, be ready to back track and have a heuristing rule for goodness or quality of estimates so far. Keep track of good "rounds" that you backtraced from.
When you reach minimal irreducible state - dan't many any more diffs - all vectors have the same remaining tasks then you can do some basic statistics like variance, mean, median and look for big outliers and ways to improve initial domain knowledge based estimates that lead to cannonical form. If you finsd a lot of them and can infer new rules, take them in and start the whole process from start.
Yes, this can cost a lot :-)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008