I am testing some non-linear models with nlmer from the lme4 library. But this function assumes normality while my data clearly follows a Gamma distribution.
fit <- lme4::nlmer(y ~ nfun(x, b0, b1, b2) ~
(b0|id),
data = df,
start = start.df, REML=T)
summary(fit)
Is there a way of adding a family group, as for lme4, or any other tips for testing groups within non-linear models when data is not Gaussian?
The options for fitting non-linear generalized mixed models in (pure) R are slim to nonexistent.
the brms package offers some scope for nonlinear fitting
you can build a nonlinear model in TMB (but you have to do your own programming in an extension of C++)
other front-end packages like rethinking, greta, rjags, etc. also allow you to build nonlinear mixed effects models — but all (?) in a Bayesian/MCMC framework
On the other hand, Gamma-based models can often be adequately substituted by log-Normal-based models (i.e. log(y) ~ nfun(x, b0, b1, b2)); while the tail shapes of these distributions are very different, the variance-mean relationship for the standard parameterizations is the same, and the results of a (log-link) Gamma GLMM and the corresponding log-Gaussian model are often very similar.
Or you could use a fixed effect for the groups (if you have enough data) and use bbmle::mle2 with the params argument.
Related
I am building a GAM with a data set which distribution resembles poisson-distributed data. However, my data is continuous, i.e., it contains information on tree volumes in cubic meters. So, when doing the GAM code in R (with mgcv library) can I use poisson as the family? Or should I choose something else since the data is not count data? I indeed found some threads discussing similar issues but they didn't provide an answer.
My simplified example code with only one explanatory variable:
gam_volumes <- gam(volumes_m3 ~ s(age, k=10), data=training, family=poisson)
I would use a Gamma distribution with log link for this; this distribution will look like a Poisson (right skewed) but it is a continuous distribution. You can't have 0s in the Gamma but that's OK as a 0 volume tree isn't an observable tree.
loosly speaking, I would write an unconditional model as follow y=f(z) with y in R^n (or R^(n,m) or R^(n,m,d)) and z in R^q with z drawn from some probability distribution. For Generative Adversarial Networks (GANs) and Variational Auto Encoders (VAEs) I was able to find examples how to design a conditional variant of these types of models. The conditional model could then be written as follows - y = f(u,z) with y and z as above and u in R^p. Remark - I am interested in continuous conditions. But how can such a conditional model be designed for diffusion or normalizing flow based or score based models? Lets take diffusion models for example. Would we add to the input u at each step gradually some Gaussian noise and "forget" the condition more and more this way? Or would we keep the full condition?
I'm using a 1D CNN on temporal data. Let's say that I have two features A and B. The ratio between A and B (i.e. A/B) is important - let's call this feature C. I'm wondering if I need to explicitly calculate and include feature C, or can the CNN theoretically infer feature C from the given features A and B?
I understand that in deep learning, it's best to exclude highly-correlated features (such as feature C), but I don't understand why.
The short answer is NO. Using the standard DNN layers will not automatically capture this A/B relationship, because standard layers like Conv/Dense will only perform the matrix multiplication operations.
To simplify the discussion, let us assume that your input feature is two-dimensional, where the first dimension is A and the second is B. Applying a Conv layer to this feature simply learns a weight matrix w and bias b
y = w * [f_A, f_B] + b = w_A * f_A + w_B * f_B + b
As you can see, there is no way for this representation to mimic or even approximate the ratio operation between A and B.
You don't have to use the feature C in the same way as feature A and B. Instead, it may be a better idea to keep feature C as an individual input, because its dynamic range may be very different from those of A and B. This means that you can have a multiple-input network, where each input has its own feature extraction layers and the resulting features from both inputs can be concatenated together to predict your target.
In the below code, they use autoencoder as supervised clustering or classification because they have data labels.
http://amunategui.github.io/anomaly-detection-h2o/
But, can I use autoencoder to cluster data if I did not have its labels.?
Regards
The deep-learning autoencoder is always unsupervised learning. The "supervised" part of the article you link to is to evaluate how well it did.
The following example (taken from ch.7 of my book, Practical Machine Learning with H2O, where I try all the H2O unsupervised algorithms on the same data set - please excuse the plug) takes 563 features, and tries to encode them into just two hidden nodes.
m <- h2o.deeplearning(
2:564, training_frame = tfidf,
hidden = c(2), auto-encoder = T, activation = "Tanh"
)
f <- h2o.deepfeatures(m, tfidf, layer = 1)
The second command there extracts the hidden node weights. f is a data frame, with two numeric columns, and one row for every row in the tfidf source data. I chose just two hidden nodes so that I could plot the clusters:
Results will change on each run. You can (maybe) get better results with stacked auto-encoders, or using more hidden nodes (but then you cannot plot them). Here I felt the results were limited by the data.
BTW, I made the above plot with this code:
d <- as.matrix(f[1:30,]) #Just first 30, to avoid over-cluttering
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17) #Triangle
text(d, labels, pos = 3) #pos=3 means above
(P.S. The original data came from Brandon Rose's excellent article on using NLTK. )
In some aspects encoding data and clustering data share some overlapping theory. As a result, you can use Autoencoders to cluster(encode) data.
A simple example to visualize is if you have a set of training data that you suspect has two primary classes. Such as voter history data for republicans and democrats. If you take an Autoencoder and encode it to two dimensions then plot it on a scatter plot, this clustering becomes more clear. Below is a sample result from one of my models. You can see a noticeable split between the two classes as well as a bit of expected overlap.
The code can be found here
This method does not require only two binary classes, you could also train on as many different classes as you wish. Two polarized classes is just easier to visualize.
This method is not limited to two output dimensions, that was just for plotting convenience. In fact, you may find it difficult to meaningfully map certain, large dimension spaces to such a small space.
In cases where the encoded (clustered) layer is larger in dimension it is not as clear to "visualize" feature clusters. This is where it gets a bit more difficult, as you'll have to use some form of supervised learning to map the encoded(clustered) features to your training labels.
A couple ways to determine what class features belong to is to pump the data into knn-clustering algorithm. Or, what I prefer to do is to take the encoded vectors and pass them to a standard back-error propagation neural network. Note that depending on your data you may find that just pumping the data straight into your back-propagation neural network is sufficient.
I will like to know more about whether or not there are any rule to set the hyper-parameters alpha and theta in the LDA model. I run an LDA model given by the library gensim:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=30, id2word = dictionary, passes=50, minimum_probability=0)
But I have my doubts on the specification of the hyper-parameters. From what I red in the library documentation, both hyper-parameters are set to 1/number of topics. Given that my model has 30 topics, both hyper-parameters are set to a common value 1/30. I am running the model in news-articles that describe the economic activity. For this reason, I expect that the document-topic distribution (theta) to be high (similar topics in documents),while the topic-word distribution (alpha) be high as well (topics sharing many words in common, or, words not being so exclusive for each topic). For this reason, and given that my understanding of the hyper-parameters is correct, is 1/30 a correct specification value?
I'll assume you expect theta and phi (document-topic proportion and topic-word proportion) to be closer to equiprobable distributions instead of sparse ones, with exclusive topics/words.
Since alpha and beta are parameters to a symmetric Dirichlet prior, they have a direct influence on what you want. A Dirichlet distribution outputs probability distributions. When the parameter is 1, all possible distributions are equally liked to outcome (for K=2, [0.5,0.5] and [0.99,0.01] have the same chances). When parameter>1, this parameter behaves as a pseudo-counter, as a prior belief. For a high value, equiprobable output is preferred (P([0.5,0.5])>P([0.99,0.01]). Parameter<1 has the opposite behaviour. For big vocabularies you don't expect topics with probability in all words, that's why beta tends to be under 1 (the same for alpha).
However, since you're using Gensim, you can let the model learn alpha and beta values for you, allowing to learn asymmetric vectors (see here), where it stands
alpha can be set to an explicit array = prior of your choice. It also
support special values of ‘asymmetric’ and ‘auto’: the former uses a
fixed normalized asymmetric 1.0/topicno prior, the latter learns an
asymmetric prior directly from your data.
The same for eta (which I call beta).