Can I use poisson distribution as family in Generalized Additive Model (GAM) for continuous, non-negative data? - regression

I am building a GAM with a data set which distribution resembles poisson-distributed data. However, my data is continuous, i.e., it contains information on tree volumes in cubic meters. So, when doing the GAM code in R (with mgcv library) can I use poisson as the family? Or should I choose something else since the data is not count data? I indeed found some threads discussing similar issues but they didn't provide an answer.
My simplified example code with only one explanatory variable:
gam_volumes <- gam(volumes_m3 ~ s(age, k=10), data=training, family=poisson)

I would use a Gamma distribution with log link for this; this distribution will look like a Poisson (right skewed) but it is a continuous distribution. You can't have 0s in the Gamma but that's OK as a 0 volume tree isn't an observable tree.

Related

LSTM Evolution Forecast

I have a confusion about the way the LSTM networks work when forecasting with an horizon that is not finite but I'm rather searching for a prediction in whatever time in future. In physical terms I would call it the evolution of the system.
Suppose I have a time series $y(t)$ (output) I want to forecast, and some external inputs $u_1(t), u_2(t),\cdots u_N(t)$ on which the series $y(t)$ depends.
It's common to use the lagged value of the output $y(t)$ as input for the network, such that I schematically have something like (let's consider for simplicity just lag 1 for the output and no lag for the external input):
[y(t-1), u_1(t), u_2(t),\cdots u_N(t)] \to y(t)
In this way of thinking the network, when one wants to do recursive forecast it is forced to use the predicted value at the previous step as input for the next step. In this way we have an effect of propagation of error that makes the long term forecast badly behaving.
Now, my confusion is, I'm thinking as a RNN as a kind of an (simple version) implementation of a state space model where I have the inputs, my output and one or more state variable responsible for the memory of the system. These variables are hidden and not observed.
So now the question, if there is this kind of variable taking already into account previous states of the system why would I need to use the lagged output value as input of my network/model ?
Getting rid of this does my long term forecast would be better, since I'm not expecting anymore the propagation of the error of the forecasted output. (I guess there will be anyway an error in the internal state propagating)
Thanks !
Please see DeepAR - a LSTM forecaster more than one step into the future.
The main contributions of the paper are twofold: (1) we propose an RNN
architecture for probabilistic forecasting, incorporating a negative
Binomial likelihood for count data as well as special treatment for
the case when the magnitudes of the time series vary widely; (2) we
demonstrate empirically on several real-world data sets that this
model produces accurate probabilistic forecasts across a range of
input characteristics, thus showing that modern deep learning-based
approaches can effective address the probabilistic forecasting
problem, which is in contrast to common belief in the field and the
mixed results
In this paper, they forecast multiple steps into the future, to negate exactly what you state here which is the error propagation.
Skipping several steps allows to get more accurate predictions, further into the future.
One more thing done in this paper is predicting percentiles, and interpolating, rather than predicting the value directly. This adds stability, and an error assessment.
Disclaimer - I read an older version of this paper.

Why W_q matrix in torch.nn.MultiheadAttention is quadratic

I am trying to implement nn.MultiheadAttention in my network. According to the docs,
embed_dim  – total dimension of the model.
However, according to the source file,
embed_dim must be divisible by num_heads
and
self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
If I understand properly, this means each head takes only a part of features of each query, as the matrix is quadratic. Is it a bug of realization or is my understanding wrong?
Each head uses a different part of the projected query vector. You can imagine it as if the query gets split into num_heads vectors that are independently used to compute the scaled dot-product attention. So, each head operates on a different linear combination of the features in queries (and keys and values, too). This linear projection is done using the self.q_proj_weight matrix and the projected queries are passed to F.multi_head_attention_forward function.
In F.multi_head_attention_forward, it is implemented by reshaping and transposing the query vector, so that the independent attentions for individual heads can be computed efficiently by matrix multiplication.
The attention head sizes are a design decision of PyTorch. In theory, you could have a different head size, so the projection matrix would have a shape of embedding_dim × num_heads * head_dims. Some implementations of transformers (such as C++-based Marian for machine translation, or Huggingface's Transformers) allow that.

how to predict query topics using word-topic matrix?

I'm implementing LDA using Java. I know how the algorithm works. In the end of the training (the given iterations) I will get 2 matrices (topic-word and document-topic) that represent the set of the input documents.
My problem is that when I input a new document (query) I want to use these matrices (or any other way) to get the document-topic vector of that query. How would I do that?
Are you using Variational Inference or Gibbs Sampling?
For Gibbs Sampling a typical approach is adding the new document/s to the inference, and only updating its own counters, keeping constant the counters for the documents you used to learn the model.
This is specified in equations 84 and 85 in Parameter Estimation for Text Analysis
I guess there has to be a similar approach in VI LDA.

Can I use autoencoder for clustering?

In the below code, they use autoencoder as supervised clustering or classification because they have data labels.
http://amunategui.github.io/anomaly-detection-h2o/
But, can I use autoencoder to cluster data if I did not have its labels.?
Regards
The deep-learning autoencoder is always unsupervised learning. The "supervised" part of the article you link to is to evaluate how well it did.
The following example (taken from ch.7 of my book, Practical Machine Learning with H2O, where I try all the H2O unsupervised algorithms on the same data set - please excuse the plug) takes 563 features, and tries to encode them into just two hidden nodes.
m <- h2o.deeplearning(
2:564, training_frame = tfidf,
hidden = c(2), auto-encoder = T, activation = "Tanh"
)
f <- h2o.deepfeatures(m, tfidf, layer = 1)
The second command there extracts the hidden node weights. f is a data frame, with two numeric columns, and one row for every row in the tfidf source data. I chose just two hidden nodes so that I could plot the clusters:
Results will change on each run. You can (maybe) get better results with stacked auto-encoders, or using more hidden nodes (but then you cannot plot them). Here I felt the results were limited by the data.
BTW, I made the above plot with this code:
d <- as.matrix(f[1:30,]) #Just first 30, to avoid over-cluttering
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17) #Triangle
text(d, labels, pos = 3) #pos=3 means above
(P.S. The original data came from Brandon Rose's excellent article on using NLTK. )
In some aspects encoding data and clustering data share some overlapping theory. As a result, you can use Autoencoders to cluster(encode) data.
A simple example to visualize is if you have a set of training data that you suspect has two primary classes. Such as voter history data for republicans and democrats. If you take an Autoencoder and encode it to two dimensions then plot it on a scatter plot, this clustering becomes more clear. Below is a sample result from one of my models. You can see a noticeable split between the two classes as well as a bit of expected overlap.
The code can be found here
This method does not require only two binary classes, you could also train on as many different classes as you wish. Two polarized classes is just easier to visualize.
This method is not limited to two output dimensions, that was just for plotting convenience. In fact, you may find it difficult to meaningfully map certain, large dimension spaces to such a small space.
In cases where the encoded (clustered) layer is larger in dimension it is not as clear to "visualize" feature clusters. This is where it gets a bit more difficult, as you'll have to use some form of supervised learning to map the encoded(clustered) features to your training labels.
A couple ways to determine what class features belong to is to pump the data into knn-clustering algorithm. Or, what I prefer to do is to take the encoded vectors and pass them to a standard back-error propagation neural network. Note that depending on your data you may find that just pumping the data straight into your back-propagation neural network is sufficient.

Apply PCA on very large sparse matrix

I am doing a text classification task with R, and I obtain a document-term matrix with size 22490 by 120,000 (only 4 million non-zero entries, less than 1% entries). Now I want to reduce the dimensionality by utilizing PCA (Principal Component Analysis). Unfortunately, R cannot handle this huge matrix, so I store this sparse matrix in a file in the "Matrix Market Format", hoping to use some other techniques to do PCA.
So could anyone give me some hints for useful libraries (whatever the programming language), which could do PCA with this large-scale matrix with ease, or do a longhand PCA by myself, in other words, calculate the covariance matrix at first, and then calculate the eigenvalues and eigenvectors for the covariance matrix.
What I want is to calculate all PCs (120,000), and choose only the top N PCs, who accounts for 90% variance. Obviously, in this case, I have to give a threshold a priori to set some very tiny variance values to 0 (in the covariance matrix), otherwise, the covariance matrix will not be sparse and its size would be 120,000 by 120,000, which is impossible to handle with one single machine. Also, the loadings (eigenvectors) will be extremely large, and should be stored in sparse format.
Thanks very much for any help !
Note: I am using a machine with 24GB RAM and 8 cpu cores.
The Python toolkit scikit-learn has a few PCA variants, of which RandomizedPCA can handle sparse matrices in any of the formats supported by scipy.sparse. scipy.io.mmread should be able to parse the Matrix Market format (I never tried it, though).
Disclaimer: I'm on the scikit-learn development team.
EDIT: the sparse matrix support from RandomizedPCA has been deprecated in scikit-learn 0.14. TruncatedSVD should be used in its stead. See the documentation for details.
Instead of running PCA, you could try Latent Dirichlet Allocation (LDA), which decomposes the document-word matrix into a document-topic and topic-word matrix. Here is a link to an R implementation: http://cran.r-project.org/web/packages/lda/ - there are quite a few implementations out there, though if you google.
With LDA you need to specify a fixed number of topics (similar to principle components) in advance. A potentially better alternative is HDP-LDA (http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/npbayes-r21.tgz), which learns the number of topics that form a good representation of your corpus.
If you can fit our dataset in memory (which it seems like you can), then you also shouldn't have a problem running the LDA code.
As a number of people on the scicomp forum pointed out, there should be no need to compute all of the 120k principle components. Algorithms like http://en.wikipedia.org/wiki/Power_iteration calculate the largest eigenvalues of a matrix, and LDA algorithms will converge to a minimum-description-length representation of the data given the number of topics specified.
In R big.PCA of bigpca package http://cran.r-project.org/web/packages/bigpca/bigpca.pdf does the job.
text classification task
I resolved almost same problem using a technique for PCA of sparse matrix .
This technique can handle very large sparse matrix.
The result shows such simple PCA outperfoms word2vec.
It intends the simple PCA outperforms LDA.
I suppose you wouldn't be able to compute all principle components. But still you can obtain reduced dimension version of your dataset matrix. I've implemented a simple routine in MATLAB, which can easily be replicated in python.
Compute the covariance matrix of your input dataset, and convert it to a dense matrix. Assuming S is you input 120,000 * 22490 sparse matrix, this would be like:
Smul=full(S.'*S);
Sm=full(mean(S));
Sm2=120000*Sm.'*Sm;
Scov=Smul-Sm2;
Apply eigs function on the covariance matrix to obtain the first N dominant eigenvectors,
[V,D] = eigs(Scov,N);
And obtain pcs by projecting the zero centered matrix on eigenvectors,
Sr=(S-Sm)*V;
Sr is the reduced dimension version of S.