Which distance matrix to compute on my absence/presence dataset with many zeros? - binary

I am doing analyses on a dataset of parasite species present or absent in bee samples.
I have a matrix with 0/1 for each sample and each parasite species.
Since a lot of samples do not have any parasite species, a lot of the rows are all zeroes.
I want to make a distance matrix from this, to compute Mantel correlations with matrices with environmental data, but i'm not sure which distance matrix to choose.
I was using the vegdist function in R for this:
pardist = vegdist(parasites, method = "euclidian", binary=TRUE),
but now i discover there is also a "binomial" method. If i use this one, my results are very different.
I cannot find much information on the difference between the two.
Can someone help to explain?

Related

Is it possible that the number of basic functions is more than the number of observations in spline regression?

I want to run regression spline with B-spline basis function. The data is structured in such a way that the number of observations is less than the number of basis functions and I get a good result.
But I`m not sure if this is the correct case.
Do I have to have more rows than columns like linear regression?
Thank you.
When the number of observations, N, is small, it’s easy to fit a model with basis functions with low square error. If you have more basis functions than observations, then you could have 0 residuals (perfect fit to the data). But that is not to be trusted because it may not be representative of more data points. So yes, you want to have more observations than you do columns. Mathematically, you cannot properly estimate more than N columns because of collinearity. For a rule of thumb, 15 - 20 observations are usually needed for each additional variable / spline.
But, this isn't always the case, such as in genetics when we have hundreds of thousands of potential variables and small sample size. In that case, we turn to tools that help with a small sample size, such as cross validation and bootstrap.
Bootstrap (ie resample with replacement) your datapoints and refit splines many times (100 will probably do). Then you average the splines and use these as the final spline functions. Or you could do cross validation, where you train on a smaller dataset (70%) and then test it on the remaining dataset.
In the functional data analysis framework, there are packages in R that construct and fit spline bases (such as cubic, B, etc). These packages include refund, fda, and fda.usc.
For example,
B <- smooth.construct.cc.smooth.spec(object = list(term = "day.t", bs.dim = 12, fixed = FALSE, dim = 1, p.order = NA, by = NA),data = list(day.t = 200:320), knots = list())
constructs a B spline basis of dimension 12 (over time, day.t), but you can also use these packages to help choose a basis dimension.

Negative binomial regression SPSS - Quantity vs Distance

I have quite a simple dataset of quantities of litter found in a national park located on an island. For each data point I have corresponding GPS coordinates, and I've derived the distance of each point to the shore. My aim: observe if the quantities of litter increase or decrease with the distance to shore. I'm assuming that quantities of litter will increase with a decrease in distance, as litter is commonly found on beaches etc.
Quantities of litter are counts, i.e. non-parametric. Additionally I've tested the data to see if it follows a Poisson model and it does not (p-value <0.05), and I have a larger variance than the mean for each variable (quantity and distance) seemingly overdispersed. Therefore, I went on using a negbin regression, with an output as follows:
Omnibus test is highly significant (p=0.000). I was just slightly puzzled on the parameter estimates, and generally hoping that this approach makes sense. Any input much appreciated.
Interpreting the parameter estimates requires knowing the link function specified, which would be a log link if you specified your model as a negative binomial with log link on the Type of Model tab, but could be something else if you specified a custom model using a negative binomial distribution with another link (which could be identity, negative binomial, or power, instead).
If it's a log link, then for a distance of 0 (at the shore), you predict exp(2,636) for the count, or about 13,957. For a given distance from the shore, multiply the distance by -,042 and add that to the 2,636 value, then take the exponential function to the resulting power. So for every unit away from the shore you move, the log of the prediction decreases by ,042, and the prediction is multiplied by about ,959. One unit away, you predict about 13,383 for the count, two units away, about 12,833, etc. So the results are in general accord with your hypothesis. Different specific calculations would be required if you used a different link function.

How to deal with ordinal labels in keras?

I have data with integer target class in the range 1-5 where one is the lowest and five the highest. In this case, should I consider it as regression problem and have one node in the output layer?
My way of handling it is:
1- first I convert the labels to binary class matrix
labels = to_categorical(np.asarray(labels))
2- in the output layer, I have five nodes
main_output = Dense(5, activation='sigmoid', name='main_output')(x)
3- I use 'categorical_crossentropy with mean_squared_error when compiling
model.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['mean_squared_error'],loss_weights=[0.2])
Also, can anyone tells me: what is the difference between using categorical_accuracy and 'mean_squared_error in this case?
Regression and classification are vastly different things. If you reimagine this as a regression task than the difference of predicting 2 when the ground truth is 4 will be rated more than if you predict 3 instead of 4. If you have class like car, animal, person you do not care for the ranking between those classes. Predicting car is just as wrong as animal, iff the image shows a person.
Metrics do not impact your learning at all. It is just something that is computed additionally to the loss to show the performance of the model. Here the accuracy makes sense, because this is mostly the metric that we care about. Mean squared error does not tell you how well your model performs. If you get something like 0.0015 mean squared error it sounds good, but it is hard to visualize just how well this performs. In contrast using accuracy and achieving 95% accuracy for example is meaningful.
One last thing you should use softmax instead of sigmoid as your final output to get a probability distribution in your final layer. Softmax will output percentages for every class that sum up to 1. Then crossentropy calculates the difference of the probability distribution of your network output and the ground truth.

Make a prediction using Octave plsregress

I have a good (or at least a self-consistent) calibration set and have applied PCA and recently PLS regression on n.i.r. spectrum of known mixtures of water and additive to predict the percentage of additive by volume. I thus far have done self-calibration and now want to predict the concentration from the n.i.r.spectrum blindly. Octave returns XLOADINGS, YLOADINGS, XSCORES, YSCORES, COEFFICIENTS, and FITTED with the plsregress command. The "fitted" is the estimate of concentration. Octave uses the SIMPLS approach.
How do I use these returned variables to predict concentration give a new samples spectrum?
Scores are usually denoted by T and loadings by P and X=TP'+E where E is the residual. I am stuck.
Note that T and P are X scores and loadings, respectively. Unlike PCA, PLS has scores and loadings for Y as well (usually denoted U and Q).
While the documentation of plsregress is sketchy at best, the paper it refers to Sijmen de Jong: SIMPLS: an alternativ approach to partial least squares regression Chemom Intell Lab Syst, 1993, 18, 251-263, DOI: 10.1016/0169-7439(93)85002-X
discusses prediction with equations (36) and (37), which give:
Yhat0 = X0 B
Note that this uses centered data X0 to predict centered y-values. B are the COEFFICIENTS.
I recommend that as a first step you predict your training spectra and make sure you get the correct results (FITTED).

Apply PCA on very large sparse matrix

I am doing a text classification task with R, and I obtain a document-term matrix with size 22490 by 120,000 (only 4 million non-zero entries, less than 1% entries). Now I want to reduce the dimensionality by utilizing PCA (Principal Component Analysis). Unfortunately, R cannot handle this huge matrix, so I store this sparse matrix in a file in the "Matrix Market Format", hoping to use some other techniques to do PCA.
So could anyone give me some hints for useful libraries (whatever the programming language), which could do PCA with this large-scale matrix with ease, or do a longhand PCA by myself, in other words, calculate the covariance matrix at first, and then calculate the eigenvalues and eigenvectors for the covariance matrix.
What I want is to calculate all PCs (120,000), and choose only the top N PCs, who accounts for 90% variance. Obviously, in this case, I have to give a threshold a priori to set some very tiny variance values to 0 (in the covariance matrix), otherwise, the covariance matrix will not be sparse and its size would be 120,000 by 120,000, which is impossible to handle with one single machine. Also, the loadings (eigenvectors) will be extremely large, and should be stored in sparse format.
Thanks very much for any help !
Note: I am using a machine with 24GB RAM and 8 cpu cores.
The Python toolkit scikit-learn has a few PCA variants, of which RandomizedPCA can handle sparse matrices in any of the formats supported by scipy.sparse. scipy.io.mmread should be able to parse the Matrix Market format (I never tried it, though).
Disclaimer: I'm on the scikit-learn development team.
EDIT: the sparse matrix support from RandomizedPCA has been deprecated in scikit-learn 0.14. TruncatedSVD should be used in its stead. See the documentation for details.
Instead of running PCA, you could try Latent Dirichlet Allocation (LDA), which decomposes the document-word matrix into a document-topic and topic-word matrix. Here is a link to an R implementation: http://cran.r-project.org/web/packages/lda/ - there are quite a few implementations out there, though if you google.
With LDA you need to specify a fixed number of topics (similar to principle components) in advance. A potentially better alternative is HDP-LDA (http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/npbayes-r21.tgz), which learns the number of topics that form a good representation of your corpus.
If you can fit our dataset in memory (which it seems like you can), then you also shouldn't have a problem running the LDA code.
As a number of people on the scicomp forum pointed out, there should be no need to compute all of the 120k principle components. Algorithms like http://en.wikipedia.org/wiki/Power_iteration calculate the largest eigenvalues of a matrix, and LDA algorithms will converge to a minimum-description-length representation of the data given the number of topics specified.
In R big.PCA of bigpca package http://cran.r-project.org/web/packages/bigpca/bigpca.pdf does the job.
text classification task
I resolved almost same problem using a technique for PCA of sparse matrix .
This technique can handle very large sparse matrix.
The result shows such simple PCA outperfoms word2vec.
It intends the simple PCA outperforms LDA.
I suppose you wouldn't be able to compute all principle components. But still you can obtain reduced dimension version of your dataset matrix. I've implemented a simple routine in MATLAB, which can easily be replicated in python.
Compute the covariance matrix of your input dataset, and convert it to a dense matrix. Assuming S is you input 120,000 * 22490 sparse matrix, this would be like:
Smul=full(S.'*S);
Sm=full(mean(S));
Sm2=120000*Sm.'*Sm;
Scov=Smul-Sm2;
Apply eigs function on the covariance matrix to obtain the first N dominant eigenvectors,
[V,D] = eigs(Scov,N);
And obtain pcs by projecting the zero centered matrix on eigenvectors,
Sr=(S-Sm)*V;
Sr is the reduced dimension version of S.