Computing Multivariate Cumulative Normal Distribution - numerical-methods

I am trying to compute the CDF of a Multivariate Cumulative Normal Distribution in the 1000th dimension (I have a 1000 vectors and their covariance matrix). I haven't been able to find a fast way to compute this, numpy and scipy all take take too long.
Are there any libraries (platform and language doesn't matter) that can compute this in a reasonable amount of time? If not, are there any numerical methods that can achieve this?

Related

Efficient loss on softmax output in Keras?

When the number of classes is large (e.g., number of words in language models), the softmax cross-entropy loss computation is very expensive. There seems to be no efficient options in the built-in losses. Is there a way to compute this efficiently in Keras, or should I self-implement it and how to?

Implementing large linear regression models using CUDA

For analysis of 10^6 genetic factors and their GeneXGene interactions (~5x10^11), I have numerous and independent linear regression problems which are probably suitable for analysis on GPUs.
The objective is to exhaustively search for GeneXGene interaction effects in modulating an outcome variable (a brain phenotype) using linear regression with the interaction term included.
As far as I know, the Householder QR factorization could be the solution for fitting regression models, however, given that each regression matrix in this particular work could easily approach the size of ~ 10'000x10, even each single regression matrix does not seem to fit in GPU on-chip memory (shared, registers etc.).
Should I accept this as a problem which is inherently bandwidth-limited and keep the matrices in GPU global memory during regression analysis, or are other strategies possible?
EDIT
Here are more details about the problem:
There will be approximately 10'000 subjects, each with ~1M genetic parameters (genetic matrix:10'000x10^6). The algorithm in each iteration should select two columns of this genetic matrix (10'000x2) and also around 6 other variables unrelated to genetic data (age, gender etc) so the final regression model will be dealing with a matrix like the size of 10'000x[2(genetic factors)+6(covariates)+2(intercept&interaction term)] and an outcome variable vector (10'000x1). This same process will be repeated ~5e11 times each time with a given pair of genetic factors. Those models passing a predefined statistical threshold should be saved as output.
The specific problem is that although there are ~5e11 separate regression models, even a single one does not seem to fit in on-chip memory.
I also guess that sticking with CUDA libraries may not be the solution here as this mandates most of the data manipulation to take place on the CPU side and only sending each QR decomposition to GPU?
You whole data matrix (1e4 x 1e6) may be too large to fit in the global memory, while each of your least squares solving (1e4 x 10) may be too small to fully utilize the GPU.
For each least squares problem, you could use cuSolver for QR factorization and triangular solving.
http://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-geqrf
If the problem size is too small to fully utilize the GPU, you could use concurrent kernel execution to solve multiple equations at the same time.
https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
For the whole data matrix, if it can not fit into the global memory, you could work on only part of it at a time. For example you could divide the matrix into ten (1e4 x 1e5) blocks, each time you load two of the blocks through PCIe, select all possible two-column combinations from the two blocks respectively, solve the equation and then load another two blocks. Maximize the block size will help you minimize the PCIe data transfer. With proper design, I'm sure the time for PCIe data transfer will be much smaller than solving 1e12 equations. Furthermore you could overlap the data transfer with solver kernel executions.
https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/

Adjusting Binary Logistic Formula in SPSS

I am running a binary logistic regression in SPSS, to test the effect of e.g. TV advertisements on the probability of a consumer to buy a product. My problem is that with the formula of binary logistic regression:
P=1/(1+e^(-(a+b*Adv)) )
the maximum probability will be equal to 100%. However,even if I increase the number of advertisements by 1000, it is not sensible to assume that the probability to purchase will be 100%. So if I draw the graph of the logistic regression with the coefficients from the Binary Logistic Regression, at some point the probability reaches 100%, which is never the case in a real life setting. How can I control for that?
Is there a way to change the SPSS binary logistic regression to have a maximum probability of e.g. 20%?
Thank you!
The maximum hypothetical probability is 100%, but if you use real-world data, your model will fit the data in such a way that the predicted y-value for any given value of x will be no higher than the real-world y-value (+/- your model's error term). I wouldn't worry too much about the hypothetical maximum probability as long as my model fit the data reasonably well. One of the key reasons for using logistic regressions instead of OLS linear regressions is to avoid impossible predicted values.

Writing a Discrete Fourier Transform program

I would like to write a DFT program using FFT.
This is actually used for very large matrix-vector multiplication (10^8 * 10^8), which is simplified to a vector-to-vector convolution, and further reduced to a Discrete Fourier Transform.
May I ask whether DFT is accurate? Because the matrix has all discrete binary elements, and the multiplication process would not tolerate any non-zero error probability in result. However from the things I currently learnt about DFT it seems to be an approximation algorithm?
Also, may I ask roughly how long would the code be? i.e. would this be something I could start from scratch and compose in C++ in perhaps one or two hundred lines? Cause actually this is for a paper...and all I need is that the complexity analyis is O(nlogn) and the coefficient in front of it doesn't really matter :) So the simplest implementation would be best. (Although I did see some packages like kissfft and FFTW, but they are very lengthy and probably an overkill for my purpose...)
A canonical radix-2 FFT can be written in less than 200 lines of C++. The average numerical error is roughly proportional to O(log N), so you will need to use a large enough numeric type and data scale factor to account for this.
You can compute numerically stable convolutions using the Number Theoretic transform. It uses unique integer sequences to compute the discrete Fourier transform over integer fields/rings. The only caveat is that the signal needs to be integer valued.
It is implementation is roughly the same size as the FFT, but a little faster. You can find my implementation of it at finitetransform.sourceforge.net as the NTTW sub-library. The APFloat library might be more relevant to your needs as they do multiplication of large numbers using convolutions.

Should I use CUDA here?

I have to multiply a very small sized matrix ( size - 10x10 ) with a vector several times 50000 to 100000 times ( could even be more than that). This happens for 1000 different matrices (could be much more). Would there be any significant performance gain by doing this operation on CUDA.
Yes, it's an ideal task for the GPU.
If you want to multiply a single matrix with a vector 50K times and each multiplication is prerequisite to the previous then don't use CUDA. It's a serial problem, best suites for CPU. However if each multiplication is independent you can multiply them simultaneously on CUDA.
The only case where your program will give tremendous speedup is when each vector multiplication iteration is independent to the data of other iterations. This way you'll be able to launch 50K or more iterations simultaneously by launching equal number of threads.
Depending on what exactly you're doing, then yes, this could be done very quickly on a GPU, but you might have to run your own kernel to get some good performance from it.
Without knowing more about your problem, I can't give you too much advice. But I could speculate on a solution:
If you're taking one vector and multiplying it by the same matrix several thousand times, you would be much better of finding the closed form of the matrix to an arbitrary power. You can do this using the Cayley–Hamilton theorem or the Jordan canonical form.
I can't seem to find an implementation of this from a quick googling, but considering I did this in first year linear algebra, it's not too bad. Some info on the Jordan normal form, and raising it to powers can be found at http://en.wikipedia.org/wiki/Jordan_normal_form#Powers and the transformation matrices of it are just a matrix of eigenvectors, and the inverse of that matrix.
Say you have a matrix A, and you find the Jordan normal form J, and the transformation matrices P, P^-1, you find
A^n = P J^n P^-1
I can't seem to find a good link to an implementation of this, but computing the closed form of a 10x10 matrix would be significantly less time consuming than 50,000 matrix multiplications. And an implementation of this would probably run much quicker on a CPU.
If this is your problem, you should look into this.