I have to make same computations (for example, get eigenvalues of A1, A2, ...) on many(>10^15) matrices, so I want use threads as many as possible.
But I couldn't find cuBLAS or cuSOLVER codes stating number of threads. Does cuSOLVER automatically distribute resources and parallelize computations if I write code with for loop and cuSOLVER functions?
Or is there any cuSOLVER or cuBLAS API that I can control the number of threads and parallelize functions?
......
Does cuSOLVER automatically distribute resources and parallelize computations if I write code with for loop and cuSOLVER functions?
No.
Or is there any cuSOLVER or cuBLAS API that I can control the number of threads and parallelize functions?
No.
But, if you care to read the CUSOLVER documentation, you will see that there is a batched sparse QR factorization routine. This can be used to solve eigenvalue problems.
Related
In a numba cuda kernel, I know that we can define local and shared arrays. Also the all the variable assignments in a kernel go to the registers for a particular thread. Is it possible to declare a register array using numba cuda? Something similar to the following which would be used in CUDA C kernel?
register float accumulators[32];
It is not possible.
The register keyword is only a hint to the compiler, and it has essentially no effect in CUDA C/C++. The device code compiler will make decisions about what to put in registers based on its heuristics to generate fast code, not this instruction from the programmer.
How do I use the CUDA cuSOLVER to find the eigenvalues and eigenvectors of a dense, (double precision) complex, non-symmetric matrix?
Looking at the documentation, there are CUDA routines and example code for solving a dense symmetric matrix, using 'syevd'. I've come across another GPU-enabled package, MAGMA, which has the relevant function (magma_zgeev).
Is it possible to find these eigenvalues/vectors using plain CUDA (SDK v8), or do I need an alternate library like MAGMA?
As of the CUDA 11 release, cuSolver continues to offer only routines for obtaining the eigenvalues of symmetric matrices. There are no non-symmetric eigensolvers in cuSolver.
So I'm trying to see if I can get some significant speedup from using a GPU to solve a small overdetermined system of equations by solving a bunch at the same time. My current algorithm involves using an LU decomposition function from the CULA Dense library that also has to switch back and forth between the GPU and the CPU to initialize and run the CULA functions. I would like to be able to call the CULA functions from my CUDA kernels so that I don't have to jump back to the CPU and copy the data back. This would also allow me to create multiple threads that are working on different data sets to be solving multiple systems concurrently. My question is can I call CULA functions from device functions? I know it's possible with CUBLAS and some of the other CUDA libraries.
Thanks!
The short answer is no. The CULA library routines are designed to be called from host code, not device code.
Note that CULA have their own support forums here which you may be interested in.
I am about to create GPU-enabled program using CUDA technology. It is supposed to be C# Emgu or C++ Cuda toolkit (not yet decided).
I need to use all GPU power (I have card with 16 GPU cores). How do I run 16 tasks in parallel?
First of. 16 GPU cores is, on pre 6xx series, equal to 16*8=128 cores. On 6xx series it is 16*32=512 cores. That does not mean you should limit yourself to 128/512 tasks.
Second: emgu seems to be a OpenCV wrapper for .NET, and is related to image processing. It generally has nothing to do with GPU programming. Might be some algorithms have been gpu accelerated, but I don't know anything about that. The alternative to CUDA in this is OpenCL, not OpenCV. If you will be using CUDA technology like you say, you have no alternative to CUDA, as only CUDA is CUDA.
When it comes to starting tasks, you only tell the GPU how many threads you wish to run. Actually, you tell the GPU how many blocks, and how many threads pr. block you wish to run. This is done when you call the cuda function itself. You don't want to limit yourself to 128/512 threads either, but experiment.
Don't know your knowledge on GPGPU programming, but remember that you can not run tasks as you do on the CPU. You can not run 128 different tasks, all threads have to run the exact same instructions (except for when branching, which should generally be avoided).
Generally speaking, you want sufficient threads to fill all the streaming multiprocessors. At a minimum that is .25 * MULTIPROCESSORS * MAX_THREADS_PER_MULTIPROCESSOR.
Specifically in CUDA now, suppose you have some CUDA kernel __global__ void square_array(float *a, int N)...
Now when you launch the kernel you specify the number of blocks and the number of threads per block
square_array <<< n_blocks, n_threads_per_block >>> (a, N);
Note: you need to get more framiliar with the CUDA parallel programming model as you not approaching to in a manor which will use all your GPU power. Consider reading Programming Massively Parallel Processors, A Hands-on Approach.
I'm using Thrust in my current project so I don't have to write a device_vector abstraction or (segmented) scan kernels myself.
So far I have done all of my work using thrust abstractions, but for simple kernels or kernels that don't translate easily to the for_each or transform abstractions I'd prefer at some point to write my own kernels instead.
So my question is: Can I through Thrust (or perhaps CUDA) ask which device is currently being used and what properties it has (max block size, max shared memory, all that stuff)?
If I can't get the current device, is there then some way for me to get thrust to calculate the kernel dimensions if I provide the kernel registers and shared memory requirements?
You can query the current device with CUDA. See the CUDA documentation on device management. Look for cudaGetDevice(), cudaSetDevice(), cudaGetDeviceProperties(), etc.
Thrust has no notion of device management currently. I'm not sure what you mean by "get thrust to calculate the kernel dimensions", but if you are looking to determine grid dimensions for launching your custom kernel, then you need to do that on your own. It can help to query the properties of the kernel with cudaFuncGetAttributes(), which is what Thrust uses.