I am searching for some special functions (CUDA) that dedicate to typical dense matrix multiplications, e.g. A*B, where the size of A is 6*n, the size of B is n*6 and n is very large (n=2^24). I have utilized CUBLAS and some other libraries to test this example, In CUBLAS, for this example, we use 6*6=36 threads, which is far from the total parallelism of GPU, so I split A and B into submatrices(vectors) and then implement dot product function for each of them and the performance has been quite well improved. The problem is, in this case, we need to launch 36 CUDA kernels and in between them there are a lot of same data footprints (same data has been accessed for several times from the global memory of GPU). So I am asking whether there exists any solution to this kind of problem.
I have recently written such a matrix multiplication routine for a client of mine. The trick is to extract more parallelism by splitting the long inner summation into several smaller ones. Then use a separate kernel launch to calculate the full sum from the partial ones.
Related
This issue reminds some typical many-body problem, but with some extra calculations.
I am working on the generalized Metropolis Monte-Carlo algorithm for the modeling of large number of arbitrary quantum systems (magnetic ions for example) interacting classically with each other. But it actually doesn't matter for the question.
There is more than 100000 interacting objects, each one can be described by a coordinate and a set of parameters describing its current state r_i, s_i.
Can be translated to the C++CUDA as float4 and float4 vectors
To update the system following Monte-Carlo method for such systems, we need to randomly sample 1 object from the whole set; calculate the interaction function for it f(r_j - r_i, s_j); substitute to some matrix and find eigenvectors of it, from which one a new state will be calculated.
The interaction is additive as usual, i.e. the total interaction will be the sum between all pairs.
Formally this can be decomposed into steps
Generate random number i
Calculate the interaction function for all possible pairs f(r_j - r_i, s_j)
Sum it. The result will be a vector F
Multiply it by some tensor and add another one h = h + dot(F,t). Some basic linear algebra stuff.
Find the eigenvectors and eigenvalues, based on some simple algorithm, choose one vector V_k and write in back to the array s_j of all objects's states.
There is a big question, which parts of this can be computed on CUDA kernels.
I am quite new to CUDA programming. So far I ended up with the following algorithm
//a good random generator
std::uniform_int_distribution<std::mt19937::result_type> random_sampler(0, N-1);
for(int i=0; i\<a_lot; ++i) {
//sample a number of object
nextObject = random_sampler(rng);
//call kernel to calculate the interaction and sum it up by threads. also to write down a new state back to the d_s array
CUDACalcAndReduce<THREADS><<<blocksPerGrid, THREADS>>>(d_r, d_s, d_sum, newState, nextObject, previousObject, N);
//copy the sum
cudaMemcpy(buf, d_sum, sizeof(float)*4*blocksPerGrid, cudaMemcpyDeviceToHost);
//manually reduce the rest of the sum
total = buf[0];
for (int i=1; i<blocksPerGrid; ++i) {
total += buf[i];
}
//find eigenvalues and etc. and determine a new state of the object
//just linear algebra with complex numbers
newState = calcNewState(total);
//a new state will be written by CUDA function on the next iteration
//remember the previous number of the object
previousObject = nextObject;
}
The problem is continuous transferring data between CPU and GPU, and the actual number of bytes is blocksPerGrid*4*sizeof(float) which sometimes is just a few bytes. I optimized CUDA code following the guide from NVIDIA and now it limited by the bus speed between CPU and GPU. I guess switching to pinned memory type will not make any sense since the number of transferred bytes is low.
I used Nvidia Visual Profiler and it shows the following
the most time was waisted by the transferring the data to CPU. The speed as one can see by the inset is 57.143 MB/s and the size is only 64B!
The question is is it worth to move the logic of eigenvalues algorithm to CUDA kernel?
Therefore there will be no data transfer between CPU and GPU. The problem with this algorithm, you can update only one object per iteration. It means that I can run the eigensolver only on one CUDA core. ;( Will it be that slow compared to my CPU, that will eliminate the advantage of keeping data inside the GPU ram?
The matrix size for the eigensolver algorithm does not exceed 10x10 complex numbers. I've heard that cuBLAS can be run fully on CUDA kernels without calling the CPU functions, but not sure how it is implemented.
UPD-1
As it was mentioned in the comment section.
For the each iteration we need to diagonalize only one 10x10 complex Hermitian matrix, which depends on the total calculated interaction function f. Then, we in general it is not allowed to a compute a new sum of f, before we update the state of the sampled object based on eigenvectors and eigenvalues of 10x10 matrix.
Due to the stochastic nature of Monte-Carlo approach we need all 10 eigenvectors to pick up a new state for the sampled object.
However, the suggested idea of double-buffering (in the comments) can work out in a way if we calculate the total sum of f for the next j-th iteration without the contribution of i-th sampled object and, then, add it later. I need to test it carefully in action...
UPD-2
The specs are
CPU 4-cores Intel(R) Core(TM) i5-6500 CPU # 3.20GHz
GPU GTX960
quite outdated, but I might find an access to the better system. However, switching to GTX1660 SUPER did not affect the performance, which means that a PCI bus is a bottleneck ;)
The question is is it worth to move the logic of eigenvalues algorithm
to CUDA kernel?
Depends on the system. Old cpu + new gpu? Both new? Both old?
Generally single cuda thread is a lot slower than single cpu thread. Because cuda compiler does not vectorize its loops but host c++ compiler vectorizes. So, you need to use 10-100 cuda threads to make the comparison fair.
For the optimizations:
According to the image, currently it loses 1 microsecond as a serial part of overall algorithm. 1 microsecond is not much compared to the usual kernel-launch latency from CPU but is big when it is GPU launching the kernel (dynamic parallelism) itself.
CUDA-graph feature enables the overall algorithm re-launch every step(kernel) automatically and complete quicker if steps are not CPU-dependent. But it is intended for "graph"-like workloads where some kernel leads to multiple kernels and they later join in another kernel, etc.
CUDA-dynamic-parallelism feature lets a kernel's cuda threads launch new kernels. This has much better timings than launching from CPU due to not waiting for the synchronizations between driver and host.
Sampling part's copying could be made in chunks like 100-1000 elements at once and consumed by CUDA part at once for 100-1000 steps if all parts are in CUDA.
If I were to write it, I would do it like this:
launch a loop kernel (only 1 CUDA thread) that is parent
start loop in the kernel
do real (child) kernel-launching within the loop
since every iteration needs serial, it should sync before continuing next iteration.
end the parent after 100-1000 sized chunk is complete and get new random data from CPU
when parent kernel ends, it shows in profiler as a single kernel launch that takes a lot of time and it doesn't have any CPU-based inefficiencies.
On top of the time saved from not synching a lot, there would be consistency of performance between 10x10 matrix part and the other kernel part because they are always in same hardware, not some different CPU and GPU.
Since random-num generation is always an input for the system, at least it can be double-buffered to hide cpu-to-gpu data copying latency behind the computation. Iirc, random number generation is much cheaper than sending data over pcie bridge. So this would hide mostly the data transmission slowness.
If it is a massively parallel experiment like running the executable N times, you can still launch like 10 executable instances at once and let them keep gpu busy with good efficiency. Not practical if too much memory is required per instance. Many gpus except ancient ones can run tens of kernels in parallel if each of them can not fully occupy all resources of gpu.
I've been conducting research on streaming datasets larger than the memory available on the GPU to the device for basic computations. One of the main limitations is the fact that the PCIe bus is generally limited around 8GB/s, and kernel fusion can help reuse data that can be reused and that it can exploit shared memory and locality within the GPU. Most research papers I have found are very difficult to understand and most of them implement fusion in complex applications such as https://ieeexplore.ieee.org/document/6270615 . I've read many papers and they ALL FAIL TO EXPLAIN some simple steps to fuse two kernels together.
My question is how does fusion actually work?. What are the steps one would go through to change a normal kernel to a fused kernel? Also, is it necessary to have more than one kernel in order to fuse it, as fusing is just a fancy term for eliminating some memory bound issues, and exploiting locality and shared memory.
I need to understand how kernel fusion is used for a basic CUDA program, like matrix multiplication, or addition and subtraction kernels. A really simple example (The code is not correct but should give an idea) like:
int *device_A;
int *device_B;
int *device_C;
cudaMalloc(device_A,sizeof(int)*N);
cudaMemcpyAsync(device_A,host_A, N*sizeof(int),HostToDevice,stream);
KernelAdd<<<block,thread,stream>>>(device_A,device_B); //put result in C
KernelSubtract<<<block,thread,stream>>>(device_C);
cudaMemcpyAsync(host_C,device_C, N*sizeof(int),DeviceToHost,stream); //send final result through the PCIe to the CPU
The basic idea behind kernel fusion is that 2 or more kernels will be converted into 1 kernel. The operations are combined. Initially it may not be obvious what the benefit is. But it can provide two related kinds of benefits:
by reusing the data that a kernel may have populated either in registers or shared memory
by reducing (i.e. eliminating) "redundant" loads and stores
Let's use an example like yours, where we have an Add kernel and a multiply kernel, and assume each kernel works on a vector, and each thread does the following:
Load my element of vector A from global memory
Add a constant to, or multiply by a constant, my vector element
Store my element back out to vector A (in global memory)
This operation requires one read per thread and one write per thread. If we did both of them back-to-back, the sequence of operations would look like:
Add kernel:
Load my element of vector A from global memory
Add a value to my vector element
Store my element back out to vector A (in global memory)
Multiply kernel:
Load my element of vector A from global memory
Multiply my vector element by a value
Store my element back out to vector A (in global memory)
We can see that step 3 in the first kernel and step 1 in the second kernel are doing things that aren't really necessary to achieve the final result, but they are necessary due to the design of these (independent) kernels. There is no way for one kernel to pass results to another kernel except via global memory.
But if we combine the two kernels together, we could write a kernel like this:
Load my element of vector A from global memory
Add a value to my vector element
Multiply my vector element by a value
Store my element back out to vector A (in global memory)
This fused kernel does both operations, produces the same result, but instead of 2 global memory load operations and 2 global memory store operations, it only requires 1 of each.
This savings can be very significant for memory-bound operations (like these) on the GPU. By reducing the number of loads and stores required, the overall performance is improved, usually proportional to the reduction in number of load/store operations.
Here is a trivial code example.
For analysis of 10^6 genetic factors and their GeneXGene interactions (~5x10^11), I have numerous and independent linear regression problems which are probably suitable for analysis on GPUs.
The objective is to exhaustively search for GeneXGene interaction effects in modulating an outcome variable (a brain phenotype) using linear regression with the interaction term included.
As far as I know, the Householder QR factorization could be the solution for fitting regression models, however, given that each regression matrix in this particular work could easily approach the size of ~ 10'000x10, even each single regression matrix does not seem to fit in GPU on-chip memory (shared, registers etc.).
Should I accept this as a problem which is inherently bandwidth-limited and keep the matrices in GPU global memory during regression analysis, or are other strategies possible?
EDIT
Here are more details about the problem:
There will be approximately 10'000 subjects, each with ~1M genetic parameters (genetic matrix:10'000x10^6). The algorithm in each iteration should select two columns of this genetic matrix (10'000x2) and also around 6 other variables unrelated to genetic data (age, gender etc) so the final regression model will be dealing with a matrix like the size of 10'000x[2(genetic factors)+6(covariates)+2(intercept&interaction term)] and an outcome variable vector (10'000x1). This same process will be repeated ~5e11 times each time with a given pair of genetic factors. Those models passing a predefined statistical threshold should be saved as output.
The specific problem is that although there are ~5e11 separate regression models, even a single one does not seem to fit in on-chip memory.
I also guess that sticking with CUDA libraries may not be the solution here as this mandates most of the data manipulation to take place on the CPU side and only sending each QR decomposition to GPU?
You whole data matrix (1e4 x 1e6) may be too large to fit in the global memory, while each of your least squares solving (1e4 x 10) may be too small to fully utilize the GPU.
For each least squares problem, you could use cuSolver for QR factorization and triangular solving.
http://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-geqrf
If the problem size is too small to fully utilize the GPU, you could use concurrent kernel execution to solve multiple equations at the same time.
https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
For the whole data matrix, if it can not fit into the global memory, you could work on only part of it at a time. For example you could divide the matrix into ten (1e4 x 1e5) blocks, each time you load two of the blocks through PCIe, select all possible two-column combinations from the two blocks respectively, solve the equation and then load another two blocks. Maximize the block size will help you minimize the PCIe data transfer. With proper design, I'm sure the time for PCIe data transfer will be much smaller than solving 1e12 equations. Furthermore you could overlap the data transfer with solver kernel executions.
https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/
I am working on big datasets that are image cubes (450x450x1500). I have a kernel that works on individual data elements. Each data element produces 6 intermediate results (floats). My block consists of 1024 threads. The 6 intermediate results are stored in shared memory by each thread (6 float arrays). However, now I need to add each of the intermediate result to produce a sum (6 sum values). I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.
Are there any reduction routines that can be called from inside a kernel function on arrays in shared memory?
What will be the best way to solve this problem? I am a newbie to CUDA programming and would welcome any suggestions.
This seems unlikely:
I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.
I can't imagine how you have enough space to store your data in shared memory but not in global memory.
Anyway, CUB provides reduction routines that can be called from within a threadblock, and that can operate on data stored in shared memory.
Or you can write your own sum-reduction code. It's not terribly hard to do, there are many questions on SO about it, such as this one.
Or you could adapt the cuda sample code.
Update
After seeing all the comments, I understand that instead of doing 1 or a few times of reduction, you need to do the reductions for 450x450x6 times.
In this case there's simpler solution.
You don't need to implement relatively complex parallel reduction for each 1500-D vector。 Since you already have 450x450x6 vectors to reduce, you could reduce all these vectors in parallel using traditional serial reduction method.
You could use a block with 16x16 threads to process a particular region of the image, and a grid with 29x29 blocks to cover the whole 450x450 image.
In each thread, you could iterate over the 1500 frames. In each iterration, you coulde first compute the 6 intermediate results, then add them to the sums. When yo finish all the iterations, you could write the 6 sums to global mem.
That finishes the kernel design. And no shared mem is needed.
You wil find that the performance is very good. Since it is a memory bound operation,it won't be much longer than simply access all the image cube data once.
In case you don't have enough global mem for the whole cube, you could split it into 4 sub-cubes of [1500][225][225], and call the kernel routine on each sub-cube. The only thing you need to change is the grid size.
Have a look at this that explains parallel reduction in CUDA thoroughly.
If I understand it correctly, each thread should sum up "only" 6 floats.
I'm not sure if it is worth doing that by a parallel reduction in general, in the sense that you will experience performance gains.
If you are targeting a Kepler, you may try to use shuffle operations if you properly set the block size so that your intermediate results fit the Streaming Multiprocessor's registers in some way.
As also pointed out by Robert Crovella, your statement about the possibility of storing the intermediate results seems strange as the amount of global memory is certainly larger than the amount of shared memory.
say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.
a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.
b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.
I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?
As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.
If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.
The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.
Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.
The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.
There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.
Have you looked at the CUDA documentation: Cuda Programming Model
Also, sample source code: Matrix Multiplication
Did you look at
$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul
i.e. the matrix multiplication example in the SDK?
If you don't need to implement this yourself, just use a library -- CUBLAS, MAGMA, etc., provide tuned matrix multiplication implementations.