Given a collection of thousands of points in 3D, I need to get the list of neighbours for each particle that fall inside some cutoff value (in terms of euclidean distance), and if possible, sorted from nearest fo farthest.
Which is the fastest GPU algorithm for this purpose in the CUDA or OpenCL languages?
One of the fastest GPU MD codes I'm aware of, HALMD, uses a (highly tuned) version of the same sort of approach that is used in the CUDA SDK examples, "Particles". Both the HALMD paper and the Particles whitepaper are very clearly written. The underling algorithm is to assign particles into cutoff-radius-sized bins, do a radix sort based on that index, and then look at particles in the neighbouring bins.
Fast k Nearest Neighbor Search using GPU
I haven't tested, used it, nothing. I just googled and posted the first link I found.
Related
I am new to CUDA. While writing a fast 3D array summation program on the 3rd dimension, there are some questions coming into my mind:
The most natural way is to use each matrix entry as threads, and each thread loops over the 3rd dimension. Under this scenario, is the memory considered coalesced? Since neighboring threads access neighboring elements; they only have strides on loop variables.
For improved performance, a reduction on the 3rd dimension certainly helps.
Are there any libraries to use? For 2D summation, using cuBLAS is considered a good choice. I am thinking of a forced type conversion, which cheats the compiler to regard the piece of memory as a 2D array, and using a matrix-vector multiplication from cuBLAS.
That's a coalesced read.
You can use cuBLAS in the same way. Just tell GEMV that the first (uncontracted) dimension is nx*ny.
I have a large rectangular matrix NxM in GPU memory, stored as 1-dimensional array in row-by-row representation. Let us say that this matrix is actually composed of submatrices of size nxm. For simplicity, assume that N is a multiple of n and same with M and m. Let us say, the data type of the array is float or double.
What is an efficient method to find the index of the extrema in each sub-matrix? For example, how to find the 1-dimensional index of the maximum element of each submatrix and write down those indices in some array.
I can hardly imagine to be so self-confident (or arrogant?) to say that one particular solution is the "most efficient way" to do something.
However, some thoughts (without the claim to cover the "most efficient" solution) :
I think that there are basically two "orthogonal" ways of approaching this
For all sub-matrices in parallel: Find the extremum sequentially
For all sub-matrices sequentially: Find the extremum in parallel
The question which one is more appropriate probably depends on the sizes of the matrices. You mentioned that "N is a multiple of n" (similarly for M and m). Let's the matrix of size M x N is composed of a*b sub-matrices of size m x n.
For the first approach, one could simply let each thread take care of one sub-matrix, with a trivial loop like
for (all elements of my sub-matrix) max = element > max ? element : max;
The prerequisite here is that a*b is "reasonably large". That is, when you can launch this kernel for, let's say, 10000 sub-matrices, then this could already bring a good speedup.
In contrast to that, in the second approach, each kernel (with all its threads) would take care of one sub-matrix. In this case, the kernel could be a standard "reduction" kernel. (The reduction is often presented an example for "computing the sum/product of the elements of an array", but it works for any binary associative operation, so instead of computing the sum or product, one can basically use the same kernel for computing the minimum or maximum). So the kernel would be launched for each sub-matrix, and this would only make sense when the sub-matrix is "reasonably large".
However, in both cases, one has to consider the general performance guidelines. Particularly, since in this case, the operation is obviously memory-bound (and not compute-bound), one has to make sure that the accesses to global memory (that is, to the matrix itself) are coalesced, and that the occupancy that is created by the kernel is as high as possible.
EDIT: Of course, one could consider to somehow combine these approaches, but I think that they are at least showing the most important directions of the space of available options.
I am trying to implement Dijsktra's algorithm using cuda.I got a code that does the same using map reduce this is the link http://famousphil.com/blog/2011/06/a-hadoop-mapreduce-solution-to-dijkstra%E2%80%99s-algorithm/ but i want to implement something similar as given in the link using cuda using shared and global memory..Please tell me how to proceed as i am new to cuda ..i dont know if it is necessary that i provide the input on host and device both in the form of matrix and also what operation should i perform in the kernel function
What about something like this(Dislaimer this is not a map-reduce solution).
Lets say you have a Graph G with N states an adjacency matrix A with entries A[i,j] for the cost of going from node i to node j in the graph.
This Dijkstras algorithm consists of having a vector denoting a front 'V' where V[i] is the current minimum distance from the origin to node i - in Dijkstras algorithm this information would be stored in a heap and loaded poped of the top of the heap on every loop.
Running the algorithm now starts to look a lot like matrix algebra in that one simply takes the Vector and applyes the adjancicy matrix to it using the following command:
V[i] <- min{V[j] + A[j,i] | j in Nodes}
for all values of i in V. This is run as long as there are updates to V (can be checked on the device, no need to load V back and forth to check!), also store the transposed version of the adjacency matrix to allow sequential reads.
At most this will have a running time corresponding to the longest non-looping path through the graph.
The interesting question now becomes how to distribute this across compute blocks, but it seems obvious to shard based on row indexes.
I suggest you study these two prominent papers on efficient graph processing on GPU. First can be found here. It's rather straightforward and basically assigns one warp to process a vertex and its neighbors. You can find the second one here which is more complicated. It efficiently produces the queue of next level vertices in parallel hence diminishing load imbalance.
After studying above articles, you'll have a better understanding why graph processing is challenging and where pitfalls are. Then you can make your own CUDA program.
I am trying to access last and next indices coordinates inside the kernel.
ex: int idx = blockIdx.x * blockDim.x + threadIdx.x;
then pos[idx].x, pos[idx].y, pos[idx].z would give current coordinates of a point. but cannot access other two. I am trying to calculate the normals of the changing triangle in GPU level using CUDA.
How easily normals can be computed on the GPU depends on the mesh topology.
It is easy to compute normals for a mesh with triangle-list topology: Use one GPU thread per triangle. This results in highly regular reads and writes and will work for any valid configuration of blocks and threads in CUDA. Unfortunately, triangle-list topology isn't very useful (for starters, it will be flat-shaded unless some additional processing is employed).
It is [much] harder to compute normals for a mesh with triangle-strip topology (which is commonly used). The problem is that vertices are used in multiple triangles and therefore you must accumulate a [weighted] average for each vertex-normal by combining multiple triangle-normals. Using one GPU thread per triangle means that multiple vert-norms will be affected from multiple GPU threads "simultaneously". Alternatively, using one GPU thread per vertex means that a list of triangles that reference that vertex are needed, then the triangles need to be read (pairs of additional verts) so that the vert-norm can be computed... which is difficult, but not impossible.
Finally, if your model uses indexed vertices, this will impose an additional [semi-random] look-up which may cause problems. This problem can be addressed with spatial partitioning.
You can still do idx+1, idx+2, the GPU has access to all the shared memory
For best efficency you have to be a little carefull about how you divide up the job into blocks/threads etc so that memory for nearby points is on the same core.
I have to multiply a very small sized matrix ( size - 10x10 ) with a vector several times 50000 to 100000 times ( could even be more than that). This happens for 1000 different matrices (could be much more). Would there be any significant performance gain by doing this operation on CUDA.
Yes, it's an ideal task for the GPU.
If you want to multiply a single matrix with a vector 50K times and each multiplication is prerequisite to the previous then don't use CUDA. It's a serial problem, best suites for CPU. However if each multiplication is independent you can multiply them simultaneously on CUDA.
The only case where your program will give tremendous speedup is when each vector multiplication iteration is independent to the data of other iterations. This way you'll be able to launch 50K or more iterations simultaneously by launching equal number of threads.
Depending on what exactly you're doing, then yes, this could be done very quickly on a GPU, but you might have to run your own kernel to get some good performance from it.
Without knowing more about your problem, I can't give you too much advice. But I could speculate on a solution:
If you're taking one vector and multiplying it by the same matrix several thousand times, you would be much better of finding the closed form of the matrix to an arbitrary power. You can do this using the Cayley–Hamilton theorem or the Jordan canonical form.
I can't seem to find an implementation of this from a quick googling, but considering I did this in first year linear algebra, it's not too bad. Some info on the Jordan normal form, and raising it to powers can be found at http://en.wikipedia.org/wiki/Jordan_normal_form#Powers and the transformation matrices of it are just a matrix of eigenvectors, and the inverse of that matrix.
Say you have a matrix A, and you find the Jordan normal form J, and the transformation matrices P, P^-1, you find
A^n = P J^n P^-1
I can't seem to find a good link to an implementation of this, but computing the closed form of a 10x10 matrix would be significantly less time consuming than 50,000 matrix multiplications. And an implementation of this would probably run much quicker on a CPU.
If this is your problem, you should look into this.