Calculate conditional mean - cuda

I'm new to cuda programming and am interested in implementing an algorithm that when coded serially calculates two or more means from a vector in one pass. What would be an efficient scheme for doing something like this in cuda?
There are two vectors of length N, element values and an indicator values identifying which subset each element belongs to.
Is there an efficient way to do this in one pass or should this be done in M passes, where M is the number of means to be calcuated and use a vector of index keys for the element values of each subset?

You can achieve this with one pass over the data with a single call to thrust::reduce_by_key. In particular, look at the "summary statistics" example, which computes several statistical properties of a single vector at once. You could generalize this method to reduce_by_key which computes reductions over many sub-vectors in parallel. Your "indicator values" would provide be the "keys" reduce_by_key uses to determine which sub-vector each element belongs to.

Partition each vector into smaller vectors and use threads to sum required elements of each sub vector. Then combine the sums and generate the global means. I would try to generate the M means at the same time rather than do M passes.

Related

How to access the elements of a sparse matrix efficiently in Eigen library?

I have filled in a sparse matrix A using Eigen library. Then I need to access the non-zero elements of the sparse matrix, if I perform it as A(rowindex,colindex), it will be very slow.
I also try to use the unordered_map in stl to solve this problem, it is also very slow.
Is there any efficient way to solve this problem?
Compressed sparse matrices are stored in CSR or CSC format. Considering how a CSR matrix stores entries internally, there is an array storing x nonzero values, a corresponding array of length x storing their respective column location, and an array (usually much smaller than the other two) "pointing" to where rows change in those arrays.
There is no way to know where each nonzero element is, or if a row-column pair exists without searching for it in the two ordered arrays of rows (outer index) and columns (inner index). This isn't very efficient for accessing elements in an random way.

Dividing a vector of points into two spaces

I have a memory-mapped file of many millions of 3D points as a STL vector using CGAL. Given an arbitrary plane that divides the dataset into approximately equal parts, I would like to sort the dataset such that all inside points are contiguous in the vector, and likewise the outside points. This process then needs to be repeated to an arbitrary depth, creating a non axis-aligned BSP tree.
Due to the size of the dataset I would like to do this in place if possible. I have a predicate functor that I use to create a filtered_iterator, but of course that doesn't sort the points, just skips non-matching ones. So I could create a second vector and copy the sorted points into that, and the re-use the original vector round-robin style, but I would like to avoid that if possible, if only to keep the iterators that mark the start and end of each space.
Of course, by invoking the question gods, I received direct communication from them almost as soon as I posted!
I had simply been blind to the STL algorithm partition which does exactly what I need.

CUDA: How to find index of extrema in sub matrices?

I have a large rectangular matrix NxM in GPU memory, stored as 1-dimensional array in row-by-row representation. Let us say that this matrix is actually composed of submatrices of size nxm. For simplicity, assume that N is a multiple of n and same with M and m. Let us say, the data type of the array is float or double.
What is an efficient method to find the index of the extrema in each sub-matrix? For example, how to find the 1-dimensional index of the maximum element of each submatrix and write down those indices in some array.
I can hardly imagine to be so self-confident (or arrogant?) to say that one particular solution is the "most efficient way" to do something.
However, some thoughts (without the claim to cover the "most efficient" solution) :
I think that there are basically two "orthogonal" ways of approaching this
For all sub-matrices in parallel: Find the extremum sequentially
For all sub-matrices sequentially: Find the extremum in parallel
The question which one is more appropriate probably depends on the sizes of the matrices. You mentioned that "N is a multiple of n" (similarly for M and m). Let's the matrix of size M x N is composed of a*b sub-matrices of size m x n.
For the first approach, one could simply let each thread take care of one sub-matrix, with a trivial loop like
for (all elements of my sub-matrix) max = element > max ? element : max;
The prerequisite here is that a*b is "reasonably large". That is, when you can launch this kernel for, let's say, 10000 sub-matrices, then this could already bring a good speedup.
In contrast to that, in the second approach, each kernel (with all its threads) would take care of one sub-matrix. In this case, the kernel could be a standard "reduction" kernel. (The reduction is often presented an example for "computing the sum/product of the elements of an array", but it works for any binary associative operation, so instead of computing the sum or product, one can basically use the same kernel for computing the minimum or maximum). So the kernel would be launched for each sub-matrix, and this would only make sense when the sub-matrix is "reasonably large".
However, in both cases, one has to consider the general performance guidelines. Particularly, since in this case, the operation is obviously memory-bound (and not compute-bound), one has to make sure that the accesses to global memory (that is, to the matrix itself) are coalesced, and that the occupancy that is created by the kernel is as high as possible.
EDIT: Of course, one could consider to somehow combine these approaches, but I think that they are at least showing the most important directions of the space of available options.

Alternatives to the Big-O notation?

Good afternoon all,
We say that a hashtable has O(1) lookup (provided that we have the key), whereas a linked list has O(1) lookup for the next node (provided that we have a reference to the current node).
However, due to how the Big-O notation works, it is not very useful in expressing (or differentiating) the cost of an algorithm x, vs the cost of an algorithm x + m.
For example, even though we label both the hashtable's lookup and the linked list's lookup as O(1), these two O(1)s boil down to a very different number of steps indeed,
The linked list's lookup is fixed at x number of steps. However, the hashtable's lookup is variable. The cost of the hashtable's lookup depends on the cost of the hashing function, so the number of steps required for the hashtable's lookup is: x + m,
where x is a fixed number
and m is an unknown variable value
In other words, even though we call both operations O(1), the cost of the hashtable's lookup is a magnitude higher than the cost of the linked list's lookup.
The Big-O notation is specifically about the size of the input data collection. This does have its advantages, but it has its disadvantages as well, as can be seen when we collapse and normalize all non-n variables into 1. We cannot see the m variable (the hashing function) inside it anymore.
Besides the Big-O notation, Is there another (established) notation we can use for expressing the fixed-cost O(1) which means x operations and the variable-cost O(1) which means x + m (m, the hashing function) number of operations?
literal O(1) which means exactly 1 operation
Except it doesn't. The big O-Notation concerns relative comparision of complexity in relation to an input. If the algorithm does take a constant amount of steps, completely independent of the size of your input, than the exact amount of steps doesn't matter.
Take a look at the (informal) definition of O(n):
It means: There is a certain k so that for each n the function f is smaller than the function g.
In the case above, the hashtable lookup and linked list lookup would be f, and g would be g(n) = 1. For each case, you are able to find a k that f(n) <= g(n) * k.
Now, this k doesn't need to fixed, it can vary depending on platform, implementation, specific hardware. The only interesting point is that it exists. That's why both hashtable lookup and linked list node lookup are O(1): Both have a constant complexity, regardless of input. And when evaluating algorithms, that's what interesting, not the physical steps.
Specifically concerning the Hashtable lookup
Yes, the hash function does take a variable amount of operations (depending on implementation). However, it doesn't take a variable amount of operation depending on the size of the input. Big O-Nation is specifically about the size of the input data collection. A hash function takes a single element. For the evaluation of an algorithm it doesn't matter wether a certain function takes 10, 20, 50 or 100 operations, if the number of operations doesn't increase with the input size, it is O(1). There is no way to distinguish this in big O-Notation, as this isn't what big O-Notation is about.
"~" includes the constant factor - see the family of bachmann functions
The issue is that the "number of operations" is highly context dependent. In fact, that's why big-O notation was invented -- it seems to work rather well in modelling a broad number of computers.
Besides, what a programmer things the number of "ops" is doesn't mean how much time it actually does take (e.g. is it already in cache?) or how many steps hardware actually takes (what does your processor do -exactly-? Does it have micro-ops?) or even how many operations are dictated to the processor (what is your compiler doing for you?). And those are all concerns, even when you try to define a precise concept that's abstract enough to be useful.
So. For now, it's Big-O vs. "operations" -- whatever "operations" means to you and your colleagues at the time.

Randomly sorting an array

Does there exist an algorithm which, given an ordered list of symbols {a1, a2, a3, ..., ak}, produces in O(n) time a new list of the same symbols in a random order without bias? "Without bias" means the probability that any symbol s will end up in some position p in the list is 1/k.
Assume it is possible to generate a non-biased integer from 1-k inclusive in O(1) time. Also assume that O(1) element access/mutation is possible, and that it is possible to create a new list of size k in O(k) time.
In particular, I would be interested in a 'generative' algorithm. That is, I would be interested in an algorithm that has O(1) initial overhead, and then produces a new element for each slot in the list, taking O(1) time per slot.
If no solution exists to the problem as described, I would still like to know about solutions that do not meet my constraints in one or more of the following ways (and/or in other ways if necessary):
the time complexity is worse than O(n).
the algorithm is biased with regards to the final positions of the symbols.
the algorithm is not generative.
I should add that this problem appears to be the same as the problem of randomly sorting the integers from 1-k, since we can sort the list of integers from 1-k and then for each integer i in the new list, we can produce the symbol ai.
Yes - the Knuth Shuffle.
The Fisher-Yates Shuffle (Knuth Shuffle) is what you are looking for.