Is there an alternative of std::memcmp in cuda? - cuda

Is there an alternative of std::memcmp in cuda?
I want to compare the whole rows in a matrix, on cpu, simply call std::memcmp is ok, is there a high efficient way to do this on gpu?
The operation is like this:
Sorting arrays in NumPy by column

While not functionally identical to std::memcmp, the thrust template library includes a comparison reduction operation thrust::equal which will return true or false when the elements of two iterator ranges compare identically.
If you actually require the sign of the comparison, you would need to write your own implementation.

Related

Constructor/Function overload signature lookup time complexity?

I was reading up on the std::string class in C++ and noticed there are quite a few different constructors available giving us a wide set of initialization features. This got me wondering how a compiler picks which constructor to choose when given parameters, or in the case of overloads, how a compiler matches a function signature with a given set of parameters.
If we have the following functions declared in pseudo-code:
function f1(int numberHere) {
//....do something
}
function f1(int numberHere, string stringHere) {
//....do something
}
And I decide to call f1(4), there are obviously two options to choose from, but what if there are 10000 options/signatures? Would it take proportionally longer? If so, what takes longer? Does the compiler have some sneaky O(n) way to index overloads such that it can call the right one in O(1) time once the program is running or would it compile in O(1) no matter how many overloads exist but take longer to run the finished result because of on-the-fly signature matching?
Can this question even be answered effectively?
Thanks!
Matching function signatures is actually not different from any other search or lookup problem. There are three basic ways to do it depending on the data structure you are storing the available function signatures in:
Use an unsorted list or array and get O(n) time complexity.
Use a sorted array or a tree-like structure and get O(log(n)). (You can sort by type of 1st argument, then 2nd and so on, assuming that each type has an integer id assigned to it.)
Use a hash map and get O(1).
But I doubt that time complxity has any practical relevance in this case. It describes the asymptotic behaviour of algorithms for large values of n. Even for n=100, an unsorted array search might be faster than hash map lookup because it has less overhead.
And from a usability point of view it is a very bad idea to design an API having functions with 10 or even 100 overloads.

How to perform Hadamard product with CUBLAS on complex numbers?

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ?
I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS).
CUBLAS is based on the reference BLAS, and the reference BLAS has never contained a Hadamard product (complex or real). Hence CUBLAS doesn't have one either. Intel have added v?Mul to MKL for doing this, but it is non-standard and not in most BLAS implementations. It is the kind of operation that an old school fortran programmer would just write a loop for, so I presume it really didn't warrant a dedicated routine in BLAS.
There is no "standard" CUDA library I am aware of which implements a Hadamard product. There would be the possibility of using CUBLAS GEMM or SYMM to do this and extracting the diagonal of the resulting matrix, but that would be horribly inefficient, both from a computation and storage stand point.
The Thrust template library can do this trivially using thrust::transform, for example:
thrust::multiplies<thrust::complex<float> > op;
thrust::transform(thrust::device, x, x + n, y, z, op);
would iterate over each pair of inputs from the device pointers x and y and calculate z[i] = x[i] * y[i] (there is probably a couple of casts you need to make to compile that, but you get the idea). But that effectively requires compilation of CUDA code within your project, and apparently you don't want that.

Summing the elements with even or odd indices by CUDA Thrust

If I use
float sum = thrust::transform_reduce(d_a.begin(), d_a.end(), conditional_operator(), 0.f, thrust::plus<float>());
I get the sum of all elements meeting a condition provided by conditional_operator(), as in Conditional reduction in CUDA.
But what can I sum only the elements d_a[0], d_a[2], d_a[4], d_a[6], ..... ?
I thought of changing the conditional operator, but it works on on elements in the array without any reference to the index.
What can I do for that?
There are two approaches I can think of for solving this sort of problem:
Use the thrust zip operator to combine a counting iterator with the input data and modify your existing functor to accept tuples of (index, data). You can have the functor return the data when the index matches your criteria, and zero otherwise. This will work correctly with scan and reduction algorithms
Use a thrust permutation iterator to gather the data which you want to sum and pass it to the standard reduce algorithm. The thrust developers have an example strided iterator which you can use to solve the problem of only processing every nth entry in an input iterator.
It might be worth implemented both and benchmarking them to see which approach is faster.

How to convert a sparse histogram into dense histogram in CUDA?

I am implementing an algorithm using raw CUDA kernels, in which every threadblock needs the dense histogram of available data to that threadblock, now the question is that do I have to calculate the dense histogram from the scratch? (is it worth calculating the dense histogram at all, provided that i already have the sparse histogram which is implemented using shared memory)
I have come up with this idea of converting, I will try to elaborate my idea with example (temp and hist both are in shared memory)
0,1,2,3,4,5,6... //array indexes
4,3,0,2,1,0,5... //contents of hist[]
0,0,2,0,0,5,0... //contents of temp[] if(hist[x]>0)temp[x]=x;
for_every_element //this is sequential part :(
if(temp[x]>0)
shift elements from index x to 256
4,3,2,1,0,5... //pass 1 of the for loop
4,3,2,1,5... //pass 2 of the for loop
//this goes on until all the 0s are compacted
Now I know above is sequential in nature, but the shifting can be done with constant time (and in parallel) because threads_per_block is already set to 256, so shifting is not the main issue, the main issue is how to improve this (or any other suggestion is welcomed).
Edit: i am thinking of another idea, that is as follows
Assuming threads_per_block=256 if i can count which of histogram bins are non-zeros (this operation is parallel because each thread is assigned to each bin, i can atomicadd the values generated by each thread) let's say that i can then start a new shared index variable sindex=0 and each time a thread wants to store the value into d_hist[] it can take the latest value from sindex and store it's values to d_hist[sindex]=hist[treadIdx.x] after that i can atomicAdd the sindex
Now there is only one problem, there is going to be a race condition to getting the value of sindex, so i may have to setup a flag which can be locked or unlocked when a thread is adding any value to d_hist (but i think there can be a deadlock situation here)
Will this technique work? and is there any other technique better than that?
Converting a sparse histogram to a dense histogram is a scatter operation. If the sparse histogram is composed of s_index[S_N] and s_hist[S_N], then first we create a dense histogram d_hist[N] composed of all zeroes (you can do this from host code, perhaps). Then we populate the dense histogram with d_hist[s_index[i]] = s_hist[i]; This can be done in parallel and uses as many threads as there are valid indices in your sparse histogram (i < S_N). Assuming your histogram is sorted, you'll get whatever coalescing benefit may be possible based on the distribution of your sparse histogram indices.
It may not make sense for your case where each threadblock is doing a separate histogram, but you may also be interested in thrust scatter.
Well I guess the simplest method is to find out which bins>0 and after that, and exclusive scan can be done (in order to calculate the target indexes let's say sum_array[]) after that for allbins>0 move to d_hist[sum_array[threadIdx.x]-1]=s_hist[threadIdx.x]
0,1,2,3,4,5,6... //s_indexes[]
4,3,0,2,1,0,5... //contents of s_hist[]
1,1,0,1,1,0,1... //all bins which are > 0 = sum_array[]
1,2,2,3,4,4,5... //inclusive_scan of summ_array[]
//after the moving part
0,1,3,4,6... //s_indexes[]
4,3,2,1,5... //d_hist[]
0,1,2,3,4... //d_indexes[]
The reason why I am inclined to use this pattern is because it takes log_base_2(256) time in order to calculate the sum_array plus, other than that, moving and checking parts are just constant time operations, if anyone have different idea than this, please share.

Sorting by key > 10 integer sequences. with thrust

I want to perform a sort_by_key where I have a single key-sequence
and multiple value sequences.
One usually performs this with
sort_by_key(
key,
key + N,
make_zip_iterator(
make_tuple(x1 , x2 , ...)
)
)
However I want to perform a sort with > 10 sequences each of length N. Thrust does not support
tuples of size >= 10. So is there a way around this ?
Of course one can keep a separate copy of the key vector and perform
sorts on bunches of 10 sequences. But I would like to do everything in a single call.
thrust::tuple is hardcoded to always have 10 elements, so there isn't a direct way to form a zip_iterator from more than ten individual iterators, and therefore no way of sorting more than 10 distinct iterators by key in a single fused operation (and implicitly no way of passing more than 10 iterators into a user functor as well).
If you really can't think of a useful way to combine some of the individual vectors into a single iterator (for example form a vector of tuple values), then one alternative might be to use permutation iterators. If you create an array from a counting iterator and sort that, so something like:
device_vector<int> indices(N);
copy(make_counting_iterator(0), make_counting_iterator(N), indices.begin());
sort_by_key(key, key+N, indices);
indices now holds ordered indices into the vectors you would otherwise have sorted. You can then create a permutation iterator which can be used to "gather" the input data by your key as part of subsequent algorithm calls. You can make as many permutation iterators as needed, and they can be permutations of zip iterators to providing different "views" of the 12 input iterators as you need them in subsequent code.
Actually you may use the simple "scatter" operation. Perform only one "thrust::sort_by_key" operation, then for each data vector apply "thrust::scatter" operation. The values will be distributed to according locations.
thrust::sequence(indices.begin(), indices.end());
thrust::sort_by_key(keyvals.begin(), keyvals.end(), indices.begin());
//now indices keep the locations of the sorted key values
foreach ( ... ) {
thrust::scatter(data.begin(), data.end(), indices.begin(), sorteddata.begin());
}
Gather and scatter operations are quite powerful and opens many opportunities.