Sorting by key > 10 integer sequences. with thrust - cuda

I want to perform a sort_by_key where I have a single key-sequence
and multiple value sequences.
One usually performs this with
sort_by_key(
key,
key + N,
make_zip_iterator(
make_tuple(x1 , x2 , ...)
)
)
However I want to perform a sort with > 10 sequences each of length N. Thrust does not support
tuples of size >= 10. So is there a way around this ?
Of course one can keep a separate copy of the key vector and perform
sorts on bunches of 10 sequences. But I would like to do everything in a single call.

thrust::tuple is hardcoded to always have 10 elements, so there isn't a direct way to form a zip_iterator from more than ten individual iterators, and therefore no way of sorting more than 10 distinct iterators by key in a single fused operation (and implicitly no way of passing more than 10 iterators into a user functor as well).
If you really can't think of a useful way to combine some of the individual vectors into a single iterator (for example form a vector of tuple values), then one alternative might be to use permutation iterators. If you create an array from a counting iterator and sort that, so something like:
device_vector<int> indices(N);
copy(make_counting_iterator(0), make_counting_iterator(N), indices.begin());
sort_by_key(key, key+N, indices);
indices now holds ordered indices into the vectors you would otherwise have sorted. You can then create a permutation iterator which can be used to "gather" the input data by your key as part of subsequent algorithm calls. You can make as many permutation iterators as needed, and they can be permutations of zip iterators to providing different "views" of the 12 input iterators as you need them in subsequent code.

Actually you may use the simple "scatter" operation. Perform only one "thrust::sort_by_key" operation, then for each data vector apply "thrust::scatter" operation. The values will be distributed to according locations.
thrust::sequence(indices.begin(), indices.end());
thrust::sort_by_key(keyvals.begin(), keyvals.end(), indices.begin());
//now indices keep the locations of the sorted key values
foreach ( ... ) {
thrust::scatter(data.begin(), data.end(), indices.begin(), sorteddata.begin());
}
Gather and scatter operations are quite powerful and opens many opportunities.

Related

Why did the designer make vector, map, and set functions in clojure?

Rich made vector, map, and set functions, while list, and sequence are not functions.
Why cannot all these collections be function to make it consistent?
Further, why don't we make all these compose data as a function which maps position to it's internal data?
If we make all these compose data as function then there will be only function and atom data in clojure. This will minimize the fundamental elements in that language right?
I believe a minimal, best only 2, set of fundamental elements would make the language simpler, more expressive and more flexible. Is this correct?
Vectors, maps, and sets are all associative data structures. Maps are the most obvious; they simply associate arbitrary keys with arbitrary values. A vector can be thought of as a map whose key set must be the set of all nonnegative integers less than the vector's size. Finally, sets can be thought of as maps that map keys to themselves.
It's important to understand that the sequential nature of a vector and the associative nature of a vector are two orthogonal things. It's a data structure that's designed to be good at supporting both abstractions (to some extent; for instance, you can't efficiently insert at the beginning of a vector).
Lists are simpler than vectors; they are finite sequential data structures, nothing more. A list can't efficiently return the element at a particular index, so it doesn't expose that functionality as part of its core interface. Of course, you can get an element of a list by index using nth, but in that case, you're explicitly treating it as a sequence, not as an associative structure.
So to answer your question, the IFn implementations for vectors, maps, and sets are there because of the extremely close relationship between the idea of an associative data structure and the idea of a pure function. Lists and other sequences are not inherently associative, so for consistency, they do not implement IFn.
Elogent's answer is excellent. There is one more reason that it wouldn't make sense for lists to be functions:
Literal lists already have a different, very important role, so they can't also be treated as functions in the way that vectors are.
Let's start with a vector containing two functions, partial and +, and a number, 5. We can treat the vector as a function, as you know, to return the value indexed by its argument:
user=> ([partial + 5] 2)
5
So far, so good. Suppose we want to use a list (partial + 5) in place of the vector, as you suggested, to return the value 5. Will we get an error message? No! But we won't get 5 as the result, either:
user=> ((partial + 5) 2)
7
What happened? (partial + 5) returned a function--the function that adds 5 to its single argument--and then this function was applied to the argument 2.
When a list is evaluated, its first element is evaluated, and should return a function. If the first element is a symbol, it's evaluated, and then the function that's its value is applied to the arguments, which are the other elements of the list. If the first argument of a list is itself a list, then it is evaluated in the same way that it would be evaluated if it were at the top level. The entire expression in that inner list should return a function, which will then be applied to the other elements of the outer list.
Since an inner list that's the first element of list that's being evaluated already has this role, it can't also play the kind of role that vectors that are first elements play.

Summing the elements with even or odd indices by CUDA Thrust

If I use
float sum = thrust::transform_reduce(d_a.begin(), d_a.end(), conditional_operator(), 0.f, thrust::plus<float>());
I get the sum of all elements meeting a condition provided by conditional_operator(), as in Conditional reduction in CUDA.
But what can I sum only the elements d_a[0], d_a[2], d_a[4], d_a[6], ..... ?
I thought of changing the conditional operator, but it works on on elements in the array without any reference to the index.
What can I do for that?
There are two approaches I can think of for solving this sort of problem:
Use the thrust zip operator to combine a counting iterator with the input data and modify your existing functor to accept tuples of (index, data). You can have the functor return the data when the index matches your criteria, and zero otherwise. This will work correctly with scan and reduction algorithms
Use a thrust permutation iterator to gather the data which you want to sum and pass it to the standard reduce algorithm. The thrust developers have an example strided iterator which you can use to solve the problem of only processing every nth entry in an input iterator.
It might be worth implemented both and benchmarking them to see which approach is faster.

Thrust::sort and transform_iterator

I want to sort a list of integer values, but before sorting them I should divide them to a number N. So I will have some duplicate keys, and I will use this duplication for stable_sort in the list.
My question is that, which is better either to divide all the values and store the divided values in a list and then perform the sort, or using a transform_iterator? Is using transform_iterator changes the sort algorithm from radix_sort to merge_sort, because they have huge time difference.
For example:
//already sorted according to another parameter
thrust::device_vector<int> myvalues...
//we want to group them..
thrust::transform(myvalues.begin(), myvalues.end(), groups.begin(), divide_by_n(N));
thrust::stable_sort_by_key(groups.begin(), groups.end(), myvalues.begin();
or
first = thrust::make_transform_iterator(myvalues.begin(), divide_by_n(N));
last = thrust::make_transform_iterator(myvalues.end(), divide_by_n(N));
thrust::stable_sort_by_key(first, last, myvalues.begin());
Thanks
According from this post(comments from #JaredHoberock), the second one does not work.
How to sort with less precision on keys with Thrust library

Stream filter in cuda

I have an array of values and a linked list of indexes. Now, i only want to keep those values from the array that correspond to the indexes in the LL. is there a standard algorithm to do this. Please give example if possible
So, suppose i have an array 1,2,5,6,7,9
and i have a linked list 2->3
So, i want to keep the values at the index 2 and 3. That is keep 5 and 6.
Thus my function should return 5 and 6
In general, linked list is inherently serial. Having a parallel machine will not speed up the traversal of your list, hence the number of steps of your problem cannot go below O(n), where n is the size of the list.
However, if you have some additional way to access the list you can do something with it.
For example, all elements of the list could be stored in a fixed-size array (although, not necesairly in a consecutive way). List member could be represented in an array using the following struct.
struct ListNode {
bool isValid;
T data;
int next;
}
The value isValid sets if given cell in an array is occupied by a valid list member, or it is just an empty cell.
Now, a parallel algorithm would read all cells at once, check if it represents a valid data, and if so, do something with it.
Second part: Each thread, having a valid index idx of your input array A would have to mark A[idx] not to be deleted. Once we know which elements of A should be removed and which not - a parallel compaction algorithm can be applied.

Efficient set intersection - decide whether the intersection is larger than k

I am faced with a problem where I have to calculate intersections between all pairs in a collection of sets. None of the sets are smaller than a small constant k, and I'm only interested in whether two sets have an intersection larger than k-1 elements or not. I do not need the actual intersections nor the exact size, only whether it's larger than k-1 or not. Is there some clever pre-processing trick or a neat set intersection algorithm that I could use to speed things up?
More info that can be useful to answer the question:
The sets represent maximal cliques in a large, undirected, sparse graph. The number of sets can be in the order of tens of thousands or more, but most of the sets are likely to be small.
The sets are already sorted members of each set are in increasing order. Effectively they are sorted lists - I receive them this way from an underlying library for maximal clique search.
Nothing is known about the distribution of elements in the sets (i.e. whether they are in tight clumps or not).
Most of the set intersections are likely to be empty, so the ideal solution would be a clever data structure that helps me cut down the number of set intersections I have to make.
Consider a mapping with all sets of size k as the keys and corresponding values of lists of all sets from your collection that contain the key as a subset. Given this mapping, you don't need to perform any intersection tests: for each key, all pairs of sets from the list will have an intersection of size at least k. This approach can produce the same pair of sets more than once, so that will need to be checked.
The mapping is easy enough to calculate. For each set in the collection, calculate all the size-k subsets and append the original set to the list for that key set. But is this actually faster? In general, no. The performance of this approach will depend on the distribution of the sizes of the sets in the collection and the value of k. With d distinct elements in the sets, you could have as many as d choose k keys, which can be very large.
However, the basic idea is usable to reduce the number of intersections. Instead of using sets of size k, use smaller ones of fixed size q as the keys. The values are again lists of all sets that have the key as a subset. Now, test each pair of sets from the list for intersection. Thus, with q=1 you only test those pairs of sets that have at least one element in common, with q=2 you only test those pairs of sets that have at least two elements in common, and so on. The optimal value for q will depend on the distribution of sizes of the sets, I think.
For the sets in question, a good choice might be q=2. The keys are then just the edges of the graph, giving a predictable size to the mapping. Since most sets are expected to be disjoint, q=2 should eliminate a lot of comparisons without much additional overhead.
One possible optimization, which is more effective the smaller the range of values contained in each set:
Create a list of all the sets, sorted by their kth-greatest element (this is easy to find, since you already have each set with its elements in order). Call this list L.
For any two sets A and B, their intersection cannot have as many as k elements in it if the kth-greatest element in A is less than the least element in B.
So, for each set in turn, calculate its intersection only with the sets in the relevant part of L.
You can use the same fact to exit early from computing the intersection of any two sets - if there are only n-1 elements left to compare in one of the sets, and the intersection so far contains at most k-n elements, then stop. The above procedure is simply this rule applied to all the sets in L at once, with n=k, at the point where we're looking at the least element of set B and the kth-greatest element of A.
The following strategy should be quite efficient. I've used variations of this for intersecting ascending sequences on a number of occasions.
First I assume that you have some sort of priority queue available (if not, rolling your own heap is pretty easy). And a fast key/value lookup (btree, hash, whatever).
With that said, here is pseudocode for an algorithm that should do what you want quite efficiently.
# Initial setup
sets = array of all sets
intersection_count = key/value lookup with keys = (set_pos, set_pos) and values are counts.
p_queue = priority queue whose elements are (set[0], 0, set_pos), organized by set[0]
# helper function
def process_intersections(current_sets):
for all pairs of current_sets:
if pair in intersection_count:
intersection_count[pair] += 1
else:
intersection_count[pair] = 1
# Find all intersections
current_sets = []
last_element = first element of first thing in p_queue
while p_queue is not empty:
(element, ind, set_pos) = get top element from p_queue
if element != last_element:
process_intersections(current_sets)
last_element = element
current_sets = []
current_sets.append(set_pos)
ind += 1
if ind < len(sets[set_pos]):
add (sets[set_pos][ind], ind, set_pos) to p_queue
# Don't forget the last one!
process_intersections(current_sets)
final answer = []
for (pair, count) in intersection_count.iteritems():
if k-1 < count:
final_answer.append(pair)
The running time will be O(sum(sizes of sets) * log(number of sets) + count(times a point is in a pair of sets). In particular note that if two sets have no intersection, you never try to intersect them.
What if you used a predictive subset as a prequalifier. Pre-sort, but use a subset intersection as a threshold condition. If subset intersection > n% then complete the intersection, otherwise abandon. n then becomes the inverse of your comfort level with the prospect of a false positive.
You could also sort by the subset intersections(m) calculated earlier and begin running the full intersection ordered by m descending. So presumably the majority of your highest m intersections would likely cross your k threshold on the full subset and the probably of hitting your k threshold would continually decrease.
This really starts to treat the problem as NP-Complete.