Which libraries can count seconds correctly and for which dates? - language-agnostic

Compute the number of SI seconds between “2021-01-01 12:56:23.423 UTC” and “2001-01-01 00:00:00.000 UTC” as example.

C++20 can do it with the following syntax:
#include <chrono>
#include <iostream>
int
main()
{
using namespace std;
using namespace std::chrono;
auto t0 = sys_days{2001y/1/1};
auto t1 = sys_days{2021y/1/1} + 12h + 56min + 23s + 423ms;
auto u0 = clock_cast<utc_clock>(t0);
auto u1 = clock_cast<utc_clock>(t1);
cout << u1 - u0 << '\n'; // with leap seconds
cout << t1 - t0 << '\n'; // without leap seconds
}
This outputs:
631198588423ms
631198583423ms
The first number includes leap seconds, and is 5s greater than the second number which does not.
This C++20 chrono preview library can do it in C++11/14/17. One just has to #include "date/tz.h", change the y suffix to _y in two places, and add using namespace date;.

My library Time4J can realize this task in Java language. Example using your input:
ChronoFormatter<Moment> f =
ChronoFormatter.ofMomentPattern(
"uuuu-MM-dd HH:mm:ss.SSS z",
PatternType.CLDR,
Locale.ROOT,
ZonalOffset.UTC);
Moment m1 = f.parse("2001-01-01 00:00:00.000 UTC");
Moment m2 = f.parse("2021-01-01 12:56:23.423 UTC");
MachineTime<TimeUnit> durationPOSIX = MachineTime.ON_POSIX_SCALE.between(m1, m2);
MachineTime<SI> durationSI = MachineTime.ON_UTC_SCALE.between(m1, m2);
System.out.println("POSIX: " + durationPOSIX); // without leap seconds
// output: 631198583.423000000s [POSIX]
System.out.println("SI: " + durationSI); // inclusive 5 leap seconds
// output: 631198588.423000000s [UTC]
// if you only want to count full seconds...
System.out.println(SI.SECONDS.between(m1, m2));
// output: 631198588
// first leap second after 2001
System.out.println(m1.with(Moment.nextLeapSecond()));
// output: 2005-12-31T23:59:60Z
The result of 5 seconds difference is in agreement with officially listed leap seconds between the years 2001 and 2021.
In general: This library supports leap seconds only since 1972.

Related

CUDA: method to calculate all partial sums during a sum reduction

I run into this issue over and over in CUDA. I have, for a set of elements, done some GPU calculation. This results in some value that has linear meaning (for instance, in terms of memory):
element_sizes = [ 10, 100, 23, 45 ]
And now, for the next stage of GPU calculation, I need the following values:
memory_size = sum(element_sizes)
memory_offsets = [ 0, 10, 110, 133 ]
I can calculate memory_size at 80 gbps on my GPU using the reduction code available from NVIDIA. However, I can't use this code, as it uses a branching technique that does not compose the memory offsets array. I have tried many things, but what I have found is that simply copying over elements_sizes to the host and calculating the offsets with a simd for loop is the simplest, fastest, way to go:
// in pseudo code
host_element_sizes = copy_to_host(element_sizes);
host_offsets = (... *) malloc(...);
int total_size = 0;
for(int i = 0; i < ...; ...){
host_offsets[i] = total_size;
total_size += host_element_sizes[i];
}
device_offsets = (... *) device_malloc(...);
device_offsets = copy_to_device(host_offsets,...);
However, I have done this many times now, and it is starting to become a bottleneck. This seems like a typical problem, but I have found no work-around.
What is the expected way for a CUDA programmer to solve this problem?
I think the algorithm you are looking for is a prefix sum. A prefix sum on a vector produces another vector which contains the cumulative sum values of the input vector. A prefix sum exists in at least two variants - an exclusive scan or an inclusive scan. Conceptually these are similar.
If your element_sizes vector has been deposited in GPU global memory (it appears to be the case based on your pseudocode), then there exist library functions that run on the GPU that you could call at that point, to produce the memory_offsets data (vector), and the memory_size value could be trivially obtained from the last value in the vector, with a slight variation based on whether you are doing an inclusive scan or exclusive scan.
Here's a trivial worked example using thrust:
$ cat t319.cu
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <iostream>
int main(){
const int element_sizes[] = { 10, 100, 23, 45 };
const int ds = sizeof(element_sizes)/sizeof(element_sizes[0]);
thrust::device_vector<int> dv_es(element_sizes, element_sizes+ds);
thrust::device_vector<int> dv_mo(ds);
thrust::exclusive_scan(dv_es.begin(), dv_es.end(), dv_mo.begin());
std::cout << "element_sizes:" << std::endl;
thrust::copy_n(dv_es.begin(), ds, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "memory_offsets:" << std::endl;
thrust::copy_n(dv_mo.begin(), ds, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "memory_size:" << std::endl << dv_es[ds-1] + dv_mo[ds-1] << std::endl;
}
$ nvcc -o t319 t319.cu
$ ./t319
element_sizes:
10,100,23,45,
memory_offsets:
0,10,110,133,
memory_size:
178
$

Using thrust::reverse with make_zip_iterator+make_tuple

I'm experiencing odd behavior while using the thrust::reverse function on a zip_iterator constructed with a thrust::make_zip_iterator( thrust::make_tuple( )) type syntax (see the answer from JackOLantern here for a good example of that combination).
I wish to reverse some arbitrarily-indicated section of multiple device vectors as in the example code below. When I do the reversing in one go by tupling and zipping them together, unexpected behavior ensues. The first half of the range is correctly changed to an inversion of the second half of the range, however, the second half of the range is left unchanged.
I've been using other thrust functions in a similar fashion (sort_by_key, uniqe_by_key, adjacent_difference, etc.) without issue. Am I just executing this incorrectly or is there some reason that this will not work on a fundamental level? A thought I had is that perhaps the zip_iterator is not bidirectional as required for reverse. Is this true? I couldn't find documentation indicating as such.
A workaround is just to reverse the vector individually, which works as shown below. However, I suspect this will be less efficient. Note that in my actual use-case I have vectors with sizes of the order of 10,000 and I'm zipping up anywhere from 3-7 vectors for the operations.
#include <iostream>
#include <ostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/tuple.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/sequence.h>
#include <thrust/reverse.h>
int main(){
// initial host vectors
const int N=10;
thrust::host_vector<int> h1(N);
thrust::host_vector<float> h2(N);
// fill them
thrust::sequence( h1.begin(), h1.end(), 0);
thrust::sequence( h2.begin(), h2.end(), 10., 0.5);
// print initial contents
for (size_t i=0; i<N; i++){
std::cout << h1[i] << " " << h2[i] << std::endl;
}
// transfer to device
thrust::device_vector<int> d1 = h1;
thrust::device_vector<float> d2 = h2;
// what chunk to invert
int iStart = 3; int iEnd = 8;
// attempt to reverse middle via zip_iterators
thrust::reverse(
thrust::make_zip_iterator( thrust::make_tuple( d1.begin()+iStart, d2.begin()+iStart)),
thrust::make_zip_iterator( thrust::make_tuple( d1.begin()+iEnd, d2.begin()+iEnd))
);
// pull back and write out unexpected ordering
thrust::host_vector<int> temp1 = d1;
thrust::host_vector<float> temp2 = d2;
std::cout << "<==========>" << std::endl;
for (size_t i=0; i<N; i++){
std::cout << temp1[i] << " " << temp2[i] << std::endl;
}
// reset device variables
d1 = h1;
d2 = h2;
// reverse individually
thrust::reverse( d1.begin()+iStart, d1.begin()+iEnd);
thrust::reverse( d2.begin()+iStart, d2.begin()+iEnd);
// pull back and write out the desired ordering
temp1 = d1;
temp2 = d2;
std::cout << "<==========>" << std::endl;
for (size_t i=0; i<N; i++){
std::cout << temp1[i] << " " << temp2[i] << std::endl;
}
return 0;
}
Output
0 10
1 10.5
2 11
3 11.5
4 12
5 12.5
6 13
7 13.5
8 14
9 14.5
<==========>
0 10
1 10.5
2 11
7 13.5
6 13
5 12.5
6 13
7 13.5
8 14
9 14.5
<==========>
0 10
1 10.5
2 11
7 13.5
6 13
5 12.5
4 12
3 11.5
8 14
9 14.5
The information from Robert Crovella in the comments combined with the initially given workaround in the initial post appears to answer the question - thus, I will combine them here so the question can be marked as "answered." If others wish to post other solutions, I'm more than willing to look at them and move the "official answer" check mark. That being said...
The solution to the question has two parts:
If using an older version of CUDA and upgrading is an option: upgrade to the newest CUDA version and the operation should work (tested to work on CUDA 9.2.148 - thanks Robert!)
If unable to upgrade to a newer version of CUDA: apply reverse to the vectors individually to achieve the same result as given in the initial post. The code with only the working solution is copied below for completeness.
#include <iostream>
#include <ostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/tuple.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/sequence.h>
#include <thrust/reverse.h>
int main(){
// initial host vectors
const int N=10;
thrust::host_vector<int> h1(N);
thrust::host_vector<float> h2(N);
// fill them
thrust::sequence( h1.begin(), h1.end(), 0);
thrust::sequence( h2.begin(), h2.end(), 10., 0.5);
// print initial contents
for (size_t i=0; i<N; i++){
std::cout << h1[i] << " " << h2[i] << std::endl;
}
// transfer to device
thrust::device_vector<int> d1 = h1;
thrust::device_vector<float> d2 = h2;
// what chunk to invert
int iStart = 3; int iEnd = 8;
// reverse individually
thrust::reverse( d1.begin()+iStart, d1.begin()+iEnd);
thrust::reverse( d2.begin()+iStart, d2.begin()+iEnd);
// pull back and write out the desired ordering
temp1 = d1;
temp2 = d2;
std::cout << "<==========>" << std::endl;
for (size_t i=0; i<N; i++){
std::cout << temp1[i] << " " << temp2[i] << std::endl;
}
return 0;
}

Use thrust to find element in groups

I have two int vectors for keys and values, their size is about 500K.
The key vector is already sorted. And there are 10K groups approximately.
The value is non-negative(stands for useful) or -2(stands for no use), in each group there should be one or zero non-negative values, and the rest is -2.
key: 0 0 0 0 1 2 2 3 3 3 3
value:-2 -2 1 -2 3 -2 -2 -2 -2 -2 0
The third pair of group 0 [0 1] is useful. For group 1 we get the pair[1 3]. The values of group 2 are all -2, so we get nothing. And for group 3, the result is [3 0].
So, the question is how can I do this by thrust or cuda ?
Here are two ideas.
First one:
Get the number of each group by a histogram algorithm. So the barrier of each group can be computed.
Operate thrust::find_if on each group to get the useful element.
Second one:
Use thrust::transform to add 2 for every value and now all the value are non-negative, and zero stands for useless.
Use thrust::reduce_by_key to get the reduction for every group, and then subtract 2 for every output value.
I think there must be some other methods which will achieve much more performance than the above two.
Performance of the methods:
I have test the Second method above and the method given by #Robert Crovella, ie. reduce_by_key and remove_if method.
The size of the vectors is 2691028, the vectors consist of 100001 groups. Here is their average time:
reduce_by_key: 1204ms
remove_if: 192ms
From above result, we can see that remove_if method is much faster. And also the "remove_if" method is easy to implement and consume much less gpu memory.
Briefly, #Robert Crovella 's method is very good.
I would use a thrust::zip_iterator to zip the key and values pairs together, and then I would do a thrust::remove_if operation on the zipped values which would require a functor definition that would indicate to remove every pair for which the value is negative (or whatever test you wish.)
Here's a worked example:
$ cat t1009.cu
#include <thrust/remove.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/copy.h>
struct remove_func
{
template <typename T>
__host__ __device__
bool operator()(T &t){
return (thrust::get<1>(t) < 0); // could change to other kinds of tests
}
};
int main(){
int keys[] = {0,0,0,0,1,2,2,3,3,3,3};
int vals[] = {-2,-2,1,-2,3,-2,-2,-2,-2,-2,0};
size_t dsize = sizeof(keys)/sizeof(int);
thrust::device_vector<int>dkeys(keys, keys+dsize);
thrust::device_vector<int>dvals(vals, vals+dsize);
auto zr = thrust::make_zip_iterator(thrust::make_tuple(dkeys.begin(), dvals.begin()));
size_t rsize = thrust::remove_if(zr, zr+dsize, remove_func()) - zr;
thrust::copy_n(dkeys.begin(), rsize, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
thrust::copy_n(dvals.begin(), rsize, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -std=c++11 -o t1009 t1009.cu
$ ./t1009
0,1,3,
1,3,0,
$

Thrust Histogram with weights

I want to compute the density of particles over a grid. Therefore, I have a vector that contains the cellID of each particle, as well as a vector with the given mass which does not have to be uniform.
I have taken the non-sparse example from Thrust to compute a histogram of my particles.
However, to compute the density, I need to include the weight of each particle, instead of simply summing the number of particles per cell, i.e. I'm interested in rho[i] = sum W[j] for all j that satify cellID[j]=i (probably unnecessary to explain, since everybody knows that).
Implementing this with Thrust has not worked for me. I also tried to use a CUDA kernel and thrust_raw_pointer_cast, but I did not succeed with that either.
EDIT:
Here is a minimal working example which should compile via nvcc file.cu under CUDA 6.5 and with Thrust installed.
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/binary_search.h>
#include <thrust/adjacent_difference.h>
// Predicate
struct is_out_of_bounds {
__host__ __device__ bool operator()(int i) {
return (i < 0); // out of bounds elements have negative id;
}
};
// cf.: https://code.google.com/p/thrust/source/browse/examples/histogram.cu, but modified
template<typename T1, typename T2>
void computeHistogram(const T1& input, T2& histogram) {
typedef typename T1::value_type ValueType; // input value type
typedef typename T2::value_type IndexType; // histogram index type
// copy input data (could be skipped if input is allowed to be modified)
thrust::device_vector<ValueType> data(input);
// sort data to bring equal elements together
thrust::sort(data.begin(), data.end());
// there are elements that we don't want to count, those have ID -1;
data.erase(thrust::remove_if(data.begin(), data.end(), is_out_of_bounds()),data.end());
// number of histogram bins is equal to the maximum value plus one
IndexType num_bins = histogram.size();
// find the end of each bin of values
thrust::counting_iterator<IndexType> search_begin(0);
thrust::upper_bound(data.begin(), data.end(), search_begin,
search_begin + num_bins, histogram.begin());
// compute the histogram by taking differences of the cumulative histogram
thrust::adjacent_difference(histogram.begin(), histogram.end(),
histogram.begin());
}
int main(void) {
thrust::device_vector<int> cellID(5);
cellID[0] = -1; cellID[1] = 1; cellID[2] = 0; cellID[3] = 2; cellID[4]=1;
thrust::device_vector<float> mass(5);
mass[0] = .5; mass[1] = 1.0; mass[2] = 2.0; mass[3] = 3.0; mass[4] = 4.0;
thrust::device_vector<int> histogram(3);
thrust::device_vector<float> density(3);
computeHistogram(cellID,histogram);
std::cout<<"\nHistogram:\n";
thrust::copy(histogram.begin(), histogram.end(),
std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
// this will print: " Histogram 1 2 1 "
// meaning one element with ID 0, two elements with ID 1
// and one element with ID 2
/* here is what I am unable to implement:
*
*
* computeDensity(cellID,mass,density);
*
* print(density): 2.0 5.0 3.0
*
*
*/
}
I hope the comment at the end of the file also makes clear what I mean by computing the density. If there is any question open, please feel free to ask. Thanks!
There still seems to be a problem in understanding my problem, which I am sorry for! Therefore I added some pictures.
Consider the first picture. For my understanding, a histogram would simply be the count of particles per grid cell. In this case a histogram would be an array of size 36, since there are 36 cells. Also, there would be a lot of zero entries in the vector, since for example in the upper left corner almost no cell contains a particle. This is what I already have in my code.
Now consider the slightly more complicated case. Here each particle has a different mass, indicated by the different size in the plot. To compute the density I can't just add the number of particles per cell, but I have to add the mass of all particles per cell. This is what I'm unable to implement.
What you described in your example does not look like a histogram but rather like a segmented reduction.
The following example code uses thrust::reduce_by_key to sum up the masses of particles within the same cell:
density.cu
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/scatter.h>
#include <iostream>
#define PRINTER(name) print(#name, (name))
template <template <typename...> class V, typename T, typename ...Args>
void print(const char* name, const V<T,Args...> & v)
{
std::cout << name << ":\t\t";
thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, "\t"));
std::cout << std::endl << std::endl;
}
int main()
{
const int particle_count = 5;
const int cell_count = 10;
thrust::device_vector<int> cellID(particle_count);
cellID[0] = -1; cellID[1] = 1; cellID[2] = 0; cellID[3] = 2; cellID[4]=1;
thrust::device_vector<float> mass(particle_count);
mass[0] = .5; mass[1] = 1.0; mass[2] = 2.0; mass[3] = 3.0; mass[4] = 4.0;
std::cout << "input data" << std::endl;
PRINTER(cellID);
PRINTER(mass);
thrust::sort_by_key(cellID. begin(), cellID.end(), mass.begin());
std::cout << "after sort_by_key" << std::endl;
PRINTER(cellID);
PRINTER(mass);
thrust::device_vector<int> reduced_cellID(particle_count);
thrust::device_vector<float> density(particle_count);
int new_size = thrust::reduce_by_key(cellID. begin(), cellID.end(),
mass.begin(),
reduced_cellID.begin(),
density.begin()
).second - density.begin();
if (reduced_cellID[0] == -1)
{
density.erase(density.begin());
reduced_cellID.erase(reduced_cellID.begin());
new_size--;
}
density.resize(new_size);
reduced_cellID.resize(new_size);
std::cout << "after reduce_by_key" << std::endl;
PRINTER(density);
PRINTER(reduced_cellID);
thrust::device_vector<float> final_density(cell_count);
thrust::scatter(density.begin(), density.end(), reduced_cellID.begin(), final_density.begin());
PRINTER(final_density);
}
compile using
nvcc -std=c++11 density.cu -o density
output
input data
cellID: -1 1 0 2 1
mass: 0.5 1 2 3 4
after sort_by_key
cellID: -1 0 1 1 2
mass: 0.5 2 1 4 3
after reduce_by_key
density: 2 5 3
reduced_cellID: 0 1 2
final_density: 2 5 3 0 0 0 0 0 0 0

Counting occurrences of numbers in a CUDA array

I have an array of unsigned integers stored on the GPU with CUDA (typically 1000000 elements). I would like to count the occurrence of every number in the array. There are only a few distinct numbers (about 10), but these numbers can span from 1 to 1000000. About 9/10th of the numbers are 0, I don't need the count of them. The result looks something like this:
58458 -> 1000 occurrences
15 -> 412 occurrences
I have an implementation using atomicAdds, but it is too slow (a lot of threads write to the same address). Does someone know of a fast/efficient method?
You can implement a histogram by first sorting the numbers, and then doing a keyed reduction.
The most straightforward method would be to use thrust::sort and then thrust::reduce_by_key. It's also often much faster than ad hoc binning based on atomics. Here's an example.
I suppose you can find help in the CUDA examples, specifically the histogram examples. They are part of the GPU computing SDK.
You can find it here http://developer.nvidia.com/cuda-cc-sdk-code-samples#histogram. They even have a whitepaper explaining the algorithms.
I'm comparing two approaches suggested at the duplicate question thrust count occurence, namely,
Using thrust::counting_iterator and thrust::upper_bound, following the histogram Thrust example;
Using thrust::unique_copy and thrust::upper_bound.
Below, please find a fully worked example.
#include <time.h> // --- time
#include <stdlib.h> // --- srand, rand
#include <iostream>
#include <thrust\host_vector.h>
#include <thrust\device_vector.h>
#include <thrust\sort.h>
#include <thrust\iterator\zip_iterator.h>
#include <thrust\unique.h>
#include <thrust/binary_search.h>
#include <thrust\adjacent_difference.h>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
//#define VERBOSE
#define NO_HISTOGRAM
/********/
/* MAIN */
/********/
int main() {
const int N = 1048576;
//const int N = 20;
//const int N = 128;
TimingGPU timerGPU;
// --- Initialize random seed
srand(time(NULL));
thrust::host_vector<int> h_code(N);
for (int k = 0; k < N; k++) {
// --- Generate random numbers between 0 and 9
h_code[k] = (rand() % 10);
}
thrust::device_vector<int> d_code(h_code);
//thrust::device_vector<unsigned int> d_counting(N);
thrust::sort(d_code.begin(), d_code.end());
h_code = d_code;
timerGPU.StartCounter();
#ifdef NO_HISTOGRAM
// --- The number of d_cumsum bins is equal to the maximum value plus one
int num_bins = d_code.back() + 1;
thrust::device_vector<int> d_code_unique(num_bins);
thrust::unique_copy(d_code.begin(), d_code.end(), d_code_unique.begin());
thrust::device_vector<int> d_counting(num_bins);
thrust::upper_bound(d_code.begin(), d_code.end(), d_code_unique.begin(), d_code_unique.end(), d_counting.begin());
#else
thrust::device_vector<int> d_cumsum;
// --- The number of d_cumsum bins is equal to the maximum value plus one
int num_bins = d_code.back() + 1;
// --- Resize d_cumsum storage
d_cumsum.resize(num_bins);
// --- Find the end of each bin of values - Cumulative d_cumsum
thrust::counting_iterator<int> search_begin(0);
thrust::upper_bound(d_code.begin(), d_code.end(), search_begin, search_begin + num_bins, d_cumsum.begin());
// --- Compute the histogram by taking differences of the cumulative d_cumsum
//thrust::device_vector<int> d_counting(num_bins);
//thrust::adjacent_difference(d_cumsum.begin(), d_cumsum.end(), d_counting.begin());
#endif
printf("Timing GPU = %f\n", timerGPU.GetCounter());
#ifdef VERBOSE
thrust::host_vector<int> h_counting(d_counting);
printf("After\n");
for (int k = 0; k < N; k++) printf("code = %i\n", h_code[k]);
#ifndef NO_HISTOGRAM
thrust::host_vector<int> h_cumsum(d_cumsum);
printf("\nCounting\n");
for (int k = 0; k < num_bins; k++) printf("element = %i; counting = %i; cumsum = %i\n", k, h_counting[k], h_cumsum[k]);
#else
thrust::host_vector<int> h_code_unique(d_code_unique);
printf("\nCounting\n");
for (int k = 0; k < N; k++) printf("element = %i; counting = %i\n", h_code_unique[k], h_counting[k]);
#endif
#endif
}
The first approach has shown to be the fastest. On an NVIDIA GTX 960 card, I have had the following timings for a number of N = 1048576 array elements:
First approach: 2.35ms
First approach without thrust::adjacent_difference: 1.52
Second approach: 4.67ms
Please, note that there is no strict need to calculate the adjacent difference explicitly, since this operation can be manually done during a kernel processing, if needed.
As others have said, you can use the sort & reduce_by_key approach to count frequencies. In my case, I needed to get mode of an array (maximum frequency/occurrence) so here is my solution:
1 - First, we create two new arrays, one containing a copy of input data and another filled with ones to later reduce it (sum):
// Input: [1 3 3 3 2 2 3]
// *(Temp) dev_keys: [1 3 3 3 2 2 3]
// *(Temp) dev_ones: [1 1 1 1 1 1 1]
// Copy input data
thrust::device_vector<int> dev_keys(myptr, myptr+size);
// Fill an array with ones
thrust::fill(dev_ones.begin(), dev_ones.end(), 1);
2 - Then, we sort the keys since the reduce_by_key function needs the array to be sorted.
// Sort keys (see below why)
thrust::sort(dev_keys.begin(), dev_keys.end());
3 - Later, we create two output vectors, for the (unique) keys and their frequencies:
thrust::device_vector<int> output_keys(N);
thrust::device_vector<int> output_freqs(N);
4 - Finally, we perform the reduction by key:
// Reduce contiguous keys: [1 3 3 3 2 2 3] => [1 3 2 1] Vs. [1 3 3 3 3 2 2] => [1 4 2]
thrust::pair<thrust::device_vector<int>::iterator, thrust::device_vector<int>::iterator> new_end;
new_end = thrust::reduce_by_key(dev_keys.begin(), dev_keys.end(), dev_ones.begin(), output_keys.begin(), output_freqs.begin());
5 - ...and if we want, we can get the most frequent element
// Get most frequent element
// Get index of the maximum frequency
int num_keys = new_end.first - output_keys.begin();
thrust::device_vector<int>::iterator iter = thrust::max_element(output_freqs.begin(), output_freqs.begin() + num_keys);
unsigned int index = iter - output_freqs.begin();
int most_frequent_key = output_keys[index];
int most_frequent_val = output_freqs[index]; // Frequencies