CUDA parallelization

CUDA parallelization - cuda

I'm having trouble doing the parallelization on an array of numbers with CUDA.
So, for example if we have an array M containing numbers ( 1 , 2 , 3 , 4 , 5)
And If I were to remove the number 2 in the array and shift everything to the left,
the resulting array would be ( 1 , 3 , 4 , 5 , 5 )
where M[1] = M[2], M[2] = M[3] , M[3] = M[4]
And my question is how can we do this in parallel in cuda? Because when we parallel this
there might be a race condition where the number 2 (M[1]) might not be the first one to
act first, if M[2] were the first one to shift, the resulting array would become
( 1 , 4 , 4 , 5 , 5). Is there any method to handle this? I'm fairly new to cuda so I'm
not sure what to do...
My current code is as follows:
__global__ void gpu_shiftSeam(int *MCEnergyMat, int *seam, int width, int height, int currRow)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int index = i+width*j;
if(i < width && j <height)
{
//shift values of -1 to the side of the image
if(MCEnergyMat[i+width*j] == -1)
{
if(i+1 != width)
MCEnergyMat[index] = MCEnergyMat[index+1];
}
if(seam[j] < i)
{
if(i+1 != width)
MCEnergyMat[index] = MCEnergyMat[index+1];
}
}
}
Where seam[i] contains the index I would like to remove in the array. and MCEnergyMat is just a 1D array converted from a 2d array... However, my code does not work... and I believe race condition is the problem.
Thanks!

As talonmies notes in his comment, this sort of thing is called "stream compaction". Here's how you would do it with Thrust:
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/remove.h>
#include <iostream>
int main()
{
int data[5] = {1,2,3,4,5};
thrust::device_vector<int> d_vec(data, data + 5);
// new_end points to the end of the sequence after 2 has been thrown out
thrust::device_vector<int>::iterator new_end =
thrust::remove(d_vec.begin(), d_vec.end(), 2);
// erase everything after the new end
d_vec.erase(new_end, d_vec.end());
// prove that it worked
thrust::host_vector<int> h_vec = d_vec;
std::cout << "result: ";
thrust::copy(h_vec.begin(), h_vec.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
return 0;
}
Here's the result:
$ nvcc test.cu -run
result: 1 3 4 5

Related

how to avoid thread divergence in this CUDA kernel?

for the CUDA kernel function, get branching divergence shown below, how to optimize it?
int gx = threadIdx.x + blockDim.x * blockIdx.x;
val = g_data[gx];
if (gx % 4 == 0)
val = op1(val);
else if (gx % 4 == 1)
val = op2(val);
else if (gx % 4 == 2)
val = op3(val);
else if (gx % 4 == 3)
val = op4(val);
g_data[gx] = val;

If I were programming in CUDA, I certainly wouldn't do any of this. However to answer your question:
how to avoid thread divergence in this CUDA kernel?
You could do something like this:
int gx = threadIdx.x + blockDim.x * blockIdx.x;
val = g_data[gx];
int gx_bit_0 = gx & 1;
int gx_bit_1 = (gx & 2) >> 1;
val = (1-gx_bit_1)*(1-gx_bit_0)*op1(val) + (1-gx_bit_1)*(gx_bit_0)*op2(val) + (gx_bit_1)*(1-gx_bit_0)*op3(val) + (gx_bit_1)*(gx_bit_0)*op4(val);
g_data[gx] = val;
Here is a full test case:
$ cat t1914.cu
#include <iostream>
__device__ float op1(float val) { return val + 1.0f;}
__device__ float op2(float val) { return val + 2.0f;}
__device__ float op3(float val) { return val + 3.0f;}
__device__ float op4(float val) { return val + 4.0f;}
__global__ void k(float *g_data){
int gx = threadIdx.x + blockDim.x * blockIdx.x;
float val = g_data[gx];
int gx_bit_0 = gx & 1;
int gx_bit_1 = (gx & 2) >> 1;
val = (1-gx_bit_1)*(1-gx_bit_0)*op1(val) + (1-gx_bit_1)*(gx_bit_0)*op2(val) + (gx_bit_1)*(1-gx_bit_0)*op3(val) + (gx_bit_1)*(gx_bit_0)*op4(val);
g_data[gx] = val;
}
const int N = 32;
int main(){
float *data;
cudaMallocManaged(&data, N*sizeof(float));
for (int i = 0; i < N; i++) data[i] = 1.0f;
k<<<1,N>>>(data);
cudaDeviceSynchronize();
for (int i = 0; i < N; i++) std::cout << data[i] << std::endl;
}
$ nvcc -o t1914 t1914.cu
$ compute-sanitizer ./t1914
========= COMPUTE-SANITIZER
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
========= ERROR SUMMARY: 0 errors
$

Solution by changing the work per thread
The best solution with the existing data layout is to let every thread compute 4 consecutive values. It's better to have fewer threads that can work properly than have more that can't.
float* g_data;
int gx = threadIdx.x + blockDim.x * blockIdx.x;
g_data[4 * gx] = op1(g_data[4 * gx]);
g_data[4 * gx + 1] = op2(g_data[4 * gx + 1]);
g_data[4 * gx + 2] = op3(g_data[4 * gx + 2]);
g_data[4 * gx + 3] = op4(g_data[4 * gx + 3]);
If the size of g_data is not a multiple of 4, put an if around the index operations. If it is always a multiple of 4 and properly aligned, load and store 4 values as a float4 for better performance.
Solution by reordering the work
As all my talk about float4 may have suggested, your input data appears to be some form of 2D structure where one every four elements share a similar function. Maybe it is an array of structs or an array of vectors -- in other words, a matrix.
For the purpose of explaining what I mean, I consider it a Nx4 matrix. If you transpose this into a 4xN matrix and apply a kernel to this, most of your problems disappear. Because then entries for which the same operation has to be done are placed next to each other in memory and that makes writing an efficient kernel easier. Something like this:
float* g_data;
int rows_in_g;
int gx = threadIdx.x + blockDim.x * blockIdx.x;
int gy = threadIdx.y;
float& own_g = g_data[gx + rows_in_g * gy];
switch(gy) {
case 0: own_g = op1(own_g); break;
case 1: own_g = op2(own_g); break;
case 2: own_g = op3(own_g); break;
case 3: own_g = op4(own_g); break;
default: break;
}
Start this as a 2D kernel with blocksize x=32, y=4 and gridsize x=N/32, y=1.
Now your kernel is still divergent, but all threads within a warp will execute the same case and access consecutive floats in memory. That's the best you can achieve. Of course this all depends on whether you can change the data layout.

Issues accessing an array based on an offset in CUDA

This question more than likely has a simple solution.
Each of the threads I spawn are to be initialized to a starting value. For example, if I have a character set, char charSet[27] = "abcdefghijklmnopqrstuvwxyz", I spawn 26 threads. So threadIdx.0 corresponds to charSet[0] = a. Simple enough.
I thought I figured out a way to do this, until I examined what my threads were doing...
Here's an example program that I wrote:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
__global__ void example(int offset, int reqThreads){
//Declarations
unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx < reqThreads){
unsigned int tid = (offset * threadIdx.x) + blockIdx.x * blockDim.x; //Used to initialize array <-----Problem is here
printf("%d ", tid);
}
}
int main(void){
//Declarations
int minLength = 1;
int maxLength = 2;
int offset;
int totalThreads;
int reqThreads;
int base = 26;
int maxThreads = 512;
int blocks;
int i,j;
for(i = minLength; i<=maxLength; i++){
offset = i;
//Calculate parameters
reqThreads = (int) pow((double) base, (double) offset); //Casting I would never do, but works here
totalThreads = reqThreads;
for(j = 1;(totalThreads % maxThreads) != 0; j++) totalThreads += 1; //Create a multiple of 512
blocks = totalThreads/maxThreads;
//Call the kernel
example<<<blocks, totalThreads>>>(offset, reqThreads);
cudaThreadSynchronize();
printf("\n\n");
}
system("pause");
return 0;
}
My reasoning was that this statement
unsigned int tid = (offset * threadIdx.x) + blockIdx.x * blockDim.x;
would allow me to introduce an offset. If offset were 2, threadIdx.0 * offset = 0, threadIdx.1 * offset = 2, threadIdx.2 * offset = 4, and so forth. That definitely does not happen. The output of the above program works when offset == 1:
0 1 2 3 4 5...26.
But when offset == 2:
1344 1346 1348 1350...
In fact, those values are way outside the bounds of my array. So I'm not sure what is going wrong.
Any constructive input is appreciated.

I believe your kernel call should look like this:
example<<<blocks, maxThreads>>>(offset, reqThreads);
Your intent is to launch thread blocks of 512 threads, so that number (maxThreads) should be your second kernel config parameter, which is the number of threads per block.
Also, this is deprecated:
cudaThreadSynchronize();
Use this instead:
cudaDeviceSynchronize();
And if you use printf from the kernel for a large amount of output, you can lose some of the output if you exceed the buffer.
Finally, I'm not sure your reasoning is correct about the range of indices that will be printed.
When offset = 2 (the second pass through the loop), then 26^2 = 676, and you will then end up with 1024 threads, (in 2 thread blocks, if you make the above fixes). The second threadblock will have
tid = (2*threadIdx.x) + blockDim.x*blockIdx.x;
(0..164) (512) (1)
So the second threadblock should print out indices of 512 (minimum) up to (2*164) + 512 = 900
(164 = 675 - 511)
The first threadblock should print out indices of:
tid = (2*threadIdx.x) + blockDim.x * blockIdx.x
(0..511) (512) (0)
i.e. 0 to 1022

Summing the rows of a matrix (stored in either row-major or column-major order) in CUDA

I'm working on the problem summing the rows of a matrix in CUDA. I'm giving the following example.
Suppose to have the following 20 * 4 array:
1 2 3 4
4 1 2 3
3 4 1 2
.
1 2 3 4
.
.
.
.
.
.
.
.
2 1 3 4
After flattened the 2d array to a 1d array (either in row-major or column-major order), I need to assign each thread to a different row and calculate the cost for that row.
For example
- thread 1 should calculate the cost for 1 2 3 4
- thread 2 should calculate the cost for 4 1 2 3
How can I that in CUDA?
Thank you all for the reply

#include <stdio.h>
#include <stdlib.h>
#define MROWS 20
#define NCOLS 4
#define nTPB 256
__global__ void mykernel(int *costdata, int rows, int cols, int *results){
int tidx = threadIdx.x + blockDim.x*blockIdx.x;
if (tidx < rows){
int mycost = 0;
for (int i = 0; i < cols; i++)
mycost += costdata[(tidx*cols)+i];
results[tidx] = mycost;
}
}
int main(){
//define and initialize host and device storage for cost and results
int *d_costdata, *h_costdata, *d_results, *h_results;
h_results = (int *)malloc(MROWS*sizeof(int));
h_costdata = (int *)malloc(MROWS*NCOLS*sizeof(int));
for (int i=0; i<(MROWS*NCOLS); i++)
h_costdata[i] = rand()%4;
cudaMalloc((void **)&d_results, MROWS*sizeof(int));
cudaMalloc((void **)&d_costdata, MROWS*NCOLS*sizeof(int));
//copy cost data from host to device
cudaMemcpy(d_costdata, h_costdata, MROWS*NCOLS*sizeof(int), cudaMemcpyHostToDevice);
mykernel<<<(MROWS + nTPB - 1)/nTPB, nTPB>>>(d_costdata, MROWS, NCOLS, d_results);
// copy results back from device to host
cudaMemcpy(h_results, d_results, MROWS*sizeof(int), cudaMemcpyDeviceToHost);
for (int i=0; i<MROWS; i++){
int loc_cost = 0;
for (int j=0; j<NCOLS; j++) loc_cost += h_costdata[(i*NCOLS)+j];
printf("cost[%d]: host= %d, device = %d\n", i, loc_cost, h_results[i]);
}
}
This assumes "cost" of each row is just the sum of the elements in each row. If you have a different "cost" function, you can modify the activity in the kernel for-loop accordingly. This also assumes C-style row-major data storage (1 2 3 4 4 1 2 3 3 4 1 2 etc.)
If you instead use column-major storage (1 4 3 etc.), you can slightly improve the performance, since the data reads can be fully coalesced. Then your kernel code could look like this:
for (int i = 0; i < cols; i++)
mycost += costdata[(i*rows)+tidx];
You should also use proper cuda error checking on all CUDA API calls and kernel calls.
EDIT: As discussed in the comments below, for the row-major storage case, in some situations it might give an increase in memory efficiency by electing to load 16-byte quantities rather than the base type. Following is a modified version that implements this idea for arbitrary dimensions and (more or less) arbitrary base types:
#include <iostream>
#include <typeinfo>
#include <cstdlib>
#include <vector_types.h>
#define MROWS 1742
#define NCOLS 801
#define nTPB 256
typedef double mytype;
__host__ int sizetype(){
int size = 0;
if ((typeid(mytype) == typeid(float)) || (typeid(mytype) == typeid(int)) || (typeid(mytype) == typeid(unsigned int)))
size = 4;
else if (typeid(mytype) == typeid(double))
size = 8;
else if ((typeid(mytype) == typeid(unsigned char)) || (typeid(mytype) == typeid(char)))
size = 1;
return size;
}
template<typename T>
__global__ void mykernel(const T *costdata, int rows, int cols, T *results, int size, size_t pitch){
int chunk = 16/size; // assumes size is a factor of 16
int tidx = threadIdx.x + blockDim.x*blockIdx.x;
if (tidx < rows){
T *myrowptr = (T *)(((unsigned char *)costdata) + tidx*pitch);
T mycost = (T)0;
int count = 0;
while (count < cols){
if ((cols-count)>=chunk){
// read 16 bytes
int4 temp = *((int4 *)(myrowptr + count));
int bcount = 16;
int j = 0;
while (bcount > 0){
mycost += *(((T *)(&temp)) + j++);
bcount -= size;
count++;}
}
else {
// read one quantity at a time
for (; count < cols; count++)
mycost += myrowptr[count];
}
results[tidx] = mycost;
}
}
}
int main(){
int typesize = sizetype();
if (typesize == 0) {std::cout << "invalid type selected" << std::endl; return 1;}
//define and initialize host and device storage for cost and results
mytype *d_costdata, *h_costdata, *d_results, *h_results;
h_results = (mytype *)malloc(MROWS*sizeof(mytype));
h_costdata = (mytype *)malloc(MROWS*NCOLS*sizeof(mytype));
for (int i=0; i<(MROWS*NCOLS); i++)
h_costdata[i] = (mytype)(rand()%4);
size_t pitch = 0;
cudaMalloc((void **)&d_results, MROWS*sizeof(mytype));
cudaMallocPitch((void **)&d_costdata, &pitch, NCOLS*sizeof(mytype), MROWS);
//copy cost data from host to device
cudaMemcpy2D(d_costdata, pitch, h_costdata, NCOLS*sizeof(mytype), NCOLS*sizeof(mytype), MROWS, cudaMemcpyHostToDevice);
mykernel<<<(MROWS + nTPB - 1)/nTPB, nTPB>>>(d_costdata, MROWS, NCOLS, d_results, typesize, pitch);
// copy results back from device to host
cudaMemcpy(h_results, d_results, MROWS*sizeof(mytype), cudaMemcpyDeviceToHost);
for (int i=0; i<MROWS; i++){
mytype loc_cost = (mytype)0;
for (int j=0; j<NCOLS; j++) loc_cost += h_costdata[(i*NCOLS)+j];
if ((i < 10) && (typesize > 1))
std::cout <<"cost[" << i << "]: host= " << loc_cost << ", device = " << h_results[i] << std::endl;
if (loc_cost != h_results[i]){ std::cout << "mismatch at index" << i << "should be:" << loc_cost << "was:" << h_results[i] << std::endl; return 1; }
}
std::cout << "Results are correct!" << std::endl;
}

Get statistics for a list of numbers using GPU

I have several lists of numbers on a file . For example,
.333, .324, .123 , .543, .00054
.2243, .333, .53343 , .4434
Now, I want to get the number of times each number occurs using the GPU. I believe this will be faster to do on the GPU than the CPU because each thread can process one list. What data structure should I use on the GPU to easily get the above counts. For example , for the above, the answer will look as follows:
.333 = 2 times in entire file
.324 = 1 time
etc..
I looking for a general solution. Not one that works only on devices with specific compute capability
Just writing kernel suggested by Pavan to see if I have implemented it efficiently:
int uniqueEle = newend.valiter – d_A;
int* count;
cudaMalloc((void**)&count, uniqueEle * sizeof(int)); // stores the count of each unique element
int TPB = 256;
int blocks = uniqueEle + TPB -1 / TPB;
//Cast d_I to raw pointer called d_rawI
launch<<<blocks,TPB>>>(d_rawI,count,uniqueEle);
__global__ void launch(int *i, int* count, int n){
int id = blockDim.x * blockIdx.x + threadIdx.x;
__shared__ int indexes[256];
if(id < n ){
indexes[threadIdx.x] = i[id];
//as occurs between two blocks
if(id % 255 == 0){
count[indexes] = i[id+1] - i[id];
}
}
__syncthreads();
if(id < ele - 1){
if(threadIdx.x < 255)
count[id] = indexes[threadIdx.x+1] – indexes[threadIdx.x];
}
}
Question: how to modify this kernel so that it handles arrays of arbitrary size. I.e , handle the condition when the total number of threads < number of elements

Here is how I would do the code in matlab
A = [333, .324, .123 , .543, .00054 .2243, .333, .53343 , .4434];
[values, locations] = unique(A); % Find unique values and their locations
counts = diff([0, locations]); % Find the count based on their locations
There is no easy way to do this in plain cuda, but you can use existing libraries to do this.
1) Thrust
It is also being shipped with CUDA toolkit from CUDA 4.0.
The matlab code can be roughly translated into thrust by using the following functions. I am not too proficient with thrust, but I am just trying to give you an idea on what routines to look at.
float _A[] = {.333, .324, .123 , .543, .00054 .2243, .333, .53343 , .4434};
int _I[] = {0, 1, 2, 3, 4, 5, 6, 7, 8};
float *A, *I;
// Allocate memory on device and cudaMempCpy values from _A to A and _I to I
int num = 9;
// Values vector
thrust::device_vector<float>d_A(A, A+num);
// Need to sort to get same values together
thrust::stable_sort(d_A, d_A+num);
// Vector containing 0 to num-1
thrust::device_vector<int>d_I(I, I+num);
// Find unique values and elements
thrust::device_vector<float>d_Values(num), d_Locations(num), d_counts(num);
// Find unique elements
thrust::device_vector<float>::iterator valiter;
thrust::device_vector<int>::iterator idxiter;
thrust::pair<valiter, idxiter> new_end;
new_end = thrust::unique_by_key(d_A, d_A+num, d_I, d_Values, d_Locations);
You now have the locations of the first instance of each unique value. You can now launch a kernel to find the differences between adjacent elements from 0 to new_end in d_Locations. Subtract the final value from num to get the count for final location.
EDIT (Adding code that was provided over chat)
Here is how the difference code needs to be done
#define MAX_BLOCKS 65535
#define roundup(A, B) = (((A) + (B) - 1) / (B))
int uniqueEle = newend.valiter – d_A;
int* count;
cudaMalloc((void**)&count, uniqueEle * sizeof(int));
int TPB = 256;
int num_blocks = roundup(uniqueEle, TPB);
int blocks_y = roundup(num_blocks, MAX_BLOCKS);
int blocks_x = roundup(num_blocks, blocks_y);
dim3 blocks(blocks_x, blocks_y);
kernel<<<blocks,TPB>>>(d_rawI, count, uniqueEle);
__global__ void kernel(float *i, int* count, int n)
{
int tx = threadIdx.x;
int bid = blockIdx.y * gridDim.x + blockIdx.x;
int id = blockDim.x * bid + tx;
__shared__ int indexes[256];
if (id < n) indexes[tx] = i[id];
__syncthreads();
if (id < n - 1) {
if (tx < 255) count[id] = indexes[tx + 1] - indexes[tx];
else count[id] = i[id + 1] - indexes[tx];
}
if (id == n - 1) count[id] = n - indexes[tx];
return;
}
2) ArrayFire
This is an easy to use, free array based library.
You can do the following in ArrayFire.
using namespace af;
float h_A[] = {.333, .324, .123 , .543, .00054 .2243, .333, .53343 , .4434};
int num = 9;
// Transfer data to device
array A(9, 1, h_A);
array values, locations, original;
// Find the unique values and locations
setunique(values, locations, original, A);
// Locations are 0 based, add 1.
// Add *num* at the end to find count of last value.
array counts = diff1(join(locations + 1, num));
Disclosure: I work for AccelerEyes, that develops this software.

To answer the latest addenum to this question - the diff kernel which would complete the thrust method proposed by Pavan could look something like this:
template<int blcksz>
__global__ void diffkernel(const int *i, int* count, const int n) {
int id = blockDim.x * blockIdx.x + threadIdx.x;
int strd = blockDim.x * gridDim.x;
int nmax = blcksz * ((n/blcksz) + ((n%blcksz>0) ? 1 : 0));
__shared__ int indices[blcksz+1];
for(; id<nmax; id+=strd) {
// Data load
indices[threadIdx.x] = (id < n) ? i[id] : n;
if (threadIdx.x == (blcksz-1))
indices[blcksz] = ((id+1) < n) ? i[id+1] : n;
__syncthreads();
// Differencing calculation
int diff = indices[threadIdx.x+1] - indices[threadIdx.x];
// Store
if (id < n) count[id] = diff;
__syncthreads();
}
}

here is a solution:
__global__ void counter(float* a, int* b, int N)
{
int idx = blockIdx.x*blockDim.x+threadIdx.x;
if(idx < N)
{
float my = a[idx];
int count = 0;
for(int i=0; i < N; i++)
{
if(my == a[i])
count++;
}
b[idx]=count;
}
}
int main()
{
int threads = 9;
int blocks = 1;
int N = blocks*threads;
float* h_a;
int* h_b;
float* d_a;
int* d_b;
h_a = (float*)malloc(N*sizeof(float));
h_b = (int*)malloc(N*sizeof(int));
cudaMalloc((void**)&d_a,N*sizeof(float));
cudaMalloc((void**)&d_b,N*sizeof(int));
h_a[0]= .333f;
h_a[1]= .324f;
h_a[2]= .123f;
h_a[3]= .543f;
h_a[4]= .00054f;
h_a[5]= .2243f;
h_a[6]= .333f;
h_a[7]= .53343f;
h_a[8]= .4434f;
cudaMemcpy(d_a,h_a,N*sizeof(float),cudaMemcpyHostToDevice);
counter<<<blocks,threads>>>(d_a,d_b,N);
cudaMemcpy(h_b,d_b,N*sizeof(int),cudaMemcpyDeviceToHost);
for(int i=0; i < N; i++)
{
printf("%f = %d times\n",h_a[i],h_b[i]);
}
cudaFree(d_a);
cudaFree(d_b);
free(h_a);
free(h_b);
getchar();
return 0;
}

removing entries in sorted list: efficiently in gpu

I am trying to code the following problem in cuda/thrust. I am given a list of key and three values associated with each keys. I have managed to sort them in lexicographic order. The input now needs to be reduced if inputs with same key have each value-wise relation. In example below, V1(a)<=V1(c) and V2(a)<=V2(c) and V3(a)<=V3(c), implies that Input a < Input c, and hence, Input c is removed from output.
Example Input:
Key V1 V2 V3
a. 1 2 5 3
b. 1 2 6 2
c. 1 2 7 4
d. 1 3 6 5
e. 2 8 8 8
f. 3 1 2 4
Example Output:
Key V1 V2 V3
a. 1 2 5 3
b. 1 2 6 2
e. 2 8 8 8
f. 3 1 2 4
Input a < Input c ==> c removed
Input a < Input d ==> d removed
I’ve been able to solve the above problem using for-loops, and if-statements. I am currently trying to solve this using gpu based cuda/thrust. Could this be done on the gpu (preferably thrust) or an individual kernel has to be written in cuda ?
I have not been to formulate this problem using unique as discussed in Thrust: Removing duplicates in key-value arrays
Edited to include program "stl/c++" program to generate above scenario: section "Reducing myMap" is my implementation using for-loops and if-statements.
#include <iostream>
#include <tr1/array>
#include <vector>
#include <algorithm>
struct mapItem {
mapItem(int k, int v1, int v2, int v3){
key=k;
std::tr1::array<int,3> v = {v1, v2, v3};
values=v;
};
int key;
std::tr1::array<int,3> values;
};
struct sortLexiObj{
bool operator()(const mapItem& lhs, const mapItem& rhs){
return lhs.values < rhs.values;
}
};
struct sortKey{
bool operator()(const mapItem& lhs, const mapItem& rhs){
return lhs.key < rhs.key;
}
};
int main(int argc, char** argv){
std::vector<mapItem> myMap;
// Set up initial matrix:
myMap.push_back(mapItem(3, 1, 2, 4));
myMap.push_back(mapItem(1, 2, 6, 2));
myMap.push_back(mapItem(1, 2, 5, 3));
myMap.push_back(mapItem(1, 3, 6, 5));
myMap.push_back(mapItem(2, 8, 8, 8));
myMap.push_back(mapItem(1, 2, 7, 4));
std::sort(myMap.begin(), myMap.end(), sortLexiObj());
std::stable_sort(myMap.begin(), myMap.end(), sortKey());
std::cout << "\r\nOriginal sorted Map" << std::endl;
for(std::vector<mapItem>::iterator mt=myMap.begin(); mt!=myMap.end(); ++mt){
std::cout << mt->key << "\t";
for(std::tr1::array<int,3>::iterator it=(mt->values).begin(); it!=(mt->values).end(); ++it){
std::cout << *it << " ";
}
std::cout << std::endl;
}
/////////////////////////
// Reducing myMap
for(std::vector<mapItem>::iterator it=myMap.begin(); it!=myMap.end(); ++it){
std::vector<mapItem>::iterator jt=it; ++jt;
for (; jt != myMap.end();) {
if ( (it->key == jt->key)){
if ( it->values.at(0) <= jt->values.at(0) &&
it->values.at(1) <= jt->values.at(1) &&
it->values.at(2) <= jt->values.at(2) ) {
jt = myMap.erase(jt);
}
else ++jt;
}
else break;
}
}
std::cout << "\r\nReduced Map" << std::endl;
for(std::vector<mapItem>::iterator mt=myMap.begin(); mt!=myMap.end(); ++mt){
std::cout << mt->key << "\t";
for(std::tr1::array<int,3>::iterator it=(mt->values).begin(); it!=(mt->values).end(); ++it){
std::cout << *it << " ";
}
std::cout << std::endl;
}
return 0;
}

I think that you can use thrust::unique with a predicate as it's shown in Thrust: Removing duplicates in key-value arrays.
Actually, we can do it because of the following characteristic of unique:
For each group of consecutive elements in the range [first, last) with the same value, unique removes all but the first element of the group.
So, you should define a predicate to test for pseudo-equality that will return true for tuples that have the same key and all values are smaller in the first tuple:
typedef thrust::tuple<int, int, int, int> tuple_t;
// a functor which defines your *uniqueness* condition
struct tupleEqual
{
__host__ __device__
bool operator()(tuple_t x, tuple_t y)
{
return ( (x.get<0>() == y.get<0>()) // same key
&& (x.get<1>() <= y.get<1>()) // all values are smaller
&& (x.get<2>() <= y.get<2>())
&& (x.get<3>() <= y.get<3>()));
}
};
And you have to apply it to a sorted collection. In this way, only the first tuple (the smallest) will not be removed.
A tuple with the same key and a bigger value in V1, V2 or V3 will yield false so it won't be removed.
typedef thrust::device_vector< int > IntVector;
typedef IntVector::iterator IntIterator;
typedef thrust::tuple< IntIterator, IntIterator, IntIterator, IntIterator > IntIteratorTuple;
typedef thrust::zip_iterator< IntIteratorTuple > ZipIterator;
IntVector keyVector;
IntVector valVector1, valVector2, valVector3;
tupleEqual predicate;
ZipIterator newEnd = thrust::unique(
thrust::make_zip_iterator(
thrust::make_tuple(
keyVector.begin(),
valVector1.begin(),
valVector2.begin(),
valVector3.begin() ) ),
thrust::make_zip_iterator(
thrust::make_tuple(
keyVector.end(),
valVector1.end(),
valVector2.end(),
valVector3.end() ) ),
predicate );
IntIteratorTuple endTuple = newEnd.get_iterator_tuple();
keyVector.erase( thrust::get<0>( endTuple ), keyVector.end() );
valVector1.erase( thrust::get<1>( endTuple ), valVector1.end() );
valVector2.erase( thrust::get<2>( endTuple ), valVector2.end() );
valVector3.erase( thrust::get<3>( endTuple ), valVector3.end() );

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

CUDA parallelization - cuda

Related

how to avoid thread divergence in this CUDA kernel?

Issues accessing an array based on an offset in CUDA

Summing the rows of a matrix (stored in either row-major or column-major order) in CUDA

Get statistics for a list of numbers using GPU

removing entries in sorted list: efficiently in gpu

Categories

Resources