I have a requirement where I want to parallelize the following using CUDA thrust.
std::vector<float> a, b, c; // size of each is (size.x * size.y * size.z), kind of a 3D array.
What I am trying to do is this
a[i] = 0 if b[i] < 0
a[i] = c[i] if b[i] > 0
This is the host code.
for (int i = 0; i < size.x; i++)
for (int j = 0; j < size.y; j++)
for (int z = 0; z < size.z; z++) {
a.data[get_idx(i, j, z)] = (b.data[get_idx(i, j, z)] < 0) ?
(0) : (1 * c.data[get_idx(i, j, z)]);
}
get_idx() just converts the loop indices to array indices.
What I want is an equivalent thrust::api that does this.
I have the thrust::device_vector ready with the values of the corresponding a, b, c cudaCopied to the host.
thrust::device_vector<float> dev_a, dev_b, dev_c;
What I have tried is to use thrust::for_each but I am unable to find a way to assign dev_c[i] to dev_a[i].
I would love a nudge in the right direction, maybe which thrust:api is the most suitable. Thanks in advance.
After doing some more digging around, I found the correct thrust api.
thrust::replace_copy_if
It is an overload of replace_copy_if which takes as input a 'stencil' which acts as the condition based on which value is copied.
In my case, 'b' is the stencil.
The following code works now.
struct is_less_than_zero
{
__host__ __device__ bool operator()(float x)
{
return x < 0;
}
};
is_less_than_zero pred{};
thrust::replace_copy_if(thrust::device, c.begin(), c.end(),
b.begin(), a.begin(), pred(), 0);
Related
I am trying to pass a row of pointers of a two dimensional array of pointers in CUDA. See my code below. Here the array of pointers is noLocal. Because I am doing an atomicAdd I am expecting a number different of zero in line printf("Holaa %d\n", local[0][0]);, but the value I get is 0. Could you help me to pass an arrow in CUDA by reference, please?
__global__ void myadd(int *data[8])
{
unsigned int x = blockIdx.x;
unsigned int y = threadIdx.x;
unsigned int z = threadIdx.y;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
//printf("Ola sou a td %d\n", tid);
for (int i; i<8; i++)
atomicAdd(&(*data)[i],10);
}
int main(void)
{
int local[20][8] = { 0 };
int *noLocal[20][8];
for (int d = 0; d< 20;d++) {
for (int dd = 0; dd< 8; dd++) {
cudaMalloc(&(noLocal[d][dd]), sizeof(int));
cudaMemcpy(noLocal[d][dd], &(local[d][dd]), sizeof(int), cudaMemcpyHostToDevice);
}
myadd<<<20, dim3(10, 20)>>>(noLocal[d]);
}
for (int d = 0; d< 20;d++)
for (int dd = 0; dd < 8; dd++)
cudaMemcpy(&(local[d][dd]), noLocal[d][dd], sizeof(int), cudaMemcpyDeviceToHost);
printf("Holaa %d\n", local[0][0]);
for (int d = 0; d < 20; d++)
for (int dd = 0; dd < 8; dd++)
cudaFree(noLocal[d][dd]);
}
I believe you received good advice in the other answer. I don't recommend this coding pattern. For general reference material on creating 2D arrays in CUDA, see this answer.
When I compile the code you have shown, I get warnings of the form "i is used before its value is set". This kind of warning should not be ignored. It arises from this statement which doesn't make sense to me:
for (int i; i<8; i++)
that should be:
for (int i = 0; i<8; i++)
It's not clear you understand the C++ concepts of pointers and arrays. This:
int local[20][8] = { 0 };
represents an array of 20x8 = 160 integers. If you want to imagine it as an array of pointers, you could pretend that it includes 20 pointers of the form local[0], local[1]..local[19]. Each of those "pointers" points to an array of 8 integers. But there is no sensible comparison to suggest that it has 160 pointers in it. Furthermore the usage pattern you indicate in your kernel does not suggest that you expect 160 pointers to individual integers. But that is exactly what you are creating here:
int *noLocal[20][8]; //this is declaring a 2D array of 160 *pointers*
for (int d = 0; d< 20;d++) { // the combination of these loops means
for (int dd = 0; dd< 8; dd++) { // you will create 160 *pointers*
cudaMalloc(&(noLocal[d][dd]), sizeof(int));
To mimic your host array (local) you want to create 20 pointers each of which is pointing to an allocation of 8 int quantities. The usage in your kernel code here:
&(*data)[i]
means that you intend to take a single pointer, and offset it by i values ranging from 0 to 7. It does not mean that you expect to receive 8 individual pointers. Again, this is C++ behavior, not unique or specific to CUDA.
In order to make your code "sensible" there were a variety of changes I had to make. Here's a "fixed" version:
$ cat t1858.cu
#include <cstdio>
__global__ void myadd(int data[8])
{
// unsigned int x = blockIdx.x;
// unsigned int y = threadIdx.x;
// unsigned int z = threadIdx.y;
// int tid = blockDim.x * blockIdx.x + threadIdx.x;
//printf("Ola sou a td %d\n", tid);
for (int i = 0; i<8; i++)
atomicAdd(data+i,10);
}
int main(void)
{
int local[20][8] = { 0 };
int *noLocal[20];
for (int d = 0; d< 20;d++) {
cudaMalloc(&(noLocal[d]), 8*sizeof(int));
cudaMemcpy(noLocal[d], local[d], 8*sizeof(int), cudaMemcpyHostToDevice);
myadd<<<20, dim3(10, 20)>>>(noLocal[d]);
}
for (int d = 0; d< 20;d++)
cudaMemcpy(local[d], noLocal[d], 8*sizeof(int), cudaMemcpyDeviceToHost);
printf("Holaa %d\n", local[0][0]);
for (int d = 0; d < 20; d++)
cudaFree(noLocal[d]);
}
$ nvcc -o t1858 t1858.cu
$ cuda-memcheck ./t1858
========= CUDA-MEMCHECK
Holaa 40000
========= ERROR SUMMARY: 0 errors
$
The number 40000 is correct. It comes about because every thread is doing an atomic add of 10, and you have 20x200 threads that are doing that. 10x20x200 = 40000.
You should simply not be doing anything like that. You are wasting time and memory with these excessive allocations. And - your kernel would be pretty slow as well. I am 100% certain this is not what you were asked, nor what you wanted, to do.
Instead, you should:
Allocate a single large buffer on the device to fit the data you need.
Avoid using pointers on the device side, except to that buffer, unless absolutely necessary.
If you somehow have to use a 2D pointer array - add relevant offsets to your buffer's base pointer to get different pointers into it.
I have p.ntp test particles and every i-th particle has Cartesian coordinates tp.rh[i].x, tp.rh[i].y, tp.rh[i].z. Within this set I need to find CLUSTERS. It means, that I am looking for particles closer to the i-th particle less than hill2 (tp.D_rel < hill2). The number of such a members is stored in N_conv.
I use this cycle for (int i = 0; i < p.ntp; i++), which goes through the data set. For each i-th particle I calculate squared distances tp.D_rel[idx] relative to the others members in the set. Then I use first thread (idx == 0) to find the number of cases, which satisfy my condition. At the end, If are there more than 1 (N_conv > 1) positive cases I need to write out all particles forming possible cluster together (triplets, ...).
My code works well only in cases, where i < blockDim.x. Why? Is there a general way, how to find clusters in a set of data, but write out only triplets and more?
Note: I know, that some cases will be found twice.
__global__ void check_conv_system(double t, struct s_tp tp, struct s_mp mp, struct s_param p, double *time_step)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
double hill2 = 1.0e+6;
__shared__ double D[200];
__shared__ int ID1[200];
__shared__ int ID2[200];
if (idx >= p.ntp) return;
int N_conv;
for (int i = 0; i < p.ntp; i++)
{
tp.D_rel[idx] = (double)((tp.rh[i].x - tp.rh[idx].x)*(tp.rh[i].x - tp.rh[idx].x) +
(tp.rh[i].y - tp.rh[idx].y)*(tp.rh[i].y - tp.rh[idx].y) +
(tp.rh[i].z - tp.rh[idx].z)*(tp.rh[i].z - tp.rh[idx].z));
__syncthreads();
N_conv = 0;
if (idx == 0)
{
for (int n = 0; n < p.ntp; n++) {
if ((tp.D_rel[n] < hill2) && (i != n)) {
N_conv = N_conv + 1;
D[N_conv] = tp.D_rel[n];
ID1[N_conv] = i;
ID2[N_conv] = n;
}
}
if (N_conv > 0) {
for(int k = 1; k < N_conv; k++) {
printf("%lf %lf %d %d \n",t/365.2422, D[k], ID1[k], ID2[k]);
}
}
} //end idx == 0
} //end for cycle for i
}
As RobertCrovella mentionned, without an MCV example, it is hard to tell.
However, the tp.D_del array seems to be written to with idx index, and read-back after a __syncthreads() with full range indexing n. Note that the call to __syncthreads() will only perform synchronization within a block, not accross the whole device. As a result, some thread/block will access data that has not been calculated yet, hence the failure.
You want to review your code so that values computed by blocks do not depend one-another.
I tried the following simple program using cublasXt to multiply two matrices. I get all zero output. Can someone let me know why? My computer can use other cuda libraries normally, and I have two GPUs. My machine is 64bit, as is required by cublasXt.
Btw, I've checked that none of the function calls in the program returns error.
#include <stdio.h>
#include "cublasXt.h"
#include <curand.h>
void fill(double* &x, long m, long n, double val) {
x = new double[m * n];
for (long i = 0; i < m; ++i) {
for (long j = 0; j < n; ++j) {
x[i * n + j] = val;
}
}
}
int main() {
cublasXtHandle_t xt_;
cublasXtCreate(&xt_);
double *A, *B, *C;
long m = 10, n = 10, k = 20;
fill(A, m, k, 0.2);
fill(B, k, n, 0.3);
fill(C, m, n, 0.0);
double alpha = 1.0;
double beta = 0.0;
cublasXtDgemm(xt_, CUBLAS_OP_N, CUBLAS_OP_N,
m, n, k, &alpha, A, m, B, k, &beta, C, m
);
cudaDeviceSynchronize();
for (int i = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
printf ("%lf ", C[i *n + j]);
}
printf ("\n");
}
cublasXtDestroy(xt_);
return 0;
}
The first issue with your code is that you have no call to cublasXtDeviceSelect. This is a necessary part of a cublasXt code, to tell the CUBLAS runtime how many devices to use and which devices to use.
As a simple proof point, try adding the following immediately after your handle creation call:
if(cublasXtCreate(&xt_) != CUBLAS_STATUS_SUCCESS) {printf("handle create fail\n"); return 1;}
int devices[1] = { 0 }; // add this line
if(cublasXtDeviceSelect(xt_, 1, devices) != CUBLAS_STATUS_SUCCESS) {printf("set devices fail\n"); return 1;} // add this line
This should cause your output to change from all zero's to all 1.2 (although only using 1 GPU)
However you will probably want to read the section of the documentation I linked above (for example if you want to use 2 GPUs, and they are of the correct type). cublasXt functionality at this time, that is included in the toolkit, for multi-GPU usage is limited to 2 devices (but note my comments below) and those 2 GPUs must be on a dual-GPU board, such as a Tesla K10 or GeForce GTX 690 (I think Titan Z or Tesla K80 should also work, just to pick other examples).
Additional details of licensing are here. You can get an evaluation version of the "Premier" package that has fewer restrictions on GPUs.
I am trying to find the minimum distance between n points in cuda. I wrote the below code. This is working fine for number of points from 1 to 1024 i.e., 1 block. But if num_points is greater than 1024 i am getting wrong value for minimum distance. I am checking the gpu min value with the value I found in CPU using brute force algorithm.
The min value is stored in the temp1[0] at the end of kernel function.
I don't know what is wrong in this. Please help me out..
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>
#define MAX_POINTS 50000
__global__ void minimum_distance(float * X, float * Y, float * D, int n) {
__shared__ float temp[1024];
float temp1[1024];
int tid = threadIdx.x;
int bid = blockIdx.x;
int ref = tid+bid*blockDim.x;
temp[ref] = 1E+37F;
temp1[bid] = 1E+37F;
float dx,dy;
float Dij;
int i;
//each thread will take a point and find min dist to all
// points greater than its unique id(ref)
for (i = ref + 1; i < n; i++)
{
dx = X[ref]-X[i];
dy = Y[ref]-Y[i];
Dij = sqrtf(dx*dx+dy*dy);
if (temp[tid] > Dij)
{
temp[tid] = Dij;
}
}
__syncthreads();
//In each block the min value is stored in temp[0]
if(tid == 0)
{
if( bid == (n-1)/1024 ) {
int end = n - (bid) * 1024;
for (i = 1; i < end; i++ )
{
if (temp[i] < temp[tid])
temp[tid] = temp[i];
}
temp1[bid] = temp[tid];
}
else {
for (i = 1; i < 1024; i++ )
{
if (temp[i] < temp[tid])
temp[tid] = temp[i];
}
temp1[bid] = temp[tid];
}
}
__syncthreads();
//Here the min value is stored in temp1[0]
if (ref == 0)
{
for (i = 1; i <= (n-1)/1024; i++)
if( temp1[bid] > temp1[i])
temp1[bid] = temp1[i];
*D=temp1[bid];
}
}
//part of Main function
//kernel function invocation
// Invoking kernel of 1D grid and block sizes
// Vx and Vy are arrays of x-coordinates and y-coordinates respectively
int main(int argc, char* argv[]) {
.
.
blocks = (num_points-1)/1024 + 1;
minimum_distance<<<blocks,1024>>>(Vx,Vy,dmin_dist,num_points);
.
.
I'd say what's wrong is your choice of algorithm. You can certainly do better than O(n^2) - even if yours is pretty straightforward. Sure, on 5,000 points it might not seem terrible, but try 50,000 points and you'll feel the pain...
I'd think about parallelizing the construction of a Voronoi Diagram, or maybe some kind of BSP-like structure which might be easier to query with less code divergence.
I am looking for an optimal algorithm to find out remaining all possible permutation
of a give binary number.
For ex:
Binary number is : ........1. algorithm should return the remaining 2^7 remaining binary numbers, like 00000001,00000011, etc.
Thanks,
sathish
The example given is not a permutation!
A permutation is a reordering of the input.
So if the input is 00000001, 00100000 and 00000010 are permutations, but 00000011 is not.
If this is only for small numbers (probably up to 16 bits), then just iterate over all of them and ignore the mismatches:
int fixed = 0x01; // this is the fixed part
int mask = 0x01; // these are the bits of the fixed part which matter
for (int i=0; i<256; i++) {
if (i & mask == fixed) {
print i;
}
}
to find all you aren't going to do better than looping over all numbers e.g. if you want to loop over all 8 bit numbers
for (int i =0; i < (1<<8) ; ++i)
{
//do stuff with i
}
if you need to output in binary then look at the string formatting options you have in what ever language you are using.
e.g.
printf("%b",i); //not standard in C/C++
for calculation the base should be irrelavent in most languages.
I read your question as: "given some binary number with some bits always set, create the remaining possible binary numbers".
For example, given 1xx1: you want: 1001, 1011, 1101, 1111.
An O(N) algorithm is as follows.
Suppose the bits are defined in mask m. You also have a hash h.
To generate the numbers < n-1, in pseudocode:
counter = 0
for x in 0..n-1:
x' = x | ~m
if h[x'] is not set:
h[x'] = counter
counter += 1
The idea in the code is to walk through all numbers from 0 to n-1, and set the pre-defined bits to 1. Then memoize the resulting number (iff not already memoized) by mapping the resulting number to the value of a running counter.
The keys of h will be the permutations. As a bonus the h[p] will contain a unique index number for the permutation p, although you did not need it in your original question, it can be useful.
Why are you making it complicated !
It is as simple as the following:
// permutation of i on a length K
// Example : decimal i=10 is permuted over length k= 7
// [10]0001010-> [5] 0000101-> [66] 1000010 and 33, 80, 40, 20 etc.
main(){
int i=10,j,k=7; j=i;
do printf("%d \n", i= ( (i&1)<< k + i >>1); while (i!=j);
}
There are many permutation generating algorithms you can use, such as this one:
#include <stdio.h>
void print(const int *v, const int size)
{
if (v != 0) {
for (int i = 0; i < size; i++) {
printf("%4d", v[i] );
}
printf("\n");
}
} // print
void visit(int *Value, int N, int k)
{
static level = -1;
level = level+1; Value[k] = level;
if (level == N)
print(Value, N);
else
for (int i = 0; i < N; i++)
if (Value[i] == 0)
visit(Value, N, i);
level = level-1; Value[k] = 0;
}
main()
{
const int N = 4;
int Value[N];
for (int i = 0; i < N; i++) {
Value[i] = 0;
}
visit(Value, N, 0);
}
source: http://www.bearcave.com/random_hacks/permute.html
Make sure you adapt the relevant constants to your needs (binary number, 7 bits, etc...)
If you are really looking for permutations then the following code should do.
To find all possible permutations of a given binary string(pattern) for example.
The permutations of 1000 are 1000, 0100, 0010, 0001:
void permutation(int no_ones, int no_zeroes, string accum){
if(no_ones == 0){
for(int i=0;i<no_zeroes;i++){
accum += "0";
}
cout << accum << endl;
return;
}
else if(no_zeroes == 0){
for(int j=0;j<no_ones;j++){
accum += "1";
}
cout << accum << endl;
return;
}
permutation (no_ones - 1, no_zeroes, accum + "1");
permutation (no_ones , no_zeroes - 1, accum + "0");
}
int main(){
string append = "";
//finding permutation of 11000
permutation(2, 6, append); //the permutations are
//11000
//10100
//10010
//10001
//01100
//01010
cin.get();
}
If you intend to generate all the string combinations for n bits , then the problem can be solved using backtracking.
Here you go :
//Generating all string of n bits assuming A[0..n-1] is array of size n
public class Backtracking {
int[] A;
void Binary(int n){
if(n<1){
for(int i : A)
System.out.print(i);
System.out.println();
}else{
A[n-1] = 0;
Binary(n-1);
A[n-1] = 1;
Binary(n-1);
}
}
public static void main(String[] args) {
// n is number of bits
int n = 8;
Backtracking backtracking = new Backtracking();
backtracking.A = new int[n];
backtracking.Binary(n);
}
}