Share pointer between two classes in CUDA - cuda

I would like to create a vertices and edge structure with CUDA.
I have two classes.
Connection {
public:
float value;
Connection()
{
this->value = 0;
}
}
Node
{
public:
Connection *incoming;
Connection *outgoing;
int lenIncoming;
int lenOutgoing;
node(Connection *incoming, Connection *outgoing, int lenIncoming, int lenOutgoing)
{
this->incoming = incoming;
this->outgoing = outgoing;
this->lenIncoming = lenIncoming;
this->lenOutgoing = lenOutgoing;
}
}
When I "connect" the nodes, I do the following:
Connection XA = Connection(10);
Connection AB = Connection(2);
Connection XB = Connection(10);
Connection BX = Connection(2);
Connection* incomingA;
Connection* outgoingA;
Connection* ingoingB;
Connection* outgoingB;
cudaMallocManaged(&incomingA, 1 * sizeof(Connection*));
cudaMallocManaged(&outgoingA, 1 * sizeof(Connection*));
cudaMallocManaged(&ingoingB, 2 * sizeof(Connection*));
cudaMallocManaged(&outgoingB, 1 * sizeof(Connection*));
incomingA[0] = XA;
outgoingA[0] = AB;
incomingB[0] = XB;
incomingB[1] = AB;
outgoingB[0]= BX;
Node nodeA = Node(incomingA, outgoingA);
Node nodeB = Node(incomingB, outgoingB);
The thing I would like to happen is when I change the value of nodaA->outgoing[0].value from within a method in Node, it should impact nodaB.incoming[1].value, however that is not the case.
When I change the value from within nodeA, it remains the starting value in nodeB. I thought since I passed a copy of the pointer to the object, I would mean that it updated the original object, however it seems I am mistaken, or I have made some error along the way.
Any suggestions on how this should be done, will be greatly appreciated.
(BTW; The reason I use a class Connection instead of just Floats, is that in the future it will include more)
The classes are created on host.
Node has a method called run, which is running on the device.
__device__ __host__
run()
{
for(int i=0; i<this->lenIncoming; i++)
{
this->incoming[i].value += 1;
}
for(int i=0; i< this->lenOutgoing; i++)
{
this->outgoing[i].value += 2;
}
}
Which in turn is called from a kernel
__global__
void kernel_run(node *nodes)
{
node[0].run();
node[1].run();
};
The kernel is launched by running
kernel_run<<<1, 1>> > (nodes);
I can see that the value changes locally within nodeA, when debugging with Nsight.

As you have already mentioned, the problem is that the objects AB, XB, BX, etc. are being assigned by value rather than by reference, so copies are made of each object each time it is used (i.e. each time it is assigned to an incoming or outgoing connection), and the update to AB from one operation does not affect any other instance of AB.
One possible solution is to make all of your objects "singletons" and refer to them by reference. To make this work on both host and device we will allocate for these objects using cudaMallocManaged. Here's an example:
$ cat t1494.cu
#include <iostream>
class Connection {
public:
float value;
Connection()
{
this->value = 0;
}
Connection(float val)
{
this->value = val;
}
};
class Node
{
public:
Connection **incoming;
Connection **outgoing;
int lenIncoming;
int lenOutgoing;
Node(Connection **incoming, Connection **outgoing, int lenIncoming, int lenOutgoing)
{
this->incoming = incoming;
this->outgoing = outgoing;
this->lenIncoming = lenIncoming;
this->lenOutgoing = lenOutgoing;
}
__device__ __host__
void run()
{
for(int i=0; i<this->lenIncoming; i++)
{
this->incoming[i]->value += 1;
}
for(int i=0; i< this->lenOutgoing; i++)
{
this->outgoing[i]->value += 2;
}
}
};
__global__
void kernel_run(Node *nodes)
{
nodes[0].run();
nodes[1].run();
};
int main(){
Connection *XA;
cudaMallocManaged(&XA, sizeof(Connection));
*XA = Connection(10);
Connection *AB;
cudaMallocManaged(&AB, sizeof(Connection));
*AB = Connection(2);
Connection *XB;
cudaMallocManaged(&XB, sizeof(Connection));
*XB = Connection(10);
Connection *BX;
cudaMallocManaged(&BX, sizeof(Connection));
*BX = Connection(2);
Connection ** incomingA;
Connection ** outgoingA;
Connection ** incomingB;
Connection ** outgoingB;
cudaMallocManaged(&incomingA, 1 * sizeof(Connection*));
cudaMallocManaged(&outgoingA, 1 * sizeof(Connection*));
cudaMallocManaged(&incomingB, 2 * sizeof(Connection*));
cudaMallocManaged(&outgoingB, 1 * sizeof(Connection*));
incomingA[0] = XA;
outgoingA[0] = AB;
incomingB[0] = XB;
incomingB[1] = AB;
outgoingB[0]= BX;
Node *nodes;
cudaMallocManaged(&nodes, 2 * sizeof(Node));
nodes[0] = Node(incomingA, outgoingA, 1, 1);
nodes[1] = Node(incomingB, outgoingB, 2, 1);
std::cout << nodes[0].incoming[0]->value << std::endl;
std::cout << nodes[0].outgoing[0]->value << std::endl;
std::cout << nodes[1].incoming[0]->value << std::endl;
std::cout << nodes[1].incoming[1]->value << std::endl;
std::cout << nodes[1].outgoing[0]->value << std::endl;
kernel_run<<<1, 1>> > (nodes);
cudaDeviceSynchronize();
std::cout << nodes[0].incoming[0]->value << std::endl;
std::cout << nodes[0].outgoing[0]->value << std::endl;
std::cout << nodes[1].incoming[0]->value << std::endl;
std::cout << nodes[1].incoming[1]->value << std::endl;
std::cout << nodes[1].outgoing[0]->value << std::endl;
}
$ nvcc -o t1494 t1494.cu
$ cuda-memcheck ./t1494
========= CUDA-MEMCHECK
10
2
10
2
2
11
5
11
5
4
========= ERROR SUMMARY: 0 errors
$
Note that this system works fine for updating these objects from a single thread. It is not guaranteed to work correctly if you update an object from separate CUDA threads. CUDA does not automatically sort out that kind of multi-thread concurrent access for you. It may be possible to use atomics or some other method, however.
Note that my objective has been to address the original design presented and identify a relatively minor design modification that would meet the stated request. I'm not intending to make any statements about the relative performance merits of this approach, or the suitability of this or any other approach for graph traversal algorithms.

Related

Search Minimum/Maximum from n Arrays parallel in CUDA (Reduction Problem)

Is there a performant way in CUDA to get out of multiple arrays (which exist in different structures)
to find the maximum/minimum in parallel? The structures are structured according to the Structure of Arrays format.
A simple idea would be to assign each array to a thread block, which is used to calculate the maximum/minimum using the parallel reduction approach. The problem here is the size of the shared memory, which is why I regard this approach as critical.
An other approach is to calculate every Miminum/Maximum separetly for each Array. I think this is to slow.
struct Cube {
int* x;
int* y;
int* z;
int size;
};
int main() {
Cube* c1 = new Cube(); //c1 includes 100 Cubes (because of SOA)
c1-> x = new int[100];
c1-> y = new int[100];
c1 -> z = new int[100];
Cube* c2 = new Cube();
c2-> x = new int[1047];
c2-> y = new int[1047];
c2 -> z = new int[1047];
Cube* c3 = new Cube();
c3-> x = new int[5000];
c3-> y = new int[5000];
c3 -> z = new int[5000];
//My goal now is to find the smallest/largest x dimension of all cubes in c1, c2, ..., and cn,
//with one Kernel launch.
//So the smallest/largest x in c1, the smallest/largest x in c2 etc..
}
Does anyone know an efficient approach? Thanks.
A simple idea would be to assign each array to a thread block, which is used to calculate the maximum/minimum using the parallel reduction approach. The problem here is the size of the shared memory, which is why I regard this approach as critical.
There is no problem with shared memory size. You may wish to review Mark Harris' canonical parallel reduction tutorial and look at the later methods to understand how we can use a loop to populate shared memory, reducing values into shared memory as we go. Once the input loop is completed, then we begin the block-sweep phase of the reduction. This doesn't impose any special requirements on the shared memory per block.
Here's a worked example demonstrating both a thrust::reduce_by_key method (single call) and a CUDA block-segmented method (single kernel call):
$ cat t1535.cu
#include <iostream>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/functional.h>
#include <cstdlib>
#define IMAX(x,y) (x>y)?x:y
#define IMIN(x,y) (x<y)?x:y
typedef int dtype;
const int ncubes = 3;
struct Cube {
dtype* x;
dtype* y;
dtype* z;
int size;
};
struct my_f
{
template <typename T1, typename T2>
__host__ __device__
thrust::tuple<dtype,dtype> operator()(T1 t1, T2 t2){
thrust::tuple<dtype,dtype> r;
thrust::get<0>(r) = IMAX(thrust::get<0>(t1),thrust::get<0>(t2));
thrust::get<1>(r) = IMIN(thrust::get<1>(t1),thrust::get<1>(t2));
return r;
}
};
const int MIN = -1;
const int MAX = 0x7FFFFFFF;
const int BS = 512;
template <typename T>
__global__ void block_segmented_minmax_reduce(const T * __restrict__ in, T * __restrict__ max, T * __restrict__ min, const size_t * __restrict__ slen){
__shared__ T smax[BS];
__shared__ T smin[BS];
size_t my_seg_start = slen[blockIdx.x];
size_t my_seg_size = slen[blockIdx.x+1] - my_seg_start;
smax[threadIdx.x] = MIN;
smin[threadIdx.x] = MAX;
for (size_t idx = my_seg_start+threadIdx.x; idx < my_seg_size; idx += BS){
T my_val = in[idx];
smax[threadIdx.x] = IMAX(my_val, smax[threadIdx.x]);
smin[threadIdx.x] = IMIN(my_val, smin[threadIdx.x]);}
for (int s = BS>>1; s > 0; s>>=1){
__syncthreads();
if (threadIdx.x < s){
smax[threadIdx.x] = IMAX(smax[threadIdx.x], smax[threadIdx.x+s]);
smin[threadIdx.x] = IMIN(smin[threadIdx.x], smin[threadIdx.x+s]);}
}
if (!threadIdx.x){
max[blockIdx.x] = smax[0];
min[blockIdx.x] = smin[0];}
}
int main() {
// data setup
Cube *c = new Cube[ncubes];
thrust::host_vector<size_t> csize(ncubes+1);
csize[0] = 100;
csize[1] = 1047;
csize[2] = 5000;
csize[3] = 0;
c[0].x = new dtype[csize[0]];
c[1].x = new dtype[csize[1]];
c[2].x = new dtype[csize[2]];
size_t ctot = 0;
for (int i = 0; i < ncubes; i++) ctot+=csize[i];
// method 1: thrust
// concatenate
thrust::host_vector<dtype> h_d(ctot);
size_t start = 0;
for (int i = 0; i < ncubes; i++) {thrust::copy_n(c[i].x, csize[i], h_d.begin()+start); start += csize[i];}
for (size_t i = 0; i < ctot; i++) h_d[i] = rand();
thrust::device_vector<dtype> d_d = h_d;
// build flag vector
thrust::device_vector<int> d_f(d_d.size());
thrust::host_vector<size_t> coff(csize.size());
thrust::exclusive_scan(csize.begin(), csize.end(), coff.begin());
thrust::device_vector<size_t> d_coff = coff;
thrust::scatter(thrust::constant_iterator<int>(1), thrust::constant_iterator<int>(1)+ncubes, d_coff.begin(), d_f.begin());
thrust::inclusive_scan(d_f.begin(), d_f.end(), d_f.begin());
// min/max reduction
thrust::device_vector<dtype> d_max(ncubes);
thrust::device_vector<dtype> d_min(ncubes);
thrust::reduce_by_key(d_f.begin(), d_f.end(), thrust::make_zip_iterator(thrust::make_tuple(d_d.begin(), d_d.begin())), thrust::make_discard_iterator(), thrust::make_zip_iterator(thrust::make_tuple(d_max.begin(), d_min.begin())), thrust::equal_to<int>(), my_f());
thrust::host_vector<dtype> h_max = d_max;
thrust::host_vector<dtype> h_min = d_min;
std::cout << "Thrust Maxima: " <<std::endl;
thrust::copy_n(h_max.begin(), ncubes, std::ostream_iterator<dtype>(std::cout, ","));
std::cout << std::endl << "Thrust Minima: " << std::endl;
thrust::copy_n(h_min.begin(), ncubes, std::ostream_iterator<dtype>(std::cout, ","));
std::cout << std::endl;
// method 2: CUDA kernel (block reduce)
block_segmented_minmax_reduce<<<ncubes, BS>>>(thrust::raw_pointer_cast(d_d.data()), thrust::raw_pointer_cast(d_max.data()), thrust::raw_pointer_cast(d_min.data()), thrust::raw_pointer_cast(d_coff.data()));
thrust::copy_n(d_max.begin(), ncubes, h_max.begin());
thrust::copy_n(d_min.begin(), ncubes, h_min.begin());
std::cout << "CUDA Maxima: " <<std::endl;
thrust::copy_n(h_max.begin(), ncubes, std::ostream_iterator<dtype>(std::cout, ","));
std::cout << std::endl << "CUDA Minima: " << std::endl;
thrust::copy_n(h_min.begin(), ncubes, std::ostream_iterator<dtype>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -o t1535 t1535.cu
$ ./t1535
Thrust Maxima:
2145174067,2147469841,2146753918,
Thrust Minima:
35005211,2416949,100669,
CUDA Maxima:
2145174067,2147469841,2146753918,
CUDA Minima:
35005211,2416949,100669,
$
For a small number of Cube objects, the thrust method is likely to be faster. It will tend to make better use of medium to large GPUs than the block method will. For a large number of Cube objects, the block method should also be fairly efficient.

In cuda, is it possible to write dense array from sparse array with expected sequence?

There is array1 that represent 0 or 1 (for each thread block):
bool array1[]: [1, 1, 0, 0, 1, 1]
Each thread in thread block accesses array1 by using threadIdx.x.
And, I need to make shared dense array2 (each value represents thread ID with '1' value from array1:
__shared__ bool array2[] (thread ID) : [0, 1, 4, 5]
It seems that, at least, I need atomicAdd() operation to index array2.
Even with atomicAdd(), I think that it is hard to make array2 like above sequence
(0, 1, 4, 5).
Is it possible to make array2 from array1 in cuda (for each thread block)?
you can coalesced groups:
suppose the read Boolean is threasIsIN:
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
uint32_t tid = threadIdx.x;
const uint32_t warpLength = 32;
uint32_t warpIdx = tid / warpLength;
if (threadIsIn){
auto active = cg::coalesced_threads();
uint32_t idx = active.thread_rank() + warpIdx * warpLength;
array2[idx] = tid;
}
Edit
solution with multiple warps in a block:
the first warp of the block will prepare the shared array for the rest of warps in the block, this makes the other warps to wait for the first warp to finish.
thread_block block = this_thread_block();
uint32_t tid = threadIdx.x;
const uint32_t warpLength = 32;
uint32_t warpIdx = tid / warpLength;
uint32_t startIdx = 0;
uint32_t tidToWrite = tid;
uint32_t maxItr = blockSize / warpLength;
uint32_t itr = 0;
while (warpIdx == 0 && itr < maxItr){
auto warp = cg::coalesced_threads();
auto warpMask = warp.ballot(threadIsIn); // the tid'th bit is set to 1 if threadIsIn is true for tid
uint32_t trueThreadsSize = __popc(warpMask); // counts the number of bits that are set to 1
if(threadIsIn){
auto active = cg::coalesced_threads();
// active.size() has the same value as trueThreadsSize
array2[startIdx + active.thread_rank()] = tidToWrite;
}
startIdx += trueThreadsSize;
tidToWrite += warpLength;
++itr;
arr1Idx += warpLength;
threadIsIn = arr1[arr1Idx];
}
block.sync();
This is in a general category of problems called stream compaction. The canonical approach is to perform a prefix sum (scan operation) on a processed version of your data (converting the kept values to 1, the discarded values to 0), then use that prefix sum as the index to write to, in the output array.
CUB provides a convenient block-level scan operation, so we don't have to write our own. Thereafter, the indexed copy is trivial:
$ cat t1465.cu
#include <cub/cub.cuh>
#include <iostream>
#include <cstdlib>
const int nTPB = 1024;
const int ds = nTPB;
__global__ void BlockCompactKernel(bool *data, int *result, int *data_size)
{
// Specialize BlockScan for a 1D block of nTPB threads on type int
typedef cub::BlockScan<int, nTPB> BlockScan;
// Allocate shared memory for BlockScan
__shared__ typename BlockScan::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int scan_data[1];
// load data
bool tmp = data[threadIdx.x];
// process data
scan_data[0] = (tmp)?1:0;
// scan data
// Collectively compute the block-wide exclusive prefix sum
BlockScan(temp_storage).ExclusiveSum(scan_data, scan_data);
// indexed copy
if (tmp) result[scan_data[0]] = threadIdx.x;
// optional: return result size
if (threadIdx.x == nTPB-1) *data_size = scan_data[0] + ((tmp)?1:0);
}
int main(){
bool *d_data, *data = new bool[ds];
int data_size, *d_data_size, *d_result, *result = new int[ds];
cudaMalloc(&d_data_size, sizeof(d_data_size[0]));
cudaMalloc(&d_result, ds*sizeof(d_result[0]));
for (int i = 0; i < ds; i++) data[i] = (rand() > (RAND_MAX/2))?true:false;
std::cout << "Original data:" << std::endl;
for (int i=0; i < ds; i++) std::cout << (int)data[i] << ",";
cudaMalloc(&d_data, ds*sizeof(d_data[0]));
cudaMemcpy(d_data, data, ds*sizeof(d_data[0]), cudaMemcpyHostToDevice);
BlockCompactKernel<<<1,nTPB>>>(d_data, d_result, d_data_size);
cudaMemcpy(&data_size, d_data_size, sizeof(d_data_size[0]), cudaMemcpyDeviceToHost);
cudaMemcpy(result, d_result, data_size*sizeof(d_result[0]), cudaMemcpyDeviceToHost);
std::cout << std::endl << "Compacted data:" << std::endl;
for (int i=0; i < data_size; i++) std::cout << result[i] << ",";
std::cout << std::endl;
}
$ nvcc -o t1465 t1465.cu
$ cuda-memcheck ./t1465
========= CUDA-MEMCHECK
Original data:
1,0,1,1,1,0,0,1,0,1,0,1,0,1,1,1,1,1,0,1,0,0,0,1,0,0,0,0,1,0,1,1,1,0,1,1,0,1,0,1,1,1,0,1,0,0,1,1,0,1,1,0,0,1,1,0,0,0,0,0,0,1,1,1,0,1,0,1,1,1,1,0,0,1,1,1,0,1,1,0,1,0,0,1,1,0,0,1,0,1,1,1,1,1,0,1,0,1,1,1,0,0,1,1,0,1,1,0,1,0,1,0,0,0,0,0,1,0,0,1,0,0,0,1,1,1,1,0,1,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,0,1,1,1,1,1,1,0,0,1,0,0,0,0,1,1,1,0,0,1,0,1,0,1,0,1,0,0,1,1,0,0,0,1,1,0,1,1,0,1,0,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,0,0,1,1,0,1,0,1,0,1,0,0,0,0,1,0,1,1,1,0,1,1,0,0,1,1,0,0,1,0,0,0,1,1,0,1,0,0,0,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,0,1,1,1,0,1,0,0,1,1,0,0,0,1,1,1,0,1,0,0,0,1,0,1,0,0,0,1,1,1,1,0,1,0,1,1,1,1,0,1,1,0,1,1,1,0,1,0,0,1,0,0,1,0,0,0,1,1,0,1,0,1,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,1,0,1,1,0,0,1,1,1,0,1,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,0,0,1,1,1,1,0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0,0,1,1,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,0,1,0,1,1,1,1,1,0,0,1,1,1,0,0,0,1,0,1,1,1,0,0,0,0,1,1,0,0,1,1,1,0,1,0,0,1,1,1,1,0,0,1,1,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,0,1,0,0,1,1,1,1,0,1,1,1,1,0,1,0,1,1,1,1,0,0,1,0,1,0,1,0,0,0,0,0,0,1,1,1,1,0,0,1,0,0,1,1,0,0,1,1,1,0,0,1,1,1,1,1,0,1,0,0,1,0,1,1,0,0,1,1,1,0,1,1,0,0,1,1,1,0,0,0,1,1,0,1,1,0,0,0,1,1,1,1,0,1,1,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,0,0,1,0,1,1,0,0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,1,0,0,0,0,0,0,1,1,0,0,0,1,1,0,1,0,1,0,1,1,0,1,1,0,0,1,1,1,1,1,0,1,0,1,0,1,0,0,0,0,0,1,0,1,1,0,1,0,0,1,0,1,0,1,1,1,1,1,0,0,1,1,0,1,0,0,1,0,0,1,1,0,0,1,0,0,1,0,1,0,1,1,1,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,1,0,0,0,1,1,0,1,1,0,0,1,1,0,1,1,1,1,0,1,1,0,0,0,1,0,1,0,0,1,1,0,1,1,0,1,0,0,1,0,1,0,0,0,0,1,0,1,1,0,1,1,0,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,1,1,1,1,0,0,1,0,1,0,0,1,1,1,1,0,1,1,1,0,1,0,0,0,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,0,0,1,1,1,0,1,0,0,1,0,0,0,0,0,1,1,0,1,1,1,0,0,1,1,1,0,1,1,1,1,1,0,1,1,1,1,0,0,1,0,0,0,0,0,1,0,0,1,1,0,1,1,0,0,0,0,1,0,1,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,1,1,1,0,0,1,1,0,1,0,1,1,1,0,1,1,1,0,0,1,0,1,0,0,1,0,1,1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1,1,1,1,0,
Compacted data:
0,2,3,4,7,9,11,13,14,15,16,17,19,23,28,30,31,32,34,35,37,39,40,41,43,46,47,49,50,53,54,61,62,63,65,67,68,69,70,73,74,75,77,78,80,83,84,87,89,90,91,92,93,95,97,98,99,102,103,105,106,108,110,116,119,123,124,125,126,128,132,135,137,139,141,143,146,147,148,149,150,151,154,159,160,161,164,166,168,170,173,174,178,179,181,182,184,186,187,189,190,191,192,193,195,196,197,198,199,200,201,202,203,204,207,208,210,212,214,219,221,222,223,225,226,229,230,233,237,238,240,244,246,249,250,251,254,255,256,258,260,261,262,264,267,268,272,273,274,276,280,282,286,287,288,289,291,293,294,295,296,298,299,301,302,303,305,308,311,315,316,318,320,321,329,330,331,332,333,337,338,343,349,350,352,353,356,357,358,360,362,366,367,368,370,374,375,378,379,382,383,386,391,392,397,398,401,402,403,404,407,410,411,412,413,415,418,422,425,427,428,431,432,433,437,439,440,441,448,450,455,457,458,459,460,461,462,464,466,467,468,469,470,473,474,475,479,481,482,483,488,489,492,493,494,496,499,500,501,502,505,506,507,508,509,511,512,513,515,516,517,518,519,520,521,522,524,525,526,527,528,529,531,534,535,536,537,539,540,541,542,544,546,547,548,549,552,554,556,563,564,565,566,569,572,573,576,577,578,581,582,583,584,585,587,590,592,593,596,597,598,600,601,604,605,606,610,611,613,614,618,619,620,621,623,624,629,630,631,632,633,637,638,639,642,644,645,648,650,651,652,653,658,662,667,668,670,677,678,682,683,685,687,689,690,692,693,696,697,698,699,700,702,704,706,712,714,715,717,720,722,724,725,726,727,728,731,732,734,737,740,741,744,747,749,751,752,753,755,756,757,761,762,763,764,765,766,767,775,776,777,782,786,787,789,790,793,794,796,797,798,799,801,802,806,808,811,812,814,815,817,820,822,827,829,830,832,833,835,836,839,847,851,852,853,854,855,858,860,863,864,865,866,868,869,870,872,876,878,879,880,881,882,883,884,885,886,887,888,890,891,895,896,897,899,902,908,909,911,912,913,916,917,918,920,921,922,923,924,926,927,928,929,932,938,941,942,944,945,950,952,954,955,961,964,968,973,975,976,977,980,981,983,985,986,987,989,990,991,994,996,999,1001,1002,1004,1008,1011,1014,1019,1020,1021,1022,
========= ERROR SUMMARY: 0 errors
$

Pass array of pointers to multiple devices to Cuda C Kernel

I have a one-dimensional array that I need to process, but it is too large for a single GPU. Therefore, I'm passing the array to multiple GPUs to store in memory, the number of which will change depending on the problem size. If I pass an array of pointers to the arrays in the different GPUs, I cannot access the other arrays from my Cuda C Kernel.
I've tried passing a simple array of device pointers to each device with a kernel call, but the code seems to break when I try to access the arrays. Even the device that is running the Kernel cannot access the array in its own memory.
Data structures:
typedef struct ComplexArray
{
double *real;
} ComplexArray;
typedef struct ComplexArrayArray
{
ComplexArray* Arr;
} ComplexArrayArray;
Malloc:
ComplexArrayArray stateVector;
stateVector.Arr = (ComplexArray*)malloc(sizeof(ComplexArray*) * numberOfGPU));
for (int dev = 0; dev < numberOfGPI; dev++)
{
...
cudaMalloc(&(stateVector.Arr[dev].real), numberOfElements * sizeof(*(stateVector.Arr[dev].real)) / numberOfGPU);
...
}
Kernel:
__global__ void kernel(..., ComplexArrayArray stateVector, ...)
{
// Calculate necessary device
int device_number = ...;
int index = ...;
double val = stateVector.Arr[device_number].real[index];
...
}
When I try to access the arrays with this manner, the Kernel seems to "break". There is no error message, but its obvious that the data has not been read. Furthermore, I don't reach any printf statements after the data access.
Any idea on the best way to pass an array of pointers to device memory to a Cuda C Kernel?
Your attempt to use a struct with a pointer to an array of struct, each of which has an embedded pointer, will make for a very complex realization with cudaMalloc. It may be a bit simpler if you use cudaMallocManaged, but still unnecessarily complex. The complexities arise because cudaMalloc allocates space on a particular device, and that data is not (by default) accessible to any other device, and also due to the fact that your embedded pointers create the necessity for various "deep copies". Here's a worked example:
$ cat t1492.cu
#include <iostream>
#include <stdio.h>
typedef struct ComplexArray
{
double *real;
} ComplexArray;
typedef struct ComplexArrayArray
{
ComplexArray* Arr;
} ComplexArrayArray;
__global__ void kernel(ComplexArrayArray stateVector, int dev, int ds)
{
// Calculate necessary device
int device_number = dev;
int index = blockIdx.x*blockDim.x+threadIdx.x;
if (index < ds){
double val = stateVector.Arr[device_number].real[index] + dev;
stateVector.Arr[device_number].real[index] = val;
}
}
const int nTPB = 256;
int main(){
int numberOfGPU;
cudaGetDeviceCount(&numberOfGPU);
std::cout << "GPU count: " << numberOfGPU << std::endl;
ComplexArrayArray *stateVector = new ComplexArrayArray[numberOfGPU];
const int ds = 32;
double *hdata = new double[ds]();
ComplexArray *ddata = new ComplexArray[numberOfGPU];
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMalloc(&(stateVector[i].Arr), sizeof(ComplexArray) * numberOfGPU);
cudaMalloc(&(ddata[i].real), (ds/numberOfGPU)*sizeof(double));
cudaMemcpy(ddata[i].real, hdata + i*(ds/numberOfGPU), (ds/numberOfGPU)*sizeof(double), cudaMemcpyHostToDevice);}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMemcpy(stateVector[i].Arr, ddata, sizeof(ComplexArray)*numberOfGPU, cudaMemcpyHostToDevice);}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
kernel<<<((ds/numberOfGPU)+nTPB-1)/nTPB,nTPB>>>(stateVector[i], i, (ds/numberOfGPU));}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMemcpy(hdata + i*(ds/numberOfGPU), ddata[i].real, (ds/numberOfGPU)*sizeof(double), cudaMemcpyDeviceToHost);}
for (int i = 0; i < ds; i++)
std::cout << hdata[i] << " ";
std::cout << std::endl;
}
$ nvcc -o t1492 t1492.cu
$ cuda-memcheck ./t1492
========= CUDA-MEMCHECK
GPU count: 4
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
========= ERROR SUMMARY: 0 errors
$
However, if you want to take a host array and partition into one chunk per GPU, you don't need that level of complexity. Here is a simpler example:
$ cat t1493.cu
#include <iostream>
#include <stdio.h>
typedef struct ComplexArray
{
double *real;
} ComplexArray;
typedef struct ComplexArrayArray
{
ComplexArray* Arr;
} ComplexArrayArray;
__global__ void kernel(ComplexArray stateVector, int dev, int ds)
{
int index = blockIdx.x*blockDim.x+threadIdx.x;
if (index < ds){
double val = stateVector.real[index] + dev;
stateVector.real[index] = val;
}
}
const int nTPB = 256;
int main(){
int numberOfGPU;
cudaGetDeviceCount(&numberOfGPU);
std::cout << "GPU count: " << numberOfGPU << std::endl;
ComplexArray *stateVector = new ComplexArray[numberOfGPU];
const int ds = 32;
double *hdata = new double[ds]();
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMalloc(&(stateVector[i].real), (ds/numberOfGPU)*sizeof(double));
cudaMemcpy(stateVector[i].real, hdata + i*(ds/numberOfGPU), (ds/numberOfGPU)*sizeof(double), cudaMemcpyHostToDevice);}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
kernel<<<((ds/numberOfGPU)+nTPB-1)/nTPB,nTPB>>>(stateVector[i], i, (ds/numberOfGPU));}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMemcpy(hdata + i*(ds/numberOfGPU), stateVector[i].real, (ds/numberOfGPU)*sizeof(double), cudaMemcpyDeviceToHost);}
for (int i = 0; i < ds; i++)
std::cout << hdata[i] << " ";
std::cout << std::endl;
}
$ nvcc -o t1493 t1493.cu
$ cuda-memcheck ./t1493
========= CUDA-MEMCHECK
GPU count: 4
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
========= ERROR SUMMARY: 0 errors
$
Note that your question appears to make reference to the idea that you will break the data up into chunks, and each kernel will potentially have access to all the chunks. That will require either managed memory usage or knowledge that the system can support P2P access between the GPUs. That adds more complexity and is beyond the scope of what I have answered here, which is focused on your question about the kernel not being able to access "its own" data.
Since we should be able to upper-bound the number of GPUs that can participate (lets set it to a maximum of 8) we can avoid the deep copy of the first approach while still allowing all GPUs to have all pointers. Here is a modified example:
$ cat t1495.cu
#include <iostream>
#include <stdio.h>
const int maxGPU=8;
typedef struct ComplexArray
{
double *real[maxGPU];
} ComplexArray;
__global__ void kernel(ComplexArray stateVector, int dev, int ds)
{
int index = blockIdx.x*blockDim.x+threadIdx.x;
if (index < ds){
double val = stateVector.real[dev][index] + dev;
stateVector.real[dev][index] = val;
}
}
const int nTPB = 256;
int main(){
int numberOfGPU;
cudaGetDeviceCount(&numberOfGPU);
std::cout << "GPU count: " << numberOfGPU << std::endl;
ComplexArray stateVector;
const int ds = 32;
double *hdata = new double[ds]();
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMalloc(&(stateVector.real[i]), (ds/numberOfGPU)*sizeof(double));
cudaMemcpy(stateVector.real[i], hdata + i*(ds/numberOfGPU), (ds/numberOfGPU)*sizeof(double), cudaMemcpyHostToDevice);}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
kernel<<<((ds/numberOfGPU)+nTPB-1)/nTPB,nTPB>>>(stateVector, i, (ds/numberOfGPU));}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMemcpy(hdata + i*(ds/numberOfGPU), stateVector.real[i], (ds/numberOfGPU)*sizeof(double), cudaMemcpyDeviceToHost);}
for (int i = 0; i < ds; i++)
std::cout << hdata[i] << " ";
std::cout << std::endl;
}
$ nvcc -o t1495 t1495.cu
$ cuda-memcheck ./t1495
========= CUDA-MEMCHECK
GPU count: 4
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
========= ERROR SUMMARY: 0 errors
$

Thrust: synchronize: launch_closure_by_value: unknown error

I am experimenting with Thrust example monte-carlo.ru from here:
https://github.com/thrust/thrust/blob/master/examples/monte_carlo.cu .
The problem appears in this peace of code:
float estimate = thrust::transform_reduce(thrust::counting_iterator<int>(0),
thrust::counting_iterator<int>(M),
estimate_pi(),
0.0f,
thrust::plus<float>());
When I increase length of input sequence to more than M=87000 for transform_reduce method I got an error:
"synchronize: launch_closure_by_value: unknown error"
Just before the error the screen is became black for several seconds, then in systray I see a message "The video driver NVidia stopped responding and was successfully restored" (my back translation) and then I reboot my computer because it's behavior is unstable.
When I try to use cuda-memcheck the situation is changed: I got the same error already for length M=30000 although when running .exe without cuda-memcheck the program ends successfully for this length.
Here is several lines from cuda-memcheck output:
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaThreadSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\Windows\system32\nvcuda.dll (cuProfilerStop + 0xc2d92) [0xe06b2]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin\cudart64_65.dll (cudaThreadSynchronize + 0xf5) [0x19585]
========= Host Frame:C:\test\Monte_carlo.exe (thrust::system::cuda::detail::synchronize + 0x47) [0x11117]
...
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaFree.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\Windows\system32\nvcuda.dll (cuProfilerStop + 0xc2d92) [0xe06b2]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin\cudart64_65.dll (cudaFree + 0xfd) [0x1d28d]
========= Host Frame:C:\test\Monte_carlo.exe (thrust::system::cuda::detail::free > + 0x50) [0x5fa0]
Below also full code of the program. I added to the original only 2 changes: try-catch around transform_reduce and input M from console.
How can I understand the reason of this error?
#include <thrust/random.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <thrust/transform_reduce.h>
#include <iostream>
#include <iomanip>
#include <cmath>
// we could vary M & N to find the perf sweet spot
__host__ __device__
unsigned int hash(unsigned int a)
{
a = (a+0x7ed55d16) + (a<<12);
a = (a^0xc761c23c) ^ (a>>19);
a = (a+0x165667b1) + (a<<5);
a = (a+0xd3a2646c) ^ (a<<9);
a = (a+0xfd7046c5) + (a<<3);
a = (a^0xb55a4f09) ^ (a>>16);
return a;
}
struct estimate_pi : public thrust::unary_function<unsigned int,float>
{
__host__ __device__
float operator()(unsigned int thread_id)
{
float sum = 0;
unsigned int N = 10000; // samples per thread
unsigned int seed = hash(thread_id);
// seed a random number generator
thrust::default_random_engine rng(seed);
// create a mapping from random numbers to [0,1)
thrust::uniform_real_distribution<float> u01(0,1);
// take N samples in a quarter circle
for(unsigned int i = 0; i < N; ++i)
{
// draw a sample from the unit square
float x = u01(rng);
float y = u01(rng);
// measure distance from the origin
float dist = sqrtf(x*x + y*y);
// add 1.0f if (u0,u1) is inside the quarter circle
if(dist <= 1.0f)
sum += 1.0f;
}
// multiply by 4 to get the area of the whole circle
sum *= 4.0f;
// divide by N
return sum / N;
}
};
int main(void)
{
// use 30K independent seeds
int M;
std::cout << "M: ";
std::cin >> M;
try
{
float estimate = thrust::transform_reduce(thrust::counting_iterator<int>(0),
thrust::counting_iterator<int>(M),
estimate_pi(),
0.0f,
thrust::plus<float>());
estimate /= M;
std::cout << "M = " << std::setw(6) << M << " " << std::endl;
std::cout << std::setprecision(6);
std::cout << "pi is approximately " << estimate << std::endl;
}
catch (thrust::system_error &e)
{
// output an error message and exit
std::cerr << "Error: " << e.what() << std::endl;
exit(-1);
}
return 0;
}

Why does this CUDA code for calculating a Mandelbrot set fail when setting the maximum iteration count higher than 5,500,000?

I'm writing a code synthesizer which converts high-level models into CUDA C code. As test model, I'm using a Mandelbrot generator application which executes the iteration count for each X-Y coordinate in parallel on a GPGPU. The image is 70x70 pixels, and the X-Y coordinates range from (-1, -1) to (1, 1). For simplicity, the application expects a large float array, where each group of 3 elements contains the X and Y coordinates, followed by the maximum iteration count. Each thread on the GPGPU receives a pointer to the beginning of each 3-group set and calculates the iteration count.
The synthesized CUDA code works perfectly when maximum iteration counts is less than 5,500,000, but when it goes higher than that then the output becomes completely bogus. To illustrate, see the examples below:
Normal output when max_it is set to 5,000,000:
output[0]: 3
output[1]: 3
output[2]: 3
output[3]: 3
output[4]: 3
output[5]: 3
output[6]: 3
output[7]: 3
output[8]: 3
output[9]: 4
output[10]: 4
output[11]: 4
output[12]: 4
output[13]: 4
output[14]: 4
output[15]: 5
output[16]: 5
output[17]: 5
output[18]: 5
output[19]: 5
output[20]: 6
output[21]: 7
output[22]: 9
output[23]: 11
output[24]: 19
output[25]: 5000000
output[26]: 5000000
output[27]: 5000000
...
output[4878]: 2
output[4879]: 2
output[4880]: 2
output[4881]: 2
output[4882]: 2
output[4883]: 2
output[4884]: 2
output[4885]: 2
output[4886]: 2
output[4887]: 2
output[4888]: 2
output[4889]: 2
output[4890]: 2
output[4891]: 2
output[4892]: 2
output[4893]: 2
output[4894]: 2
output[4895]: 2
output[4896]: 2
output[4897]: 2
output[4898]: 2
output[4899]: 2
Bogus output when max_it is set to 6,000,000:
output[0]: 0
output[1]: 0
output[2]: 0
output[3]: 0
output[4]: 0
output[5]: 0
output[6]: 0
output[7]: 0
output[8]: 0
output[9]: 0
output[10]: 0
output[11]: 0
output[12]: 0
output[13]: 0
output[14]: 0
output[15]: 0
output[16]: 0
output[17]: 0
output[18]: 0
output[19]: 0
output[20]: 0
output[21]: 0
output[22]: 0
output[23]: 0
output[24]: 0
output[25]: 0
output[26]: 0
output[27]: 0
...
output[4877]: 0
output[4878]: -1161699328
output[4879]: 32649
output[4880]: -1698402160
output[4881]: 32767
output[4882]: -1177507963
output[4883]: 32649
output[4884]: 6431616
output[4885]: 0
output[4886]: -1174325376
output[4887]: 32649
output[4888]: -1698402384
output[4889]: 32767
output[4890]: 4199904
output[4891]: 0
output[4892]: -1698402160
output[4893]: 32767
output[4894]: -1177511704
output[4895]: 32649
output[4896]: -1174325376
output[4897]: 32649
output[4898]: -1177559142
output[4899]: 32649
And here follows the code:
mandelbrot.cpp (main file)
#include "mandelbrot.h"
#include <iostream>
#include <cstdlib>
using namespace std;
int main(int argc, char** argv) {
const int kNumPixelsRow = 70;
const int kNumPixelsCol = 70;
if (argc != 6) {
cout << "Must provide 5 arguments: " << endl
<< " #1: Lower left corner X coordinate (x0)" << endl
<< " #2: Lower left corner Y coordinate (y0)" << endl
<< " #3: Upper right corner X coordinate (x1)" << endl
<< " #4: Upper right corner Y coordinate (y1)" << endl
<< " #5: Maximum number of iterations" << endl;
return 0;
}
float x0 = (float) atof(argv[1]);
if (x0 < -2.5) {
cout << "x0 is too small, must be larger than -2.5" << endl;
return 0;
}
float y0 = (float) atof(argv[2]);
if (y0 < -1) {
cout << "y0 is too small, must be larger than -1" << endl;
return 0;
}
float x1 = (float) atof(argv[3]);
if (x1 > 1) {
cout << "x1 is too large, must be smaller than 1" << endl;
return 0;
}
float y1 = (float) atof(argv[4]);
if (y1 > 1) {
cout << "x0 is too large, must be smaller than 1" << endl;
return 0;
}
int max_it = atoi(argv[5]);
if (max_it <= 0) {
cout << "max_it is too small, must be larger than 0" << endl;
return 0;
}
cout << "Generating input data..." << endl;
float input_array[kNumPixelsRow][kNumPixelsCol][3];
float delta_x = (x1 - x0) / kNumPixelsRow;
float delta_y = (y1 - y0) / kNumPixelsCol;
for (int x = 0; x < kNumPixelsCol; ++x) {
for (int y = 0; y < kNumPixelsRow; ++y) {
if (x == 0) {
input_array[x][y][0] = x0;
}
else {
input_array[x][y][0] = input_array[x - 1][y][0] + delta_x;
}
if (y == 0) {
input_array[x][y][1] = y0;
}
else {
input_array[x][y][1] = input_array[x][y - 1][1] + delta_y;
}
input_array[x][y][2] = (float) max_it;
}
}
cout << "Executing..." << endl;
struct ModelOutput output = executeModel((float*) input_array);
cout << "Done." << endl;
for (int i = 0; i < kNumPixelsRow * kNumPixelsCol; ++i) {
cout << "output[" << i << "]: " << output.value1[i] << endl;
}
return 0;
}
mandelbrot.h (header file)
////////////////////////////////////////////////////////////
// AUTO-GENERATED BY f2cc 0.1
////////////////////////////////////////////////////////////
/**
* C struct for retrieving the output values from the model.
* This is needed since C functions can only return a single
* value.
*/
struct ModelOutput {
/**
* Output from process "parallelmapSY_1".
*/
int value1[4900];
};
/**
* Executes the model.
*
* #param input1
* Input to process "parallelmapSY_1".
* Expects an array of size 14700.
* #returns A struct containing the model outputs.
*/
struct ModelOutput executeModel(const float* input1);
mandelbrot.cu (CUDA file)
////////////////////////////////////////////////////////////
// AUTO-GENERATED BY f2cc 0.1
////////////////////////////////////////////////////////////
#include "mandelbrot.h"
__device__
int parallelmapSY_1_func1(const float* args) {
float x0 = args[0];
float y0 = args[1];
int max_it = (int) args[2];
float x = 0;
float y = 0;
int i = 0;
while (x*x + y*y < (2*2) && i < max_it) {
float x_temp = x*x - y*y + x0;
y = 2*x*y + y0;
x = x_temp;
++i;
}
return i;
}
__global__
void parallelmapSY_1__kernel(const float* input, int* output) {
unsigned int index = (blockIdx.x * blockDim.x + threadIdx.x);
if (index < 4900) {
output[index] = parallelmapSY_1_func1(&input[index * 3]);
}
}
void parallelmapSY_1__kernel_wrapper(const float* input, int* output) {
float* device_input;
int* device_output;
struct cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int max_block_size = prop.maxThreadsPerBlock;
int num_blocks = (4900 + max_block_size - 1) / max_block_size;
cudaMalloc((void**) &device_input, 14700 * sizeof(float));
cudaMalloc((void**) &device_output, 4900 * sizeof(int));
cudaMemcpy((void*) device_input, (void*) input, 14700 * sizeof(float), cudaMemcpyHostToDevice);
dim3 grid(num_blocks, 1);
dim3 blocks(max_block_size, 1);
parallelmapSY_1__kernel<<<grid, blocks>>>(device_input, device_output);
cudaMemcpy((void*) output, (void*) device_output, 4900 * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree((void*) device_input);
cudaFree(((void*) device_output);
}
struct ModelOutput executeModel(const float* input1) {
// Declare signal variables
// Signals part of DelaySY processes are also initiated with delay value
float model_input_to_parallelmapSY_1_in[14700];
int parallelmapSY_1_out_to_model_output[4900];
// Copy model inputs to signal variables
for (int i = 0; i < 14700; ++i) {
model_input_to_parallelmapSY_1_in[i] = input1[i];
}
// Execute processes
parallelmapSY_1__kernel_wrapper(model_input_to_parallelmapSY_1_in, parallelmapSY_1_out_to_model_output);
// Copy model output values to return container
struct ModelOutput outputs;
for (int i = 0; i < 4900; ++i) {
outputs.value1[i] = parallelmapSY_1_out_to_model_output[i];
}
return outputs;
}
The interesting file is mandelbrot.cu as that contains the computational code; mandelbrot.cpp is just a driver to get user input and generate input data, and mandelbrot.h is just a header file so that mandelbrot.cpp can easily use mandelbrot.cu.
The function executeModel() is a wrapper function which takes care of propagating data between the processes in the model. In this case there is only one process so executeModel() is rather pointless.
parallelmapSY_1__kernel_wrapper() prepares the parallel execution by allocating memory on the device, transfers the input data, invokes the kernel, and transfers the result back to the host.
parallelmapSY_1__kernel() is the kernel function, which simply calls parallelmapSY_1_func1() with the appropriate input data. It also prevents execution when too many threads have been spawned.
So the real area of interest is parallelmapSY_1_func1(). As I said, it works perfectly when the maximum iteration count is less than 5,500,000, but when I go higher it just doesn't seem to work as it's supposed to (see output log above). Some may ask "Why are you setting the iteration count so high? That's not necessary!". True, but since the pure C equivalent works perfectly with higher maximum iteration counts, why shouldn't the CUDA version? Since I'm designing a general tool, I need to know why it doesn't work in this example.
So does anyone have any idea what the code appears to fail when the maximum iteration count fails when exceeding 5,500,000?
It may be a time-out problem with your video card and the OS causing the CUDA task to be aborted. See e.g. CUDA apps time out & fail after several seconds - how to work around this?