Pass array of pointers to multiple devices to Cuda C Kernel - cuda
I have a one-dimensional array that I need to process, but it is too large for a single GPU. Therefore, I'm passing the array to multiple GPUs to store in memory, the number of which will change depending on the problem size. If I pass an array of pointers to the arrays in the different GPUs, I cannot access the other arrays from my Cuda C Kernel.
I've tried passing a simple array of device pointers to each device with a kernel call, but the code seems to break when I try to access the arrays. Even the device that is running the Kernel cannot access the array in its own memory.
Data structures:
typedef struct ComplexArray
{
double *real;
} ComplexArray;
typedef struct ComplexArrayArray
{
ComplexArray* Arr;
} ComplexArrayArray;
Malloc:
ComplexArrayArray stateVector;
stateVector.Arr = (ComplexArray*)malloc(sizeof(ComplexArray*) * numberOfGPU));
for (int dev = 0; dev < numberOfGPI; dev++)
{
...
cudaMalloc(&(stateVector.Arr[dev].real), numberOfElements * sizeof(*(stateVector.Arr[dev].real)) / numberOfGPU);
...
}
Kernel:
__global__ void kernel(..., ComplexArrayArray stateVector, ...)
{
// Calculate necessary device
int device_number = ...;
int index = ...;
double val = stateVector.Arr[device_number].real[index];
...
}
When I try to access the arrays with this manner, the Kernel seems to "break". There is no error message, but its obvious that the data has not been read. Furthermore, I don't reach any printf statements after the data access.
Any idea on the best way to pass an array of pointers to device memory to a Cuda C Kernel?
Your attempt to use a struct with a pointer to an array of struct, each of which has an embedded pointer, will make for a very complex realization with cudaMalloc. It may be a bit simpler if you use cudaMallocManaged, but still unnecessarily complex. The complexities arise because cudaMalloc allocates space on a particular device, and that data is not (by default) accessible to any other device, and also due to the fact that your embedded pointers create the necessity for various "deep copies". Here's a worked example:
$ cat t1492.cu
#include <iostream>
#include <stdio.h>
typedef struct ComplexArray
{
double *real;
} ComplexArray;
typedef struct ComplexArrayArray
{
ComplexArray* Arr;
} ComplexArrayArray;
__global__ void kernel(ComplexArrayArray stateVector, int dev, int ds)
{
// Calculate necessary device
int device_number = dev;
int index = blockIdx.x*blockDim.x+threadIdx.x;
if (index < ds){
double val = stateVector.Arr[device_number].real[index] + dev;
stateVector.Arr[device_number].real[index] = val;
}
}
const int nTPB = 256;
int main(){
int numberOfGPU;
cudaGetDeviceCount(&numberOfGPU);
std::cout << "GPU count: " << numberOfGPU << std::endl;
ComplexArrayArray *stateVector = new ComplexArrayArray[numberOfGPU];
const int ds = 32;
double *hdata = new double[ds]();
ComplexArray *ddata = new ComplexArray[numberOfGPU];
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMalloc(&(stateVector[i].Arr), sizeof(ComplexArray) * numberOfGPU);
cudaMalloc(&(ddata[i].real), (ds/numberOfGPU)*sizeof(double));
cudaMemcpy(ddata[i].real, hdata + i*(ds/numberOfGPU), (ds/numberOfGPU)*sizeof(double), cudaMemcpyHostToDevice);}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMemcpy(stateVector[i].Arr, ddata, sizeof(ComplexArray)*numberOfGPU, cudaMemcpyHostToDevice);}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
kernel<<<((ds/numberOfGPU)+nTPB-1)/nTPB,nTPB>>>(stateVector[i], i, (ds/numberOfGPU));}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMemcpy(hdata + i*(ds/numberOfGPU), ddata[i].real, (ds/numberOfGPU)*sizeof(double), cudaMemcpyDeviceToHost);}
for (int i = 0; i < ds; i++)
std::cout << hdata[i] << " ";
std::cout << std::endl;
}
$ nvcc -o t1492 t1492.cu
$ cuda-memcheck ./t1492
========= CUDA-MEMCHECK
GPU count: 4
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
========= ERROR SUMMARY: 0 errors
$
However, if you want to take a host array and partition into one chunk per GPU, you don't need that level of complexity. Here is a simpler example:
$ cat t1493.cu
#include <iostream>
#include <stdio.h>
typedef struct ComplexArray
{
double *real;
} ComplexArray;
typedef struct ComplexArrayArray
{
ComplexArray* Arr;
} ComplexArrayArray;
__global__ void kernel(ComplexArray stateVector, int dev, int ds)
{
int index = blockIdx.x*blockDim.x+threadIdx.x;
if (index < ds){
double val = stateVector.real[index] + dev;
stateVector.real[index] = val;
}
}
const int nTPB = 256;
int main(){
int numberOfGPU;
cudaGetDeviceCount(&numberOfGPU);
std::cout << "GPU count: " << numberOfGPU << std::endl;
ComplexArray *stateVector = new ComplexArray[numberOfGPU];
const int ds = 32;
double *hdata = new double[ds]();
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMalloc(&(stateVector[i].real), (ds/numberOfGPU)*sizeof(double));
cudaMemcpy(stateVector[i].real, hdata + i*(ds/numberOfGPU), (ds/numberOfGPU)*sizeof(double), cudaMemcpyHostToDevice);}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
kernel<<<((ds/numberOfGPU)+nTPB-1)/nTPB,nTPB>>>(stateVector[i], i, (ds/numberOfGPU));}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMemcpy(hdata + i*(ds/numberOfGPU), stateVector[i].real, (ds/numberOfGPU)*sizeof(double), cudaMemcpyDeviceToHost);}
for (int i = 0; i < ds; i++)
std::cout << hdata[i] << " ";
std::cout << std::endl;
}
$ nvcc -o t1493 t1493.cu
$ cuda-memcheck ./t1493
========= CUDA-MEMCHECK
GPU count: 4
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
========= ERROR SUMMARY: 0 errors
$
Note that your question appears to make reference to the idea that you will break the data up into chunks, and each kernel will potentially have access to all the chunks. That will require either managed memory usage or knowledge that the system can support P2P access between the GPUs. That adds more complexity and is beyond the scope of what I have answered here, which is focused on your question about the kernel not being able to access "its own" data.
Since we should be able to upper-bound the number of GPUs that can participate (lets set it to a maximum of 8) we can avoid the deep copy of the first approach while still allowing all GPUs to have all pointers. Here is a modified example:
$ cat t1495.cu
#include <iostream>
#include <stdio.h>
const int maxGPU=8;
typedef struct ComplexArray
{
double *real[maxGPU];
} ComplexArray;
__global__ void kernel(ComplexArray stateVector, int dev, int ds)
{
int index = blockIdx.x*blockDim.x+threadIdx.x;
if (index < ds){
double val = stateVector.real[dev][index] + dev;
stateVector.real[dev][index] = val;
}
}
const int nTPB = 256;
int main(){
int numberOfGPU;
cudaGetDeviceCount(&numberOfGPU);
std::cout << "GPU count: " << numberOfGPU << std::endl;
ComplexArray stateVector;
const int ds = 32;
double *hdata = new double[ds]();
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMalloc(&(stateVector.real[i]), (ds/numberOfGPU)*sizeof(double));
cudaMemcpy(stateVector.real[i], hdata + i*(ds/numberOfGPU), (ds/numberOfGPU)*sizeof(double), cudaMemcpyHostToDevice);}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
kernel<<<((ds/numberOfGPU)+nTPB-1)/nTPB,nTPB>>>(stateVector, i, (ds/numberOfGPU));}
for (int i = 0; i < numberOfGPU; i++){
cudaSetDevice(i);
cudaMemcpy(hdata + i*(ds/numberOfGPU), stateVector.real[i], (ds/numberOfGPU)*sizeof(double), cudaMemcpyDeviceToHost);}
for (int i = 0; i < ds; i++)
std::cout << hdata[i] << " ";
std::cout << std::endl;
}
$ nvcc -o t1495 t1495.cu
$ cuda-memcheck ./t1495
========= CUDA-MEMCHECK
GPU count: 4
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
========= ERROR SUMMARY: 0 errors
$
Related
Search Minimum/Maximum from n Arrays parallel in CUDA (Reduction Problem)
Is there a performant way in CUDA to get out of multiple arrays (which exist in different structures) to find the maximum/minimum in parallel? The structures are structured according to the Structure of Arrays format. A simple idea would be to assign each array to a thread block, which is used to calculate the maximum/minimum using the parallel reduction approach. The problem here is the size of the shared memory, which is why I regard this approach as critical. An other approach is to calculate every Miminum/Maximum separetly for each Array. I think this is to slow. struct Cube { int* x; int* y; int* z; int size; }; int main() { Cube* c1 = new Cube(); //c1 includes 100 Cubes (because of SOA) c1-> x = new int[100]; c1-> y = new int[100]; c1 -> z = new int[100]; Cube* c2 = new Cube(); c2-> x = new int[1047]; c2-> y = new int[1047]; c2 -> z = new int[1047]; Cube* c3 = new Cube(); c3-> x = new int[5000]; c3-> y = new int[5000]; c3 -> z = new int[5000]; //My goal now is to find the smallest/largest x dimension of all cubes in c1, c2, ..., and cn, //with one Kernel launch. //So the smallest/largest x in c1, the smallest/largest x in c2 etc.. } Does anyone know an efficient approach? Thanks.
A simple idea would be to assign each array to a thread block, which is used to calculate the maximum/minimum using the parallel reduction approach. The problem here is the size of the shared memory, which is why I regard this approach as critical. There is no problem with shared memory size. You may wish to review Mark Harris' canonical parallel reduction tutorial and look at the later methods to understand how we can use a loop to populate shared memory, reducing values into shared memory as we go. Once the input loop is completed, then we begin the block-sweep phase of the reduction. This doesn't impose any special requirements on the shared memory per block. Here's a worked example demonstrating both a thrust::reduce_by_key method (single call) and a CUDA block-segmented method (single kernel call): $ cat t1535.cu #include <iostream> #include <thrust/reduce.h> #include <thrust/copy.h> #include <thrust/device_vector.h> #include <thrust/host_vector.h> #include <thrust/iterator/constant_iterator.h> #include <thrust/iterator/discard_iterator.h> #include <thrust/iterator/zip_iterator.h> #include <thrust/functional.h> #include <cstdlib> #define IMAX(x,y) (x>y)?x:y #define IMIN(x,y) (x<y)?x:y typedef int dtype; const int ncubes = 3; struct Cube { dtype* x; dtype* y; dtype* z; int size; }; struct my_f { template <typename T1, typename T2> __host__ __device__ thrust::tuple<dtype,dtype> operator()(T1 t1, T2 t2){ thrust::tuple<dtype,dtype> r; thrust::get<0>(r) = IMAX(thrust::get<0>(t1),thrust::get<0>(t2)); thrust::get<1>(r) = IMIN(thrust::get<1>(t1),thrust::get<1>(t2)); return r; } }; const int MIN = -1; const int MAX = 0x7FFFFFFF; const int BS = 512; template <typename T> __global__ void block_segmented_minmax_reduce(const T * __restrict__ in, T * __restrict__ max, T * __restrict__ min, const size_t * __restrict__ slen){ __shared__ T smax[BS]; __shared__ T smin[BS]; size_t my_seg_start = slen[blockIdx.x]; size_t my_seg_size = slen[blockIdx.x+1] - my_seg_start; smax[threadIdx.x] = MIN; smin[threadIdx.x] = MAX; for (size_t idx = my_seg_start+threadIdx.x; idx < my_seg_size; idx += BS){ T my_val = in[idx]; smax[threadIdx.x] = IMAX(my_val, smax[threadIdx.x]); smin[threadIdx.x] = IMIN(my_val, smin[threadIdx.x]);} for (int s = BS>>1; s > 0; s>>=1){ __syncthreads(); if (threadIdx.x < s){ smax[threadIdx.x] = IMAX(smax[threadIdx.x], smax[threadIdx.x+s]); smin[threadIdx.x] = IMIN(smin[threadIdx.x], smin[threadIdx.x+s]);} } if (!threadIdx.x){ max[blockIdx.x] = smax[0]; min[blockIdx.x] = smin[0];} } int main() { // data setup Cube *c = new Cube[ncubes]; thrust::host_vector<size_t> csize(ncubes+1); csize[0] = 100; csize[1] = 1047; csize[2] = 5000; csize[3] = 0; c[0].x = new dtype[csize[0]]; c[1].x = new dtype[csize[1]]; c[2].x = new dtype[csize[2]]; size_t ctot = 0; for (int i = 0; i < ncubes; i++) ctot+=csize[i]; // method 1: thrust // concatenate thrust::host_vector<dtype> h_d(ctot); size_t start = 0; for (int i = 0; i < ncubes; i++) {thrust::copy_n(c[i].x, csize[i], h_d.begin()+start); start += csize[i];} for (size_t i = 0; i < ctot; i++) h_d[i] = rand(); thrust::device_vector<dtype> d_d = h_d; // build flag vector thrust::device_vector<int> d_f(d_d.size()); thrust::host_vector<size_t> coff(csize.size()); thrust::exclusive_scan(csize.begin(), csize.end(), coff.begin()); thrust::device_vector<size_t> d_coff = coff; thrust::scatter(thrust::constant_iterator<int>(1), thrust::constant_iterator<int>(1)+ncubes, d_coff.begin(), d_f.begin()); thrust::inclusive_scan(d_f.begin(), d_f.end(), d_f.begin()); // min/max reduction thrust::device_vector<dtype> d_max(ncubes); thrust::device_vector<dtype> d_min(ncubes); thrust::reduce_by_key(d_f.begin(), d_f.end(), thrust::make_zip_iterator(thrust::make_tuple(d_d.begin(), d_d.begin())), thrust::make_discard_iterator(), thrust::make_zip_iterator(thrust::make_tuple(d_max.begin(), d_min.begin())), thrust::equal_to<int>(), my_f()); thrust::host_vector<dtype> h_max = d_max; thrust::host_vector<dtype> h_min = d_min; std::cout << "Thrust Maxima: " <<std::endl; thrust::copy_n(h_max.begin(), ncubes, std::ostream_iterator<dtype>(std::cout, ",")); std::cout << std::endl << "Thrust Minima: " << std::endl; thrust::copy_n(h_min.begin(), ncubes, std::ostream_iterator<dtype>(std::cout, ",")); std::cout << std::endl; // method 2: CUDA kernel (block reduce) block_segmented_minmax_reduce<<<ncubes, BS>>>(thrust::raw_pointer_cast(d_d.data()), thrust::raw_pointer_cast(d_max.data()), thrust::raw_pointer_cast(d_min.data()), thrust::raw_pointer_cast(d_coff.data())); thrust::copy_n(d_max.begin(), ncubes, h_max.begin()); thrust::copy_n(d_min.begin(), ncubes, h_min.begin()); std::cout << "CUDA Maxima: " <<std::endl; thrust::copy_n(h_max.begin(), ncubes, std::ostream_iterator<dtype>(std::cout, ",")); std::cout << std::endl << "CUDA Minima: " << std::endl; thrust::copy_n(h_min.begin(), ncubes, std::ostream_iterator<dtype>(std::cout, ",")); std::cout << std::endl; return 0; } $ nvcc -o t1535 t1535.cu $ ./t1535 Thrust Maxima: 2145174067,2147469841,2146753918, Thrust Minima: 35005211,2416949,100669, CUDA Maxima: 2145174067,2147469841,2146753918, CUDA Minima: 35005211,2416949,100669, $ For a small number of Cube objects, the thrust method is likely to be faster. It will tend to make better use of medium to large GPUs than the block method will. For a large number of Cube objects, the block method should also be fairly efficient.
In cuda, is it possible to write dense array from sparse array with expected sequence?
There is array1 that represent 0 or 1 (for each thread block): bool array1[]: [1, 1, 0, 0, 1, 1] Each thread in thread block accesses array1 by using threadIdx.x. And, I need to make shared dense array2 (each value represents thread ID with '1' value from array1: __shared__ bool array2[] (thread ID) : [0, 1, 4, 5] It seems that, at least, I need atomicAdd() operation to index array2. Even with atomicAdd(), I think that it is hard to make array2 like above sequence (0, 1, 4, 5). Is it possible to make array2 from array1 in cuda (for each thread block)?
you can coalesced groups: suppose the read Boolean is threasIsIN: #include <cooperative_groups.h> namespace cg = cooperative_groups; uint32_t tid = threadIdx.x; const uint32_t warpLength = 32; uint32_t warpIdx = tid / warpLength; if (threadIsIn){ auto active = cg::coalesced_threads(); uint32_t idx = active.thread_rank() + warpIdx * warpLength; array2[idx] = tid; } Edit solution with multiple warps in a block: the first warp of the block will prepare the shared array for the rest of warps in the block, this makes the other warps to wait for the first warp to finish. thread_block block = this_thread_block(); uint32_t tid = threadIdx.x; const uint32_t warpLength = 32; uint32_t warpIdx = tid / warpLength; uint32_t startIdx = 0; uint32_t tidToWrite = tid; uint32_t maxItr = blockSize / warpLength; uint32_t itr = 0; while (warpIdx == 0 && itr < maxItr){ auto warp = cg::coalesced_threads(); auto warpMask = warp.ballot(threadIsIn); // the tid'th bit is set to 1 if threadIsIn is true for tid uint32_t trueThreadsSize = __popc(warpMask); // counts the number of bits that are set to 1 if(threadIsIn){ auto active = cg::coalesced_threads(); // active.size() has the same value as trueThreadsSize array2[startIdx + active.thread_rank()] = tidToWrite; } startIdx += trueThreadsSize; tidToWrite += warpLength; ++itr; arr1Idx += warpLength; threadIsIn = arr1[arr1Idx]; } block.sync();
This is in a general category of problems called stream compaction. The canonical approach is to perform a prefix sum (scan operation) on a processed version of your data (converting the kept values to 1, the discarded values to 0), then use that prefix sum as the index to write to, in the output array. CUB provides a convenient block-level scan operation, so we don't have to write our own. Thereafter, the indexed copy is trivial: $ cat t1465.cu #include <cub/cub.cuh> #include <iostream> #include <cstdlib> const int nTPB = 1024; const int ds = nTPB; __global__ void BlockCompactKernel(bool *data, int *result, int *data_size) { // Specialize BlockScan for a 1D block of nTPB threads on type int typedef cub::BlockScan<int, nTPB> BlockScan; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int scan_data[1]; // load data bool tmp = data[threadIdx.x]; // process data scan_data[0] = (tmp)?1:0; // scan data // Collectively compute the block-wide exclusive prefix sum BlockScan(temp_storage).ExclusiveSum(scan_data, scan_data); // indexed copy if (tmp) result[scan_data[0]] = threadIdx.x; // optional: return result size if (threadIdx.x == nTPB-1) *data_size = scan_data[0] + ((tmp)?1:0); } int main(){ bool *d_data, *data = new bool[ds]; int data_size, *d_data_size, *d_result, *result = new int[ds]; cudaMalloc(&d_data_size, sizeof(d_data_size[0])); cudaMalloc(&d_result, ds*sizeof(d_result[0])); for (int i = 0; i < ds; i++) data[i] = (rand() > (RAND_MAX/2))?true:false; std::cout << "Original data:" << std::endl; for (int i=0; i < ds; i++) std::cout << (int)data[i] << ","; cudaMalloc(&d_data, ds*sizeof(d_data[0])); cudaMemcpy(d_data, data, ds*sizeof(d_data[0]), cudaMemcpyHostToDevice); BlockCompactKernel<<<1,nTPB>>>(d_data, d_result, d_data_size); cudaMemcpy(&data_size, d_data_size, sizeof(d_data_size[0]), cudaMemcpyDeviceToHost); cudaMemcpy(result, d_result, data_size*sizeof(d_result[0]), cudaMemcpyDeviceToHost); std::cout << std::endl << "Compacted data:" << std::endl; for (int i=0; i < data_size; i++) std::cout << result[i] << ","; std::cout << std::endl; } $ nvcc -o t1465 t1465.cu $ cuda-memcheck ./t1465 ========= CUDA-MEMCHECK Original data: 1,0,1,1,1,0,0,1,0,1,0,1,0,1,1,1,1,1,0,1,0,0,0,1,0,0,0,0,1,0,1,1,1,0,1,1,0,1,0,1,1,1,0,1,0,0,1,1,0,1,1,0,0,1,1,0,0,0,0,0,0,1,1,1,0,1,0,1,1,1,1,0,0,1,1,1,0,1,1,0,1,0,0,1,1,0,0,1,0,1,1,1,1,1,0,1,0,1,1,1,0,0,1,1,0,1,1,0,1,0,1,0,0,0,0,0,1,0,0,1,0,0,0,1,1,1,1,0,1,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,0,1,1,1,1,1,1,0,0,1,0,0,0,0,1,1,1,0,0,1,0,1,0,1,0,1,0,0,1,1,0,0,0,1,1,0,1,1,0,1,0,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,0,0,1,1,0,1,0,1,0,1,0,0,0,0,1,0,1,1,1,0,1,1,0,0,1,1,0,0,1,0,0,0,1,1,0,1,0,0,0,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,0,1,1,1,0,1,0,0,1,1,0,0,0,1,1,1,0,1,0,0,0,1,0,1,0,0,0,1,1,1,1,0,1,0,1,1,1,1,0,1,1,0,1,1,1,0,1,0,0,1,0,0,1,0,0,0,1,1,0,1,0,1,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,1,0,1,1,0,0,1,1,1,0,1,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,0,0,1,1,1,1,0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0,0,1,1,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,0,1,0,1,1,1,1,1,0,0,1,1,1,0,0,0,1,0,1,1,1,0,0,0,0,1,1,0,0,1,1,1,0,1,0,0,1,1,1,1,0,0,1,1,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,0,1,0,0,1,1,1,1,0,1,1,1,1,0,1,0,1,1,1,1,0,0,1,0,1,0,1,0,0,0,0,0,0,1,1,1,1,0,0,1,0,0,1,1,0,0,1,1,1,0,0,1,1,1,1,1,0,1,0,0,1,0,1,1,0,0,1,1,1,0,1,1,0,0,1,1,1,0,0,0,1,1,0,1,1,0,0,0,1,1,1,1,0,1,1,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,0,0,1,0,1,1,0,0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,1,0,0,0,0,0,0,1,1,0,0,0,1,1,0,1,0,1,0,1,1,0,1,1,0,0,1,1,1,1,1,0,1,0,1,0,1,0,0,0,0,0,1,0,1,1,0,1,0,0,1,0,1,0,1,1,1,1,1,0,0,1,1,0,1,0,0,1,0,0,1,1,0,0,1,0,0,1,0,1,0,1,1,1,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,1,0,0,0,1,1,0,1,1,0,0,1,1,0,1,1,1,1,0,1,1,0,0,0,1,0,1,0,0,1,1,0,1,1,0,1,0,0,1,0,1,0,0,0,0,1,0,1,1,0,1,1,0,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,1,1,1,1,0,0,1,0,1,0,0,1,1,1,1,0,1,1,1,0,1,0,0,0,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,0,0,1,1,1,0,1,0,0,1,0,0,0,0,0,1,1,0,1,1,1,0,0,1,1,1,0,1,1,1,1,1,0,1,1,1,1,0,0,1,0,0,0,0,0,1,0,0,1,1,0,1,1,0,0,0,0,1,0,1,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,1,1,1,0,0,1,1,0,1,0,1,1,1,0,1,1,1,0,0,1,0,1,0,0,1,0,1,1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1,1,1,1,0, Compacted data: 0,2,3,4,7,9,11,13,14,15,16,17,19,23,28,30,31,32,34,35,37,39,40,41,43,46,47,49,50,53,54,61,62,63,65,67,68,69,70,73,74,75,77,78,80,83,84,87,89,90,91,92,93,95,97,98,99,102,103,105,106,108,110,116,119,123,124,125,126,128,132,135,137,139,141,143,146,147,148,149,150,151,154,159,160,161,164,166,168,170,173,174,178,179,181,182,184,186,187,189,190,191,192,193,195,196,197,198,199,200,201,202,203,204,207,208,210,212,214,219,221,222,223,225,226,229,230,233,237,238,240,244,246,249,250,251,254,255,256,258,260,261,262,264,267,268,272,273,274,276,280,282,286,287,288,289,291,293,294,295,296,298,299,301,302,303,305,308,311,315,316,318,320,321,329,330,331,332,333,337,338,343,349,350,352,353,356,357,358,360,362,366,367,368,370,374,375,378,379,382,383,386,391,392,397,398,401,402,403,404,407,410,411,412,413,415,418,422,425,427,428,431,432,433,437,439,440,441,448,450,455,457,458,459,460,461,462,464,466,467,468,469,470,473,474,475,479,481,482,483,488,489,492,493,494,496,499,500,501,502,505,506,507,508,509,511,512,513,515,516,517,518,519,520,521,522,524,525,526,527,528,529,531,534,535,536,537,539,540,541,542,544,546,547,548,549,552,554,556,563,564,565,566,569,572,573,576,577,578,581,582,583,584,585,587,590,592,593,596,597,598,600,601,604,605,606,610,611,613,614,618,619,620,621,623,624,629,630,631,632,633,637,638,639,642,644,645,648,650,651,652,653,658,662,667,668,670,677,678,682,683,685,687,689,690,692,693,696,697,698,699,700,702,704,706,712,714,715,717,720,722,724,725,726,727,728,731,732,734,737,740,741,744,747,749,751,752,753,755,756,757,761,762,763,764,765,766,767,775,776,777,782,786,787,789,790,793,794,796,797,798,799,801,802,806,808,811,812,814,815,817,820,822,827,829,830,832,833,835,836,839,847,851,852,853,854,855,858,860,863,864,865,866,868,869,870,872,876,878,879,880,881,882,883,884,885,886,887,888,890,891,895,896,897,899,902,908,909,911,912,913,916,917,918,920,921,922,923,924,926,927,928,929,932,938,941,942,944,945,950,952,954,955,961,964,968,973,975,976,977,980,981,983,985,986,987,989,990,991,994,996,999,1001,1002,1004,1008,1011,1014,1019,1020,1021,1022, ========= ERROR SUMMARY: 0 errors $
Is Concurrent cudaMemcpyAsync possible?
I'm writing some test code to get familiar with the concurrent attributes of cudaMemcpyAsync. When I was trying to do concurrent cudaMemcpyAsync in a single context, the copy operations are queuing up and get executed one by one with throughput 12.4 GB/s, which is consistent with the answer here: But when I tried to do concurrent cudaMemcpyAsync in different contexts (by separating them into 4 processes), it seems that the first and the last one are running concurrently: The first 2 sequential cudaMemcpyAsync are running with a throughput 12.4 GB/s while the last 2 concurrent ones are running with a throughput 5.3 GB/s. How can I do concurrent cudaMemcpyAsync within single context? I'm using CUDA9.0 on TITAN Xp, which has 2 copy engines. EDIT: Code for scenario 1: #include <stdio.h> #include <pthread.h> #include <stdlib.h> #include <assert.h> #include <time.h> inline cudaError_t checkCuda(cudaError_t result) { if (result != cudaSuccess) { fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result)); assert(result == cudaSuccess); } return result; } const int nStreams = 8; const int N = 100000000; const int bytes = N * sizeof(int); int* arr_H; int* arr_D[nStreams]; cudaStream_t stream[nStreams]; int args[nStreams]; pthread_t threads[nStreams]; void* worker(void *arg) { int i = *((int *)arg); checkCuda(cudaMemcpyAsync(arr_D[i], arr_H, bytes, cudaMemcpyHostToDevice, stream[i])); return NULL; } int main() { for(int i = 0; i < nStreams; i++) checkCuda(cudaStreamCreate(&stream[i])); checkCuda(cudaMallocHost((void**)&arr_H, bytes)); for (int i = 0; i < N; i++) arr_H[i] = random(); for (int i = 0; i < nStreams; i++) checkCuda(cudaMalloc((void**)&arr_D[i], bytes)); for (int i = 0; i < nStreams; i++) { args[i] = i; pthread_create(&threads[i], NULL, worker, &args[i]); } for (int i = 0; i < nStreams; i++) pthread_join(threads[i], NULL); cudaFreeHost(arr_H); for (int i = 0; i < nStreams; i++) { checkCuda(cudaStreamDestroy(stream[i])); cudaFree(arr_D[i]); } return 0; Code for scenario 2: #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <time.h> inline cudaError_t checkCuda(cudaError_t result) { if (result != cudaSuccess) { fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result)); assert(result == cudaSuccess); } return result; } int main() { const int nStreams = 1; const int N = 100000000; const int bytes = N * sizeof(int); int* arr_H; int* arr_D[nStreams]; cudaStream_t stream[nStreams]; for(int i = 0; i < nStreams; i++) checkCuda(cudaStreamCreate(&stream[i])); checkCuda(cudaMallocHost((void**)&arr_H, bytes)); for (int i = 0; i < N; i++) arr_H[i] = random(); for (int i = 0; i < nStreams; i++) checkCuda(cudaMalloc((void**)&arr_D[i], bytes)); for (int i = 0; i < nStreams; i++) checkCuda(cudaMemcpyAsync(arr_D[i], arr_H, bytes, cudaMemcpyHostToDevice, stream[i])); cudaFreeHost(arr_H); for (int i = 0; i < nStreams; i++) { checkCuda(cudaStreamDestroy(stream[i])); cudaFree(arr_D[i]); } return 0; } Code 2 is basically copied from Code 1. I used a python script to run multiple processes concurrently: #!/usr/bin/env python3 import subprocess N = 4 processes = [subprocess.Popen('./a.out', shell=True) for _ in range(N)] for process in processes: process.wait()
cuda addvectors memory intuitive explanation
I have the following code and #include <iostream> #include <cuda.h> #include <cuda_runtime.h> #include <ctime> #include <vector> #include <numeric> float random_float(void) { return static_cast<float>(rand()) / RAND_MAX; } std::vector<float> add(float alpha, std::vector<float>& v1, std::vector<float>& v2 ) { /*Do quick size check on vectors before proceeding*/ std::vector<float> result(v1.size()); for (unsigned int i = 0; i < result.size(); ++i) { result[i]=alpha*v1[i]+v2[i]; } return result; } __global__ void Addloop( int N, float alpha, float* x, float* y ) { int i; int i0 = blockIdx.x*blockDim.x + threadIdx.x; for( i = i0; i < N; i += blockDim.x*gridDim.x ) y[i] = alpha*x[i] + y[i]; /* if ( i0 < N ) y[i0] = alpha*x[i0] + y[i0]; */ } int main( int argc, char** argv ) { float alpha = 0.3; // create array of 256k elements int num_elements = 10;//1<<18; // generate random input on the host std::vector<float> h1_input(num_elements); std::vector<float> h2_input(num_elements); for(int i = 0; i < num_elements; ++i) { h1_input[i] = random_float(); h2_input[i] = random_float(); } for (std::vector<float>::iterator it = h1_input.begin() ; it != h1_input.end(); ++it) std::cout << ' ' << *it; std::cout << '\n'; for (std::vector<float>::iterator it = h2_input.begin() ; it != h2_input.end(); ++it) std::cout << ' ' << *it; std::cout << '\n'; std::vector<float> host_result;//(std::vector<float> h1_input, std::vector<float> h2_input ); host_result = add( alpha, h1_input, h2_input ); for (std::vector<float>::iterator it = host_result.begin() ; it != host_result.end(); ++it) std::cout << ' ' << *it; std::cout << '\n'; // move input to device memory float *d1_input = 0; cudaMalloc((void**)&d1_input, sizeof(float) * num_elements); cudaMemcpy(d1_input, &h1_input[0], sizeof(float) * num_elements, cudaMemcpyHostToDevice); float *d2_input = 0; cudaMalloc((void**)&d2_input, sizeof(float) * num_elements); cudaMemcpy(d2_input, &h2_input[0], sizeof(float) * num_elements, cudaMemcpyHostToDevice); Addloop<<<1,3>>>( num_elements, alpha, d1_input, d2_input ); // copy the result back to the host std::vector<float> device_result(num_elements); cudaMemcpy(&device_result[0], d2_input, sizeof(float) * num_elements, cudaMemcpyDeviceToHost); for (std::vector<float>::iterator it = device_result.begin() ; it != device_result.end(); ++it) std::cout << ' ' << *it; std::cout << '\n'; cudaFree(d1_input); cudaFree(d2_input); h1_input.clear(); h2_input.clear(); device_result.clear(); std::cout << "DONE! \n"; getchar(); return 0; } I am trying to understand the gpu memory access. The kernel, for reasons of simplicity, is launched as Addloop<<<1,3>>>. I am trying to understand how this code is working by imagining the for loops working on the gpu as instances. More specifically, I imagine the following instances but they do not help. Instance 1: for( i = 0; i < N; i += 3*1 ) // ( i += 0*1 --> i += 3*1 after Eric's comment) y[i] = alpha*x[i] + y[i]; Instance 2: for( i = 1; i < N; i += 3*1 ) y[i] = alpha*x[i] + y[i]; Instance 3: for( i = 3; i < N; i += 3*1 ) y[i] = alpha*x[i] + y[i]; Looking inside of every loop it does not make any sense in the logic of adding two vectors. Can some one help? The reason I am adopting this logic of instances is because it is working well in the case of the code inside the kernel which is in comments. If these thoughts are correct what would be the instances in case we have multiple blocks inside the grid? In other words what would be the i values and the update rates (+=updaterate) in some examples? PS: The kernel code borrowed from here. UPDATE: After Eric's answer I think the execution for N = 15, e.i the number of elements, goes like this (correct me if I am wrong): For the instance 1 above i = 0 , 3, 6, 9, 12 which computes the corresponding y[i] values. For the instance 2 above i = 1 , 4, 7, 10, 13 which computes the corresponding remaining y[i] values. For the instance 3 above i = 2 , 5, 8, 11, 14 which computes the rest y[i] values.
Your blockDim.x is 3 and gridDim.x is 1 according to your setup <<<1,3>>>. So in each thread (you call it instance), it should be i+=3*1 update With the for loop you can compute 15 element using only 3 threads. Generally you can use limited number of threads to do "infinit" work. And more work per threads can improve the performance by reducing the launch overhead and hiding the instruction stalls. Another advantage is you could use fixed number of threads/blocks to do work of various sizes, thus requires less tuning.
Why does this CUDA code for calculating a Mandelbrot set fail when setting the maximum iteration count higher than 5,500,000?
I'm writing a code synthesizer which converts high-level models into CUDA C code. As test model, I'm using a Mandelbrot generator application which executes the iteration count for each X-Y coordinate in parallel on a GPGPU. The image is 70x70 pixels, and the X-Y coordinates range from (-1, -1) to (1, 1). For simplicity, the application expects a large float array, where each group of 3 elements contains the X and Y coordinates, followed by the maximum iteration count. Each thread on the GPGPU receives a pointer to the beginning of each 3-group set and calculates the iteration count. The synthesized CUDA code works perfectly when maximum iteration counts is less than 5,500,000, but when it goes higher than that then the output becomes completely bogus. To illustrate, see the examples below: Normal output when max_it is set to 5,000,000: output[0]: 3 output[1]: 3 output[2]: 3 output[3]: 3 output[4]: 3 output[5]: 3 output[6]: 3 output[7]: 3 output[8]: 3 output[9]: 4 output[10]: 4 output[11]: 4 output[12]: 4 output[13]: 4 output[14]: 4 output[15]: 5 output[16]: 5 output[17]: 5 output[18]: 5 output[19]: 5 output[20]: 6 output[21]: 7 output[22]: 9 output[23]: 11 output[24]: 19 output[25]: 5000000 output[26]: 5000000 output[27]: 5000000 ... output[4878]: 2 output[4879]: 2 output[4880]: 2 output[4881]: 2 output[4882]: 2 output[4883]: 2 output[4884]: 2 output[4885]: 2 output[4886]: 2 output[4887]: 2 output[4888]: 2 output[4889]: 2 output[4890]: 2 output[4891]: 2 output[4892]: 2 output[4893]: 2 output[4894]: 2 output[4895]: 2 output[4896]: 2 output[4897]: 2 output[4898]: 2 output[4899]: 2 Bogus output when max_it is set to 6,000,000: output[0]: 0 output[1]: 0 output[2]: 0 output[3]: 0 output[4]: 0 output[5]: 0 output[6]: 0 output[7]: 0 output[8]: 0 output[9]: 0 output[10]: 0 output[11]: 0 output[12]: 0 output[13]: 0 output[14]: 0 output[15]: 0 output[16]: 0 output[17]: 0 output[18]: 0 output[19]: 0 output[20]: 0 output[21]: 0 output[22]: 0 output[23]: 0 output[24]: 0 output[25]: 0 output[26]: 0 output[27]: 0 ... output[4877]: 0 output[4878]: -1161699328 output[4879]: 32649 output[4880]: -1698402160 output[4881]: 32767 output[4882]: -1177507963 output[4883]: 32649 output[4884]: 6431616 output[4885]: 0 output[4886]: -1174325376 output[4887]: 32649 output[4888]: -1698402384 output[4889]: 32767 output[4890]: 4199904 output[4891]: 0 output[4892]: -1698402160 output[4893]: 32767 output[4894]: -1177511704 output[4895]: 32649 output[4896]: -1174325376 output[4897]: 32649 output[4898]: -1177559142 output[4899]: 32649 And here follows the code: mandelbrot.cpp (main file) #include "mandelbrot.h" #include <iostream> #include <cstdlib> using namespace std; int main(int argc, char** argv) { const int kNumPixelsRow = 70; const int kNumPixelsCol = 70; if (argc != 6) { cout << "Must provide 5 arguments: " << endl << " #1: Lower left corner X coordinate (x0)" << endl << " #2: Lower left corner Y coordinate (y0)" << endl << " #3: Upper right corner X coordinate (x1)" << endl << " #4: Upper right corner Y coordinate (y1)" << endl << " #5: Maximum number of iterations" << endl; return 0; } float x0 = (float) atof(argv[1]); if (x0 < -2.5) { cout << "x0 is too small, must be larger than -2.5" << endl; return 0; } float y0 = (float) atof(argv[2]); if (y0 < -1) { cout << "y0 is too small, must be larger than -1" << endl; return 0; } float x1 = (float) atof(argv[3]); if (x1 > 1) { cout << "x1 is too large, must be smaller than 1" << endl; return 0; } float y1 = (float) atof(argv[4]); if (y1 > 1) { cout << "x0 is too large, must be smaller than 1" << endl; return 0; } int max_it = atoi(argv[5]); if (max_it <= 0) { cout << "max_it is too small, must be larger than 0" << endl; return 0; } cout << "Generating input data..." << endl; float input_array[kNumPixelsRow][kNumPixelsCol][3]; float delta_x = (x1 - x0) / kNumPixelsRow; float delta_y = (y1 - y0) / kNumPixelsCol; for (int x = 0; x < kNumPixelsCol; ++x) { for (int y = 0; y < kNumPixelsRow; ++y) { if (x == 0) { input_array[x][y][0] = x0; } else { input_array[x][y][0] = input_array[x - 1][y][0] + delta_x; } if (y == 0) { input_array[x][y][1] = y0; } else { input_array[x][y][1] = input_array[x][y - 1][1] + delta_y; } input_array[x][y][2] = (float) max_it; } } cout << "Executing..." << endl; struct ModelOutput output = executeModel((float*) input_array); cout << "Done." << endl; for (int i = 0; i < kNumPixelsRow * kNumPixelsCol; ++i) { cout << "output[" << i << "]: " << output.value1[i] << endl; } return 0; } mandelbrot.h (header file) //////////////////////////////////////////////////////////// // AUTO-GENERATED BY f2cc 0.1 //////////////////////////////////////////////////////////// /** * C struct for retrieving the output values from the model. * This is needed since C functions can only return a single * value. */ struct ModelOutput { /** * Output from process "parallelmapSY_1". */ int value1[4900]; }; /** * Executes the model. * * #param input1 * Input to process "parallelmapSY_1". * Expects an array of size 14700. * #returns A struct containing the model outputs. */ struct ModelOutput executeModel(const float* input1); mandelbrot.cu (CUDA file) //////////////////////////////////////////////////////////// // AUTO-GENERATED BY f2cc 0.1 //////////////////////////////////////////////////////////// #include "mandelbrot.h" __device__ int parallelmapSY_1_func1(const float* args) { float x0 = args[0]; float y0 = args[1]; int max_it = (int) args[2]; float x = 0; float y = 0; int i = 0; while (x*x + y*y < (2*2) && i < max_it) { float x_temp = x*x - y*y + x0; y = 2*x*y + y0; x = x_temp; ++i; } return i; } __global__ void parallelmapSY_1__kernel(const float* input, int* output) { unsigned int index = (blockIdx.x * blockDim.x + threadIdx.x); if (index < 4900) { output[index] = parallelmapSY_1_func1(&input[index * 3]); } } void parallelmapSY_1__kernel_wrapper(const float* input, int* output) { float* device_input; int* device_output; struct cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); int max_block_size = prop.maxThreadsPerBlock; int num_blocks = (4900 + max_block_size - 1) / max_block_size; cudaMalloc((void**) &device_input, 14700 * sizeof(float)); cudaMalloc((void**) &device_output, 4900 * sizeof(int)); cudaMemcpy((void*) device_input, (void*) input, 14700 * sizeof(float), cudaMemcpyHostToDevice); dim3 grid(num_blocks, 1); dim3 blocks(max_block_size, 1); parallelmapSY_1__kernel<<<grid, blocks>>>(device_input, device_output); cudaMemcpy((void*) output, (void*) device_output, 4900 * sizeof(int), cudaMemcpyDeviceToHost); cudaFree((void*) device_input); cudaFree(((void*) device_output); } struct ModelOutput executeModel(const float* input1) { // Declare signal variables // Signals part of DelaySY processes are also initiated with delay value float model_input_to_parallelmapSY_1_in[14700]; int parallelmapSY_1_out_to_model_output[4900]; // Copy model inputs to signal variables for (int i = 0; i < 14700; ++i) { model_input_to_parallelmapSY_1_in[i] = input1[i]; } // Execute processes parallelmapSY_1__kernel_wrapper(model_input_to_parallelmapSY_1_in, parallelmapSY_1_out_to_model_output); // Copy model output values to return container struct ModelOutput outputs; for (int i = 0; i < 4900; ++i) { outputs.value1[i] = parallelmapSY_1_out_to_model_output[i]; } return outputs; } The interesting file is mandelbrot.cu as that contains the computational code; mandelbrot.cpp is just a driver to get user input and generate input data, and mandelbrot.h is just a header file so that mandelbrot.cpp can easily use mandelbrot.cu. The function executeModel() is a wrapper function which takes care of propagating data between the processes in the model. In this case there is only one process so executeModel() is rather pointless. parallelmapSY_1__kernel_wrapper() prepares the parallel execution by allocating memory on the device, transfers the input data, invokes the kernel, and transfers the result back to the host. parallelmapSY_1__kernel() is the kernel function, which simply calls parallelmapSY_1_func1() with the appropriate input data. It also prevents execution when too many threads have been spawned. So the real area of interest is parallelmapSY_1_func1(). As I said, it works perfectly when the maximum iteration count is less than 5,500,000, but when I go higher it just doesn't seem to work as it's supposed to (see output log above). Some may ask "Why are you setting the iteration count so high? That's not necessary!". True, but since the pure C equivalent works perfectly with higher maximum iteration counts, why shouldn't the CUDA version? Since I'm designing a general tool, I need to know why it doesn't work in this example. So does anyone have any idea what the code appears to fail when the maximum iteration count fails when exceeding 5,500,000?
It may be a time-out problem with your video card and the OS causing the CUDA task to be aborted. See e.g. CUDA apps time out & fail after several seconds - how to work around this?