reduction example using cuda and CUB - cuda

I'm trying to get my head around CUB, and having a bit of trouble following the (rather incomplete) worked examples. CUB looks like it is a fantastic tool, I just can't make sense of the example code.
I've built a simple proto-warp reduce example:
#include <cub/cub.cuh>
#include <cuda.h>
#include <vector>
using std::vector;
#include <iostream>
using std::cout;
using std::endl;
const int N = 128;
__global__ void sum(float *indata, float *outdata) {
typedef cub::WarpReduce<float,4> WarpReduce;
__shared__ typename WarpReduce::TempStorage temp_storage;
int id = blockIdx.x*blockDim.x+threadIdx.x;
if( id < 128 ) {
outdata[id] = WarpReduce(temp_storage).Sum(indata[id]);
}
}
int main() {
vector<float> y(N), sol(N);
float *dev_y, *dev_sol;
cudaMalloc((void**)&dev_y,N*sizeof(float));
cudaMalloc((void**)&dev_sol,N*sizeof(float));
for( int i = 0; i < N; i++ ) {
y[i] = (float)i;
}
cout << "input: ";
for( int i = 0; i < N; i++ ) cout << y[i] << " ";
cout << endl;
cudaMemcpy(&y[0],dev_y,N*sizeof(float),cudaMemcpyHostToDevice);
sum<<<1,32>>>(dev_y,dev_sol);
cudaMemcpy(dev_sol,&sol[0],N*sizeof(float),cudaMemcpyDeviceToHost);
cout << "output: ";
for( int i = 0; i < N; i++ ) cout << sol[i] << " ";
cout << endl;
cudaFree(dev_y);
cudaFree(dev_sol);
return 0;
}
which returns all zeros.
I'm aware that this code would return a reduction that was banded with every 32nd element being the sum of a warp and the other elements being undefined - I just want to get a feel for how CUB works. Can someone point out what I'm doing wrong?
(also, does CUB deserve its own tag yet?)

Your cudaMemcpy arguments are back to front, the destination comes first (to be consistent with memcpy).
cudaError_t cudaMemcpy ( void* dst, const void* src, size_t count, cudaMemcpyKind kind )
See the API reference for more info.

Related

Search Minimum/Maximum from n Arrays parallel in CUDA (Reduction Problem)

Is there a performant way in CUDA to get out of multiple arrays (which exist in different structures)
to find the maximum/minimum in parallel? The structures are structured according to the Structure of Arrays format.
A simple idea would be to assign each array to a thread block, which is used to calculate the maximum/minimum using the parallel reduction approach. The problem here is the size of the shared memory, which is why I regard this approach as critical.
An other approach is to calculate every Miminum/Maximum separetly for each Array. I think this is to slow.
struct Cube {
int* x;
int* y;
int* z;
int size;
};
int main() {
Cube* c1 = new Cube(); //c1 includes 100 Cubes (because of SOA)
c1-> x = new int[100];
c1-> y = new int[100];
c1 -> z = new int[100];
Cube* c2 = new Cube();
c2-> x = new int[1047];
c2-> y = new int[1047];
c2 -> z = new int[1047];
Cube* c3 = new Cube();
c3-> x = new int[5000];
c3-> y = new int[5000];
c3 -> z = new int[5000];
//My goal now is to find the smallest/largest x dimension of all cubes in c1, c2, ..., and cn,
//with one Kernel launch.
//So the smallest/largest x in c1, the smallest/largest x in c2 etc..
}
Does anyone know an efficient approach? Thanks.
A simple idea would be to assign each array to a thread block, which is used to calculate the maximum/minimum using the parallel reduction approach. The problem here is the size of the shared memory, which is why I regard this approach as critical.
There is no problem with shared memory size. You may wish to review Mark Harris' canonical parallel reduction tutorial and look at the later methods to understand how we can use a loop to populate shared memory, reducing values into shared memory as we go. Once the input loop is completed, then we begin the block-sweep phase of the reduction. This doesn't impose any special requirements on the shared memory per block.
Here's a worked example demonstrating both a thrust::reduce_by_key method (single call) and a CUDA block-segmented method (single kernel call):
$ cat t1535.cu
#include <iostream>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/functional.h>
#include <cstdlib>
#define IMAX(x,y) (x>y)?x:y
#define IMIN(x,y) (x<y)?x:y
typedef int dtype;
const int ncubes = 3;
struct Cube {
dtype* x;
dtype* y;
dtype* z;
int size;
};
struct my_f
{
template <typename T1, typename T2>
__host__ __device__
thrust::tuple<dtype,dtype> operator()(T1 t1, T2 t2){
thrust::tuple<dtype,dtype> r;
thrust::get<0>(r) = IMAX(thrust::get<0>(t1),thrust::get<0>(t2));
thrust::get<1>(r) = IMIN(thrust::get<1>(t1),thrust::get<1>(t2));
return r;
}
};
const int MIN = -1;
const int MAX = 0x7FFFFFFF;
const int BS = 512;
template <typename T>
__global__ void block_segmented_minmax_reduce(const T * __restrict__ in, T * __restrict__ max, T * __restrict__ min, const size_t * __restrict__ slen){
__shared__ T smax[BS];
__shared__ T smin[BS];
size_t my_seg_start = slen[blockIdx.x];
size_t my_seg_size = slen[blockIdx.x+1] - my_seg_start;
smax[threadIdx.x] = MIN;
smin[threadIdx.x] = MAX;
for (size_t idx = my_seg_start+threadIdx.x; idx < my_seg_size; idx += BS){
T my_val = in[idx];
smax[threadIdx.x] = IMAX(my_val, smax[threadIdx.x]);
smin[threadIdx.x] = IMIN(my_val, smin[threadIdx.x]);}
for (int s = BS>>1; s > 0; s>>=1){
__syncthreads();
if (threadIdx.x < s){
smax[threadIdx.x] = IMAX(smax[threadIdx.x], smax[threadIdx.x+s]);
smin[threadIdx.x] = IMIN(smin[threadIdx.x], smin[threadIdx.x+s]);}
}
if (!threadIdx.x){
max[blockIdx.x] = smax[0];
min[blockIdx.x] = smin[0];}
}
int main() {
// data setup
Cube *c = new Cube[ncubes];
thrust::host_vector<size_t> csize(ncubes+1);
csize[0] = 100;
csize[1] = 1047;
csize[2] = 5000;
csize[3] = 0;
c[0].x = new dtype[csize[0]];
c[1].x = new dtype[csize[1]];
c[2].x = new dtype[csize[2]];
size_t ctot = 0;
for (int i = 0; i < ncubes; i++) ctot+=csize[i];
// method 1: thrust
// concatenate
thrust::host_vector<dtype> h_d(ctot);
size_t start = 0;
for (int i = 0; i < ncubes; i++) {thrust::copy_n(c[i].x, csize[i], h_d.begin()+start); start += csize[i];}
for (size_t i = 0; i < ctot; i++) h_d[i] = rand();
thrust::device_vector<dtype> d_d = h_d;
// build flag vector
thrust::device_vector<int> d_f(d_d.size());
thrust::host_vector<size_t> coff(csize.size());
thrust::exclusive_scan(csize.begin(), csize.end(), coff.begin());
thrust::device_vector<size_t> d_coff = coff;
thrust::scatter(thrust::constant_iterator<int>(1), thrust::constant_iterator<int>(1)+ncubes, d_coff.begin(), d_f.begin());
thrust::inclusive_scan(d_f.begin(), d_f.end(), d_f.begin());
// min/max reduction
thrust::device_vector<dtype> d_max(ncubes);
thrust::device_vector<dtype> d_min(ncubes);
thrust::reduce_by_key(d_f.begin(), d_f.end(), thrust::make_zip_iterator(thrust::make_tuple(d_d.begin(), d_d.begin())), thrust::make_discard_iterator(), thrust::make_zip_iterator(thrust::make_tuple(d_max.begin(), d_min.begin())), thrust::equal_to<int>(), my_f());
thrust::host_vector<dtype> h_max = d_max;
thrust::host_vector<dtype> h_min = d_min;
std::cout << "Thrust Maxima: " <<std::endl;
thrust::copy_n(h_max.begin(), ncubes, std::ostream_iterator<dtype>(std::cout, ","));
std::cout << std::endl << "Thrust Minima: " << std::endl;
thrust::copy_n(h_min.begin(), ncubes, std::ostream_iterator<dtype>(std::cout, ","));
std::cout << std::endl;
// method 2: CUDA kernel (block reduce)
block_segmented_minmax_reduce<<<ncubes, BS>>>(thrust::raw_pointer_cast(d_d.data()), thrust::raw_pointer_cast(d_max.data()), thrust::raw_pointer_cast(d_min.data()), thrust::raw_pointer_cast(d_coff.data()));
thrust::copy_n(d_max.begin(), ncubes, h_max.begin());
thrust::copy_n(d_min.begin(), ncubes, h_min.begin());
std::cout << "CUDA Maxima: " <<std::endl;
thrust::copy_n(h_max.begin(), ncubes, std::ostream_iterator<dtype>(std::cout, ","));
std::cout << std::endl << "CUDA Minima: " << std::endl;
thrust::copy_n(h_min.begin(), ncubes, std::ostream_iterator<dtype>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -o t1535 t1535.cu
$ ./t1535
Thrust Maxima:
2145174067,2147469841,2146753918,
Thrust Minima:
35005211,2416949,100669,
CUDA Maxima:
2145174067,2147469841,2146753918,
CUDA Minima:
35005211,2416949,100669,
$
For a small number of Cube objects, the thrust method is likely to be faster. It will tend to make better use of medium to large GPUs than the block method will. For a large number of Cube objects, the block method should also be fairly efficient.

How to cope with "cudaErrorMissingConfiguration" from "cudaMallocPitch" function of CUDA?

I'm making a Mandelbrot set program with CUDA. However I can't step more unless cudaErrorMissingConfiguration from cudaMallocPitch() function of CUDA is to be solved. Could you tell me something about it?
My GPU is GeForce RTX 2060 SUPER.
I'll show you my command lines below.
> nvcc MandelbrotCUDA.cu -o MandelbrotCUDA -O3
I tried cudaDeviceSetLimit( cudaLimitMallocHeapSize, 7*1024*1024*1024 ) to
resize heap size.
cudaDeviceSetLimit was success.
However I cannot step one more. I cannot print "CUDA malloc done!"
#include <iostream>
#include <thrust/complex.h>
#include <fstream>
#include <string>
#include <stdlib.h>
using namespace std;
#define D 0.0000025 // Tick
#define LIMIT_N 255
#define INF_NUM 2
#define PLOT_METHOD 2 // dat file : 0, ppm file : 1, ppm file with C : 2
__global__
void calculation(const int indexTotalX, const int indexTotalY, int ***n, thrust::complex<double> ***c){ // n, c are the pointers of dN, dC.
for(int i = 0; i < indexTotalY ; i++){
for(int j = 0; j < indexTotalX; j++){
thrust::complex<double> z(0.0f, 0.0f);
n[i][j] = 0;
for(int ctr=1; ctr <= LIMIT_N ; ctr++){
z = z*z + (*(c[i][j]));
n[i][j] = n[i][j] + (abs(z) < INF_NUM);
}
}
}
}
int main(){
// Data Path
string filePath = "Y:\\Documents\\Programming\\mandelbrot\\";
string fileName = "mandelbrot4.ppm";
string filename = filePath+fileName;
//complex<double> c[N][M];
double xRange[2] = {-0.76, -0.74};
double yRange[2] = {0.05, 0.1};
const int indexTotalX = (xRange[1]-xRange[0])/D;
const int indexTotalY = (yRange[1]-yRange[0])/D;
thrust::complex<double> **c;
//c = new complex<double> [N];
cout << "debug_n" << endl;
int **n;
n = new int* [indexTotalY];
c = new thrust::complex<double> * [indexTotalY];
for(int i=0;i<indexTotalY;i++){
n[i] = new int [indexTotalX];
c[i] = new thrust::complex<double> [indexTotalX];
}
cout << "debug_n_end" << endl;
for(int i = 0; i < indexTotalY; i++){
for(int j = 0; j < indexTotalX; j++){
thrust::complex<double> tmp( xRange[0]+j*D, yRange[0]+i*D );
c[i][j] = tmp;
//n[i*sqrt(N)+j] = 0;
}
}
// CUDA malloc
cout << "CUDA malloc initializing..." << endl;
int **dN;
thrust::complex<double> **dC;
cudaError_t error;
error = cudaDeviceSetLimit(cudaLimitMallocHeapSize, 7*1024*1024*1024);
if(error != cudaSuccess){
cout << "cudaDeviceSetLimit's ERROR CODE = " << error << endl;
return 0;
}
size_t tmpPitch;
error = cudaMallocPitch((void **)dN, &tmpPitch,(size_t)(indexTotalY*sizeof(int)), (size_t)(indexTotalX*sizeof(int)));
if(error != cudaSuccess){
cout << "CUDA ERROR CODE = " << error << endl;
cout << "indexTotalX = " << indexTotalX << endl;
cout << "indexTotalY = " << indexTotalY << endl;
return 0;
}
cout << "CUDA malloc done!" << endl;
This is console messages below.
debug_n
debug_n_end
CUDA malloc initializing...
CUDA ERROR CODE = 1
indexTotalX = 8000
indexTotalY = 20000
There are several problems here:
int **dN;
...
error = cudaMallocPitch((void **)dN, &tmpPitch,(size_t)(indexTotalY*sizeof(int)), (size_t)(indexTotalX*sizeof(int)));
The correct type of pointer to use in CUDA allocations is a single pointer:
int *dN;
not a double pointer:
int **dN;
(so your kernel where you are trying pass triple-pointers:
void calculation(const int indexTotalX, const int indexTotalY, int ***n, thrust::complex<double> ***c){ // n, c are the pointers of dN, dC.
is almost certainly not going to work, and should not be designed that way, but that is not the question you are asking.)
The pointer is passed to the allocating function by its address:
error = cudaMallocPitch((void **)&dN,
For cudaMallocPitch, only the horizontal requested dimension is scaled by the size of the data element. The allocation height is not scaled this way. Also, I will assume X corresponds to your allocation width, and Y corresponds to your allocation height, so you also have those parameters reversed:
error = cudaMallocPitch((void **)&dN, &tmpPitch,(size_t)(indexTotalX*sizeof(int)), (size_t)(indexTotalY));
The cudaLimitMallocHeapSize should not be necessary to set to make any of this work. It applies only to in-kernel allocations. Reserving 7GB on an 8GB card may also cause problems. Until you are sure you need that (it's not needed for what you have shown) I would simply remove that.
$ cat t1488.cu
#include <iostream>
#include <thrust/complex.h>
#include <fstream>
#include <string>
#include <stdlib.h>
using namespace std;
#define D 0.0000025 // Tick
#define LIMIT_N 255
#define INF_NUM 2
#define PLOT_METHOD 2 // dat file : 0, ppm file : 1, ppm file with C : 2
__global__
void calculation(const int indexTotalX, const int indexTotalY, int ***n, thrust::complex<double> ***c){ // n, c are the pointers of dN, dC.
for(int i = 0; i < indexTotalY ; i++){
for(int j = 0; j < indexTotalX; j++){
thrust::complex<double> z(0.0f, 0.0f);
n[i][j] = 0;
for(int ctr=1; ctr <= LIMIT_N ; ctr++){
z = z*z + (*(c[i][j]));
n[i][j] = n[i][j] + (abs(z) < INF_NUM);
}
}
}
}
int main(){
// Data Path
string filePath = "Y:\\Documents\\Programming\\mandelbrot\\";
string fileName = "mandelbrot4.ppm";
string filename = filePath+fileName;
//complex<double> c[N][M];
double xRange[2] = {-0.76, -0.74};
double yRange[2] = {0.05, 0.1};
const int indexTotalX = (xRange[1]-xRange[0])/D;
const int indexTotalY = (yRange[1]-yRange[0])/D;
thrust::complex<double> **c;
//c = new complex<double> [N];
cout << "debug_n" << endl;
int **n;
n = new int* [indexTotalY];
c = new thrust::complex<double> * [indexTotalY];
for(int i=0;i<indexTotalY;i++){
n[i] = new int [indexTotalX];
c[i] = new thrust::complex<double> [indexTotalX];
}
cout << "debug_n_end" << endl;
for(int i = 0; i < indexTotalY; i++){
for(int j = 0; j < indexTotalX; j++){
thrust::complex<double> tmp( xRange[0]+j*D, yRange[0]+i*D );
c[i][j] = tmp;
//n[i*sqrt(N)+j] = 0;
}
}
// CUDA malloc
cout << "CUDA malloc initializing..." << endl;
int *dN;
thrust::complex<double> **dC;
cudaError_t error;
size_t tmpPitch;
error = cudaMallocPitch((void **)&dN, &tmpPitch,(size_t)(indexTotalX*sizeof(int)), (size_t)(indexTotalY));
if(error != cudaSuccess){
cout << "CUDA ERROR CODE = " << error << endl;
cout << "indexTotalX = " << indexTotalX << endl;
cout << "indexTotalY = " << indexTotalY << endl;
return 0;
}
cout << "CUDA malloc done!" << endl;
}
$ nvcc -o t1488 t1488.cu
t1488.cu(68): warning: variable "dC" was declared but never referenced
$ cuda-memcheck ./t1488
========= CUDA-MEMCHECK
debug_n
debug_n_end
CUDA malloc initializing...
CUDA malloc done!
========= ERROR SUMMARY: 0 errors
$

cufftSetStream causes garbage output. Am I doing something wrong?

According to the docs, the cufftSetStream() function
Associates a CUDA stream with a cuFFT plan. All kernel launches made during plan execution are now done through the associated stream [...until...] the stream is changed with another call to cufftSetStream().
Unfortunately, the results are turned into garbage. Here is an example that demonstrates this by performing a bunch of transforms two ways: once with each stream having its own dedicated plan, and once with a single plan being reused as the documentation above indicates. The former behaves as expected, the reused/cufftSetStream approach has errors in most of the transforms. This was observed on the two cards I've tried (GTX 750 ti, Titan X) on CentOS 7 linux with
Cuda compilation tools, release 7.0, V7.0.27; and release 7.5, V7.5.17.
EDIT :see the "FIX" comments below for one way to fix things.
#include <cufft.h>
#include <stdexcept>
#include <iostream>
#include <numeric>
#include <vector>
#define ck(cmd) if ( cmd) { std::cerr << "error at line " << __LINE__ << std::endl;exit(1);}
__global__
void fill_input(cufftComplex * buf, int batch,int nbins,int stride,int seed)
{
for (int i = blockDim.y * blockIdx.y + threadIdx.y; i< batch;i += gridDim.y*blockDim.y)
for (int j = blockDim.x * blockIdx.x + threadIdx.x; j< nbins;j += gridDim.x*blockDim.x)
buf[i*stride + j] = make_cuFloatComplex( (i+seed)%101 - 50,(j+seed)%41-20);
}
__global__
void check_output(const float * buf1,const float * buf2,int batch, int nfft, int stride, int * errors)
{
for (int i = blockDim.y * blockIdx.y + threadIdx.y; i< batch;i += gridDim.y*blockDim.y) {
for (int j = blockDim.x * blockIdx.x + threadIdx.x; j< nfft;j += gridDim.x*blockDim.x) {
float e=buf1[i*stride+j] - buf2[i*stride+j];
if (e*e > 1) // gross error
atomicAdd(errors,1);
}
}
}
void demo(bool reuse_plan)
{
if (reuse_plan)
std::cout << "Reusing the same fft plan with multiple stream via cufftSetStream ... ";
else
std::cout << "Giving each stream its own dedicated fft plan ... ";
int nfft = 1024;
int batch = 1024;
int nstreams = 8;
int nbins = nfft/2+1;
int nit=100;
size_t inpitch,outpitch;
std::vector<cufftComplex*> inbufs(nstreams);
std::vector<float*> outbufs(nstreams);
std::vector<float*> checkbufs(nstreams);
std::vector<cudaStream_t> streams(nstreams);
std::vector<cufftHandle> plans(nstreams);
for (int i=0;i<nstreams;++i) {
ck( cudaStreamCreate(&streams[i]));
ck( cudaMallocPitch((void**)&inbufs[i],&inpitch,nbins*sizeof(cufftComplex),batch) );
ck( cudaMallocPitch((void**)&outbufs[i],&outpitch,nfft*sizeof(float),batch));
ck( cudaMallocPitch((void**)&checkbufs[i],&outpitch,nfft*sizeof(float),batch) );
if (i==0 || reuse_plan==false)
ck ( cufftPlanMany(&plans[i],1,&nfft,&nbins,1,inpitch/sizeof(cufftComplex),&nfft,1,outpitch/sizeof(float),CUFFT_C2R,batch) );
}
// fill the input buffers and FFT them to get a baseline for comparison
for (int i=0;i<nstreams;++i) {
fill_input<<<20,dim3(32,32)>>>(inbufs[i],batch,nbins,inpitch/sizeof(cufftComplex),i);
ck (cudaGetLastError());
if (reuse_plan) {
ck (cufftExecC2R(plans[0],inbufs[i],checkbufs[i]));
}else{
ck (cufftExecC2R(plans[i],inbufs[i],checkbufs[i]));
ck( cufftSetStream(plans[i],streams[i]) ); // only need to set the stream once
}
ck( cudaDeviceSynchronize());
}
// allocate a buffer for the error count
int * errors;
cudaMallocHost((void**)&errors,sizeof(int)*nit);
memset(errors,0,sizeof(int)*nit);
/* FIX: an event can protect the plan internal buffers
by serializing access to the plan
cudaEvent_t ev;
cudaEventCreateWithFlags(&ev,cudaEventDisableTiming);
*/
// perform the FFTs and check the outputs on streams
for (int it=0;it<nit;++it) {
int k = it % nstreams;
ck( cudaStreamSynchronize(streams[k]) ); // make sure any prior kernels have completed
if (reuse_plan) {
// FIX: ck(cudaStreamWaitEvent(streams[k],ev,0 ) );
ck(cufftSetStream(plans[0],streams[k]));
ck(cufftExecC2R(plans[0],inbufs[k],outbufs[k]));
// FIX: ck(cudaEventRecord(ev,streams[k] ) );
}else{
ck(cufftExecC2R(plans[k],inbufs[k],outbufs[k]));
}
check_output<<<100,dim3(32,32),0,streams[k]>>>(outbufs[k],checkbufs[k],batch,nfft,outpitch/sizeof(float),&errors[it]);
ck (cudaGetLastError());
}
ck(cudaDeviceSynchronize());
// report number of errors
int errcount=0;
for (int it=0;it<nit;++it)
if (errors[it])
++errcount;
std::cout << errcount << " of " << nit << " transforms had errors\n";
for (int i=0;i<nstreams;++i) {
cudaFree(inbufs[i]);
cudaFree(outbufs[i]);
cudaStreamDestroy(streams[i]);
if (i==0 || reuse_plan==false)
cufftDestroy(plans[i]);
}
}
int main(int argc,char ** argv)
{
demo(false);
demo(true);
return 0;
}
Typical output
Giving each stream its own dedicated fft plan ... 0 of 100 transforms had errors
Reusing the same fft plan with multiple stream via cufftSetStream ... 87 of 100 transforms had errors
In order to reuse plans the way you want you need to manage cuFFT work area manually.
Each plan has a space for intermediate calculation results. If you want to use plan handle at the same time for two or more different plan executions you need to provide temporary buffer for each concurrent cufftExec* call.
You can do this by using cufftSetWorkArea - please have a look at section 3.7 in cuFFT documentation. Section 2.2 also would help to understand how it works.
Here's a worked example showing the changes to your code for this:
$ cat t1241.cu
#include <cufft.h>
#include <stdexcept>
#include <iostream>
#include <numeric>
#include <vector>
#define ck(cmd) if ( cmd) { std::cerr << "error at line " << __LINE__ << std::endl;exit(1);}
__global__
void fill_input(cufftComplex * buf, int batch,int nbins,int stride,int seed)
{
for (int i = blockDim.y * blockIdx.y + threadIdx.y; i< batch;i += gridDim.y*blockDim.y)
for (int j = blockDim.x * blockIdx.x + threadIdx.x; j< nbins;j += gridDim.x*blockDim.x)
buf[i*stride + j] = make_cuFloatComplex( (i+seed)%101 - 50,(j+seed)%41-20);
}
__global__
void check_output(const float * buf1,const float * buf2,int batch, int nfft, int stride, int * errors)
{
for (int i = blockDim.y * blockIdx.y + threadIdx.y; i< batch;i += gridDim.y*blockDim.y) {
for (int j = blockDim.x * blockIdx.x + threadIdx.x; j< nfft;j += gridDim.x*blockDim.x) {
float e=buf1[i*stride+j] - buf2[i*stride+j];
if (e*e > 1) // gross error
atomicAdd(errors,1);
}
}
}
void demo(bool reuse_plan)
{
if (reuse_plan)
std::cout << "Reusing the same fft plan with multiple stream via cufftSetStream ... ";
else
std::cout << "Giving each stream its own dedicated fft plan ... ";
int nfft = 1024;
int batch = 1024;
int nstreams = 8;
int nbins = nfft/2+1;
int nit=100;
size_t inpitch,outpitch;
std::vector<cufftComplex*> inbufs(nstreams);
std::vector<float*> outbufs(nstreams);
std::vector<float*> checkbufs(nstreams);
std::vector<cudaStream_t> streams(nstreams);
std::vector<cufftHandle> plans(nstreams);
// if plan reuse, set up independent work areas
std::vector<char *> wk_areas(nstreams);
for (int i=0;i<nstreams;++i) {
ck( cudaStreamCreate(&streams[i]));
ck( cudaMallocPitch((void**)&inbufs[i],&inpitch,nbins*sizeof(cufftComplex),batch) );
ck( cudaMallocPitch((void**)&outbufs[i],&outpitch,nfft*sizeof(float),batch));
ck( cudaMallocPitch((void**)&checkbufs[i],&outpitch,nfft*sizeof(float),batch) );
if (i==0 || reuse_plan==false)
ck ( cufftPlanMany(&plans[i],1,&nfft,&nbins,1,inpitch/sizeof(cufftComplex),&nfft,1,outpitch/sizeof(float),CUFFT_C2R,batch) );
}
if (reuse_plan){
size_t ws;
ck(cufftGetSize(plans[0], &ws));
for (int i = 0; i < nstreams; i++)
ck(cudaMalloc(&(wk_areas[i]), ws));
ck(cufftSetAutoAllocation(plans[0], 0));
ck(cufftSetWorkArea(plans[0], wk_areas[0]));
}
// fill the input buffers and FFT them to get a baseline for comparison
for (int i=0;i<nstreams;++i) {
fill_input<<<20,dim3(32,32)>>>(inbufs[i],batch,nbins,inpitch/sizeof(cufftComplex),i);
ck (cudaGetLastError());
if (reuse_plan) {
ck (cufftExecC2R(plans[0],inbufs[i],checkbufs[i]));
}else{
ck (cufftExecC2R(plans[i],inbufs[i],checkbufs[i]));
ck( cufftSetStream(plans[i],streams[i]) ); // only need to set the stream once
}
ck( cudaDeviceSynchronize());
}
// allocate a buffer for the error count
int * errors;
cudaMallocHost((void**)&errors,sizeof(int)*nit);
memset(errors,0,sizeof(int)*nit);
// perform the FFTs and check the outputs on streams
for (int it=0;it<nit;++it) {
int k = it % nstreams;
ck( cudaStreamSynchronize(streams[k]) ); // make sure any prior kernels have completed
if (reuse_plan) {
ck(cufftSetStream(plans[0],streams[k]));
ck(cufftSetWorkArea(plans[0], wk_areas[k])); // update work area pointer in plan
ck(cufftExecC2R(plans[0],inbufs[k],outbufs[k]));
}else{
ck(cufftExecC2R(plans[k],inbufs[k],outbufs[k]));
}
check_output<<<100,dim3(32,32),0,streams[k]>>>(outbufs[k],checkbufs[k],batch,nfft,outpitch/sizeof(float),&errors[it]);
ck (cudaGetLastError());
}
ck(cudaDeviceSynchronize());
// report number of errors
int errcount=0;
for (int it=0;it<nit;++it)
if (errors[it])
++errcount;
std::cout << errcount << " of " << nit << " transforms had errors\n";
for (int i=0;i<nstreams;++i) {
cudaFree(inbufs[i]);
cudaFree(outbufs[i]);
cudaFree(wk_areas[i]);
cudaStreamDestroy(streams[i]);
if (i==0 || reuse_plan==false)
cufftDestroy(plans[i]);
}
}
int main(int argc,char ** argv)
{
demo(false);
demo(true);
return 0;
}
$ nvcc -o t1241 t1241.cu -lcufft
$ ./t1241
Giving each stream its own dedicated fft plan ... 0 of 100 transforms had errors
Reusing the same fft plan with multiple stream via cufftSetStream ... 0 of 100 transforms had errors
$

How to use CUB and Thrust in one CUDA code

I'm trying to introduce some CUB into my "old" Thrust code, and so have started with a small example to compare thrust::reduce_by_key with cub::DeviceReduce::ReduceByKey, both applied to thrust::device_vectors.
The thrust part of the code is fine, but the CUB part, which naively uses raw pointers obtained via thrust::raw_pointer_cast, crashes after the CUB calls. I put in a cudaDeviceSynchronize() to try to solve this problem, but it didn't help. The CUB part of the code was cribbed from the CUB web pages.
On OSX the runtime error is:
libc++abi.dylib: terminate called throwing an exception
Abort trap: 6
On Linux the runtime error is:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): an illegal memory access was encountered
The first few lines of cuda-memcheck are:
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 4
========= at 0x00127010 in /home/sdettrick/codes/MCthrust/tests/../cub-1.3.2/cub/device/dispatch/../../block_range/block_range_reduce_by_key.cuh:1017:void cub::ReduceByKeyRegionKernel<cub::DeviceReduceByKeyDispatch<unsigned int*, unsigned int*, float*, float*, int*, cub::Equality, CustomSum, int>::PtxReduceByKeyPolicy, unsigned int*, unsigned int*, float*, float*, int*, cub::ReduceByKeyScanTileState<float, int, bool=1>, cub::Equality, CustomSum, int>(unsigned int*, float*, float*, int*, cub::Equality, CustomSum, int, cub::DeviceReduceByKeyDispatch<unsigned int*, unsigned int*, float*, float*, int*, cub::Equality, CustomSum, int>::PtxReduceByKeyPolicy, unsigned int*, int, cub::GridQueue<int>)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7fff7dbb3e88 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
Unfortunately I'm not too sure what to do about that.
Any help would be greatly appreciated. I tried this on the NVIDIA developer zone but didn't get any responses. The complete example code is below. It should compile with CUDA 6.5 and cub 1.3.2:
#include <iostream>
#include <thrust/sort.h>
#include <thrust/gather.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
//========================================
// for CUB:
struct CustomSum
{
template <typename T>
CUB_RUNTIME_FUNCTION __host__ __device__ __forceinline__
//__host__ __device__ __forceinline__
T operator()(const T &a, const T &b) const {
return b+a;
}
};
//========================================
int main()
{
const int Nkey=20;
int Nseg=9;
int ikey[Nkey] = {0, 0, 0, 6, 8, 0, 2, 4, 6, 8, 1, 3, 5, 7, 8, 1, 3, 5, 7, 8};
thrust::device_vector<unsigned int> key(ikey,ikey+Nkey);
thrust::device_vector<unsigned int> keysout(Nkey);
// Let's reduce x, by key:
float xval[Nkey];
for (int i=0; i<Nkey; i++) xval[i]=ikey[i]+0.1f;
thrust::device_vector<float> x(xval,xval+Nkey);
// First, sort x by key:
thrust::sort_by_key(key.begin(),key.end(),x.begin());
//---------------------------------------------------------------------
std::cout<<"=================================================================="<<std::endl
<<" THRUST reduce_by_key:"<<std::endl
<<"=================================================================="<<std::endl;
thrust::device_vector<float> output(Nseg,0.0f);
thrust::reduce_by_key(key.begin(),
key.end(),
x.begin(),
keysout.begin(),
output.begin());
for (int i=0;i<Nkey;i++) std::cout << x[i] <<" "; std::cout<<std::endl;
for (int i=0;i<Nkey;i++) std::cout << key[i] <<" "; std::cout<<std::endl;
for (int i=0;i<Nseg;i++) std::cout << output[i] <<" "; std::cout<<std::endl;
float ototal=thrust::reduce(output.begin(),output.end());
float xtotal=thrust::reduce(x.begin(),x.end());
std::cout << "total="<< ototal <<", should be "<<xtotal<<std::endl;
//---------------------------------------------------------------------
std::cout<<"=================================================================="<<std::endl
<<" CUB ReduceByKey:"<<std::endl
<<"=================================================================="<<std::endl;
unsigned int *d_keys_in =thrust::raw_pointer_cast(&key[0]);
float *d_values_in =thrust::raw_pointer_cast(&x[0]);
unsigned int *d_keys_out =thrust::raw_pointer_cast(&keysout[0]);
float *d_values_out=thrust::raw_pointer_cast(&output[0]);
int *d_num_segments=&Nseg;
CustomSum reduction_op;
std::cout << "CUB input" << std::endl;
for (int i=0; i<Nkey; ++i) std::cout << key[i] << " "; std::cout<<std::endl;
for (int i=0; i<Nkey; ++i) std::cout << x[i] << " "; std::cout<< std::endl;
for (int i=0; i<Nkey; ++i) std::cout << keysout[i] << " "; std::cout<< std::endl;
for (int i=0; i<Nseg; ++i) std::cout << output[i] << " "; std::cout<< std::endl;
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, d_num_segments, reduction_op, Nkey);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
std::cout << "temp_storage_bytes = " << temp_storage_bytes << std::endl;
// Run reduce-by-key
cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, d_num_segments, reduction_op, Nkey);
cudaDeviceSynchronize();
std::cout << "CUB output" << std::endl;
std::cout<<Nkey<<" "<<Nseg<<std::endl;
std::cout<<key.size() << " "<<x.size() << " "<<keysout.size() << " "<<output.size() << std::endl;
// At this point onward it dies:
//libc++abi.dylib: terminate called throwing an exception
//Abort trap: 6
// If the next line is uncommented, it crashes the Mac!
for (int i=0; i<Nkey; ++i) std::cout << key[i] << " "; std::cout<<std::endl;
// for (int i=0; i<Nkey; ++i) std::cout << x[i] << " "; std::cout<< std::endl;
// for (int i=0; i<Nkey; ++i) std::cout << keysout[i] << " "; std::cout<< std::endl;
// for (int i=0; i<Nseg; ++i) std::cout << output[i] << " "; std::cout<< std::endl;
cudaFree(d_temp_storage);
ototal=thrust::reduce(output.begin(),output.end());
xtotal=thrust::reduce(x.begin(),x.end());
std::cout << "total="<< ototal <<", should be "<<xtotal<<std::endl;
return 1;
}
This is not appropriate:
int *d_num_segments=&Nseg;
You cannot take the address of a host variable and use it as a device pointer.
Instead do this:
int *d_num_segments;
cudaMalloc(&d_num_segments, sizeof(int));
This allocates space on the device for the size of data (a single integer that cub will write to), and assigns the address of that allocation to your d_num_segments variable. This then becomes a valid device pointer.
In (*ordinary, non-UM) CUDA, it is illegal dereference a host address in device code, or a device address in host code.

cuda addvectors memory intuitive explanation

I have the following code and
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#include <ctime>
#include <vector>
#include <numeric>
float random_float(void)
{
return static_cast<float>(rand()) / RAND_MAX;
}
std::vector<float> add(float alpha, std::vector<float>& v1, std::vector<float>& v2 )
{ /*Do quick size check on vectors before proceeding*/
std::vector<float> result(v1.size());
for (unsigned int i = 0; i < result.size(); ++i)
{
result[i]=alpha*v1[i]+v2[i];
}
return result;
}
__global__ void Addloop( int N, float alpha, float* x, float* y ) {
int i;
int i0 = blockIdx.x*blockDim.x + threadIdx.x;
for( i = i0; i < N; i += blockDim.x*gridDim.x )
y[i] = alpha*x[i] + y[i];
/*
if ( i0 < N )
y[i0] = alpha*x[i0] + y[i0];
*/
}
int main( int argc, char** argv ) {
float alpha = 0.3;
// create array of 256k elements
int num_elements = 10;//1<<18;
// generate random input on the host
std::vector<float> h1_input(num_elements);
std::vector<float> h2_input(num_elements);
for(int i = 0; i < num_elements; ++i)
{
h1_input[i] = random_float();
h2_input[i] = random_float();
}
for (std::vector<float>::iterator it = h1_input.begin() ; it != h1_input.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
for (std::vector<float>::iterator it = h2_input.begin() ; it != h2_input.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
std::vector<float> host_result;//(std::vector<float> h1_input, std::vector<float> h2_input );
host_result = add( alpha, h1_input, h2_input );
for (std::vector<float>::iterator it = host_result.begin() ; it != host_result.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
// move input to device memory
float *d1_input = 0;
cudaMalloc((void**)&d1_input, sizeof(float) * num_elements);
cudaMemcpy(d1_input, &h1_input[0], sizeof(float) * num_elements, cudaMemcpyHostToDevice);
float *d2_input = 0;
cudaMalloc((void**)&d2_input, sizeof(float) * num_elements);
cudaMemcpy(d2_input, &h2_input[0], sizeof(float) * num_elements, cudaMemcpyHostToDevice);
Addloop<<<1,3>>>( num_elements, alpha, d1_input, d2_input );
// copy the result back to the host
std::vector<float> device_result(num_elements);
cudaMemcpy(&device_result[0], d2_input, sizeof(float) * num_elements, cudaMemcpyDeviceToHost);
for (std::vector<float>::iterator it = device_result.begin() ; it != device_result.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
cudaFree(d1_input);
cudaFree(d2_input);
h1_input.clear();
h2_input.clear();
device_result.clear();
std::cout << "DONE! \n";
getchar();
return 0;
}
I am trying to understand the gpu memory access. The kernel, for reasons of simplicity, is launched as Addloop<<<1,3>>>. I am trying to understand how this code is working by imagining the for loops working on the gpu as instances. More specifically, I imagine the following instances but they do not help.
Instance 1:
for( i = 0; i < N; i += 3*1 ) // ( i += 0*1 --> i += 3*1 after Eric's comment)
y[i] = alpha*x[i] + y[i];
Instance 2:
for( i = 1; i < N; i += 3*1 )
y[i] = alpha*x[i] + y[i];
Instance 3:
for( i = 3; i < N; i += 3*1 )
y[i] = alpha*x[i] + y[i];
Looking inside of every loop it does not make any sense in the logic of adding two vectors. Can some one help?
The reason I am adopting this logic of instances is because it is working well in the case of the code inside the kernel which is in comments.
If these thoughts are correct what would be the instances in case we have multiple blocks inside the grid? In other words what would be the i values and the update rates (+=updaterate) in some examples?
PS: The kernel code borrowed from here.
UPDATE:
After Eric's answer I think the execution for N = 15, e.i the number of elements, goes like this (correct me if I am wrong):
For the instance 1 above i = 0 , 3, 6, 9, 12 which computes the corresponding y[i] values.
For the instance 2 above i = 1 , 4, 7, 10, 13 which computes the corresponding remaining y[i] values.
For the instance 3 above i = 2 , 5, 8, 11, 14 which computes the rest y[i] values.
Your blockDim.x is 3 and gridDim.x is 1 according to your setup <<<1,3>>>. So in each thread (you call it instance), it should be i+=3*1
update
With the for loop you can compute 15 element using only 3 threads. Generally you can use limited number of threads to do "infinit" work. And more work per threads can improve the performance by reducing the launch overhead and hiding the instruction stalls.
Another advantage is you could use fixed number of threads/blocks to do work of various sizes, thus requires less tuning.