Using thrust::max_element in a CUDA C project

Using thrust::max_element in a CUDA C project - cuda

In a CUDA C project, I would like to try and use the Thrust library in order to find the maximum element inside an array of floats. It seems like the Thrust function thrust::max_element() is what I need. The array on which I want to use this function is the result of a cuda kernel (which seems to work fine) and so it is already present in device memory when calling thrust::max_element().
I am not very familiar with the Thrust library but after looking at the documentation for thrust::max_element() and reading the answers to similar questions on this site, I thought I had grasped the working principles of this process. Unfortunately I get wrong results and it seems that I am not using the library functions correctly. Can somebody please tell me what is wrong in my code?
float* deviceArray;
float* max;
int length = 1025;
*max = 0.0f;
size = (int) length*sizeof(float);
cudaMalloc(&deviceArray, size);
cudaMemset(deviceArray, 0.0f, size);
// here I launch a cuda kernel which modifies deviceArray
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(deviceArray);
*max = *(thrust::max_element(d_ptr, d_ptr + length));
I use the following headers:
#include <thrust/extrema.h>
#include <thrust/device_ptr.h>
I keep getting zero values for *max even though I am sure that deviceArray contains non-zero values after running the kernel.
I am using nvcc as a compiler (CUDA 7.0) and I am running the code on a device with compute capability 3.5.
Any help would be much appreciated. Thanks.

This is not proper C code:
float* max;
int length = 1025;
*max = 0.0f;
You're not allowed to store data using a pointer (max) until you properly provide an allocation for that pointer (and set the pointer equal to the address of that allocation).
Apart from that, the rest of your code seems to work for me:
$ cat t990.cu
#include <thrust/extrema.h>
#include <thrust/device_ptr.h>
#include <iostream>
int main(){
float* deviceArray;
float max, test;
int length = 1025;
max = 0.0f;
test = 2.5f;
int size = (int) length*sizeof(float);
cudaMalloc(&deviceArray, size);
cudaMemset(deviceArray, 0.0f, size);
cudaMemcpy(deviceArray, &test, sizeof(float),cudaMemcpyHostToDevice);
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(deviceArray);
max = *(thrust::max_element(d_ptr, d_ptr + length));
std::cout << max << std::endl;
}
$ nvcc -o t990 t990.cu
$ ./t990
2.5
$

Related

CUDA/C - Using malloc in kernel functions gives strange results

I'm new to CUDA/C and new to stack overflow. This is my first question.
I'm trying to allocate memory dynamically in a kernel function, but the results are unexpected.
I read using malloc() in a kernel can lower performance a lot, but I need it anyway so I first tried with a simple int ** array just to test the possibility, then I'll actually need to allocate more complex structs.
In my main I used cudaMalloc() to allocate the space for the array of int *, and then I used malloc() for every thread in the kernel function to allocate the array for every index of the outer array. I then used another thread to check the result, but it doesn't always work.
Here's main code:
#define N_CELLE 1024*2
#define L_CELLE 512
extern "C" {
int main(int argc, char **argv) {
int *result = (int *)malloc(sizeof(int));
int *d_result;
int size_numbers = N_CELLE * sizeof(int *);
int **d_numbers;
cudaMalloc((void **)&d_numbers, size_numbers);
cudaMalloc((void **)&d_result, sizeof(int *));
kernel_one<<<2, 1024>>>(d_numbers);
cudaDeviceSynchronize();
kernel_two<<<1, 1>>>(d_numbers, d_result);
cudaMemcpy(result, d_result, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d\n", *result);
cudaFree(d_numbers);
cudaFree(d_result);
free(result);
}
}
I used extern "C"because I could't compile while importing my header, which is not used in this example code. I pasted it since I don't know if this may be relevant or not.
This is kernel_one code:
__global__ void kernel_one(int **d_numbers) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
d_numbers[i] = (int *)malloc(L_CELLE*sizeof(int));
for(int j=0; j<L_CELLE;j++)
d_numbers[i][j] = 1;
}
And this is kernel_two code:
__global__ void kernel_two(int **d_numbers, int *d_result) {
int temp = 0;
for(int i=0; i<N_CELLE; i++) {
for(int j=0; j<L_CELLE;j++)
temp += d_numbers[i][j];
}
*d_result = temp;
}
Everything works fine (aka the count is correct) until I use less than 1024*2*512 total blocks in device memory. For example, if I #define N_CELLE 1024*4 the program starts giving "random" results, such as negative numbers.
Any idea of what the problem could be?
Thanks anyone!

In-kernel memory allocation draws memory from a statically allocated runtime heap. At larger sizes, you are exceeding the size of that heap and then your two kernels are attempting to read and write from uninitialised memory. This produces a runtime error on the device and renders the results invalid. You would already know this if you either added correct API error checking on the host side, or ran your code with the cuda-memcheck utility.
The solution is to ensure that the heap size is set to something appropriate before trying to run a kernel. Adding something like this:
size_t heapsize = sizeof(int) * size_t(N_CELLE) * size_t(2*L_CELLE);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, heapsize);
to your host code before any other API calls, should solve the problem.

I don't know anything about CUDA but these are severe bugs:
You cannot convert from int** to void**. They are not compatible types. Casting doesn't solve the problem, but hides it.
&d_numbers gives the address of a pointer to pointer which is wrong. It is of type int***.
Both of the above bugs result in undefined behavior. If your program somehow seems to works in some condition, that's just by pure (bad) luck only.

External call of a class method in a kernel

I have a class FPlan that has a number of methods such as permute and packing.
__host__ __device__ void Perturb_action(FPlan *dfp){
dfp->perturb();
dfp->packing();
}
__global__ void Vector_Perturb(FPlan **dfp, int n){
int i=threadIx.x;
if(i<n) Perturb_action(dfp[i]);
}
in main:
FPlan **fp_vec;
fp_vec=(FPlan**)malloc(VEC_SIZE*sizeof(FPlan*));
//initialize the vec
for(int i=0; i<VEC_SIZE;i++)
fp_vec[i]=&fp;
//fp of type FPlan that is initialized
int v_sz=sizeof(fp_vec);
double test=fp_vec[0]->getCost();
printf("the cost before perturb %f\n"test);
FPlan **value;
cudaMalloc(&value,v_sz);
cudaMemcpy(value,&fp_vec,v_sz,cudaMemcpyHostToDevice);
//call kernel
dim3 threadsPerBlock(VEC_SIZE);
dim3 numBlocks(1);
Vector_Perturb<<<numBlocks,threadsPerBlock>>> (value,VEC_SIZE);
cudaMemcpy(fp_vec,value,v_sz,cudaMemcpyDeviceToHost);
test=fp_vec[0]->getCost();
printf("the cost after perturb %f\n"test);
test=fp_vec[1]->getCost();
printf("the cost after perturb %f\n"test);
I am getting before permute for fp_vec[0] printf the cost 0.8.
After permute for fp_vec[0] the value inf and for fp_vec[1] the value 0.8.
The expected output after the permutation should be something like fp_vec[0] = 0.7 and fp_vec[1] = 0.9. I want to apply these permutations to an array of type FPlan.
What am I missing? Is calling an external function supported in CUDA?

This seems to be a common problem these days:
Consider the following code:
#include <stdio.h>
#include <stdlib.h>
int main() {
int* arr = (int*) malloc(100);
printf("sizeof(arr) = %i", sizeof(arr));
return 0;
}
what is the expected ouptut? 100? no its 4 (at least on a 32 bit machine). sizeof() returns the size of the type of a variable not the allocated size of an array.
int v_sz=sizeof(fp_vec);
double test=fp_vec[0]->getCost();
printf("the cost before perturb %f\n"test);
FPlan **value;
cudaMalloc(&value,v_sz);
cudaMemcpy(value,&fp_vec,v_sz,cudaMemcpyHostToDevice);
You are allocating 4 (or 8) bytes on the device and copy 4 (or 8) bytes. The result is undefined (and maybe every time garbage).
Besides that, you shold do proper error checking of your CUDA calls.
Have a look: What is the canonical way to check for errors using the CUDA runtime API?

Particular Allocating device memory for _global_ function in cuda

want to do this programm on cuda.
1.in "main.cpp"
struct Center{
double * Data;
int dimension;
};
typedef struct Center Center;
//I allow a pointer on N Center elements by the CUDAMALLOC like follow
....
#include "kernel.cu"
....
center *V_dev;
int M =100, n=4;
cudaStatus = cudaMalloc((void**)&V_dev,M*sizeof(Center));
Init<<<1,M>>>(V_dev, M, N); //I always know the dimension of N before calling
My "kernel.cu" file is something like this
#include "cuda_runtime.h"
#include"device_launch_parameters.h"
... //other include headers to allow my .cu file to know the Center type definition
__global__ void Init(Center *V, int N, int dimension){
V[threadIdx.x].dimension = dimension;
V[threadIdx.x].Data = (double*)malloc(dimension*sizeof(double));
for(int i=0; i<dimension; i++)
V[threadIdx.x].Data[i] = 0; //For the value, it can be any kind of operation returning a float that i want to be able put here
}
I'm on visual studio 2008 and CUDA 5.0. When I Build my project, I've got these errors:
error: calling a _host_ function("malloc") from a _global_ function("Init") is not allowed.
I want to know please how can I perform this? (I know that 'malloc' and other cpu memory allocation are not allowed for device memory.

malloc is allowed in device code but you have to be compiling for a cc2.0 or greater target GPU.
Adjust your VS project settings to remove any GPU device settings like compute_10,sm_10 and replace it with compute_20,sm_20 or higher to match your GPU. (And, to run that code, your GPU needs to be cc2.0 or higher.)

You need the compiler parameter -arch=sm_20 and a GPU which supports it.

how to get cufftcomplex magnitude and phase fast

i have a cufftcomplex data block which is the result from cuda fft(R2C). i know the data is save as a structure with a real number followed by image number. now i want to get the amplitude=sqrt(R*R+I*I), and phase=arctan(I/R) of each complex element by a fast way(not for loop). Is there any good way to do that? or any library could do that?

Since cufftExecR2C operates on data that is on the GPU, the results are already on the GPU, (before you copy them back to the host, if you are doing that.)
It should be straightforward to write your own cuda kernel to accomplish this. The amplitude you're describing is the value returned by cuCabs or cuCabsf in cuComplex.h header file. By looking at the functions in that header file, you should be able to figure out how to write your own that computes the phase angle. You'll note that cufftComplex is just a typedef of cuComplex.
let's say your cufftExecR2C call left some results of type cufftComplex in array data of size sz. Your kernel might look like this:
#include <math.h>
#include <cuComplex.h>
#include <cufft.h>
#define nTPB 256 // threads per block for kernel
#define sz 100000 // or whatever your output data size is from the FFT
...
__host__ __device__ float carg(const cuComplex& z) {return atan2(cuCimagf(z), cuCrealf(z));} // polar angle
__global__ void magphase(cufftComplex *data, float *mag, float *phase, int dsz){
int idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < dsz){
mag[idx] = cuCabsf(data[idx]);
phase[idx] = carg(data[idx]);
}
}
...
int main(){
...
/* Use the CUFFT plan to transform the signal in place. */
/* Your code might be something like this already: */
if (cufftExecR2C(plan, (cufftReal*)data, data) != CUFFT_SUCCESS){
fprintf(stderr, "CUFFT error: ExecR2C Forward failed");
return;
}
/* then you might add: */
float *h_mag, *h_phase, *d_mag, *d_phase;
// malloc your h_ arrays using host malloc first, then...
cudaMalloc((void **)&d_mag, sz*sizeof(float));
cudaMalloc((void **)&d_phase, sz*sizeof(float));
magphase<<<(sz+nTPB-1)/nTPB, nTPB>>>(data, d_mag, d_phase, sz);
cudaMemcpy(h_mag, d_mag, sz*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(h_phase, d_phase, sz*sizeof(float), cudaMemcpyDeviceToHost);
You can also do this using thrust by creating functors for the magnitude and phase functions, and passing these functors along with data, mag and phase to thrust::transform.
I'm sure you can probably do it with CUBLAS as well, using a combination of vector add and vector multiply operations.
This question/answer may be of interest as well. I lifted my phase function carg from there.

Retaining dot product on GPGPU using CUBLAS routine

I am writing a code to compute dot product of two vectors using CUBLAS routine of dot product but it returns the value in host memory. I want to use the dot product for further computation on GPGPU only. How can I make the value reside on GPGPU only and use it for further computations without making an explicit copy from CPU to GPGPU?

You can do this in CUBLAS as long as you use the "V2" API. The newer API includes a function cublasSetPointerMode which you can use to set the API to assume that all routines which return a scalar value will be passed a device pointer rather than a host pointer. This is discussed in Section 2.4 of the latest CUBLAS documentation. For example:
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <stdio.h>
int main(void)
{
const int nvals = 10;
const size_t sz = sizeof(double) * (size_t)nvals;
double x[nvals], y[nvals];
double *x_, *y_, *result_;
double result=0., resulth=0.;
for(int i=0; i<nvals; i++) {
x[i] = y[i] = (double)(i)/(double)(nvals);
resulth += x[i] * y[i];
}
cublasHandle_t h;
cublasCreate(&h);
cublasSetPointerMode(h, CUBLAS_POINTER_MODE_DEVICE);
cudaMalloc( (void **)(&x_), sz);
cudaMalloc( (void **)(&y_), sz);
cudaMalloc( (void **)(&result_), sizeof(double) );
cudaMemcpy(x_, x, sz, cudaMemcpyHostToDevice);
cudaMemcpy(y_, y, sz, cudaMemcpyHostToDevice);
cublasDdot(h, nvals, x_, 1, y_, 1, result_);
cudaMemcpy(&result, result_, sizeof(double), cudaMemcpyDeviceToHost);
printf("%f %f\n", resulth, result);
cublasDestroy(h);
return 0;
}
Using CUBLAS_POINTER_MODE_DEVICE makes cublasDdot assume that result_ is a device pointer, and there is no attempt made to copy the result back to the host. Note that this makes routines like dot asynchronous, so you might need to keep on eye on synchronization between device and host.

You can't, exactly, using CUBLAS. As per talonmies' answer, starting with the CUBLAS V2 api (CUDA 4.0) the return value can be a device pointer. Refer to his answer. But if you are using the V1 API it's a single value, so it's pretty trivial to pass it as an argument to a kernel that uses it—you don't need an explicit cudaMemcpy (but there is one implied in order to return a host value).
Starting with the Tesla K20 GPU and CUDA 5, you will be able to call CUBLAS routines from device kernels using CUDA Dynamic Parallelism. This means you would be able to call cublasSdot (for example) from inside a __global__ kernel function, and your result would therefore be returned on the GPU.

Set pointer mode to device using cublasSetPointerMode().
From cuBLAS docs:
cublasSetPointerMode()
This function sets the pointer mode used by the cuBLAS library. The default is for the values to be passed by reference on the host.
Example:
cublasHandle_t handle;
cublasCreate(&handle);
cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_DEVICE); // Make the values be passed by reference on the device.
Warning: cublasSetPointerMode also affects pointers used as input parameters (e.g., alpha for cublasSgemm). You will need to store the parameters on the device or set the pointer mode back to host mode.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Using thrust::max_element in a CUDA C project - cuda

Related

CUDA/C - Using malloc in kernel functions gives strange results

External call of a class method in a kernel

Particular Allocating device memory for _global_ function in cuda

how to get cufftcomplex magnitude and phase fast

Retaining dot product on GPGPU using CUBLAS routine

Categories

Resources