Retaining dot product on GPGPU using CUBLAS routine - cuda

I am writing a code to compute dot product of two vectors using CUBLAS routine of dot product but it returns the value in host memory. I want to use the dot product for further computation on GPGPU only. How can I make the value reside on GPGPU only and use it for further computations without making an explicit copy from CPU to GPGPU?

You can do this in CUBLAS as long as you use the "V2" API. The newer API includes a function cublasSetPointerMode which you can use to set the API to assume that all routines which return a scalar value will be passed a device pointer rather than a host pointer. This is discussed in Section 2.4 of the latest CUBLAS documentation. For example:
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <stdio.h>
int main(void)
{
const int nvals = 10;
const size_t sz = sizeof(double) * (size_t)nvals;
double x[nvals], y[nvals];
double *x_, *y_, *result_;
double result=0., resulth=0.;
for(int i=0; i<nvals; i++) {
x[i] = y[i] = (double)(i)/(double)(nvals);
resulth += x[i] * y[i];
}
cublasHandle_t h;
cublasCreate(&h);
cublasSetPointerMode(h, CUBLAS_POINTER_MODE_DEVICE);
cudaMalloc( (void **)(&x_), sz);
cudaMalloc( (void **)(&y_), sz);
cudaMalloc( (void **)(&result_), sizeof(double) );
cudaMemcpy(x_, x, sz, cudaMemcpyHostToDevice);
cudaMemcpy(y_, y, sz, cudaMemcpyHostToDevice);
cublasDdot(h, nvals, x_, 1, y_, 1, result_);
cudaMemcpy(&result, result_, sizeof(double), cudaMemcpyDeviceToHost);
printf("%f %f\n", resulth, result);
cublasDestroy(h);
return 0;
}
Using CUBLAS_POINTER_MODE_DEVICE makes cublasDdot assume that result_ is a device pointer, and there is no attempt made to copy the result back to the host. Note that this makes routines like dot asynchronous, so you might need to keep on eye on synchronization between device and host.

You can't, exactly, using CUBLAS. As per talonmies' answer, starting with the CUBLAS V2 api (CUDA 4.0) the return value can be a device pointer. Refer to his answer. But if you are using the V1 API it's a single value, so it's pretty trivial to pass it as an argument to a kernel that uses it—you don't need an explicit cudaMemcpy (but there is one implied in order to return a host value).
Starting with the Tesla K20 GPU and CUDA 5, you will be able to call CUBLAS routines from device kernels using CUDA Dynamic Parallelism. This means you would be able to call cublasSdot (for example) from inside a __global__ kernel function, and your result would therefore be returned on the GPU.

Set pointer mode to device using cublasSetPointerMode().
From cuBLAS docs:
cublasSetPointerMode()
This function sets the pointer mode used by the cuBLAS library. The default is for the values to be passed by reference on the host.
Example:
cublasHandle_t handle;
cublasCreate(&handle);
cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_DEVICE); // Make the values be passed by reference on the device.
Warning: cublasSetPointerMode also affects pointers used as input parameters (e.g., alpha for cublasSgemm). You will need to store the parameters on the device or set the pointer mode back to host mode.

Related

Perform arithmetic on a pointer returned using cudaMalloc() in host code

I am reading Cuda by examples book and I came across this sentence:
However, it is the responsibility of the programmer not to dereference the pointer
returned by cudaMalloc() from code that executes on the host. Host code may
pass this pointer around, perform arithmetic on it, or even cast it to a different
type. But you cannot use it to read or write from memory.
Specifically, how would the 'perform an arithmetic on a pointer returned by cudaMalloc()' be done?
I tried running the following addition code with 2 additional lines before and after the kernel was called, but it had no effect on the output(which is 12 with or without those lines).
#include <iostream>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
__global__
void add(int a, int b, int *c)
{
*c += a + b;
}
int main()
{
int *c, d;
cudaMalloc((void**)&c, sizeof(int));
*c = 10;
add << <1,1>> > (5,7,c);
*c += 5;
cudaMemcpy(&d, c, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << d<<std::endl;
return 0;
}
I am a beginner and wold appreciate your help.
pointer arithmetic is a concept associated with C and C++, it is not unique or specific to CUDA.
This is not an example of pointer arithmetic:
*c = 10;
nor is this:
*c += 5;
These are both modifications of what the pointer is pointing to, not the pointer itself. Pointer arithmetic involves adjustments to the pointer value itself. (And by the way the code you have shown is illegal in CUDA - it is not legal to dereference ordinary device pointers in host code. *c is an operation that dereferences the pointer c. It is not the same as pointer arithmetic.)
Suppose I had a device memory allocation of 1024 int quantities:
cudaMalloc(&data, 1024 * sizeof(int));
Now suppose I wanted to cause the first invocation of a CUDA kernel to start working on the beginning of the array, and a second invocation of a CUDA kernel to start working at the midpoint of the array, but otherwise perform the same work.
I might do something like this, and the second kernel invocation has an argument that involves pointer arithmetic:
kernel<<<...>>>(data, 512);
kernel<<<...>>>(data+512, 512);
The data+512 argument involves pointer arithmetic. This will pass a pointer to the kernel that points to the midpoint of the data array, rather than the beginning of the array. If I wanted to carry this pointer around in host code, I could do:
int *datahalf = data+512;

How do I pass a shared pointer to a cublas function?

I'm trying to run a cublas function from within a kernel in the following way:
__device__ void doLinear(const float *W,const float *input, unsigned i, float *out, unsigned o) {
unsigned idx = blockIdx.x*blockDim.x+threadIdx.x;
const float alpha = 1.0f;
const float beta = 0.0f;
if(idx == 0) {
cublasHandle_t cnpHandle;
cublasStatus_t status = cublasCreate(&cnpHandle);
cublasSgemv(cnpHandle, CUBLAS_OP_N, o, i, &alpha, W, 1, input, 1, &beta, out, 1);
}
__syncthreads();
}
This function works perfectly well if the input pointer is allocated using cudaMalloc.
My issue is, if the input pointer actually points to some shared memory, that contains data generated from within the kernel, I get the error:
CUDA_EXCEPTION_14 - Warp Illegal address.
Is it not possible to pass pointers to shared memory to a cublas function being called from a kernel?
What is the correct way to allocate my memory here? (At the moment I'm just doing another cudaMalloc and using that as my 'shared' memory, but it's making me feel a bit dirty)
You can't pass shared memory to a CUBLAS device API routine because it violates the CUDA dynamic parallelism memory model on which device side CUBLAS is based. The best you can do is use malloc() or new to allocate thread local memory on the runtime heap for the CUBLAS routine to use, or a portion of an a priori allocated buffer allocated with one of the host side APIs (as you are presently doing).

how to get cufftcomplex magnitude and phase fast

i have a cufftcomplex data block which is the result from cuda fft(R2C). i know the data is save as a structure with a real number followed by image number. now i want to get the amplitude=sqrt(R*R+I*I), and phase=arctan(I/R) of each complex element by a fast way(not for loop). Is there any good way to do that? or any library could do that?
Since cufftExecR2C operates on data that is on the GPU, the results are already on the GPU, (before you copy them back to the host, if you are doing that.)
It should be straightforward to write your own cuda kernel to accomplish this. The amplitude you're describing is the value returned by cuCabs or cuCabsf in cuComplex.h header file. By looking at the functions in that header file, you should be able to figure out how to write your own that computes the phase angle. You'll note that cufftComplex is just a typedef of cuComplex.
let's say your cufftExecR2C call left some results of type cufftComplex in array data of size sz. Your kernel might look like this:
#include <math.h>
#include <cuComplex.h>
#include <cufft.h>
#define nTPB 256 // threads per block for kernel
#define sz 100000 // or whatever your output data size is from the FFT
...
__host__ __device__ float carg(const cuComplex& z) {return atan2(cuCimagf(z), cuCrealf(z));} // polar angle
__global__ void magphase(cufftComplex *data, float *mag, float *phase, int dsz){
int idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < dsz){
mag[idx] = cuCabsf(data[idx]);
phase[idx] = carg(data[idx]);
}
}
...
int main(){
...
/* Use the CUFFT plan to transform the signal in place. */
/* Your code might be something like this already: */
if (cufftExecR2C(plan, (cufftReal*)data, data) != CUFFT_SUCCESS){
fprintf(stderr, "CUFFT error: ExecR2C Forward failed");
return;
}
/* then you might add: */
float *h_mag, *h_phase, *d_mag, *d_phase;
// malloc your h_ arrays using host malloc first, then...
cudaMalloc((void **)&d_mag, sz*sizeof(float));
cudaMalloc((void **)&d_phase, sz*sizeof(float));
magphase<<<(sz+nTPB-1)/nTPB, nTPB>>>(data, d_mag, d_phase, sz);
cudaMemcpy(h_mag, d_mag, sz*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(h_phase, d_phase, sz*sizeof(float), cudaMemcpyDeviceToHost);
You can also do this using thrust by creating functors for the magnitude and phase functions, and passing these functors along with data, mag and phase to thrust::transform.
I'm sure you can probably do it with CUBLAS as well, using a combination of vector add and vector multiply operations.
This question/answer may be of interest as well. I lifted my phase function carg from there.

Device memory flush cuda

I'm running a C program where I call twice a cuda host function. I want to clean up the device memory between these 2 calls. Is there a way I can flush GPU device memory?? I'm on a Tesla M2050 with computing capability of 2.0
If you only want to zero the memory, then cudaMemset is probably the simplest way to do this. For example:
const int n = 10000000;
const int sz = sizeof(float) * n;
float *devicemem;
cudaMalloc((void **)&devicemem, sz);
kernel<<<...>>>(devicemem,....);
cudaMemset(devicemem, 0, sz); // zeros all the bytes in devicemem
kernel<<<...>>>(devicemem,....);
Note that the value cudaMemset takes is a byte value, and all bytes in the specified range are set to that value, just like the standard C memset. If you have a specific word value, then you will need to write your own memset kernel to assign the values.
If you are using Thrust vectors, then you can call thrust::fill() on the vector you want to reset with the reset value you want.
thrust::device_vector< FooType > fooVec( FooSize );
kernelCall1<<< x, y >>>( /* Pass fooVec here */ );
// Reset memory of fooVec
thrust::fill( fooVec.begin(), fooVec.end(), FooDefaultValue );
kernelCall2<<< x, y >>>( /* Pass fooVec here */ );

CUDA: Allocating 2D array on GPU

I have already read the following thread , but I couldn't get my code to work.
I am trying to allocate a 2D array on GPU, fill it with values, and copy it back to the CPU. My code is as follows:
__global__ void Kernel(char **result,int N)
{
//do something like result[0][0]='a';
}
int N=20;
int Count=5;
char **result_h=(char**)malloc(sizeof(char*)*Count);
char **result_d;
cudaMalloc(&result_d, sizeof(char*)*Count);
for(int i=0;i<Count;i++)
{
result_h[i] = (char*)malloc(sizeof(char)*N);
cudaMalloc(&result_d[i], sizeof(char)*N); //get exception here
}
//call kernel
//copy values from result_d to result_h
printf("%c",result_h[0][0])//should print a
How can i achieve this?
You can't manipulate device pointers in host code, which is why the cudaMalloc call inside the loop fails. You should probably just allocate a single contiguous block of memory and then treat that as a flattened 2D array.
For doing the simplest 2D operations on a GPU, I'd recommend you just treat it as a 1D array. cudaMalloc a block of size w*h*sizeof(char). You can access the element (i,j) through index j*w+i.
Alternatively, you could use cudaMallocArray to get a 2D array. This has a better sense of locality than linear mapped 2D memory. You can easily bind this to a texture, for example.
Now in terms of your example, the reason why it doesn't work is that cudaMalloc manipulates a host pointer to point at a block of device memory. Your example allocated the pointer structure for results_d on the device. If you just change the cudaMalloc call for results_d to a regular malloc, it should work as you originally intended.
That said, perhaps one of the two options I outlined above might work better from an ease of code maintenance perspective.
When allocating in that way you are allocating addresses that are valid on the CPU memory.
The value of the addresses is transferred as a number without problems, but once on the device memory the char* address will not have meaning.
Create an array of N * max text length, and another array of length N that tells how long each word is.
This is a bit more advanced but if you are processing a set of defined text (passwords for example)
I would suggest you to group it by text length and create specialized kernel for each length
template<int text_width>
__global__ void Kernel(char *result,int N)
{
//pseudocode
for i in text_width:
result[idx][i] = 'a'
}
and in the kernel invocation code you specify:
switch text_length
case 16:
Kernel<16> <<<>>> ()
The following code sample allocates a width×height 2D array of floating-point values and shows how to loop over the array elements in device code[1]
// host code
float* devPtr;
int pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height);
myKernel<<<100, 192>>>(devPtr, pitch);
// device code
__global__ void myKernel(float* devPtr, int pitch)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c]; }
}
}
The following code sample allocates a width×height CUDA array of one 32-bit
floating-point component[1]
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaArray* cuArray;
cudaMallocArray(&cuArray, &channelDesc, width, height);
The following code sample copies the 2D array to the CUDA array allocated in the
previous code samples[1]:
cudaMemcpy2DToArray(cuArray, 0, 0, devPtr, pitch, width * sizeof(float), height,
cudaMemcpyDeviceToDevice);
The following code sample copies somehost memory array to device memory[1]:
float data[256];
int size = sizeof(data);
float* devPtr;
cudaMalloc((void**)&devPtr, size);
cudaMemcpy(devPtr, data, size, cudaMemcpyHostToDevice);
you can understand theses examples and apply them in your purpose.
[1] NVIDIA CUDA Compute Unified Device Architecture