Device memory flush cuda - cuda

I'm running a C program where I call twice a cuda host function. I want to clean up the device memory between these 2 calls. Is there a way I can flush GPU device memory?? I'm on a Tesla M2050 with computing capability of 2.0

If you only want to zero the memory, then cudaMemset is probably the simplest way to do this. For example:
const int n = 10000000;
const int sz = sizeof(float) * n;
float *devicemem;
cudaMalloc((void **)&devicemem, sz);
kernel<<<...>>>(devicemem,....);
cudaMemset(devicemem, 0, sz); // zeros all the bytes in devicemem
kernel<<<...>>>(devicemem,....);
Note that the value cudaMemset takes is a byte value, and all bytes in the specified range are set to that value, just like the standard C memset. If you have a specific word value, then you will need to write your own memset kernel to assign the values.

If you are using Thrust vectors, then you can call thrust::fill() on the vector you want to reset with the reset value you want.
thrust::device_vector< FooType > fooVec( FooSize );
kernelCall1<<< x, y >>>( /* Pass fooVec here */ );
// Reset memory of fooVec
thrust::fill( fooVec.begin(), fooVec.end(), FooDefaultValue );
kernelCall2<<< x, y >>>( /* Pass fooVec here */ );

Related

CUDA streams performance

I am currently learning CUDA streams through the computation of a dot product between two vectors. The ingredients are a kernel function that takes in vectors x and y and returns a vector result of size equal to the number of blocks, where each block contributes its own reduced sum.
I also have a host function dot_gpu that calls the kernel and reduces the vector result to the final dot product value.
The synchronous version does just this:
// copy to device
copy_to_device<double>(x_h, x_d, n);
copy_to_device<double>(y_h, y_d, n);
// kernel
double result = dot_gpu(x_d, y_d, n, blockNum, blockSize);
while the async one goes like:
double result[numChunks];
for (int i = 0; i < numChunks; i++) {
int offset = i * chunkSize;
// copy to device
copy_to_device_async<double>(x_h+offset, x_d+offset, chunkSize, stream[i]);
copy_to_device_async<double>(y_h+offset, y_d+offset, chunkSize, stream[i]);
// kernel
result[i] = dot_gpu(x_d+offset, y_d+offset, chunkSize, blockNum, blockSize, stream[i]);
}
for (int i = 0; i < numChunks; i++) {
finalResult += result[i];
cudaStreamDestroy(stream[i]);
}
I am getting worse performance when using streams and was trying to investigate the reasons. I tried to pipeline the downloads, kernel calls and uploads, but with no results.
// accumulate the result of each block into a single value
double dot_gpu(const double *x, const double* y, int n, int blockNum, int blockSize, cudaStream_t stream=NULL)
{
double* result = malloc_device<double>(blockNum);
dot_gpu_kernel<<<blockNum, blockSize, blockSize * sizeof(double), stream>>>(x, y, result, n);
#if ASYNC
double* r = malloc_host_pinned<double>(blockNum);
copy_to_host_async<double>(result, r, blockNum, stream);
CudaEvent copyResult;
copyResult.record(stream);
copyResult.wait();
#else
double* r = malloc_host<double>(blockNum);
copy_to_host<double>(result, r, blockNum);
#endif
double dotProduct = 0.0;
for (int i = 0; i < blockNum; i ++) {
dotProduct += r[i];
}
cudaFree(result);
#if ASYNC
cudaFreeHost(r);
#else
free(r);
#endif
return dotProduct;
}
My guess is that the problem is inside the dot_gpu() functions that doesn't only call the kernel. Tell me if I understand correctly the following stream executions
foreach stream {
cudaMemcpyAsync( device[stream], host[stream], ... stream );
LaunchKernel<<<...stream>>>( ... );
cudaMemcpyAsync( host[stream], device[stream], ... stream );
}
The host executes all the three instructions without being blocked, since cudaMemcpyAsync and kernel return immediately (however on the GPU they will execute sequentially as they are assigned to the same stream). So host goes on to the next stream (even if stream1 who knows what stage it is at, but who cares.. it's doing his job on the GPU, right?) and executes the three instructions again without being blocked.. and so on and so forth. However, my code blocks the host before it can process the next stream, somewhere inside the dot_gpu() function. Is it because I am allocating & freeing stuff, as well as reducing the array returned by the kernel to a single value?
Assuming your objectified CUDA interface does what the function and method names suggest, there are three reasons why work from subsequent calls to dot_gpu() might not overlap:
Your code explicitly blocks by recording an event and waiting for it.
If it weren't blocking for 1. already, your code would block on the pinned host side allocation and deallocation, as you suspected.
If your code weren't blocking for 2. already, work from subsequent calls to dot_gpu() might still not overlap depending on compute capbility. Devices of compute capability 3.0 or lower do not reorder operations even if they are enqueued to different streams.
Even for devices of compute capability 3.5 and higher the number of streams whose operations can be reordered is limited by the CUDA_​DEVICE_​MAX_​CONNECTIONS environment variable, which defaults to 8 and can be set to values as large as 32.

cudaMemcpy double array to float array

Is there any way to copy double array from Host to the float array on Device. I am not concerned with loss of precision?
I have next case:
double* host = new[N];
... // Perform some calculations on host array
float* device;
cudaMalloc( (void**) &device, N * sizeof(float) );
cudaMemcpy( device, host, N * sizeof(float), cudaMemcpyHostToDevice);
When trying above written code I was getting error invalid argument.
Is there any solution to this besides changing host array to float?
float is 4 bytes, double is 8 bytes. You can't simply memcpy between incompatible types, you must first convert the doubles to floats.
Something like this (I took the liberty of replacing your raw arrays with standard library constructs):
std::vector<double> host_double(N);
// Perform some calculations on host array
// Make a copy of the host vector, converting all doubles to floats
std::vector<float> host_float(host_double.begin(), host_double.end());
// The rest is almost unchanged
float* device;
cudaMalloc((void**)&device, N * sizeof(float));
cudaMemcpy(device, host_float.data(), N * sizeof(float), cudaMemcpyHostToDevice);
However, are you sure you are benefiting from the usage of double at all? The highest precision of all your computation chain will be the one of float anyway.

Bound CUDA texture reads zero

I try to read values from a texture and write them back to global memory.
I am sure the writing part works, beause I can put constant values in the kernel and I can see them in the output:
__global__ void
bartureKernel( float* g_odata, int width, int height)
{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
if(x < width && y < height) {
unsigned int idx = (y*width + x);
g_odata[idx] = tex2D(texGrad, (float)x, (float)y).x;
}
}
The texture I want to use is a 2D float texture with two channels, so I defined it as:
texture<float2, 2, cudaReadModeElementType> texGrad;
And the code which calls the kernel initializes the texture with some constant non-zero values:
float* d_data_grad = NULL;
cudaMalloc((void**) &d_data_grad, gradientSize * sizeof(float));
CHECK_CUDA_ERROR;
texGrad.addressMode[0] = cudaAddressModeClamp;
texGrad.addressMode[1] = cudaAddressModeClamp;
texGrad.filterMode = cudaFilterModeLinear;
texGrad.normalized = false;
cudaMemset(d_data_grad, 50, gradientSize * sizeof(float));
CHECK_CUDA_ERROR;
cudaBindTexture(NULL, texGrad, d_data_grad, cudaCreateChannelDesc<float2>(), gradientSize * sizeof(float));
float* d_data_barture = NULL;
cudaMalloc((void**) &d_data_barture, outputSize * sizeof(float));
CHECK_CUDA_ERROR;
dim3 dimBlock(8, 8, 1);
dim3 dimGrid( ((width-1) / dimBlock.x)+1, ((height-1) / dimBlock.y)+1, 1);
bartureKernel<<< dimGrid, dimBlock, 0 >>>( d_data_barture, width, height);
I know, setting the texture bytes to all "50" doesn't make much sense in the context of floats, but it should at least give me some non-zero values to read.
I can only read zeros though...
You are using cudaBindTexture to bind your texture to the memory allocated by cudaMalloc. In the kernel you are using tex2D function to read values from the texture. That is why it is reading zeros.
If you bind texture to linear memory using cudaBindTexture, it is read using tex1Dfetch inside the kernel.
tex2D is used to read only from those textures which are bound to pitch linear memory ( which is allocated by cudaMallocPitch ) using the function cudaBindTexture2D, or those textures which are bound to cudaArray using the function cudaBindTextureToArray
Here is the basic table, rest you can read from the programming guide:
Memory Type----------------- Allocated Using-----------------Bound Using-----------------------Read In The Kernel By
Linear Memory...................cudaMalloc........................cudaBindTexture.............................tex1Dfetch
Pitch Linear Memory.........cudaMallocPitch.............cudaBindTexture2D........................tex2D
cudaArray............................cudaMallocArray.............cudaBindTextureToArray.............tex1D or tex2D
3D cudaArray......................cudaMalloc3DArray........cudaBindTextureToArray.............tex3D
To add on, access using tex1Dfetch is based on integer indexing.
However, the rest are indexed based on floating point, and you have to add +0.5 to get the exact value you want.
I'm curious why do you create float and bind to a float2 texture? It may gives ambiguous results.
float2 is not 2D float texture. It can actually be used for representation of complex number.
typedef struct {float x; float y;} float2;
I think this tutorial will help you understand how to use texture memory in cuda.
http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/218100902
The kernel you shown does not benefit much from using texture. however, if utilized properly, by exploiting locality, texture memory can improve the performance by quite a lot. Also, it is useful for interpolation.

Retaining dot product on GPGPU using CUBLAS routine

I am writing a code to compute dot product of two vectors using CUBLAS routine of dot product but it returns the value in host memory. I want to use the dot product for further computation on GPGPU only. How can I make the value reside on GPGPU only and use it for further computations without making an explicit copy from CPU to GPGPU?
You can do this in CUBLAS as long as you use the "V2" API. The newer API includes a function cublasSetPointerMode which you can use to set the API to assume that all routines which return a scalar value will be passed a device pointer rather than a host pointer. This is discussed in Section 2.4 of the latest CUBLAS documentation. For example:
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <stdio.h>
int main(void)
{
const int nvals = 10;
const size_t sz = sizeof(double) * (size_t)nvals;
double x[nvals], y[nvals];
double *x_, *y_, *result_;
double result=0., resulth=0.;
for(int i=0; i<nvals; i++) {
x[i] = y[i] = (double)(i)/(double)(nvals);
resulth += x[i] * y[i];
}
cublasHandle_t h;
cublasCreate(&h);
cublasSetPointerMode(h, CUBLAS_POINTER_MODE_DEVICE);
cudaMalloc( (void **)(&x_), sz);
cudaMalloc( (void **)(&y_), sz);
cudaMalloc( (void **)(&result_), sizeof(double) );
cudaMemcpy(x_, x, sz, cudaMemcpyHostToDevice);
cudaMemcpy(y_, y, sz, cudaMemcpyHostToDevice);
cublasDdot(h, nvals, x_, 1, y_, 1, result_);
cudaMemcpy(&result, result_, sizeof(double), cudaMemcpyDeviceToHost);
printf("%f %f\n", resulth, result);
cublasDestroy(h);
return 0;
}
Using CUBLAS_POINTER_MODE_DEVICE makes cublasDdot assume that result_ is a device pointer, and there is no attempt made to copy the result back to the host. Note that this makes routines like dot asynchronous, so you might need to keep on eye on synchronization between device and host.
You can't, exactly, using CUBLAS. As per talonmies' answer, starting with the CUBLAS V2 api (CUDA 4.0) the return value can be a device pointer. Refer to his answer. But if you are using the V1 API it's a single value, so it's pretty trivial to pass it as an argument to a kernel that uses it—you don't need an explicit cudaMemcpy (but there is one implied in order to return a host value).
Starting with the Tesla K20 GPU and CUDA 5, you will be able to call CUBLAS routines from device kernels using CUDA Dynamic Parallelism. This means you would be able to call cublasSdot (for example) from inside a __global__ kernel function, and your result would therefore be returned on the GPU.
Set pointer mode to device using cublasSetPointerMode().
From cuBLAS docs:
cublasSetPointerMode()
This function sets the pointer mode used by the cuBLAS library. The default is for the values to be passed by reference on the host.
Example:
cublasHandle_t handle;
cublasCreate(&handle);
cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_DEVICE); // Make the values be passed by reference on the device.
Warning: cublasSetPointerMode also affects pointers used as input parameters (e.g., alpha for cublasSgemm). You will need to store the parameters on the device or set the pointer mode back to host mode.

CUDA: Allocating 2D array on GPU

I have already read the following thread , but I couldn't get my code to work.
I am trying to allocate a 2D array on GPU, fill it with values, and copy it back to the CPU. My code is as follows:
__global__ void Kernel(char **result,int N)
{
//do something like result[0][0]='a';
}
int N=20;
int Count=5;
char **result_h=(char**)malloc(sizeof(char*)*Count);
char **result_d;
cudaMalloc(&result_d, sizeof(char*)*Count);
for(int i=0;i<Count;i++)
{
result_h[i] = (char*)malloc(sizeof(char)*N);
cudaMalloc(&result_d[i], sizeof(char)*N); //get exception here
}
//call kernel
//copy values from result_d to result_h
printf("%c",result_h[0][0])//should print a
How can i achieve this?
You can't manipulate device pointers in host code, which is why the cudaMalloc call inside the loop fails. You should probably just allocate a single contiguous block of memory and then treat that as a flattened 2D array.
For doing the simplest 2D operations on a GPU, I'd recommend you just treat it as a 1D array. cudaMalloc a block of size w*h*sizeof(char). You can access the element (i,j) through index j*w+i.
Alternatively, you could use cudaMallocArray to get a 2D array. This has a better sense of locality than linear mapped 2D memory. You can easily bind this to a texture, for example.
Now in terms of your example, the reason why it doesn't work is that cudaMalloc manipulates a host pointer to point at a block of device memory. Your example allocated the pointer structure for results_d on the device. If you just change the cudaMalloc call for results_d to a regular malloc, it should work as you originally intended.
That said, perhaps one of the two options I outlined above might work better from an ease of code maintenance perspective.
When allocating in that way you are allocating addresses that are valid on the CPU memory.
The value of the addresses is transferred as a number without problems, but once on the device memory the char* address will not have meaning.
Create an array of N * max text length, and another array of length N that tells how long each word is.
This is a bit more advanced but if you are processing a set of defined text (passwords for example)
I would suggest you to group it by text length and create specialized kernel for each length
template<int text_width>
__global__ void Kernel(char *result,int N)
{
//pseudocode
for i in text_width:
result[idx][i] = 'a'
}
and in the kernel invocation code you specify:
switch text_length
case 16:
Kernel<16> <<<>>> ()
The following code sample allocates a width×height 2D array of floating-point values and shows how to loop over the array elements in device code[1]
// host code
float* devPtr;
int pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height);
myKernel<<<100, 192>>>(devPtr, pitch);
// device code
__global__ void myKernel(float* devPtr, int pitch)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c]; }
}
}
The following code sample allocates a width×height CUDA array of one 32-bit
floating-point component[1]
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaArray* cuArray;
cudaMallocArray(&cuArray, &channelDesc, width, height);
The following code sample copies the 2D array to the CUDA array allocated in the
previous code samples[1]:
cudaMemcpy2DToArray(cuArray, 0, 0, devPtr, pitch, width * sizeof(float), height,
cudaMemcpyDeviceToDevice);
The following code sample copies somehost memory array to device memory[1]:
float data[256];
int size = sizeof(data);
float* devPtr;
cudaMalloc((void**)&devPtr, size);
cudaMemcpy(devPtr, data, size, cudaMemcpyHostToDevice);
you can understand theses examples and apply them in your purpose.
[1] NVIDIA CUDA Compute Unified Device Architecture