Invalid argument in cudaMemcpy3D using width in bytes? - cuda

I've made a simple texture3D test and found a strange behavior when copying data to device. The function cudaMemcpy3D return an 'invalid argument'.
I found the problem is related with cudaExtent. According to the CUDA Toolkit Reference Manual 4.0, cudaExtent Parameters are as follow:
w - Width in bytes
h - Height in elements
d - Depth in elements
So, I prepared the texture as follows:
// prepare texture
cudaChannelFormatDesc t_desc = cudaCreateChannelDesc<baseType>();
// CUDA extent parameters w - Width in bytes, h - Height in elements, d - Depth in elements
cudaExtent t_extent = make_cudaExtent(NCOLS*sizeof(baseType), NROWS, DEPTH);
// CUDA arrays are opaque memory layouts optimized for texture fetching
cudaArray *i_ArrayPtr = NULL;
// allocate 3D
status = cudaMalloc3DArray(&i_ArrayPtr, &t_desc, t_extent);
And configured the 3D parameters as follow:
// prepare input data
cudaMemcpy3DParms i_3DParms = { 0 };
i_3DParms.srcPtr = make_cudaPitchedPtr( (void*)h_idata, NCOLS*sizeof(baseType), NCOLS, NROWS);
i_3DParms.dstArray = i_ArrayPtr;
i_3DParms.extent = t_extent;
i_3DParms.kind = cudaMemcpyHostToDevice;
And finally copied the data to device memory:
// copy input data from host to device
status = cudaMemcpy3D( &i_3DParms );
The problem is solved if I only specified the number of element in the x dimension as:
cudaExtent t_extent = make_cudaExtent(NCOLS, NROWS, DEPTH);
which does not produce any error and the test work as expected.
I'm wondering if I miss something with the cudaExtent function or something else. Why the width parameter is not needed to be expressed in bytes ?

For CUDA arrays, the extent is specified with the width given in array elements. For allocating linear memory, the extent is specified with the width given in bytes. Because you are allocating an array with cudaMalloc3DArray, use the width in elements. If you were using cudaMalloc3D, the extent would have a width in bytes.

Related

(cudaBindTexture2D) How to bind a pitched-array from the middle

I am trying to bind a pitched array from the middle partly (not from the beginning of the array), like followings.
/1. allocate/
cudaMallocPitch((void**)&d_texinput, &FloatPitch, cols*sizeof(float), rows);
cudaMallocPitch((void**)&d_output, &FloatPitch, cols*sizeof(float), rows);
/2. set row-length of target region (i.e., dividing rows 10 times)/
int row_div_times = 10;
int part_rows = rows / row_div_times;
int part_offset = part_rows*FloatPitch/sizeof(float);
dim3 threads(16,16);
dim3 Part_Blocks((cols + threads.x - 1) / threads.x, (Part_rows + threads.y - 1) / threads.y);
/3. processing divided rows, iteratively/
for (int i = 0; i < row_div_times; i++)
{
size_t offsetsize= i*part_offset;
/*computing values of "d_tex_input"*/
calibration << <Part_Blocks, threads, 0, stream[i] >> >
(d_texinput + i*part_offset );
/*
//###(QUESTION point!) I want to bind the device memory "d_texinput" to texture "tex_mem" only partly like below.
cudaBindTexture2D(0, tex_mem, &d_texinput[i*part_offset], channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code a;
,,, or something like,,,
cudaBindTexture2D(&offsetsize, tex_mem, &d_texinput, channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code b;
*/
//final computaion with texture
final_computationwithtexture << <Part_Blocks, threads, 0, stream[i] >> >
( d_output + i*part_offset );
cudaUnbindTexture(tex_mem);
}
Please kindly allow me to ask your instruction, advice how to bind the target region of the device memory array partly by revising above( QUESTION point!)?
I tried to understand first argument of cudaBindTExture2D as "offset". but it is not value. it is address. according to the documentation.
i still could not understand the documentation.
I hope I can understand what that is by knowing adequate inputting way to the cudaBindTexture2D.
The offset parameter is not an input, it is an output. That's why it is a pointer. The function will set the offset in bytes. If you want to bind in the middle of an allocation, you set the devPtr argument (third) appropriately and then the function will give you the offset required for texture accesses.
Here is how to understand this: Textures can only be bound with a certain alignment. Memory allocations are always properly aligned. Therefore it is not an issue in most cases. However, if you provide an arbitrary memory address, CUDA has to round down to the alignment and you have to apply the proper offset later on.
Let's say you bind &float[66], the proper alignment might be &float[64], so CUDA starts its texture at that offset and you have to add an offset of 8 bytes for each access to get the desired result. I'm picking random numbers here, I don't know the alignment requirements.

PyCUDA illegal memory access of curandState*

I'm studying the spread of an invasive species and am trying to generate random numbers within a PyCUDA kernel using the XORWOW random number generator. The matrices I need to be able to use as input in the study are quite large (up to 8,000 x 8,000).
The error seems to occur inside get_random_number when indexing the curandState* of the XORWOW generator. The code executes without errors on smaller matrices and produces correct results. I'm running my code on 2 NVidia Tesla K20X GPUs.
Kernel code and setup:
kernel_code = '''
#include <curand_kernel.h>
#include <math.h>
extern "C" {
__device__ float get_random_number(curandState* global_state, int thread_id) {
curandState local_state = global_state[thread_id];
float num = curand_uniform(&local_state);
global_state[thread_id] = local_state;
return num;
}
__global__ void survival_of_the_fittest(float* grid_a, float* grid_b, curandState* global_state, int grid_size, float* survival_probabilities) {
int x = threadIdx.x + blockIdx.x * blockDim.x; // column index of cell
int y = threadIdx.y + blockIdx.y * blockDim.y; // row index of cell
// make sure this cell is within bounds of grid
if (x < grid_size && y < grid_size) {
int thread_id = y * grid_size + x; // thread index
grid_b[thread_id] = grid_a[thread_id]; // copy current cell
float num;
// ignore cell if it is not already populated
if (grid_a[thread_id] > 0.0) {
num = get_random_number(global_state, thread_id);
// agents in this cell die
if (num < survival_probabilities[thread_id]) {
grid_b[thread_id] = 0.0; // cell dies
//printf("Cell (%d,%d) died (probability of death was %f)\\n", x, y, survival_probabilities[thread_id]);
}
}
}
}
mod = SourceModule(kernel_code, no_extern_c = True)
survival = mod.get_function('survival_of_the_fittest')
Data setup:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
grid_a = gpuarray.to_gpu(np.ones((matrix_size,matrix_size)).astype(np.float32))
grid_b = gpuarray.to_gpu(np.zeros((matrix_size,matrix_size)).astype(np.float32))
generator = curandom.XORWOWRandomNumberGenerator()
grid_size = np.int32(matrix_size)
survival_probabilities = gpuarray.to_gpu(np.random.uniform(0,1,(matrix_size,matrix_size)))
Kernel call:
survival(grid_a, grid_b, generator.state, grid_size, survival_probabilities,
grid = (grid_dims, grid_dims), block = (block_dims, block_dims, 1))
I expect to be able to generate random numbers within the range (0,1] for matrices up to (8,000 x 8,000), but executing my code on large matrices leads to an illegal memory access error.
pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
Am I indexing the curandState* incorrectly in get_random_number? And if not, what else might be causing this error?
The problem here is a disconnect between this code which determines the size of the state which the PyCUDA curandom interface allocates for its internal state and this code in your post:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
You seem to be assuming that PyCUDA will magically allocate enough state for whatever block and grid dimension you select in you code. That is obviously unlikely, particularly at large grid sizes. You either need to
Modify your code to use the same block and grid sizes as the curandom module uses internally for whichever generator you choose to use, or
Allocate and manage your own state scratch space so that you have enough state allocated to service the block and grid sizes you select
I leave it as an exercise to the reader as to which one of these two approaches will work better in your application.

NPP functions with scratch buffer doesn't fill output value

Some code where im trying find maximum:
// 1)
// compute size of scratch buffer
int nBufferSize;
auto status = nppiMaxGetBufferHostSize_32f_C1R(size(img), &nBufferSize);
// status - No_Errors, nBufferSize - computed
// 2)
// device memory allocation for scratch buffer
Npp8u * pDeviceBuffer;
auto res = cudaMalloc((void **)(&pDeviceBuffer), nBufferSize);
// result - cudaSucces
//3 )
// call nnp function
// where:
// - img is npp::ImageNPP_32f_C1 from UtilNPP (npp pointer wrapper for memory management)
// - size(img) valid NppiSize value
Npp32f max_ = 13;
status = nppiMax_32f_C1R(img.data(), img.pitch(), size(img), pDeviceBuffer, &max_);
// status = No_Errors, but output value max_ not changed!
// 4)
// free device memory for scratch buffer
cudaFree(pDeviceBuffer)
All function return 0 (no errors). But output value max_ not calculated.
Im try some other statistical functions who required scratch buffer and get same result.
Im use CUDA 6.5 and my code like sample in NPP documentation about using function with scratch buffer
Someone have any ideas?
nppiMax_32f_C1R and all other such variants require input and output memory pointers to be allocated on device. So max_ should be present on device. To make the above example work, you can do the following:
Npp32f max_ = 13;
Npp32f* d_max_; //Device output
cudaMalloc(&d_max_, sizeof(Npp32f));
status = nppiMax_32f_C1R(img.data(), img.pitch(), size(img), pDeviceBuffer, d_max_);
cudaMemcpy(&max_, d_max_, sizeof(Npp32f), cudaMemcpyDeviceToHost);
cudaFree(d_max_);

cublasSetVector() vs cudaMemcpy()

I am wondering if there is a difference between:
// cumalloc.c - Create a device on the device
HOST float * cudamath_vector(const float * h_vector, const int m)
{
float *d_vector = NULL;
cudaError_t cudaStatus;
cublasStatus_t cublasStatus;
cudaStatus = cudaMalloc(&d_vector, sizeof(float) * m );
if(cudaStatus == cudaErrorMemoryAllocation) {
printf("ERROR: cumalloc.cu, cudamath_vector() : cudaErrorMemoryAllocation");
return NULL;
}
/* THIS: */ cublasSetVector(m, sizeof(*d_vector), h_vector, 1, d_vector, 1);
/* OR THAT: */ cudaMemcpy(d_vector, h_vector, sizeof(float) * m, cudaMemcpyHostToDevice);
return d_vector;
}
cublasSetVector() has two arguments incx and incy and the documentation says:
The storage spacing between consecutive elements is given by incx for
the source vector x and for the destination vector y.
In the NVIDIA forum someone said:
iona_me: "incx and incy are strides measured in floats."
So does this mean that for incx = incy = 1 all elements of a float[] will be sizeof(float)-aligned and for incx = incy = 2 there would be a sizeof(float)-padding between each element?
Except for those two parameters and the cublasHandle - does cublasSetVector() anything else what cudaMalloc() doesn't do?
Would it be save to pass a vector/matrix which was not created with their respective cublas*() function to other CUBLAS functions to manipulate them?
There is a comment in a thread of the NVIDIA Forum provided by Massimiliano Fatica confirming my statement in the above comment (or, saying it better, my comment originated by a recall of having read the post I linked to). In particular
cublasSetVector, cubblasGetVector, cublasSetMatrix, cublasGetMatrix are thin wrappers around cudaMemcpy and cudaMemcpy2D. Therefore, no significant performance differences are expected between the two sets of copy functions.
Accordingly, you can safely pass any array created by cudaMalloc as input to cublasSetVector.
Concerning the strides, perhaps there is a misprint in the guide (as of CUDA 6.0), which says that
The storage spacing between consecutive elements is given by incx for the source
vector x and for the destination vector y.
but perhaps should be read as
The storage spacing between consecutive elements is given by incx for the source
vector x and incy for the destination vector y.

CUDA: Allocating 2D array on GPU

I have already read the following thread , but I couldn't get my code to work.
I am trying to allocate a 2D array on GPU, fill it with values, and copy it back to the CPU. My code is as follows:
__global__ void Kernel(char **result,int N)
{
//do something like result[0][0]='a';
}
int N=20;
int Count=5;
char **result_h=(char**)malloc(sizeof(char*)*Count);
char **result_d;
cudaMalloc(&result_d, sizeof(char*)*Count);
for(int i=0;i<Count;i++)
{
result_h[i] = (char*)malloc(sizeof(char)*N);
cudaMalloc(&result_d[i], sizeof(char)*N); //get exception here
}
//call kernel
//copy values from result_d to result_h
printf("%c",result_h[0][0])//should print a
How can i achieve this?
You can't manipulate device pointers in host code, which is why the cudaMalloc call inside the loop fails. You should probably just allocate a single contiguous block of memory and then treat that as a flattened 2D array.
For doing the simplest 2D operations on a GPU, I'd recommend you just treat it as a 1D array. cudaMalloc a block of size w*h*sizeof(char). You can access the element (i,j) through index j*w+i.
Alternatively, you could use cudaMallocArray to get a 2D array. This has a better sense of locality than linear mapped 2D memory. You can easily bind this to a texture, for example.
Now in terms of your example, the reason why it doesn't work is that cudaMalloc manipulates a host pointer to point at a block of device memory. Your example allocated the pointer structure for results_d on the device. If you just change the cudaMalloc call for results_d to a regular malloc, it should work as you originally intended.
That said, perhaps one of the two options I outlined above might work better from an ease of code maintenance perspective.
When allocating in that way you are allocating addresses that are valid on the CPU memory.
The value of the addresses is transferred as a number without problems, but once on the device memory the char* address will not have meaning.
Create an array of N * max text length, and another array of length N that tells how long each word is.
This is a bit more advanced but if you are processing a set of defined text (passwords for example)
I would suggest you to group it by text length and create specialized kernel for each length
template<int text_width>
__global__ void Kernel(char *result,int N)
{
//pseudocode
for i in text_width:
result[idx][i] = 'a'
}
and in the kernel invocation code you specify:
switch text_length
case 16:
Kernel<16> <<<>>> ()
The following code sample allocates a width×height 2D array of floating-point values and shows how to loop over the array elements in device code[1]
// host code
float* devPtr;
int pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height);
myKernel<<<100, 192>>>(devPtr, pitch);
// device code
__global__ void myKernel(float* devPtr, int pitch)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c]; }
}
}
The following code sample allocates a width×height CUDA array of one 32-bit
floating-point component[1]
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaArray* cuArray;
cudaMallocArray(&cuArray, &channelDesc, width, height);
The following code sample copies the 2D array to the CUDA array allocated in the
previous code samples[1]:
cudaMemcpy2DToArray(cuArray, 0, 0, devPtr, pitch, width * sizeof(float), height,
cudaMemcpyDeviceToDevice);
The following code sample copies somehost memory array to device memory[1]:
float data[256];
int size = sizeof(data);
float* devPtr;
cudaMalloc((void**)&devPtr, size);
cudaMemcpy(devPtr, data, size, cudaMemcpyHostToDevice);
you can understand theses examples and apply them in your purpose.
[1] NVIDIA CUDA Compute Unified Device Architecture