NPP functions with scratch buffer doesn't fill output value - cuda

Some code where im trying find maximum:
// 1)
// compute size of scratch buffer
int nBufferSize;
auto status = nppiMaxGetBufferHostSize_32f_C1R(size(img), &nBufferSize);
// status - No_Errors, nBufferSize - computed
// 2)
// device memory allocation for scratch buffer
Npp8u * pDeviceBuffer;
auto res = cudaMalloc((void **)(&pDeviceBuffer), nBufferSize);
// result - cudaSucces
//3 )
// call nnp function
// where:
// - img is npp::ImageNPP_32f_C1 from UtilNPP (npp pointer wrapper for memory management)
// - size(img) valid NppiSize value
Npp32f max_ = 13;
status = nppiMax_32f_C1R(img.data(), img.pitch(), size(img), pDeviceBuffer, &max_);
// status = No_Errors, but output value max_ not changed!
// 4)
// free device memory for scratch buffer
cudaFree(pDeviceBuffer)
All function return 0 (no errors). But output value max_ not calculated.
Im try some other statistical functions who required scratch buffer and get same result.
Im use CUDA 6.5 and my code like sample in NPP documentation about using function with scratch buffer
Someone have any ideas?

nppiMax_32f_C1R and all other such variants require input and output memory pointers to be allocated on device. So max_ should be present on device. To make the above example work, you can do the following:
Npp32f max_ = 13;
Npp32f* d_max_; //Device output
cudaMalloc(&d_max_, sizeof(Npp32f));
status = nppiMax_32f_C1R(img.data(), img.pitch(), size(img), pDeviceBuffer, d_max_);
cudaMemcpy(&max_, d_max_, sizeof(Npp32f), cudaMemcpyDeviceToHost);
cudaFree(d_max_);

Related

PyCUDA illegal memory access of curandState*

I'm studying the spread of an invasive species and am trying to generate random numbers within a PyCUDA kernel using the XORWOW random number generator. The matrices I need to be able to use as input in the study are quite large (up to 8,000 x 8,000).
The error seems to occur inside get_random_number when indexing the curandState* of the XORWOW generator. The code executes without errors on smaller matrices and produces correct results. I'm running my code on 2 NVidia Tesla K20X GPUs.
Kernel code and setup:
kernel_code = '''
#include <curand_kernel.h>
#include <math.h>
extern "C" {
__device__ float get_random_number(curandState* global_state, int thread_id) {
curandState local_state = global_state[thread_id];
float num = curand_uniform(&local_state);
global_state[thread_id] = local_state;
return num;
}
__global__ void survival_of_the_fittest(float* grid_a, float* grid_b, curandState* global_state, int grid_size, float* survival_probabilities) {
int x = threadIdx.x + blockIdx.x * blockDim.x; // column index of cell
int y = threadIdx.y + blockIdx.y * blockDim.y; // row index of cell
// make sure this cell is within bounds of grid
if (x < grid_size && y < grid_size) {
int thread_id = y * grid_size + x; // thread index
grid_b[thread_id] = grid_a[thread_id]; // copy current cell
float num;
// ignore cell if it is not already populated
if (grid_a[thread_id] > 0.0) {
num = get_random_number(global_state, thread_id);
// agents in this cell die
if (num < survival_probabilities[thread_id]) {
grid_b[thread_id] = 0.0; // cell dies
//printf("Cell (%d,%d) died (probability of death was %f)\\n", x, y, survival_probabilities[thread_id]);
}
}
}
}
mod = SourceModule(kernel_code, no_extern_c = True)
survival = mod.get_function('survival_of_the_fittest')
Data setup:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
grid_a = gpuarray.to_gpu(np.ones((matrix_size,matrix_size)).astype(np.float32))
grid_b = gpuarray.to_gpu(np.zeros((matrix_size,matrix_size)).astype(np.float32))
generator = curandom.XORWOWRandomNumberGenerator()
grid_size = np.int32(matrix_size)
survival_probabilities = gpuarray.to_gpu(np.random.uniform(0,1,(matrix_size,matrix_size)))
Kernel call:
survival(grid_a, grid_b, generator.state, grid_size, survival_probabilities,
grid = (grid_dims, grid_dims), block = (block_dims, block_dims, 1))
I expect to be able to generate random numbers within the range (0,1] for matrices up to (8,000 x 8,000), but executing my code on large matrices leads to an illegal memory access error.
pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
Am I indexing the curandState* incorrectly in get_random_number? And if not, what else might be causing this error?
The problem here is a disconnect between this code which determines the size of the state which the PyCUDA curandom interface allocates for its internal state and this code in your post:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
You seem to be assuming that PyCUDA will magically allocate enough state for whatever block and grid dimension you select in you code. That is obviously unlikely, particularly at large grid sizes. You either need to
Modify your code to use the same block and grid sizes as the curandom module uses internally for whichever generator you choose to use, or
Allocate and manage your own state scratch space so that you have enough state allocated to service the block and grid sizes you select
I leave it as an exercise to the reader as to which one of these two approaches will work better in your application.

CMSIS real-FFT on 8192 samples in Q15

I need to perform an FFT on a block of 8192 samples on an STM32F446 microcontroller.
For that I wanted to use the CMSIS DSP library as it's available easily and optimised for the STM32F4.
My 8192 samples of input will ultimately be values from the internal 12-bit ADC (left aligned and converted to q15 by flipping the sign bit)., but for testing purpose I'm feeding the FFT with test-buffers.
With CMSIS's FFT functions, only the Q15 version supports lengths of 8192. Thus I am using arm_rfft_q15().
Because the FFT functions of the CMSIS libraries include by default about 32k of LUTs - to adapt to many FFT lengths, I have "rewritten" them to remove all the tables corresponding to other length than the one I'm interested in. I haven't touched anything except removing the useless code.
My samples are stored on an external SDRAM that I access via DMA.
When using the FFT, I have several problems :
Both my source buffer and my destination buffer get modified ;
the result is not at all as expected
To make sure I had wrong results I did an IFFT right after the FFT but it just confirmed that the code wasn't working.
Here is my code :
status_codes FSM::fft_state(void)
{
// Flush the SDRAM section
si_ovf_buf_clr_u16((uint16_t *)0xC0000000, 8192);
q15_t* buf = (q15_t*)(0xC0000000);
for(int i = 0; i<50; i++)
buf[i] = 0x0FFF; // Fill the buffer with test vector (50 sp gate)
// initialise FFT
// ---> Forward, 8192 samples, bitReversed
arm_rfft_instance_q15 S;
if(arm_rfft_init_q15(&S, 8192, 0, 1) != ARM_MATH_SUCCESS)
return state_error;
// perform FFT
arm_rfft_q15(&S, (q15_t*)0xC0000000, (q15_t*)0xC0400000);
// Post-shift by 12, in place (see doc)
arm_shift_q15((q15_t*)0xC0400000, 12, (q15_t*)0xC0400000, 16384);
// Init inverse FFT
if(arm_rfft_init_q15(&S, 8192, 1, 1) != ARM_MATH_SUCCESS)
return state_error;
// Perform iFFT
arm_rfft_q15(&S, (q15_t*)0xC0400000, (q15_t*)0xC0800000);
// Post shift
arm_shift_q15((q15_t*)0xC0800000, 12, (q15_t*)0xC0800000, 8192);
return state_success;
}
And here is the result (from GDB)
PS : I'm using ChibiOS - not sure if it is relevant.

C - pass array as parameter and change size and content

UPDATE: I solved my problem (scroll down).
I'm writing a small C program and I want to do the following:
The program is connected to a mysql database (that works perfectly) and I want to do something with the data from the database. I get about 20-25 rows per query and I created my own struct, which should contain the information from each row of the query.
So my struct looks like this:
typedef struct {
int timestamp;
double rate;
char* market;
char* currency;
} Rate;
I want to pass an empty array to a function, the function should calculate the size for the array based on the returned number of rows of the query. E.g. there are 20 rows which are returned from a single SQL query, so the array should contain 20 objectes of my Rate struct.
I want something like this:
int main(int argc, char **argv)
{
Rate *rates = ?; // don't know how to initialize it
(void) do_something_with_rates(&rates);
// the size here should be ~20
printf("size of rates: %d", sizeof(rates)/sizeof(Rate));
}
How does the function do_something_with_rates(Rate **rates) have to look like?
EDIT: I did it as Alex said, I made my function return the size of the array as size_t and passed my array to the function as Rate **rates.
In the function you can access and change the values like (*rates)[i].timestamp = 123 for example.
In C, memory is either dynamically or statically allocated.
Something like int fifty_numbers[50] is statically allocated. The size is 50 integers no matter what, so the compiler knows how big the array is in bytes. sizeof(fifty_numbers) will give you 200 bytes here.
Dynamic allocation: int *bunch_of_numbers = malloc(sizeof(int) * varying_size). As you can see, varying_size is not constant, so the compiler can't figure out how big the array is without executing the program. sizeof(bunch_of_numbers) gives you 4 bytes on a 32 bit system, or 8 bytes on a 64 bit system. The only one that know how big the array is would be the programmer. In your case, it's whoever wrote do_something_with_rates(), but you're discarding that information by either not returning it, or taking a size parameter.
It's not clear how do_something_with_rates() was declared exactly, but something like: void do_something_with_rates(Rate **rates) won't work as the function has no idea how big rates is. I recommend something like: void do_something_with_rates(size_t array_size, Rate **rates). At any rate, going by your requirements, it's still a ways away from working. Possible solutions are below:
You need to either return the new array's size:
size_t do_something_with_rates(size_t old_array_size, Rate **rates) {
Rate **new_rates;
*new_rates = malloc(sizeof(Rate) * n); // allocate n Rate objects
// carry out your operation on new_rates
// modifying rates
free(*rates); // releasing the memory taken up by the old array
*rates = *new_rates // make it point to the new array
return n; // returning the new size so that the caller knows
}
int main() {
Rate *rates = malloc(sizeof(Rate) * 20);
size_t new_size = do_something_with_rates(20, &rates);
// now new_size holds the size of the new array, which may or may not be 20
return 0;
}
Or pass in a size parameter for the function to set:
void do_something_with_rates(size_t old_array_size, size_t *new_array_size, Rate **rates) {
Rate **new_rates;
*new_rates = malloc(sizeof(Rate) * n); // allocate n Rate objects
*new_array_size = n; // setting the new size so that the caller knows
// carry out your operation on new_rates
// modifying rates
free(*rates); // releasing the memory taken up by the old array
*rates = *new_rates // make it point to the new array
}
int main() {
Rate *rates = malloc(sizeof(Rate) * 20);
size_t new_size;
do_something_with_rates(20, &new_size, &rates);
// now new_size holds the size of the new array, which may or may not be 20
return 0;
}
Why do I need to pass the old size as a parameter?
void do_something_with_rates(Rate **rates) {
// You don't know what n is. How would you
// know how many rate objects the caller wants
// you to process for any given call to this?
for (size_t i = 0; i < n; ++i)
// carry out your operation on new_rates
}
Everything changes when you have a size parameter:
void do_something_with_rates(size_t size, Rate **rates) {
for (size_t i = 0; i < size; ++i) // Now you know when to stop
// carry out your operation on new_rates
}
This is a very fundamental flaw with your program.
I want to also want the function to change the contents of the array:
size_t do_something_with_rates(size_t old_array_size, Rate **rates) {
Rate **new_rates;
*new_rates = malloc(sizeof(Rate) * n); // allocate n Rate objects
// carry out some operation on new_rates
Rate *array = *new_rates;
for (size_t i = 0; i < n; ++i) {
array[i]->timestamp = time();
// you can see the pattern
}
return n; // returning the new size so that the caller knows
}
sizeof produces a value (or code to produce a value) of the size of a type or the type of an expression at compile time. The size of an expression can therefore not change during the execution of the program. If you want that feature, use a variable, terminal value or a different programming language. Your choice. Whatever. C's better than Java.
char foo[42];
foo has either static storage duration (which is only partially related to the static keyword) or automatic storage duration.
Objects with static storage duration exist from the start of the program to the termination. Those global variables are technically called variables declared at file scope that have static storage duration and internal linkage.
Objects with automatic storage duration exist from the beginning of their initialisation to the return of the function. These are usually on the stack, though they could just as easily be on the graph. They're variables declared at block scope that have automatic storage duration and internal linkage.
In either case, todays compilers will encode 42 into the machine code. I suppose it'd be possible to modify the machine code, though that several thousands of lines you put into that task would be much better invested into storing the size externally (see other answer/s), and this isn't really a C question. If you really want to look into this, the only examples I can think of that change their own machine code are viruses... How are you going to avoid that antivirus heuristic?
Another option is to encode size information into a struct, use a flexible array member and then you can carry both the array and the size around as one allocation. Sorry, this is as close as you'll get to what you want. e.g.
struct T_vector {
size_t size;
T value[];
};
struct T_vector *T_make(struct T_vector **v) {
size_t index = *v ? (*v)->size++ : 0, size = index + 1;
if ((index & size) == 0) {
void *temp = realloc(*v, size * sizeof *(*v)->value);
if (!temp) {
return NULL;
}
*v = temp;
// (*v)->size = size;
*v = 42; // keep reading for a free cookie
}
return (*v)->value + index;
}
#define T_size(v) ((v) == NULL ? 0 : (v)->size)
int main(void) {
struct T_vector *v = NULL; T_size(v) == 0;
{ T *x = T_make(&v); x->value[0]; T_size(v) == 1;
x->y = y->x; }
{ T *y = T_make(&v); x->value[1]; T_size(v) == 2;
y->x = x->y; }
free(v);
}
Disclaimer: I only wrote this as an example; I don't intend to test or maintain it unless the intent of the example suffers drastically. If you want something I've thoroughly tested, use my push_back.
This may seem innocent, yet even with that disclaimer and this upcoming warning I'll likely see a comment along the lines of: Each successive call to make_T may render previously returned pointers invalid... True, and I can't think of much more I could do about that. I would advise calling make_T, modifying the value pointed at by the return value and discarding that pointer, as I've done above (rather explicitly).
Some compilers might even allow you to #define sizeof(x) T_size(x)... I'm joking; don't do this. Do it, mate; it's awesome!
Technically we aren't changing the size of an array here; we're allocating ahead of time and where necessary, reallocating and copying to a larger array. It might seem appealing to abstract allocation away this way in C at times... enjoy :)

cuda use constant memory as two-dimensional array

I'm implement my kernel in a multithreaded "host"-program, where every host thread is calling the kernel.
I've got a problem with the use of constant memory. In the constant memory will be placed some parameters, but for every thread they are different.
I build a sample where the problem occurs, too.
This is the kernel
__global__ void Kernel( int *aiOutput, int Length )
{
int id = threadIdx.x + blockIdx.x * blockDim.x;
int iValue = 0;
// bound check
if( id < Length )
{
if( id % 3 == 0 )
iValue = c_iaCoeff[2];
else if( id % 2 == 0 )
iValue = c_iaCoeff[1];
else
iValue = c_iaCoeff[0];
aiOutput[id] = iValue;
}
__syncthreads();
}
And a pthread is calling this function.
void* WrapperCopy( void* params )
{
// choose cuda device to perform on
CUDA_CHECK_RETURN( cudaSetDevice( 0 ) );
// cast of params
SParams *_params = (SParams*)params;
// copy coefficients to constant memory
CUDA_CHECK_RETURN( cudaMemcpyToSymbol( c_iaCoeff, _params->h_piCoeff, 3*sizeof(int) ) );
// loop kernel
for( int i=0; i<100; i++ )
{
// perfrom kernel
Kernel<<< BLOCKCOUNT, BLOCKSIZE >>>( _params->d_piArray, _params->iLength );
}
// copy data back from gpu
CUDA_CHECK_RETURN( cudaMemcpy(
_params->h_piArray, _params->d_piArray, BLOCKSIZE*BLOCKCOUNT*sizeof(int), cudaMemcpyDeviceToHost ) );
return NULL;
}
Constant memory is declared as this.
__constant__ int c_iaCoeff[ 3 ];
For every host thread has diffrent values in h_piCoeff and will copy that to the constant memory.
Now I get for every pthread call the same results, becaus all of them got the same values in c_iaCoeff.
I think that is the problem of how constant memory works and have to be declared in a context - in the sample there will be only one c_iaCoeff declared for all pthreads calling and the kernels called by pthreads will get the values of the last cudaMemcpyToSymbol. Is that right?
Now I've tried to change my constant memory in a two-dimensional array.
The second dimension will be the values as before, but the first will be the index of the used pthread.
__constant__ int c_iaCoeff2[ THREADS ][ 3 ];
In the kernels the use of it will be in this way.
iValue = c_iaCoeff2[iTId][2];
But I don't know if it's possible to use constant memory in this way, is it?
Also I got an error when I try to copy data to the constant memory.
CUDA_CHECK_RETURN( cudaMemcpyToSymbol( c_iaCoeff[_params->iTId], _params->h_piCoeff, 3*sizeof(int) ) );
General is it possible to use constant memory as a two-dimensional array and if yes, where is my failure?
Yes, you should be able to use constant memory in the way you want to, but the cudaMemcpyToSymbol copy operation you are using is incorrect. The first argument to the call is a symbol, and the API does a lookup in the runtime symbol table to get the address of the constant memory symbol you request. So an address can't be passed to the call (although your code is actually passing an initialised host value to the call, why that is I will leave as an exercise to the reader).
What you may have missed is the optional fourth argument in the call, which is an offset into the memory pointed to by the symbol you request. So you should be able to do something like:
cudaMemcpyToSymbol( c_iaCoeff, // symbol to lookup
_params->h_piCoeff, // source location
3*sizeof(int), // number of bytes to copy
(3*_params->iTId)*sizeof(int) // Offset in bytes
);
[standard disclaimer: written in browser, unstested. use at own risk]
The last argument is the offset in bytes from the start of the symbol. Your 2D array will be laid out in row major order, so you need to use the pitch of the rows multiplied by the row index as an offset for each copy operation.

Invalid argument in cudaMemcpy3D using width in bytes?

I've made a simple texture3D test and found a strange behavior when copying data to device. The function cudaMemcpy3D return an 'invalid argument'.
I found the problem is related with cudaExtent. According to the CUDA Toolkit Reference Manual 4.0, cudaExtent Parameters are as follow:
w - Width in bytes
h - Height in elements
d - Depth in elements
So, I prepared the texture as follows:
// prepare texture
cudaChannelFormatDesc t_desc = cudaCreateChannelDesc<baseType>();
// CUDA extent parameters w - Width in bytes, h - Height in elements, d - Depth in elements
cudaExtent t_extent = make_cudaExtent(NCOLS*sizeof(baseType), NROWS, DEPTH);
// CUDA arrays are opaque memory layouts optimized for texture fetching
cudaArray *i_ArrayPtr = NULL;
// allocate 3D
status = cudaMalloc3DArray(&i_ArrayPtr, &t_desc, t_extent);
And configured the 3D parameters as follow:
// prepare input data
cudaMemcpy3DParms i_3DParms = { 0 };
i_3DParms.srcPtr = make_cudaPitchedPtr( (void*)h_idata, NCOLS*sizeof(baseType), NCOLS, NROWS);
i_3DParms.dstArray = i_ArrayPtr;
i_3DParms.extent = t_extent;
i_3DParms.kind = cudaMemcpyHostToDevice;
And finally copied the data to device memory:
// copy input data from host to device
status = cudaMemcpy3D( &i_3DParms );
The problem is solved if I only specified the number of element in the x dimension as:
cudaExtent t_extent = make_cudaExtent(NCOLS, NROWS, DEPTH);
which does not produce any error and the test work as expected.
I'm wondering if I miss something with the cudaExtent function or something else. Why the width parameter is not needed to be expressed in bytes ?
For CUDA arrays, the extent is specified with the width given in array elements. For allocating linear memory, the extent is specified with the width given in bytes. Because you are allocating an array with cudaMalloc3DArray, use the width in elements. If you were using cudaMalloc3D, the extent would have a width in bytes.