I'm implement my kernel in a multithreaded "host"-program, where every host thread is calling the kernel.
I've got a problem with the use of constant memory. In the constant memory will be placed some parameters, but for every thread they are different.
I build a sample where the problem occurs, too.
This is the kernel
__global__ void Kernel( int *aiOutput, int Length )
{
int id = threadIdx.x + blockIdx.x * blockDim.x;
int iValue = 0;
// bound check
if( id < Length )
{
if( id % 3 == 0 )
iValue = c_iaCoeff[2];
else if( id % 2 == 0 )
iValue = c_iaCoeff[1];
else
iValue = c_iaCoeff[0];
aiOutput[id] = iValue;
}
__syncthreads();
}
And a pthread is calling this function.
void* WrapperCopy( void* params )
{
// choose cuda device to perform on
CUDA_CHECK_RETURN( cudaSetDevice( 0 ) );
// cast of params
SParams *_params = (SParams*)params;
// copy coefficients to constant memory
CUDA_CHECK_RETURN( cudaMemcpyToSymbol( c_iaCoeff, _params->h_piCoeff, 3*sizeof(int) ) );
// loop kernel
for( int i=0; i<100; i++ )
{
// perfrom kernel
Kernel<<< BLOCKCOUNT, BLOCKSIZE >>>( _params->d_piArray, _params->iLength );
}
// copy data back from gpu
CUDA_CHECK_RETURN( cudaMemcpy(
_params->h_piArray, _params->d_piArray, BLOCKSIZE*BLOCKCOUNT*sizeof(int), cudaMemcpyDeviceToHost ) );
return NULL;
}
Constant memory is declared as this.
__constant__ int c_iaCoeff[ 3 ];
For every host thread has diffrent values in h_piCoeff and will copy that to the constant memory.
Now I get for every pthread call the same results, becaus all of them got the same values in c_iaCoeff.
I think that is the problem of how constant memory works and have to be declared in a context - in the sample there will be only one c_iaCoeff declared for all pthreads calling and the kernels called by pthreads will get the values of the last cudaMemcpyToSymbol. Is that right?
Now I've tried to change my constant memory in a two-dimensional array.
The second dimension will be the values as before, but the first will be the index of the used pthread.
__constant__ int c_iaCoeff2[ THREADS ][ 3 ];
In the kernels the use of it will be in this way.
iValue = c_iaCoeff2[iTId][2];
But I don't know if it's possible to use constant memory in this way, is it?
Also I got an error when I try to copy data to the constant memory.
CUDA_CHECK_RETURN( cudaMemcpyToSymbol( c_iaCoeff[_params->iTId], _params->h_piCoeff, 3*sizeof(int) ) );
General is it possible to use constant memory as a two-dimensional array and if yes, where is my failure?
Yes, you should be able to use constant memory in the way you want to, but the cudaMemcpyToSymbol copy operation you are using is incorrect. The first argument to the call is a symbol, and the API does a lookup in the runtime symbol table to get the address of the constant memory symbol you request. So an address can't be passed to the call (although your code is actually passing an initialised host value to the call, why that is I will leave as an exercise to the reader).
What you may have missed is the optional fourth argument in the call, which is an offset into the memory pointed to by the symbol you request. So you should be able to do something like:
cudaMemcpyToSymbol( c_iaCoeff, // symbol to lookup
_params->h_piCoeff, // source location
3*sizeof(int), // number of bytes to copy
(3*_params->iTId)*sizeof(int) // Offset in bytes
);
[standard disclaimer: written in browser, unstested. use at own risk]
The last argument is the offset in bytes from the start of the symbol. Your 2D array will be laid out in row major order, so you need to use the pitch of the rows multiplied by the row index as an offset for each copy operation.
Related
I'm operating on a Linux system and a Tesla C2075 machine. I am launching a kernel that is a modified version of the reduction kernel. My aim is to find the mean and a step by step averaged version(time_avg) of a large data set (result). See code below.
Size of "result" and "time_avg" is same and equal to "nsamps". "time_avg" contains successive averaged sets of the array result. So, first half contains averages of every two non-overlapping samples, the quarter after that has averages of every four non-overlapping samples, the next eighth of 8 samples and so on.
__global__ void timeavg_mean(float *result, unsigned int *nsamps, float *time_avg, float *mean) {
__shared__ float temp[1024];
int ltid = threadIdx.x, gtid = blockIdx.x*blockDim.x + threadIdx.x, stride;
int start = 0, index;
unsigned int npts = *nsamps;
printf("here here\n");
// Store chunk of memory=2*blockDim.x (which is to be reduced) into shared memory
if ( (2*gtid) < npts ){
temp[2*ltid] = result[2*gtid];
temp[2*ltid+1] = result[2*gtid + 1];
}
for (stride=1; stride<blockDim.x; stride>>=1) {
__syncthreads();
if (ltid % (stride*2) == 0){
if ( (2*gtid) < npts ){
temp[2*ltid] += temp[2*ltid + stride];
index = (int)(start + gtid/stride);
time_avg[index] = (float)( temp[2*ltid]/(2.0*stride) );
}
}
start += npts/(2*stride);
}
__syncthreads();
if (ltid == 0)
{
atomicAdd(mean, temp[0]);
}
__syncthreads();
printf("%f\n", *mean);
}
Launch configuration is 40 blocks, 512 threads. Data set is ~40k samples.
In my main code, I call cudaGetLastError() after the kernel call and it returns no error. Memory allocations and memory copies return no errors. If I write cudaDeviceSynchronize() (or a cudaMemcpy to check for the value of mean) after the kernel call, the program hangs completely after the kernel call. If I remove it, program runs and exits. In neither case, do I get the outputs "here here" or the mean value printed. I understand that unless the kernel executes successfully, the printf's won't print.
Has this got to do with __syncthreads() in a recursion? All threads will go till the same depth so I think that checks out.
What is the problem here?
Thank you!
A kernel call is asynchronous, if the kernel starts successfully your host code will continue to run and you will see no error. Errors that happen during the kernel run appear only after you do an explicit synchronization or call a function that causes an implicit synchronization.
If your host hangs on synchronization than your kernel probably didn't finished running - it is either running some infinite loop or it is waiting on some __synchthreads() or some other synchronization primitive.
Your code seems to contain an infinite loop: for (stride=1; stride<blockDim.x; stride>>=1). You probably want to shift the stride left not right: stride<<=1.
You mentioned recursion but your code contains only one __global__ function, there are no recursive calls.
Your kernel has an infinite loop. Replace the for loop with
for (stride=1; stride<blockDim.x; stride<<=1) {
I need some help with Cuda GLOBAL memory. In my project I must declare Global array for avoid to send this array at every kernel call.
EDIT:
My application can call the kernel more than 1,000 times , and on every call I'm sending him an array with size more than [1000 X 1000], So I think it's taking more time , that's why my app works slowly. So I need declare Global array for GPU, So my questions are
1 How to declare Global array
2 How to initialize Global array from CPU before kernel call
Thanks in advance
Your edited question is confusing because you say you are sending your kernel an array of size 1000 x 1000 but you want to know how to do this using a global array. The only way I know of to send this much data to a kernel is to use a global array, so you are probably already doing this with an array in global memory.
Nevertheless, there are 2 methods, at least, to create and initialize an array in global memory:
1.statically, using __device__ and cudaMemcpyToSymbol, for example:
#define SIZE 100
__device__ int A[SIZE];
...
int main(){
int myA[SIZE];
for (int i=0; i< SIZE; i++) myA[i] = 5;
cudaMemcpyToSymbol(A, myA, SIZE*sizeof(int));
...
(kernel calls, etc.)
}
(device variable reference, cudaMemcpyToSymbol reference)
2.dynamically, using cudaMalloc and cudaMemcpy:
#define SIZE 100
...
int main(){
int myA[SIZE];
int *A;
for (int i=0; i< SIZE; i++) myA[i] = 5;
cudaMalloc((void **)&A, SIZE*sizeof(int));
cudaMemcpy(A, myA, SIZE*sizeof(int), cudaMemcpyHostToDevice);
...
(kernel calls, etc.)
}
(cudaMalloc reference, cudaMemcpy reference)
For clarity I'm omitting error checking which you should do on all cuda calls and kernel calls.
If I understand well this question, which is kind of unclear, you want to use global array and send it to the device in every kernel call. This bad practice leads to high latency because in every kernel call you need to transfer your data to the device. In my experience such practice led to negative speed-up.
An optimal way would be to use what I call flip-flop technique. The way you do it is:
Declare two array in the device. d_arr1 and d_arr2
Copy the data host -> device into one of the arrays.
Pass as kernel's parameters pointers to d_arr1 and d_arr2
Process the data into the kernel.
In consequent kernel calls you exchange the pointers you are passing as parameters
This way you avoid to transfer the data every kernel call. You transfer only at the beginning and at the end of your host loop.
int a, even =0;
for(a=0;a<1000;a++)
{
if (even % 2 ==0 )
//call to the kernel(pointer_a, pointer_b)
else
//call to the kernel(pointer_b, pointer_a)
}
A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:
Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).
Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?
I'll also go back to the answer I would have posted in 2012 but for a browser crash.
The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.
This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.
A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>
__device__ unsigned int process(int & val)
{
return (++val < 10);
}
template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
__shared__ int wchanged[nwarps];
unsigned int laneid = threadIdx.x % warpSize;
unsigned int warpid = threadIdx.x / warpSize;
// Do calculations then check for change/convergence
// and set tchanged to be !=0 if required
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tchanged = process(inout[idx]);
// Simple blockwise reduction using voting primitives
// increments kchanged is any thread in the block
// returned tchanged != 0
tchanged = __any(tchanged != 0);
if (laneid == 0) {
wchanged[warpid] = tchanged;
}
__syncthreads();
if (threadIdx.x == 0) {
int bchanged = 0;
#pragma unroll
for(int i=0; i<nwarps; i++) {
bchanged |= wchanged[i];
}
if (bchanged) {
atomicAdd(kchanged, 1);
}
}
}
int main(void)
{
const int N = 2048;
const int min = 5, max = 15;
std::vector<int> data(N);
for(int i=0; i<N; i++) {
data[i] = min + (std::rand() % (int)(max - min + 1));
}
int* _data;
size_t datasz = sizeof(int) * (size_t)N;
cudaMalloc<int>(&_data, datasz);
cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);
unsigned int *kchanged, *_kchanged;
cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);
const int nwarps = 4;
dim3 blcksz(32*nwarps), grdsz(16);
// Loop while the kernel signals it needs to run again
do {
*kchanged = 0;
kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
cudaDeviceSynchronize();
} while (*kchanged != 0);
cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
cudaDeviceReset();
int minval = *std::min_element(data.begin(), data.end());
assert(minval == 10);
return 0;
}
Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.
You should be able to adapt this to whatever your application requires
I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.
create a flag in global memory:
__device__ int flag;
at each iteration,
initialize the flag to zero (in host code):
int init_val = 0;
cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
In your kernel device code, modify the flag to 1 if a change is made to the matrix:
__global void iter_kernel(float *matrix){
...
if (new_val[i] != matrix[i]){
matrix[i] = new_val[i];
flag = 1;}
...
}
after calling the kernel, at the end of the iteration (in host code), test for modification:
int modified = 0;
cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
if (modified){
...
}
Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.
Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).
Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.
I can't see any reason to use atomics, or how it would benefit anything.
A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".
I wrote a code which is facing kernel launch failure due to Device Illegal Address when I run it using cuda-gdb for a particular input. I ran it using cuda-memcheck and got Invalid write of size 4 error.The code is too big so I will explain the scenario here.
I have a main kernel to which I am passing an array pointer which serves as a stack.
I have a device function which is called from the main kernel and uses the stack.
__device__ void find(int v , int* p, int* pv,int n, int* d_stackContents)
{
int d_stackTop;
d_stackTop = -1;
*pv = p[v];
if(*pv == -1){
*pv = v;
}
else{
cuPrintf("Stack top is %d\n",d_stackTop);
d_stackTop = d_stackTop + 1;
d_stackContents[d_stackTop] = v;
cuPrintf("Stack top is %d\n",d_stackTop);
while(*pv != -1){
d_stackTop = d_stackTop + 1;
d_stackContents[d_stackTop] = *pv;
cuPrintf("Stack top is %d\n",d_stackTop);
*pv = p[*pv];
}
}
The error is occurring at d_stackContents[d_stackTop] = *pv;
I am calling the device function in the main kernel as follows:
find(v[idx], p,&pv,n, d_stackContents);
where idx = threadIdx.x + blockDim.x * blockIdx.x and I have declared pv as int pv;
Also, the d_stackContents array is allocated in main using cudaMalloc and passed as an argument to the main kernel
This won't work unless you call your kernel with a single thread in a single block. Otherwise all threads scribble over each other's stack. If you then dereference a pointer that was stored on the corrupted stack, it would immediately explain why your code tries to access an illegal address.
You need to use separate stacks for each thread, or a single stack with a stack pointer in global memory that is manipulated only via atomic operations.
I am trying to speed up the CPU binary search. Unfortunately, GPU version is always much slower than CPU version. Perhaps the problem is not suitable for GPU or am I doing something wrong ?
CPU version (approx. 0.6ms):
using sorted array of length 2000 and do binary search for specific value
...
Lookup ( search[j], search_array, array_length, m );
...
int Lookup ( int search, int* arr, int length, int& m )
{
int l(0), r(length-1);
while ( l <= r )
{
m = (l+r)/2;
if ( search < arr[m] )
r = m-1;
else if ( search > arr[m] )
l = m+1;
else
{
return index[m];
}
}
if ( arr[m] >= search )
return m;
return (m+1);
}
GPU version (approx. 20ms):
using sorted array of length 2000 and do binary search for specific value
....
p_ary_search<<<16, 64>>>(search[j], array_length, dev_arr, dev_ret_val);
....
__global__ void p_ary_search(int search, int array_length, int *arr, int *ret_val )
{
const int num_threads = blockDim.x * gridDim.x;
const int thread = blockIdx.x * blockDim.x + threadIdx.x;
int set_size = array_length;
ret_val[0] = -1; // return value
ret_val[1] = 0; // offset
while(set_size != 0)
{
// Get the offset of the array, initially set to 0
int offset = ret_val[1];
// I think this is necessary in case a thread gets ahead, and resets offset before it's read
// This isn't necessary for the unit tests to pass, but I still like it here
__syncthreads();
// Get the next index to check
int index_to_check = get_index_to_check(thread, num_threads, set_size, offset);
// If the index is outside the bounds of the array then lets not check it
if (index_to_check < array_length)
{
// If the next index is outside the bounds of the array, then set it to maximum array size
int next_index_to_check = get_index_to_check(thread + 1, num_threads, set_size, offset);
if (next_index_to_check >= array_length)
{
next_index_to_check = array_length - 1;
}
// If we're at the mid section of the array reset the offset to this index
if (search > arr[index_to_check] && (search < arr[next_index_to_check]))
{
ret_val[1] = index_to_check;
}
else if (search == arr[index_to_check])
{
// Set the return var if we hit it
ret_val[0] = index_to_check;
}
}
// Since this is a p-ary search divide by our total threads to get the next set size
set_size = set_size / num_threads;
// Sync up so no threads jump ahead and get a bad offset
__syncthreads();
}
}
Even if I try bigger arrays, the time ratio is not any better.
You have way too many divergent branches in your code so you're essentially serializing the entire process on the GPU. You want to break up the work so that all the threads in the same warp take the same path in the branch. See page 47 of the CUDA Best Practices Guide.
I'm must admit I'm not entirely sure what what your kernel does, but am I right in assuming that you are looking for just one index that satisfies your search criteria? If so then have a look at the reduction sample that comes with CUDA for some pointers on how to structure and optimize such a query. (What your are doing is essentially trying to reduce the closest index to your query)
Some quick pointers though:
You are performing an awful lot of reads and writes to global memory, which is incredibly slow. Try using shared memory instead.
Secondly remember that __syncthreads() only syncs threads in the same block, so your reads/writes to global memory won't necessarily get synced across all threads (though the latency from you global memory writes may actually make it appear as if they do)