Does __device__ variable have a size limit - cuda

I want to use global variable for several kernel method, but when I use the flowing code to init __device__ variable, I got a [access violation on store (global memory)] error when I init the second var.
__device__ short* blockTmp[4];
//init blockTmp
template<int BS>
__global__ void InitQuarterBuf_kernel(
)
{
int iBufSize = 2000000;
for (int i = 0; i < 4; i++){
blockTmp[[i] = new short[iBufSize];
blockTmp[[i][iBufSize-1]=1;
printf("blockTmp[[%d][%d] is %d.\n",i,iBufSize-1,blockTmp[[i][iBufSize-1]);
}
}
The error message:
Memory Checker detected 1 access violations.
error = access violation on store (global memory)
gridid = 94
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0x003d08fe
accessSize = 2
CUDA grid launch failed: CUcontext: 1014297073536 CUmodule: 1013915337344 Function: _Z21InitBuf_kernelILi8EEvii
CUDA context created : 67e557f3e0
CUDA module loaded: 67cdc7ed80
CUDA module loaded: 67cdc7e180
================================================================================
CUDA Memory Checker detected 1 threads caused an access violation:
Launch Parameters
CUcontext = 67e557f3e0
CUstream = 67cdc7f580
CUmodule = 67cdc7e180
CUfunction = 67eb64b2f0
FunctionName = _Z21InitBuf_kernelILi8EEvii
GridId = 94
gridDim = {1,1,1}
blockDim = {1,1,1}
sharedSize = 256
Parameters (raw):
0x00000780 0x00000440
GPU State:
Address Size Type Mem Block Thread blockIdx threadIdx PC Source
----------------------------------------------------------------------------------------------------------------------------------
003d08fe 2 adr st g 0 0 {0,0,0} {0,0,0} _Z21InitBuf_kernelILi8EEvii+0004c8
Summary of access violations:
xxxx_launcher.cu(481): error MemoryChecker: #misaligned=0 #invalidAddress=1
================================================================================
Memory Checker detected 1 access violations.
error = access violation on store (global memory)
gridid = 94
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0x003d08fe
accessSize = 2
CUDA grid launch failed: CUcontext: 446229378016 CUmodule: 445834060160 Function: _Z21InitBuf_kernelILi8EEvii
Is there some limit for __device__ variable? How can I init the __device__ variable?
And if I change the buffer size to 1000, it is OK.

Your posted kernel doesn't really make sense, as your __device__ variable is named blockTmp but you are initializing m_filteredBlockTmp variables in your kernel, which don't appear to be defined anywhere.
Anyway, supposing these are intended to be the same, the issue is probably not related to your usage of __device__ variables (pointers, in this case) but your use of in-kernel new which definitely has allocation limits.
These limits and behavior are the same as what is described in the programming guide for in-kernel malloc. In particular, the default limit is 8MB and if you need more (in the "device heap") you must explicitly raise the limit with a CUDA runtime API call.
A useful error check in these situations is to check whether the pointer returned by new or malloc is NULL, which would indicate an allocation failure. If you fail to do that check, but then attempt to use the pointer anyway, you are going to run into trouble as described in your post.

Related

Can anyone tell me why my CUDA C code is returning my array Z to be wholly zero? (again - but with different code this time) [duplicate]

Here is my code:
int threadNum = BLOCKDIM/8;
dim3 dimBlock(threadNum,threadNum);
int blocks1 = nWidth/threadNum + (nWidth%threadNum == 0 ? 0 : 1);
int blocks2 = nHeight/threadNum + (nHeight%threadNum == 0 ? 0 : 1);
dim3 dimGrid;
dimGrid.x = blocks1;
dimGrid.y = blocks2;
// dim3 numThreads2(BLOCKDIM);
// dim3 numBlocks2(numPixels/BLOCKDIM + (numPixels%BLOCKDIM == 0 ? 0 : 1) );
perform_scaling<<<dimGrid,dimBlock>>>(imageDevice,imageDevice_new,min,max,nWidth, nHeight);
cudaError_t err = cudaGetLastError();
cudasafe(err,"Kernel2");
This is the execution of my second kernel and it is fully independent in term of the usage of data. BLOCKDIM is 512 , nWidth and nHeight are 512 too and cudasafe simply prints the corresponding string message of the error code. This section of the code gives configuration error just after the kernel call.
What might give this error, any idea?
This type of error message frequently refers to the launch configuration parameters (grid/threadblock dimensions in this case, could also be shared memory, etc. in other cases). When you see a message like this it's a good idea just to print out your actual config parameters before launching the kernel, to see if you've made any mistakes.
You said BLOCKDIM = 512. You have threadNum = BLOCKDIM/8 so threadNum = 64. Your threadblock configuration is:
dim3 dimBlock(threadNum,threadNum);
So you are asking to launch blocks of 64 x 64 threads, that is 4096 threads per block. That won't work on any generation of CUDA devices. All current CUDA devices are limited to a maximum of 1024 threads per block, which is the product of the 3 block dimensions.
Maximum dimensions are listed in table 14 of the CUDA programming guide, and also available via the deviceQuery CUDA sample code.
Just to add to the previous answers, you can find the max threads allowed in your code also, so it can run in other devices without hard-coding the number of threads you will use:
struct cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
cout<<"using "<<properties.multiProcessorCount<<" multiprocessors"<<endl;
cout<<"max threads per processor: "<<properties.maxThreadsPerMultiProcessor<<endl;

Pass statically declared __constant__ variable as kernel parameter (CUDA)

I'm trying to pass statically allocated __constant__ variables as kernel parameters, but something seems to be wrong. Kernel behaves as variables are not initialized. But still, I can access variables as global variables (from the same kernel). Modifying variables values from the host works, but again, only if I access variables from the global scope. Here is an example:
#include<cuda_runtime_api.h>
#include<cuda_runtime.h>
#include<stdio.h>
__constant__ float constant_a=1.12345;
__constant__ float constant_b;
__global__ void test_values(float a, float b) {
printf("Device code: constant_a = %f, constant_b = %f\n", a, b);
printf("Device code: constant_a = %f, constant_b = %f\n\n", constant_a, constant_b);
}
int main() {
test_values<<<1, 1>>>(constant_a, constant_b);
cudaDeviceSynchronize();
const float h_const_a = 1;
const float h_const_b = 2;
cudaMemcpyToSymbol(constant_a, &h_const_a, sizeof(float));
test_values<<<1, 1>>>(constant_a, constant_b);
cudaDeviceSynchronize();
cudaMemcpyToSymbol(constant_b, &h_const_b, sizeof(float));
test_values<<<1, 1>>>(constant_b, constant_b);
cudaDeviceSynchronize();
}
The kernel prints out this:
Device code: constant_a = 0.000000, constant_b = 0.000000
Device code: constant_a = 1.123450, constant_b = 0.000000
Device code: constant_a = 0.000000, constant_b = 0.000000
Device code: constant_a = 1.000000, constant_b = 0.000000
Device code: constant_a = 0.000000, constant_b = 0.000000
Device code: constant_a = 1.000000, constant_b = 2.000000
If anyone can explain this, I would also very much appreciate if you can provide me with the source of information. Since I didn't found this information in Nvidia Guide or few other books.
I'm trying to pass statically allocated __constant__ variables as
kernel parameters, but something seems to be wrong
What is wrong is that you can't do that. It isn't supported.
Kernel behaves as variables are not initialized
Because they are not initialized. __constant__ variables cannot be accessed in host code, except by the symbol manipulation APIs. And when you use those APIs, you are only modifying values in device memory. Nothing else. The host side variables associated with statically declared device symbols exist only as tags for binding the API calls to. Changes in device memory are not reflected in the host memory. The only exception to this is __managed__ symbols (on platforms where that is supported).
Modifying variables values from the host works, but again, only if I
access variables from the global scope
That is the only valid and supported use case for __constant__ variables in host code.
Since I didn't found this information in Nvidia Guide or few other books.
You didn't find the information about how to do this because it doesn't exist, and that is because it isn't supported. You can find a concise description of what is supported here.

"invalid configuration argument " error for the call of CUDA kernel?

Here is my code:
int threadNum = BLOCKDIM/8;
dim3 dimBlock(threadNum,threadNum);
int blocks1 = nWidth/threadNum + (nWidth%threadNum == 0 ? 0 : 1);
int blocks2 = nHeight/threadNum + (nHeight%threadNum == 0 ? 0 : 1);
dim3 dimGrid;
dimGrid.x = blocks1;
dimGrid.y = blocks2;
// dim3 numThreads2(BLOCKDIM);
// dim3 numBlocks2(numPixels/BLOCKDIM + (numPixels%BLOCKDIM == 0 ? 0 : 1) );
perform_scaling<<<dimGrid,dimBlock>>>(imageDevice,imageDevice_new,min,max,nWidth, nHeight);
cudaError_t err = cudaGetLastError();
cudasafe(err,"Kernel2");
This is the execution of my second kernel and it is fully independent in term of the usage of data. BLOCKDIM is 512 , nWidth and nHeight are 512 too and cudasafe simply prints the corresponding string message of the error code. This section of the code gives configuration error just after the kernel call.
What might give this error, any idea?
This type of error message frequently refers to the launch configuration parameters (grid/threadblock dimensions in this case, could also be shared memory, etc. in other cases). When you see a message like this it's a good idea just to print out your actual config parameters before launching the kernel, to see if you've made any mistakes.
You said BLOCKDIM = 512. You have threadNum = BLOCKDIM/8 so threadNum = 64. Your threadblock configuration is:
dim3 dimBlock(threadNum,threadNum);
So you are asking to launch blocks of 64 x 64 threads, that is 4096 threads per block. That won't work on any generation of CUDA devices. All current CUDA devices are limited to a maximum of 1024 threads per block, which is the product of the 3 block dimensions.
Maximum dimensions are listed in table 14 of the CUDA programming guide, and also available via the deviceQuery CUDA sample code.
Just to add to the previous answers, you can find the max threads allowed in your code also, so it can run in other devices without hard-coding the number of threads you will use:
struct cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
cout<<"using "<<properties.multiProcessorCount<<" multiprocessors"<<endl;
cout<<"max threads per processor: "<<properties.maxThreadsPerMultiProcessor<<endl;

Kernel fails on launch cause of kernel pararameters

I made a simple CUDA kernel which fails to launch for some reason I don't understand.
Below you see my global vars.
unsigned int volume[256*256*256];//contains volume data of source
unsigned int target[256*256*256];//contains volume data of target
unsigned int* d_volume=NULL;//source data on device
unsigned int* d_target=NULL;//target data on device
The next function is a kernel launcher.
void launch_kernel(){
cudaMalloc(&d_volume,256*256*256*sizeof(unsigned int));
cudaMemcpy(d_volume, volume, 256*256*256*sizeof(unsigned int),cudaMemcpyHostToDevice);
cudaMalloc(&d_target,256*256*256*sizeof(unsigned int));
cudaMemcpy(d_target, target, 256*256*256*sizeof(unsigned int),cudaMemcpyHostToDevice);
dim3 threads(256,1,1);
dim3 blocks(256,256,1);
simple_kernel<<<blocks,threads>>>(d_volume,d_target);
cudaError_t cudaResult;
cudaResult = cudaGetLastError();
if (cudaResult != cudaSuccess)
{
cout<<"kernel failed"<<endl;
}
cudaMemcpy(volume, d_volume, 256*256*256*sizeof( int),cudaMemcpyDeviceToHost);
cudaFree(d_volume);
cudaMemcpy(target, d_target 256*256*256*sizeof( int),cudaMemcpyDeviceToHost);
cudaFree(d_target);
}
Problem seems to be on d_target cause if I launch the kernel like that:
simple_kernel<<<blocks,threads>>>(d_volume,d_volume);
it is working perfectly (passes on to the device the values that must be passed) and no message appears. Any idea why could that happen?
Kernel declaration follows below.
__global__ void simple_kernel(unsigned int* src,unsigned int* tgt){
//i dont think it matters what it is for.
int x = threadIdx.x;
int y = blockIdx.x;
int z = blockIdx.y;
if(x!=0 || x!=255 || y!=0 || y!=255 || z!=0 || z!=255 ){//in bound of memory allocated
if( src[x*256*256+y*256+z]==tgt[x*256*256+y*256+z])
if(tgt[(x+1)*256*256+y*256+z]==1 || tgt[(x-1)*256*256+y*256+z]==1 || tgt[(x-1)*256*256+(y+1)*256+z] ||tgt[(x-1)*256*256+(y-1)*256+z])
src[x*256*256+y*256+z]=1;
else
src[x*256*256+y*256+z]=0;
}
}
CUDA can return error also in a case of out-of-bounds read access to global memory. You perform this out-of-bounds read access in:
if(tgt[(x+1)*256*256+y*256+z]==1 || ...) e.g. for x = y = z = 255 which go through your out-of-bounds checking.
In a case you launch your kernel as
simple_kernel<<<blocks,threads>>>(d_volume,d_volume);
during out-of-bounds read access you actually access global memory which has already been allocated for d_target as arrays d_volume and d_target are stored consecutively, hence, no error occurs.
Confirm my opinion by further error-checking or launch your program with cuda-memcheck.

cuda gdb: the kernel indicated is not in the code

My original problem, is that I have functions with a long list of arguments, that exceeded the memory that is allowed to be passed as an argument to a cuda kernel (I don't remember how many bytes, because it's been a while since I dealt with that). So, the way I bypassed this problem, was to define a new structure that its members are pointers pointing to other structures that I can dereference from within the kernel later.
... this is where the current problem begins: at the point where I'm trying to dereference the pointers (members of the structure I created earlier) from within the kernel, I get CUDA_EXCEPTION_5, Warp Out-of-range Address
...from the cuda-gdb. And of top of that, the kernel name and arguments (which are reported 'not live at this point' which cuda-gdb gives as the one with the error, is not one that I created in my code.
Now, for the more specifics :
here are the structures involved:
typedef struct {
int strx;
int stry;
int strz;
float* el;
} manmat;
typedef struct {
manmat *x;
manmat *y;
manmat *z;
} manmatvec;
here's how i'm trying to group the kernel's arguments inside the main:
int main () {
...
...
manmat resu0;
resu0.strx = n+2; resu0.stry = m+2; resu0.strz = l+2;
if (cudaMalloc((void**)&resu0.el,sizeof(float) * (n+2)*(m+2)*(l+2)) != cudaSuccess) cout << endl << " ERROR allocating memory for manmat resu0" << endl ;
manmat resv0;
resv0.strx = n+2; resv0.stry = m+2; resv0.strz = l+2;
if (cudaMalloc((void**)&resv0.el,sizeof(float) * (n+2)*(m+2)*(l+2)) != cudaSuccess) cout << endl << " ERROR allocating memory for manmat resv0" << endl ;
manmat resw0;
resw0.strx = n+2; resw0.stry = m+2; resw0.strz = l+2;
if (cudaMalloc((void**)&resw0.el,sizeof(float) * (n+2)*(m+2)*(l+2)) != cudaSuccess) cout << endl << " ERROR allocating memory for manmat resw0" << endl ;
manmatvec residues0 ;
residues0.x = &resu0;
residues0.y = &resv0;
residues0.z = &resw0;
exec_res_std_2d <<<numBlocks2D, threadsPerBlock2D>>> (residues0, ......) ;
.....
}
... and this is what happens in the kernel :
__global__ void exec_res_std_2d (manmatvec residues, ......) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int k = blockIdx.y * blockDim.y + threadIdx.y;
manmat *resup;
manmat *resvp;
manmat *reswp;
resup = residues.x;
resvp = residues.y;
reswp = residues.z;
manmat resu, resv, resw ;
resu.strx = (*resup).strx; //LINE 1626
resu.stry = (*resup).stry;
resu.strz = (*resup).strz;
resu.el = (*resup).el;
resv = *resvp;
resw = *reswp;
.....
}
and finally, this is what cuda-gdb gives as output :
..................
[Launch of CUDA Kernel 1065 (exec_res_std_2d<<<(1,2,1),(32,16,1)>>>) on Device 0]
[Launch of CUDA Kernel 1066 (exec_res_bot_2d<<<(1,2,1),(32,16,1)>>>) on Device 0]
Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
[Switching focus to CUDA kernel 1065, grid 1066, block (0,0,0), thread (0,2,0), device 0, sm 0, warp 2, lane 0]
0x0000000003179020 in fdivide<<<(1,2,1),(32,16,1)>>> (a=warning: Variable is not live at this point. Value is undetermined.
..., pt=warning: Variable is not live at this point. Value is undetermined.
..., cells=warning: Variable is not live at this point. Value is undetermined.
...) at ola.cu:1626
1626 ola.cu: No such file or directory.
in ola.cu
I have to note that I haven't defined ANY function , __device__ or __global__ in my code called fdivide.....
Also, it might be important to say that, in the beginning of the run of the program inside the debugger, despite the fact that I compile my cuda c files with -arch=sm_20 -g -G -gencode arch=compute_20,code=sm_20, I get,
[New Thread 0x7ffff3b69700 (LWP 12465)]
[Context Create of context 0x1292340 on Device 0]
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.1619c10.o.LkkWns
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.1940ad0.o.aHtC7W
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.2745680.o.bVXEWl
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.2c438b0.o.cgUqiP
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.2c43980.o.4diaQ4
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.2dc9380.o.YYJAr5
Any answers or hints, or suggestions that could help me with this issue are very welcome!
please note that i've only recently started programming with cuda-c, and I'm not very experienced with cuda-gdb. Most of the debugging I did in C code I did it 'manually' by checking the output at various points of the code....
Also, this code is running on tesla M2090, and is also compiled to run on 2.0 architecture.
This will be a problem:
manmatvec residues0 ;
residues0.x = &resu0;
residues0.y = &resv0;
residues0.z = &resw0;
The resu0, resv0, and resw0 variables are allocated in host memory - on the host stack. You're putting host addresses into the manmatvec structure, then passing the manmatvec into the kernel. On the receiving end, the CUDA code cannot access the host memory addresses provided in the structure.
If you're going to pass the addresses of the resu0, resv0, resw0 variables, you need to allocate them from device memory.
I don't know if this is the entire problem, but I'm pretty sure it's a top contributor.