Pass statically declared __constant__ variable as kernel parameter (CUDA) - cuda

I'm trying to pass statically allocated __constant__ variables as kernel parameters, but something seems to be wrong. Kernel behaves as variables are not initialized. But still, I can access variables as global variables (from the same kernel). Modifying variables values from the host works, but again, only if I access variables from the global scope. Here is an example:
#include<cuda_runtime_api.h>
#include<cuda_runtime.h>
#include<stdio.h>
__constant__ float constant_a=1.12345;
__constant__ float constant_b;
__global__ void test_values(float a, float b) {
printf("Device code: constant_a = %f, constant_b = %f\n", a, b);
printf("Device code: constant_a = %f, constant_b = %f\n\n", constant_a, constant_b);
}
int main() {
test_values<<<1, 1>>>(constant_a, constant_b);
cudaDeviceSynchronize();
const float h_const_a = 1;
const float h_const_b = 2;
cudaMemcpyToSymbol(constant_a, &h_const_a, sizeof(float));
test_values<<<1, 1>>>(constant_a, constant_b);
cudaDeviceSynchronize();
cudaMemcpyToSymbol(constant_b, &h_const_b, sizeof(float));
test_values<<<1, 1>>>(constant_b, constant_b);
cudaDeviceSynchronize();
}
The kernel prints out this:
Device code: constant_a = 0.000000, constant_b = 0.000000
Device code: constant_a = 1.123450, constant_b = 0.000000
Device code: constant_a = 0.000000, constant_b = 0.000000
Device code: constant_a = 1.000000, constant_b = 0.000000
Device code: constant_a = 0.000000, constant_b = 0.000000
Device code: constant_a = 1.000000, constant_b = 2.000000
If anyone can explain this, I would also very much appreciate if you can provide me with the source of information. Since I didn't found this information in Nvidia Guide or few other books.

I'm trying to pass statically allocated __constant__ variables as
kernel parameters, but something seems to be wrong
What is wrong is that you can't do that. It isn't supported.
Kernel behaves as variables are not initialized
Because they are not initialized. __constant__ variables cannot be accessed in host code, except by the symbol manipulation APIs. And when you use those APIs, you are only modifying values in device memory. Nothing else. The host side variables associated with statically declared device symbols exist only as tags for binding the API calls to. Changes in device memory are not reflected in the host memory. The only exception to this is __managed__ symbols (on platforms where that is supported).
Modifying variables values from the host works, but again, only if I
access variables from the global scope
That is the only valid and supported use case for __constant__ variables in host code.
Since I didn't found this information in Nvidia Guide or few other books.
You didn't find the information about how to do this because it doesn't exist, and that is because it isn't supported. You can find a concise description of what is supported here.

Related

Static __device__ variable and kernels in separate file

I want statically declare a global variable with __device__ qualifier. In the same time I want to store functions intended to GPU in a separate file.
However, if I do so, the variable value is not transferred to GPU -- there are no errors in compilation or execution time, but memcpy functions do nothing.
When I move kernel function into the file with the host code, everything works.
I am sure, that it should be possible to split host and device functions into separate files in this case, but how to do this? I have seen just examples, when kernels and host code are in the same file.
I would be also very thankful, if somebody explained, why does it behaves so.
A sample code is listed below.
Thank you in advance.
Working directory:
$ ls
functionsGPU.cu functionsGPU.cuh staticGlobalMemory.cu
staticGlobalMemory.cu:
#include "functionsGPU.cuh"
#if VARIANT == 2
__global__ void checkGlobalVariable(){
printf("Old value (dev): %f\n", devData);
devData += 2.0f;
printf("New value (dev): %f\n", devData);
}
#endif
int main(int argc, char **argv){
int dev = 0;
float val = 3.2;
cudaSetDevice(dev);
printf("---------\nVARIANT %i\n---------\n", VARIANT);
printf("Old value (host): %f\n", val);
cudaMemcpyToSymbol(devData, &val, sizeof(float));
checkGlobalVariable <<<1, 1>>> ();
cudaMemcpyFromSymbol(&val, devData, sizeof(float));
printf("New value (host): %f\n", val);
cudaDeviceReset();
return 0;
}
functionsGPU.cuh:
#ifndef FUNCTIONSGPU_CUH
#define FUNCTIONSGPU_CUH
#include <cuda_runtime.h>
#include <stdio.h>
#define VARIANT 1
__device__ float devData;
#if VARIANT == 1
__global__ void checkGlobalVariable();
#endif
#endif
functionsGPU.cu:
#include "functionsGPU.cuh"
#if VARIANT == 1
__global__ void checkGlobalVariable(){
printf("Old value (dev): %f\n", devData);
devData += 2.0f;
printf("New value (dev): %f\n", devData);
}
#endif
This is compiled as
$ nvcc -arch=sm_61 staticGlobalMemory.cu functionsGPU.cu -o staticGlobalMemory
Output if the kernel and host code are in separate files (incorrect):
---------
VARIANT 1
---------
Old value (host): 3.200000
Old value (dev): 0.000000
New value (dev): 2.000000
New value (host): 3.200000
Output if the kernel and host code are in the same file (correct):
---------
VARIANT 2
---------
Old value (host): 3.200000
Old value (dev): 3.200000
New value (dev): 5.200000
New value (host): 5.200000
Your code structure, where device code in one compilation unit references device code or device entities in another compilation unit, will require CUDA relocatable device code compilation and linking.
In the case of __device__ variables such as what you have here:
Add -rdc=true to enable this, to your nvcc compilation command line
Add extern in front of the definition of devData, in functionsGPU.cuh
Add __device__ float devData; to staticGlobalMemory.cu
In the case of linking to a __device__ function in a separate file, along with providing the prototype typically via a header file like you would with any function in C++, you also need to add -rdc=true to your nvcc compilation command line, to enable device code linking. Steps 2 and 3 above are not needed.
That should fix the issue. Step 1 provides the necessary cross-module linkage, and steps 2 and 3 will fix the duplicate definition problem you would have, since you are including the same variable via a header file in separate compilation units.
For a reference of how to do the device code compilation setting in windows visual studio, see here.

How do I direct all accesses to global memory in CUDA?

I want all accesses from my program to access global memory (even if the data is found in the L1/L2 cache). To this effect I found out that L1 cache can be skipped by passing these options to nvcc compiler:
-Xptxas -dlcm=cg
CUDA documentation states this:
.cv Cache as volatile (consider cached system memory lines stale, fetch again).
So, I am assuming when I run with either -dlcm=cg or -dlcm=cv, the PTX file generated should be different from the one that is generated normally. (The loads should be appended with either .cg or .cv)
My sample program:
__global__ void rh_kernel(int *datainRowX, int *datainRowY) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid != 0)
return;
int i, x, y;
x = datainRowX[1];
y = datainRowY[2];
datainRowX[0] = x + y;
}
int main(int argc, char** argv) {
int* d_datainRowX;
cudaMalloc((void**)&d_datainRowX, sizeof(int) * 268435456);
int* d_datainRowY;
cudaMalloc((void**)&d_datainRowY, sizeof(int) * 268435456);
rh_kernel<<<1024, 1>>>(d_datainRowX, d_datainRowY);
cudaFree(d_datainRowX); cudaFree(d_datainRowY);
return(0);
}
I notice that whatever options I pass to the nvcc compiler ("-Xptxas -dlcm=cg" or "-Xptxas -dlcm=cv" or nothing), in all the three cases the PTX generated is the same. I am using -ptx option to generate the PTX file.
What am I missing? Is there any other way to achieve what I am doing?
Thanks in advance for your time.
According to Cuda Toolkit Documentation:
L1 caching in Kepler GPUs is reserved only for local memory accesses,
such as register spills and stack data. Global loads are cached in L2
only (or in the Read-Only Data Cache).
GK110B-based products such as the Tesla K40 GPU Accelerator, GK20A,
and GK210 retain this behavior by default
L1 cache is not used in global memory reads on Kepler by default . Thus - there is no difference in PTX when you add -Xptxas -dlcm=cg.
Disabling L2 cache is not possbile.

Does __device__ variable have a size limit

I want to use global variable for several kernel method, but when I use the flowing code to init __device__ variable, I got a [access violation on store (global memory)] error when I init the second var.
__device__ short* blockTmp[4];
//init blockTmp
template<int BS>
__global__ void InitQuarterBuf_kernel(
)
{
int iBufSize = 2000000;
for (int i = 0; i < 4; i++){
blockTmp[[i] = new short[iBufSize];
blockTmp[[i][iBufSize-1]=1;
printf("blockTmp[[%d][%d] is %d.\n",i,iBufSize-1,blockTmp[[i][iBufSize-1]);
}
}
The error message:
Memory Checker detected 1 access violations.
error = access violation on store (global memory)
gridid = 94
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0x003d08fe
accessSize = 2
CUDA grid launch failed: CUcontext: 1014297073536 CUmodule: 1013915337344 Function: _Z21InitBuf_kernelILi8EEvii
CUDA context created : 67e557f3e0
CUDA module loaded: 67cdc7ed80
CUDA module loaded: 67cdc7e180
================================================================================
CUDA Memory Checker detected 1 threads caused an access violation:
Launch Parameters
CUcontext = 67e557f3e0
CUstream = 67cdc7f580
CUmodule = 67cdc7e180
CUfunction = 67eb64b2f0
FunctionName = _Z21InitBuf_kernelILi8EEvii
GridId = 94
gridDim = {1,1,1}
blockDim = {1,1,1}
sharedSize = 256
Parameters (raw):
0x00000780 0x00000440
GPU State:
Address Size Type Mem Block Thread blockIdx threadIdx PC Source
----------------------------------------------------------------------------------------------------------------------------------
003d08fe 2 adr st g 0 0 {0,0,0} {0,0,0} _Z21InitBuf_kernelILi8EEvii+0004c8
Summary of access violations:
xxxx_launcher.cu(481): error MemoryChecker: #misaligned=0 #invalidAddress=1
================================================================================
Memory Checker detected 1 access violations.
error = access violation on store (global memory)
gridid = 94
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0x003d08fe
accessSize = 2
CUDA grid launch failed: CUcontext: 446229378016 CUmodule: 445834060160 Function: _Z21InitBuf_kernelILi8EEvii
Is there some limit for __device__ variable? How can I init the __device__ variable?
And if I change the buffer size to 1000, it is OK.
Your posted kernel doesn't really make sense, as your __device__ variable is named blockTmp but you are initializing m_filteredBlockTmp variables in your kernel, which don't appear to be defined anywhere.
Anyway, supposing these are intended to be the same, the issue is probably not related to your usage of __device__ variables (pointers, in this case) but your use of in-kernel new which definitely has allocation limits.
These limits and behavior are the same as what is described in the programming guide for in-kernel malloc. In particular, the default limit is 8MB and if you need more (in the "device heap") you must explicitly raise the limit with a CUDA runtime API call.
A useful error check in these situations is to check whether the pointer returned by new or malloc is NULL, which would indicate an allocation failure. If you fail to do that check, but then attempt to use the pointer anyway, you are going to run into trouble as described in your post.

Using cuda texture memory for 1D interpolation

I'm trying to use texture memory to solve an interpolation problem, hopefully in a faster way than using global memory. Being the very first time for me to use texture memory, I'm oversimplifying my interpolation problem to a linear interpolation one. So, I'm already aware there are smarter and faster ways to make linear interpolation than the one reported below.
Here is the file Kernels_Interpolation.cuh. The __device__ function linear_kernel_GPU is omitted for simplicity, but is correct.
texture<cuFloatComplex,1> data_d_texture;
__global__ void linear_interpolation_kernel_function_GPU_texture(cuComplex* result_d, float* x_in_d, float* x_out_d, int M, int N)
{
int j = threadIdx.x + blockDim.x * blockIdx.x;
cuComplex datum;
if(j<N)
{
result_d[j] = make_cuComplex(0.,0.);
for(int k=0; k<M; k++)
{
datum = tex1Dfetch(data_d_texture,k);
if (fabs(x_out_d[j]-x_in_d[k])<1.) result_d[j] = cuCaddf(result_d[j],cuCmulf(make_cuComplex(linear_kernel_GPU(x_out_d[j]-x_in_d[k]),0.),datum));
}
}
}
Here is the Kernels_Interpolation.cu function
extern "C" void linear_interpolation_function_GPU_texture(cuComplex* result_d, cuComplex* data_d, float* x_in_d, float* x_out_d, int M, int N){
cudaBindTexture(NULL, data_d_texture, data_d, M);
dim3 dimBlock(BLOCK_SIZE,1); dim3 dimGrid(N/BLOCK_SIZE + (N%BLOCK_SIZE == 0 ? 0:1),1);
linear_interpolation_kernel_function_GPU_texture<<<dimGrid,dimBlock>>>(result_d, x_in_d, x_out_d, M, N);
}
Finally, in the main program, the data_d array is allocated and initialized as follows
cuComplex* data_d; cudaMalloc((void**)&data_d,sizeof(cuComplex)*M);
cudaMemcpy(data_d,data,sizeof(cuComplex)*M,cudaMemcpyHostToDevice);
The result_d array has length N.
The strange thing is that the output is correctly computed only on the first 16 locations, although N>16, the others being 0s e.g.
result.r[0] 0.563585 result.i[0] 0.001251
result.r[1] 0.481203 result.i[1] 0.584259
result.r[2] 0.746924 result.i[2] 0.820994
result.r[3] 0.510477 result.i[3] 0.708008
result.r[4] 0.362980 result.i[4] 0.091818
result.r[5] 0.443626 result.i[5] 0.984452
result.r[6] 0.378992 result.i[6] 0.011919
result.r[7] 0.607517 result.i[7] 0.599023
result.r[8] 0.353575 result.i[8] 0.448551
result.r[9] 0.798026 result.i[9] 0.780909
result.r[10] 0.728561 result.i[10] 0.876729
result.r[11] 0.143276 result.i[11] 0.538575
result.r[12] 0.216170 result.i[12] 0.861384
result.r[13] 0.994566 result.i[13] 0.993541
result.r[14] 0.295192 result.i[14] 0.270596
result.r[15] 0.092388 result.i[15] 0.377816
result.r[16] 0.000000 result.i[16] 0.000000
result.r[17] 0.000000 result.i[17] 0.000000
result.r[18] 0.000000 result.i[18] 0.000000
result.r[19] 0.000000 result.i[19] 0.000000
The rest of the code is correct, namely, if I replace linear_interpolation_kernel_function_GPU_texture and linear_interpolation_function_GPU_texture with functions using global memory everything is fine.
I have verified that I can correctly access texture memory until a certain location (which depends on M and N), for example 64, after which it returns 0s.
I have the same problem if I replace the cuComplex texture to a float one (forcing the data to be real).
Any ideas?
I can see one logical error in the following line of your program.
cudaBindTexture(NULL, data_d_texture, data_d, M);
The last argument of cudaBindTexture takes the size of data in bytes and you are specifying the number of elements.
You should try the following:
cudaBindTexture(NULL, data_d_texture, data_d, M * sizeof(cuComplex));

cuda gdb: the kernel indicated is not in the code

My original problem, is that I have functions with a long list of arguments, that exceeded the memory that is allowed to be passed as an argument to a cuda kernel (I don't remember how many bytes, because it's been a while since I dealt with that). So, the way I bypassed this problem, was to define a new structure that its members are pointers pointing to other structures that I can dereference from within the kernel later.
... this is where the current problem begins: at the point where I'm trying to dereference the pointers (members of the structure I created earlier) from within the kernel, I get CUDA_EXCEPTION_5, Warp Out-of-range Address
...from the cuda-gdb. And of top of that, the kernel name and arguments (which are reported 'not live at this point' which cuda-gdb gives as the one with the error, is not one that I created in my code.
Now, for the more specifics :
here are the structures involved:
typedef struct {
int strx;
int stry;
int strz;
float* el;
} manmat;
typedef struct {
manmat *x;
manmat *y;
manmat *z;
} manmatvec;
here's how i'm trying to group the kernel's arguments inside the main:
int main () {
...
...
manmat resu0;
resu0.strx = n+2; resu0.stry = m+2; resu0.strz = l+2;
if (cudaMalloc((void**)&resu0.el,sizeof(float) * (n+2)*(m+2)*(l+2)) != cudaSuccess) cout << endl << " ERROR allocating memory for manmat resu0" << endl ;
manmat resv0;
resv0.strx = n+2; resv0.stry = m+2; resv0.strz = l+2;
if (cudaMalloc((void**)&resv0.el,sizeof(float) * (n+2)*(m+2)*(l+2)) != cudaSuccess) cout << endl << " ERROR allocating memory for manmat resv0" << endl ;
manmat resw0;
resw0.strx = n+2; resw0.stry = m+2; resw0.strz = l+2;
if (cudaMalloc((void**)&resw0.el,sizeof(float) * (n+2)*(m+2)*(l+2)) != cudaSuccess) cout << endl << " ERROR allocating memory for manmat resw0" << endl ;
manmatvec residues0 ;
residues0.x = &resu0;
residues0.y = &resv0;
residues0.z = &resw0;
exec_res_std_2d <<<numBlocks2D, threadsPerBlock2D>>> (residues0, ......) ;
.....
}
... and this is what happens in the kernel :
__global__ void exec_res_std_2d (manmatvec residues, ......) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int k = blockIdx.y * blockDim.y + threadIdx.y;
manmat *resup;
manmat *resvp;
manmat *reswp;
resup = residues.x;
resvp = residues.y;
reswp = residues.z;
manmat resu, resv, resw ;
resu.strx = (*resup).strx; //LINE 1626
resu.stry = (*resup).stry;
resu.strz = (*resup).strz;
resu.el = (*resup).el;
resv = *resvp;
resw = *reswp;
.....
}
and finally, this is what cuda-gdb gives as output :
..................
[Launch of CUDA Kernel 1065 (exec_res_std_2d<<<(1,2,1),(32,16,1)>>>) on Device 0]
[Launch of CUDA Kernel 1066 (exec_res_bot_2d<<<(1,2,1),(32,16,1)>>>) on Device 0]
Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
[Switching focus to CUDA kernel 1065, grid 1066, block (0,0,0), thread (0,2,0), device 0, sm 0, warp 2, lane 0]
0x0000000003179020 in fdivide<<<(1,2,1),(32,16,1)>>> (a=warning: Variable is not live at this point. Value is undetermined.
..., pt=warning: Variable is not live at this point. Value is undetermined.
..., cells=warning: Variable is not live at this point. Value is undetermined.
...) at ola.cu:1626
1626 ola.cu: No such file or directory.
in ola.cu
I have to note that I haven't defined ANY function , __device__ or __global__ in my code called fdivide.....
Also, it might be important to say that, in the beginning of the run of the program inside the debugger, despite the fact that I compile my cuda c files with -arch=sm_20 -g -G -gencode arch=compute_20,code=sm_20, I get,
[New Thread 0x7ffff3b69700 (LWP 12465)]
[Context Create of context 0x1292340 on Device 0]
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.1619c10.o.LkkWns
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.1940ad0.o.aHtC7W
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.2745680.o.bVXEWl
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.2c438b0.o.cgUqiP
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.2c43980.o.4diaQ4
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.2dc9380.o.YYJAr5
Any answers or hints, or suggestions that could help me with this issue are very welcome!
please note that i've only recently started programming with cuda-c, and I'm not very experienced with cuda-gdb. Most of the debugging I did in C code I did it 'manually' by checking the output at various points of the code....
Also, this code is running on tesla M2090, and is also compiled to run on 2.0 architecture.
This will be a problem:
manmatvec residues0 ;
residues0.x = &resu0;
residues0.y = &resv0;
residues0.z = &resw0;
The resu0, resv0, and resw0 variables are allocated in host memory - on the host stack. You're putting host addresses into the manmatvec structure, then passing the manmatvec into the kernel. On the receiving end, the CUDA code cannot access the host memory addresses provided in the structure.
If you're going to pass the addresses of the resu0, resv0, resw0 variables, you need to allocate them from device memory.
I don't know if this is the entire problem, but I'm pretty sure it's a top contributor.