Hi I am trying to debug a cuda code with cpp wrapper. The cpp function is used in PyTorch code(test.py) as a custom extension. When I run
cuda-memcheck python test.py
After few iterations, I get this error
========= Invalid __global__ read of size 4
========= at 0x00000ed8 in /tmp/pip-req-build-9fowbxbl/test_ops2/_ext-src/sr
c/geo_query_gpu.cu:770:query_geo_point_kernel(int, int, int, float, int, float c
onst *, float const *, int*, int*, float const *, int const *, int const *, int*
, int const *, int const *, int const *, float const *, int const *, int const *
, int const *, int const *, int const *, int const *, int const *, int const *,
int const *, int const *, int const *, float*, float*, bool*, int*, int*)
========= by thread (210,0,0) in block (15,0,0)
========= Address 0x7fc938c00000 is out of bounds
I want to understand the exact meaning of this error. I believe its because of out of bound addressing happening on line 770 of my code. But what does few terms indicate here?
at 0x00000ed8 is common during every test. What does this indicate?
The error says Address 0x7fc938c00000 is out of bounds. What does this address represent here?
I am not seeking the exact source of errors in my code in this question. I am trying to understand the meaning of this error log.
at 0x00000ed8 is common during every test. What does this indicate?
That is the address of the instruction that caused the fault.
The error says Address 0x7fc938c00000 is out of bounds. What does this address represent here?
It represents the address where the instruction indicated a __global__ read of size 4 should happen. However that address is "out of bounds" meaning it is not part of any valid memory allocation in use by your program.
Related
I am trying to add 2 char arrays in cuda, but nothing is working.
I tried to use:
char temp[32];
strcpy(temp, my_array);
strcat(temp, my_array_2);
When I used this in kernel - I am getting error : calling a __host__ function("strcpy") from a __global__ function("Process") is not allowed
After this, I tried to use these functions in host, not in kernel - no error,but after addition I am getting strange symbols like ĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶ.
So, how I can add two ( or more ) char arrays in CUDA ?
So, how I can add two ( or more ) char arrays in CUDA ?
write your own functions:
__device__ char * my_strcpy(char *dest, const char *src){
int i = 0;
do {
dest[i] = src[i];}
while (src[i++] != 0);
return dest;
}
__device__ char * my_strcat(char *dest, const char *src){
int i = 0;
while (dest[i] != 0) i++;
my_strcpy(dest+i, src);
return dest;
}
And while we're at it, here is strcmp
As the error message explains, you are trying to call host functions ("CPU functions") from a global kernel ("GPU function"). Within a global kernel you only have access to functions provided by the CUDA runtime API, which doesn't include the C standard library (where strcpy and strcat are defined).
You have to create your own str* functions according to what you want to do. Do you want to concatenate an array of chars in parallel, or do it serially in each thread?
While testing if I know how to allocate surface objects, I was designing a dummy kernel to read a single value.
This kernel was failing at compile time because
"no instance of overloaded function "surf3Dread" matches the argument list"
__global__ void test_surface(cudaSurfaceObject_t surfImg,int x, int y, int z){
float test = surf3Dread(surfImg , (int)(x*sizeof(float)) , y , z ,cudaBoundaryModeZero);
printf("%f \n",test);
}
it works when I do this instead:
__global__ void test_surface(cudaSurfaceObject_t surfImg,int x, int y, int z){
float test;
surf3Dread(&test,surfImg , (int)(x*sizeof(float)) , y , z ,cudaBoundaryModeZero);
printf("%f \n",test);
}
This is nor a problem really, but I was doing the first because the documentation of surf3Dread states that you this function is defined as:
template<class T>
T surf3Dread(cudaSurfaceObject_t surfObj,
int x, int y, int z,
boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surf3Dread(T* data,
cudaSurfaceObject_t surfObj,
int x, int y, int z,
boundaryMode = cudaBoundaryModeTrap);
Maybe I am not understanding the documentation correctly, but I'd say that the first kernel here corresponds to the first documented way of calling the function and the second kernel to the second. Why does only one work? If I misunderstood the first function in the documentation, how do you call that version?
I am using CUDA 10.2
In the first instance, the compiler cannot deduce the template instantiation from the supplied function arguments. You need to specify the type explicitly to the compiler. This:
#include <cstdio>
__global__ void test_surface(cudaSurfaceObject_t surfImg,int x, int y, int z){
float test = surf3Dread<float>(surfImg, (int)(x*sizeof(float)), y, z, cudaBoundaryModeZero);
printf("%f \n",test);
}
will work where your version will not.
I have recently been studying Unix system programming. I came across this in the man page of exec,
int execle(const char *path, const char *arg,..., char * const envp[]);
How does this function prototype have a ... in the middle ? This won't even compile !
Can someone explain this prototype please ?
I feel that this is just there for the users to see and there is a different internal implementation. This is just a guess and I am not sure about it.
I feel that this is just there for the users to see
Assuming you're confused by the trailing envp[], you would be correct. If we look at the POSIX documentation, we find the actual prototype should be:
int execle(const char *, const char *, ...);
And indeed, if you consult unistd.h on your system, you'll probably find something of that form:
//glibc
extern int execle (__const char *__path, __const char *__arg, ...)
__THROW __nonnull ((1, 2));
//musl
int execle(const char *, const char *, ...);
//cygwin
int _EXFUN(execle, (const char *__path, const char *, ... ));
Basically what I want is an function works like hiloint2uint64(), just join two 32 bit integer and reinterpret the outcome as an uint64.
I cannot find any function in CUDA that can do this, anyhow, is there any ptx code that can do that kind of type casting?
You can define your own function like this:
__host__ __device__ unsigned long long int hiloint2uint64(int h, int l)
{
int combined[] = { h, l };
return *reinterpret_cast<unsigned long long int*>(combined);
}
Maybe a bit late by now, but probably the safest way to do this is to do it "manually" with bit-shifts and or:
uint32_t ui_h = h;
uint32_t ui_l = l;
return (uint64_t(h)<<32)|(uint64_t(l));
Note the other solution presented in the other answer isn't safe, because the array of ints might not be 8-byte aligned (and shifting some bits is faster than memory read/write, anyway)
Use uint2 (but define the temporary variable as 64-bit value: unsigned long long int) instead of arrays to be sure of alignment.
Be careful about the order of l and h.
__host__ __device__ __forceinline__ unsigned long long int hiloint2uint64(unsigned int h, unsigned int l)
{
unsigned long long int result;
uint2& src = *reinterpret_cast<uint2*>(&result);
src.x = l;
src.y = h;
return result;
}
The CUDA registers have a size of 32 bits anyway. In the best case the compiler won't need any extra code. In the worst case it has to reorder the registers by moving a 32-bit value.
Godbolt example https://godbolt.org/z/3r9WYK9e7 of how optimized it gets.
I need some function to atomically get int value. Something called OSAtomicGet(). Analog of g_atomic_int_get().
Dereferencing an int from a known pointer is always atomic on architectures running Mac/iStuffs. Use OSMemoryBarrier() if you need a memory barrier.
int OSAtomicGet(volatile int* value) {
OSMemoryBarrier();
return *value;
}