I am doing a project on GPU, and I have to use atomicAdd() for double, because the cuda does not support it for double, so I use the code below, which is NVIDIA provide.
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val +
__longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
Now I want to know why the implement require a loop, while (assumed!=old)
Basically because the implementation requires a load, which can't be performed atomically. The compare-and-swap operation is an atomic version of
(*address == assumed) ? (assumed + val) : *address
There is no guarantee the the value at *address won't change between the cycle that the value is loaded from *address and the cycle that the atomicCAS call is used to store the updated value. If that happens, the value at *address won't be updated. Therefore the loop ensures that the two operations are repeated until there is no change of the value at *address between the read and the compare-and-swap operation, which implies that the update took place.
Related
I am trying to add 2 char arrays in cuda, but nothing is working.
I tried to use:
char temp[32];
strcpy(temp, my_array);
strcat(temp, my_array_2);
When I used this in kernel - I am getting error : calling a __host__ function("strcpy") from a __global__ function("Process") is not allowed
After this, I tried to use these functions in host, not in kernel - no error,but after addition I am getting strange symbols like ĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶĶ.
So, how I can add two ( or more ) char arrays in CUDA ?
So, how I can add two ( or more ) char arrays in CUDA ?
write your own functions:
__device__ char * my_strcpy(char *dest, const char *src){
int i = 0;
do {
dest[i] = src[i];}
while (src[i++] != 0);
return dest;
}
__device__ char * my_strcat(char *dest, const char *src){
int i = 0;
while (dest[i] != 0) i++;
my_strcpy(dest+i, src);
return dest;
}
And while we're at it, here is strcmp
As the error message explains, you are trying to call host functions ("CPU functions") from a global kernel ("GPU function"). Within a global kernel you only have access to functions provided by the CUDA runtime API, which doesn't include the C standard library (where strcpy and strcat are defined).
You have to create your own str* functions according to what you want to do. Do you want to concatenate an array of chars in parallel, or do it serially in each thread?
I've seen this question several times relating to PHP (here is an example). The answer was generally 'stop using magic quotes'. I am having this problem in C however. When I insert binary data into a BLOB in my MySQL database, having run it through mysql_real_escape_string(), some 5c ('\') characters appear in the blob. This disrupts the data and makes it unusable. How can I prevent / fix this?
#define CHUNK_SZ (1024*256)
void insertdb(int16_t *data, size_t size, size_t nmemb)
{
static int16_t *buf;
static unsigned long index;
static short initialized;
unsigned long i;
struct tm *info;
time_t rawtime;
char dbuf[12];
char tbuf[12];
char *chunk;
if(initialized==0){
buf = (int16_t *) malloc(CHUNK_SZ);
initialized = 1;
}
if(index + (nmemb*size) + 1 >= CHUNK_SZ || do_exit == 1){
time(&rawtime);
info = localtime(&rawtime);
snprintf(dbuf, 16, "%d-%02d-%02d", 1900+info->tm_year, 1+info->tm_mon, info->tm_mday);
snprintf(tbuf, 16, "%02d:%02d:%02d", info->tm_hour, info->tm_min, info->tm_sec);
chunk = (char *) malloc(index*2+1);
mysql_real_escape_string(con, chunk, (char *) buf, index);
char *st = "INSERT INTO %s (date, time, tag, data) VALUES ('%s', '%s', %d, '%s')";
int len = strlen(st)+strlen(db_mon_table)+strlen(dbuf)+strlen(tbuf)+sizeof(tag)+index*2+1;
char *query = (char *) malloc(len);
int qlen = snprintf(query, len, st, our_table, dbuf, tbuf, tag, chunk);
if(mysql_real_query(con, query, qlen)){
fprintf(stderr, "%s\n", mysql_error(con));
mysql_close(con);
exit(1);
}
free(chunk);
index = 0;
} else {
memcpy((void *) buf+index, (void *) data, nmemb*size);
index += (nmemb*size);
}
return;
}
EDIT: Please look here. They use the same function to escape binary data (from an image), insert it, and afterward get the same image from the database. That my binary data is somehow different from an image's binary data makes no sense to me.
If you're inserting into a BLOB column, then instead of escaping the data via mysql_real_escape_string(), you should probably express it as a HEX string. You will have to figure out how to encode your int16_t data into the needed byte sequence, as at minimum you have a byte-order question to sort out (but if you're in control of both encoding and decoding then you just need to make them match).
Alternatively, if the data are genuinely textual, rather than binary, then the type of the column should probably be Text rather than BLOB. In that case, you should continue to use an ordinary SQL string and mysql_real_escape_string().
Basically what I want is an function works like hiloint2uint64(), just join two 32 bit integer and reinterpret the outcome as an uint64.
I cannot find any function in CUDA that can do this, anyhow, is there any ptx code that can do that kind of type casting?
You can define your own function like this:
__host__ __device__ unsigned long long int hiloint2uint64(int h, int l)
{
int combined[] = { h, l };
return *reinterpret_cast<unsigned long long int*>(combined);
}
Maybe a bit late by now, but probably the safest way to do this is to do it "manually" with bit-shifts and or:
uint32_t ui_h = h;
uint32_t ui_l = l;
return (uint64_t(h)<<32)|(uint64_t(l));
Note the other solution presented in the other answer isn't safe, because the array of ints might not be 8-byte aligned (and shifting some bits is faster than memory read/write, anyway)
Use uint2 (but define the temporary variable as 64-bit value: unsigned long long int) instead of arrays to be sure of alignment.
Be careful about the order of l and h.
__host__ __device__ __forceinline__ unsigned long long int hiloint2uint64(unsigned int h, unsigned int l)
{
unsigned long long int result;
uint2& src = *reinterpret_cast<uint2*>(&result);
src.x = l;
src.y = h;
return result;
}
The CUDA registers have a size of 32 bits anyway. In the best case the compiler won't need any extra code. In the worst case it has to reorder the registers by moving a 32-bit value.
Godbolt example https://godbolt.org/z/3r9WYK9e7 of how optimized it gets.
Let us assume that we have the following strings that we need to store in a CUDA array.
"hi there"
"this is"
"who is"
How do we declare a array on the GPU to do this. I tried using C++ strings but it does not work.
Probably the best way to do this is to use structure that is similar to common compressed sparse matrix formats. Store the character data packed into a single piece of linear memory, then use a separate integer array to store the starting indices, and perhaps a third array to store the string lengths. The storage overhead of the latter might be more efficient that storing a string termination byte for every entry in the data and trying to parse for the terminator inside the GPU code.
So you might have something like this:
struct gpuStringArray {
unsigned int * pos;
unsigned int * length; // could be a smaller type if strings are short
char4 * data; // 32 bit data type will improve memory throughput, could be 8 bit
}
Note I used a char4 type for the string data; the vector type will give better memory throughput, but it will mean strings need to be aligned/suitably padded to 4 byte boundaries. That may or may not be a problem depending on what a typical real string looks like in your application. Also, the type of the (optional) length parameter should probably be chosen to reflect the maximum admissible string length. If you have a lot of very short strings, it might be worth using an 8 or 16 bit unsigned type for the lengths to save memory.
A really simplistic code to compare strings stored this way in the style of strcmp might look something like this:
__device__ __host__
int cmp4(const char4 & c1, const char4 & c2)
{
int result;
result = c1.x - c2.x; if (result !=0) return result;
result = c1.y - c2.y; if (result !=0) return result;
result = c1.z - c2.z; if (result !=0) return result;
result = c1.w - c2.w; if (result !=0) return result;
return 0;
}
__device__ __host__
int strncmp4(const char4 * s1, const char4 * s2, const unsigned int nwords)
{
for(unsigned int i=0; i<nwords; i++) {
int result = cmp4(s1[i], s2[i]);
if (result != 0) return result;
}
return 0;
}
__global__
void tkernel(const struct gpuStringArray a, const gpuStringArray b, int * result)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
char4 * s1 = a.data + a.pos[idx];
char4 * s2 = b.data + b.pos[idx];
unsigned int slen = min(a.length[idx], b.length[idx]);
result[idx] = strncmp4(s1, s2, slen);
}
[disclaimer: never compiled, never tested, no warranty real or implied, use at your own risk]
There are some corner cases and assumptions in this which might catch you out depending on exactly what the real strings in your code look like, but I will leave those as an exercise to the reader to resolve. You should be able to adapt and expand this into whatever it is you are trying to do.
You have to use C-style character strings char *str. Searching for "CUDA string" on google would have given you this CUDA "Hello World" example as first hit: http://computer-graphics.se/hello-world-for-cuda.html
There you can see how to use char*-strings in CUDA. Be aware that standard C-functions like strcpy or strcmp are not available in CUDA!
If you want an array of strings, you just have to use char** (as in C/C++). As for strcmp and similar functions, it highly depends on what you want to do. CUDA is not really well suited for string operations, maybe it would help if you would provide a little more detail about what you want to do.
I need some function to atomically get int value. Something called OSAtomicGet(). Analog of g_atomic_int_get().
Dereferencing an int from a known pointer is always atomic on architectures running Mac/iStuffs. Use OSMemoryBarrier() if you need a memory barrier.
int OSAtomicGet(volatile int* value) {
OSMemoryBarrier();
return *value;
}