primitive data type in ptx - cuda

__device__ __inline__ double ld_gbl_cg(const double *addr) {
double return_value;
asm("ld.global.cg.f64 %0, [%1];" : "=d"(return_value) : "l"(addr));
return return_value;
}
The above code is from here:
CUDA disable L1 cache only for one variable
According to the author, "d" means float, "r" means int.
I want to write a small piece of inline asm code, I want to know whats the symbol for rest of the primitive type variables (like unsigned short, unsigned long long, float-32, etc), I cannot find that from ptx isa.
I use letter "l" to represent unsigned long long, is that correct?

You can find them here, but for the sake of completeness, the letters correspond to the underlying PTX register types:
"h" = .u16 reg
"r" = .u32 reg
"l" = .u64 reg
"f" = .f32 reg
"d" = .f64 reg
So an unsigned long long maps to "l" (for a 64 bit integer PTX register).

Related

type casting to unsigned long long in CUDA?

Basically what I want is an function works like hiloint2uint64(), just join two 32 bit integer and reinterpret the outcome as an uint64.
I cannot find any function in CUDA that can do this, anyhow, is there any ptx code that can do that kind of type casting?
You can define your own function like this:
__host__ __device__ unsigned long long int hiloint2uint64(int h, int l)
{
int combined[] = { h, l };
return *reinterpret_cast<unsigned long long int*>(combined);
}
Maybe a bit late by now, but probably the safest way to do this is to do it "manually" with bit-shifts and or:
uint32_t ui_h = h;
uint32_t ui_l = l;
return (uint64_t(h)<<32)|(uint64_t(l));
Note the other solution presented in the other answer isn't safe, because the array of ints might not be 8-byte aligned (and shifting some bits is faster than memory read/write, anyway)
Use uint2 (but define the temporary variable as 64-bit value: unsigned long long int) instead of arrays to be sure of alignment.
Be careful about the order of l and h.
__host__ __device__ __forceinline__ unsigned long long int hiloint2uint64(unsigned int h, unsigned int l)
{
unsigned long long int result;
uint2& src = *reinterpret_cast<uint2*>(&result);
src.x = l;
src.y = h;
return result;
}
The CUDA registers have a size of 32 bits anyway. In the best case the compiler won't need any extra code. In the worst case it has to reorder the registers by moving a 32-bit value.
Godbolt example https://godbolt.org/z/3r9WYK9e7 of how optimized it gets.

atomicAdd() for double on GPU

I am doing a project on GPU, and I have to use atomicAdd() for double, because the cuda does not support it for double, so I use the code below, which is NVIDIA provide.
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val +
__longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
Now I want to know why the implement require a loop, while (assumed!=old)
Basically because the implementation requires a load, which can't be performed atomically. The compare-and-swap operation is an atomic version of
(*address == assumed) ? (assumed + val) : *address
There is no guarantee the the value at *address won't change between the cycle that the value is loaded from *address and the cycle that the atomicCAS call is used to store the updated value. If that happens, the value at *address won't be updated. Therefore the loop ensures that the two operations are repeated until there is no change of the value at *address between the read and the compare-and-swap operation, which implies that the update took place.

unsigned long long to binary

I am trying to check the set bits of an unsigned long long in c++ using below algorithm which only checks whether the bit is set or not.But my problem is the answer that I get is wrong.Please help me understand how unsigned long long is stored in binary.
Code:
#include<stdio.h>
#include<iostream>
#define CHECK_BIT(var,pos) ((var) & (1<<(pos)))
using namespace std;
int main()
{
int pos=sizeof(unsigned long long)*8;
unsigned long long a;
cin>>a;
pos=pos-1;
while(pos>=0)
{
if(CHECK_BIT(a,pos))
cout<<"1";
else
cout<<"0";
--pos;
}
}
Input :
1000000000000000000
Output:
1010011101100100000000000000000010100111011001000000000000000000
Expected Output:
110111100000101101101011001110100111011001000000000000000000
Similarly for another input:
14141
Output :
0000000000000000001101110011110100000000000000000011011100111101
Expected Output:
11011100111101
In the second example(in fact for any small number) the binary pattern just repeats itself after 32 bits.
I think what you have is an issue in the bit set macro , please replace it w/
#define CHECK_BIT(var,pos) ((var) & (1LL<<(pos)))

Allocating array of strings in cuda

Let us assume that we have the following strings that we need to store in a CUDA array.
"hi there"
"this is"
"who is"
How do we declare a array on the GPU to do this. I tried using C++ strings but it does not work.
Probably the best way to do this is to use structure that is similar to common compressed sparse matrix formats. Store the character data packed into a single piece of linear memory, then use a separate integer array to store the starting indices, and perhaps a third array to store the string lengths. The storage overhead of the latter might be more efficient that storing a string termination byte for every entry in the data and trying to parse for the terminator inside the GPU code.
So you might have something like this:
struct gpuStringArray {
unsigned int * pos;
unsigned int * length; // could be a smaller type if strings are short
char4 * data; // 32 bit data type will improve memory throughput, could be 8 bit
}
Note I used a char4 type for the string data; the vector type will give better memory throughput, but it will mean strings need to be aligned/suitably padded to 4 byte boundaries. That may or may not be a problem depending on what a typical real string looks like in your application. Also, the type of the (optional) length parameter should probably be chosen to reflect the maximum admissible string length. If you have a lot of very short strings, it might be worth using an 8 or 16 bit unsigned type for the lengths to save memory.
A really simplistic code to compare strings stored this way in the style of strcmp might look something like this:
__device__ __host__
int cmp4(const char4 & c1, const char4 & c2)
{
int result;
result = c1.x - c2.x; if (result !=0) return result;
result = c1.y - c2.y; if (result !=0) return result;
result = c1.z - c2.z; if (result !=0) return result;
result = c1.w - c2.w; if (result !=0) return result;
return 0;
}
__device__ __host__
int strncmp4(const char4 * s1, const char4 * s2, const unsigned int nwords)
{
for(unsigned int i=0; i<nwords; i++) {
int result = cmp4(s1[i], s2[i]);
if (result != 0) return result;
}
return 0;
}
__global__
void tkernel(const struct gpuStringArray a, const gpuStringArray b, int * result)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
char4 * s1 = a.data + a.pos[idx];
char4 * s2 = b.data + b.pos[idx];
unsigned int slen = min(a.length[idx], b.length[idx]);
result[idx] = strncmp4(s1, s2, slen);
}
[disclaimer: never compiled, never tested, no warranty real or implied, use at your own risk]
There are some corner cases and assumptions in this which might catch you out depending on exactly what the real strings in your code look like, but I will leave those as an exercise to the reader to resolve. You should be able to adapt and expand this into whatever it is you are trying to do.
You have to use C-style character strings char *str. Searching for "CUDA string" on google would have given you this CUDA "Hello World" example as first hit: http://computer-graphics.se/hello-world-for-cuda.html
There you can see how to use char*-strings in CUDA. Be aware that standard C-functions like strcpy or strcmp are not available in CUDA!
If you want an array of strings, you just have to use char** (as in C/C++). As for strcmp and similar functions, it highly depends on what you want to do. CUDA is not really well suited for string operations, maybe it would help if you would provide a little more detail about what you want to do.

What is the value of a dereferenced pointer

I realized that I had some confusion regarding the value of a dereferenced pointer, as I was reading a C text with the following code snippet:
int main()
{
int matrix[3][10]; // line 3: matrix is tentatively defined
int (* arrPtr)[10] = matrix; // line 4: arrPtr is defined and initialize
(*arrPtr)[0] = 5; // line 5: what is the value of (*arrPtr) ?
My confusion is in regards to the value of *arrPtr in the last line. This is my understanding upto that point.
Line 3, matrix is declard (tentatively defined) to be an array of 3 elements of type array of 10 elements of type int.
Line 4, arrPtr is defined as a pointer to an array of 10 elements of type int. It is also initialized as a ptr to an array of 10 elements (i.e. the first row of matrix)
Now Line 5, arrPtr is dereferenced, yielding the actual array, so it's type is array of 10 ints.
MY question: Why is the value of the array, just the address of the array and not in someway related to it's elements?
The value of the array variable matrix is the array, however it (easily) "degrades" into a pointer to its first item, which you then assign to arrPtr.
To see this, use &matrix (has type int (*)[3][10]) or sizeof matrix (equals sizeof(int) * 3 * 10).
Additionally, there's nothing tentative about that definition.
Edit: I missed the question hiding in the code comments: *arrPtr is an object of type int[10], so when you use [0] on it, you get the first item, to which you then assign 5.
Pointers and arrays are purposefully defined to behave similiarly, and this is sometimes confusing (before you learn the various quirks), but also extremely versatile and useful.
I think you need to clarify your question. If you mean what is the value of printf("%i", arrPtr); then it will be the address of the array. If you mean printf("$i",(*arrPtr)[0] ); then we've got a more meaty question.
In C, arrays are pretty much just a convenience thing. All an “array” variable is is a pointer to the start of a block of data; just as an int [] equates to an int*, i.e. the location in memory of an int, an int [][] is a double pointer, an int**, which points to the location in memory of... another pointer, which in turn points to an actual particular int.