Cuda signed 128-bit multiplication error - cuda

I think I discovered a problem when doing 128-bit signed multiplication in cuda PTX using signed integers.
Here is my sample code:
long long result_lo, result_hi;
asm(" mul.lo.s64 %0, 0, -1; \n\t" // 0 * -1 = 0
" mul.hi.s64 %1, 0, -1; \n\t"
: "=l"(result_lo), "=l"(result_hi));
This should produce the result result_lo = 0x0, result_hi = 0x0. However this produces the result: result_lo = 0x0, result_hi = 0xFFFFFFFFFFFFFFFF which is actualy the value 2^127 - (2^126 - 1) if I'm not mistaken and clearly not zero.
First off, I want to make sure my understanding is correct, but moreso, is there a way around this?
Update Changing from Debug mod to Release mode fixes this issue, still wondering if this is a bug in cuda?
Update 2
Reported this bug to NVIDIA
Used Cuda toolkit 7.5 with Visual Studio 2013. x64 Debug, sm_52, compute_52.

TL;DR This appears to be a bug in the emulation of the PTX instruction mul.hi.s64 that is specific to sm_5x platforms, so filing a bug report with NVIDIA is the recommended course of action.
Generally, NVIDIA GPUs are 32-bit architectures, so all 64-bit integer instructions require emulation sequences. In the particular case of 64-bit integer multiplies, for sm_2x and sm_3x platforms, these are constructed from the machine code instruction IMAD.U32, which is a 32-bit integer multiply-add instruction.
For the Maxwell architecture (that is, sm_5x), a high-throughput, but lower-width, integer multiply-add instruction XMAD was introduced, although a low-throughput legacy 32-bit integer multipy IMUL was apparently retained. Inspection of disassembled machine code generated for sm_5x by the CUDA 7.5 toolchain with cuobjdump --dumpsass shows that for ptxas optimization level -O0 (which is used for debug builds), the 64-bit multiplies are emulated with the IMUL instruction, while for optimization level -O1 and higher XMAD is used. I cannot think of a reason why two fundamentally different emulation sequences are employed.
As it turns out, the IMUL-based emulation for mul.hi.s64 for sm_5x is broken while the XMAD-based emulation works fine. Therefore, one possible workaround is to utilize an optimization level of at least -O1 for ptxas, by specifying -Xptxas -O1 on the nvcc command line. Note that release builds use -Xptxas -O3 by default, so no corrective action is necessary for release builds.
From code analysis, the emulation for mul.hi.s64 is implemented as a wrapper around the emulation for mul.hi.u64, and this latter emulation seems to work fine on all platforms including sm_5x. Thus another possible workaround is to use our own wrapper around mul.hi.u64. Coding with inline PTX is unnecessary in this case, since mul.hi.s64 and mul.hi.u64 are accessible via the device intrinsics __mul64hi() and __umul64hi(). As can be seen from the code below, the adjustments to convert a result from unsigned to signed multiplication are fairly trivial.
long long int m1, m2, result;
#if 0 // broken on sm_5x at optimization level -O0
asm(" mul.hi.s64 %0, %1, %2; \n\t"
: "=l"(result)
: "l"(m1), "l"(m2));
#else
result = __umul64hi (m1, m2);
if (m1 < 0LL) result -= m2;
if (m2 < 0LL) result -= m1;
#endif

Related

Merit of inline-ASM rounding via putting float into int variable

I have inherited a pretty interesting piece of code:
inline int round(float a)
{
int i;
__asm {
fld a
fistp i
}
return i;
}
My first impulse was to discard it and replace calls with (int)std::round (pre-C++11, would use std::lround if it happened today), but after a while I started to wonder if it might have some merit after all...
The use case for this function are all values in [-100, 100], so even int8_t would be wide enough to hold the result. fistp requires at least a 32 bit memory variable, however, so less than int32_t is just as wasted as more.
Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate, as per the standard, and back afterwards. C++11 offers the std::lround function, which alleviates this particular issue, but still does seem to be more wasteful, considering that the value passes float->long->int instead of directly arriving where it should.
On the other hand, with inline-ASM in the function, the compiler cannot optimise away i into a register (and even if it could, fistp expects a memory variable), so std::lround does not seem too much worse...
The most pressing question I have is however how safe it is to assume (as this function does), that the rounding mode will always be round-to-nearest, as it obviously does (no checks). As std::lround has to guarantee a certain behaviour independent of rounding mode, this assumption, as long as it holds, always seems to make the inline-ASM round the better option.
It is furthermore highly unclear to me whether the rounding mode set by std::fesetround and used by the std::lround alternative std::lrint and the rounding mode employed in the fistp ASM-instruction are guaranteed to be the same or at least synchronous.
These are my considerations, aka what I do not know to make an informed decision on retaining or replacing the function.
Now to the questions:
Following a more informed view of these considerations or such which I have not thought of, does it seem advisable to use this function?
How great is the risk, if any?
Does reasoning exist for why it would not be faster than std::lround or std::lrint?
Can it be further improved without performance cost?
Does any of this reasoning change if the program were compiled for x86-64?
TL;DR: use lrintf(x) or (int)nearbyintf(x), depending on which one your compiler likes better.
Check the asm to see which one inlines when SSE4.1 is available (e.g. -march=nehalem or penryn, or later), with or without -ffast-math. You may need -fno-math-errno to get GCC to inline sometimes, but clang inline anyway. This is 100% safe unless you actually expect lrintf or sqrtf or other math functions to set errno, and is generally recommended along with -fno-trapping-math.
Don't use inline asm when you can possibly avoid it. Compilers don't "understand" what it does, so they can't optimize through it. e.g. If that function is inlined somewhere that makes its argument a compile-time constant, it will still fld a constant and fistp it to memory, then load that back into an integer register. Pure C will let the compiler propagate the constant and just mov r32, imm32, or further propagate the constant and fold it into something else. Not to mention CSE, and hoisting the conversion out of a loop. (MSVC inline asm doesn't let you specify that an asm block is a pure function, and only needs to be run if the output value is needed, and that it doesn't depend on a global. GNU C inline asm does allow that part, but it's still a bad choice for this because it's not transparent to the compiler).
The GCC wiki even has a page on this subject, explaining the same things as my previous paragraph (and more), so inline asm should definitely be a last resort.
In this case, we can get the compiler to emit good code from pure C, so we should absolutely do that.
Float->int with the current rounding mode only takes a single machine instruction (see below), but the trick is to get a compiler to emit it (and only it). Getting math-library functions to inline can be tricky, because some of them have to set errno and/or raise an inexact exception in certain cases. (-fno-math-errno can help, if you can't use the full -ffast-math or the MSVC equivalent)
With some compilers (gcc but not clang), lrintf is good. It isn't ideal, though: float->long->int isn't the same as directly to int when they're not the same size. The x86-64 SystemV ABI (used by everything except Windows) has 64bit long.
64bit long changes the overflow semantics for lrint: instead of getting 0x80000000 (on x86 with SSE instructions), you'll get the low 32bits of the long (which will be all-zero if the value was outside the range of a long).
This lrintf won't auto-vectorize (unless maybe the compiler can prove that the floats will be in-range), because there are only scalar, not SIMD, instructions to convert floats or double to packed 64bit integers (until AVX512DQ). IDK of a C math library function to convert directly to int, but you can use (int)nearbyintf(x), which does auto-vectorize more easily in 64bit code. See the section below for how well gcc and clang do with that.
Other than defeating auto-vectorization, though, there's no direct speed penalty for cvtss2si rax, xmm0 on any modern microarchitecture (see Agner Fog's insn tables). It just costs an extra instruction byte for the REX prefix.
On AArch64 (aka ARM64), gcc4.8 compiles lround into a single fcvtas x0, s0 instruction, so I guess ARM64 provides that funky rounding mode in hardware (but x86 doesn't). Strangely, -ffast-math makes fewer functions inline, but that's with clunky old gcc4.8. For ARM (not 64), gcc4.8 doesn't inline anything, even with -mfloat-abi=hard -mhard-float -march=armv7-a. Maybe those aren't the right options; IDK ARM very well :/
If you have a lot of conversions to do, you can manually vectorize for x86 with SSE / AVX intrinsics, like _mm_cvtps_epi32 (cvtps2dq), and even pack the resulting 32bit integer elements down to 16 or 8 bit (with packssdw. However, using pure C that the compiler can auto-vectorize is a good plan, because it's portable.
lrintf
#include <math.h>
int round_to_nearest(float f) { // default mode is always nearest
return lrintf(f);
}
Compiler output from the Godbolt Compiler explorer:
########### Without -ffast-math #############
cvtss2si eax, xmm0 # gcc 6.1 (-O3 -mx32, so long is 32bit)
cvtss2si rax, xmm0 # gcc 4.4 through 6.1 (-O3). can't auto-vectorize, though.
jmp lrintf # clang 3.8 (-O3 -msse4.1), still tail-calls the function :/
###### With -ffast-math #########
jmp lrintf # clang 3.8 (-O3 -msse4.1 -ffast-math)
So clearly clang doesn't do well with it, but even ancient gcc is great, and does a good job even without -ffast-math.
Don't use roundf/lroundf: it has non-standard rounding semantics (halfway cases away from 0, instead of to even). This leads to worse x86 asm, but actually better ARM64 asm. So maybe do use it for ARM? It does have fixed rounding behaviour, though, instead of using the current rounding mode.
If you want the return value as a float, instead of converting to int, it may be better to use nearbyintf. rint has to raise the FP inexact exception when output != input. (But SSE4.1 roundss can implement either behaviour with bit 3 of its immediate control byte).
truncating nearbyint() to int directly.
#include <math.h>
int round_to_nearest(float f) {
return nearbyintf(f);
}
Compiler output from the Godbolt Compiler explorer.
######## With -ffast-math ############
cvtss2si eax, xmm0 # gcc 4.8 through 6.1 (-O3 -ffast-math)
# clang is dumb and won't fold the roundss into the cvt. Without sse4.1, it's a function call
roundss xmm0, xmm0, 12 # clang 3.5 to 3.8 (-O3 -ffast-math -msse4.1)
cvttss2si eax, xmm0
roundss xmm1, xmm0, 12 # ICC13 (-O3 -msse4.1 -ffast-math)
cvtss2si eax, xmm1
######## WITHOUT -ffast-math ############
sub rsp, 8
call nearbyintf # gcc 6.1 (-O3 -msse4.1)
add rsp, 8 # and clang without -msse4.1
cvttss2si eax, xmm0
roundss xmm0, xmm0, 12 # clang3.2 and later (-O3 -msse4.1)
cvttss2si eax, xmm0
roundss xmm1, xmm0, 12 # ICC13 (-O3 -msse4.1)
cvtss2si eax, xmm1
Gcc 4.7 and earlier: Just cvttss2si without -msse4.1, but emits a roundss if SSE4.1 is available. It's nearbyint definition must be using inline-asm, because the asm syntax is broken in intel-syntax output. Probably this is how it gets inserted and then not optimized away when it realizes it's converting to int.
How it works in asm
Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate, as per the standard, and back afterwards.
That's only true if you're targeting 20-year-old CPUs without SSE. (You said float, not double, so we only need SSE, not SSE2. The oldest CPUs without SSE2 are Athlon XP).
Modern system do floating point in xmm registers. SSE has instructions to convert a scalar float to signed int with truncation (cvttss2si) or with the current counting mode (cvtss2si). (Note the extra t for Truncate in the first one. The rest of the mnemonic is Convert Scalar Single-precision To Signed Integer.) There are similar instructions for double, and x86-64 allows the destination to be a 64bit integer register.
See also the x86 tag wiki.
cvtss2si basically exists because of C's default behaviour for casting float to int. Changing the rounding mode is slow, so Intel provided a way to do it that doesn't suck.
I think even 32bit versions of modern Windows requires hardware new enough to have SSE2, in case that matters to anyone. (SSE2 is part of the AMD64 ISA, and the 64bit calling conventions even pass float / double args in xmm registers).

Counting registers/thread in Cuda kernel

The nSight profiler tells me that the following kernel uses 52 registers per thread:
//Just the first lines of the kernel.
__global__ void voles_kernel(float *params, int *ctrl_params,
float dt, float currTime,
float *dev_voles, float *dev_weasels,
curandStateMtgp32 *state)
{
__shared__ float dev_params[9];
__shared__ int BuYeSimStep[4];
if(threadIdx.x < 4)
{
BuYeSimStep[threadIdx.x] = ctrl_params[threadIdx.x];
}
if(threadIdx.x < 9){
dev_params[threadIdx.x] = params[threadIdx.x];
}
__syncthreads();
float currVole = curand_uniform(&state[blockIdx.x]) + 3.0;
float currWeas = curand_uniform(&state[blockIdx.x]) + 0.1;
float oldVole = currVole;
float oldWeas = currWeas;
int jj;
if (blockIdx.x * blockDim.x + threadIdx.x < BuYeSimStep[2])
{
int dayIndex = 0;
/* Not declaring any new variable from here on, just doing arithmetics.
....... */
If each register has 4 bytes I don't understand how we get to 52 registers, even
assuming that the arrays params[9] and ctrl_params[4] end up in registers (in which
case using shared memory as I did doesn't make sense). I would
like to increase occupancy, but I don't get why I'm using so many registers.
Any ideas?
It's generally difficult to look at C code and predict the register usage from it. The compiler may aggressively optimize code by increasing register usage, perhaps to save an instruction here or there. You seem to be making an assumption that register usage can be predicted from your C code variable allocations, and while there is some connection between the two, you cannot assume register usage can be computed directly from C code variable allocations.
Since you haven't provided your code, nobody can actually help with the register usage. If you want to better understand the register usage, you will need to look at the PTX code directly. To do this, compile your code using nvcc with the -ptx switch, and inspect the resultant .ptx file directly. To do this you may wish to refer to the PTX documentation as well as the nvcc documentation to look at the various compiler options.
You haven't provided your code, so it's not really possible to make any direct suggestions, but you may be able to reduce register usage by reducing constant usage, reducing or refactoring arithmetic usage, switching from double to float, and I'm sure there are many other suggestions as well. Register usage will also be affected if you are passing the -G switch to the compiler.
You can limit the compiler's usage of registers per thread by passing the -maxrregcount switch to nvcc with an appropriate parameter, such as -maxrregcount 20 which will instruct the compiler to limit itself to 20 registers per thread. This tactic may not give good results, however, or you may need to tune the parameter to a value which doesn't sacrifice too much performance. However you may find an optimum choice which doesn't sacrifice too much basic performance but allows you to improve occupancy. If you constrain the compiler too much, it will begin to spill it's needed register usage to local memory, which will generally reduce performance.
You should also be aware that you can pass -Xptxas -v to nvcc which will give useful output about the compiler's register usage and other related data (spilling, etc.) at compile time.
If you want to increase the occupancy, a direct way is using compiler flag: maxregcount to restrict the usage of registers, but it may suffer a performance loss because some registers will be spilled to local memory, which is very slow.
I suggest you debug your code with Eclipse Nsight.
Create a breakpoint at the first line of your kernel and step to there.
In Debug Perspective, inside the CUDA Thread, you have the current stack trace. Right-click on the stack and click on "Instruction Stepping Mode". The window "Disassembly" will open your kernel PTX Assembly. You can continue stepping in your kernel to track the correlation of your source code and the assembly. So you can discover which register is used for.

OpenCL version of cudaMemcpyToSymbol & optimization

Can someone tell me OpenCl version of cudaMemcpyToSymbol for copying __constant to device and getting back to host?
Or usual clenquewritebuffer(...) will do the job ?
Could not find much help in forum. Actually a few lines of demo will suffice.
Also shall I expect same kind of optimization in opencl as that of CUDA using constant cache?
Thanks
I have seen people use cudaMemcpyToSymbol() for setting up constants in the kernel and the compiler could take advantage of those constants when optimizing the code. If one was to setup a memory buffer in openCL to pass such constants to the kernel then the compiler could not use them to optimize the code.
Instead the solution I found is to replace the cudaMemcpyToSymbol() with a print to a string that defines the symbol for the compiler. The compiler can take definitions in the form of -D FOO=bar for setting the symbol FOO to the value bar.
Not sure about OpenCL.Net, but in plain OpenCL: yes, clenquewritebuffer is enough (just remember to create buffer with CL_MEM_READ_ONLY flag set).
Here is a demo from Nvidia GPU Computing SDK (OpenCL/src/oclQuasirandomGenerator/oclQuasirandomGenerator.cpp):
c_Table[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, QRNG_DIMENSIONS * QRNG_RESOLUTION * sizeof(unsigned int),
NULL, &ciErr);
ciErr |= clEnqueueWriteBuffer(cqCommandQueue[i], c_Table[i], CL_TRUE, 0,
QRNG_DIMENSIONS * QRNG_RESOLUTION * sizeof(unsigned int), tableCPU, 0, NULL, NULL);
Constant memory in CUDA and in OpenCL are exactly the same, and provide the same type of optimization. That is, if you use nVidia GPU. On ATI GPUs, it should act similarly. And I doubt that constant memory would give you any benefit over global when run on CPU.

Thrust sort_by_key issue when using zip_iterator values

I'm trying to use zip_iterator with a sort_by_key() in cuda and the values inside the zip_iterator are not getting re-ordered during the sort (the positions of the data stay the same as they were originally).
Example code:
typedef thrust::device_vector<int> IntVec;
IntVec keyVec(100);
IntVec fooVec(100);
IntVec barVec(100);
for (int z = 0; z < 100; z++)
{
keyVec[z] = rand();
fooVec[z] = z;
barVec[z] = z;
}
thrust::sort_by_key( keyVec.begin(), keyVec.end(),
thrust::make_zip_iterator( make_tuple( fooVec.begin(), barVec.begin() ) ) );
What i expect this code to do is sort based on the value in keyVec (which it does properly) while maintaining the order of fooVec and barVec. Is this not what sort_by_key does? does sort_by_key work with zip_iterators? Am i doing something incorrect when setting up/pulling the data from the zip_iterator? If this method is incorrect what is the proper method to keep value ordering?
EX:
key,foo,bar (presort)
3,1,1
2,2,2
...
key,foo,bar (what i expect post sort)
2,2,2
3,1,1
...
key,foo,bar (what i actually get)
2,1,1
3,2,2
...
Using Thrust that ships with CUDA 4.1
System Details:
OS: RHEL 6.0 x86_64
CUDA Version: 4.1 (also tested with 4.1.1.5)
Thrust Version: 1.5
GPU: 4x nVidia Corporation GF100 [GeForce GTX 480] (rev a3)
nvidia driver: 290.10
nvcc version: release 4.1, V0.2.1221
compile string: nvcc testfile.cu
UPDATE:
Still cannot get sort_by_key() to work with zip_iterators but it works correctly with a standard thrust::device_vector<>.begin() iterator.
thrust::sort_by_key should be able to sort zip_iterator in the manner of your example.
I've not been able to reproduce the behavior you describe on any of several different platforms, but it's possible there's something unique about your system which causes an issue.
You should post the contents of testfile.cu and the details of your system to Thrust's bug tracker on Google Code so the developers can take a closer look.

clock() in opencl

I know that there is function clock() in CUDA where you can put in kernel code and query the GPU time. But I wonder if such a thing exists in OpenCL? Is there any way to query the GPU time in OpenCL? (I'm using NVIDIA's tool kit).
There is no OpenCL way to query clock cycles directly. However, OpenCL does have a profiling mechanism that exposes incremental counters on compute devices. By comparing the differences between ordered events, elapsed times can be measured. See clGetEventProfilingInfo.
Just for others coming her for help: Short introduction to profiling kernel runtime with OpenCL
Enable profiling mode:
cmdQueue = clCreateCommandQueue(context, *devices, CL_QUEUE_PROFILING_ENABLE, &err);
Profiling kernel:
cl_event prof_event;
clEnqueueNDRangeKernel(cmdQueue, kernel, 1 , 0, globalWorkSize, NULL, 0, NULL, &prof_event);
Read profiling data in:
cl_ulong ev_start_time=(cl_ulong)0;
cl_ulong ev_end_time=(cl_ulong)0;
clFinish(cmdQueue);
err = clWaitForEvents(1, &prof_event);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &ev_start_time, NULL);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &ev_end_time, NULL);
Calculate kernel execution time:
float run_time_gpu = (float)(ev_end_time - ev_start_time)/1000; // in usec
Profiling of individual work-items / work-goups is NOT possible yet.
You can set globalWorkSize = localWorkSize for profiling. Then you have only one workgroup.
Btw: Profiling of a single work-item (some work-items) isn't very helpful. With only some work-items you won't be able to hide memory latencies and the overhead leading to not meaningful measurements.
Try this (Only work with NVidia OpenCL of course) :
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time));
return clock_time;
}
The NVIDIA OpenCL SDK has an example Using Inline PTX with OpenCL. The clock register is accessible through inline PTX as the special register %clock. %clock is described in PTX: Parallel Thread Execution ISA manual. You should be able to replace the %%laneid with %%clock.
I have never tested this with OpenCL but use it in CUDA.
Please be warned that the compiler may reorder or remove the register read.
On NVIDIA you can use the following:
typedef unsigned long uint64_t; // if you haven't done so earlier
inline uint64_t n_nv_Clock()
{
uint64_t n_clock;
asm volatile("mov.u64 %0, %%clock64;" : "=l" (n_clock)); // make sure the compiler will not reorder this
return n_clock;
}
The volatile keyword tells the optimizer that you really mean it and don't want it moved / optimized away. This is a standard way of doing so both in PTX and e.g. in gcc.
Note that this returns clocks, not nanoseconds. You need to query for device clock frequency (using clGetDeviceInfo(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, 0))). Also note that on older devices there are two frequencies (or three if you count the memory frequency which is irrelevant in this case): the device clock and the shader clock. What you want is the shader clock.
With the 64-bit version of the register you don't need to worry about overflowing as it generally takes hundreds of years. On the other hand, the 32-bit version can overflow quite often (you can still recover the result - unless it overflows twice).
Now, 10 years later after the question was posted I did some tests on NVidia. I tried running the answers given by user 'Spectral' and 'the swine'. Answer given by 'Spectral' does not work. I always got same invalid values returned by clock_time function.
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time)); // this is wrong
return clock_time;
}
After subtracting start and end time I got zero.
So had a look at the PTX assembly which in PyOpenCL you can get this way:
kernel_string = """
your OpenCL code
"""
prg = cl.Program(ctx, kernel_string).build()
print(prg.binaries[0].decode())
It turned out that the clock command was optimized away! So there was no '%clock' instruction in the printed assembly.
Looking into Nvidia's PTX documentation I found the following:
'Normally any memory that is written to will be specified as an out operand, but if there is a hidden side effect on user memory (for example, indirect access of a memory location via an operand), or if you want to stop any memory optimizations around the asm() statement performed during generation of PTX, you can add a "memory" clobbers specification after a 3rd colon, e.g.:'
So the function that actually work is this:
uint clock_time()
{
uint clock_time;
asm volatile ("mov.u32 %0, %%clock;" : "=r"(clock_time) :: "memory");
return clock_time;
}
The assembly contained lines like:
// inline asm
mov.u32 %r13, %clock;
// inline asm
The version given by 'the swine' also works.