Thrust sort_by_key issue when using zip_iterator values - cuda

I'm trying to use zip_iterator with a sort_by_key() in cuda and the values inside the zip_iterator are not getting re-ordered during the sort (the positions of the data stay the same as they were originally).
Example code:
typedef thrust::device_vector<int> IntVec;
IntVec keyVec(100);
IntVec fooVec(100);
IntVec barVec(100);
for (int z = 0; z < 100; z++)
{
keyVec[z] = rand();
fooVec[z] = z;
barVec[z] = z;
}
thrust::sort_by_key( keyVec.begin(), keyVec.end(),
thrust::make_zip_iterator( make_tuple( fooVec.begin(), barVec.begin() ) ) );
What i expect this code to do is sort based on the value in keyVec (which it does properly) while maintaining the order of fooVec and barVec. Is this not what sort_by_key does? does sort_by_key work with zip_iterators? Am i doing something incorrect when setting up/pulling the data from the zip_iterator? If this method is incorrect what is the proper method to keep value ordering?
EX:
key,foo,bar (presort)
3,1,1
2,2,2
...
key,foo,bar (what i expect post sort)
2,2,2
3,1,1
...
key,foo,bar (what i actually get)
2,1,1
3,2,2
...
Using Thrust that ships with CUDA 4.1
System Details:
OS: RHEL 6.0 x86_64
CUDA Version: 4.1 (also tested with 4.1.1.5)
Thrust Version: 1.5
GPU: 4x nVidia Corporation GF100 [GeForce GTX 480] (rev a3)
nvidia driver: 290.10
nvcc version: release 4.1, V0.2.1221
compile string: nvcc testfile.cu
UPDATE:
Still cannot get sort_by_key() to work with zip_iterators but it works correctly with a standard thrust::device_vector<>.begin() iterator.

thrust::sort_by_key should be able to sort zip_iterator in the manner of your example.
I've not been able to reproduce the behavior you describe on any of several different platforms, but it's possible there's something unique about your system which causes an issue.
You should post the contents of testfile.cu and the details of your system to Thrust's bug tracker on Google Code so the developers can take a closer look.

Related

Cuda signed 128-bit multiplication error

I think I discovered a problem when doing 128-bit signed multiplication in cuda PTX using signed integers.
Here is my sample code:
long long result_lo, result_hi;
asm(" mul.lo.s64 %0, 0, -1; \n\t" // 0 * -1 = 0
" mul.hi.s64 %1, 0, -1; \n\t"
: "=l"(result_lo), "=l"(result_hi));
This should produce the result result_lo = 0x0, result_hi = 0x0. However this produces the result: result_lo = 0x0, result_hi = 0xFFFFFFFFFFFFFFFF which is actualy the value 2^127 - (2^126 - 1) if I'm not mistaken and clearly not zero.
First off, I want to make sure my understanding is correct, but moreso, is there a way around this?
Update Changing from Debug mod to Release mode fixes this issue, still wondering if this is a bug in cuda?
Update 2
Reported this bug to NVIDIA
Used Cuda toolkit 7.5 with Visual Studio 2013. x64 Debug, sm_52, compute_52.
TL;DR This appears to be a bug in the emulation of the PTX instruction mul.hi.s64 that is specific to sm_5x platforms, so filing a bug report with NVIDIA is the recommended course of action.
Generally, NVIDIA GPUs are 32-bit architectures, so all 64-bit integer instructions require emulation sequences. In the particular case of 64-bit integer multiplies, for sm_2x and sm_3x platforms, these are constructed from the machine code instruction IMAD.U32, which is a 32-bit integer multiply-add instruction.
For the Maxwell architecture (that is, sm_5x), a high-throughput, but lower-width, integer multiply-add instruction XMAD was introduced, although a low-throughput legacy 32-bit integer multipy IMUL was apparently retained. Inspection of disassembled machine code generated for sm_5x by the CUDA 7.5 toolchain with cuobjdump --dumpsass shows that for ptxas optimization level -O0 (which is used for debug builds), the 64-bit multiplies are emulated with the IMUL instruction, while for optimization level -O1 and higher XMAD is used. I cannot think of a reason why two fundamentally different emulation sequences are employed.
As it turns out, the IMUL-based emulation for mul.hi.s64 for sm_5x is broken while the XMAD-based emulation works fine. Therefore, one possible workaround is to utilize an optimization level of at least -O1 for ptxas, by specifying -Xptxas -O1 on the nvcc command line. Note that release builds use -Xptxas -O3 by default, so no corrective action is necessary for release builds.
From code analysis, the emulation for mul.hi.s64 is implemented as a wrapper around the emulation for mul.hi.u64, and this latter emulation seems to work fine on all platforms including sm_5x. Thus another possible workaround is to use our own wrapper around mul.hi.u64. Coding with inline PTX is unnecessary in this case, since mul.hi.s64 and mul.hi.u64 are accessible via the device intrinsics __mul64hi() and __umul64hi(). As can be seen from the code below, the adjustments to convert a result from unsigned to signed multiplication are fairly trivial.
long long int m1, m2, result;
#if 0 // broken on sm_5x at optimization level -O0
asm(" mul.hi.s64 %0, %1, %2; \n\t"
: "=l"(result)
: "l"(m1), "l"(m2));
#else
result = __umul64hi (m1, m2);
if (m1 < 0LL) result -= m2;
if (m2 < 0LL) result -= m1;
#endif

Multi-dimensional array copy OpenACC

I have a 2D matrix SIZE x SIZE, which I'm trying to copy to the GPU.
I allocate the matrix this way:
#define SIZE 1024
float (*a)(SIZE) = (float(*)[SIZE]) malloc(SIZE * SIZE * sizeof(float));
And I have this on my ACC region:
void mmul_acc(restrict float a[][SIZE],
restrict float b[][SIZE],
restrict float c[][SIZE]) {
#pragma acc data copyin(a[0:SIZE][0:SIZE], b[0:SIZE][0:SIZE]) \
copyout c[0:SIZE][0:SIZE])
{
... code here...
}
When compiling with the PGI compiler, using -Minfo=acc, the compiler tells me:
Generating copyin(a[0:1024][0:])
What does a[0:1024][0:] mean? Why not a[0:1024][0:1024] ???
If instead of declaring matrices I declare arrays with size SIZE*SIZE, doing
#pragma acc copyin(a[0:SIZE*SIZE])
Generates the following compiler message
Generating copyin(a[0:16777216])
The code actually works the same way, same performance, same result.
Apparently in both ways the compiler generates the same code, as it should be, but the message is not straightforward.
I'm using the PGI accelerator 12.8, in a Linux64 machine. I'm compiling with -Minfo=acc
Note: this question was edited and now it doesn't really make much sense, but maybe it can useful to more people.
This issue is fixed in latest PGI Compiler 12.9.0. The compiler now returns following messsage:
Generating copyin(a[0:1024][0:1024])

using memcmp from within device code CUDA [duplicate]

I'm going to run on GPU for example a strcmp function, but I get:
error: calling a host function("strcmp") from a __device__/__global__ function("myKernel") is not allowed
It's possible that printf won't work because gpu hasn't got stdout, but functions like strcmp are expected to work! So, I should insert in my code the implement of strcmp from the library with __device__ prefix or what?
CUDA has a standard library, documented in the CUDA programming guide. It includes printf() for devices that support it (Compute Capability 2.0 and higher), as well as assert(). It does not include a complete string or stdio library at this point, however.
Implementing your own standard library as Jason R. Mick suggests may be possible, but it is not necessarily advisable. In some cases, it may be unsafe to naively port functions from the sequential standard library to CUDA -- not least because some of these implementations are not meant to be thread safe (rand() on Windows, for example). Even if it is safe, it might not be efficient -- and it might not really be what you need.
In my opinion, you are better off avoiding standard library functions in CUDA that are not officially supported. If you need the behavior of a standard library function in your parallel code, first consider whether you really need it:
* Are you really going to do thousands of strcmp operations in parallel?
* If not, do you have strings to compare that are many thousands of characters long? If so, consider a parallel string comparison algorithm instead.
If you determine that you really do need the behavior of the standard library function in your parallel CUDA code, then consider how you might implement it (safely and efficiently) in parallel.
Hope this will help atleast one person:
Since strcmp function is not available in CUDA, so we have to implement on our own:
__device__ int my_strcmp (const char * s1, const char * s2) {
for(; *s1 == *s2; ++s1, ++s2)
if(*s1 == 0)
return 0;
return *(unsigned char *)s1 < *(unsigned char *)s2 ? -1 : 1;
}

How to run "host" functions on GPU with CUDA?

I'm going to run on GPU for example a strcmp function, but I get:
error: calling a host function("strcmp") from a __device__/__global__ function("myKernel") is not allowed
It's possible that printf won't work because gpu hasn't got stdout, but functions like strcmp are expected to work! So, I should insert in my code the implement of strcmp from the library with __device__ prefix or what?
CUDA has a standard library, documented in the CUDA programming guide. It includes printf() for devices that support it (Compute Capability 2.0 and higher), as well as assert(). It does not include a complete string or stdio library at this point, however.
Implementing your own standard library as Jason R. Mick suggests may be possible, but it is not necessarily advisable. In some cases, it may be unsafe to naively port functions from the sequential standard library to CUDA -- not least because some of these implementations are not meant to be thread safe (rand() on Windows, for example). Even if it is safe, it might not be efficient -- and it might not really be what you need.
In my opinion, you are better off avoiding standard library functions in CUDA that are not officially supported. If you need the behavior of a standard library function in your parallel code, first consider whether you really need it:
* Are you really going to do thousands of strcmp operations in parallel?
* If not, do you have strings to compare that are many thousands of characters long? If so, consider a parallel string comparison algorithm instead.
If you determine that you really do need the behavior of the standard library function in your parallel CUDA code, then consider how you might implement it (safely and efficiently) in parallel.
Hope this will help atleast one person:
Since strcmp function is not available in CUDA, so we have to implement on our own:
__device__ int my_strcmp (const char * s1, const char * s2) {
for(; *s1 == *s2; ++s1, ++s2)
if(*s1 == 0)
return 0;
return *(unsigned char *)s1 < *(unsigned char *)s2 ? -1 : 1;
}

clock() in opencl

I know that there is function clock() in CUDA where you can put in kernel code and query the GPU time. But I wonder if such a thing exists in OpenCL? Is there any way to query the GPU time in OpenCL? (I'm using NVIDIA's tool kit).
There is no OpenCL way to query clock cycles directly. However, OpenCL does have a profiling mechanism that exposes incremental counters on compute devices. By comparing the differences between ordered events, elapsed times can be measured. See clGetEventProfilingInfo.
Just for others coming her for help: Short introduction to profiling kernel runtime with OpenCL
Enable profiling mode:
cmdQueue = clCreateCommandQueue(context, *devices, CL_QUEUE_PROFILING_ENABLE, &err);
Profiling kernel:
cl_event prof_event;
clEnqueueNDRangeKernel(cmdQueue, kernel, 1 , 0, globalWorkSize, NULL, 0, NULL, &prof_event);
Read profiling data in:
cl_ulong ev_start_time=(cl_ulong)0;
cl_ulong ev_end_time=(cl_ulong)0;
clFinish(cmdQueue);
err = clWaitForEvents(1, &prof_event);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &ev_start_time, NULL);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &ev_end_time, NULL);
Calculate kernel execution time:
float run_time_gpu = (float)(ev_end_time - ev_start_time)/1000; // in usec
Profiling of individual work-items / work-goups is NOT possible yet.
You can set globalWorkSize = localWorkSize for profiling. Then you have only one workgroup.
Btw: Profiling of a single work-item (some work-items) isn't very helpful. With only some work-items you won't be able to hide memory latencies and the overhead leading to not meaningful measurements.
Try this (Only work with NVidia OpenCL of course) :
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time));
return clock_time;
}
The NVIDIA OpenCL SDK has an example Using Inline PTX with OpenCL. The clock register is accessible through inline PTX as the special register %clock. %clock is described in PTX: Parallel Thread Execution ISA manual. You should be able to replace the %%laneid with %%clock.
I have never tested this with OpenCL but use it in CUDA.
Please be warned that the compiler may reorder or remove the register read.
On NVIDIA you can use the following:
typedef unsigned long uint64_t; // if you haven't done so earlier
inline uint64_t n_nv_Clock()
{
uint64_t n_clock;
asm volatile("mov.u64 %0, %%clock64;" : "=l" (n_clock)); // make sure the compiler will not reorder this
return n_clock;
}
The volatile keyword tells the optimizer that you really mean it and don't want it moved / optimized away. This is a standard way of doing so both in PTX and e.g. in gcc.
Note that this returns clocks, not nanoseconds. You need to query for device clock frequency (using clGetDeviceInfo(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, 0))). Also note that on older devices there are two frequencies (or three if you count the memory frequency which is irrelevant in this case): the device clock and the shader clock. What you want is the shader clock.
With the 64-bit version of the register you don't need to worry about overflowing as it generally takes hundreds of years. On the other hand, the 32-bit version can overflow quite often (you can still recover the result - unless it overflows twice).
Now, 10 years later after the question was posted I did some tests on NVidia. I tried running the answers given by user 'Spectral' and 'the swine'. Answer given by 'Spectral' does not work. I always got same invalid values returned by clock_time function.
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time)); // this is wrong
return clock_time;
}
After subtracting start and end time I got zero.
So had a look at the PTX assembly which in PyOpenCL you can get this way:
kernel_string = """
your OpenCL code
"""
prg = cl.Program(ctx, kernel_string).build()
print(prg.binaries[0].decode())
It turned out that the clock command was optimized away! So there was no '%clock' instruction in the printed assembly.
Looking into Nvidia's PTX documentation I found the following:
'Normally any memory that is written to will be specified as an out operand, but if there is a hidden side effect on user memory (for example, indirect access of a memory location via an operand), or if you want to stop any memory optimizations around the asm() statement performed during generation of PTX, you can add a "memory" clobbers specification after a 3rd colon, e.g.:'
So the function that actually work is this:
uint clock_time()
{
uint clock_time;
asm volatile ("mov.u32 %0, %%clock;" : "=r"(clock_time) :: "memory");
return clock_time;
}
The assembly contained lines like:
// inline asm
mov.u32 %r13, %clock;
// inline asm
The version given by 'the swine' also works.