CUDA PTX code %envreg<32> special registers - cuda

I tried to run a PTX assembly code generated by a .cl kernel with the CUDA driver API. The steps i took were these ( standard opencl procedure ):
1) Load .cl kernel
2) JIT compile it
3) Get the compiled ptx code and save it.
So far so good.
I noticed some special registers inside ptx assembly, %envreg3, %envreg6 etc. The problem is that these registers are not set ( according to ptx_isa these registers are set by the driver before the kernel launch ) when i try to execute the code with the driver API. So the code is falling into an infinite loop, and fails to run corectly. But if i manually set the values ( nore exactly i replace %envreg6 with the blocksize inside ptx ), the code is executing and i get the correct results ( correct compared with the cpu results ).
Does anyone know how we can set values to these registers, or maybee if i am missing something? i.e. a flag on cuLaunchKernel, that sets values to these registers?

You are trying to compile an OpenCL kernel and run it using the CUDA driver API. The NVIDIA driver/compiler interface is different between OpenCL and CUDA, so what you want to do is not supported and fundamentally cannot work.
Presumably, the only workaround would be the one you found: to patch the PTX code. But I'm afraid this might not work in the general case.
Edit:
Specifically, OpenCL supports larger grids than most NVIDIA GPUs support, so grid sizes need to be virtualized by dividing across multiple actual grid launches, and so offsets are necessary. Also in OpenCL, indices do not necessarily start from (0, 0, 0), the user can specify offsets which the driver must pass to the kernel. Therefore the registers initialized for OpenCL and CUDA C launches are different.

Related

Trouble in knowing when the cuda code gets compiled?

I want to know, when the cuda code gets compiled? I mean is it possible to know the values of parameters of the cuda kernel which is given in the command line argument of host code run time? Is it possible to compile cuda code during run time of host code ?
In typical usage of the CUDA runtime API, CUDA device code gets compiled when you pass a file containing CUDA device code to nvcc, the CUDA compiler driver engine.
CUDA device code can/will be compiled at run-time using either the driver API or using the CUDA NVRTC mechanism. There is documentation for each of these approaches, CUDA sample codes for each of these approaches, and various questions here on the cuda SO tag for each.
When you use the CUDA driver API, the device source code you will present for compilation at run-time is in the form of PTX, a CUDA intermediate language.
For compilation of typical CUDA C++ device code at runtime, you would use the NVRTC mechanism.

What do the %envregN special registers hold?

I've read: CUDA PTX code %envreg<32> special registers . The poster there was satisfied with not trying to treat OpenCL-originating PTX as a regular CUDA PTX. But - their question about %envN registers was not properly answered.
Mark Harris wrote that
OpenCL supports larger grids than most NVIDIA GPUs support, so grid sizes need to be virtualized by dividing across multiple actual grid launches, and so offsets are necessary. Also in OpenCL, indices do not necessarily start from (0, 0, 0), the user can specify offsets which the driver must pass to the kernel. Therefore the registers initialized for OpenCL and CUDA C launches are different.
So, do the %envN registers make up the "virtual grid index"? And what does each of these registers hold?
The extent of the answer that can be authoritatively given is what is in the PTX documentation:
A set of 32 pre-defined read-only registers used to capture execution environment of PTX program outside of PTX virtual machine. These registers are initialized by the driver prior to kernel launch and can contain cta-wide or grid-wide values.
Anything beyond that would have to be:
discovered via reverse engineering or disclosed by someone with authoritative/unpublished knowledge
subject to change (being undocumented)
evidently under control of the driver, which means that for a different driver (e.g. CUDA vs. OpenCL) the contents and/or interpretation might be different.
If you think that NVIDIA documentation should be improved in any way, my suggestion would be to file a bug.

How to view CUDA library function calls in profiler?

I am using the cuFFT library. How do I modify my code to see the function calls from this library (or any other CUDA library) in the NVIDIA Visual Profiler NVVP? I am using Windows and Visual Studio 2013.
Below is my code. I convert my image and filter to the Fourier domain, then perform point-wise complex matrix multiplication in a custom CUDA kernel I wrote, and then simply perform the inverse DFT on the filtered images spectrum. The results are accurate, but I am not able to figure out how to view the cuFFT functions in the profiler.
// Execute FFT Plans
cufftExecR2C(fftPlanFwd, (cufftReal *)d_in, (cufftComplex *)d_img_Spectrum);
cufftExecR2C(fftPlanFwd, (cufftReal *)d_filter, (cufftComplex *)d_filter_Spectrum);
// Perform complex pointwise muliplication on filter spectrum and image spectrum
pointWise_complex_matrix_mult_kernel << <grid, block >> >(d_img_Spectrum, d_filter_Spectrum, d_filtered_Spectrum, ROWS, COLS);
// Execute FFT^-1 Plan
cufftExecC2R(fftPlanInv, (cufftComplex *)d_filtered_Spectrum, (cufftReal *)d_out);
At the entry point to the library, the library call is like any other call into a C or C++ library: it is executing on the host. Within that library call, there may be calls to CUDA kernels or other CUDA API functions, for a CUDA GPU-enabled library such as CUFFT.
The profilers (at least up through CUDA 7.0 - see note about CUDA 7.5 nvprof below) don't natively support the profiling of host code. They are primarily focused on kernel calls and CUDA API calls. A call into a library like CUFFT by itself is not considered a CUDA API call.
You haven't shown a complete profiler output, but you should see the CUFFT library make CUDA kernel calls; these will show up in the profiler output. The first two CUFFT calls prior to your pointWise_complex_matrix_mult_kernel should have one or more kernel calls each that show up to the left of that kernel, and the last CUFFT call should have one or more kernel calls that show up to the right of that kernel.
One possible way to get specific sections of host code to show up in the profiler is to use the NVTX (NVIDIA Tools Extension) library to annotate your source code, which will cause those annotations to show up in the profiler output. You might want to put an NVTX range event around the library call you wish to see identified in the profiler output.
Another approach would be to try out the new CPU profiling features in nvprof in CUDA 7.5. You can refer to section 3.4 of the Profiler guide that ships with CUDA 7.5RC.
Finally, ordinary host profilers should be able to profile your CUDA application, including CUFFT library calls, but they won't have any visibility into what is happening on the GPU.
EDIT: Based on discussion in the comments below, your code appears to be similar to the simpleCUFFT sample code. When I compile and profile that code on Win7 x64, VS 2013 Community, and CUDA 7, I get the following output (zoomed in to depict the interesting part of the timeline):
You can see that there are CUFFT kernels being called both before and after the complex pointwise multiply and scale kernel that appears in that code. My suggestion would be to start by doing something similar with the simpleCUFFT sample code rather than your own code, and see if you can duplicate the output above. If so, the problem lies in your code (perhaps your CUFFT calls are failing, perhaps you need to add proper error checking, etc.)

Is there a macro to define the max amount of register that a particular Kernel can use?

I wrote a CUDA library, is there anyway to specifically put a register cap on certain library kernel instead of put a register cap on all kernels within the library?
At the C code level there is not. You can use the __launch_bounds__ keyword to specify the expected upper limit for threads per block, which can result in an upper register per thread limit during the compilation cycle. Alternatively, if you compile to PTX, you can introduce the .maxnreg to the kernel preamble.
CUDA 5 now supports separate compilation and has a device code linker, so it should also be possible to compile kernels to different device object files using different compiler arguments and then link them into your library object.

Can GPU counters be read transparently to the application code

I am trying to profile the CUDA rodinia benchmarks executing on a GTX 650.
I am using the code /usr/local/cuda-5.0/extras/CUPTI/samples/event_sampling to read
the instructions executed counter. It seems strange that I do not see any change in the
values reported by the event_sampling whether I am executing the CUDA benchmarks or not.
The event_sampling code also has some calculations of its own for which it measures the instructions executed. Unlike CPU, do I need to make changes to the source code of the application to be able to read the GPU counters such as instruction_executed?
CUPTI will only give you counter updates for kernels in the same process. You can get some of these values, though not to the same level of precision, with the NVIDIA visual profiler or related environment variables without modifying the code however.