Computation between two different kernels in Cuda [duplicate] - cuda

I'm writing a CUDA program but I'm getting the obnoxious warning:
Warning: Cannot tell what pointer points to, assuming global memory space
this is coming from nvcc and I can't disable it.
Is there any way to filter out warning from third-party tools (like nvcc)?
I'm asking for a way to filter out of the output window log errors/warnings coming from custom build tools.

I had the same annoying warnings, I found help on this thread: link.
You can either remove the -G flag on the nvcc command line,
or
change the compute_10,sm_10 to compute_20,sm_20 in the Cuda C/C++ options of your project if you're using Visual Studio.

Related

Is it possible to programmatically determine if the CUDA profiler is running?

The problem I'm trying to solve. Most of our command line apps, when run from Visual Studio, we like to force the user to hit a key to exit, so we can see the output in Visual Studio while we're debugging.
This doesn't work at all with profiling. One way to fix that would be to determine if the profiler was running or not.
The API to the CUDA profiler is rather limited:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PROFILER.html
It appears to support:
Initialization cudaProfilerInitialize
Starting cudaProfilerStart
Stopping cudaProfilerStop
But no way to determine if it's actually running?
Well, an ugly and surely sub-optimal solution is just searching for nvprof among the running processes...
On Linux, you can do this with readproc():
#include <proc/readproc.h>
proc_t* readproc(PROCTAB *PT, proc_t *return_buf);
For more information on how to use the functions in readproc.h, have a look at:
How does the ps command work?
on SuperUser.com, and particularly at this answer.
Note: Don't forget nvprof might be running but not profiling your process.

Can not profile a cuda code with nvprof when using CUPTI functions inside

I'm doing a simple experiment. Everyone may know about callback_metric sample code of CUPTI (located in CUPTI folder: /usr/local/cuda/extras/CUPTI/sample/callback_metric). It contains only a simple code for reading a metric when running a vectorAdd kernel. Everything works when I compile and run the code.
But when I run this code under nvprof command (nvprof ./callback_metric), I get an error message as:
Error: incompatible CUDA driver version
both nvprof and other CUPTI-based codes work fine separately.
The profilers are not intended to be used in this way with applications that make use of CUPTI.
This is documented in the profiler documentation:
Here are a couple of reasons why Visual Profiler may fail to gather metric or event information.
More than one tool is trying to access the GPU. To fix this issue please make sure only one tool is using the GPU at any given point. Tools include the CUDA command line profiler, Parallel NSight Analysis Tools and Graphics Tools, and applications that use either CUPTI or PerfKit API (NVPM) to read event values.

let Nsight start debugging after certain kernel function is executed

My CUDA program have too many kernel functions and if I open the CUDA debugging mode, I'll have to wait for an whole hour after the breakpoint in certain kernel function is triggered.
Is there any way for Nsight to start debugging after certain kernel functions, or only debug the certain kernel function?
I'm using Nsight with VS2012
In theory you can follow the instructions in the Nsight help file (either the online help or local help. At the time of writing the page is here).
In short:
In the Nsight Monitor options, CUDA » Use this Monitor for CUDA attach should be True.
Before starting your application, set an environment variable called NSIGHT_CUDA_DEBUGGER to 1.
Then in your CUDA kernel, you can add a breakpoint like this:
asm("brkpt;");
This will work similar to the __debugbreak() intrinsic or int 3 assembly instruction in host code. When hit you will get a dialog prompting you to attach the CUDA debugger.
In practice, at least for me it Just Doesn't Work™. Maybe you'll have more luck.

CUDA code too large to be expanded [duplicate]

I have a CUDA class, let's call it A, defined in a header file. I have written a test kernel which creates an instance of class A, which compiles fine and produces the expected result.
In addition, I have my main CUDA kernel, which also compiles fine and produces the expected result. However, when I add code to my main kernel to instantiate an instance of class A, the nvcc compiler fails with a segmentation fault.
Update:
To clarify, the segmentation fault happens during compilation, not when running the kernel. The line I am using to compile is:
`nvcc --cubin -arch compute_20 -code sm_20 -I<My include dir> --keep kernel.cu`
where <My include dir> is the path to my local path containing some utility header files.
My question is, before spending a lot of time isolating a minimal example exhibiting the behaviour (not trivial, due to relatively large code base), has anyone encountered a similar issue? Would it be possible for the nvcc compiler to fail and die if the kernel is either too long or uses too many registers?
If an issue such as register count can affect the compiler this way, then I will need to rethink how to implement my kernel to use fewer resources. This would also mean that trimming things down to a minimal example will likely make the problem disappear. However, if this is not even a possibility, I don't want to waste time on a dead-end, but will rather try to cut things down to a minimal example and will file a bug report to NVIDIA.
Update:
As per the suggestion of #njuffa, I reran the compilation with the -v flag enabled. The output ends with the following:
#$ ptxas -arch=sm_20 -m64 -v "/path/to/kernel_ptx/kernel.ptx" -o "kernel.cubin"
Segmentation fault
# --error 0x8b --
This suggests the problem is due to the ptxas program, which is failing to generate a CUDA binary from the ptx file.
This would appear to have been a genuine bug of some sort in the CUDA 5.0 ptxas assembler. It was reported to NVIDIA and we can assume that it was fixed sometime during the more than three years since the question was asked and this answer added.
[This answer has been assembled from comments and added as a community wiki entry to get this question off the unanswered question list ]

cuda with mingw - updated

We have been developing our code in linux, but would like to compile a windows executable. The old non-gpu version compiles just fine with mingw in windows, so I was hoping I'd be able to do the same with the CUDA version.
The strategy is to compile kernel code with nvcc in visual studio, and the rest with gcc in mingw.
So far, we easily compiled the .cu file (with the kernel and kernel launches) in visual studio. However, we still can't compile the c code in mingw. The c code contains cuda api calls such as cudaMalloc and cuda types such as cudaEvent_t, so we must include cuda.h and cuda_runtime.h. However, gcc gives warnings and errors for these headers, for example:
../include/host_defines.h:57:0: warning: "__cdecl" redefined
and
../include/vector_functions.h:127:14: error: 'short_4' has no member named 'x'
Any ideas on how we can include these headers and compile the c portion of the code?
If you are really desperate there might be a way. The nvcc is really just a frontend for a bunch of compilers. It invokes g++ a lot to strip comments, separate device and host code, handle name mangling, link stuff back together, etc. (use --verbose) to get the details.
My idea is as follows: You should be able to compile the host code with mingw while compiling the device code to a fatbin on a linux machine (as I guess the device binary is host-machine independent). Afterwards link both parts of the code back together with mingw or use the driver API to load the fatbin dynamically. Disclaimer: Did not test!
As far as I know, it is impossible to use CUDA without MSVC. So, you need MSVC to make nvcc work, and you can compile CPU code with mingw and link everything together.
According to http://forums.nvidia.com/index.php?showtopic=30743
"There are no current plans to support mingw."
You might want to take a look at how the cycles renderer handles this, look at https://developer.blender.org/diffusion/B/browse/master/extern/cuew/ and
https://developer.blender.org/diffusion/B/browse/master/intern/cycles/device/device_cuda.cpp
I know it's not an automagic trick but it might help you get started.