launch more than 65536 blocks per grid (x dimension) in CUDA - cuda

I have a CUDA code that I am launching from a mex file in visual studios. I am only launching blocks in the x dimension, but get an error if I try to launch more than 65536 blocks, despite the fact that my compute capacity is 6.1 (according to the GPU devices tab under system info).
Also under system info it says MAX_GRID_DIM_X is 2147483647. Is there some setting or environmental variable I need to change before I can launch this many blocks? What other things might be limiting the number of blocks I can launch?

Is there some setting or environmental variable I need to change before I can launch this many blocks?
No.
What other things might be limiting the number of blocks I can launch?
Compilation settings. You must choose a target compilation architecture which supports 2^31-1. On CUDA 9, the default compilation architecture is 3.0 and this supports extended 1D grid sizes. On older toolkits, the default will be 2.0 or older, and these do not.

Related

What do the %envregN special registers hold?

I've read: CUDA PTX code %envreg<32> special registers . The poster there was satisfied with not trying to treat OpenCL-originating PTX as a regular CUDA PTX. But - their question about %envN registers was not properly answered.
Mark Harris wrote that
OpenCL supports larger grids than most NVIDIA GPUs support, so grid sizes need to be virtualized by dividing across multiple actual grid launches, and so offsets are necessary. Also in OpenCL, indices do not necessarily start from (0, 0, 0), the user can specify offsets which the driver must pass to the kernel. Therefore the registers initialized for OpenCL and CUDA C launches are different.
So, do the %envN registers make up the "virtual grid index"? And what does each of these registers hold?
The extent of the answer that can be authoritatively given is what is in the PTX documentation:
A set of 32 pre-defined read-only registers used to capture execution environment of PTX program outside of PTX virtual machine. These registers are initialized by the driver prior to kernel launch and can contain cta-wide or grid-wide values.
Anything beyond that would have to be:
discovered via reverse engineering or disclosed by someone with authoritative/unpublished knowledge
subject to change (being undocumented)
evidently under control of the driver, which means that for a different driver (e.g. CUDA vs. OpenCL) the contents and/or interpretation might be different.
If you think that NVIDIA documentation should be improved in any way, my suggestion would be to file a bug.

If I have multiple Nvidia GPUs in my system, how to check which GPU is currently used by CUDA compiler?

I have a windows system with 2 Nvidia GPUs. Can someone tell me which GPU is the CUDA compiler using? Is it possible to switch the GPUs or use both together for same process?
http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g1bf9d625a931d657e08db2b4391170f0
Use 'cudaGetDeviceCount' to get the number of devices. If deviceCount is 2, then device index 0 and device index 1 refer to the two current devices.
And 'cudaGetDeviceProperties' can be used to get many properties of the device.
For example,
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, 1);
can be used to get many properties of device 1.
And the way to switch to different GPUs is easy. After initialization, use
'cudaSetDevice(0)'
and
'cudaSetDevice(1)'
to switch to different GPUs.
The CUDA_VISIBLE_DEVICES environment variable will allow you to modify this enabling/ordering.
CUDA_VISIBLE_DEVICES="0,1" will enable both GPU devices to be available to your program.
Possible duplicate of CUDA GPU selected by position, but how to set default to be something other than device 0?

Artificially downgrade CUDA compute capabilities to simulate other hardware

I am developing software that should be running on several CUDA GPUs of varying amount of memory and compute capability. It happened to me more than once that customers would report a reproducible problem on their GPU that I couldn't reproduce on my machine. Maybe because I have 8 GB GPU memory and they have 4 GB, maybe because compute capability 3.0 rather than 2.0, things like that.
Thus the question: can I temporarily "downgrade" my GPU so that it would pretend to be a lesser model, with smaller amount of memory and/or with less advanced compute capability?
Per comments clarifying what I'm asking.
Suppose a customer reports a problem running on a GPU with compute capability C with M gigs of GPU memory and T threads per block. I have a better GPU on my machine, with higher compute capability, more memory, and more threads per block.
Can I run my program on my GPU restricted to M gigs of GPU memory? The answer to this one seems to be "yes, just allocate (whatever mem you have) - M at startup and never use it; that would leave only M until your program exits."
Can I reduce the size of the blocks on my GPU to no more than T threads for the duration of runtime?
Can I reduce compute capability of my GPU for the duration of runtime, as seen by my program?
I originally wanted to make this a comment but it was getting far too big for that scope.
As #RobertCrovella mentioned there is no native way to do what you are asking for. That said, you can take the following measures to minimize the bugs you see on other architectures.
0) Try to get the output from cudaGetDeviceProperties from the CUDA GPUs you want to target. You could crowd source this from your users or the community.
1) To restrict memory, you can either implement a memory manager and manually keep track of the memory being used or use cudaGetMemInfo to get a fairly close estimate. Note: This function returns memory used by other applications as well.
2) Have a wrapper macro to launch the kernel where you can explicitly check if the number of blocks / threads fit in the current profile. i.e. Instead of launching
kernel<float><<<blocks, threads>>>(a, b, c);
You'd do something like this:
LAUNCH_KERNEL((kernel<float>), blocks, threads, a, b, c);
Where you can have the macro be defined like this:
#define LAUNCH_KERNEL(kernel, blocks, threads, ...)\
check_blocks(blocks);\
check_threads(threads);\
kernel<<<blocks, threads>>>(__VA_ARGS__)
3) Reducing the compute capability is not possible, but you can however compile your code with various compute modes and make sure your kernels have backwards compatible code in them. If a certain part of your kernel errors out with an older compute mode, you can do something like this:
#if !defined(TEST_FALLBACK) && __CUDA_ARCH__ >= 300 // Or any other newer compute
// Implement using new fancy feature
#else
// Implement a fallback version
#endif
You can define TEST_FALLBACK whenever you want to test your fallback code and ensure your code works on older computes.

CUDA GPU selected by position, but how to set default to be something other than device 0?

I've recently installed a second GPU (Tesla K40) on my machine at home and my searches have suggested that the first PCI slot becomes the default GPU chosen for CUDA jobs. A great link is explaining it can be found here:
Default GPU Assignment
My original GPU is a TITAN X, also CUDA enabled, but it's really best for single precision calculations and the Tesla better for double precision. My question for the group is whether there is a way to set up my default CUDA programming device to be the second one always? Obviously I can specify in the code each time which device to use, but I'm hoping that I can configure my set such that it will always default to using the Tesla card.
Or is the only way to open the box up and physically swap positions of the devices? Somehow that seems wrong to me....
Any advice or relevant links to follow up on would be greatly appreciated.
As you've already pointed out, the cuda runtime has its own heuristic for ordering GPUs and assigning devices indices to them.
The CUDA_VISIBLE_DEVICES environment variable will allow you to modify this ordering.
For example, suppose that in ordinary use, my display device is enumerated as device 0, and my preferred CUDA GPU is enumerated as device 1. Applications written without any usage of cudaSetDevice, for example, will default to using the device enumerated as 0. If I want to change this, under linux I could use something like:
CUDA_VISIBLE_DEVICES="1" ./my_app
to cause the cuda runtime to enumerate the device that would ordinarily be device 1 as device 0 for this application run (and the ordinary device 0 would be "hidden" from CUDA, in this case). You can make this "permanent" for the session simply by exporting that variable (e.g., bash):
export CUDA_VISIBLE_DEVICES="1"
./my_app
If I simply wanted to reverse the default CUDA runtime ordering, but still make both GPUs available to the application, I could do something like:
CUDA_VISIBLE_DEVICES="1,0" ./deviceQuery
There are other specification options, such as using GPU UUID identifiers (instead of device indices) as provided by nvidia-smi.
Refer to the documentation or this writeup as well.

I cannot set breakpoint in CUDA kernel

I'm new to NSIGHT and CUDA. I tried to set a breakpoint inside my CUDA kernel code, but I can't--the breakpoint is set at the end of my kernel and not on the particular line I want to debug.
I'm using VS2010 (MFC project) with NSIGHT 2.2 and CUDA 4.2.
I'm compiling in debug mode.
I'm using CUDA in a project which is not the "StratUp project".
I'm using "Generate Host Debug Information" with "Yes (-g)"
I'm using "Generate Device Debug Information" with "Yes (-G)"
I am currently running the program through Menu->Nsight->Start CUDA debugging.
When I try to set a breakpoint on a different project (which is "StartUp project"), i do succeed.
Any suggestions about how I can get the breakpoint to act on a particular line, versus the entire kernel?
I used too many threads (256X256) to activate my kernel.
dim3 threads(256,256)
(kernel<<<...,threads>>>
It is important to note that when debugging CUDA, breakpoints set in device code will not work properly if the number of cores on your machine is greater than the number of CUDA threads being run. Additionally, if the number of CUDA threads is not evenly divisible by the number of cores, some cores will not hit device code breakpoints on the last iteration.