Why is cudaOccupancyMaxActiveBlocksPerMultiprocessor() independent of device? - cuda

Different devices may have different shared memory sizes, register counts.
Why is cudaOccupancyMaxActiveBlocksPerMultiprocessor() independent of device?
It doesn't need a device as a parameter.

It uses the active device set by cudaSetDevice().

Related

Do CUDA device IDs change when debugging?

I've noticed that, on a host with two working CUDA SM_2.x devices, the first of which is running the display, calling cudaSetDevice(1) in the debugger throws CUDA error 10 (invalid device). It works fine when executed outside of the debugger, however. I also note that the device which normally has ID 1 has device ID 0 inside the debugger.
Are my suspicions confirmed that device ID 0 is assigned only to the first available device, rather than the device installed in the first PCIe slot?
If so, is there a way of ensuring that e.g. cudaSetDevice(1) always selects the same device, irrespective of how CUDA assigns device IDs?
The really short answer is, no, there is no way to do this. Having said that, hardcoding a fixed device id is never the correct thing to do. You want to either:
Select an id from the list of available devices which the API returns for you (and there are a number of very helpful APIs to let you get the device you want), or
You don't use any explicit device selection at all in your code and rely on appropriate driver compute mode settings and/or the CUDA_VISIBLE_DEVICES environment setting to have the driver automatically select a suitable valid device ID for you.
Which you choose will probably dictated by the environment in which your code ends up being deployed.

Splitting an array on a multi-GPU system and transferring the data across the different GPUs

I'm using CUDA on a double GPU system using NVIDIA GTX 590 cards and I have an array partitioned according to the figure below.
If I'm going to use CudaSetDevice() to split the sub-arrays across the GPUs, will they share the same global memory? Could the first device access the updated data on the second device and, if so, how?
Thank you.
Each device memory is separate, so if you call cudaSetDevice(A) and then cudaMalloc() then you are allocating memory on device A. If you subsequently access that memory from device B then you will have a higher access latency since the access has to go through the external PCIe link.
An alternative strategy would be to partition the result across the GPUs and store all the input data needed on each GPU. This means you have some duplication of data but this is common practice in GPU (and indeed any parallel method such as MPI) programming - you'll often hear the term "halo" applied to the data regions that need to be transferred between updates.
Note that you can check whether one device can access another's memory using cudaDeviceCanAccessPeer(), in cases where you have a dual GPU card this is always true.

How do the nVIDIA drivers assign device indices to GPUs?

Assume on a single node, there are several devcies with different compute capabilities, how nvidia rank them (by rank I mean the number assigned by cudaSetDevice)?
Are there any general guideline about this? thanks.
I believe the ordering of devices corresponding to cudaGetDevice and cudaSetDevice (i.e. the CUDA runtime enumeration order should be either based on a heuristic that determines the fastest device and makes it first or else based on PCI enumeration order. You can confirm this using the deviceQuery sample, which prints the properties of devices (including PCI ID) based on the order they get enumerated in for cudaSetDevice.
However I would recommend not to base any decisions on this. There's nothing magical about PCI enumeration order, and even things like a system BIOS upgrade can change the device enumeration order (as can swapping devices, moving to another system, etc.)
It's usually best to query devices (see the deviceQuery sample) and then make decisions based on the specific devices returned and/or their properties. You can also use cudaChooseDevice to select a device heuristically.
You can cause the CUDA runtime to choose either "Faster First" or "PCI Enumeration Order" based on the setting (or lack of) an environment variable in CUDA 8.

Memory usage limit for windows phone 8

What is the application memory usage limit of windows phone 8 application, I need memory limit for the three different devices available (like 720p, WXVGA etc)
The zen of WP8 memory caps has three aspects: default baseline (150MB+), extended memory (180MB+) and low-memory device opt-out (300MB+).
Baseline:
By default all apps (D3D, XAML and XNA) on WP8 have at least 150MB which is up from 90MB on WP7. The increase from 90MB to 150MB is done to accommodate the extra memory needed for more detailed visuals on HD displays.
Extended Memory Caps
Apps can also ask for additional memory by specifying ID_FUNCCAP_EXTEND_MEM. When asking for additional memory you're guaranteed at least 180MB on all devices. When asking for additional memory your app may actually get all the way up to 380MB memory on high-memory devices.
Low memory device opt-out
Apps can also opt-out of low-memory devices (512MB RAM) by specifying ID_REQ_MEMORY_300. That guaranteed your app will only run on high-memory devices (more then 1GB of RAM) and with at least 300MB of memory.
The way you should think about "high memory devices" is that it's just like having an optional sensor (Gyroscope, Compass, etc) or any other optional hardware (NFC, etc). Don't assume users have this extra memory unless you want to limit the distribution of your app considerably. Public statistics show that low-memory devices sell pretty well and you shouldn't disqualify your app from those devices unless it's an absolute must.
App memory limits for Windows Phone 8 (MSDN)

WinRT handling more than 5 touch simultaneous touch inputs

How can more than five touch inputs be handled simultaneously on Windows Store Apps using C# and XAML?
Different approaches have been tried, including this from MS: http://msdn.microsoft.com/en-us/library/windows/apps/xaml/jj150606.aspx
Does any one know an approach to handle more than five?
The five touch limit is most likely a hardware limitation. A touch screen has a dedicated processor to process the large amount of capacitance data. This processing results in a list of touches, and their respective locations, which is sent along to the operating system for handling.
In Apple-land, small iOS devices (iPhone, iPod) have a 5-touch limit, while large iOS devices (iPad) have a 10-touch limit.