How to select a GPU with CUDA? - cuda

I have a computer with 2 GPUs; I wrote a CUDA C program and I need to tell it somehow that I want to run it on just 1 out of the 2 graphic cards; what is the command I need to type and how should I use it? I believe somehow that is related to the cudaSetDevice but I can't really find out how to use it.

It should be pretty much clear from documentation of cudaSetDevice, but let me provide following code snippet.
bool IsGpuAvailable()
{
int devicesCount;
cudaGetDeviceCount(&devicesCount);
for(int deviceIndex = 0; deviceIndex < devicesCount; ++deviceIndex)
{
cudaDeviceProp deviceProperties;
cudaGetDeviceProperties(&deviceProperties, deviceIndex);
if (deviceProperties.major >= 2
&& deviceProperties.minor >= 0)
{
cudaSetDevice(deviceIndex);
return true;
}
}
return false;
}
This is how I iterated through all available GPUs (cudaGetDeviceCount) looking for the first one of Compute Capability of at least 2.0. If such device was found, then I used cudaSetDevice so all the CUDA computations were executed on that particular device. Without executing the cudaSetDevice your CUDA app would execute on the first GPU, i.e. the one with deviceIndex == 0 but which particular GPU is that depends on which GPU is in which PCIe slot.
EDIT:
After clarifying your question in comments, it seems to me that it should be suitable for you to choose the device based on its name. If you are unsure about your actual GPU names, then run this code which will print names of all your GPUs into console:
int devicesCount;
cudaGetDeviceCount(&devicesCount);
for(int deviceIndex = 0; deviceIndex < devicesCount; ++deviceIndex)
{
cudaDeviceProp deviceProperties;
cudaGetDeviceProperties(&deviceProperties, deviceIndex);
cout << deviceProperties.name << endl;
}
After that, choose the name of the GPU that you want to use for computations, lets say it is "GTX XYZ". Call the following method from your main method, thanks to it, all the CUDA kernels will be executed on the device with name "GTX XYZ". You should also check the return value - true if device with such name is found, false otherwise:
bool SetGPU()
{
int devicesCount;
cudaGetDeviceCount(&devicesCount);
string desiredDeviceName = "GTX XYZ";
for(int deviceIndex = 0; deviceIndex < devicesCount; ++deviceIndex)
{
cudaDeviceProp deviceProperties;
cudaGetDeviceProperties(&deviceProperties, deviceIndex);
if (deviceProperties.name == desiredDeviceName)
{
cudaSetDevice(deviceIndex);
return true;
}
}
return false;
}
Of course you have to change the value of desiredDeviceName variable to desired value.

Searching more carefully in the internet I found this lines of code that select the GPU with more cores among all the devices installed in the Pc.
int num_devices, device;
cudaGetDeviceCount(&num_devices);
if (num_devices > 1) {
int max_multiprocessors = 0, max_device = 0;
for (device = 0; device < num_devices; device++) {
cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
if (max_multiprocessors < properties.multiProcessorCount) {
max_multiprocessors = properties.multiProcessorCount;
max_device = device;
}
}
cudaSetDevice(max_device);
}

Related

The way to properly do multiple CUDA block synchronization

I like to do CUDA synchronization for multiple blocks. It is not for each block where __syncthreads() can easily handle it.
I saw there are exiting discussions on this topic, for example cuda block synchronization, and I like the simple solution brought up by #johan, https://stackoverflow.com/a/67252761/3188690, essentially it uses a 64 bits counter to track the synchronized blocks.
However, I wrote the following code trying to accomplish the similar job but meet a problem. Here I used the term environment so that the wkNumberEnvs of blocks within this environment shall be synchronized. It has a counter. I used atomicAdd() to count how many blocks have already been synchronized themselves, once the number of sync blocks == wkBlocksPerEnv, I know all blocks finished sync and it is free to go. However, it has a strange outcome that I am not sure why.
The problem comes from this while loop. Since the first threads of all blocks are doing the atomicAdd, there is a while loop to check until the condition meets. But I find that some blocks will be stuck into the endless loop, which I am not sure why the condition cannot be met eventually? And if I printf some messages either in *** I can print here 1 or *** I can print here 2, there is no endless loop and everything is perfect. I do not see something obvious.
const int wkBlocksPerEnv = 2;
__device__ int env_sync_block_count[wkNumberEnvs];
__device__ void syncthreads_for_env(){
// sync threads for each block so all threads in this block finished the previous tasks
__syncthreads();
// sync threads for wkBlocksPerEnv blocks for each environment
if(wkBlocksPerEnv > 1){
const int kThisEnvId = get_env_scope_block_id(blockIdx.x);
if (threadIdx.x == 0){
// incrementing env_sync_block_count by 1
atomicAdd(&env_sync_block_count[kThisEnvId], 1);
// *** I can print here 1
while(env_sync_block_count[kThisEnvId] != wkBlocksPerEnv){
// *** I can print here 2
}
// Do the next job ...
}
}
There are two potential issues with your code. Caching and block scheduling.
Caching can prevent you from observing an updated value during the while loop.
Block scheduling can cause a dead-lock if you wait for an update of a block which has not yet been scheduled. Since CUDA does not guarantee a specific order of scheduled blocks, the only way to prevent this dead-lock is to limit the number of blocks in the grid such that all blocks can run simultaneously.
Following code shows how you could synchronize multiple blocks while avoiding above issues. I adapted the code from the multi-grid synchronization given in the CUDA-sample conjugateGradientMultiDeviceCG https://github.com/NVIDIA/cuda-samples/blob/master/Samples/4_CUDA_Libraries/conjugateGradientMultiDeviceCG/conjugateGradientMultiDeviceCG.cu#L186
On pre-Volta devices, it uses volatile memory accesses. Volta and later uses acquire/release semantics.
Grid size is limited by querying device properties.
#include <cassert>
#include <cstdio>
constexpr int wkBlocksPerEnv = 13;
__device__
int getEnv(int blockId){
return blockId / wkBlocksPerEnv;
}
__device__
int getRankInEnv(int blockId){
return blockId % wkBlocksPerEnv;
}
__device__
unsigned char load_arrived(unsigned char *arrived) {
#if __CUDA_ARCH__ < 700
return *(volatile unsigned char *)arrived;
#else
unsigned int result;
asm volatile("ld.acquire.gpu.global.u8 %0, [%1];"
: "=r"(result)
: "l"(arrived)
: "memory");
return result;
#endif
}
__device__
void store_arrived(unsigned char *arrived,
unsigned char val) {
#if __CUDA_ARCH__ < 700
*(volatile unsigned char *)arrived = val;
#else
unsigned int reg_val = val;
asm volatile(
"st.release.gpu.global.u8 [%1], %0;" ::"r"(reg_val) "l"(arrived)
: "memory");
// Avoids compiler warnings from unused variable val.
(void)(reg_val = reg_val);
#endif
}
#if 0
//wrong implementation which does not synchronize. to check that kernel assert does trigger without proper synchronization
__device__
void syncthreads_for_env(unsigned char* temp){
}
#else
//temp must have at least size sizeof(unsigned char) * total_number_of_blocks in grid
__device__
void syncthreads_for_env(unsigned char* temp){
__syncthreads();
const int env = getEnv(blockIdx.x);
const int blockInEnv = getRankInEnv(blockIdx.x);
unsigned char* const mytemp = temp + env * wkBlocksPerEnv;
if(threadIdx.x == 0){
if(blockInEnv == 0){
// Leader block waits for others to join and then releases them.
// Other blocks in env can arrive in any order, so the leader have to wait for
// all others.
for (int i = 0; i < wkBlocksPerEnv - 1; i++) {
while (load_arrived(&mytemp[i]) == 0)
;
}
for (int i = 0; i < wkBlocksPerEnv - 1; i++) {
store_arrived(&mytemp[i], 0);
}
__threadfence();
}else{
// Other blocks in env note their arrival and wait to be released.
store_arrived(&mytemp[blockInEnv - 1], 1);
while (load_arrived(&mytemp[blockInEnv - 1]) == 1)
;
}
}
__syncthreads();
}
#endif
__global__
void kernel(unsigned char* synctemp, int* array){
const int env = getEnv(blockIdx.x);
const int blockInEnv = getRankInEnv(blockIdx.x);
if(threadIdx.x == 0){
array[blockIdx.x] = 1;
}
syncthreads_for_env(synctemp);
if(threadIdx.x == 0){
int sum = 0;
for(int i = 0; i < wkBlocksPerEnv; i++){
sum += array[env * wkBlocksPerEnv + i];
}
assert(sum == wkBlocksPerEnv);
}
}
int main(){
const int smem = 0;
const int blocksize = 128;
int deviceId = 0;
int numSMs = 0;
int maxBlocksPerSM = 0;
cudaGetDevice(&deviceId);
cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, deviceId);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&maxBlocksPerSM,
kernel,
blocksize,
smem
);
int maxBlocks = maxBlocksPerSM * numSMs;
maxBlocks -= maxBlocks % wkBlocksPerEnv; //round down to nearest multiple of wkBlocksPerEnv
printf("wkBlocksPerEnv %d, maxBlocks: %d\n", wkBlocksPerEnv, maxBlocks);
int* d_array;
unsigned char* d_synctemp;
cudaMalloc(&d_array, sizeof(int) * maxBlocks);
cudaMalloc(&d_synctemp, sizeof(unsigned char) * maxBlocks);
cudaMemset(d_synctemp, 0, sizeof(unsigned char) * maxBlocks);
kernel<<<maxBlocks, blocksize>>>(d_synctemp, d_array);
cudaFree(d_synctemp);
cudaFree(d_array);
return 0;
}
Atomic value is going to global memory but in the while-loop you read it directly and it must be coming from the cache which will not automatically synchronize between threads (cache-coherence only handled by explicit synchronizations like threadfence). Thread gets its own synchronization but other threads may not see it.
Even if you use threadfence, the threads in same warp would be in dead-lock waiting forever if they were the first to check the value before any other thread updates it. But should work with newest GPUs supporting independent thread scheduling.
I like to do CUDA synchronization for multiple blocks.
You should learn to dis-like it. Synchronization is always costly, even when implemented just right, and inter-core synchronization all the more so.
if (threadIdx.x == 0){
// incrementing env_sync_block_count by 1
atomicAdd(&env_sync_block_count[kThisEnvId], 1);
while(env_sync_block_count[kThisEnvId] != wkBlocksPerEnv)
// OH NO!!
{
}
}
This is bad. With this code, the first warp of each block will perform repeated reads of env_sync_block_count[kThisEnvId]. First, and as #AbatorAbetor mentioned, you will face the problem of cache incoherence, causing your blocks to potentially read the wrong value from a local cache well after the global value has long changed.
Also, your blocks will hog up the multiprocessors. Blocks will stay resident and have at least one active warp, indefinitely. Who's to say the will be evicted from their multiprocessor to schedule additional blocks to execute? If I were the GPU, I wouldn't allow more and more active blocks to pile up. Even if you don't deadlock - you'll be wasting a lot of time.
Now, #AbatorAbetor's answer avoids the deadlock by limiting the grid size. And I guess that works. But unless you have a very good reason to write your kernels this way - the real solution is to just break up your algorithm into consecutive kernels (or better yet, figure out how to avoid the need to synchronize altogether).
a mid-way approach is to only have some blocks get past the point of synchronization. You could do that by not waiting except on some condition which holds for a very limited number of blocks (say you had a single workgroup - then only the blocks which got the last K possible counter values, wait).

CUDA cudaLaunchCooperativeKernel and grid synchronization

I am trying to understand how to synchronize a grid of threads with cudaLaunchCooperativeKernel.
https://developer.nvidia.com/blog/cooperative-groups/
I have a very simple kernel where two threads update an array, sync and both print the array:
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void kernel(float *buf){
cg::grid_group
grid = cg::this_grid();
if(grid.thread_rank()<2)
buf[grid.thread_rank()] = 10+grid.thread_rank();
assert(grid.is_valid()); // ok!
grid.sync();
if(grid.thread_rank()<2)
printf("thread=%d: %g %g\n",(int)grid.thread_rank(),buf[0],buf[1]);
}
Instead of printing values (10,11) twice, I get:
thread=0: 10 0
thread=1: 0 11
All cuda calls were fine, cuda-memcheck is happy, my cards is "GeForce RTX 2060 SUPER" and it does support cooperative kernels work, checked with:
int supportsCoopLaunch = 0;
if( cudaSuccess != cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev) )
throw std::runtime_error("Cooperative Launch is not supported on this machine configuration.");
I am confused... Why I don't see the synchronization?
This test is incorrect:
int supportsCoopLaunch = 0;
if( cudaSuccess != cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev) )
throw std::runtime_error("Cooperative Launch is not supported on this machine configuration.");
The support (or lack of) is not communicated via the cudaError_t return value of the function, instead it is communicated via the value placed in the supportsCoopLaunch variable. You would want to do something like:
int supportsCoopLaunch = 0;
cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev);
if( supportsCoopLaunch != 1)
throw std::runtime_error("Cooperative Launch is not supported on this machine configuration.");
I found the bug. The actual code was something like that:
__device__ void kernel(float *buf){/* see the function body above*/}
__global__ void parent_kernel(){
float buf[2]; // per-thread buffer!!! The kernel will not 'sync' it!
kernel(buf); // different kernels will get different buffers
}

How to create out-of-tree QEMU devices?

Two possible mechanisms come to mind:
IPC like the existing QMP and QAPI
QEMU loads a shared library plugin that contains the model
Required capabilities (of course all possible through the C API, but not necessarily IPC APIs):
inject interrupts
register callbacks for register access
modify main memory
Why I want this:
use QEMU as a submodule and leave its source untouched
additional advantages only present for IPC methods:
write the models in any language I want
use a non-GPL license for my device
I'm aware of in-tree devices as explained at: How to add a new device in QEMU source code? which are the traditional way of doing things.
What I've found so far:
interrupts: could only find NMI generation with the nmi monitor command
IO ports: IO possible with i and o monitor commands, so I'm fine there
main memory:
the ideal solution would be to map memory to host directly, but that seems hard:
http://kvm.vger.kernel.narkive.com/rto1dDqn/sharing-variables-memory-between-host-and-guest
https://www.linux-kvm.org/images/e/e8/0.11.Nahanni-CamMacdonell.pdf
http://www.fp7-save.eu/papers/SCALCOM2016.pdf
memory read is possible through the x and xp monitor commands
could not find how to write to memory with monitor commands. But I think the GDB API supports, so it should not be too hard to implement.
The closest working piece of code I could find was: https://github.com/texane/vpcie , which serializes PCI on both sides, and sends it through QEMU's TCP API. But this is more inefficient and intrusive, as it requires extra setup on both guest and host.
This create out of tree PCI device , it just display device in lspci..
It will ease faster PCI driver implementation as it will act as module,
can we extend this to to have similar functionality as edu-pci of QEMU.?
https://github.com/alokprasad/pci-hacking/blob/master/ksrc/virtual_pcinet/virtual_pci.c
/*
*/
#include <linux/init.h>
#include <linux/module.h>
#include <linux/sysfs.h>
#include <linux/fs.h>
#include <linux/kobject.h>
#include <linux/device.h>
#include <linux/proc_fs.h>
#include <linux/types.h>
#include <linux/pci.h>
#include <linux/version.h>
#include<linux/kernel.h>
#define PCI_VENDOR_ID_XTREME 0x15b3
#define PCI_DEVICE_ID_XTREME_VNIC 0x1450
static struct pci_bus *vbus;
static struct pci_sysdata *sysdata;
static DEFINE_PCI_DEVICE_TABLE( vpci_dev_table) = {
{PCI_DEVICE(PCI_VENDOR_ID_XTREME, PCI_DEVICE_ID_XTREME_VNIC)},
{0}
};
MODULE_DEVICE_TABLE(pci, vpci_dev_table);
int vpci_read(struct pci_bus *bus, unsigned int devfn, int where,
int size, u32 *val)
{
switch (where) {
case PCI_VENDOR_ID:
*val = PCI_VENDOR_ID_XTREME | PCI_DEVICE_ID_XTREME_VNIC << 16;
/* our id */
break;
case PCI_COMMAND:
*val = 0;
break;
case PCI_HEADER_TYPE:
*val = PCI_HEADER_TYPE_NORMAL;
break;
case PCI_STATUS:
*val = 0;
break;
case PCI_CLASS_REVISION:
*val = (4 << 24) | (0 << 16) | 1;
/* network class, ethernet controller, revision 1 */ /*2 or 4*/
break;
case PCI_INTERRUPT_PIN:
*val = 0;
break;
case PCI_SUBSYSTEM_VENDOR_ID:
*val = 0;
break;
case PCI_SUBSYSTEM_ID:
*val = 0;
break;
default:
*val = 0;
/* sensible default */
}
return 0;
}
int vpci_write(struct pci_bus *bus, unsigned int devfn, int where,
int size, u32 val)
{
switch (where) {
case PCI_BASE_ADDRESS_0:
case PCI_BASE_ADDRESS_1:
case PCI_BASE_ADDRESS_2:
case PCI_BASE_ADDRESS_3:
case PCI_BASE_ADDRESS_4:
case PCI_BASE_ADDRESS_5:
break;
}
return 0;
}
struct pci_ops vpci_ops = {
.read = vpci_read,
.write = vpci_write
};
void vpci_remove_vnic()
{
struct pci_dev *pcidev = NULL;
if (vbus == NULL)
return;
pci_remove_bus_device(pcidev);
pci_dev_put(pcidev);
}
EXPORT_SYMBOL( vpci_remove_vnic);
void vpci_vdev_remove(struct pci_dev *dev)
{
}
static struct pci_driver vpci_vdev_driver = {
.name = "Xtreme-Virtual-NIC1",
.id_table = vpci_dev_table,
.remove = vpci_vdev_remove
};
int vpci_bus_init(void)
{
struct pci_dev *pcidev = NULL;
sysdata = kzalloc(sizeof(void *), GFP_KERNEL);
vbus = pci_scan_bus_parented(NULL, 2, & vpci_ops, sysdata);
//vbus = pci_create_root_bus(NULL,i,& vpci_ops, sysdata,NULL);
//if (vbus != NULL)
//break;
memset(sysdata, 0, sizeof(void *));
if (vbus == NULL) {
kfree(sysdata);
return -EINVAL;
}
if (pci_register_driver(& vpci_vdev_driver) < 0) {
pci_remove_bus(vbus);
vbus = NULL;
return -EINVAL;
}
pcidev = pci_scan_single_device(vbus, 0);
if (pcidev == NULL)
return 0;
else
pci_dev_get(pcidev);
pci_bus_add_devices(vbus);
return 0;
}
void vpci_bus_remove(void)
{
if (vbus) {
pci_unregister_driver(&vpci_vdev_driver);
device_unregister(vbus->bridge);
pci_remove_bus(vbus);
kfree(sysdata);
vbus = NULL;
}
}
static int __init pci_init(void)
{
printk( "module loaded");
vpci_bus_init();
return 0;
}
static void __exit pci_exit(void)
{
printk(KERN_ALERT "unregister PCI Device\n");
pci_unregister_driver(&vpci_vdev_driver);
}
module_init(pci_init);
module_exit(pci_exit);
MODULE_LICENSE("GPL");
There is at least one fork of QEMU I'm aware of that offers shared library plugins for QEMU... but it's a fork of QEMU 4.0.
https://github.com/cromulencellc/qemu-shoggoth
It is possible to build out of tree plugins with this fork, though it's not documented.
On Nov 11 2019 Peter Maydell, a major QEMU contributor, commented on another Stack Overflow question that:
Device plugins are specifically off the menu, because upstream does not want to provide a nice easy mechanism for people to use to have out-of-tree non-GPL/closed-source devices.
So it seems that QEMU devs oppose this idea at that point in time. It is worth learning about the QEMU plugin system though which might come handy for related applications in any case: How to count the number of guest instructions QEMU executed from the beginning to the end of a run?
This is a shame. Imagine if the Linux kernel didn't have a kernel module interface! I suggest QEMU expose this interface, but just don't make it stable, so that it won't impose a developer burden, and which gives the upside that those who merge won't have as painful rebases.

Are atomic operations in CUDA guaranteed to be scheduled per warp?

Suppose I have 8 blocks of 32 threads each running on a GTX 970. Each blcok either writes all 1's or all 0's to an array of length 32 in global memory, where thread 0 in a block writes to position 0 in the array.
Now to write the actual values atomicExch is used, exchanging the current value in the array with the value that the block attempts to write. Because of SIMD, atomic operation and the fact that a warp executes in lockstep I would expect the array to, at any point in time, only contain 1's or 0's. But never a mix of the two.
However, while running code like this there are several cases where at some point in time the array contains of a mix of 0's and 1's. Which appears to point to the fact that atomic operations are not executed per warp, and instead scheduled using some other scheme.
From other sources I have not really found a conclusive write-up detailing the scheduling of atomic operations across different warps (please correct me if I'm wrong), so I was wondering if there is any information on this topic. Since I need to write many small vectors consisting of several 32 bit integers atomically to global memory, and an atomic operation that is guaranteed to write a single vector atomically is obviously very important.
For those wondering, the code I wrote was executed on a GTX 970, compiled on compute capability 5.2, using CUDA 8.0.
The atomic instructions, like all instructions, are scheduled per warp. However there is an unspecified pipeline associated with atomics, and the scheduled instruction flow through the pipeline is not guaranteed to be executed in lockstep, for every thread, for every stage through the pipeline. This gives rise to the possibility for your observations.
I believe a simple thought experiment will demonstrate that this must be true: what if 2 threads in the same warp targeted the same location? Clearly every aspect of the processing could not proceed in lockstep. We could extend this thought experiment to the case where we have multiple issue per clock within an SM and even across SMs, to as additional examples.
If the vector length were short enough (16 bytes or less) then it should be possible to accomplish this ("atomic update") simply by having a thread in a warp write an appropriate vector-type quantity, e.g. int4. As long as all threads (regardless of where they are in the grid) are attempting to update a naturally aligned location, the write should not be corrupted by other writes.
However, after discussion in the comments, it seems that OP's goal is to be able to have a warp or threadblock update a vector of some length, without interference from other warps or threadblocks. It seems to me that really what is desired is access control (so that only one warp or threadblock is updating a particular vector at a time) and OP had some code that wasn't working as desired.
This access control can be enforced using an ordinary atomic operation (atomicCAS in the example below) to permit only one "producer" to update a vector at a time.
What follows is an example producer-consumer code, where there are multiple threadblocks that are updating a range of vectors. Each vector "slot" has a "slot control" variable, which is atomically updated to indicate:
vector is empty
vector is being filled
vector is filled, ready for "consumption"
with this 3-level scheme, we can allow for ordinary access to the vector by both consumer and multiple producer workers, with a single ordinary atomic variable access mechanism. Here is an example code:
#include <assert.h>
#include <iostream>
#include <stdio.h>
const int num_slots = 256;
const int slot_length = 32;
const int max_act = 65536;
const int slot_full = 2;
const int slot_filling = 1;
const int slot_empty = 0;
const int max_sm = 64; // needs to be greater than the maximum number of SMs for any GPU that it will be run on
__device__ int slot_control[num_slots] = {0};
__device__ int slots[num_slots*slot_length];
__device__ int observations[max_sm] = {0}; // reported by consumer
__device__ int actives[max_sm] = {0}; // reported by producers
__device__ int correct = 0;
__device__ int block_id = 0;
__device__ volatile int restricted_sm = -1;
__device__ int num_act = 0;
static __device__ __inline__ int __mysmid(){
int smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
// this code won't work on a GPU with a single SM!
__global__ void kernel(){
__shared__ volatile int done, update, next_slot;
int my_block_id = atomicAdd(&block_id, 1);
int my_sm = __mysmid();
if (my_block_id == 0){
if (!threadIdx.x){
restricted_sm = my_sm;
__threadfence();
// I am "block 0" and process the vectors, checking for coherency
// "consumer"
next_slot = 0;
volatile int *vslot_control = slot_control;
volatile int *vslots = slots;
int scount = 0;
while(scount < max_act){
if (vslot_control[next_slot] == slot_full){
scount++;
int slot_val = vslots[next_slot*slot_length];
for (int i = 1; i < slot_length; i++) if (slot_val != vslots[next_slot*slot_length+i]) { assert(0); /* badness - incoherence */}
observations[slot_val]++;
vslot_control[next_slot] = slot_empty;
correct++;
__threadfence();
}
next_slot++;
if (next_slot >= num_slots) next_slot = 0;
}
}}
else {
// "producer"
while (restricted_sm < 0); // wait for signaling
if (my_sm == restricted_sm) return;
next_slot = 0;
done = 0;
__syncthreads();
while (!done) {
if (!threadIdx.x){
while (atomicCAS(slot_control+next_slot, slot_empty, slot_filling) > slot_empty) {
next_slot++;
if (next_slot >= num_slots) next_slot = 0;}
// we grabbed an empty slot, fill it with my_sm
if (atomicAdd(&num_act, 1) < max_act) update = 1;
else {done = 1; update = 0;}
}
__syncthreads();
if (update) slots[next_slot*slot_length+threadIdx.x] = my_sm;
__threadfence(); //enforce ordering
if ((update) && (!threadIdx.x)){
slot_control[next_slot] = 2; // mark slot full
atomicAdd(actives+my_sm, 1);}
__syncthreads();
}
}
}
int main(){
kernel<<<256, slot_length>>>();
cudaDeviceSynchronize();
cudaError_t res= cudaGetLastError();
if (res != cudaSuccess) printf("kernel failure: %d\n", (int)res);
int *h_obs = new int[max_sm];
int *h_act = new int[max_sm];
int h_correct;
cudaMemcpyFromSymbol(h_obs, observations, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(h_act, actives, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(&h_correct, correct, sizeof(int));
int h_total_act = 0;
int h_total_obs = 0;
for (int i = 0; i < max_sm; i++){
std::cout << h_act[i] << "," << h_obs[i] << " ";
h_total_act += h_act[i];
h_total_obs += h_obs[i];}
std::cout << std::endl << h_total_act << "," << h_total_obs << "," << h_correct << std::endl;
}
I don't claim this code to be defect free for any use case. It is advanced to demonstrate the workability of a concept, not as production-ready code. It seems to work for me on linux, on a couple different systems I tested it on. It should not be run on GPUs that have only a single SM, as one SM is reserved for the consumer, and the remaining SMs are used by the producers.

how to find the number of maximum available threads in CUDA?

Not sure what is the best way to find the number of maximum available threads of my GPU.
I have the following code:
int deviceCount, device;
int gpuDeviceCount = 0;
struct cudaDeviceProp properties;
cudaError_t cudaResultCode = cudaGetDeviceCount(&deviceCount);
if (cudaResultCode != cudaSuccess)
deviceCount = 0;
/* machines with no GPUs can still report one emulation device */
for (device = 0; device < deviceCount; ++device) {
cudaGetDeviceProperties(&properties, device);
if (properties.major != 9999) /* 9999 means emulation only */
if (device==0)
{
printf("multiProcessorCount %d\n",properties.multiProcessorCount);
printf("maxThreadsPerMultiProcessor %d\n",properties.maxThreadsPerMultiProcessor);
}
}
which returns:
multiProcessorCount 14
maxThreadsPerMultiProcessor 1536
It turns out the total number is 14*1536=21504. I got a feeling it is too small (I have a Tesla M2070).
your way of checking is correct.
you can check the NVIDIA cuda SDK samples, "Device query" sample in SDK defines it well