different kernels for different architectures - cuda

I am wondering if there is some easy way as to have different versions of a kernel for different architectures. Is their an easy way? or the only possibility is to define independent kernels in independent files and ask nvcc to compile to different architecture per file?

You can do that by compiler directives. Something like
__global__ void kernel(...) {
# if __CUDA_ARCH__ >= 350
do something
# else
do something else
# endif
}

With a little more C++ JackOLanterns Answer slightly modified:
template <unsigned int ARCH>
__global__ void kernel(...)
{
switch(ARCH)
{
case 35:
do something
break;
case 30:
do something else
break;
case 20:
so something else
break;
default:
do something for all other ARCH
break;
}
}
EDIT: to remove the error #sgar91 pointed out:
you can call the kernel with the porperties form your CUDA device queried via
cudaGetDeviceProperties(&props, devId);
unsigned int cc = props.major * 10 + props.minor;
switch(cc)
{
case 35:
kernel<35><<<1, 1>>>(/* args */);
break;
...
}

Related

Can a branch in CUDA be ignored if all the warps go one path? If so, is there a way I could give the compiler/runtime this information?

Suppose we have code like the following (I have not compiled this, it may be wrong)
__global__ void myKernel()
{
int data = someArray[threadIdx.x];
if (data == 0) {
funcA();
} else {
funcB();
}
}
Now Suppose there's 1024-thread block running, and someArray is all zero.
Further suppose that funcB() is costly to run, but funcA() is not.
I assume the compiler has to emit both paths sequentially, like doing funcA first, then funcB after. This is not ideal.
Is there a way to hint to CUDA to not do it? Or does the runtime notice "no threads are active so I will skip over all the instructions as I see them"?
Or better yet, what if the branch was something like this (again, haven't compiled this, but it illustrates what I am trying to convey)
__constant__ int constantNumber;
__global__ void myKernel()
{
if (constantNumber == 123) {
funcA();
} else {
funcB();
}
}
and then I set constantNumber to 123 before launching the kernel. Would this still cause both paths to be taken?
This can be achieved using __builtin_assume.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#__builtin_assume
Quoting the documentation:
void __builtin_assume(bool exp)
Allows the compiler to assume that the Boolean argument is true. If the argument is not true at run time, then the behavior is undefined. The argument is not evaluated, so any side-effects will be discarded.

CUDA cudaLaunchCooperativeKernel and grid synchronization

I am trying to understand how to synchronize a grid of threads with cudaLaunchCooperativeKernel.
https://developer.nvidia.com/blog/cooperative-groups/
I have a very simple kernel where two threads update an array, sync and both print the array:
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void kernel(float *buf){
cg::grid_group
grid = cg::this_grid();
if(grid.thread_rank()<2)
buf[grid.thread_rank()] = 10+grid.thread_rank();
assert(grid.is_valid()); // ok!
grid.sync();
if(grid.thread_rank()<2)
printf("thread=%d: %g %g\n",(int)grid.thread_rank(),buf[0],buf[1]);
}
Instead of printing values (10,11) twice, I get:
thread=0: 10 0
thread=1: 0 11
All cuda calls were fine, cuda-memcheck is happy, my cards is "GeForce RTX 2060 SUPER" and it does support cooperative kernels work, checked with:
int supportsCoopLaunch = 0;
if( cudaSuccess != cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev) )
throw std::runtime_error("Cooperative Launch is not supported on this machine configuration.");
I am confused... Why I don't see the synchronization?
This test is incorrect:
int supportsCoopLaunch = 0;
if( cudaSuccess != cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev) )
throw std::runtime_error("Cooperative Launch is not supported on this machine configuration.");
The support (or lack of) is not communicated via the cudaError_t return value of the function, instead it is communicated via the value placed in the supportsCoopLaunch variable. You would want to do something like:
int supportsCoopLaunch = 0;
cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev);
if( supportsCoopLaunch != 1)
throw std::runtime_error("Cooperative Launch is not supported on this machine configuration.");
I found the bug. The actual code was something like that:
__device__ void kernel(float *buf){/* see the function body above*/}
__global__ void parent_kernel(){
float buf[2]; // per-thread buffer!!! The kernel will not 'sync' it!
kernel(buf); // different kernels will get different buffers
}

How to create out-of-tree QEMU devices?

Two possible mechanisms come to mind:
IPC like the existing QMP and QAPI
QEMU loads a shared library plugin that contains the model
Required capabilities (of course all possible through the C API, but not necessarily IPC APIs):
inject interrupts
register callbacks for register access
modify main memory
Why I want this:
use QEMU as a submodule and leave its source untouched
additional advantages only present for IPC methods:
write the models in any language I want
use a non-GPL license for my device
I'm aware of in-tree devices as explained at: How to add a new device in QEMU source code? which are the traditional way of doing things.
What I've found so far:
interrupts: could only find NMI generation with the nmi monitor command
IO ports: IO possible with i and o monitor commands, so I'm fine there
main memory:
the ideal solution would be to map memory to host directly, but that seems hard:
http://kvm.vger.kernel.narkive.com/rto1dDqn/sharing-variables-memory-between-host-and-guest
https://www.linux-kvm.org/images/e/e8/0.11.Nahanni-CamMacdonell.pdf
http://www.fp7-save.eu/papers/SCALCOM2016.pdf
memory read is possible through the x and xp monitor commands
could not find how to write to memory with monitor commands. But I think the GDB API supports, so it should not be too hard to implement.
The closest working piece of code I could find was: https://github.com/texane/vpcie , which serializes PCI on both sides, and sends it through QEMU's TCP API. But this is more inefficient and intrusive, as it requires extra setup on both guest and host.
This create out of tree PCI device , it just display device in lspci..
It will ease faster PCI driver implementation as it will act as module,
can we extend this to to have similar functionality as edu-pci of QEMU.?
https://github.com/alokprasad/pci-hacking/blob/master/ksrc/virtual_pcinet/virtual_pci.c
/*
*/
#include <linux/init.h>
#include <linux/module.h>
#include <linux/sysfs.h>
#include <linux/fs.h>
#include <linux/kobject.h>
#include <linux/device.h>
#include <linux/proc_fs.h>
#include <linux/types.h>
#include <linux/pci.h>
#include <linux/version.h>
#include<linux/kernel.h>
#define PCI_VENDOR_ID_XTREME 0x15b3
#define PCI_DEVICE_ID_XTREME_VNIC 0x1450
static struct pci_bus *vbus;
static struct pci_sysdata *sysdata;
static DEFINE_PCI_DEVICE_TABLE( vpci_dev_table) = {
{PCI_DEVICE(PCI_VENDOR_ID_XTREME, PCI_DEVICE_ID_XTREME_VNIC)},
{0}
};
MODULE_DEVICE_TABLE(pci, vpci_dev_table);
int vpci_read(struct pci_bus *bus, unsigned int devfn, int where,
int size, u32 *val)
{
switch (where) {
case PCI_VENDOR_ID:
*val = PCI_VENDOR_ID_XTREME | PCI_DEVICE_ID_XTREME_VNIC << 16;
/* our id */
break;
case PCI_COMMAND:
*val = 0;
break;
case PCI_HEADER_TYPE:
*val = PCI_HEADER_TYPE_NORMAL;
break;
case PCI_STATUS:
*val = 0;
break;
case PCI_CLASS_REVISION:
*val = (4 << 24) | (0 << 16) | 1;
/* network class, ethernet controller, revision 1 */ /*2 or 4*/
break;
case PCI_INTERRUPT_PIN:
*val = 0;
break;
case PCI_SUBSYSTEM_VENDOR_ID:
*val = 0;
break;
case PCI_SUBSYSTEM_ID:
*val = 0;
break;
default:
*val = 0;
/* sensible default */
}
return 0;
}
int vpci_write(struct pci_bus *bus, unsigned int devfn, int where,
int size, u32 val)
{
switch (where) {
case PCI_BASE_ADDRESS_0:
case PCI_BASE_ADDRESS_1:
case PCI_BASE_ADDRESS_2:
case PCI_BASE_ADDRESS_3:
case PCI_BASE_ADDRESS_4:
case PCI_BASE_ADDRESS_5:
break;
}
return 0;
}
struct pci_ops vpci_ops = {
.read = vpci_read,
.write = vpci_write
};
void vpci_remove_vnic()
{
struct pci_dev *pcidev = NULL;
if (vbus == NULL)
return;
pci_remove_bus_device(pcidev);
pci_dev_put(pcidev);
}
EXPORT_SYMBOL( vpci_remove_vnic);
void vpci_vdev_remove(struct pci_dev *dev)
{
}
static struct pci_driver vpci_vdev_driver = {
.name = "Xtreme-Virtual-NIC1",
.id_table = vpci_dev_table,
.remove = vpci_vdev_remove
};
int vpci_bus_init(void)
{
struct pci_dev *pcidev = NULL;
sysdata = kzalloc(sizeof(void *), GFP_KERNEL);
vbus = pci_scan_bus_parented(NULL, 2, & vpci_ops, sysdata);
//vbus = pci_create_root_bus(NULL,i,& vpci_ops, sysdata,NULL);
//if (vbus != NULL)
//break;
memset(sysdata, 0, sizeof(void *));
if (vbus == NULL) {
kfree(sysdata);
return -EINVAL;
}
if (pci_register_driver(& vpci_vdev_driver) < 0) {
pci_remove_bus(vbus);
vbus = NULL;
return -EINVAL;
}
pcidev = pci_scan_single_device(vbus, 0);
if (pcidev == NULL)
return 0;
else
pci_dev_get(pcidev);
pci_bus_add_devices(vbus);
return 0;
}
void vpci_bus_remove(void)
{
if (vbus) {
pci_unregister_driver(&vpci_vdev_driver);
device_unregister(vbus->bridge);
pci_remove_bus(vbus);
kfree(sysdata);
vbus = NULL;
}
}
static int __init pci_init(void)
{
printk( "module loaded");
vpci_bus_init();
return 0;
}
static void __exit pci_exit(void)
{
printk(KERN_ALERT "unregister PCI Device\n");
pci_unregister_driver(&vpci_vdev_driver);
}
module_init(pci_init);
module_exit(pci_exit);
MODULE_LICENSE("GPL");
There is at least one fork of QEMU I'm aware of that offers shared library plugins for QEMU... but it's a fork of QEMU 4.0.
https://github.com/cromulencellc/qemu-shoggoth
It is possible to build out of tree plugins with this fork, though it's not documented.
On Nov 11 2019 Peter Maydell, a major QEMU contributor, commented on another Stack Overflow question that:
Device plugins are specifically off the menu, because upstream does not want to provide a nice easy mechanism for people to use to have out-of-tree non-GPL/closed-source devices.
So it seems that QEMU devs oppose this idea at that point in time. It is worth learning about the QEMU plugin system though which might come handy for related applications in any case: How to count the number of guest instructions QEMU executed from the beginning to the end of a run?
This is a shame. Imagine if the Linux kernel didn't have a kernel module interface! I suggest QEMU expose this interface, but just don't make it stable, so that it won't impose a developer burden, and which gives the upside that those who merge won't have as painful rebases.

Issue with periodically discrepancies in cufft-fftw complex to real transformations

For my thesis, I have to optimize a special MPI-Navier Stokes-Solver program with CUDA. The original program uses FFTW for solving several PDEs. In detail, several upper triangle matrices are fourier tranformed in two dimensions, but handled as one dimensional arrays. For the moment, I'm struggling with parts of the original code: (N is always set to 64)
Original:
//Does the complex to real in place fft an normalizes
void fftC2R(double complex *arr) {
fftw_execute_dft_c2r(plan_c2r, (fftw_complex*)arr, (double*)arr);
//Currently ignored: Normalization
/* for(int i=0; i<N*(N/2+1); i++)
arr[i] /= (double complex)sqrt((double complex)(N*N));*/
}
void doTimeStepETDRK2_nonlin_original() {
//calc velocity
ux[0] = 0;
uy[0] = 0;
for(int i=1; i<N*(N/2+1); i++) {
ux[i] = I*kvec[1][i]*qvec[i] / kvec[2][i];
uy[i] = -I*kvec[0][i]*qvec[i] / kvec[2][i];
}
fftC2R(ux);
fftC2R(uy);
//do some stuff here...
//...
return;
}
Where ux and uy are allocated as (double complex arrays):
ux = (double complex*)fftw_malloc(N*(N/2+1) * sizeof(double complex));
uy = (double complex*)fftw_malloc(N*(N/2+1) * sizeof(double complex));
The fft-plan is created as:
plan_c2r = fftw_plan_dft_c2r_2d(N, N,(fftw_complex*) qvec, (double*)qvec, FFTW_ESTIMATE);
Where qvec is allocated the same way like ux and uy and has data type double complex.
Here are the relevant parts of the CUDA code:
NN2_VecSetZero_and_init<<<block_size,grid_size>>>();
cudaSafeCall(cudaDeviceSynchronize());
cudaSafeCall(cudaGetLastError());
int err = (int)cufftExecZ2D(cu_plan_c2r,(cufftDoubleComplex*)sym_ux,(cufftDoubleReal*)sym_ux);
if (err != CUFFT_SUCCESS ) {
exit(EXIT_FAILURE);
return;
}
err = (int)cufftExecZ2D(cu_plan_c2r,(cufftDoubleComplex*)sym_uy,(cufftDoubleReal*)sym_uy);
if (err != CUFFT_SUCCESS ) {
exit(EXIT_FAILURE);
return;
}
//do some stuff here...
//...
return;
Where sim_ux and sim_uy are allocated as:
cudaMalloc((void**)&sym_ux, N*(N/2+1)*sizeof(cufftDoubleComplex));
cudaMalloc((void**)&sym_uy, N*(N/2+1)*sizeof(cufftDoubleComplex));
The initialization of the relevant cufft parts looks like
if (cufftPlan2d(&cu_plan_c2r,N,N, CUFFT_Z2D) != CUFFT_SUCCESS){
exit(EXIT_FAILURE);
return -1;
}
if (cufftPlan2d(&cu_plan_r2c,N,N, CUFFT_D2Z) != CUFFT_SUCCESS){
exit(EXIT_FAILURE);
return -1;
}
if ( cufftSetCompatibilityMode ( cu_plan_c2r , CUFFT_COMPATIBILITY_FFTW_ALL) != CUFFT_SUCCESS ) {
exit(EXIT_FAILURE);
return -1;
}
if ( cufftSetCompatibilityMode ( cu_plan_r2c , CUFFT_COMPATIBILITY_FFTW_ALL) != CUFFT_SUCCESS ) {
exit(EXIT_FAILURE);
return -1;
}
So I use full FFTW compatibility and I call every function with the FFTW calling patterns.
When I run both versions, I receive almost equal results for ux and uy (sim_ux and sim_uy). But at periodically positions of the arrays, Cufft seems to ignore these elements, where FFTW sets the real part of these elements to zero and calculates the complex parts (the arrays are too large to show them here). The stepcount, for which this occurs, is N/2+1. So I believe, I didn't completely understand fft-padding theory between Cufft and FFTW.
I can exclude any former discrepancies between these arrays, until Cufft-executions are called. So any other arrays of the code above are not relevant here.
My question is: am I too optimistic in using almost 100% of the FFTW calling style? Do I have to handle my arrays before the ffts? The Cufft documentation says, I'd have to resize the data input and output arrays. But how can I do this, when I'm running inplace transformations? I really wouldn't like to go too far distant from the original code and I don't want to use any more copy instructions for each fft call, because memory is limited and arrays should remain and processed on gpu as long as possible.
I'm thankful for every hint, critical statement or idea!
My configuration:
Compiler: gcc 4.6 (C99 Standard)
MPI-Package: mvapich2-1.5.1p1 (shouldn't play a role because of reduced single processing debug mode)
CUDA-Version: 4.2
GPU: CUDA-arch-compute_20 (NVIDIA GeForce GTX 570)
FFTW 3.3
once i had to do with CUFFT the only solution I got to work was with exclusive usage of "cu_plan_c2c" - there are easy transformations between real and complex arrays:
-filling complex part with 0 for emulating cu_plan_r2c
-use atan2 (not atan) on the complex result to emulate cu_plan_c2r
Sorry for not pointing you to any better solution, but this is how I ended up solving this problem. Hope you are not getting into any hard toruble with low memory cpu-sided....

Thrust - accessing neighbors

I would like to use Thrust's stream compaction functionality (copy_if) for distilling indices of elements from a vector if the elements adhere to a number of constraints. One of these constraints depends on the values of neighboring elements (8 in 2D and 26 in 3D). My question is: how can I obtain the neighbors of an element in Thrust?
The function call operator of the functor for the 'copy_if' basically looks like:
__host__ __device__ bool operator()(float x) {
bool mark = x < 0.0f;
if (mark) {
if (left neighbor of x > 1.0f) return false;
if (right neighbor of x > 1.0f) return false;
if (top neighbor of x > 1.0f) return false;
//etc.
}
return mark;
}
Currently I use a work-around by first launching a CUDA kernel (in which it is easy to access neighbors) to appropriately mark the elements. After that, I pass the marked elements to Thrust's copy_if to distill the indices of the marked elements.
I came across counting_iterator as a sort of substitute for directly using threadIdx and blockIdx to acquire the index of the processed element. I tried the solution below, but when compiling it, it gives me a "/usr/include/cuda/thrust/detail/device/cuda/copy_if.inl(151): Error: Unaligned memory accesses not supported". As far as I know I'm not trying to access memory in an unaligned fashion. Anybody knows what's going on and/or how to fix this?
struct IsEmpty2 {
float* xi;
IsEmpty2(float* pXi) { xi = pXi; }
__host__ __device__ bool operator()(thrust::tuple<float, int> t) {
bool mark = thrust::get<0>(t) < -0.01f;
if (mark) {
int countindex = thrust::get<1>(t);
if (xi[countindex] > 1.01f) return false;
//etc.
}
return mark;
}
};
thrust::copy_if(indices.begin(),
indices.end(),
thrust::make_zip_iterator(thrust::make_tuple(xi, thrust::counting_iterator<int>())),
indicesEmptied.begin(),
IsEmpty2(rawXi));
#phoad: you're right about the shared mem, it struck me after I already posted my reply, subsequently thinking that the cache probably will help me. But you beat me with your quick response. The if-statement however is executed in less than 5% of all cases, so either using shared mem or relying on the cache will probably have negligible impact on performance.
Tuples only support 10 values, so that would mean I would require tuples of tuples for the 26 values in the 3D case. Working with tuples and zip_iterator was already quite cumbersome, so I'll pass for this option (also from a code readability stand point). I tried your suggestion by directly using threadIdx.x etc. in the device function, but Thrust doesn't like that. I seem to be getting some unexplainable results and sometimes I end up with an Thrust error. The following program for example generates a 'thrust::system::system_error' with an 'unspecified launch failure', although it first correctly prints "Processing 10" to "Processing 41":
struct printf_functor {
__host__ __device__ void operator()(int e) {
printf("Processing %d\n", threadIdx.x);
}
};
int main() {
thrust::device_vector<int> dVec(32);
for (int i = 0; i < 32; ++i)
dVec[i] = i + 10;
thrust::for_each(dVec.begin(), dVec.end(), printf_functor());
return 0;
}
Same applies to printing blockIdx.x Printing blockDim.x however generates no error. I was hoping for a clean solution, but I guess I am stuck with my current work-around solution.