Overloading the CUDA shuffle function makes the original ones invisible - cuda

I'm trying to implement my own 64-bit shuffle function in CUDA. However, if I do it like this:
static __inline__ __device__ double __shfl_xor(double var, int laneMask, int width=warpSize)
{
int hi, lo;
asm volatile( "mov.b64 { %0, %1 }, %2;" : "=r"(lo), "=r"(hi) : "d"(var) );
hi = __shfl_xor( hi, laneMask, width );
lo = __shfl_xor( lo, laneMask, width );
return __hiloint2double( hi, lo );
}
All subsequent calls to __shfl_xor will be instantiated from this 64-bit version, no matter what the type of the argument is. For example, if I am doing
int a;
a = __shfl_xor( a, 16 );
It would still use the double version.
A work-around might be using different function names. But since I'm calling this shuffle function from a template function, using different names means that I have to make a different version for 64-bit floating points, which is not quite neat.
So how can I overload the __shfl_xor(double,...) function while on the same time still make sure the __shfl_xor(int,...) can be called appropriately?

All integral types and float can be upcasted to double. When given a choice between in-built function and your specialized double function, the compiler here might be picking yours for all types.
Have you tried creating a function with a different name and using that to create both your specialized double variant and as dummies for the other types?
For example:
static __inline__ __device__ double foo_shfl_xor(double var, int laneMask, int width=warpSize)
{
// Your double shuffle implementation
}
static __inline__ __device__ int foo_shfl_xor(int var, int laneMask, int width=warpSize)
{
// For every non-double data type you use
// Just call the original shuffle function
return __shfl_xor(var, laneMask, width);
}
// Your code that uses shuffle
double d;
int a;
foo_shfl_xor(d, ...); // Calls your custom shuffle
foo_shfl_xor(a, ...); // Calls default shuffle

Related

How to change the default code generated by SWIG for the allocation of memory for a C structure?

I am using a flexible array in the structure. So I want to change the memory allocated for that structure with some of my own code. Basically I want to change the new_structname() and structname_variable_set() functions.
typedef struct vector{
int x;
char y;
int arr[0];
} vector;
here, SWIG generated new_vector() function to allocate memory by calling calloc(1,sizeof(struct vector)) where swig will not handle these type of structure in a special manner. So we need to modify the swig generated new_vector() in order to allocate memory for the flexible array. So is there any way to handle this?
There are a few ways you can do this. What you're looking for though is %extend. That lets us define new constructors and implement them as we see fit. (It even works with a C compiler, they're only constructors from the perspective of the target language).
Using your vector as a starting point we can illustrate this:
%module test
%include <stdint.i>
%inline %{
typedef struct vector{ int x; char y; int arr[0]; }vector;
%}
%extend vector {
vector(const size_t len) {
vector *v = calloc(1, sizeof *v + len);
v->x = len;
return v;
}
}
With this SWIG synthesises a new_vector function in the generated module code as you'd hoped.
I also assumed that you want to record the length inside the struct as one of its members. If that's not the case you can simply delete the assignment I made.

Writing a simple thrust functor operating on some zipped arrays

I am trying to perform a thrust::reduce_by_key using zip and permutation iterators.
i.e. doing this on a zipped array of several 'virtual' permuted arrays.
I am having trouble in writing the syntax for the functor density_update.
But first the setup of the problem.
Here is my function call:
thrust::reduce_by_key( dflagt,
dflagtend,
thrust::make_zip_iterator(
thrust::make_tuple(
thrust::make_permutation_iterator(dmasst, dmapt),
thrust::make_permutation_iterator(dvelt, dmapt),
thrust::make_permutation_iterator(dmasst, dflagt),
thrust::make_permutation_iterator(dvelt, dflagt)
)
),
thrust::make_discard_iterator(),
danswert,
thrust::equal_to<int>(),
density_update()
)
dmapt, dflagt are of type thrust::device_ptr<int> and dvelt , dmasst and danst are of type
thrust::device_ptr<double>.
(They are thrust wrappers to my raw cuda arrays)
The arrays mapt and flagt are both index vectors from which I need to perform a gather operation from the arrays dmasst and dvelt.
After the reduction step I intend to write my data to the danswert array. Since multiple arrays are being used in the reduction, obviously I am using zip iterators.
My problem lies in writing the functor density_update which is binary operation.
struct density_update
{
typedef thrust::device_ptr<double> ElementIterator;
typedef thrust::device_ptr<int> IndexIterator;
typedef thrust::permutation_iterator<ElementIterator,IndexIterator> PIt;
typedef thrust::tuple< PIt , PIt , PIt, PIt> Tuple;
__host__ __device__
double operator()(const Tuple& x , const Tuple& y)
{
return thrust::get<0>(*x) * (thrust::get<1>(*x) - thrust::get<3>(*x)) + \
thrust::get<0>(*y) * (thrust::get<1>(*y) - thrust::get<3>(*y));
}
};
The value being returned is a double . Why the binary operation looks like the above functor is
not important. I just want to know how I would go about correcting the above syntactically.
As shown above the code is throwing a number of compilation errors. I am not sure where I have gone wrong.
I am using CUDA 4.0 on GTX 570 on Ubuntu 10.10
density_update should not receive tuples of iterators as parameters -- it needs tuples of the iterators' references.
In principle you could write density_update::operator() in terms of the particular reference type of the various iterators, but it's simpler to have the compiler infer the type of the parameters:
struct density_update
{
template<typename Tuple>
__host__ __device__
double operator()(const Tuple& x, const Tuple& y)
{
return thrust::get<0>(x) * (thrust::get<1>(x) - thrust::get<3>(x)) + \
thrust::get<0>(y) * (thrust::get<1>(y) - thrust::get<3>(y));
}
};

CUDA: How to apply __restrict__ on array of pointers to arrays?

This kernel using two __restrict__ int arrays compiles fine:
__global__ void kerFoo( int* __restrict__ arr0, int* __restrict__ arr1, int num )
{
for ( /* Iterate over array */ )
arr1[i] = arr0[i]; // Copy one to other
}
However, the same two int arrays composed into a pointer array fails compilation:
__global__ void kerFoo( int* __restrict__ arr[2], int num )
{
for ( /* Iterate over array */ )
arr[1][i] = arr[0][i]; // Copy one to other
}
The error given by the compiler is:
error: invalid use of `restrict'
I have certain structures that are composed as an array of pointers to arrays. (For example, a struct passed to the kernel that has int* arr[16].) How do I pass them to kernels and be able to apply __restrict__ on them?
The CUDA C manual only refers to the C99 definition of __restrict__, no special CUDA-specific circumstances.
Since the indicated parameter is an array containing two pointers, this use of __restrict__ looks perfectly valid to me, no reason for the compiler to complain IMHO. I would ask the compiler author to verify and possibly/probably correct the issue. I'd be interested in different opinions, though.
One remark to #talonmies:
The whole point of restrict is to tell the compiler that two or more pointer arguments will never overlap in memory.
This is not strictly true. restrict tells the compiler that the pointer in question, for the duration of its lifetime, is the only pointer through which the pointed-to object can be accessed. Be aware that the object pointed to is only assumed to be an array of int. (In truth it's only one int in this case.) Since the compiler cannot know the size of the array, it is up to the programmer to guard the array's boundaries..
Filling in the comment in your code with some arbitrary iteration, we get the following program:
__global__ void kerFoo( int* __restrict__ arr[2], int num )
{
for ( int i = 0; i < 1024; i ++)
arr[1][i] = arr[0][i]; // Copy one to other
}
and this compiles fine with CUDA 10.1 (Godbolt.org).

Cannot overload make_uint4 function

I'm trying to overload make_uint4 in the following manner:
namespace A {
namespace B {
inline __host__ __device__ uint4 make_uint4(uint2 a, uint2 b) {
return make_uint4(a.x, a.y, b.x, b.y);
}
}
}
But when I try to compile it, nvcc returns an error:
error: no suitable constructor exists to convert from "unsigned int" to "uint2"
error: no suitable constructor exists to convert from "unsigned int" to "uint2"
error: too many arguments in function call
All these errors point to the "return…" line.
I was able to get a partial repro on VS 2010 and CUDA 4.0 (the compiler built the code OK but Intellisense flagged the error you are seeing). Try the following:
#include "vector_functions.h"
inline __host__ __device__ uint4 make_uint4(uint2 a, uint2 b)
{
return ::make_uint4(a.x, a.y, b.x, b.y);
}
This fixed it for me.
I have no problem compiling it in Visual Studio+nvcc. What compiler are you using?
If that would be of any help: make_uint4 is defined in vector_functions.h, line 170 as
static __inline__ __host__ __device__ uint4 make_uint4(unsigned int x, unsigned int y, unsigned int z, unsigned int w)
{
uint4 t; t.x = x; t.y = y; t.z = z; t.w = w; return t;
}
Update:
I get similar error when I try to overload the function while being inside my custom namespace. Are you certain you are not inside one? If so, try putting :: in front of function call to refer to global scope, i.e:
return ::make_uint4(a.x, a.y, b.x, b.y);
I don't have the library code, but it seems like the compiler doesn't like overloaded device functions (as they are treated just like really fancy inline macros). What is does is shadow (hide) the old make_uint4(a,b,c,d) with your new make_uint4(va, vb) and try to call the latter with 4 uint parameters. That doesn't work because there is no conversion from uint to uint2 (as indicated by the first two error messages) and there are 4 instead of 2 arguments (the last error message).
Use a slightly different function name like make_uint4_from_uint2s and you'll be fine.

C++: Explicit DLL Loading: First-chance Exception on non "extern C" functions

I am having trouble importing my C++ functions. If I declare them as C functions I can successfully import them. When explicit loading, if any of the functions are missing the extern as C decoration I get a the following exception:
First-chance exception at 0x00000000 in cpp.exe: 0xC0000005: Access violation.
DLL.h:
extern "C" __declspec(dllimport) int addC(int a, int b);
__declspec(dllimport) int addCpp(int a, int b);
DLL.cpp:
#include "DLL.h"
int addC(int a, int b) {
return a + b;
}
int addCpp(int a, int b) {
return a + b;
}
main.cpp:
#include "..DLL/DLL.h"
#include <stdio.h>
#include <windows.h>
int main() {
int a = 2;
int b = 1;
typedef int (*PFNaddC)(int,int);
typedef int (*PFNaddCpp)(int,int);
HMODULE hDLL = LoadLibrary(TEXT("../Debug/DLL.dll"));
if (hDLL != NULL)
{
PFNaddC pfnAddC = (PFNaddC)GetProcAddress(hDLL, "addC");
PFNaddCpp pfnAddCpp = (PFNaddCpp)GetProcAddress(hDLL, "addCpp");
printf("a=%d, b=%d\n", a,b);
printf("pfnAddC: %d\n", pfnAddC(a,b));
printf("pfnAddCpp: %d\n", pfnAddCpp(a,b)); //EXCEPTION ON THIS LINE
}
getchar();
return 0;
}
How can I import c++ functions for dynamic loading? I have found that the following code works with implicit loading by referencing the *.lib, but I would like to learn about dynamic loading.
Thank you to all in advance.
Update:
bindump /exports
1 00011109 ?addCpp##YAHHH#Z = #ILT+260(?addCpp##YAHHH#Z)
2 00011136 addC = #ILT+305(_addC)
Solution:
Create a conversion struct as
found here
Take a look at the
file exports and copy explicitly the
c++ mangle naming convention.
PFNaddCpp pfnAddCpp = (PFNaddCpp)GetProcAddress(hDLL, "?addCpp##YAHHH#Z");
Inevitably, the access violation on the null pointer is because GetProcAddress() returns null on error.
The problem is that C++ names are mangled by the compiler to accommodate a variety of C++ features (namespaces, classes, and overloading, among other things). So, your function addCpp() is not really named addCpp() in the resulting library. When you declare the function with extern "C", you give up overloading and the option of putting the function in a namespace, but in return you get a function whose name is not mangled, and which you can call from C code (which doesn't know anything about name mangling.)
One option to get around this is to export the functions using a .def file to rename the exported functions. There's an article, Explicitly Linking to Classes in DLLs, that describes what is necessary to do this.
It's possible to just wrap a whole header file in extern "C" as follows. Then you don't need to worry about forgetting an extern "C" on one of your declarations.
#ifdef __cplusplus
extern "C" {
#endif
__declspec(dllimport) int addC(int a, int b);
__declspec(dllimport) int addCpp(int a, int b);
#ifdef __cplusplus
} /* extern "C" */
#endif
You can still use all of the C++ features that you're used to in the function bodies -- these functions are still C++ functions -- they just have restrictions on the prototypes to make them compatible with C code.