Enable code indexing of Cuda in Clion - cuda

I am using Clion to develop a cuda program. The code highlight works fine when the extension is .h. However, when it is changed to .cuh, Clion just consider the new file a plain text file, and I have not been able to enable code highlight. I understand a complete Cuda toolchain is out of the question, so I will not hope Clion to parse statements like mykernel<<<1024, 100>>>. Still I will be more than satisfied if it can parse the file just like parsing a normal header/cpp file.
Many thanks

First, make sure you tell CLion to treat .cu and .cuh files as C++ using the File Types settings menu.
CLion is not able to parse CUDA's language extensions, but it does provide a preprocessor macro that is defined only when clion is parsing the code. You can use this to implement almost complete CUDA support yourself.
Much of the problem is that CLion's parser is derailed by keywords like __host__ or __device__, causing it to fail to do things it otherwise knows how to do:
CLion has failed to understand Dtype in this example, because the CUDA stuff confused its parsing.
The most minimal solution to this problem is to give clion preprocessor macros to ignore the new keywords, fixing the worst of the brokenness:
#ifdef __JETBRAINS_IDE__
#define __host__
#define __device__
#define __shared__
#define __constant__
#define __global__
#endif
This fixes the above example:
However, CUDA functions like __syncthreads, __popc will still fail to index. So will CUDA builtins like threadIdx. One option is to provide endless preprocessor macros (or even struct definitions) for these, but that's ugly and sacrifices type-safety.
If you're using Clang's CUDA frontend, you can do better. Clang implements the implicitly-defined CUDA builtins by defining them in headers, which it then includes when compiling your code. These provide definitions of things like threadIdx. By pretending to be the CUDA compiler's preprocessor and including device_functions.h, we can get __popc and friends to work, too:
#ifdef __JETBRAINS_IDE__
#define __host__
#define __device__
#define __shared__
#define __constant__
#define __global__
// This is slightly mental, but gets it to properly index device function calls like __popc and whatever.
#define __CUDACC__
#include <device_functions.h>
// These headers are all implicitly present when you compile CUDA with clang. Clion doesn't know that, so
// we include them explicitly to make the indexer happy. Doing this when you actually build is, obviously,
// a terrible idea :D
#include <__clang_cuda_builtin_vars.h>
#include <__clang_cuda_intrinsics.h>
#include <__clang_cuda_math_forward_declares.h>
#include <__clang_cuda_complex_builtins.h>
#include <__clang_cuda_cmath.h>
#endif // __JETBRAINS_IDE__
This will get you perfect indexing of virtually all CUDA code. CLion even gracefully copes with <<<...>>> syntax. It puts a little red line under one character on each end of the launch block, but otherwise treats it as a function call - which is perfectly fine:

Right click file in project tool window -> Associate with file type -> C++
However, Clion doesn't support cuda officially now, it cannot parse cuda syntax.
UPDATE:
From CLion 2020.1, we have official CUDA C/C++ support. CLion could handle them correctly now.

Thanks! I added more "fake" declarations to allow CLion to parse CUDA better:
#ifdef __JETBRAINS_IDE__
#define __CUDACC__ 1
#define __host__
#define __device__
#define __global__
#define __forceinline__
#define __shared__
inline void __syncthreads() {}
inline void __threadfence_block() {}
template<class T> inline T __clz(const T val) { return val; }
struct __cuda_fake_struct { int x; };
extern __cuda_fake_struct blockDim;
extern __cuda_fake_struct threadIdx;
extern __cuda_fake_struct blockIdx;
#endif

I've expanded upon this answer using the method found in this answer to provide a more comprehensive parsing macro, you can now have .x, .y and .z work properly with out issue, and use grid dim. In addition to that I've updated the list to include most intrinsics and values found in the CUDA 8.0 documentation guide. Note that this should have full C++ compatibility, and maybe C. This does not have all functions accounted for (missing atomics, math functions (just include math.h for most), texture, surface, timing, warp votie and shuffle, assertion, launch bounds, and video function)
#ifdef __JETBRAINS_IDE__
#include "math.h"
#define __CUDACC__ 1
#define __host__
#define __device__
#define __global__
#define __noinline__
#define __forceinline__
#define __shared__
#define __constant__
#define __managed__
#define __restrict__
// CUDA Synchronization
inline void __syncthreads() {};
inline void __threadfence_block() {};
inline void __threadfence() {};
inline void __threadfence_system();
inline int __syncthreads_count(int predicate) {return predicate};
inline int __syncthreads_and(int predicate) {return predicate};
inline int __syncthreads_or(int predicate) {return predicate};
template<class T> inline T __clz(const T val) { return val; }
template<class T> inline T __ldg(const T* address){return *address};
// CUDA TYPES
typedef unsigned short uchar;
typedef unsigned short ushort;
typedef unsigned int uint;
typedef unsigned long ulong;
typedef unsigned long long ulonglong;
typedef long long longlong;
typedef struct uchar1{
uchar x;
}uchar1;
typedef struct uchar2{
uchar x;
uchar y;
}uchar2;
typedef struct uchar3{
uchar x;
uchar y;
uchar z;
}uchar3;
typedef struct uchar4{
uchar x;
uchar y;
uchar z;
uchar w;
}uchar4;
typedef struct char1{
char x;
}char1;
typedef struct char2{
char x;
char y;
}char2;
typedef struct char3{
char x;
char y;
char z;
}char3;
typedef struct char4{
char x;
char y;
char z;
char w;
}char4;
typedef struct ushort1{
ushort x;
}ushort1;
typedef struct ushort2{
ushort x;
ushort y;
}ushort2;
typedef struct ushort3{
ushort x;
ushort y;
ushort z;
}ushort3;
typedef struct ushort4{
ushort x;
ushort y;
ushort z;
ushort w;
}ushort4;
typedef struct short1{
short x;
}short1;
typedef struct short2{
short x;
short y;
}short2;
typedef struct short3{
short x;
short y;
short z;
}short3;
typedef struct short4{
short x;
short y;
short z;
short w;
}short4;
typedef struct uint1{
uint x;
}uint1;
typedef struct uint2{
uint x;
uint y;
}uint2;
typedef struct uint3{
uint x;
uint y;
uint z;
}uint3;
typedef struct uint4{
uint x;
uint y;
uint z;
uint w;
}uint4;
typedef struct int1{
int x;
}int1;
typedef struct int2{
int x;
int y;
}int2;
typedef struct int3{
int x;
int y;
int z;
}int3;
typedef struct int4{
int x;
int y;
int z;
int w;
}int4;
typedef struct ulong1{
ulong x;
}ulong1;
typedef struct ulong2{
ulong x;
ulong y;
}ulong2;
typedef struct ulong3{
ulong x;
ulong y;
ulong z;
}ulong3;
typedef struct ulong4{
ulong x;
ulong y;
ulong z;
ulong w;
}ulong4;
typedef struct long1{
long x;
}long1;
typedef struct long2{
long x;
long y;
}long2;
typedef struct long3{
long x;
long y;
long z;
}long3;
typedef struct long4{
long x;
long y;
long z;
long w;
}long4;
typedef struct ulonglong1{
ulonglong x;
}ulonglong1;
typedef struct ulonglong2{
ulonglong x;
ulonglong y;
}ulonglong2;
typedef struct ulonglong3{
ulonglong x;
ulonglong y;
ulonglong z;
}ulonglong3;
typedef struct ulonglong4{
ulonglong x;
ulonglong y;
ulonglong z;
ulonglong w;
}ulonglong4;
typedef struct longlong1{
longlong x;
}longlong1;
typedef struct longlong2{
longlong x;
longlong y;
}longlong2;
typedef struct float1{
float x;
}float1;
typedef struct float2{
float x;
float y;
}float2;
typedef struct float3{
float x;
float y;
float z;
}float3;
typedef struct float4{
float x;
float y;
float z;
float w;
}float4;
typedef struct double1{
double x;
}double1;
typedef struct double2{
double x;
double y;
}double2;
typedef uint3 dim3;
extern dim3 gridDim;
extern uint3 blockIdx;
extern dim3 blockDim;
extern uint3 threadIdx;
extern int warpsize;
#endif

if you want clion to parse all your .cu files as .cpp or any other supported file type, you can do this:
Go to File -> Settings -> Editor -> File Types
Select the file type you want it to be parsed as in the first column (.cpp)
Click the plus sign of the second column and write *.cu
Press apply and clion will parse all your .cu files as it was the file type you specified in the upper column (.cpp)
you can see more documentation here

I've found that clion seems to code-index all build targets, not just the target you've selected to build. My strategy has been to make .cpp symbolic links out of my .cu files and make a child clion/cmake c++ build target (for indexing only) that references those .cpp links. This approach appears to be working on small cuda/thrust c++11 projects in clion 2017.3.3 in Unbuntu 16.04.3.
I do this by:
register the .cu/cuh files with clion, as in the other answers
add the cuda/clion macro voodoo to my .cu files, as in the other answers (the position of the voodoo may be important, but I haven't run into any troubles yet)
make .cpp/.hpp symbolic links to your .cu/.cuh files in your project directory
make a new folder with the single file named clionShadow/CMakeLists.txt that contains:
cmake_minimum_required(VERSION 3.9)
project(cudaNoBuild)
set(CMAKE_CXX_STANDARD 11)
add_executable(cudaNoBuild ../yourcudacode.cpp ../yourcudacode.hpp)
target_include_directories(cudaNoBuild PUBLIC ${CUDA_INCLUDE_DIRS})
add a dependency to clionShadow/CMakeLists.txt at the end of your main CMakeLists.txt with a line like this:
add_subdirectory(clionShadow)
Now, clion parses and code-indexes .cu files 'through' the .cpp files.
Remember, the cudaNoBuild target is not for building - it will use the c++ toolchain which won't work. If you suddenly get compilation errors check clion's build target settings - I've noticed that it sometimes mixes and matches the current build settings between the projects. In this case go to the Edit_Configurations dialog under the Run menu and ensure that clion has not changed the target_executable to be from the cudaNoBuild target.
Edit: Gah! Upon rebuilding the CMake and ide cache after an update to clion 2017.3.3 things are not really working the way they did before. Indexing only works for .cpp files and breakpoints only work for .cu files.

Though not particularly related, somehow this question was in the google search result 'Pycharm cuda highlight'. However, use CLion for C/C++ projects then!
As of PyCharm 2020.3 Community Edition for Mac, it is under File > File Types > Associate with file types.
If uncertain, search 'file type' with the search bar under Help menu.

Related

How to invoke a constexpr function on both device and host?

From the documentation:
I.4.20.4. Constexpr functions and function templates
By default, a
constexpr function cannot be called from a function with incompatible
execution space. The experimental nvcc flag --expt-relaxed-constexpr
removes this restriction. When this flag is specified, host code can
invoke a __device__ constexpr function and device code can invoke a
__host__ constexpr function.
I read it, but I don't understand what it means - device code can invoke a host constexpr function? Here is my test:
constexpr int bar(int i)
{
#ifdef __CUDA_ARCH__
return i;
#else
return 555;
#endif
}
__global__ void kernel()
{
int tid = (blockDim.x * blockIdx.x) + threadIdx.x;
printf("%i\n", bar(tid));
}
int main(int argc, char *[])
{
static_assert(bar(5) > 0);
// static_assert(bar(argc) > 0); // compile error
cout << bar(argc) << endl;
kernel<<<2, 2>>>();
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
return 0;
}
It prints:
555
0
1
2
3
According to my understanding, the host invokes the host function, while the device invokes the device function. I.e. it behaves the same as if I declare bar with both __host__ and __device__ attributes. Adding a single attribute (__host__ or __device__) doesn't make any difference.
As a comparison, the documentation for std::initializer_list is much clearer:
I.4.20.2. std::initializer_list
By default, the CUDA compiler will
implicitly consider the member functions of std::initializer_list to
have __host__ __device__ execution space specifiers, and therefore
they can be invoked directly from device code.
Here I don't have any questions.
What does the documentation mean exactly?
Consider the following code.
#include <algorithm> //std::max
__global__ void kernel(int *array, int n) {
array[0] = std::max(array[1], array[2]);
}
This code will not compile by default.
error: calling a constexpr __host__ function("max") from a __global__ function("kernel") is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
std::max is a standard host function without __device__ execution space specifiers and thus cannot be called from device code.
However, when the compiler flag --expt-relaxed-constexpr is specified, the code compiles nonetheless. I cannot give you any details about how this is achieved internally

Problems in CUDA function cudaMemcpyToSymbol()

I'm transporting data to specific CUDA symbol, my CUDA version is 10.1, GPU is Tesla K80. I compiled the code on VS2017, code generated by compute_35 & sm35. When I wrote my code like this,
<.h>
#include <cuda_runtime.h>
__device__ __constant__ float scoreRatio;
<.cpp>
const float ScoreRatio;
cudaErr=cudaMemcpyToSymbol(&scoreRatio,&ScoreRatio,sizeof(ScoreRatio));
printf("%d: %s.\n",cudaErr,cudaGetErorString(cudaErr));
it compiled well but got cudaErrInvalidSymbol when I run the program,
13: Invalid device symbol
If I modified my code like this,
<.h>
#include <cuda_runtime.h>
__device__ __constant__ float scoreRatio;
<.cpp>
const float ScoreRatio;
cudaErr=cudaMemcpyToSymbol(scoreRatio,&ScoreRatio,sizeof(ScoreRatio));
then the compile would fail due to incompatible parameter type as the first parameter is FLOAT while function asks VOID*, here I found the function definition in cuda_runtime_api.h,
extern __host__ cudaError_t CUDARTAPI cudaMemcpyToSymbol(const void *symbol, const void *src, size_t count, size_t offset __dv(0), enum cudaMemcpyKind kind __dv(cudaMemcpyHostToDevice));
Could anyone please give some advice, much appreciated.
This:
<.h>
#include <cuda_runtime.h>
__device__ __constant__ float scoreRatio;
<.cpp>
const float ScoreRatio;
cudaErr=cudaMemcpyToSymbol(&scoreRatio,&ScoreRatio,sizeof(ScoreRatio));
printf("%d: %s.\n",cudaErr,cudaGetErorString(cudaErr));
is illegal/wrong in two ways. You must use nvcc to compile the code using a device code aware trajectory, and the first argument of the cudaMemcpyToSymbol call is incorrect. If you simply rename your .cpp source file to have a .cu file extension and change the contents so that it looks like this:
<.cu>
#include <.h>
....
const float ScoreRatio;
cudaErr=cudaMemcpyToSymbol(scoreRatio, &ScoreRatio, sizeof(ScoreRatio));
printf("%d: %s.\n", cudaErr, cudaGetErorString(cudaErr));
it will both compile and run correctly. See here for an explanation of why it is necessary to change the first argument of the cudaMemcpyToSymbol call.

Why can't we split __host__ and __device__ implementations?

If we have a __host__ __device__ function in CUDA, we can use macros to choose different code paths for host-side and device-side code in its implementations, like so:
__host__ __device__ int foo(int x)
{
#ifdef CUDA_ARCH
return x * 2;
#else
return x;
#endif
}
but why is it that we can't write:
__host__ __device__ int foo(int x);
__device__ int foo(int x) { return x * 2; }
__host__ int foo(int x) { return x; }
instead?
The Clang implementation of CUDA C++ actually supports overloading on __host__ and
__device__ because it considers the execution space qualifiers part of the function signature. Note, however, that even there, you'd have to declare the two functions separately:
__device__ int foo(int x);
__host__ int foo(int x);
__device__ int foo(int x) { return x * 2; }
__host__ int foo(int x) { return x; }
test it out here
Personally, I'm not sure how desirable/important that really is to have though. Consider that you can just define a foo(int x) in the host code outside of your CUDA source. If someone told me they need to have different implementations of the same function for host and device where the host version for some reason needs to be defined as part of the CUDA source, my initial gut feeling would be that there's likely something going in a bit of an odd direction. If the host version does something different, shouldn't it most likely have a different name? If it logically does the same thing just not using the GPU, then why does it have to be part of the CUDA source? I'd generally advocate for keeping as clean and strict a separation between host and device code as possible and keeping any host code inside the CUDA source to the bare minimum. Even if you don't care about the cleanliness of your code, doing so will at least minimize the chances of getting hurt by all the compiler magic that goes on under the hood…

Polymorphism and derived classes in CUDA / CUDA Thrust

This is my first question on Stack Overflow, and it's quite a long question. The tl;dr version is: How do I work with a thrust::device_vector<BaseClass> if I want it to store objects of different types DerivedClass1, DerivedClass2, etc, simultaneously?
I want to take advantage of polymorphism with CUDA Thrust. I'm compiling for an -arch=sm_30 GPU (GeForce GTX 670).
Let us take a look at the following problem: Suppose there are 80 families in town. 60 of them are married couples, 20 of them are single-parent households. Each family has, therefore, a different number of members. It's census time and households have to state the parents' ages and the number of children they have. Therefore, an array of Family objects is constructed by the government, namely thrust::device_vector<Family> familiesInTown(80), such that information of families familiesInTown[0] to familiesInTown[59] corresponds to married couples, the rest (familiesInTown[60] to familiesInTown[79]) being single-parent households.
Family is the base class - the number of parents in the household (1 for single parents and 2 for couples) and the number of children they have are stored here as members.
SingleParent, derived from Family, includes a new member - the single parent's age, unsigned int ageOfParent.
MarriedCouple, also derived from Family, however, introduces two new members - both parents' ages, unsigned int ageOfParent1 and unsigned int ageOfParent2.
#include <iostream>
#include <stdio.h>
#include <thrust/device_vector.h>
class Family
{
protected:
unsigned int numParents;
unsigned int numChildren;
public:
__host__ __device__ Family() {};
__host__ __device__ Family(const unsigned int& nPars, const unsigned int& nChil) : numParents(nPars), numChildren(nChil) {};
__host__ __device__ virtual ~Family() {};
__host__ __device__ unsigned int showNumOfParents() {return numParents;}
__host__ __device__ unsigned int showNumOfChildren() {return numChildren;}
};
class SingleParent : public Family
{
protected:
unsigned int ageOfParent;
public:
__host__ __device__ SingleParent() {};
__host__ __device__ SingleParent(const unsigned int& nChil, const unsigned int& age) : Family(1, nChil), ageOfParent(age) {};
__host__ __device__ unsigned int showAgeOfParent() {return ageOfParent;}
};
class MarriedCouple : public Family
{
protected:
unsigned int ageOfParent1;
unsigned int ageOfParent2;
public:
__host__ __device__ MarriedCouple() {};
__host__ __device__ MarriedCouple(const unsigned int& nChil, const unsigned int& age1, const unsigned int& age2) : Family(2, nChil), ageOfParent1(age1), ageOfParent2(age2) {};
__host__ __device__ unsigned int showAgeOfParent1() {return ageOfParent1;}
__host__ __device__ unsigned int showAgeOfParent2() {return ageOfParent2;}
};
If I were to naïvely initiate the objects in my thrust::device_vector<Family> with the following functors:
struct initSlicedCouples : public thrust::unary_function<unsigned int, MarriedCouple>
{
__device__ MarriedCouple operator()(const unsigned int& idx) const
// I use a thrust::counting_iterator to get idx
{
return MarriedCouple(idx % 3, 20 + idx, 19 + idx);
// Couple 0: Ages 20 and 19, no children
// Couple 1: Ages 21 and 20, 1 child
// Couple 2: Ages 22 and 21, 2 children
// Couple 3: Ages 23 and 22, no children
// etc
}
};
struct initSlicedSingles : public thrust::unary_function<unsigned int, SingleParent>
{
__device__ SingleParent operator()(const unsigned int& idx) const
{
return SingleParent(idx % 3, 25 + idx);
}
};
int main()
{
unsigned int Num_couples = 60;
unsigned int Num_single_parents = 20;
thrust::device_vector<Family> familiesInTown(Num_couples + Num_single_parents);
// Families [0] to [59] are couples. Families [60] to [79] are single-parent households.
thrust::transform(thrust::counting_iterator<unsigned int>(0),
thrust::counting_iterator<unsigned int>(Num_couples),
familiesInTown.begin(),
initSlicedCouples());
thrust::transform(thrust::counting_iterator<unsigned int>(Num_couples),
thrust::counting_iterator<unsigned int>(Num_couples + Num_single_parents),
familiesInTown.begin() + Num_couples,
initSlicedSingles());
return 0;
}
I would definitely be guilty of some classic object slicing...
So, I asked myself, what about a vector of pointers that may give me some sweet polymorphism? Smart pointers in C++ are a thing, and thrust iterators can do some really impressive things, so let's give it a shot, I figured. The following code compiles.
struct initCouples : public thrust::unary_function<unsigned int, MarriedCouple*>
{
__device__ MarriedCouple* operator()(const unsigned int& idx) const
{
return new MarriedCouple(idx % 3, 20 + idx, 19 + idx); // Memory issues?
}
};
struct initSingles : public thrust::unary_function<unsigned int, SingleParent*>
{
__device__ SingleParent* operator()(const unsigned int& idx) const
{
return new SingleParent(idx % 3, 25 + idx);
}
};
int main()
{
unsigned int Num_couples = 60;
unsigned int Num_single_parents = 20;
thrust::device_vector<Family*> familiesInTown(Num_couples + Num_single_parents);
// Families [0] to [59] are couples. Families [60] to [79] are single-parent households.
thrust::transform(thrust::counting_iterator<unsigned int>(0),
thrust::counting_iterator<unsigned int>(Num_couples),
familiesInTown.begin(),
initCouples());
thrust::transform(thrust::counting_iterator<unsigned int>(Num_couples),
thrust::counting_iterator<unsigned int>(Num_couples + Num_single_parents),
familiesInTown.begin() + Num_couples,
initSingles());
Family A = *(familiesInTown[2]); // Compiles, but object slicing takes place (in theory)
std::cout << A.showNumOfParents() << "\n"; // Segmentation fault
return 0;
}
Seems like I've hit a wall here. Am I understanding memory management correctly? (VTables, etc). Are my objects being instantiated and populated on the device? Am I leaking memory like there is no tomorrow?
For what it's worth, in order to avoid object slicing, I tried with a dynamic_cast<DerivedPointer*>(basePointer). That's why I made my Family destructor virtual.
Family *pA = familiesInTown[2];
MarriedCouple *pB = dynamic_cast<MarriedCouple*>(pA);
The following lines compile, but, unfortunately, a segfault is thrown again. CUDA-Memcheck won't tell me why.
std::cout << "Ages " << (pB -> showAgeOfParent1()) << ", " << (pB -> showAgeOfParent2()) << "\n";
and
MarriedCouple B = *pB;
std::cout << "Ages " << B.showAgeOfParent1() << ", " << B.showAgeOfParent2() << "\n";
In short, what I need is a class interface for objects that will have different properties, with different numbers of members among each other, but that I can store in one common vector (that's why I want a base class) that I can manipulate on the GPU. My intention is to work with them both in thrust transformations and in CUDA kernels via thrust::raw_pointer_casting, which has worked flawlessly for me until I've needed to branch out my classes into a base one and several derived ones. What is the standard procedure for that?
Thanks in advance!
I am not going to attempt to answer everything in this question, it is just too large. Having said that here are some observations about the code you posted which might help:
The GPU side new operator allocates memory from a private runtime heap. As of CUDA 6, that memory cannot be accessed by the host side CUDA APIs. You can access the memory from within kernels and device functions, but that memory cannot be accessed by the host. So using new inside a thrust device functor is a broken design that can never work. That is why your "vector of pointers" model fails.
Thrust is fundamentally intended to allow data parallel versions of typical STL algorithms to be applied to POD types. Building a codebase using complex polymorphic objects and trying to cram those through Thrust containers and algorithms might be made to work, but it isn't what Thrust was designed for, and I wouldn't recommend it. Don't be surprised if you break thrust in unexpected ways if you do.
CUDA supports a lot of C++ features, but the compilation and object models are much simpler than even the C++98 standard upon which they are based. CUDA lacks several key features (RTTI for example) which make complex polymorphic object designs workable in C++. My suggestion is use C++ features sparingly. Just because you can do something in CUDA doesn't mean you should. The GPU is a simple architecture and simple data structures and code are almost always more performant than functionally similar complex objects.
Having skim read the code you posted, my overall recommendation is to go back to the drawing board. If you want to look at some very elegant CUDA/C++ designs, spend some time reading the code bases of CUB and CUSP. They are both very different, but there is a lot to learn from both (and CUSP is built on top of Thrust, which makes it even more relevant to your usage case, I suspect).
I completely agree with #talonmies answer. (e.g. I don't know that thrust has been extensively tested with polymorphism.) Furthermore, I have not fully parsed your code. I post this answer to add additional info, in particular that I believe some level of polymorphism can be made to work with thrust.
A key observation I would make is that it is not allowed to pass as an argument to a __global__ function an object of a class with virtual functions. This means that polymorphic objects created on the host cannot be passed to the device (via thrust, or in ordinary CUDA C++). (One basis for this limitation is the requirement for virtual function tables in the objects, which will necessarily be different between host and device, coupled with the fact that it is illegal to directly take the address of a device function in host code).
However, polymorphism can work in device code, including thrust device functions.
The following example demonstrates this idea, restricting ourselves to objects created on the device although we can certainly initialize them with host data. I have created two classes, Triangle and Rectangle, derived from a base class Polygon which includes a virtual function area. Triangle and Rectangle inherit the function set_values from the base class but replace the virtual area function.
We can then manipulate objects of those classes polymorphically as demonstrated here:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/for_each.h>
#include <thrust/sequence.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/copy.h>
#define N 4
class Polygon {
protected:
int width, height;
public:
__host__ __device__ void set_values (int a, int b)
{ width=a; height=b; }
__host__ __device__ virtual int area ()
{ return 0; }
};
class Rectangle: public Polygon {
public:
__host__ __device__ int area ()
{ return width * height; }
};
class Triangle: public Polygon {
public:
__host__ __device__ int area ()
{ return (width * height / 2); }
};
struct init_f {
template <typename Tuple>
__host__ __device__ void operator()(const Tuple &arg) {
(thrust::get<0>(arg)).set_values(thrust::get<1>(arg), thrust::get<2>(arg));}
};
struct setup_f {
template <typename Tuple>
__host__ __device__ void operator()(const Tuple &arg) {
if (thrust::get<0>(arg) == 0)
thrust::get<1>(arg) = &(thrust::get<2>(arg));
else
thrust::get<1>(arg) = &(thrust::get<3>(arg));}
};
struct area_f {
template <typename Tuple>
__host__ __device__ void operator()(const Tuple &arg) {
thrust::get<1>(arg) = (thrust::get<0>(arg))->area();}
};
int main () {
thrust::device_vector<int> widths(N);
thrust::device_vector<int> heights(N);
thrust::sequence( widths.begin(), widths.end(), 2);
thrust::sequence(heights.begin(), heights.end(), 3);
thrust::device_vector<Rectangle> rects(N);
thrust::device_vector<Triangle> trgls(N);
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(rects.begin(), widths.begin(), heights.begin())), thrust::make_zip_iterator(thrust::make_tuple(rects.end(), widths.end(), heights.end())), init_f());
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(trgls.begin(), widths.begin(), heights.begin())), thrust::make_zip_iterator(thrust::make_tuple(trgls.end(), widths.end(), heights.end())), init_f());
thrust::device_vector<Polygon *> polys(N);
thrust::device_vector<int> selector(N);
for (int i = 0; i<N; i++) selector[i] = i%2;
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(selector.begin(), polys.begin(), rects.begin(), trgls.begin())), thrust::make_zip_iterator(thrust::make_tuple(selector.end(), polys.end(), rects.end(), trgls.end())), setup_f());
thrust::device_vector<int> areas(N);
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(polys.begin(), areas.begin())), thrust::make_zip_iterator(thrust::make_tuple(polys.end(), areas.end())), area_f());
thrust::copy(areas.begin(), areas.end(), std::ostream_iterator<int>(std::cout, "\n"));
return 0;
}
I suggest compiling the above code for a cc2.0 or newer architecture. I tested with CUDA 6 on RHEL 5.5.
(The polymorphic example idea, and some of the code, was taken from here.)

Applying reduction operation using Thrust subject to a boolean condition

I want to use thrust::reduce to find the max value in an array A. However, A[i]should only be chosen as the max if it also satisfies a particular boolean condition in another array B. For example, B[i] should be true. Is their a version of thrust::reduce that does this. I looked at the documentation and found only following API;
thrust::reduce(begin,end, default value, operator)
However, i was curious is their a version more suitable to my problem?
EDIT: Compilation fails in last line!
typedef thrust::device_ptr<int> IntIterator;
typedef thrust::device_ptr<float> FloatIterator;
typedef thrust::tuple<IntIterator,FloatIterator> IteratorTuple;
typedef thrust::zip_iterator<IteratorTuple> myZipIterator;
thrust::device_ptr<int> deviceNBMInt(gpuNBMInt);
thrust::device_ptr<int> deviceIsActive(gpuIsActive);
thrust::device_ptr<float> deviceNBMSim(gpuNBMSim);
myZipIterator iter_begin = thrust::make_zip_iterator(thrust::make_tuple(deviceIsActive,deviceNBMSim));
myZipIterator iter_end = thrust::make_zip_iterator(thrust::make_tuple(deviceIsActive + numRow,deviceNBMSim + numRow));
myZipIterator result = thrust::max_element(iter_begin, iter_end, Predicate());
Yes, there is. I guess you should take a look at Extrema And Zip iterator
Something like this should do the trick (not sure if this code works out of the box):
typedef thrust::device_ptr<bool> BoolIterator;
typedef thrust::device_ptr<float> ValueIterator;
BoolIterator bools_begin, bools_end;
ValueIterator values_begin, values_end;
// initialize these pointers
// ...
typedef thrust::tuple<BoolIterator, ValueIterator> IteratorTuple;
typedef thrust::tuple<bool, value> DereferencedIteratorTuple;
typedef thrust::zip_iterator<IteratorTuple> ZipIterator;
ZipIterator iter_begin(thrust::make_tuple(bools_begin, values_begin));
ZipIterator iter_end(thrust::make_tuple(bools_end, values_end));
struct Predicate
{
__host__ __device__ bool operator ()
(const DereferencedIteratorTuple& lhs,
const DereferencedIteratorTuple& lhs)
{
using thrust::get;
if (get<0>(lhs) && get<0>(rhs) ) return get<1>(lhs) < get<1>(rhs); else
return ! get<0>(lhs) ;
}
};
ZipIterator result = thrust::max_element(iter_begin, iter_end, Predicate());
Or you may consider similar technique with zip iterator with thrust::reduce. Or you can try with inner_product Not sure what will work faster.