Static Assert for NVCC and Compiler Bug - cuda

Whats the best way to have a static assert for the NVCC compiler inside a struct which is used for compile time settings:
The following works mostly but sometimes NVCC produces bullshit error messages, and does not compile even if it should!
template<int A, int B>
struct Settings{
static const int a = A;
static const int b = B;
STATIC_ASSERT(a == 15);
}
typedef Settings<15,5> set1; // Comment this out and it works....
template<int A, int B>
struct Settings2{
static const int a = A;
static const int b = B;
STATIC_ASSERT(a % b == 0);
}
typedef Settings<10,5> set2;
The static assert does not work, I dont know but there is a CUDA Compiler BUG which tells me when I compile it throws the STATIC_ASSERT(a == 15); even if IT should COMPILE because the code above is correct, if I comment (A) out then it suddenly works,
I use the STATIC_ASSERT from Thrust which is basically taken from Boost:
#define JOIN( X, Y ) DO_JOIN( X, Y )
#define DO_JOIN( X, Y ) DO_JOIN2(X,Y)
#define DO_JOIN2( X, Y ) X##Y
namespace staticassert {
// HP aCC cannot deal with missing names for template value parameters
template <bool x> struct STATIC_ASSERTION_FAILURE;
template <> struct STATIC_ASSERTION_FAILURE<true> { enum { value = 1 }; };
// HP aCC cannot deal with missing names for template value parameters
template<int x> struct static_assert_test{};
};
// XXX nvcc 2.3 can't handle STATIC_ASSERT
#if defined(__CUDACC__) && (CUDA_VERSION > 100)
#error your version number of cuda is not 2 digits!
#endif
#if defined(__CUDACC__) /* && (CUDA_VERSION < 30)*/
#define STATIC_ASSERT( B ) typedef staticassert::static_assert_test<sizeof(staticassert::STATIC_ASSERTION_FAILURE< (bool)( (B) ) >) > JOIN(thrust_static_assert_typedef_, __LINE__)
#define STATIC_ASSERT2(B,COMMENT) STATIC_ASSERT(B)
#else
#define STATIC_ASSERT2(B,COMMENT) \
typedef staticassert::static_assert_test< \
sizeof(staticassert::STATIC_ASSERTION_FAILURE< (bool)( (B) ) >)>\
JOIN(thrust_static_assert_typedef_, JOIN(__LINE__, COMMENT ))
#define STATIC_ASSERT( B ) \
typedef staticassert::static_assert_test<sizeof(staticassert::STATIC_ASSERTION_FAILURE< (bool)( (B) ) >) > JOIN(thrust_static_assert_typedef_, __LINE__)
#endif // NVCC 2.3
Did anybody experience the same problem?
Thanks for any comments!

After adding the missing semicolons after each struct definition, your code compiles with no warnings or errors for me. System details:
harrism$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2010 NVIDIA Corporation
Built on Thu_Nov_11_15:26:50_PST_2010
Cuda compilation tools, release 3.2, V0.2.1221
harrism$ g++ --version
i686-apple-darwin10-g++-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5666) (dot 3)

Related

Support for std::tuple in swig?

When calling a swig generated function returning std::tuple, i get a swig object of that std::tuple.
Is there a way to use type-maps or something else to extract the values? I have tried changing the code to std::vector for a small portion of the code, and that works. (using %include <std_vector.i> and templates) But i don't want to make too many changes in the C++ part.
Edit: here is a minimal reproducible example:
foo.h
#pragma once
#include <tuple>
class foo
{
private:
double secret1;
double secret2;
public:
foo();
~foo();
std::tuple<double, double> return_thing(void);
};
foo.cpp
#include "foo.h"
#include <tuple>
foo::foo()
{
secret1 = 1;
secret2 = 2;
}
foo::~foo()
{
}
std::tuple<double, double> foo::return_thing(void) {
return {secret1, secret2};
}
foo.i
%module foo
%{
#include"foo.h"
%}
%include "foo.h"
When compiled on my linux using
-:$ swig -python -c++ -o foo_wrap.cpp foo.i
-:$ g++ -c foo.cpp foo_wrap.cpp '-I/usr/include/python3.8' '-fPIC' '-std=c++17' '-I/home/simon/Desktop/test_stack_overflow_tuple'
-:$ g++ -shared foo.o foo_wrap.o -o _foo.so
I can import it in python as shown:
test_module.ipynb
import foo as f
Foo = f.foo()
return_object = Foo.return_thing()
type(return_object)
print(return_object)
Outputs is
SwigPyObject
<Swig Object of type 'std::tuple< double,double > *' at 0x7fb5845d8420>
Hopefully this is more helpful, thank you for responding
To clarify i want to be able to use the values in python something like this:
main.cpp
#include "foo.h"
#include <iostream>
//------------------------------------------------------------------------------'
using namespace std;
int main()
{
foo Foo = foo();
auto [s1, s2] = Foo.return_thing();
cout << s1 << " " << s2 << endl;
}
//------------------------------------------------------------------------------
Github repo if anybody is interested
https://github.com/simon-cmyk/test_stack_overflow_tuple
Our goal is to make something like the following SWIG interface work intuitively:
%module test
%include "std_tuple.i"
%std_tuple(TupleDD, double, double);
%inline %{
std::tuple<double, double> func() {
return std::make_tuple(0.0, 1.0);
}
%}
We want to use this within Python in the following way:
import test
r=test.func()
print(r)
print(dir(r))
r[1]=1234
for x in r:
print(x)
i.e. indexing and iteration should just work.
By re-using some of the pre-processor tricks I used to wrap std::function (which were themselves originally from another answer here on SO) we can define a neat macro that "just wraps" std::tuple for us. Although this answer is Python specific it should in practice be fairly simple to adapt for most other languages too. I'll post my std_tuple.i file, first and then annotate/explain it after:
// [1]
%{
#include <tuple>
#include <utility>
%}
// [2]
#define make_getter(pos, type) const type& get##pos() const { return std::get<pos>(*$self); }
#define make_setter(pos, type) void set##pos(const type& val) { std::get<pos>(*$self) = val; }
#define make_ctorargN(pos, type) , type v##pos
#define make_ctorarg(first, ...) const first& v0 FOR_EACH(make_ctorargN, __VA_ARGS__)
// [3]
#define FE_0(...)
#define FE_1(action,a1) action(0,a1)
#define FE_2(action,a1,a2) action(0,a1) action(1,a2)
#define FE_3(action,a1,a2,a3) action(0,a1) action(1,a2) action(2,a3)
#define FE_4(action,a1,a2,a3,a4) action(0,a1) action(1,a2) action(2,a3) action(3,a4)
#define FE_5(action,a1,a2,a3,a4,a5) action(0,a1) action(1,a2) action(2,a3) action(3,a4) action(4,a5)
#define GET_MACRO(_1,_2,_3,_4,_5,NAME,...) NAME
%define FOR_EACH(action,...)
GET_MACRO(__VA_ARGS__, FE_5, FE_4, FE_3, FE_2, FE_1, FE_0)(action,__VA_ARGS__)
%enddef
// [4]
%define %std_tuple(Name, ...)
%rename(Name) std::tuple<__VA_ARGS__>;
namespace std {
struct tuple<__VA_ARGS__> {
// [5]
tuple(make_ctorarg(__VA_ARGS__));
%extend {
// [6]
FOR_EACH(make_getter, __VA_ARGS__)
FOR_EACH(make_setter, __VA_ARGS__)
size_t __len__() const { return std::tuple_size<std::decay_t<decltype(*$self)>>{}; }
%pythoncode %{
# [7]
def __getitem__(self, n):
if n >= len(self): raise IndexError()
return getattr(self, 'get%d' % n)()
def __setitem__(self, n, val):
if n >= len(self): raise IndexError()
getattr(self, 'set%d' % n)(val)
%}
}
};
}
%enddef
This is just the extra includes we need for our macro to work
These apply to each of the type arguments we supply to our %std_tuple macro invocation, we need to be careful with commas here to keep the syntax correct.
This is the mechanics of our FOR_EACH macro, which invokes each action per argument in our variadic macro argument list
Finally the definition of %std_tuple can begin. Essentially this is manually doing the work of %template for each specialisation of std::tuple we care to name inside of the std namespace.
We use our macro for each magic to declare a constructor with arguments for each element of the correct type. The actual implementation here is the default one from the C++ library which is exactly what we need/want though.
We use our FOR_EACH macro twice to make a member function get0, get1, getN of the correct type of each tuple element and the correct number of them for the template argument size. Likewise for setN. Doing it this way allows the usual SWIG typemaps for double, etc. or whatever types your tuple contains to be applied automatically and correctly for each call to std::get<N>. These are really just an implementation detail, not intended to be part of the public interface, but exposing them makes no real odds.
Finally we need an implementation of __getitem__ and a corresponding __setitem__. These simply look up and call the right getN/setN function on the class and call that instead. We take care to raise IndexError instead of the default exception if an invalid index is used as this will stop iteration correctly when we try to iterate of the tuple.
This is then sufficient that we can run our target code and get the following output:
$ swig3.0 -python -c++ -Wall test.i && g++ -shared -o _test.so test_wrap.cxx -I/usr/include/python3.7 -m32 && python3.7 run.py
<test.TupleDD; proxy of <Swig Object of type 'std::tuple< double,double > *' at 0xf766a260> >
['__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__swig_destroy__', '__swig_getmethods__', '__swig_setmethods__', '__weakref__', 'get0', 'get1', 'set0', 'set1', 'this']
0.0
1234.0
Generally this should work as you'd hope in most input/output situations in Python.
There are a few improvements we could look to make:
Implement repr
Implement slicing so that tuple[n:m] type indexing works
Handle unpacking like Python tuples.
Maybe do some more automatic conversions for compatible types?
Avoid calling __len__ for every get/setitem call, either by caching the value in the class itself, or postponing it until the method lookup fails?

Static __device__ variable and kernels in separate file

I want statically declare a global variable with __device__ qualifier. In the same time I want to store functions intended to GPU in a separate file.
However, if I do so, the variable value is not transferred to GPU -- there are no errors in compilation or execution time, but memcpy functions do nothing.
When I move kernel function into the file with the host code, everything works.
I am sure, that it should be possible to split host and device functions into separate files in this case, but how to do this? I have seen just examples, when kernels and host code are in the same file.
I would be also very thankful, if somebody explained, why does it behaves so.
A sample code is listed below.
Thank you in advance.
Working directory:
$ ls
functionsGPU.cu functionsGPU.cuh staticGlobalMemory.cu
staticGlobalMemory.cu:
#include "functionsGPU.cuh"
#if VARIANT == 2
__global__ void checkGlobalVariable(){
printf("Old value (dev): %f\n", devData);
devData += 2.0f;
printf("New value (dev): %f\n", devData);
}
#endif
int main(int argc, char **argv){
int dev = 0;
float val = 3.2;
cudaSetDevice(dev);
printf("---------\nVARIANT %i\n---------\n", VARIANT);
printf("Old value (host): %f\n", val);
cudaMemcpyToSymbol(devData, &val, sizeof(float));
checkGlobalVariable <<<1, 1>>> ();
cudaMemcpyFromSymbol(&val, devData, sizeof(float));
printf("New value (host): %f\n", val);
cudaDeviceReset();
return 0;
}
functionsGPU.cuh:
#ifndef FUNCTIONSGPU_CUH
#define FUNCTIONSGPU_CUH
#include <cuda_runtime.h>
#include <stdio.h>
#define VARIANT 1
__device__ float devData;
#if VARIANT == 1
__global__ void checkGlobalVariable();
#endif
#endif
functionsGPU.cu:
#include "functionsGPU.cuh"
#if VARIANT == 1
__global__ void checkGlobalVariable(){
printf("Old value (dev): %f\n", devData);
devData += 2.0f;
printf("New value (dev): %f\n", devData);
}
#endif
This is compiled as
$ nvcc -arch=sm_61 staticGlobalMemory.cu functionsGPU.cu -o staticGlobalMemory
Output if the kernel and host code are in separate files (incorrect):
---------
VARIANT 1
---------
Old value (host): 3.200000
Old value (dev): 0.000000
New value (dev): 2.000000
New value (host): 3.200000
Output if the kernel and host code are in the same file (correct):
---------
VARIANT 2
---------
Old value (host): 3.200000
Old value (dev): 3.200000
New value (dev): 5.200000
New value (host): 5.200000
Your code structure, where device code in one compilation unit references device code or device entities in another compilation unit, will require CUDA relocatable device code compilation and linking.
In the case of __device__ variables such as what you have here:
Add -rdc=true to enable this, to your nvcc compilation command line
Add extern in front of the definition of devData, in functionsGPU.cuh
Add __device__ float devData; to staticGlobalMemory.cu
In the case of linking to a __device__ function in a separate file, along with providing the prototype typically via a header file like you would with any function in C++, you also need to add -rdc=true to your nvcc compilation command line, to enable device code linking. Steps 2 and 3 above are not needed.
That should fix the issue. Step 1 provides the necessary cross-module linkage, and steps 2 and 3 will fix the duplicate definition problem you would have, since you are including the same variable via a header file in separate compilation units.
For a reference of how to do the device code compilation setting in windows visual studio, see here.

Loading multiple modules in JCuda is not working

In jCuda one can load cuda files as PTX or CUBIN format and call(launch) __global__ functions (kernels) from Java.
With keeping that in mind, I want to develop a framework with JCuda that gets user's __device__ function in a .cu file at run-time, loads and runs it.
And I have already implemented a __global__ function, in which each thread finds out the start point of its related data, perform some computation, initialization and then call user's __device__ function.
Here is my kernel pseudo code:
extern "C" __device__ void userFunc(args);
extern "C" __global__ void kernel(){
// initialize
userFunc(args);
// rest of the kernel
}
And user's __device__ function:
extern "C" __device__ void userFunc(args){
// do something
}
And in Java side, here is the part that I load the modules(modules are made from ptx files which are successfully created from cuda files with this command: nvcc -m64 -ptx path/to/cudaFile -o cudaFile.ptx)
CUmodule kernelModule = new CUmodule(); // 1
CUmodule userFuncModule = new CUmodule(); // 2
cuModuleLoad(kernelModule, ptxKernelFileName); // 3
cuModuleLoad(userFuncModule, ptxUserFuncFileName); // 4
When I try to run it I got error at line 3 : CUDA_ERROR_NO_BINARY_FOR_GPU. After some searching I get that my ptx file has some syntax error. After running this suggested command:
ptxas -arch=sm_30 kernel.ptx
I got:
ptxas fatal : Unresolved extern function 'userFunc'
Even when I replace line 3 with 4 to load userFunc before kernel I get this error. I got stuck at this phase. Is this the correct way to load multiple modules that need to be linked together in JCuda? Or is it even possible?
Edit:
Second part of the question is here
The really short answer is: No, you can't load multiple modules into a context in the runtime API.
You can do what you want, but it requires explicit setup and execution of a JIT linking call. I have no idea how (or even whether) that has been implemented in JCUDA, but I can show you how to do it with the standard driver API. Hold on...
If you have a device function in one file, and a kernel in another, for example:
// test_function.cu
#include <math.h>
__device__ float mathop(float &x, float &y, float &z)
{
float res = sin(x) + cos(y) + sqrt(z);
return res;
}
and
// test_kernel.cu
extern __device__ float mathop(float & x, float & y, float & z);
__global__ void kernel(float *xvals, float * yvals, float * zvals, float *res)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
res[tid] = mathop(xvals[tid], yvals[tid], zvals[tid]);
}
You can compile them to PTX as usual:
$ nvcc -arch=sm_30 -ptx test_function.cu
$ nvcc -arch=sm_30 -ptx test_kernel.cu
$ head -14 test_kernel.ptx
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19324607
// Cuda compilation tools, release 7.0, V7.0.27
// Based on LLVM 3.4svn
//
.version 4.2
.target sm_30
.address_size 64
// .globl _Z6kernelPfS_S_S_
.extern .func (.param .b32 func_retval0) _Z6mathopRfS_S_
At runtime, your code must create a JIT link session, add each PTX to the linker session, then finalise the linker session. This will give you a handle to a compiled cubin image which can be loaded as a module as usual. The simplest possible driver API code to put this together looks like this:
#include <cstdio>
#include <cuda.h>
#define drvErrChk(ans) { drvAssert(ans, __FILE__, __LINE__); }
inline void drvAssert(CUresult code, const char *file, int line, bool abort=true)
{
if (code != CUDA_SUCCESS) {
fprintf(stderr, "Driver API Error %04d at %s %d\n", int(code), file, line);
exit(-1);
}
}
int main()
{
cuInit(0);
CUdevice device;
drvErrChk( cuDeviceGet(&device, 0) );
CUcontext context;
drvErrChk( cuCtxCreate(&context, 0, device) );
CUlinkState state;
drvErrChk( cuLinkCreate(0, 0, 0, &state) );
drvErrChk( cuLinkAddFile(state, CU_JIT_INPUT_PTX, "test_function.ptx", 0, 0, 0) );
drvErrChk( cuLinkAddFile(state, CU_JIT_INPUT_PTX, "test_kernel.ptx" , 0, 0, 0) );
size_t sz;
char * image;
drvErrChk( cuLinkComplete(state, (void **)&image, &sz) );
CUmodule module;
drvErrChk( cuModuleLoadData(&module, image) );
drvErrChk( cuLinkDestroy(state) );
CUfunction function;
drvErrChk( cuModuleGetFunction(&function, module, "_Z6kernelPfS_S_S_") );
return 0;
}
You should be able to compile and run this as posted and verify it works OK. It should serve as a template for a JCUDA implementation, if they have JIT linking support implemented.

CUDA invalid device symbol error

the code below compiles just fine. But when i try to run it, i got
GPUassert: invalid device symbol file.cu 114
When i comment lines marked by (!!!) the error wont show up. My question is what is causing this error because it gives me no sense.
Compiling with nvcc file.cu -arch compute_11
#include "stdio.h"
#include <algorithm>
#include <ctime>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
#define THREADS 64
#define BLOCKS 256
#define _dif (((1ll<<32)-121)/(THREADS*BLOCKS)+1)
#define HASH_SIZE 1024
#define ROUNDS 16
#define HASH_ROW (HASH_SIZE/ROUNDS)+(HASH_SIZE%ROUNDS==0?0:1)
#define HASH_COL 1000000000/HASH_SIZE
typedef unsigned long long ull;
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
//fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
printf("GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__device__ unsigned int primes[1024];
//__device__ unsigned char primes[(1<<28)+1];
__device__ long long n = 1ll<<32;
__device__ ull dev_base;
__device__ unsigned int dev_hash;
__device__ unsigned int dev_index;
time_t curtime;
__device__ int hashh(long long x) {
return (x>>1)%1024;
}
// compute (x^e)%n
__device__ ull mulmod(ull x,ull e,ull n) {
ull ans = 1;
while(e>0) {
if(e&1) ans = (ans*x)%n;
x = (x*x)%n;
e>>=1;
}
return ans;
}
// determine whether n is strong probable prime base a or not.
// n is ODD
__device__ int is_SPRP(ull a,ull n) {
int d=0;
ull t = n-1;
while(t%2==0) {
++d;
t>>=1;
}
ull x = mulmod(a,t,n);
if(x==1) return 1;
for(int i=0;i<d;++i) {
if(x==n-1) return 1;
x=(x*x)%n;
}
return 0;
}
__device__ int prime(long long x) {
//unsigned long long b = 2;
//return is_SPRP(b,(unsigned long long)x);
return is_SPRP((unsigned long long)primes[(((long long)0xAFF7B4*x)>>7)%1024],(unsigned long long)x);
}
__global__ void find(unsigned int *out,unsigned int *c) {
unsigned int buff[HASH_ROW][256];
int local_c[HASH_ROW];
for(int i=0;i<HASH_ROW;++i) local_c[i]=0;
long long b = 121+(threadIdx.x+blockIdx.x*blockDim.x)*_dif;
long long e = b+_dif;
if(b%2==0) ++b;
for(long long i=b;i<e && i<n;i+=2) {
if(i%3==0 || i%5==0 || i%7==0) continue;
int hash_num = hashh(i)-(dev_hash*(HASH_ROW));
if(0<=hash_num && hash_num<HASH_ROW) {
if(prime(i)) continue;
buff[hash_num][local_c[hash_num]++]=(unsigned int)i;
if(local_c[hash_num]==256) {
int start = atomicAdd(c+hash_num,local_c[hash_num]);
if(start+local_c[hash_num]>=HASH_COL) return;
unsigned int *out_offset = out+hash_num*(HASH_COL)*4;
for(int i=0;i<local_c[hash_num];++i) out_offset[i+start]=buff[hash_num][i]; //(!!!)
local_c[hash_num]=0;
}
}
}
for(int i=0;i<HASH_ROW;++i) {
int start = atomicAdd(c+i,local_c[i]);
if(start+local_c[i]>=HASH_COL) return;
unsigned int *out_offset = out+i*(HASH_COL)*4;
for(int j=0;j<local_c[i];++j) out_offset[j+start]=buff[i][j]; //(!!!)
}
}
int main(void) {
printf("HASH_ROW: %d\nHASH_COL: %d\nPRODUCT: %d\n",(int)HASH_ROW,(int)HASH_COL,(int)(HASH_ROW)*(HASH_COL));
ull *base_adr;
gpuErrchk(cudaGetSymbolAddress((void**)&base_adr,dev_base));
gpuErrchk(cudaMemset(base_adr,0,7));
gpuErrchk(cudaMemset(base_adr,0x02,1));
}
A rather unusual error.
The failure is occurring because:
By specifying a virtual architecture only (-arch compute_11) you defer the PTX compile step until runtime (i.e. you are forcing JIT-compile)
The JIT-compile is failing (at runtime)
The failure of the JIT-compile (and link) means device symbols cannot be properly established
Due to the problem with device symbols, the operation cudaGetSymbolAddress on the device symbol dev_base fails, and throws an error.
Why is the JIT-compile failing? You can find out yourself by triggering the machine code compile (which runs the ptxas assembler) by specifying -arch=sm_11 instead of -arch compute_11. If you do that, you'll get this result:
ptxas error : Entry function '_Z4findPjS_' uses too much local data (0x10100 bytes, 0x4000 max)
So even though your code doesn't call the find kernel, it must compile successfully to have a sane device environment for symbols.
Why does this compile error occur? Because you are requesting too much local memory per thread. cc 1.x devices are limited to 16KB local memory per thread, and your find kernel is requesting quite a bit more than that (over 64KB).
When I initially tried it on my device, I was using a cc2.0 device which has a higher limit (512KB per thread) and so the JIT-compile step succeeded.
In general, I would recommend specifying both a virtual architecture and a machine architecture, and the shorthand way to do that is:
nvcc -arch=sm_11 ....
(for a cc1.1 device)
This question/answer may also be of interest, and the nvcc manual has more details about virtual vs. machine architecture, and how to specify the compilation phases for each.
I believe the reason the error goes away when you comment out those particular lines in the kernel, is that with those commented out, the compiler is able to optimize-out the accesses to those local memory areas, and optimize-out the instantiation of the local memory. This allows the JIT-compile step to complete successfully, and your code runs "without runtime error".
You can verify this by commenting those lines out and then specify a full compile (nvcc -arch=sm_11 ...), where -arch is short for --gpu-architecture.
This error usually means the kernel has been compiled for the wrong architecture. You need to find out what the compute capability of your GPU is, and then compile it for that architecture. E.g. if your GPU has compute capability 1.1, compile it with -arch=sm_11. You can also build an executable for more than one architecture.

Error while defining the predicate for thrust Min_element, using zip_iterators for device_ptr

In the simple example I tried to find the min value, which is not yet visited.
float *cost=NULL;
cudaMalloc( (void **) &cost, 5 * sizeof(float) );
bool *visited=NULL;
cudaMalloc( (void **) &visited, 5 * sizeof(bool) );
thrust::device_ptr< float > dp_cost( cost );
thrust::device_ptr< bool > dp_visited( visited );
typedef thrust::device_ptr<bool> BoolIterator;
typedef thrust::device_ptr<float> ValueIterator;
BoolIterator bools_begin = dp_visited, bools_end = dp_visited +5;
ValueIterator values_begin = dp_cost, values_end = dp_cost +5;
typedef thrust::tuple<BoolIterator, ValueIterator> IteratorTuple;
typedef thrust::tuple<bool, float> DereferencedIteratorTuple;
typedef thrust::zip_iterator<IteratorTuple> NodePropIterator;
struct nodeProp_comp : public thrust::binary_function<DereferencedIteratorTuple, DereferencedIteratorTuple, bool>
{
__host__ __device__
bool operator()( const DereferencedIteratorTuple lhs, const DereferencedIteratorTuple rhs ) const
{
if( !( thrust::get<0>( lhs ) ) && !( thrust::get<0>( rhs ) ) )
{
return ( thrust::get<1>( lhs ) < thrust::get<1>( rhs ) );
}
else
{
return !( thrust::get<0>( lhs ) );
}
}
};
NodePropIterator iter_begin (thrust::make_tuple(bools_begin, values_begin));
NodePropIterator iter_end (thrust::make_tuple(bools_end, values_end));
NodePropIterator min_el_pos = thrust::min_element( iter_begin, iter_end, nodeProp_comp() );
DereferencedIteratorTuple tmp = *min_el_pos;
But on compilation i get this error.
thrust_min.cu(99): error: no instance of overloaded function "thrust::min_element" matches the argument list
argument types are: (NodePropIterator, NodePropIterator, nodeProp_comp)
1 error detected in the compilation of "/tmp/tmpxft_00005c8e_00000000-6_thrust_min.cpp1.ii".
I compile using :
nvcc -gencode arch=compute_30,code=sm_30 -G -g thrust_min.cu -Xcompiler -rdynamic,-Wall,-Wextra -lineinfo -o thrust_min
I am using gcc version 4.6.3 20120306 (Red Hat 4.6.3-2) (GCC), CUDA 5.
I get no error if I omit the predicate during the call to min_element ... which uses the default 'less' functor i guess.
Please help.
I asked around about this, and it seems that, in c++03, a local type (i.e., nodeProp) can't be used as a template parameter because it has no linkage. You may want to review this (non-thrust related) SO question/answer for additional discussion.
Thrust, being a template library, depends on this. So I think the recommendation is to put your functors that are used in thrust operations at global scope.
If you think there are other issues at play, you may want to post a new question with examples. However for the code you've posted in this question, I believe this is the reason, and I've demonstrated that reordering the code fixes the issue. Note the struct definition is really what is at issue here, not the typedefs.