I'm trying to execute some sample code from the Thrust Quick Start Guide. It's pasted below. What is killing me is that when I'm running it, I'm getting an exception thrown "R6010 -abort() has been called) whenever I hit the find_if.
I've tried this using both the 4.1 and 4.2 runtimes. I'm building this in Visual Studio 2010 Ultimate using the latest NSight release candidate (downloaded May 4th, 2012). My graphics card is an NVidia NVS 3100m.
I can run the vector addition sample generated in a new VS project (that doesn't use Thrust) and it works okay. Adding Thrust however gives me this weirdness.
Any suggestions are appreciated.
mj
thrust::device_vector<int> input(4);
input[0] = 0;
input[1] = 5;
input[2] = 3;
input[3] = 7;
thrust::device_vector<int>::iterator iter;
iter = thrust::find_if(input.begin(), input.end(), greater_than_four());
iter = thrust::find_if(input.begin(), input.end(), greater_than_ten());
EDIT1
Another tidbit of information. I'm digging in deeper into this and seeing that an error is caught during cudaThreadSynchronize(). The message is "launch_closure_by_value".
I figured it out. The __host__ and __device__ tags were missing.
struct greater_than_four
{
__host__ __device__
bool operator()(int x)
{
return x > 4;
}
};
Related
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
auto lamb = [](int x) {return x + 1; }; // Works.
auto t = std::make_tuple(1, 2, 3); // Does not work.
c[i] = a[i] + b[i];
}
NVCC has lambdas at least, but std::make_tuple fails to compile. Are tuples not allowed in the current version of Cuda?
I've just tried this out and tuple metaprogramming with std:: (std::tuple, std::get, etc ...) will work in device code with C++14 and expt-relaxed-constexpr enabled (CUDA8+) during compilation (e.g. nvcc -std=c++14 xxxx.cu -o yyyyy --expt-relaxed-constexpr) - CUDA 9 required for C++14, but basic std::tuple should work in CUDA 8 if you are limited to that. Thrust/tuple works but has some drawbacks: limited to 10 items and lacking in some of the std::tuple helper functions (e.g. std::tuple_cat). Because tuples and their related functions are compile-time, expt-relaxed-constexpr should enable your std::tuple to "just work".
#include <tuple>
__global__ void kernel()
{
auto t = std::make_tuple(1, 2, 3);
printf("%d\n",std::get<0>(t));
}
int main()
{
kernel<<<1,1>>>();
cudaDeviceSynchronize();
}
#include <thrust/tuple.h>
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
auto lamb = [](int x) {return x + 1; }; // Works.
auto t = thrust::make_tuple(1, 2, 3);
c[i] = a[i] + b[i];
}
I needed to get the ones from the Thrust library instead to make them work it seems. The above does compile.
Support for the standard c++ library on device side is problematic for CUDA as the standard library does not have the necessary __host__ or __device__ annotations.
That said, both clang and nvcc do have partial support for some functionality. Usually it's limited to constexpr functions that are considered to be __host__ __device__ if you pass --expt-relaxed-constexpr to nvcc (or by default in clang). Clang also has a bit more support for standard math functions. Neither supports anything that relies on C++ runtime (except for memory allocation, printf and assert) as that does not exist on device side.
So, in short -- most of the standard C++ library is unusable on device side in CUDA, though things do slowly improve as more and more functions in the standard library become constexpr.
Indeed, CUDA itself does not offer a device-side-capable version of std::tuple. However, I have a full tuple implementation as part of my cuda-kat library (still very much under initial development at the time of writing). thrust's tuple class is limited in the following senses:
Limited to 10 tuple elements.
Recursively expands templated types for every tuple element.
No/partial support for rvalues (e.g. in get())
The tuple implementation in cuda-kat is an adaptation of the EASTL tuple, which in turn is an adaptation of the LLVM project's libc++ tuple. Unlike the EASTL's, however, it is C++11-compatible, so you don't have to have the absolute latest CUDA version. It is possible to extract only the tuple class from the library with oh, I think 4 files or so, if you need just that.
I tried the following code with cuda 7.0.
If I set n_repeat to 1 and remove the last cudaDeviceReset, the code runs fine.
If I set n_repeat to 1 and keep the cudaDeviceReset, I can run the code segment towards the end but I got a memory leak detected by my memory leak detector after running the program.
If I set n_repeat to 2 and keep the cudaDeviceReset, I got an error in the second time I reach cublasCreate. The error code is CUBLAS_STATUS_NOT_INITIALIZED.
Can some one let me know what is the problem here and is cudaDeviceReset for the purpose of cleaning up between different runs of using the GPU, like what I'm trying to do here?
int device_id_ = 0;
cublasHandle_t blas_;
curandGenerator_t rand_gen_;
long alloc_size = 1000;
char* raw_;
int n_repeat = 2;
for (int i = 0; i < n_repeat; ++i) {
CHECK_CUDA(cudaSetDevice(device_id_));
CHECK_CUDA(cublasCreate(&blas_));
CHECK_CUDA(curandCreateGenerator(&rand_gen_, CURAND_RNG_PSEUDO_DEFAULT));
CHECK_CUDA(cudaMalloc((void **)&raw_, alloc_size));
CHECK_CUDA(curandDestroyGenerator(rand_gen_));
CHECK_CUDA(cublasDestroy(blas_));
CHECK_CUDA(cudaFree(raw_));
CHECK_CUDA(cudaDeviceReset());
}
I had the same problem, even with the example from Robert Crovella, cuda 7 ubuntu 14.04, K40c
Adding cudaDeviceSynchronize() after cudaSetDevice and before cublasCreate() made it work for me
want to do this programm on cuda.
1.in "main.cpp"
struct Center{
double * Data;
int dimension;
};
typedef struct Center Center;
//I allow a pointer on N Center elements by the CUDAMALLOC like follow
....
#include "kernel.cu"
....
center *V_dev;
int M =100, n=4;
cudaStatus = cudaMalloc((void**)&V_dev,M*sizeof(Center));
Init<<<1,M>>>(V_dev, M, N); //I always know the dimension of N before calling
My "kernel.cu" file is something like this
#include "cuda_runtime.h"
#include"device_launch_parameters.h"
... //other include headers to allow my .cu file to know the Center type definition
__global__ void Init(Center *V, int N, int dimension){
V[threadIdx.x].dimension = dimension;
V[threadIdx.x].Data = (double*)malloc(dimension*sizeof(double));
for(int i=0; i<dimension; i++)
V[threadIdx.x].Data[i] = 0; //For the value, it can be any kind of operation returning a float that i want to be able put here
}
I'm on visual studio 2008 and CUDA 5.0. When I Build my project, I've got these errors:
error: calling a _host_ function("malloc") from a _global_ function("Init") is not allowed.
I want to know please how can I perform this? (I know that 'malloc' and other cpu memory allocation are not allowed for device memory.
malloc is allowed in device code but you have to be compiling for a cc2.0 or greater target GPU.
Adjust your VS project settings to remove any GPU device settings like compute_10,sm_10 and replace it with compute_20,sm_20 or higher to match your GPU. (And, to run that code, your GPU needs to be cc2.0 or higher.)
You need the compiler parameter -arch=sm_20 and a GPU which supports it.
I want to disable a specific compiler warning with nvcc, specifically
warning: NULL reference is not allowed
The code I am working on uses NULL references are part of SFINAE, so they can't be avoided.
An ideal solution would be a #pragma in just the source file where we want to disable the warnings, but a compiler flag would also be fine, if one exists to turn off only the warning in question.
It is actually possible to disable specific warnings on the device with NVCC. It took me ages to figure out how to do it.
You need to use the -Xcudafe flag combined with a token listed on this page. For example, to disable the "controlling expression is constant" warning, pass the following to NVCC:
-Xcudafe "--diag_suppress=boolean_controlling_expr_is_constant"
For other warnings, see the above link.
Just to add to the previous answer about -xcudafe (not enough reputation yet to leave a comment)
MAJOR EDIT:
cudaFE is apparently Nvidia's custom version of C++ Front End from the Edison Design Group. You can find the docs for it here: http://www.edg.com/docs/edg_cpp.pdf. I am currently referring to page numbers from the July 2019 v5.1 manual.
#einpoklum notes that just doing push/pop as I had originally stated in the initial post doesn't work and specifically that #pragma push generates a warning that it is ignored. I could not replicate the warning, but indeed in the test program below neither CUDA 10.1 nor CUDA 9.2 did the push/pop actually do anything (note that lines 20 and 22 generate no warning).
However, on page 75 of this manual they go through how to do localized diagnostic severity control without push/pop:
The following example suppresses the “pointless friend declaration”
warning on the declaration of class A:
#pragma diag_suppress 522
class A { friend class A; };
#pragma diag_default 522
class B { friend class B; };
The #pragma diag_default returns the warning to default state. Another example would be:
#pragma diag_suppress = code_is_unreachable
...
#pragma diag_default = code_is_unreachable
The equal symbol is optional. Testing shows that this works and truly localized the severity control. Further, testing reveals that adding diagnostic suppressions in this way adds to previous diagnostic suppressions - it doesn’t replace. Also of note, in CUDA 10.1 unreachable code did not generate a warning while it did in CUDA 9.2. Finally, Page 77 of the manual mentions a new push/pop syntax:
#pragma push_macro(“identifier”)
#pragma pop_macro(“identifier”)
But I couldn't get it to work in the program below.
All of the above is tested in the program below, compiled with nvcc -std=c++14 test_warning_suppression.cu -o test_warning_suppression:
#include <cuda_runtime.h>
__host__ __device__ int return1(){
int j = -1; //warning given for both CUDA 9.2 and 10.1
return 1;
if(false){ return 0; } //warning given here for CUDA 9.2
}
#pragma push
#pragma diag_suppress = code_is_unreachable
#pragma diag_suppress = declared_but_not_referenced
__host__ __device__ int return2(){
int j = -1;
return 2;
if(false){ return 0; }
}
#pragma pop
__host__ __device__ int return3(){
int j = -1; //no warning given here
return 3;
if(false){ return 0; } //no warning here even in CUDA 9.2
}
//push/pop did not localize warning suppression, so reset to default
#pragma diag_default = declared_but_not_referenced
#pragma diag_default = code_is_unreachable
//warning suppression localized to lines above by diag_default!
__host__ __device__ int return4(){
int j = -1; //warning given for both CUDA 9.2 and 10.1
return 4;
if(false){ return 0; } //warning given here for CUDA 9.2
}
/* below does not work as of CUDA 10.1
#pragma push_macro(“identifier”)
#pragma diag_suppress = code_is_unreachable
__device__ int return5(){
return 5;
if(false){ return 0; }
}
#pragma pop_macro(“identifier”)
__device__ int return6(){
return 6;
if(false){ return 0; }
} */
int main(){ return 0; }
To augment user2333829's answer: if you know the warning name you can disable it like this:
-Xcudafe "--diag_suppress=boolean_controlling_expr_is_constant"
If you don't know the name, get warning numbers by compiling with:
-Xcudafe --display_error_number
And then with:
-Xcudafe --diag_suppress=<warning_number>
(Note: both options at the same time apparently don't work.)
You can use w flag to suppress the warnings
nvcc -w
I struggled to find a matching -Xcudafe for my warning. So here is another way.
You can pass a compiler flag to CL.exe that will disable a specific warning. For example, to disable the warnings about unchecked iterators you can pass /wd4996.
warning C4996: 'std::_Copy_impl': Function call with parameters that may be
unsafe - this call relies on the caller to check that the passed values are
correct. To disable this warning, use -D_SCL_SECURE_NO_WARNINGS. See
documentation on how to use Visual C++ 'Checked Iterators'
The tricky thing here is that by default the arguments from the host compiler settings are not passed to nvcc, so you need to add it via the CUDA C/C++ dialog.
I was using nvcc with the ubuntu g++ compilers, in my case openmpi mpic++. For the "-Wunused-result" of the g++ compiler the corresponding message suppress is "-Wno-unused-result". So passing it in the nvcc like -Xcompiler "-Wno-unused-result" worked for me.
ref gcc/clang compiling options:
https://docs.adacore.com/live/wave/gcc-12.x/html/gcc/gcc.html#Warning-Options
https://releases.llvm.org/12.0.0/tools/clang/docs/DiagnosticsReference.html#diagnostic-flags
I have recently started learning CUDA and I've integrated my CUDA into MS Visual Studio 2010 with Nsight. I have also acquired the book "CUDA by Example" and I'm going through all the examples and compiling them. I have come across an error however, which I do not understand.
The program comes from chapter 4 and it's the julia_gpu example. Original code:
#include "../common/book.h"
#include "../common/cpu_bitmap.h"
#define DIM 1000
struct cuComplex {
float r;
float i;
cuComplex( float a, float b ) : r(a), i(b) {}
__device__ float magnitude2( void ) {
return r * r + i * i;
}
__device__ cuComplex operator*(const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
__device__ cuComplex operator+(const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};
__device__ int julia( int x, int y ) {
const float scale = 1.5;
float jx = scale * (float)(DIM/2 - x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);
cuComplex c(-0.8, 0.156);
cuComplex a(jx, jy);
int i = 0;
for (i=0; i<200; i++) {
a = a * a + c;
if (a.magnitude2() > 1000)
return 0;
}
return 1;
}
__global__ void kernel( unsigned char *ptr ) {
// map from blockIdx to pixel position
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y * gridDim.x;
// now calculate the value at that position
int juliaValue = julia( x, y );
ptr[offset*4 + 0] = 255 * juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}
// globals needed by the update routine
struct DataBlock {
unsigned char *dev_bitmap;
};
int main( void ) {
DataBlock data;
CPUBitmap bitmap( DIM, DIM, &data );
unsigned char *dev_bitmap;
HANDLE_ERROR( cudaMalloc( (void**)&dev_bitmap, bitmap.image_size() ) );
data.dev_bitmap = dev_bitmap;
dim3 grid(DIM,DIM);
kernel<<<grid,1>>>( dev_bitmap );
HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap,
bitmap.image_size(),
cudaMemcpyDeviceToHost ) );
HANDLE_ERROR( cudaFree( dev_bitmap ) );
bitmap.display_and_exit();
}
My Visual Studio however forces me to embelish the cuComplex constructor to device, otherwise it won't compile (it tells me I cannot use it later in the julia function), which I guess is fair enough. So I have:
__device__ cuComplex( float a, float b ) : r(a), i(b) {}
But when I do run the example (having added the necessary includes for it to run through VS, which is cuda_runtime.h and device_launch_parameters.h, as well as copying the glut32.dll into the same folder as the exe) it quickly fails, killing my device driver and saying it's due to an unknown error in line 94, which is the cudaMemcpy call in main. To be exact, it's the actual line containing the call "cudaDeviceToHost". To be frank however, I have tried creating some breakpoints line after line and the driver dies at the kernel call.
Could someone please tell me what might be wrong? I am a noob with CUDA and have no real idea why a trivial example would kill itself like that. What could I be doing wrong? Because frankly, I don't really even know what to investigate.
I have the CUDA 4.1 toolkit, NSight 2.1 and a GeForce GT445M with computational ability rated at 2.1 and the 295 version of the drivers.
I haven't had time to test this yet, but I think it may be your GFX "timing out" as far as windows is concerned.
Windows has a default behaviour from Vista to tell the gfx driver to recover after 2 seconds. If your job takes longer then you get booted. You can increase or remove this feature through the registry. I assume you need a reboot for this because I just made the changes and it's not working yet.
See this link for detail:
http://msdn.microsoft.com/en-us/windows/hardware/gg487368.aspx
...
Timeout Detection and Recovery : Windows Vista attempts to detect these
problematic hang situations and recover a responsive desktop
dynamically. In this process, the Windows Display Driver Model (WDDM)
driver is reinitialized and the GPU is reset. No reboot is necessary,
which greatly enhances the user experience. The only visible artifact
from the hang detection to the recovery is a screen flicker, which
results from resetting some portions of the graphics stack, causing a
screen redraw. Some older Microsoft DirectX applications may render to
a black screen at the end of this recovery. The end user would have to
restart these applications. The following is a brief overview of the
TDR process: ....
Clearly this is why its a weird bug because it will give you that mem copy error at different scales for different people depending on how fast their gfx is.
This is a known issue in CUDA.
You can try changing this:
const float scale = 1.5;
to something larger like 3.5, 4.5, 5.5.
example:
const float scale = 5.5;