CURAND_STATUS_DOUBLE_PRECISION_REQUIRED is undefined - cuda

While building my CUDA project I get the following error:
cutil_inline_runtime.h(328): error: identifier "CURAND_STATUS_DOUBLE_PRECISION_REQUIRED" is undefined
So I started googling. Since I couldn't find the solution (nor did I find the actual problem) I downloaded from nVIDIA CURAND guide pdf and started reading. In that pdf it says:
Enumerator:
...
**CURAND_STATUS_DOUBLE_PRECISION_REQUIRED** GPU does not have double precision required by MRG32k3a
...
This means I can't perform operations with double data type... right? Well that seems wrong because, I assure you, couple a days ago I made a project using double data type and it worked just fine. Does anyone have a suggestion? Thank you!
EDIT Since I was unable to find the "real" solution for the problem, eventually I commented out the lines with "CURAND_STATUS_DOUBLE_PRECISION_REQUIRED" in cutil_inline_runtime.h and now it works. I know it is "ugly" but it was the only thing that helped...

I also had this problem. The issue was that I was using different versions of the cuda SDK and the cuda toolkit. I was using a newer version of the SDK which referenced the definition of CURAND_STATUS_DOUBLE_PRECISION_REQUIRED, but that definition was undefined in curand.h in the older version of the cuda toolkit that I had installed.
Try installing the version of the toolkit that matches your cuda SDK version.

Try Googling "Compute Capability", it's how nvidia defines the various CUDA capabilities. In the CUDA C Programming guide, it's mentioned a couple of times that devices with compute capability 1.2 or lower do not support double precision FP math: CUDA C Programming, pages 140, and also check table F-1, namely the row about dpfp support.
Gefore 8800 GT is a G80 architecture chip, which is compute capability 1.0. You cannot do double computations on this card. Keep in mind much example code makes use of emulation modes, and the like, which can fool you into thinking it works.

Related

CUDA program with slight different results every run

I'm new at CUDA and OpenCL.
I have translated the kernels of a program from CUDA kernels to OpenCL kernels. I'm using the same seeds for the random number generation in both versions.
While the OpenCL version gets the exact same results every run, the CUDA version gives a slight different results every run.
I'm compiling the CUDA version without -use_fast_math.
My device is 1.1 capability.
Any idea about what could be the reason?
Thanks in advance
Devices of compute capability 1.1 do not support double operations. So if you are using double they are getting demoted to float. That could possibly affect your results, although a compute capability 1.1 device cannot support double in OpenCL either, AFAIK.
My question actually is is there any CUDA compiling options that may affect the accuracy of the CUDA results.
Yes, there are a variety of options that affect CUDA's usage of floating point math
I don't know why any of this would lead to variation from one run to the next, however. It's likely that you have a bug in the code.
I found the problem. In the original code, some values were updated asynchronously and was not completely updated yet. Thanks everybody for help. And sorry for the troubles.

Reading mxArray in CUSP or in cuSPARSE

I am trying to read mxArray from matlab into my custom made .cu file.
I have two sparse matrices to operate on.
How do I read them inside cusp sparse matrices say A and B ( or in cuSPARSE matrices), so that I can perform operations and return them back to matlab.
One idea that I could come up with is to write mxArrays in .mtx file and then read
from it. But again, are there any alternatives?
Further, I am trying understand the various CUSP mechanisms using the examples posted on its website.But every I try to compile and run the examples, I am getting the following error.
terminate called after throwing an instance of
'thrust::system::detail::bad_alloc'
what(): N6thrust6system6detail9bad_allocE: CUDA driver version is
insufficient for CUDA runtime version
Abort
Here are the stuff that is installed on the machine that I am using.
CUDA v4.2
Thrust v1.6
Cusp v0.3
I am using GTX 480 with Linux x86_64 on my machine.
Strangely enough, code for device query is also returning this output.
CUDA Device Query...
There are 0 CUDA devices.
Press any key to exit...
I updated my drivers and SDK few days.
Not sure whats wrong.
I know, I am asking a lot in one questions but I am facing this problem from quite a while and upgrading and downgrading the drivers doesn't seem to solve.
Cheers
This error is most revealing, "CUDA driver version is insufficient for CUDA runtime version". You definitely need to update your driver.
I use CUSPARSE/CUSP through Jacket's Sparse Linear Algebra library. It's been good, but I wish there were more sparse features available in CUSPARSE/CUSP. I hear Jacket is going to get CULA Sparse into it soon, so that'll be nice.

cuda sdk example simpleStreams in SDK 4.1 not working

I upgraded CUDA GPU computing SDK and CUDA computing toolkit to 4.1. I was testing simpleStreams programs, but consistently it is taking more time that non-streamed execution. my device is with compute capability 2.1 and i'm using VS2008,windows OS.
This sample constantly has issues. If you tweak the sample to have equal duration for the kernel and memory copy the overlap will improve. Normally breadth first submission is better for concurrency; however, on WDDM OS this sample will usually have better overlap if you issue the memory copy right after kernel launch.
I noticed this as well. I thought it was just me but I didn't notice any improvement and tried searching the forums but didn't find anyone else with the issue.
I also ran the source code in the Cuda By Example book (which is really helpful and I recommend you pick it up if you're serious about GPU programming).
Chapter 10 examples has the progression of examples showing how streams should be used.
http://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0
But comparing the,
1. non-streamed version(which is basically the single stream version)
2. the streamed (incorrectly queued asyncmemcpy and kernel launch)
3. the streamed (correctly queued asyncmemcpy and kernel launch)
I find no benefit in using cuda streams. It might be a win7 issue as I found some sources online discussing that win vista didn't support the cuda streams correctly.
Let me know what you find with the example I linked. My setup is: Win7 64bit Pro, Cuda 4.1, Dual Geforce GTX460 cards, 8GB RAM.
I'm pretty new to Cuda so may not be able to help but generally its very hard to help without you posting any code. If posting is not possible then I suggest you take a look at Nvidia's visual profiler. Its cross platform and can show you were your bottlenecks are.

Multiple GPUs in CUDA 3.2 and issues with Cuda 4.0

I am new to multiple GPUs. I have written a code for a single GPU and want to further speed up by use of multiple GPUs. I am working with two GTX 470 with MS VS 2008 and cuda toolkit 4.0
I am facing two problems.
First problem is my code somehow doesn't run fine with 4.0 build rules and works fine for 3.2 build rules. Also the SDK example of multiGPU doesn't build on VS2008 giving error
error C3861: 'cudaDeviceReset': identifier not found
My second problem is, if I have to work with 3.2 then according to the documentation, threads have to be launched separately and separate allocations to be made etc. What is the easiest library for launching threads for multiple gpus and can you please give some example for my setup for access to multiple GPUs.
The answer to the first question is that you are clearly linking an older version of the CUDA runtime library. cudaDeviceReset is a new addition to the API introduced in CUDA 4.0. So double check the build rules and make sure you really are pointing the linker at the CUDA 4.0 toolkit and not an earlier version
The second part of your question sounds like a "hai plz give me teh code" question, and that isn't really what this place is for. I will, however, give you a link to GPUWorker (code currently available here), which is a boost threads based multigpu framework that was originally part of the HOOMD molecular dynamics package. It should give you some hints on how to do a multithreaded, multigpu code, even if GPUWorker turns out to not be directly applicable to your needs.

GPU Emulator for CUDA programming without the hardware [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Question: Is there an emulator for a Geforce card that would allow me to program and test CUDA without having the actual hardware?
Info:
I'm looking to speed up a few simulations of mine in CUDA, but my problem is that I'm not always around my desktop for doing this development. I would like to do some work on my netbook instead, but my netbook doesn't have a GPU. Now as far as I know, you need a CUDA capable GPU to run CUDA. Is there a way to get around this? It would seem like the only way is a GPU emulator (which obviously would be painfully slow, but would work). But whatever way there is to do this I would like to hear.
I'm programming on Ubuntu 10.04 LTS.
For those who are seeking the answer in 2016 (and even 2017) ...
Disclaimer
I've failed to emulate GPU after all.
It might be possible to use gpuocelot if you satisfy its list of
dependencies.
I've tried to get an emulator for BunsenLabs (Linux 3.16.0-4-686-pae #1 SMP
Debian 3.16.7-ckt20-1+deb8u4 (2016-02-29) i686 GNU/Linux).
I'll tell you what I've learnt.
nvcc used to have a -deviceemu option back in CUDA Toolkit 3.0
I downloaded CUDA Toolkit 3.0, installed it and tried to run a simple
program:
#include <stdio.h>
__global__ void helloWorld() {
printf("Hello world! I am %d (Warp %d) from %d.\n",
threadIdx.x, threadIdx.x / warpSize, blockIdx.x);
}
int main() {
int blocks, threads;
scanf("%d%d", &blocks, &threads);
helloWorld<<<blocks, threads>>>();
cudaDeviceSynchronize();
return 0;
}
Note that in CUDA Toolkit 3.0 nvcc was in the /usr/local/cuda/bin/.
It turned out that I had difficulties with compiling it:
NOTE: device emulation mode is deprecated in this release
and will be removed in a future release.
/usr/include/i386-linux-gnu/bits/byteswap.h(47): error: identifier "__builtin_bswap32" is undefined
/usr/include/i386-linux-gnu/bits/byteswap.h(111): error: identifier "__builtin_bswap64" is undefined
/home/user/Downloads/helloworld.cu(12): error: identifier "cudaDeviceSynchronize" is undefined
3 errors detected in the compilation of "/tmp/tmpxft_000011c2_00000000-4_helloworld.cpp1.ii".
I've found on the Internet that if I used gcc-4.2 or similarly ancient instead of gcc-4.9.2 the errors might disappear. I gave up.
gpuocelot
The answer by Stringer has a link to a very old gpuocelot project website. So at first I thought that the project was abandoned in 2012 or so. Actually, it was abandoned few years later.
Here are some up to date websites:
GitHub;
Project's website;
Installation guide.
I tried to install gpuocelot following the guide. I had several errors during installation though and I gave up again. gpuocelot is no longer supported and depends on a set of very specific versions of libraries and software.
You might try to follow this tutorial from July, 2015 but I don't guarantee it'll work. I've not tested it.
MCUDA
The MCUDA translation framework is a linux-based tool designed to
effectively compile the CUDA programming model to a CPU architecture.
It might be useful. Here is a link to the website.
CUDA Waste
It is an emulator to use on Windows 7 and 8. I've not tried it though. It doesn't seem to be developed anymore (the last commit is dated on Jul 4, 2013).
Here's the link to the project's website: https://code.google.com/archive/p/cuda-waste/
CU2CL
Last update: 12.03.2017
As dashesy pointed out in the comments, CU2CL seems to be an interesting project. It seems to be able to translate CUDA code to OpenCL code. So if your GPU is capable of running OpenCL code then the CU2CL project might be of your interest.
Links:
CU2CL homepage
CU2CL GitHub repository
This response may be too late, but it's worth noting anyway. GPU Ocelot (of which I am one of the core contributors) can be compiled without CUDA device drivers (libcuda.so) installed if you wish to use the Emulator or LLVM backends. I've demonstrated the emulator on systems without NVIDIA GPUs.
The emulator attempts to faithfully implement the PTX 1.4 and PTX 2.1 specifications which may include features older GPUs do not support. The LLVM translator strives for correct and efficient translation from PTX to x86 that will hopefully make CUDA an effective way of programming multicore CPUs as well as GPUs. -deviceemu has been a deprecated feature of CUDA for quite some time, but the LLVM translator has always been faster.
Additionally, several correctness checkers are built into the emulator to verify: aligned memory accesses, accesses to shared memory are properly synchronized, and global memory dereferencing accesses allocated regions of memory. We have also implemented a command-line interactive debugger inspired largely by gdb to single-step through CUDA kernels, set breakpoints and watchpoints, etc... These tools were specifically developed to expedite the debugging of CUDA programs; you may find them useful.
Sorry about the Linux-only aspect. We've started a Windows branch (as well as a Mac OS X port) but the engineering burden is already large enough to stress our research pursuits. If anyone has any time and interest, they may wish to help us provide support for Windows!
Hope this helps.
[1]: GPU Ocelot - https://code.google.com/archive/p/gpuocelot/
[2]: Ocelot Interactive Debugger - http://forums.nvidia.com/index.php?showtopic=174820
You can check also gpuocelot project which is a true emulator in the sense that PTX (bytecode in which CUDA code is converted to) will be emulated.
There's also an LLVM translator, it would be interesting to test if it's more fast than when using -deviceemu.
The CUDA toolkit had one built into it until the CUDA 3.0 release cycle. I you use one of these very old versions of CUDA, make sure to use -deviceemu when compiling with nvcc.
https://github.com/hughperkins/cuda-on-cl lets you run NVIDIA® CUDA™ programs on OpenCL 1.2 GPUs (full disclosure: I'm the author)
Be careful when you're programming using -deviceemu as there are operations that nvcc will accept while in emulation mode but not when actually running on a GPU. This is mostly found with device-host interaction.
And as you mentioned, prepare for some slow execution.
GPGPU-Sim is a GPU simulator that can run CUDA programs without using GPU.
I created a docker image with GPGPU-Sim installed for myself in case that is helpful.