CUDA 5 compatible with CUDA 4 - cuda

CUDA 5 was released recently and I have been using CUDA 4 until now. So I was wondering whether the code I wrote in CUDA 4 will still run, if I install CUDA 5?
Is it completely compatible or partially? Will open source projects like gpuocelot, that require CUDA 4, work with CUDA 5 too?
Thanks

There is not 100% compatibility between CUDA 4 and CUDA 5.
To choose just one example, in CUDA 5, it is no longer permissible to use a character string to indicate a device symbol, which was possible
with certain API functions in CUDA 4. Instead, the symbol should be
used directly.
It's also been pointed out that the structure of the sample codes has changed significantly, which may impact your code if you are using elements of the sample codes. However this isn't a true compatibility issue, in my opinion.
It's likely that the changes required to move a cuda code from CUDA 4 to CUDA 5 will be minor, if any.
Emulators often depend on unpublished characteristics of the CUDA runtime, and frequently only work with specific CUDA versions. Check the emulator of your choice for any statements made about required runtime.

Related

Benefit of higher version of CUDA for devices with lower Compute Capability

I'm using CUDA 7.0 on a Tesla K20X (C.C. 3.5). Is there any benefit to update to a higher version of CUDA, say 8.0. Is there any compatibility or stability risk with using higher version of CUDA with devices with (much) lower C.C.?
(Various available versions of CUDA on Nvidia website make me doubtful which one is really good)
Regarding benefits, newer CUDA toolkit versions usually provide feature benefits (new features, and/or enhanced performance) over previous CUDA toolkit version. However there are also occasionally performance regressions. Specifics can't be given - it may vary based on your exact code. However there are generally summary blog articles for each new CUDA toolkit version, for example here is the one for CUDA 8 and here is the one for CUDA 9, describing the new features available.
Regarding compatibility, there should be no risk to moving to a higher CUDA version, regardless of the compute capability of your device, as long as your device is supported. All current CUDA versions in the range of 7-9 support your cc3.5 GPU.
Regarding stability, it is possible that a newer CUDA version may have a bug, but it is also possible that a bug in your existing CUDA version may be fixed in a newer version. Guarantees can't be made here; software almost always has bugs in it. However it is generally recommended to use the latest CUDA version compatible with your GPU (in the absence of other considerations), as this gives you access the latest features and at least gives you the best possibility that a historically known issue has been addressed.
I doubt these sort of platitudes are any different regardless of the software stack (e.g. compiler, tools framework, etc.) that you are using. I don't think these considerations are specific or unique to CUDA.
I'm using CUDA 7.0 on a Tesla K20X (C.C. 3.5). Is there any benefit to update to a higher version of CUDA, say 8.0 ?
Are you kidding me? There are enormous benefits. It's a world of difference! Just have a look at the CUDA 8 feature descriptions (Parallel4All blog entry). Specifically,
CUDA 8.0 lets you compile with GCC 5.x instead of 4.x
Not only does that save you a life full of pain having to build your own GCC - since modern distros often don't package it at all, and it's not the system's default compiler. Also, GCC 5.x has lots of improvements, not the least of which being full C++14 support for host-side code.
CUDA 8 lets you use C++11 lambdas in device code
(actually, CUDA 7.5 lets you do that and this is rounded off in CUDA 8)
NVCC internal improvements
Not that I can list these, but hopefully NVIDIA continues working on its compiler, equipping it with better optimization logic.
Much faster compilation
NVCC is markedly faster with CUDA 8. It might be up to 2x, but even if it's just 1.5x - that really improves your quality of life as a developer...
Shall I go on? ... all of the above applies regardless of your compute capability. And CC 3.5 or 3.7 is nothing to sneeze at anyway.

CUDA program with slight different results every run

I'm new at CUDA and OpenCL.
I have translated the kernels of a program from CUDA kernels to OpenCL kernels. I'm using the same seeds for the random number generation in both versions.
While the OpenCL version gets the exact same results every run, the CUDA version gives a slight different results every run.
I'm compiling the CUDA version without -use_fast_math.
My device is 1.1 capability.
Any idea about what could be the reason?
Thanks in advance
Devices of compute capability 1.1 do not support double operations. So if you are using double they are getting demoted to float. That could possibly affect your results, although a compute capability 1.1 device cannot support double in OpenCL either, AFAIK.
My question actually is is there any CUDA compiling options that may affect the accuracy of the CUDA results.
Yes, there are a variety of options that affect CUDA's usage of floating point math
I don't know why any of this would lead to variation from one run to the next, however. It's likely that you have a bug in the code.
I found the problem. In the original code, some values were updated asynchronously and was not completely updated yet. Thanks everybody for help. And sorry for the troubles.

CURAND_STATUS_DOUBLE_PRECISION_REQUIRED is undefined

While building my CUDA project I get the following error:
cutil_inline_runtime.h(328): error: identifier "CURAND_STATUS_DOUBLE_PRECISION_REQUIRED" is undefined
So I started googling. Since I couldn't find the solution (nor did I find the actual problem) I downloaded from nVIDIA CURAND guide pdf and started reading. In that pdf it says:
Enumerator:
...
**CURAND_STATUS_DOUBLE_PRECISION_REQUIRED** GPU does not have double precision required by MRG32k3a
...
This means I can't perform operations with double data type... right? Well that seems wrong because, I assure you, couple a days ago I made a project using double data type and it worked just fine. Does anyone have a suggestion? Thank you!
EDIT Since I was unable to find the "real" solution for the problem, eventually I commented out the lines with "CURAND_STATUS_DOUBLE_PRECISION_REQUIRED" in cutil_inline_runtime.h and now it works. I know it is "ugly" but it was the only thing that helped...
I also had this problem. The issue was that I was using different versions of the cuda SDK and the cuda toolkit. I was using a newer version of the SDK which referenced the definition of CURAND_STATUS_DOUBLE_PRECISION_REQUIRED, but that definition was undefined in curand.h in the older version of the cuda toolkit that I had installed.
Try installing the version of the toolkit that matches your cuda SDK version.
Try Googling "Compute Capability", it's how nvidia defines the various CUDA capabilities. In the CUDA C Programming guide, it's mentioned a couple of times that devices with compute capability 1.2 or lower do not support double precision FP math: CUDA C Programming, pages 140, and also check table F-1, namely the row about dpfp support.
Gefore 8800 GT is a G80 architecture chip, which is compute capability 1.0. You cannot do double computations on this card. Keep in mind much example code makes use of emulation modes, and the like, which can fool you into thinking it works.

cuda sdk example simpleStreams in SDK 4.1 not working

I upgraded CUDA GPU computing SDK and CUDA computing toolkit to 4.1. I was testing simpleStreams programs, but consistently it is taking more time that non-streamed execution. my device is with compute capability 2.1 and i'm using VS2008,windows OS.
This sample constantly has issues. If you tweak the sample to have equal duration for the kernel and memory copy the overlap will improve. Normally breadth first submission is better for concurrency; however, on WDDM OS this sample will usually have better overlap if you issue the memory copy right after kernel launch.
I noticed this as well. I thought it was just me but I didn't notice any improvement and tried searching the forums but didn't find anyone else with the issue.
I also ran the source code in the Cuda By Example book (which is really helpful and I recommend you pick it up if you're serious about GPU programming).
Chapter 10 examples has the progression of examples showing how streams should be used.
http://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0
But comparing the,
1. non-streamed version(which is basically the single stream version)
2. the streamed (incorrectly queued asyncmemcpy and kernel launch)
3. the streamed (correctly queued asyncmemcpy and kernel launch)
I find no benefit in using cuda streams. It might be a win7 issue as I found some sources online discussing that win vista didn't support the cuda streams correctly.
Let me know what you find with the example I linked. My setup is: Win7 64bit Pro, Cuda 4.1, Dual Geforce GTX460 cards, 8GB RAM.
I'm pretty new to Cuda so may not be able to help but generally its very hard to help without you posting any code. If posting is not possible then I suggest you take a look at Nvidia's visual profiler. Its cross platform and can show you were your bottlenecks are.

Multiple GPUs in CUDA 3.2 and issues with Cuda 4.0

I am new to multiple GPUs. I have written a code for a single GPU and want to further speed up by use of multiple GPUs. I am working with two GTX 470 with MS VS 2008 and cuda toolkit 4.0
I am facing two problems.
First problem is my code somehow doesn't run fine with 4.0 build rules and works fine for 3.2 build rules. Also the SDK example of multiGPU doesn't build on VS2008 giving error
error C3861: 'cudaDeviceReset': identifier not found
My second problem is, if I have to work with 3.2 then according to the documentation, threads have to be launched separately and separate allocations to be made etc. What is the easiest library for launching threads for multiple gpus and can you please give some example for my setup for access to multiple GPUs.
The answer to the first question is that you are clearly linking an older version of the CUDA runtime library. cudaDeviceReset is a new addition to the API introduced in CUDA 4.0. So double check the build rules and make sure you really are pointing the linker at the CUDA 4.0 toolkit and not an earlier version
The second part of your question sounds like a "hai plz give me teh code" question, and that isn't really what this place is for. I will, however, give you a link to GPUWorker (code currently available here), which is a boost threads based multigpu framework that was originally part of the HOOMD molecular dynamics package. It should give you some hints on how to do a multithreaded, multigpu code, even if GPUWorker turns out to not be directly applicable to your needs.