CUDA program with slight different results every run - cuda

I'm new at CUDA and OpenCL.
I have translated the kernels of a program from CUDA kernels to OpenCL kernels. I'm using the same seeds for the random number generation in both versions.
While the OpenCL version gets the exact same results every run, the CUDA version gives a slight different results every run.
I'm compiling the CUDA version without -use_fast_math.
My device is 1.1 capability.
Any idea about what could be the reason?
Thanks in advance

Devices of compute capability 1.1 do not support double operations. So if you are using double they are getting demoted to float. That could possibly affect your results, although a compute capability 1.1 device cannot support double in OpenCL either, AFAIK.
My question actually is is there any CUDA compiling options that may affect the accuracy of the CUDA results.
Yes, there are a variety of options that affect CUDA's usage of floating point math
I don't know why any of this would lead to variation from one run to the next, however. It's likely that you have a bug in the code.

I found the problem. In the original code, some values were updated asynchronously and was not completely updated yet. Thanks everybody for help. And sorry for the troubles.

Related

CURAND_STATUS_DOUBLE_PRECISION_REQUIRED is undefined

While building my CUDA project I get the following error:
cutil_inline_runtime.h(328): error: identifier "CURAND_STATUS_DOUBLE_PRECISION_REQUIRED" is undefined
So I started googling. Since I couldn't find the solution (nor did I find the actual problem) I downloaded from nVIDIA CURAND guide pdf and started reading. In that pdf it says:
Enumerator:
...
**CURAND_STATUS_DOUBLE_PRECISION_REQUIRED** GPU does not have double precision required by MRG32k3a
...
This means I can't perform operations with double data type... right? Well that seems wrong because, I assure you, couple a days ago I made a project using double data type and it worked just fine. Does anyone have a suggestion? Thank you!
EDIT Since I was unable to find the "real" solution for the problem, eventually I commented out the lines with "CURAND_STATUS_DOUBLE_PRECISION_REQUIRED" in cutil_inline_runtime.h and now it works. I know it is "ugly" but it was the only thing that helped...
I also had this problem. The issue was that I was using different versions of the cuda SDK and the cuda toolkit. I was using a newer version of the SDK which referenced the definition of CURAND_STATUS_DOUBLE_PRECISION_REQUIRED, but that definition was undefined in curand.h in the older version of the cuda toolkit that I had installed.
Try installing the version of the toolkit that matches your cuda SDK version.
Try Googling "Compute Capability", it's how nvidia defines the various CUDA capabilities. In the CUDA C Programming guide, it's mentioned a couple of times that devices with compute capability 1.2 or lower do not support double precision FP math: CUDA C Programming, pages 140, and also check table F-1, namely the row about dpfp support.
Gefore 8800 GT is a G80 architecture chip, which is compute capability 1.0. You cannot do double computations on this card. Keep in mind much example code makes use of emulation modes, and the like, which can fool you into thinking it works.

CUDA development on different cards?

I'm just starting to learn how to do CUDA development(using version 4) and was wondering if it was possible to develop on a different card then I plan to use? As I learn, it would be nice to know this so I can keep an eye out if differences are going to impact me.
I have a mid-2010 macbook pro with a Nvidia GeForce 320M graphic cards(its a pretty basic laptop integrated card) but I plan to run my code on EC2's NVIDIA Tesla “Fermi” M2050 GPUs. I'm wondering if its possible to develop locally on my laptop and then run it on EC2 with minimal changes(I'm doing this for a personal project and don't want to spend $2.4 for development).
A specific question is, I heard that recursions are supported in newer cards(and maybe not in my laptops), what if I run a recursion on my laptop gpu? will it kick out an error or will it run but not utilize the hardware features? (I don't need the specific answer to this, but this is kind of the what I'm getting at).
If this is going to be a problem, is there emulators for features not avail in my current card? or will the SDK emulate it for me?
Sorry if this question is too basic.
Yes, it's a pretty common practice to use different GPUs for development and production. nVidia GPU generations are backward-compatible, so if your program runs on older card (that is if 320M (CC1.3)), it would certainly run on M2070 (CC2.0)).
If you want to get maximum performance, you should, however, profile your program on same architecture you are going to use it, but usually everything works quite well without any changes when moving from 1.x to 2.0. Any emulator provide much worse view of what's going on than running on no-matter-how-old GPU.
Regarding recursion: an attempt to compile a program with obvious recursion for 1.3 architecture produces compile-time error:
nvcc rec.cu -arch=sm_13
./rec.cu(5): Error: Recursive function call is not supported yet: factorial(int)
In more complex cases the program might compile (I don't know how smart the compiler is in detecting recursions), but certainly won't work: in 1.x architecture there was no call stack, and all function calls were actually inlined, so recursion is technically impossible.
However, I would strongly recommend you to avoid recursion at any cost: it goes against GPGPU programming paradigm, and would certainly lead to very poor performance. Most algorithms are easily rewritten without the use of recursion, and it is much more preferable way to utilize them, not only on GPU, but on CPU as well.
The Cuda Version at first is not that important. More important are the compute capabilities of your card.
If you programm your kernels using cc 1.0 and they are scalable for the future you won't have any problems.
Choose yourself your minimum cc level you need for your application.
Calculate necessary parameters using properties and use ptx jit compilation:
If your kernel can handle arbitrary input sized data and your kernel launch configuration scales across thousands of threads it will scale across future versions.
In my projects all my kernels used a fixed number of threads per block which was equal to the number of resident threads per streaming multiprocessor divided by the number of resident blocks per streaming multiprocessor to reach 100% occupancy.
Some kernels need a multiple of two number of threads per block so I handled this case also since not for all cc versions the above equation guaranteed a multiple of two block size.
Some kernels used shared memory and its size was also deducted by the cc level properties.
This data was received using (cudaGetDeviceProperties) in a utility class and using ptx jit compiling my kernels worked without any changes on all devices. I programmed on a cc 1.1 device and ran tests on latest cuda cards without any changes!
All kernels were programmed to work with 64-bit length input data and utilizing all dimensions of the 3D Grid. (I am pretty sure in a year I will continue working on this project so this was necessary)
All my kernels except one did not exceeded the cc 1.0 register limit while having 100% occ. So if the used card cc was below 1.2 I added a maxregcount command to my kernel to still enforce 100% occ.
This does not guarantees best possible performance!
For possible best performance each kernel should be analyzed regarding its parameters and resources.
This maybe is not practicable for all applications and requirements
The NVidia Kepler K20 GPU available in Q4 2012 with CUDA 5 will support recursive algorithms.

cuda sdk example simpleStreams in SDK 4.1 not working

I upgraded CUDA GPU computing SDK and CUDA computing toolkit to 4.1. I was testing simpleStreams programs, but consistently it is taking more time that non-streamed execution. my device is with compute capability 2.1 and i'm using VS2008,windows OS.
This sample constantly has issues. If you tweak the sample to have equal duration for the kernel and memory copy the overlap will improve. Normally breadth first submission is better for concurrency; however, on WDDM OS this sample will usually have better overlap if you issue the memory copy right after kernel launch.
I noticed this as well. I thought it was just me but I didn't notice any improvement and tried searching the forums but didn't find anyone else with the issue.
I also ran the source code in the Cuda By Example book (which is really helpful and I recommend you pick it up if you're serious about GPU programming).
Chapter 10 examples has the progression of examples showing how streams should be used.
http://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0
But comparing the,
1. non-streamed version(which is basically the single stream version)
2. the streamed (incorrectly queued asyncmemcpy and kernel launch)
3. the streamed (correctly queued asyncmemcpy and kernel launch)
I find no benefit in using cuda streams. It might be a win7 issue as I found some sources online discussing that win vista didn't support the cuda streams correctly.
Let me know what you find with the example I linked. My setup is: Win7 64bit Pro, Cuda 4.1, Dual Geforce GTX460 cards, 8GB RAM.
I'm pretty new to Cuda so may not be able to help but generally its very hard to help without you posting any code. If posting is not possible then I suggest you take a look at Nvidia's visual profiler. Its cross platform and can show you were your bottlenecks are.

How to find out if kernel is memory bound or computation bound?

I think my kernel is memory bound (because most GPGPU code is memory bound), but I don't actually know for sure. How can I found it out for myself. Probably one has to use the visual profiler, as it depends on the used GPU.
If it is explained in the CUDA Programming guide or in other NVIDIA documentation, don't hesitate to just post a link with a page number, so I can read it up for myself.
Clarification
I would prefer are general "rule" how to determine the limiting factor, but in my special case you can find details about my kernel here: Using `overlap`, `kernel time` and `utilization` to optimize one's kernels
This presentation from NVIDIA talks about selectively disabling memory accesses and arithmetic in your kernel by modifying your source code, in order to determine if one of them is limiting your performance.
A nice trick without any source code modification can be used for code compiled with compute capability 2.0 and above ( based on answer here )
using the "--use_fast_math" flag one can easily increase\decrease compute pressure.
if setting this flag gives a large speed-up, this would indicate a compute bound kernel.
if setting this flag gives little to no speed-up, this would indicate a balanced\memory bound kernel.
I though I would pitch in an answer even though there is an accepted answer and this question is old.
I had a similar problem in my code, although at the time I didn't know it.
I ran the Nvidia Visual Profiler (nvvp) and analysed my program. I found that the profiler had detected my program was limited in some fashion and had some recommendations.
A great tool to use if you are unsure on where to begin.

CUDA vs Direct X 10 for parallel mathematics. any thoughs you have about it?

CUDA vs Direct X 10 for parallel mathematics. any thoughs you have about it ?
CUDA is probably a better option, if you know your target architecture is using nVidia chips. You have complete control over your data transfers, instruction paths and order of operations. You can also get by with a lot less __syncthreads calls when you're working on the lower level.
DirectX 10 will be easier to interface against, I should think, but if you really want to push your speed optimization, you have to bypass the extra layer. DirectX 10 will also not know when to use texture memory versus constant memory versus shared memory as well as you will depending on your particular algorithm.
If you have access to a Tesla C1060 or something like that, CUDA is by far the better choice hands down. You can really speed things up if you know the specifics of your GPGPU - I've seen 188x speedups in one particular algorithm on a Tesla versus my desktop.
I find CUDA awkward. It's not C, but a subset of it. It doesn't support double precision floating point natively and is emulated. For single precision it's okay though. It depends on the type of task you throw at it. You have to spend more time computing in parallel than you spend passing the data around for it to be worth using. But that issue is not unique to CUDA.
I'd wait for Apple's OpenCL which seems like it will be the industry standard for parallel computing.
Well, CUDA is portable... That's a big win if you ask me...
CUDA has nothing to do about supporting double precision floating point operations.
This is dependent on the hardware available. The 9, 100, 200 and Tesla series support double precision floating point operations tesla.
It should be easy to decide between them.
If your app can tolerate being Windows specific, you can still consider DirectX Compute. Otherwise, use CUDA or OpenCL.
If your app cannot tolerate a vendor lock on NVIDIA, you cannot use CUDA, you must use OpenCL or DirectX Compute.
If your app is doing DirectX interop, consider that CUDA/OpenCL will incur context switch overhead doing graphics API interop, and DirectX Compute will not.
Unless one or more of those criteria affect your application, use the great granddaddy of massively parallel toolchains: CUDA.