cuda sdk example simpleStreams in SDK 4.1 not working - cuda

I upgraded CUDA GPU computing SDK and CUDA computing toolkit to 4.1. I was testing simpleStreams programs, but consistently it is taking more time that non-streamed execution. my device is with compute capability 2.1 and i'm using VS2008,windows OS.

This sample constantly has issues. If you tweak the sample to have equal duration for the kernel and memory copy the overlap will improve. Normally breadth first submission is better for concurrency; however, on WDDM OS this sample will usually have better overlap if you issue the memory copy right after kernel launch.

I noticed this as well. I thought it was just me but I didn't notice any improvement and tried searching the forums but didn't find anyone else with the issue.
I also ran the source code in the Cuda By Example book (which is really helpful and I recommend you pick it up if you're serious about GPU programming).
Chapter 10 examples has the progression of examples showing how streams should be used.
http://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0
But comparing the,
1. non-streamed version(which is basically the single stream version)
2. the streamed (incorrectly queued asyncmemcpy and kernel launch)
3. the streamed (correctly queued asyncmemcpy and kernel launch)
I find no benefit in using cuda streams. It might be a win7 issue as I found some sources online discussing that win vista didn't support the cuda streams correctly.
Let me know what you find with the example I linked. My setup is: Win7 64bit Pro, Cuda 4.1, Dual Geforce GTX460 cards, 8GB RAM.

I'm pretty new to Cuda so may not be able to help but generally its very hard to help without you posting any code. If posting is not possible then I suggest you take a look at Nvidia's visual profiler. Its cross platform and can show you were your bottlenecks are.

Related

CUDA program with slight different results every run

I'm new at CUDA and OpenCL.
I have translated the kernels of a program from CUDA kernels to OpenCL kernels. I'm using the same seeds for the random number generation in both versions.
While the OpenCL version gets the exact same results every run, the CUDA version gives a slight different results every run.
I'm compiling the CUDA version without -use_fast_math.
My device is 1.1 capability.
Any idea about what could be the reason?
Thanks in advance
Devices of compute capability 1.1 do not support double operations. So if you are using double they are getting demoted to float. That could possibly affect your results, although a compute capability 1.1 device cannot support double in OpenCL either, AFAIK.
My question actually is is there any CUDA compiling options that may affect the accuracy of the CUDA results.
Yes, there are a variety of options that affect CUDA's usage of floating point math
I don't know why any of this would lead to variation from one run to the next, however. It's likely that you have a bug in the code.
I found the problem. In the original code, some values were updated asynchronously and was not completely updated yet. Thanks everybody for help. And sorry for the troubles.

Dearth of CUDA 5 Dynamic Parallelism Examples

I've been googling around and have only been able to find a trivial example of the new dynamic parallelism in Compute Capability 3.0 in one of their Tech Briefs linked from here. I'm aware that the HPC-specific cards probably won't be available until this time next year (after the nat'l labs get theirs). And yes, I realize that the simple example they gave is enough to get you going, but the more the merrier.
Are there other examples I've missed?
To save you the trouble, here is the entire example given in the tech brief:
__global__ ChildKernel(void* data){
//Operate on data
}
__global__ ParentKernel(void *data){
ChildKernel<<<16, 1>>>(data);
}
// In Host Code
ParentKernel<<<256, 64>>(data);
// Recursion is also supported
__global__ RecursiveKernel(void* data){
if(continueRecursion == true)
RecursiveKernel<<<64, 16>>>(data);
}
EDIT:
The GTC talk New Features In the CUDA Programming Model focused mostly on the new Dynamic Parallelism in CUDA 5. The link has the video and slides. Still only toy examples, but a lot more detail than the tech brief above.
Here is what you need, the Dynamic parallelism programming guide. Full of details and examples: http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
Just to confirm that dynamic parallelism is only supported on GPU's with a compute capability of 3.5 upwards.
I have a 3.0 GPU with cuda 5.0 installed I have compiled the Dynamic Parallelism examples
nvcc -arch=sm_30 test.cu
and received the below compile error
test.cu(10): error: calling a global function("child_launch") from a global function("parent_launch") is only allowed on the compute_35 architecture or above.
GPU info
Device 0: "GeForce GT 640"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 3.0
hope this helps
I edited the question title to "...CUDA 5...", since Dynamic Parallelism is new in CUDA 5, not CUDA 4. We don't have any public examples available yet, because we don't have public hardware available that can run them. CUDA 5.0 will support dynamic parallelism but only on Compute Capability 3.5 and later (GK110, for example). These will be available later in the year.
We will release some examples with a CUDA 5 release candidate closer to the time the hardware is available.
I think compute capability 3.0 doesn´t include dynamic paralelism. It will be included in the GK110 architecture (aka "Big Kepler"), I don´t know what compute capability number will have assigned (3.1? maybe). Those cards won´t be available until late this year (I´m waiting sooo much for those). As far as I know the 3.0 corresponds to the GK104 chips like the GTX690 o the GT640M for laptops.
Just wanted to check in with you all given that the CUDA 5 RC was released recently. I looked in the SDK examples and wasn't able to find any dynamic parallelism there. Someone correct me if I'm wrong. I searched for kernel launches within kernels by grepping for "<<<" and found nothing.

CUDA development on different cards?

I'm just starting to learn how to do CUDA development(using version 4) and was wondering if it was possible to develop on a different card then I plan to use? As I learn, it would be nice to know this so I can keep an eye out if differences are going to impact me.
I have a mid-2010 macbook pro with a Nvidia GeForce 320M graphic cards(its a pretty basic laptop integrated card) but I plan to run my code on EC2's NVIDIA Tesla “Fermi” M2050 GPUs. I'm wondering if its possible to develop locally on my laptop and then run it on EC2 with minimal changes(I'm doing this for a personal project and don't want to spend $2.4 for development).
A specific question is, I heard that recursions are supported in newer cards(and maybe not in my laptops), what if I run a recursion on my laptop gpu? will it kick out an error or will it run but not utilize the hardware features? (I don't need the specific answer to this, but this is kind of the what I'm getting at).
If this is going to be a problem, is there emulators for features not avail in my current card? or will the SDK emulate it for me?
Sorry if this question is too basic.
Yes, it's a pretty common practice to use different GPUs for development and production. nVidia GPU generations are backward-compatible, so if your program runs on older card (that is if 320M (CC1.3)), it would certainly run on M2070 (CC2.0)).
If you want to get maximum performance, you should, however, profile your program on same architecture you are going to use it, but usually everything works quite well without any changes when moving from 1.x to 2.0. Any emulator provide much worse view of what's going on than running on no-matter-how-old GPU.
Regarding recursion: an attempt to compile a program with obvious recursion for 1.3 architecture produces compile-time error:
nvcc rec.cu -arch=sm_13
./rec.cu(5): Error: Recursive function call is not supported yet: factorial(int)
In more complex cases the program might compile (I don't know how smart the compiler is in detecting recursions), but certainly won't work: in 1.x architecture there was no call stack, and all function calls were actually inlined, so recursion is technically impossible.
However, I would strongly recommend you to avoid recursion at any cost: it goes against GPGPU programming paradigm, and would certainly lead to very poor performance. Most algorithms are easily rewritten without the use of recursion, and it is much more preferable way to utilize them, not only on GPU, but on CPU as well.
The Cuda Version at first is not that important. More important are the compute capabilities of your card.
If you programm your kernels using cc 1.0 and they are scalable for the future you won't have any problems.
Choose yourself your minimum cc level you need for your application.
Calculate necessary parameters using properties and use ptx jit compilation:
If your kernel can handle arbitrary input sized data and your kernel launch configuration scales across thousands of threads it will scale across future versions.
In my projects all my kernels used a fixed number of threads per block which was equal to the number of resident threads per streaming multiprocessor divided by the number of resident blocks per streaming multiprocessor to reach 100% occupancy.
Some kernels need a multiple of two number of threads per block so I handled this case also since not for all cc versions the above equation guaranteed a multiple of two block size.
Some kernels used shared memory and its size was also deducted by the cc level properties.
This data was received using (cudaGetDeviceProperties) in a utility class and using ptx jit compiling my kernels worked without any changes on all devices. I programmed on a cc 1.1 device and ran tests on latest cuda cards without any changes!
All kernels were programmed to work with 64-bit length input data and utilizing all dimensions of the 3D Grid. (I am pretty sure in a year I will continue working on this project so this was necessary)
All my kernels except one did not exceeded the cc 1.0 register limit while having 100% occ. So if the used card cc was below 1.2 I added a maxregcount command to my kernel to still enforce 100% occ.
This does not guarantees best possible performance!
For possible best performance each kernel should be analyzed regarding its parameters and resources.
This maybe is not practicable for all applications and requirements
The NVidia Kepler K20 GPU available in Q4 2012 with CUDA 5 will support recursive algorithms.

Can't run CUDA nor OpenCL on GeForce 540M

I have problem running samples provided by Nvidia in their GPU Computing SDK (there's a library of compiled sample codes).
For cuda I get message "No CUDA-capable device is detected", for OpenCL there's error from function that should find OpenCL capable units.
I have installed all three parts from Nvidia to develop with OpenCL - devdriver for win7 64bit v.301.27, cuda toolkit 4.2.9 and gpu computing sdk 4.2.9.
I think this might have to do with Optimus technology that reroutes output from Nvidia GPU to Intel to render things (this notebook has also Intel 3000HD accelerator), but in Nvidia control pannel I set to use high performance Nvidia GPU, set power profile to prefer maximum performance and for PhysX I changed from automatic selection to Nvidia processor again. Nothing has changed though, those samples won't run (not even those targeted for GF8000 cards).
I would like to play somewhat with OpenCL and see what it is capable of but without ability to test things it's useless. I have found some info about this on forums, but it was mostly about linux users where you need Bumblebee to access Nvidia GPU. There's no such problem on Windows however, drivers are better and so you can access it without dark magic (or I thought so until I found this problem).
My laptop has a GeForce 540M as well, in an Optimus configuration since my Sandy Bridge CPU also has Intel's integrated graphics. To run CUDA codes, I have to:
Install NVIDIA Driver
Go to NVIDIA Control Panel
Click 3D Settings -> Manage 3D Settings -> Global Settings
In the Preferred Graphics processor drop down, select "High-performance NVIDIA processor"
Apply the settings
Note that the instructions above apply the settings for all applications, so you don't have to worry about CUDA errors any more. But it will drain more battery.
Here is a video recap as well. Good luck!
Ok this has proven to be totally crazy solution. I was thinking if something isn't hooking between the hardware and application and only thing that came to my mind was AV software. I'm using Comodo with sandbox and Defense+ on and after turning them off I could run all those samples. What's more, only Defense+ needs to be turned off.
Now I just think about how much apps could have been blocked from accessing that GPU..
That's most likely because of the architecture of Optimus. So I'd suggest you to read
NVIDIA CUDA Developer Guide for NVIDIA Optimus Platforms, especially the section "Querying for a CUDA Device" which addresses this issue, I believe.

GPU Emulator for CUDA programming without the hardware [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Question: Is there an emulator for a Geforce card that would allow me to program and test CUDA without having the actual hardware?
Info:
I'm looking to speed up a few simulations of mine in CUDA, but my problem is that I'm not always around my desktop for doing this development. I would like to do some work on my netbook instead, but my netbook doesn't have a GPU. Now as far as I know, you need a CUDA capable GPU to run CUDA. Is there a way to get around this? It would seem like the only way is a GPU emulator (which obviously would be painfully slow, but would work). But whatever way there is to do this I would like to hear.
I'm programming on Ubuntu 10.04 LTS.
For those who are seeking the answer in 2016 (and even 2017) ...
Disclaimer
I've failed to emulate GPU after all.
It might be possible to use gpuocelot if you satisfy its list of
dependencies.
I've tried to get an emulator for BunsenLabs (Linux 3.16.0-4-686-pae #1 SMP
Debian 3.16.7-ckt20-1+deb8u4 (2016-02-29) i686 GNU/Linux).
I'll tell you what I've learnt.
nvcc used to have a -deviceemu option back in CUDA Toolkit 3.0
I downloaded CUDA Toolkit 3.0, installed it and tried to run a simple
program:
#include <stdio.h>
__global__ void helloWorld() {
printf("Hello world! I am %d (Warp %d) from %d.\n",
threadIdx.x, threadIdx.x / warpSize, blockIdx.x);
}
int main() {
int blocks, threads;
scanf("%d%d", &blocks, &threads);
helloWorld<<<blocks, threads>>>();
cudaDeviceSynchronize();
return 0;
}
Note that in CUDA Toolkit 3.0 nvcc was in the /usr/local/cuda/bin/.
It turned out that I had difficulties with compiling it:
NOTE: device emulation mode is deprecated in this release
and will be removed in a future release.
/usr/include/i386-linux-gnu/bits/byteswap.h(47): error: identifier "__builtin_bswap32" is undefined
/usr/include/i386-linux-gnu/bits/byteswap.h(111): error: identifier "__builtin_bswap64" is undefined
/home/user/Downloads/helloworld.cu(12): error: identifier "cudaDeviceSynchronize" is undefined
3 errors detected in the compilation of "/tmp/tmpxft_000011c2_00000000-4_helloworld.cpp1.ii".
I've found on the Internet that if I used gcc-4.2 or similarly ancient instead of gcc-4.9.2 the errors might disappear. I gave up.
gpuocelot
The answer by Stringer has a link to a very old gpuocelot project website. So at first I thought that the project was abandoned in 2012 or so. Actually, it was abandoned few years later.
Here are some up to date websites:
GitHub;
Project's website;
Installation guide.
I tried to install gpuocelot following the guide. I had several errors during installation though and I gave up again. gpuocelot is no longer supported and depends on a set of very specific versions of libraries and software.
You might try to follow this tutorial from July, 2015 but I don't guarantee it'll work. I've not tested it.
MCUDA
The MCUDA translation framework is a linux-based tool designed to
effectively compile the CUDA programming model to a CPU architecture.
It might be useful. Here is a link to the website.
CUDA Waste
It is an emulator to use on Windows 7 and 8. I've not tried it though. It doesn't seem to be developed anymore (the last commit is dated on Jul 4, 2013).
Here's the link to the project's website: https://code.google.com/archive/p/cuda-waste/
CU2CL
Last update: 12.03.2017
As dashesy pointed out in the comments, CU2CL seems to be an interesting project. It seems to be able to translate CUDA code to OpenCL code. So if your GPU is capable of running OpenCL code then the CU2CL project might be of your interest.
Links:
CU2CL homepage
CU2CL GitHub repository
This response may be too late, but it's worth noting anyway. GPU Ocelot (of which I am one of the core contributors) can be compiled without CUDA device drivers (libcuda.so) installed if you wish to use the Emulator or LLVM backends. I've demonstrated the emulator on systems without NVIDIA GPUs.
The emulator attempts to faithfully implement the PTX 1.4 and PTX 2.1 specifications which may include features older GPUs do not support. The LLVM translator strives for correct and efficient translation from PTX to x86 that will hopefully make CUDA an effective way of programming multicore CPUs as well as GPUs. -deviceemu has been a deprecated feature of CUDA for quite some time, but the LLVM translator has always been faster.
Additionally, several correctness checkers are built into the emulator to verify: aligned memory accesses, accesses to shared memory are properly synchronized, and global memory dereferencing accesses allocated regions of memory. We have also implemented a command-line interactive debugger inspired largely by gdb to single-step through CUDA kernels, set breakpoints and watchpoints, etc... These tools were specifically developed to expedite the debugging of CUDA programs; you may find them useful.
Sorry about the Linux-only aspect. We've started a Windows branch (as well as a Mac OS X port) but the engineering burden is already large enough to stress our research pursuits. If anyone has any time and interest, they may wish to help us provide support for Windows!
Hope this helps.
[1]: GPU Ocelot - https://code.google.com/archive/p/gpuocelot/
[2]: Ocelot Interactive Debugger - http://forums.nvidia.com/index.php?showtopic=174820
You can check also gpuocelot project which is a true emulator in the sense that PTX (bytecode in which CUDA code is converted to) will be emulated.
There's also an LLVM translator, it would be interesting to test if it's more fast than when using -deviceemu.
The CUDA toolkit had one built into it until the CUDA 3.0 release cycle. I you use one of these very old versions of CUDA, make sure to use -deviceemu when compiling with nvcc.
https://github.com/hughperkins/cuda-on-cl lets you run NVIDIA® CUDA™ programs on OpenCL 1.2 GPUs (full disclosure: I'm the author)
Be careful when you're programming using -deviceemu as there are operations that nvcc will accept while in emulation mode but not when actually running on a GPU. This is mostly found with device-host interaction.
And as you mentioned, prepare for some slow execution.
GPGPU-Sim is a GPU simulator that can run CUDA programs without using GPU.
I created a docker image with GPGPU-Sim installed for myself in case that is helpful.