CPU or GPU bound? profiling OpenGL application - language-agnostic

I've got an OpenGL application which I'm afraid is GPU bound.
How can I be sure that's the case?
And if it is, how can I profile the code run by the GPU?

I would also check it with AMD GPU PerfStudio.
It will analyse your GPU and CPU usage and show relative load values.

If you are using Windows, Linux or Mac, (well, a computer!) give a try to gDEBugger.

If your OpenGL thread should uses less than one core you are not CPU bound. If you're running at 60Hz you're probably limited by vsync.

gDEBugger no longer supports OSX..
For OSX users (and perhaps other OS's) the Intel Graphics Performance Analyser might be worth a look
https://software.intel.com/en-us/gpa
Info here

Related

CUDA driver version is insufficient for runtime version [duplicate]

I have a very simple Toshiba Laptop with i3 processor. Also, I do not have any expensive graphics card. In the display settings, I see Intel(HD) Graphics as display adapter. I am planning to learn some cuda programming. But, I am not sure, if I can do that on my laptop as it does not have any nvidia's cuda enabled GPU.
In fact, I doubt, if I even have a GPU o_o
So, I would appreciate if someone can tell me if I can do CUDA programming with the current configuration and if possible also let me know what does Intel(HD) Graphics mean?
At the present time, Intel graphics chips do not support CUDA. It is possible that, in the nearest future, these chips will support OpenCL (which is a standard that is very similar to CUDA), but this is not guaranteed and their current drivers do not support OpenCL either. (There is an Intel OpenCL SDK available, but, at the present time, it does not give you access to the GPU.)
Newest Intel processors (Sandy Bridge) have a GPU integrated into the CPU core. Your processor may be a previous-generation version, in which case "Intel(HD) graphics" is an independent chip.
Portland group have a commercial product called CUDA x86, it is hybrid compiler which creates CUDA C/ C++ code which can either run on GPU or use SIMD on CPU, this is done fully automated without any intervention for the developer. Hope this helps.
Link: http://www.pgroup.com/products/pgiworkstation.htm
If you're interested in learning a language which supports massive parallelism better go for OpenCL since you don't have an NVIDIA GPU. You can run OpenCL on Intel CPUs, but at best you can learn to program SIMDs.
Optimization on CPU and GPU are different. I really don't think you can use Intel card for GPGPU.
Intel HD Graphics is usually the on-CPU graphics chip in newer Core i3/i5/i7 processors.
As far as I know it doesn't support CUDA (which is a proprietary NVidia technology), but OpenCL is supported by NVidia, ATi and Intel.
in 2020 ZLUDA was created which provides CUDA API for Intel GPUs. It is not production ready yet though.

How to decide whether a GPU card is being used?

In CUDA, is there any runtime API that will tell whether a GPU device is being used or not? And whether the user is from video display or a GUGPU application? And what is the GPU occupancy?
On linux at least, you can use the program nvidia-smi to see the current memory use, and if any compute processes are running. Think though that the status about compute processes is only supported on a selected number of graphics cards, e.g. tesla.
While it doesn't show exactly what is using it, MSI Afterburner on Windows will show you the core usage, memory usage, fan speed, and temperature of GPU's in a system (NV or otherwise.)

GPU affinity (GPU core affinity)

Can any one tell me why there are no GPU affinity (I mean execution units affinity) ? I know in Opencl specification 1.2 we have something called device fission, but in the best of my knowledge this is juste available for CPU.
any one have more informations about this?
Thanks
This is currently a very CPU-related extension. I believe that some GPUS will support this soon, and there would be a couple already with the extension enabled. If you read the page below you will see some CPU features, like whenever NUMA is mentioned.
http://www.khronos.org/registry/cl/extensions/ext/cl_ext_device_fission.txt

Can't run CUDA nor OpenCL on GeForce 540M

I have problem running samples provided by Nvidia in their GPU Computing SDK (there's a library of compiled sample codes).
For cuda I get message "No CUDA-capable device is detected", for OpenCL there's error from function that should find OpenCL capable units.
I have installed all three parts from Nvidia to develop with OpenCL - devdriver for win7 64bit v.301.27, cuda toolkit 4.2.9 and gpu computing sdk 4.2.9.
I think this might have to do with Optimus technology that reroutes output from Nvidia GPU to Intel to render things (this notebook has also Intel 3000HD accelerator), but in Nvidia control pannel I set to use high performance Nvidia GPU, set power profile to prefer maximum performance and for PhysX I changed from automatic selection to Nvidia processor again. Nothing has changed though, those samples won't run (not even those targeted for GF8000 cards).
I would like to play somewhat with OpenCL and see what it is capable of but without ability to test things it's useless. I have found some info about this on forums, but it was mostly about linux users where you need Bumblebee to access Nvidia GPU. There's no such problem on Windows however, drivers are better and so you can access it without dark magic (or I thought so until I found this problem).
My laptop has a GeForce 540M as well, in an Optimus configuration since my Sandy Bridge CPU also has Intel's integrated graphics. To run CUDA codes, I have to:
Install NVIDIA Driver
Go to NVIDIA Control Panel
Click 3D Settings -> Manage 3D Settings -> Global Settings
In the Preferred Graphics processor drop down, select "High-performance NVIDIA processor"
Apply the settings
Note that the instructions above apply the settings for all applications, so you don't have to worry about CUDA errors any more. But it will drain more battery.
Here is a video recap as well. Good luck!
Ok this has proven to be totally crazy solution. I was thinking if something isn't hooking between the hardware and application and only thing that came to my mind was AV software. I'm using Comodo with sandbox and Defense+ on and after turning them off I could run all those samples. What's more, only Defense+ needs to be turned off.
Now I just think about how much apps could have been blocked from accessing that GPU..
That's most likely because of the architecture of Optimus. So I'd suggest you to read
NVIDIA CUDA Developer Guide for NVIDIA Optimus Platforms, especially the section "Querying for a CUDA Device" which addresses this issue, I believe.

cuda sdk example simpleStreams in SDK 4.1 not working

I upgraded CUDA GPU computing SDK and CUDA computing toolkit to 4.1. I was testing simpleStreams programs, but consistently it is taking more time that non-streamed execution. my device is with compute capability 2.1 and i'm using VS2008,windows OS.
This sample constantly has issues. If you tweak the sample to have equal duration for the kernel and memory copy the overlap will improve. Normally breadth first submission is better for concurrency; however, on WDDM OS this sample will usually have better overlap if you issue the memory copy right after kernel launch.
I noticed this as well. I thought it was just me but I didn't notice any improvement and tried searching the forums but didn't find anyone else with the issue.
I also ran the source code in the Cuda By Example book (which is really helpful and I recommend you pick it up if you're serious about GPU programming).
Chapter 10 examples has the progression of examples showing how streams should be used.
http://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0
But comparing the,
1. non-streamed version(which is basically the single stream version)
2. the streamed (incorrectly queued asyncmemcpy and kernel launch)
3. the streamed (correctly queued asyncmemcpy and kernel launch)
I find no benefit in using cuda streams. It might be a win7 issue as I found some sources online discussing that win vista didn't support the cuda streams correctly.
Let me know what you find with the example I linked. My setup is: Win7 64bit Pro, Cuda 4.1, Dual Geforce GTX460 cards, 8GB RAM.
I'm pretty new to Cuda so may not be able to help but generally its very hard to help without you posting any code. If posting is not possible then I suggest you take a look at Nvidia's visual profiler. Its cross platform and can show you were your bottlenecks are.