How to read power consumption using CUPTI? - cuda

I know that there's a way to read the power consumption of a GPU using CUPTI. Do you know of any method I can use? and where I can find examples?

Probably what you are looking for is the cupti ActivityEnvironment data.
As far as I know, this particular data category is new in CUDA 5.5, so you may need to be sure you are using CUDA 5.5 to access these parameters.
Collecting this data is part of the cupti Activity API
An example of the usage of this API is given in the activity_trace_async example that is included in the CUPTI toolkit.
On a standard linux install, this sample would be located at /usr/local/cuda/extras/CUPTI/sample/activity_trace_async

The NVIDIA Management Library (NVML)1 is an API that allows you to read the power consumption of the GPU among other information.
It has C and python bindings.

Related

cuda runtime api and dynamic kernel definition

Using the driver api precludes the usage of the runtime api in the same application ([1]) . Unfortunately cublas, cufft, etc are all based on the runtime api. If one wants dynamic kernel definition as in cuModuleLoad and cublas at the same time, what are the options? I have these in mind, but maybe there are more:
A. Wait for compute capability 3.5 that's rumored to support peaceful coexistence of driver and runtime apis in the same application.
B. Compile the kernels to an .so file and dlopen it. Do they get unloaded on dlcose?
C. Attempt to use cuModuleLoad from the driver api, but everything else from the runtime api. No idea if there is any hope for this.
I'm not holding my breath, because jcuda or pycuda are in pretty much the same bind and they probably would have figured it out already.
[1] CUDA Driver API vs. CUDA runtime
To summarize, you are tilting at windmills here. By relying on extremely out of date information, you seem to have concluded that runtime and driver API interoperability isn't supported in CUDA, when, in fact, it has been since the CUDA 3.0 beta was released in 2009. Quoting from the release notes of that version:
The CUDA Toolkit 3.0 Beta is now available.
Highlights for this release include:
CUDA Driver / Runtime Buffer Interoperability, which allows applications using the CUDA Driver API to also use libraries implemented using the CUDA C Runtime.
There is documentation here which succinctly describes how the driver and runtime API interact.
To concretely answer your main question:
If one wants dynamic kernel definition as in cuModuleLoad and cublas
at the same time, what are the options?
The basic approach goes something like this:
Use the driver API to establish a context on the device as you would normally do.
Call the runtime API routine cudaSetDevice(). The runtime API will automagically bind to the existing driver API context. Note that device enumeration is identical and common between both APIs, so if you establish context on a given device number in the driver API, the same number will select the same GPU in the driver API
You are now free to use any CUDA runtime API call or any library built on the CUDA runtime API. Behaviour is the same as if you relied on runtime API "lazy" context establishment

How does CUDA profiling work "under the hood"?

Can anyone explain how the profiler works. How it measures all the time , instructions etc given the executable. I know how to run a profiler. I wanted to know its background working.
I want to develop a profiler of my own. So I need to understand how the existing profiler works.
I am provided with the executable and need to develop a profiler to profile the executable.
You can start by reading the CUPTI Documentation.
The CUDA Profiling Tools Interface (CUPTI) enables the creation of
profiling and tracing tools that target CUDA applications. CUPTI
provides four APIs: the Activity API, the Callback API, the Event API,
and the Metric API. Using these APIs, you can develop profiling tools
that give insight into the CPU and GPU behavior of CUDA applications.
CUPTI is delivered as a dynamic library on all platforms supported by
CUDA.
And CUPTI Metric API is what you should read, and you should always be aware of which CUDA version is your target, because some of the API are different than the previous or the next version.

Is there a way to emulate multiple GPUs with one?

I am designing a multi-gpu cuda code but I still don't have the machinary to actually develop the code. So, until I do,
Do you know if there is someway to emulate a multiple gpu enviroment just by using one gpu?
I suppose that such a thing, if it exists, would be very limited but it would allow me to test my ideas until I get the hardware I want.
Thanks!
Something close can be approximated using the CUDA Driver API (cuCtxCreate, cuCtxSetCurrent). See CUDA C Programming Guide Appendix G.4 Interoperability between Runtime and Driver API. Before calling any cuda* functions use cuCtxCreate to create two contexts on the device. Use cuCtxSetCurrent in place of cudaSetDevice.

a more extensive library of math functions for CUDA kernel? particularly incomplete beta

I know about the cuda math library, and the cuda 4.1 toolkit version has good stuff like the gamma and bessel functions. I need the regularized incomplete beta function (a.k.a the cumulative distribution function for the beta probability distribution). Is this available in any open source library?
I don't know enough about statistical functions, but it looks like the function incbet() in the Cephes library may be a reasonable starting point. The Cephes library has an excellent reputation, and sources are readily available via Netlib. See http://www.netlib.org/cephes
Inside the archive cprob.tgz there is a file incbet.c which contains the source for incbet().

CUDA - Implementing Device Hash Map?

Does anyone have any experience implementing a hash map on a CUDA Device? Specifically, I'm wondering how one might go about allocating memory on the Device and copying the result back to the Host, or whether there are any useful libraries that can facilitate this task.
It seems like I would need to know the maximum size of the hash map a priori in order to allocate Device memory. All my previous CUDA endeavors have used arrays and memcpys and therefore been fairly straightforward.
Any insight into this problem are appreciated. Thanks.
There is a GPU Hash Table implementation presented in "CUDA by example", from Jason Sanders and Edward Kandrot.
Fortunately, you can get information on this book and download the examples source code freely on this page:
http://developer.nvidia.com/object/cuda-by-example.html
In this implementation, the table is pre-allocated on CPU and safe multithreaded access is ensured by a lock function based upon the atomic function atomicCAS (Compare And Swap).
Moreover, newer hardware generation (from 2.0) combined with CUDA >= 4.0 are supposed to be able to use directly new/delete operators on the GPU ( http://developer.nvidia.com/object/cuda_4_0_RC_downloads.html?utm_source=http://forums.nvidia.com&utm_medium=http://forums.nvidia.com&utm_term=Developers&utm_content=Developers&utm_campaign=CUDA4 ), which could serve your implementation. I haven't tested these features yet.
cuCollections is a relatively new open-source library started by NVIDIA engineers aiming at implementing efficient containers on the GPU.
cuCollections (cuco) is an open-source, header-only library of GPU-accelerated, concurrent data structures.
Similar to how Thrust and CUB provide STL-like, GPU accelerated algorithms and primitives, cuCollections provides STL-like concurrent data structures. cuCollections is not a one-to-one, drop-in replacement for STL data structures like std::unordered_map. Instead, it provides functionally similar data structures tailored for efficient use with GPUs.
cuCollections is still under heavy development. Users should expect breaking changes and refactoring to be common.
At the moment it provides a fixed size hashtable cuco::static_map and one that can grow cuco::dynamic_map.
I recall someone developed a straightforward hash map implementation on top of thrust. There is some code for it here, although whether it works with current thrust releases is something I don't know. It might at least give you some ideas.
AFAIK, the hash table given in "Cuda by Example" does not perform too well.
Currently, I believe, the fastest hash table on CUDA is given in Dan Alcantara's PhD dissertation. Look at chapter 6.
BTW, warpcore is a framework for creating high-throughput, purpose-built hashing data structures on CUDA-accelerators. Hashing at the speed of light on modern CUDA-accelerators. You can find it here:
https://github.com/sleeepyjack/warpcore