CUDA bandwidthTest to get attainable peak - cuda

I want to know how good my CUDA kernels are in terms of memory bandwidth utilisation. I run them on a Tesla K40c with ECC on. Is the result given by the bandwidthTest utility a good approximation to the attainable peak? Else, how would one go about writing a similar test to find the peak bandwidth?
I mean device memory bandwidth.

The source code for bandwidth test is included with the CUDA SDK so you can review it directly. The bandwidthTest example performs a test of the transfer time between the device and the host, the host and the device, and the device and the device (transferring memory on the card).
This is a real execution of a memory transfer but it takes advantage of several things:
Medium to large memory transfers. If you are doing tons of tiny
transfers you will pay a high penalty in overhead and this will
reduce your transfer rates.
Pinned memory. The bandwidthTest uses pinned memory so that the transfers can be as fast as possible. You may or may not have this option.
Sustained read/write of memory. As I recall, the bandwidthTest does a number of transfers that can be queued up. Any startup delays or anomalies will be smoothed out and it has the advantage of stringing together lots of transfers together in the queue. You may have to do transfer-work-work-transfer so you may end up with additional delays. Improvements in memory transfers from CUDA 5 may assist in mitigating this.
Doing real work with a kernel while performing memory transfers will likely result in a reduction of performance. However, you can reference the bandwidth test code and use it as a guide for improving your transfers. Consider pinned memory, asynchronous transfers, or the newer shared memory methods that do not require explicit transfer of data. Also keep in mind that bandwidthTest is only counting bulk transfers around memory and is not really taking a measure of things like shared memory.
The final performance will depend greatly on the kernel and the count and size of the memory transfers you are performing.

Related

why shared memory is faster than global memory?

is that difference in speed due to technology with which both were made of( i read that shared memory is a scratchpad memory that is mainly SRAM memory while global memory is typically a DRAM memory)?
what if both were made with same technology, will be any differences in performance based on shared memory is on-chip and global memory is off-chip due to extra instructions(load instructions) or extra hardware circuit needed for global memory to load it's data into the processor?
At least two reasons are the ones you've already pointed out. There is a:
Location difference - shared memory is on-chip, global memory (at least, ordinary global memory accesses that do not hit in one of the caches) are off-chip. Memory is generally clocked at a fixed frequency, and the maximum frequency will depend on how fast the system can be clocked. Long transmission lines, buffers that drive signals from off-chip to on-chip or vice versa, and many other circuit effects will slow down the maximum rate that a particular circuit can be clocked. Therefore the shared memory is considerably advantaged by being on-chip. The caches (L1, L2, read-only, constant cache, texture cache, etc.) all benefit from the same principle.
Technology difference. An SRAM cell (e.g. shared memory) might be clocked faster than a DRAM cell (e.g. off-chip global memory), and SRAM is more amenable to fast random access. DRAM has a more complicated access sequence that comes into play when a cell is accessed. DRAM is also burdened by mechanisms such as refresh that may get in the way of continuous fast access. However I would suggest that the technology difference is less of an issue. Another technology related issue is that SRAM arrays are generally more amenable (able to be placed in higher density) on the logic processes that modern processors use. For highest density, DRAM arrays use a semiconductor process that differs substantially from the one used for general logic inside a processor.
Processor instuctrions required wouldn't be a meaningful differentiator between shared memory and global memory access times.

Memory-compute overlap affects kernel duration?

Profiling my solution, I see dependencies between memory transfer and kernel computation. For a 60Mb data transfer, I have 2ms overhead for each overlapped kernel computation.
I'm computing my basic solution and the enhanced one (overlapped) to see the differences. They treat the same amount of data with the same kernels (which do not depend on the data value).
So am I wrong or missing something somewhere, or does the overlap really use a "significant" part of the GPU ?
I think the overlapping process must order the data transfer and control its issue and you may add the context switching. But compared to 2ms it seems to be too much ?
When you overlap data copy with compute, both operations are competing for GPU memory bandwidth. If your kernel is memory-bandwidth bound, then its possible that overlapping the operations will cause both the compute and the memory copy to run longer, than if either were running alone.
60 megabytes of data on a PCIE Gen2 link will take ~10ms of time if there is no contention. An extra 2ms when there is contention doesn't sound out-of-range to me, but it will depend to a significant degree, which GPU you are using. It's also not clear if the "overhead" you're referring to is an extension of the length of the transfer, or the kernel compute, or the overall program. Different GPUs have different GPU memory bandwidth numbers.

Which is faster in CUDA, global memory or host memory?

I read from CUDA by Example, chapter 9.4, that when using atomic operations on GPU global memory improperly, performance of the program may be worse than that when executed purely on CPU, because of the memory access contention.
In the worse case, the program executed on GPU is highly serialized and no threads execute in parallel, which is just the way a single-threaded program run on the CPU. So the key problem is how fast the program accesses the memory.
Considering the example in the book I mentioned, it seems that CPU accesses host memory faster than GPU accesses global memory on device.
Is that so? Or are there any other factors that influence the performance of the program under the circumstance I just described?
i think you're misreading things slightly. yes, it's saying that single-threaded code on the GPU is typically slower than on the CPU. but that's not because of raw memory bandwidth - it's because a CPU is much more powerful than a GPU when running a single thread. for example, a CPU has pipelining and sophisticated branch prediction to pre-load data from memory, while a GPU is designed to switch contexts to another thread when waiting for data. the CPU is tuned for the single threaded case while the GPU is tuned for many threads.
if you want to know which memory is fastest, look at the technical specs for your card and mobo, but that's not really what the book is talking about.

GPU reads from CPU or CPU writes to the GPU?

I am beginner in parallel programming. I have a query which might be seem to be silly but I didn't get a definitive answer when I googled it out.
In GPU computing there is a device i.e. the GPU and the host i.e. the CPU. I wrote a simple hello world program which will allocate some memory on the gpu, pass two parameters (say src[] and dest[]) to the kernel, copy src string i.e. Hello world to dest string and get the dest string from gpu to the host.
Is the string "src" read by the GPU or the CPU writes to the GPU ? Also when we get back the string from GPU, is the GPU writing to the CPU or the CPU reading from the GPU?
In transferring the data back and forth there can be four possibilities
1. CPU to GPU
- CPU writes to GPU
- GPU reads form CPU
2. GPU to CPU
- GPU writes to the CPU
- CPU reads from GPU
Can someone please explain which of these are possible and which are not?
In earlier versions of CUDA and corresponding hardware models, the GPU was more strictly a coprocessor owned by the CPU; the CPU wrote information to the GPU, and read the information back when the GPU was ready. At the lower level, this meant that really all four things were happening: the CPU wrote data to PCIe, the GPU read data from PCIe, the GPU then wrote data to PCIe, and the CPU read back the result. But transactions were initiated by the CPU.
More recently (CUDA 3? 4? maybe even beginning in 2?), some of these details are hidden from the application level, so that, effectively, GPU code can cause transfers to be initiated in much the same way as the CPU can. Consider unified virtual addressing, whereby programmers can access a unified virtual address space for CPU and GPU memory. When the GPU requests memory in the CPU space, this must initiate a transfer from the CPU, essentially reading from the CPU. The ability to put data onto the GPU from the CPU side is also retained. Basically, all ways are possible now, at the top level (at low levels, it's largely the same sort of protocol as always: both read from and write to the PCIe bus, but now, GPUs can initiate transactions as well).
Actually none of these.
Your CPU code initiates the copy of data, but while the data is transferred by the memory controller to the memory of the GPU through whatever bus you have on your system. Meanwhile, the CPU can process other data.
Similarly, when the GPU has finished running the kernels you launched, your CPU code initiates the copy of data, but meanwhile both GPU and CPU can handle other data or run other code.
The copies are called asynchronous or non-blocking. You can optionally do blocking copies, in which the CPU waits for the copy to be completed.
When launching asynchronous tasks, you usually register an "event", which is some kind of flag that you can check later on, to see if the task is finished or not.
In OpenCL the Host (CPU) is exclusively controlling all the transfers of data between GPU and GPU. The host transfers data to the GPU using buffers. The host transfers (reads) back
from the GPU using buffers. For some systems and devices, the transfer isn't physically copying bytes as the Host and GPU use the same physical memory. This is called zero copy.
I just found out in this forum http://devgurus.amd.com/thread/129897 that using CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR in clCreateBuffer allocates memory on the host and that it wont be copied on the device.
There may be issue with performance but this is what I am looking for. Your comments please..

How much memory can I actually allocated on a cuda card

I'm writing a server process that performs calculations on a GPU using cuda. I want to queue up in-coming requests until enough memory is available on the device to run the job, but I'm having a hard time figuring out how much memory I can allocate on the the device. I have a pretty good estimate of how much memory a job requires, (at least how much will be allocated from cudaMalloc()), but I get device out of memory long before I've allocated the total amount of global memory available.
Is there some king of formula to compute from the total global memory the amount I can allocated? I can play with it until I get an estimate that works empirically, but I'm concerned my customers will deploy different cards at some point and my jerry-rigged numbers won't work very well.
The size of your GPU's DRAM is an upper bound on the amount of memory you can allocate through cudaMalloc, but there's no guarantee that the CUDA runtime can satisfy a request for all of it in a single large allocation, or even a series of small allocations.
The constraints of memory allocation vary depending on the details of the underlying driver model of the operating system. For example, if the GPU in question is the primary display device, then it's possible that the OS has also reserved some portion of the GPU's memory for graphics. Other implicit state the runtime uses (such as the heap) also consumes memory resources. It's also possible that the memory has become fragmented and no contiguous block large enough to satisfy the request exists.
The CUDART API function cudaMemGetInfo reports the free and total amount of memory available. As far as I know, there's no similar API call which can report the size of the largest satisfiable allocation request.