Conceptual understanding of a Memory Bandwidth of a GPU - cuda

I am a little confused by the concept of the memory bandwidth of a GPU.
According to the TESLA M 2090 GPU specs it says
the peak bandwidth is 177.6 GB/s.
So When people refer to bandwidth, does it refer to
the speed of one way traffic ,as in the number of bytes per second which can be read ,
from the device
the speed of the two way-traffic , as in the number of bytes per second which can be read and written to the device memory.
Wherever I read this term, I dont see this clarification being made

There is only one set of wires on the bus so data can't be written or read at the same time. In theory the bandwidth is the same, total read+write == total read == total write.
But in practice the transfers are much more efficient if you are writing large contiguous blocks of data to the device, this is the most common usage and is what the system is optimised for.
edit. The internal memory bandwidth of a graphics card (ie the memory path between various components on the card) is much higher than the bandwidth to/from the computer.
It's also much more complex, there are different types of memory connected to different processors in different ways and the manufacturer will pick the numbers that make it sound the highest - this number is really meaningless except to compare different models of very similar cards from the same GPU family.

The bandwidth is the amount of data that can be read or written in a given period of time.
The same bus is used for both reads and writes. In a given clock cycle, the bus can be used for either a read or a write.

Related

filtering an image, best practices

I have an input image "let it be a buffer of 1024 * 1024 pixels, with RGBA color data"
what I want to do for each pixel, is to filter it depending on neighbors , like [-15,15] in x and y directions
so my concern is, doing this with global memory will do like 31 * 31 global memory access for each pixel "which would be very performance bottleneck" , also I'm not sure about the behavior of multiple threads trying to read from the same memory location at the same time "may be some of them fail to read so -> rubbish data in -> rubbish data out"
this question is for CUDA or OpenCL as the concept should be the same
I know that shared memory (per work group) or local memory (per thread) won't solve this as I can't read another thread local memory, or another group shared memory "correct me if I misunderstand this concept"
Shared memory is a typical approach to this problem, although the stencil area (31*31) is quite large. Data re-use benefit can still be gained however. Since adjacent pixel computations only extend the region required by one column, in a 16KB shared memory array of 32bit RGBA pixels, you could have enough data for at least 64 threads to cooperatively compute their pixel values out of a single shared memory load.
Regarding the concern about multiple threads reading the same location, there is no possibility for garbage data reads. Certainly there is a possibility for contention leading to a performance impact, but in fact with an orderly for-loop progression in the kernel, no threads will be reading the same location at the same time anyway. With appropriate data organization there will be good opportunity for coalesced reads from global memory and no bank conflicts in shared memory.
This type of problem is well-suited for GPUs e.g. CUDA or OpenCL, and there are many examples of programs like this on SO.

Memory coalescing and transaction

After reading about the topic, I have 2 questions related to Global Memory coalescing access:
1- I read that one requirement for Memory coalescing is that words accessed by the threads must be 4, 8, or 16 byte but apparently this is valid only for device with compute capability less than 1.3. Is that right? for the latter device (>=1.3), a thread can even access one or 2 bytes and have a successful coalesced memory access
2- Will it matter (time mainly) if a (half) warp Global Memory access generates a 128-byte instead of 64-byte memory transaction because of the words misalignment and what about the extra data transferred, will it be discarded by the system?
Thank you
1) You can access the data any way you want on later devices, but the performance will still be poor if you request a data segment that is narrow, i.e. you will not achieve the full memory bandwidth of your GPU.
2) This again depends on the overall scheme of you code. Generally, the improvement in later version of CUDA was that non-aligned reads/writes did not result in abysmal performance, but resulted in e.g. 2 write commands being issues instead of one.
Think of it like putting people on a bus. If you can fill your whole crowd into a single bus with one destination, you get better efficiency than using two buses that are only half filled.
So yes, it will matter, but depending on whether you are memory or compute bound, it will matter differently.
Arranging your read/write patterns to utilize the full bandwidth have given me the last 20-30% performance in many applications.
/Henrik

Utilizing GPU worth it?

I want to compute the trajectories of particles subject to certain potentials, a typical N-body problem. I've been researching methods for utilizing a GPU (CUDA for example), and they seem to benefit simulations with large N (20000). This makes sense since the most expensive calculation is usually finding the force.
However, my system will have "low" N (less than 20), many different potentials/factors, and many time steps. Is it worth it to port this system to a GPU?
Based on the Fast N-Body Simulation with CUDA article, it seems that it is efficient to have different kernels for different calculations (such as acceleration and force). For systems with low N, it seems that the cost of copying to/from the device is actually significant, since for each time step one would have to copy and retrieve data from the device for EACH kernel.
Any thoughts would be greatly appreciated.
If you have less than 20 entities that need to be simulated in parallel, I would just use parallel processing on an ordinary multi-core CPU and not bother about using GPU.
Using a multi-core CPU would be much easier to program and avoid the steps of translating all your operations into GPU operations.
Also, as you already suggested, the performance gain using GPU will be small (or even negative) with this small number of processes.
There is no need to copy results from the device to host and back between time steps. Just run your entire simulation on the GPU and copy results back only after several time steps have been calculated.
For how many different potentials do you need to run simulations? Enough to just use the structure from the N-body example and still load the whole GPU?
If not, and assuming the potential calculation is expensive, I'd think it would be best to use one thread for each pair of particles in order to make the problem sufficiently parallel. If you use one block per potential setting, you can then write out the forces to shared memory, __syncthreads(), and use a subset of the block's threads (one per particle) to sum the forces. __syncthreads() again, and continue for the next time step.
If the potential calculation is not expensive, it might be worth exploring first where the main cost of your simulation is.

Maximum (shared memory per block) / (threads per block) in CUDA with 100% MP load

I'm trying to process array of big structures with CUDA 2.0 (NVIDIA 590). I'd like to use shared memory for it. I've experimented with CUDA occupancy calculator, trying to allocate maximum shared memory per thread, so that each thread can process whole element of array.
However maximum of (shared memory per block) / (threads per block) I can see in calculator with 100% Multiprocessor load is 32 bytes, which is not enough for single element (on the order of magnitude).
Is 32 bytes a maximum possible value for (shared memory per block) / (threads per block)?
Is it possible to say which alter4native is preferable - allocate part of array in global memory or just use underloaded multiprocessor? Or it can only be decided by experiment?
Yet another alternative I can see is to process array in several passes, but it looks like a last resort.
That is first time I'm trying something really complex with CUDA, so I could be missing some other options...
There are many hardware limitations you need to keep in mind when designing a CUDA kernel. Here are some of the constraints you need to consider:
maximum number of threads you can run in a single block
maximum number of blocks you can load on a streaming multiprocessor at once
maximum number of registers per streaming multiprocessor
maximum amount of shared memory per streaming multiprocessor
Whichever of these limits you hit first becomes a constraint that limits your occupancy (is maximum occupancy what you are referring to by "100% Multiprocessor load"?). Once you reach a certain threshold of occupancy, it becomes less important to pay attention to occupancy. For example, occupancy of 33% does not mean that you are only able to achieve 33% of the maximum theoretical performance of the GPU. Vasily Volkov gave a great talk at the 2010 GPU Technology Conference which recommends not worrying too much about occupancy, and instead trying to minimize memory transactions by using some explicit caching tricks (and other stuff) in the kernel. You can watch the talk here: http://www.gputechconf.com/gtcnew/on-demand-GTC.php?sessionTopic=25&searchByKeyword=occupancy&submit=&select=+&sessionEvent=&sessionYear=&sessionFormat=#193
The only real way to be sure that you are using a kernel design that gives best performance is to test all the possibilities. And you need to redo this performance testing for each type of device you run it on, because they all have different constraints in some way. This can obviously be tedious, especially when the different design patterns result in fundamentally different kernels. I get around this to some extent by using a templating engine to dynamically generate kernels at runtime according to the device hardware specifications, but it's still a bit of a hassle.

What is the cost of memory access?

We like to think that a memory access is fast and constant, but on modern architectures/OSes, that's not necessarily true.
Consider the following C code:
int i = 34;
int *p = &i;
// do something that may or may not involve i and p
{...}
// 3 days later:
*p = 643;
What is the estimated cost of this last assignment in CPU instructions, if
i is in L1 cache,
i is in L2 cache,
i is in L3 cache,
i is in RAM proper,
i is paged out to an SSD disk,
i is paged out to a traditional disk?
Where else can i be?
Of course the numbers are not absolute, but I'm only interested in orders of magnitude. I tried searching the webs, but Google did not bless me this time.
Here's some hard numbers, demonstrating that exact timings vary from CPU family and version to version: http://www.agner.org/optimize/
These numbers are a good guide:
L1 1 ns
L2 5 ns
RAM 83 ns
Disk 13700000 ns
And as an infograph to give you the orders of magnitude:
(src http://news.ycombinator.com/item?id=702713)
Norvig has some values from 2001. Things have changed some since then but I think the relative speeds are still roughly correct.
It could also be in a CPU-register. The C/C++-keyword "register" tells the CPU to keep the variable in a register, but you can't guarantee it will stay or even ever get in there.
As long as the Cache/RAM/Harddisk/SSD is not busy serving other access (e.g. DMA requests) and that the hardware is reasonably reliable, then the cost is still constant (though they may be a large constant).
When you get a cache miss, and you have to page to harddisk to read the variable, then it's just a simple harddisk read request, this cost is huge, as the CPU has to: send interrupt to the kernel for harddisk read request, send a request to harddisk, wait for the harddisk to write the data to RAM, then read the data from RAM to cache and to a register. However, this cost is still constant cost.
The actual numbers and proportions will vary depending on your hardware, and on the compatibility of your hardware (e.g. if your CPU is running on 2000 Mhz and your RAM sends data at 333 Mhz then they doesn't sync very well). The only way you can figure this out is to test it in your program.
And this is not premature optimization, this is micro-optimization. Let the compiler worry about these kind of details.
These numbers change all the time. But for rough estimates for 2010, Kathryn McKinley has nice slides on the web, which I don't feel compelled to copy here.
The search term you want is "memory hierarchy" or "memory hierarchy cost".
Where else can i be?
i and *i are different things, both of them can be located in any of the locations in your list. The pointer address might additionally still be stored in a CPU register when the assignment is made, so it doesn't need to be fetched from RAM/Cache/…
Regarding performance: this is highly CPU-dependent. Thinking in orders of magnitude, accessing RAM is worse than accessing cache entries and accessing swapped-out pages is the worst. All are a bit unpredictable because they depend on other factors as well (i.e. other processors, depending on the system architecture).