CUDA and web development - cuda

It seems apparent that each core of the GPU could allow for handling of a request, rather than one main processor (the system's CPU) handling all requests. On the surface, it seems like it is possible, perhaps with Templates in GPU + Redis database in GPU GDDR5?
Is it possible and worthwhile?

How would the GPU access disks, databases, etc.?
Requests are usually short sharp processing snippets. You'd have to load each request off main memory, into GPU memory, do a computation and fire it back again. There's an overhead when transferring data from main memory to GPU memory. Therefore, it's only worth doing a GPU computation if the calculation is long enough and the problem is ammenable to parallel processing on a GPU.
In essence, GPUs are good at stream processing. Not for lots of small requests.

The previous answer is valid. There is also another ceveat regarding GPU's, the instruction set is smaller and data is processed in matrices. I.e. same operation applied to every element in a set. So you will need to be very clever in designing what this replicated operations are.
I take it you are considering a GPU HTTP server.

Related

Separate instruction and data memory

I am currently in a Computer Architecture class and this is the one thing majorly stumping me. I asked my professor why we have separate instruction and data memory (consider the single-cycle MIPS data path I'm attaching).
My thoughts:
add extra ports (not an issue of FU reuse, similar to register file implementation but with a port for instructions)
consolidate so that memory could be unified and not go unused
His:
agreed with me on last point
ports are quadratic negative increase in perf
separate allows more leeway in placement on chip
single-access memory is faster
Could anyone please elaborate on any of these points in more depth, or add anything of their own? I'm still not fully clear on this.
Yes, multi-ported DRAM is an option, but much more expensive, probably more than twice as expensive per byte. (And lower capacity per die area, so available sizes will be smaller).
In practice real CPUs just have split L1d/L1i caches, and unified L2 cache and memory, assuming it's ultimately a von Neumann type of architecture.
We call this "modified Harvard" - the performance advantages of Harvard allowing parallel code-fetch and load/store, except for contention for access to the unified cache or memory. But it's rare to have lots of code cache misses at the same time as data misses, because if you're stalling on code fetch then you'll have bubbles in the pipeline anyway. (Out-of-order exec could hide that better than a single single-cycle design, of course!)
It needs extra sync / pipeline flushing when we want to run machine code that we recently generated / stored, e.g. a JIT compiler, but other than that it has all the advantages of unified memory and the CPU-pipeline advantages of the Harvard split. (You need extra synchronization anyway to run recently-stored code on an ISA that allows deeply pipelined and out-of-order exec implementations, and which fetch code far ahead into buffers in the pipeline to give more room to absorb bubbles).
What does a 'Split' cache means. And how is it useful(if it is)?
L1 caches usually have split design, but L2, L3 caches have unified design, why?
The first pipelined CPUs had small caches or in the case of MIPS R2000 even off-chip caches with only the controllers on-chip. But yes, MIPS R2000 had split I and D cache. Because you don't want code-fetch to conflict with the MEM stage of load or store instructions; that would introduce a structural hazard that would interfere with running 1 instruction per cycle when you don't have cache misses.
In a single-cycle design I guess your cycle would normally be long enough to access memory twice because you aren't overlapping code-fetch and load/store, so you might not even need multi-ported memory?
L1 data caches are already multi-ported on modern high-performance CPUs, allowing them to commit a store from the store buffer in the same cycle as doing 1 or 2 loads on load execution units.
Having even more ports to also allow code-fetch from it would be even more expensive in terms of power, vs. two slightly smaller caches.
If you think of the Instruction Memory and Data Memory as caches, as in being backed by a unified main memory, then you have the traditional Modified Harvard Architecture, which has some of the advantages of both the Von Neumann and the Harvard Architecture together.
One point you didn't seem to raise is that separation of the two memories (caches) allows for simultaneous access, so an instruction can be read while a data memory is read or written in the same cycle.  This would be more difficult with a unified cache/memory.  This advantage applies to single cycle and pipelined processors since in both designs there is overlap between instruction fetch (IF stage in pipelined) and memory operations (MEM stage in pipelined).
Further, as the Instruction Memory is read-only it has less circuitry.  In the case of being caches, the IM has no dirty bits, no write back, etc..  Further, the IM and DM can have different associativity.
In the case of not being caches, it is not clear how the computer system loads the instruction memory, perhaps it is some fast ROM or is loaded by an external device from ROM into IM.  A number of embedded systems have Instruction Tightly Integrated Memory (and/or Data memory ITIM/DTIM) that then do not act as caches and are not necessarily backed by main memory, instead serving as the primary memories.

What's the differences between the kernel fusion and persistent thread?

In CUDA programming, I try to reduce the synchronization overhead between the off-chip memory and on-chip memory if there is data dependency between two kernels? What's the differences between these two techniques?
The idea behind kernel fusion is to take two (or more) discrete operations, that could be realized (and might already be realized) in separate kernels, and combine them so the operations all happen in a single kernel.
The benefits of this may or may not seem obvious, so I refer you to this writeup.
Persistent threads/Persistent kernel is a kernel design strategy that allows the kernel to continue execution indefinitely. Typical "ordinary" kernel design focuses on solving a particular task, and when that task is done, the kernel exits (at the closing curly-brace of your kernel code).
A persistent kernel however has a governing loop in it that only ends when signaled - otherwise it runs indefinitely. People often connect this with the producer-consumer model of application design. Something (host code) produces data, and your persistent kernel consumes that data and produces results. This producer-consumer model can run indefinitely. When there is no data to consume, the consumer (your persistent kernel) simply waits in a loop, for new data to be presented.
Persistent kernel design has a number of important considerations, which I won't try to list here but instead refer you to this longer writeup/example.
Benefits:
Kernel fusion may combine work into a single kernel so as to increase performance by reduction of unnecessary loads and stores - because the data being operated on can be preserved in-place in device registers or shared memory.
Persistent kernels may have a variety of benefits. They may possibly reduce the latency associated with processing data, because the CUDA kernel launch overhead is no longer necessary. However another possible performance factor may be the ability to retain state (similar to kernel fusion) in device registers or shared memory.
Kernel fusion doesn't necessarily imply a persistent kernel. You may simply be combining a set of tasks into a single kernel. A persistent kernel doesn't necessarily imply fusion of separate computation tasks - there may be only 1 "task" that you are performing in a governing "consumer" loop.
But there is obviously considerable conceptual overlap between the two ideas.

Transferring data to GPU while kernel is running to save time

GPU is really fast when it comes to paralleled computation and out performs CPU with being 15-30 ( some have reported even 50 ) times faster however,
GPU memory is very limited compared to CPU memory and communication between GPU memory and CPU is not as fast.
Lets say we have some data what won't fit into GPU ram but we still want to use
it's wonders to compute. What we can do is split that data into pieces and feed it into GPU one by one.
Sending large data to GPU can take time and one might think, what if we would split a data piece into two and feed the first half, run the kernel and then feed the other half while kernel is running.
By that logic we should save some time because data transfer should be going on while computation is, hopefully not interrupting it's job and when finished, it can just, well, continue it's job without needs for waiting a new data path.
I must say that I'm new to gpgpu, new to cuda but I have been experimenting around with simple cuda codes and have noticed that the function cudaMemcpy used to transfer data between CPU and GPU will block if kerner is running. It will wait until kernel is finished and then will do its job.
My question, is it possible to accomplish something like that described above and if so, could one show an example or provide some information source of how it could be done?
Thank you!
is it possible to accomplish something like that described above
Yes, it's possible. What you're describing is a pipelined algorithm, and CUDA has various asynchronous capabilities to enable it.
The asynchronous concurrent execution section of the programming guide covers the necessary elements in CUDA to make it work. To use your example, there exists a non-blocking version of cudaMemcpy, called cudaMemcpyAsync. You'll need to understand CUDA streams and how to use them.
I would also suggest this presentation which covers most of what is needed.
Finally, here is a worked example. That particular example happens to use CUDA stream callbacks, but those are not necessary for basic pipelining. They enable additional host-oriented processing to be asynchronously triggered at various points in the pipeline, but the basic chunking of data, and delivery of data while processing is occurring does not depend on stream callbacks. Note also the linked CUDA sample codes in that answer, which may be useful for study/learning.

Memory-compute overlap affects kernel duration?

Profiling my solution, I see dependencies between memory transfer and kernel computation. For a 60Mb data transfer, I have 2ms overhead for each overlapped kernel computation.
I'm computing my basic solution and the enhanced one (overlapped) to see the differences. They treat the same amount of data with the same kernels (which do not depend on the data value).
So am I wrong or missing something somewhere, or does the overlap really use a "significant" part of the GPU ?
I think the overlapping process must order the data transfer and control its issue and you may add the context switching. But compared to 2ms it seems to be too much ?
When you overlap data copy with compute, both operations are competing for GPU memory bandwidth. If your kernel is memory-bandwidth bound, then its possible that overlapping the operations will cause both the compute and the memory copy to run longer, than if either were running alone.
60 megabytes of data on a PCIE Gen2 link will take ~10ms of time if there is no contention. An extra 2ms when there is contention doesn't sound out-of-range to me, but it will depend to a significant degree, which GPU you are using. It's also not clear if the "overhead" you're referring to is an extension of the length of the transfer, or the kernel compute, or the overall program. Different GPUs have different GPU memory bandwidth numbers.

Prefetching in Nvidia CUDA

I'm working on data prefetching in nVidia CUDA. I read some documents on prefetching on device itself i.e. Prefetching from shared memory to cache.
But I'm interested in data prefetching between CPU and GPU. Can anyone connect me with some documents or something regarding this matter. Any help would be appreciated.
Answer based on your comment:
when we to want perform computation on large data ideally we'll send max data to GPU,perform computation,send it back to CPU i.e SEND,COMPUTE,SEND(back to CPU) now whn it sends back to CPU GPU has to stall,now my plan is given CU program,say it runs in entire global mem,i'll compel it to run it in half of the global mem so that rest of the half i can use for data prefetching,so while computation is being performed in one half simultaneously i cn prefetch data in otherhalf.so no stalls will be there..now tell me is it feasible to do?performance will be degraded or upgraded?should enhance..
CUDA streams were introduced to enable exactly this approach.
If your compoutation is rather intensive, then yes --- it can greatly speed up your performance. On the other hand, if data transfers take, say, 90% of your time, you will save only on computation time - that is - 10% tops...
The details, including examples, on how to use streams is provided in CUDA Programming Guide.
For version 4.0, that will be section "3.2.5.5 Streams", and in particular "3.2.5.5.5 Overlapping Behavior" --- there, they launch another, asynchronous memory copy, while a kernel is still running.
Perhaps you would be interested in the asynchronous host/device memory transfer capabilities of CUDA 4.0? You can overlap host/device memory transfers and kernels by using page-locked host memory. You could use this to...
Copy working set #1 & #2 from host to device.
Process #i, promote #i+1, and load #i+2 - concurrently.
So you could be streaming data in and out of the GPU and computing on it all at once (!). Please refer to the CUDA 4.0 Programming Guide and CUDA 4.0 Best Practices Guide for more detailed information. Good luck!
Cuda 6 will eliminate the need to copy, ie the copying will be automatic.
however you may still benefit from prefetching.
In a nutshell you want the data for the "next" computation transferring while you complete the current computation. to achieve that you need to have at least two threads on the CPU, and some kind of signalling scheme (to know when to send the next data). Chunking will of course play a big role and affect performance.
The above may be easier on an APU (CPU+GPU on the same die) as the need to copy is eliminated as both processors can access the same memory.
If you want to find some papers on GPU prefetching just use google scholar.