My program runs 2 threads - Thread A (for input) and B (for processing). I also have a pair of pointers to 2 buffers, so that when Thread A has finished copying data into Buffer 1, Thread B starts processing Buffer 1 and Thread A starts copying data into Buffer 2. Then when Buffer 2 is full, Thread A copies data into Buffer 1 and Thread B processes Buffer 2, and so on.
My problem comes when I try to cudaMemcpy Buffer[] into d_Buffer (which was previously cudaMalloc'd by the main thread, i.e. before thread creation. Buffer[] were also malloc'd by the main thread). I get a "invalid argument" error, but have no idea which is the invalid argument.
I've reduced my program to a single-threaded program, but still using 2 buffers. That is, the copying and processing takes place one after another, instead of simultaneously. The cudaMemcpy line is exactly the same as the double-threaded one. The single-threaded program works fine.
I'm not sure where the error lies.
Thank you.
Regards,
Rayne
If you are doing this with CUDA 3.2 or earlier, the reason is that GPU contexts are tied to a specific thread. If a multi-threaded program allocated memory on the same GPU from different host threads, the allocations wind up establishing different contexts, and pointers from one context are not portable to another context. Each context has its own "virtualised" memory space to work with.
The solution is to either use the context migration API to transfer a single context from thread to thread as they do work, or try the new public CUDA 4.0rc2 release, which should support what you are trying to do without the use of context migration. The downside is that 4.0rc2 is a testing release, and it requires a particular beta release driver. That driver won't work will all hardware (laptops for example).
Related
i would like to know how CUDA hardware/run-time system handles the following case.
If a warp (warp1 in the following) instruction involves access to global memory (load/store); the run-time system schedules the next ready warp for execution.
When the new warp is executed,
Will the "memory access" of warp1 be conducted in parallel, i.e. while the new warp is running ?
Will the run time system put warp1 into a memory access waiting queue; once the memory request is completed, the warp is then moved into the runnable queue?
Will the instruction pointer related to warp1 execution be incremented automatically and in parallel to the new warp execution, to annotate that the memory request is completed?
For instance, consider this pseudo code output=input+array[i]; where output and input are both scalar variables mapped into registers, whereas array is saved in the global memory.
To run the above instruction, we need to load the value of array[i] into a (temporary) register before updating output; i.e the above instruction can be translated into 2 macro assembly instructions load reg, reg=&array[i], output_register=input_register+reg.
I would like to know how the hardware and runtime system handle the execution of the above 2 macro assembly instructions, given that load can't return immediately
I am not sure I understand your questions correctly, so I'll just try to answer them as I read them:
Yes, while a memory transaction is in flight further independent instructions will continue to be issued. There isn't necessarily a switch to a different warp though - while instructions from other warps will always be independent, the following instructions from the same warp might be independent as well and the same warp may keep running (i.e. further instructions may be issued from the same warp).
No. As explained under 1. the warp can and will continue executing instructions until either the result of the load is needed by dependent instruction, or a memory fence / barrier instruction requires it to wait for the effect of the store being visible to other threads.
This can go as far as issuing further (independent) load or store instructions, so that multiple memory transactions can be in flight for the same warp at the same time. So the status of a warp after issuing a load/store doesn't change fundamentally and it is not halted until necessary.
The instruction pointer will always be incremented automatically (there is no situation where you ever do this manually, nor are there instructions allowing to do so). However, as 2. implies, this doesn't necessarily indicate that the memory access has been performed - there is separate hardware to track progress of memory accesses.
Please note that the hardware implementation is completely undocumented by Nvidia. You might find some indications of possible implementations if you search through Nvidia's patent applications.
GPUs up to the Fermi generation (compute capability 2.x) tracked outstanding memory transaction completely in hardware. While undocumented by Nvidia, the common mechanism to track (memory) transactions in flight is scoreboarding.
GPUs from newer generations starting with Kepler (compute capability 3.x) use some assistance in the form of control words embedded in the shader assembly code. While again undocumented, Scott Gray has reversed engineered these for his Maxas Maxwell assembler. He found that (amongst other things) the control words contain barrier instructions for tracking memory transactions and was kind enough to document his findings on his Control-Codes wiki page.
I have a kernel which I want to start with the configuration "1 block x 32 threads". To increase parallelism I want to start several streams instead of running a bigger "work package" than "1 block x 32 threads". I want to use the GPU in a program where data comes from the network. I don't want to wait until a bigger "work package" is available.
The code is like:
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
The real code is much more complex but I want to keep it simple (15 CPU threads use the GPU).
The code works but streams doesn't run concurrently as expected. The GTX 480 has 15 SMs where each SM has 32 shader processors. I expect that if I start the kernel 15 times, all 15 streams run in parallel, but this is not the case. I have used the Nvidia Visual Profiler and there is a maximum of 5 streams which run in parallel. Often only one stream runs. The performance is really bad.
I get the best results with a "64 block x 1024 threads" configuration. If I use instead a "32 block x 1024 threads" configuration but two streams the streams are executed one after each other and performance drops. I am using Cuda Toolkit 5.5 and Ubuntu 12.04.
Can somebody explain why this is the case and can give me some background information? Should it work better on newer GPUs? What is the best way to use the GPU in time critically applications where you don't want to buffer data? Probably this is not possible, but I am searching for techniques which bring me closer to a solution.
News:
I did some further research. The problem is the last cudaMemcpyAsync(..) (GPU->Host copy) call. If I remove it, all streams run concurrent. I think the problem is illustrated in http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf on slide 21. They say that on Fermi there are two copy queues but this is only true for tesla and quadro cards, right? I think the problem is that the GTX 480 has only one copy queue and all copy commands (host->GPU AND GPU->host) are put in this one queue. Everything is non-blocking and the GPU->host memcopy of the first thread blocks the host->GPU memcopy calls of other threads.
Here some observations:
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
}
-> works: streams run concurrently
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- sleep(10)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
-> works: streams run concurrently
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- cudaStreamSynchronize(stream i)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
-> doesn't work!!! Maybe cudaStreamSynchronize is put in the copy-queue?
Does someone knows a solution for this problem. Something like a blocking-kernel call would be cool. The last cudaMemcpyAsync() (GPU->device) should be called if the kernel has been finished.
Edit2:
Here an example to clarify my problem:
To keep it simple we have 2 streams:
Stream1:
------------
HostToGPU1
kernel1
GPUToHost1
Stream2:
------------
HostToGPU2
kernel2
GPUToHost2
The first stream is started. HostToGPU1 is executed, kernel1 is launched and GPUToHost1 is called. GPUToHost1 blocks because kernel1 is running. In the meantime Stream2 is started. HostToGPU2 is called, Cuda puts it in the queue but it can't be executed because GPUToHost1 blocks until kernel 1 has been finished. There are no data transfers in the moment. Cuda just waits for GPUToHost1. So my idea was to call GPUToHost1 when kernel1 is finished. This seams to be the reason why it works with sleep(..) because GPUToHost1 is called when the kernel has been finished. A kernel-launch which automatically blocks the CPU-thread would be cool.
GPUToHost1 is not blocking in the queue (if there are no other data transfers at the time but in my case, data transfer are not time-consuming).
Concurrent kernel execution can be most easily witnessed on linux.
For a good example and an easy test, refer to the concurrent kernels sample.
Good concurrency among kernels generally requires several things:
a device which supports concurrent kernels, so a cc 2.0 or newer device
kernels that are small enough in terms of number of blocks and other resource usage (registers, shared memory) so that multiple kernels can actually execute. Kernels with larger resource requirements will typically be observed to be running serially. This is expected behavior.
proper usage of streams to enable concurrency
In addition, concurrent kernels often implies copy/compute overlap. In order for copy/compute overlap to work, you must:
be using a GPU with enough copy engines. Some GPUs have one engine, some have 2. If your GPU has one engine, you can overlap one copy operation (ie. one direction) with kernel execution. if you have 2 copy engines (your GeForce GPU has 1) you can overlap both directions of copying with kernel execution.
use pinned (host) memory for any data that will be copied to or from the GPU global memory, that will be the target (to or from) for any of the copy operations you intend to overlap
Use streams properly and the necessary async versions of the relevant api calls (e.g. cudaMemcpyAsync
Regarding your observation that the smaller 32x1024 kernels do not execute concurrently, this is likely a resource issue (blocks, registers, shared memory) preventing much overlap. If you have enough blocks in the first kernel to occupy the GPU execution resources, it's not sensible to expect additional kernels to begin executing until the first kernel is finished or mostly finished.
EDIT: Responding to question edits and additional comments below.
Yes, GTX480 has only one copy "queue" (I mentioned this explicitly in my answer, but I called it a a copy "engine"). You will only be able to get one cudaMemcpy... operation to run at any given time, and therefore only one direction (H2D or D2H) can actually be moving data at any given time, and you will only see one cudaMemcpy... operation overlap with any given kernel. And cudaStreamSynchronize causes the stream to wait until ALL CUDA operations previously issued to that stream are completed.
Note that the cudaStreamSynchronize you have in your last example should not be necessary, I don't think. Streams have 2 execution characteristics:
cuda operations (API calls, kernel calls, everything) issued to the same stream will always execute sequentially, regardless of your use of the Async API or any other considerations.
cuda operations issued to separate streams, assuming all the necessary requirements have been met, will execute asynchronously to each other.
Due to item 1, in your last case, your final "copy Data GPU->Host" operation will not begin until the previous kernel call issued to that stream is complete, even without the cudaStreamSynchronize call. So I think you can get rid of that call, i.e the 2nd case you have listed should be no different than the final case, and in the 2nd case you should not need the sleep operation either. The cudaMemcpy... issued to the same stream will not begin until all previous cuda activity in that stream is finished. This is a characteristic of streams.
EDIT2: I'm not sure we're making any progress here. The issue you pointed out in the GTC preso here (slide 21) is a valid issue, but you can't work around it by inserting additional synchronization operations, nor would a "blocking kernel" help you with that, nor is it a function of having one copy engine or 2. If you want to issue kernels in separate streams but issued in sequence with no other intervening cuda operations, then that hazard exists. The solution for this, as pointed out on the next slide, is to not issue the kernels sequentially, which is roughly comparable to your 2nd case. I'll state this again:
you have identified that your case 2 gives good concurrency
the sleep operation in that case is not needed for data integrity
If you want to provide a short sample code that demonstrates the issue, perhaps other discoveries can be made.
I am using a Quadro K2000M card, CUDA capability 3.0, CUDA Driver 5.5, runtime 5.0, programming with Visual Studio 2010. My GPU algorithm runs many parallel breadth first searches (BFS) of a tree (constant). The threads are independed except reading from a constant array and the tree. In each thread there can be some malloc/free operations, following the BFS algorithm with queues (no recursion). There N threads; the number of tree leaf nodes is also N. I used 256 threads per block and (N+256-1)/256 blocks per grid.
Now the problem is the program works for less N=100000 threads but fails for more than that. It also works in CPU or in GPU thread by thread. When N is large (e.g. >100000), the kernel crashes and then cudaMemcpy from device to host also fails. I tried Nsight, but it is too slow.
Now I set cudaDeviceSetLimit(cudaLimitMallocHeapSize, 268435456); I also tried larger values, up to 1G; cudaDeviceSetLimit succeeded but the problem remains.
Does anyone know some common reason for the above problem? Or any hints for further debugging? I tried to put some printf's, but there are tons of output. Moreover, once a thread crashes, all remaining printf's are discarded. Thus it is hard to identify the problem.
"CUDA Driver 5.5, runtime 5.0" -- that seems odd.
You might be running into a windows TDR event. Based on your description, I would check that first. If, as you increase the threads, the kernel begins to take more than about 2 seconds to execute, you may hit the windows timeout.
You should also add proper cuda error checking to your code, for all kernel calls and CUDA API calls. A windows TDR event will be more easily evident based on the error codes you receive. Or the error codes may steer you in another direction.
Finally, I would run your code with cuda-memcheck in both the passing and failing cases, looking for out-of-bounds accesses in the kernel or other issues.
I want my CPU and GPU to overlap computation, however, my GPU code contains some synchronous function calls like cudaBindTextureToArray() and cudaUnbindTexture() for which no asynchronous counterparts exists. Will these calls calls break GPU-CPU concurrency?
In general, the functions that may be asynchronous are listed here:
- •Kernel launches;
- •Memory copies between two addresses to the same device memory;
- •Memory copies from host to device of a memory block of 64 KB or less;
- •Memory copies performed by functions that are suffixed with Async;
- •Memory set function calls.
Asynchronous functions usually have an Async suffix, and they will usually accept a stream parameter.
Functions that don't meet the above description should be assumed to be synchronous. Specific exceptions (like cudaSetDevice()) are usually evident from their description.
In the context of a single-device system, synchronous functions (with the exception of specific stream synchronizing functions like cudaStreamSynchronize and cudaStreamWaitEvent) will:
Wait to begin until all cuda activity has completed (i.e. all previous cuda API calls and kernel calls have completed)
Execute their designated activity (e.g. cudaMemcpy() will begin the designated copy operation after step 1 is complete)
Release the calling (host) thread after step 2 is complete
Therefore the calling (host) thread is blocked from the moment the cudaMemcpy() call is made until all previous cuda activity is complete and the cudaMemcpy() call is complete. I think most people would say this may "break" GPU-CPU concurrency, because for the duration of the sequence described above (steps 1-3) the CPU thread is effectively doing nothing.
Whether or not it makes much difference in your application will depend on what is happening before and after the synchronous call in question.
I am running on a very strange issue with the Cuda Runtime API. Calls to functions like cudaMallocHost(), cudaEventCreate(), cudaFree() etc.. seem to be executed only when kernels finish execution on GPU. This kernels are all launched on a stream created with the cudaStreamNonBlocking flag. What is the problem? Do I have to put up some other flags somewhere?
They could be made asynchronous, but it wouldn't be surprising if they are not.
With respect to cudaMallocHost(), which requires that the host memory be mapped for the GPU: if the allocation can't be satisfied from a preallocated pool, the GPU's page tables must be edited. It would not surprise me in the least if the driver had a restriction where it could not edit the page tables of an executing kernel. (Esp. since the page table editing must be done by kernel mode driver code.)
With respect to cudaEventCreate(), that really should be asynchronous since those allocations generally can be satisfied from a preallocated pool. The main impediment there is that changing the behavior would break existing applications that rely on its current, synchronous behavior.
Freeing objects asynchronously requires the driver to track which objects are referenced in the command buffers submitted to the GPU, and defer the actual free operation until after the GPU has finished processing them. It is doable but I am not sure NVIDIA has done the work.
For cudaFree(), it is not possible to track references as you could for CUDA events (because pointers can be stored for running kernels to read and chase). So for large vitrual address ranges that should be deallocated and unmapped, the free must be deferred until after all pending GPU operations have executed. Again, doable but I am not sure NVIDIA has done the work.
I think NVIDIA generally expects developers to work around the lack of asynchrony in these entry points.