Rewriting memory allocated via cudaHostAlloc() - cuda

I have a 100MB character array (h_array) that is allocated using cudaHostAlloc() with the flag cudaHostAllocWriteCombined.
The program first copies data into h_array on the host. When h_array is full, it will copy h_array to d_array on the device and some processing is done. When the processing is completed, h_array is reused in the sense that new data is copied to it again, starting from h_array[0]. The new data is meant to overwrite what was previously stored in h_array.
However, I'm getting segmentation fault when the new data is copied to h_array after processing is complete. There are no seg fault errors when I'm using regular malloc.
What is wrong? Can I not rewrite the memory when it's pinned?
Thank you!

Your CUDA context is probably getting yanked out from under you somehow.
For example, if you allocate the pinned host memory in a thread that then exits, the memory will go away.
Make sure the thread that performs the allocation sticks around!

Related

Do I need to externally call flush if using cuda api to copy from GPU Memory to Persistent Memory?

I am using Cuda API:
cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )
to copy data from GPU memory from CPU memory. In case copying the data from CPU memory to Persistent Memory using memcpy(), we need to explicitly call the flush operation(eg. clflush()) to make sure data is flushed from CPU caches. Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();
Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();
No.
However, you are calling a potentially asynchronous API, so you may need to use one of the synchronization APIs (stream or device scope) in order to ensure data consistency between operations that can potentially overlap and need to access the same memory area.
Intel processors with the server uncore design starting with Sandy Bridge support Data Direct I/O (DDIO), which is enabled by default. With DDIO, an inbound PCIe write targeting system memory location of type WB is an allocating write transaction.
For a full write (that writes to an entire cache line), the IIO first obtains ownership of the target cache line by invalidating all copies in the coherence domain except in the L3 that exists in the same NUMA node to which the originating device is attached. If the line doesn't already exist in the target L3, an L3 entry is allocated, which may require evicting another line to make space. The write is performed in the L3 and the coherence state of the line becomes M. This means that the data is not sent to the memory controller to which its address is mapped. Partial writes are buffered in the IIO (which is in the coherence domain) until they are eventually evicted to be written into the LLC (allocate or update). In DDIO, reads are never allocating.
Even if DDIO is disabled, PCIe writes can be buffered in the DDIO. When cudaMemcpyAsync or even cudaMemcpy returns, there is no guarantee that all writes have reached the persistence domain on Intel processors (unless you have Whole System Persistence). In addition, the memory copy is not guaranteed to be persistently atomic and there is no guarantee in what order the bytes will move from the IIO to the target memory controllers. You need a flag to tell you whether the entire data was persisted or not.
You can use a barrier (cudaStreamSynchronize() or cudaDeviceSynchronize()) to wait on the host until the data copy operation is complete, and then flush each cache line, followed by writing a flag, in that order.

Would CPU continue executing the next line code when the GPU kernel is running? Would it cause an error?

I am reading the book "CUDA by example" by Sanders, where the author mentioned that p.441: For example, when we launched the kernel in our ray tracer, the GPU begins executing our code, but the CPU continues executing the next line of our program before the GPU finishes. -- Highlighted mar 4, 2014
I am wondering if this statement is correct. For example, what if the next instruction CPU continues executing depends on the variables that the GPU kernel outputs? Would it cause an error? From my experience, it does not cause an error. So what does the author really mean?
Many thanks!
Yes, the author is correct. Suppose my kernel launch looks like this:
int *h_in_data, *d_in_data, *h_out_data, *d_out_data;
// code to allocate host and device pointers, and initialize host data
...
// copy host data to device
cudaMemcpy(d_in_data, h_in_data, size_of_data, cudaMemcpyHostToDevice);
mykernel<<<grid, block>>>(d_in_data, d_out_data);
// some other host code happens here
// at this point, h_out_data does not point to valid data
...
cudaMemcpy(h_out_data, d_out_data, size_of_data, cudaMemcpyDeviceToHost);
//h_out_data now points to valid data
Immediately after the kernel launch, the CPU continues executing host code. But the data generated by the device (either d_out_data or h_out_data) is not ready yet. If the host code attempts to use whatever is pointed to by h_out_data, it will just be garbage data. This data only becomes valid after the 2nd cudaMemcpy operation.
Note that using the data (h_out_data) before the 2nd cudaMemcpy will not generate an error, if by that you mean a segmentation fault or some other run time error. But any results generated will not be correct.
Kernel launches in CUDA are by default asynchronous, i.e., the control will return to CPU after the launch. Now if the next instruction of the CPU is another kernel launch, then you don't need to worry, this launch will be done only after the previously launched kernel has finished its execution.
However, if the next instruction is some CPU instruction only which is accessing the results of the kernel, there can be a problem of accessing garbage value. Therefore, excessive care has to be taken and device synchronization should be done as and when needed.

When is safe to reuse CPU buffer when calling cudaMemcpyAsync?

My project will have multiple threads, each one issuing kernel executions on different cudaStreams. Some other thread will consume the results that whill be stored in a queue Some pseudo-code here:
while(true) {
cudaMemcpyAsync(d_mem, h_mem, some_stream)
kernel_launch(some_stream)
cudaMemcpyAsync(h_queue_results[i++], d_result, some_stream)
}
Is safe to reuse the h_mem after the first cudaMemcpyAsync returns? or should I use N host buffers for issuing the gpu computation?
How to know when the h_mem can be reused? should I make some synchronization using cudaevents?
BTW. h_mem is host-pinned. If it was pageable, could I reuse h_mem inmediatly? from what I have read here it seems I could reuse inmediatly after memcpyasync returns, am i right?
Asynchronous
For transfers from pageable host memory to device memory, host memory
is copied to a staging buffer immediately (no device synchronization
is performed). The function will return once the pageable buffer has
been copied to the staging memory. The DMA transfer to final
destination may not have completed. For transfers between pinned host
memory and device memory, the function is fully asynchronous. For
transfers from device memory to pageable host memory, the function
will return only once the copy has completed. For all other transfers,
the function is fully asynchronous. If pageable memory must first be
staged to pinned memory, this will be handled asynchronously with a
worker thread. For transfers from any host memory to any host memory,
the function is fully synchronous with respect to the host.
MemcpyAsynchronousBehavior
Thanks!
In order to get copy/compute overlap, you must use pinned memory. The reason for this is contained in the paragraph you excerpted. Presumably the whole reason for your multi-streamed approach is for copy/compute overlap, so I don't think the correct answer is to switch to using pageable memory buffers.
Regarding your question, assuming h_mem is only used as the source buffer for the pseudo-code you've shown here (i.e. the data in it only participates in that one cudaMemcpyAsync call), then the h_mem buffer is no longer needed once the next cuda operation in that stream begins. So if your kernel_launch were an actual kernel<<<...>>>(...), then once kernel begins, you can be assured that the previous cudaMemcpyAsync is complete.
You could use cudaEvents with cudaEventSynchronize() or cudaStreamWaitEvent(), or you could use cudaStreamSynchronize() directly in the stream. For example, if you have a cudaStreamSynchronize() call somewhere in the stream pseudocode you have shown, and it is after the cudaMemcpyAsync call, then any code after the cudaStreamSynchronize() call is guaranteed to be executing after the cudaMemcpyAsync() call is complete. All of the calls I've referenced are documented in the usual place.

Persistent GPU shared memory

I am new to CUDA programming, and I am mostly working with shared memory per block because of performance reasons. The way my program is structured right now, I use one kernel to load the shared memory and another kernel to read the pre-loaded shared memory. But, as I understand it, shared memory cannot persist between two different kernels.
I have two solutions in mind; I am not sure about the first one, and second might be slow.
First Solution: Instead of using two kernels, I use one kernel. After loading the shared memory, the kernel may wait for an input from the host, perform the operation and then return the value to host. I am not sure whether a kernel can wait for a signal from the host.
Second solution: After loading the shared memory, copy the shared memory value in the global memory. When the next kernel is launched, copy the value from global memory back into the shared memory and then perform the operation.
Please comment on the feasibility of the two solutions.
I would use a variation of your proposed first solution: As you already suspected, you can't wait for host input in a kernel - but you can syncronise your kernels at a point. Just call "__syncthreads();" in your kernel after loading your data into shared memory.
I don't really understand your second solution: why would you copy data to shared memory just to copy it back to global memory in the first kernel? Or would this first kernel also compute something? In this case I guess it will not help to store the preliminary results in the shared memory first, I would rather store them directly in global memory (however, this might depend on the algorithm).

Accessing cuda device memory when the cuda kernel is running

I have allocated memory on device using cudaMalloc and have passed it to a kernel function. Is it possible to access that memory from host before the kernel finishes its execution?
The only way I can think of to get a memcpy to kick off while the kernel is still executing is by submitting an asynchronous memcpy in a different stream than the kernel. (If you use the default APIs for either kernel launch or asynchronous memcpy, the NULL stream will force the two operations to be serialized.)
But because there is no way to synchronize a kernel's execution with a stream, that code would be subject to a race condition. i.e. the copy engine might pull from memory that hasn't yet been written by the kernel.
The person who alluded to mapped pinned memory is into something: if the kernel writes to mapped pinned memory, it is effectively "copying" data to host memory as it finishes processing it. This idiom works nicely, provided the kernel will not be touching the data again.
It is possible, but there's no guarantee as to the contents of the memory you retrieve in such a way, since you don't know what the progress of the kernel is.
What you're trying to achieve is to overlap data transfer and execution. That is possible through the use of streams. You create multiple CUDA streams, and queue a kernel execution and a device-to-host cudaMemcpy in each stream. For example, put the kernel that fills the location "0" and cudaMemcpy from that location back to host into stream 0, kernel that fills the location "1" and cudaMemcpy from "1" into stream 1, etc. What will happen then is that the GPU will overlap copying from "0" and executing "1".
Check CUDA documentation, it's documented somewhere (in the best practices guide, I think).
You can't access GPU memory directly from the host regardless of a kernel is running or not.
If you're talking about copying that memory back to the host before the kernel is finished writing to it, then the answer depends on the compute capability of your device. But all but the very oldest chips can perform data transfers while the kernel is running.
It seems unlikely that you would want to copy memory that is still being updated by a kernel though. You would get some random snapshot of partially finished data. Instead, you might want to set up something where you have two buffers on the device. You can copy one of the buffers while the GPU is working on the other.
Update:
Based on your clarification, I think the closest you can get is using mapped page-locked host memory, also called zero-copy memory. With this approach, values are copied to the host as they are written by the kernel. There is no way to query the kernel to see how much of the work it has performed, so I think you would have to repeatedly scan the memory for newly written values. See section 3.2.4.3, Mapped Memory, in the CUDA Programming Guide v4.2 for a bit more information.
I wouldn't recommend this though. Unless you have some very unusual requirements, there is likely to be a better way to accomplish your task.
When you launch the Kernel it is an asynchronous (non blocking) call. Calling cudaMemcpy next will block until the Kernel has finished.
If you want to have the result for Debug purposes maybe it is possible for you to use cudaDebugging where you can step through the kernel and inspect the memory.
For small result checks you could also use printf() in the Kernel code.
Or run only a threadblock of size (1,1) if you are interested in that specific result.