does cublasSetMatrix work if used in a separate pthread?
I want to overlap tasks on CPU with CPU->GPU data transfer. However, since the data is quite large, I'm trying to avoid allocate large pinned memory.
No it won't. GPU contexts are tied to threads that created them. If you try running cublasSetMatrix or cudaMemcpy in another thread without doing anything else, it will make another context. Memory allocations are not portable between contexts, effectively every context has its own virtual address space. The result will be that you wind up with two GPU contexts, and the copy will fail.
The requirement for pinned memory comes from the CUDA driver. For overlapping copying and execution, the host memory involved in the copying must be in a physical address range that the GPU can access by DMA over the PCI-e bus. That is why pinning is required, otherwise the host memory could be in swap space or other virtual memory, and the DMA transaction would fail.
If you are worried about the amount of pinned host memory required for large problems, try using one or two smaller pinned buffers and executing multiple transfers, using the pinned memory as staging buffers. The performance won't be quite as good as using a single large pinned buffer and one big transfer, but you can still achieve useful kernel/copy overlap and hide a lot of PCI-e latency in the process.
Related
I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.
CUDA device memory can be allocated using cudaMalloc/cudaFree, sure. This is fine, but primitive.
I'm curious to know, is device memory virtualised in some way? Are there equivalent operations to mmap, and more importantly, mremap for device memory?
If device memory is virtualised, I expect these sorts of functions should exist. It seems modern GPU drivers implement paging when there is contention for limited video resources by multiple processes, which suggests it's virtualised in some way or another...
Does anyone know where I can read more about this?
Edit:
Okay, my question was a bit general. I've read the bits of the manual that talk about mapping system memory for device access. I was more interested in device-allocated memory however.
Specific questions:
- Is there any possible way to remap device memory? (ie, to grow a device allocation)
- Is it possible to map device allocated memory to system memory?
- Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
I have cases where the memory is used by the GPU 99% of the time; so it should be device-local, but it may be convenient to map device memory to system memory for occasional structured read-back without having to implement an awkward deep-copy.
Yes, unified memory exists, however I'm happy with explicit allocation, save for the odd moment when I want a sneaky read-back.
I've found the manual fairly light on detail in general.
CUDA comes with a fine CUDA C Programming Guide as it's main manual which has sections on Mapped Memory as well as Unified Memory Programming.
Responding to your additional posted questions, and following your cue to leave UM out of the consideration:
Is there any possible way to remap device memory? (ie, to grow a device allocation)
There is no direct method. You would have to manually create a new allocation of the desired size, and copy the old data to it, then free the old allocation. If you expect to do this a lot, and don't mind the significant overhead associated with it, you could take a look at thrust device vectors which will hide some of the manual labor and allow you to resize an allocation in a single vector-style .resize() operation. There's no magic, however, so thrust is just a template library built on top of CUDA C (for the CUDA device backend) and so it is going to do a sequence of cudaMalloc and cudaFree operations, just as you would "manually".
Is it possible to map device allocated memory to system memory?
Leaving aside UM, no. Device memory cannot be mapped into host address space.
Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
no, host mapped data is never duplicated in device memory, and apart from L2 caching, mapped data needed by the GPU will always be fetched across the PCI-E bus
This is a follow up question to CUDA Global Memory, Where is it? In reference to GSmith's response. These Q's address the CC > 2.0 case.
When I lookup the spec's of my Nvida card, it lists 2GB of 'memory'. I've come to believe this is the 'Global' memory for this card. That is, this is GDDR3 memory that resides 'off-chip', but on the card. Would this be correct?
I don't see any case where the spec'd 'memory' is zero. Does one exist? That is, can I have a card w/ no off-chip memory? In that all my texture, local, and constant memory actually resides in pinned & mapped host memory.
Can I extend my global memory usage by pinning more than 2GB of host memory? Can I use all my off-chip global memory (2GB) and add (1GB) more global pinned memory? Or am I to understand that this card is only capable of an addressing space of 2GB max? i.e. I can only access 2GB of mem, unPinned, pinned, mapped, or any combo.
If the device is using pinned host memory (not mapped), do I need to Memcpy from dev to host? That is, the mem is physically on the host side. And it is being used by the device, so they can both see it. Why do I need to copy it to the host, when it is already there. It seems to be 'mapped' by default. (What mechanism is preventing this dual access?)
How does one go about mapping shared mem to global mem? (I'm not finding any mention of this in the doc's.) Is this a 'mapped' arrangement or do I still need to copy it from global to shared, and back again? (Could this save me a copy step?)
It's recommended to ask one question per question.
When I lookup the spec's of my Nvida card, it lists 2GB of 'memory'. I've come to believe this is the 'Global' memory for this card. That is, this is GDDR3 memory that resides 'off-chip', but on the card. Would this be correct?
Yes.
I don't see any case where the spec'd 'memory' is zero. Does one exist? That is, can I have a card w/ no off-chip memory? In that all my texture, local, and constant memory actually resides in pinned & mapped host memory.
The closest NVIDIA came to this idea was probably in the Ion 2 chipset. But there are no cuda-capable nvidia discrete graphics cards with zero on-board off-chip memory.
Can I extend my global memory usage by pinning more than 2GB of host memory?
You can pin more than 2GB of host memory. However this does not extend global memory. It does enable a variety of things such as improved host-device transfer rates, overlapped copy and compute, and zero-copy access of host memory from the GPU, but this is not the same as what you use global memory for. Zero-copy techniques perhaps come the closest to extending global memory onto host memory (conceptually) but zero copy is very slow from a GPU standpoint.
If the device is using pinned host memory (not mapped), do I need to Memcpy from dev to host?
Yes you still need to cudaMemcpy data back and forth.
That is, the mem is physically on the host side. And it is being used by the device
I don't know where this concept is coming from. Perhaps you are referring to zero-copy, but zero-copy is relatively slow compared to accessing data that is in global memory. It should be used judiciously in cases of small data sizes, and is by no means a straightforward way to provide a bulk increase to the effective size of the global memory on the card.
How does one go about mapping shared mem to global mem?
Shared memory is not automatically mapped to global memory. The methodology is to copy the data you need back and forth between shared and global memory.
How the fastest can I transfer the data block of 256 bytes from one CUDA Block to another?
And is there a way to transfer faster than global memory?
In theory, on devices of compute capability >= 2.0, transfers between blocks, using global memory, could be very fast because global memory transactions use the L1 and L2 caches.
However, the only way to safely transfer memory between blocks is to launch those blocks in separate kernel invocations. Then, you lose the theoretical advantage I just described, as the caches are flushed between invocations.
Within a given kernel invocation, you cannot know in which order your blocks will be launched.
Transferring data between blocks launched by separate kernel invocations is a common paradigm in CUDA and if there is enough computational work to be done, the latency of the global memory transactions can be completely hidden.
I use cudaMemcpy() one time to copy exactly 1GB of data to the device. This takes 5.9s. The other way round it takes 5.1s. Is this normal? Does the function itself have so much overhead before copying?
Theoretical there should be a throughput of at least 4GB/s for the PCIe bus.
There are no memory transfers overlapping because the Tesla C870 just does not support it. Any hints?
EDIT 2: my test program + updated timings; I hope it is not too much to read!
The cutCreateTimer() functions wont compile for me: 'error: identifier "cutCreateTimer" is undefined' - this could be related to the old cuda version (2.0) installed on the machine
__host__ void time_int(int print){
static struct timeval t1; /* var for previous time stamp */
static struct timeval t2; /* var of current time stamp */
double time;
if(gettimeofday(&t2, 0) == -1) return;
if(print != 0){
time = (double) (t2.tv_sec - t1.tv_sec) + ((double) (t2.tv_usec - t1.tv_usec)) / 1000000.0;
printf(...);
}
t1 = t2;
}
main:
time(0);
void *x;
cudaMallocHost(&x,1073741824);
void *y;
cudaMalloc(&y, 1073741824);
time(1);
cudaMemcpy(y,x,1073741824, cudaMemcpyHostToDevice);
time(1);
cudaMemcpy(x,y,1073741824, cudaMemcpyDeviceToHost);
time(1);
Displayed timings are:
0.86 s allocation
0.197 s first copy
5.02 s second copy
The weird thing is: Although it displays 0.197s for first copy it takes much longer if I watch the program run.
Yes, This is normal. cudaMemcpy() does a lot of checks and works (if host memory was allocated by usual malloc() or mmap()). It should check that every page of data is in memory, and move the pages (one-by-one) to the driver.
You can use cudaHostAlloc function or cudaMallocHost for allocating memory instead of malloc. It will allocate pinned memory which is always stored in RAM and can be accessed by GPU's DMA directly (faster cudaMemcpy()). Citing from first link:
Allocates count bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy().
Only limiting factor is that total amount of pinned memory in system is limited (not more than RAM size; it is better to use not more than RAM - 1Gb):
Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.
Assuming the transfers are timed accurately, 1.1 seconds for a transfer of 1 GB from pinned memory seems slow. Are you sure the PCIe slot is configured to the correct width? For full performance, you'd want a x16 configuration. Some platforms provide two slots, one of which is configured as a x16, the other as a x4. So if you machine has two slots, you might want try moving the card into the other slot. Other systems have two slots, where you get x16 if only one slot is occupied, but you get two slots of x8 if both are occupied. The BIOS setup may help in figuring out how the PCIe slots are configured.
The Tesla C870 is rather old technology, but if I recall correctly transfer rates of around 2 GB/s from pinned memory should be possible with these parts, which used a 1st generation PCIe interface. Current Fermi-class GPUs use a PCIe gen 2 interface and can achieve 5+ GB/s for tranfers from pinned memory (for throughput measurements, 1 GB/s = 10^9 bytes/s).
Note that PCIe uses a packetized transport, and the packet overhead can be significant at the packet sizes supported by common chipsets, with newer chipsets typically supporting somewhat longer packets. One is unlikely to exceed 70% of the nominal per-direction maximum (4 GB/s for PCIe 1.0 x16, 8 GB/s for PCIe 2.0 x16), even for transfers from / to pinned host memory. Here is a white paper that explains the overhead issue and has a handy graph showing the utilization achievable with various packet sizes:
http://www.plxtech.com/files/pdf/technical/expresslane/Choosing_PCIe_Packet_Payload_Size.pdf
Other than a system that just is not configured properly, the best explanation for dreadful PCIe bandwidth is a mismatch between IOH/socket and the PCIe slot that the GPU is plugged into.
Most multi-socket Intel i7-class (Nehalem, Westmere) motherboards have one I/O hub per socket. Since the system memory is directly connected to each CPU, DMA accesses that are "local" (fetching memory from the CPU connected to the same IOH as the GPU doing the DMA access) are much faster than nonlocal ones (fetching memory from the CPU connected to the other IOH, a transaction that has to be satisfied via the QPI interconnect that links the two CPUs).
IMPORTANT NOTE: unfortunately it is common for SBIOS's to configure systems for interleaving, which causes contiguous memory allocations to be interleaved between the sockets. This mitigates performance cliffs from local/nonlocal access for the CPUs (one way to think of it: it makes all memory accesses equally bad for both sockets), but wreaks havoc with GPU access to the data since it causes every other page on a 2-socket system to be nonlocal.
Nehalem and Westmere class systems don't seem to suffer from this problem if the system only has one IOH.
(By the way, Sandy Bridge class processors take another step down this path by integrating the PCI Express support into the CPU, so with Sandy Bridge, multi-socket machines automatically have multiple IOH's.)
You can investigate this hypothesis by either running your test using a tool that pins it to a socket (numactl on Linux, if it's available) or by using platform-dependent code to steer the allocations and threads to run on a specific socket. You can learn a lot without getting fancy - just call a function with global effects at the beginning of main() to force everything onto one socket or another, and see if that has a big impact on your PCIe transfer performance.