This is a follow up question to CUDA Global Memory, Where is it? In reference to GSmith's response. These Q's address the CC > 2.0 case.
When I lookup the spec's of my Nvida card, it lists 2GB of 'memory'. I've come to believe this is the 'Global' memory for this card. That is, this is GDDR3 memory that resides 'off-chip', but on the card. Would this be correct?
I don't see any case where the spec'd 'memory' is zero. Does one exist? That is, can I have a card w/ no off-chip memory? In that all my texture, local, and constant memory actually resides in pinned & mapped host memory.
Can I extend my global memory usage by pinning more than 2GB of host memory? Can I use all my off-chip global memory (2GB) and add (1GB) more global pinned memory? Or am I to understand that this card is only capable of an addressing space of 2GB max? i.e. I can only access 2GB of mem, unPinned, pinned, mapped, or any combo.
If the device is using pinned host memory (not mapped), do I need to Memcpy from dev to host? That is, the mem is physically on the host side. And it is being used by the device, so they can both see it. Why do I need to copy it to the host, when it is already there. It seems to be 'mapped' by default. (What mechanism is preventing this dual access?)
How does one go about mapping shared mem to global mem? (I'm not finding any mention of this in the doc's.) Is this a 'mapped' arrangement or do I still need to copy it from global to shared, and back again? (Could this save me a copy step?)
It's recommended to ask one question per question.
When I lookup the spec's of my Nvida card, it lists 2GB of 'memory'. I've come to believe this is the 'Global' memory for this card. That is, this is GDDR3 memory that resides 'off-chip', but on the card. Would this be correct?
Yes.
I don't see any case where the spec'd 'memory' is zero. Does one exist? That is, can I have a card w/ no off-chip memory? In that all my texture, local, and constant memory actually resides in pinned & mapped host memory.
The closest NVIDIA came to this idea was probably in the Ion 2 chipset. But there are no cuda-capable nvidia discrete graphics cards with zero on-board off-chip memory.
Can I extend my global memory usage by pinning more than 2GB of host memory?
You can pin more than 2GB of host memory. However this does not extend global memory. It does enable a variety of things such as improved host-device transfer rates, overlapped copy and compute, and zero-copy access of host memory from the GPU, but this is not the same as what you use global memory for. Zero-copy techniques perhaps come the closest to extending global memory onto host memory (conceptually) but zero copy is very slow from a GPU standpoint.
If the device is using pinned host memory (not mapped), do I need to Memcpy from dev to host?
Yes you still need to cudaMemcpy data back and forth.
That is, the mem is physically on the host side. And it is being used by the device
I don't know where this concept is coming from. Perhaps you are referring to zero-copy, but zero-copy is relatively slow compared to accessing data that is in global memory. It should be used judiciously in cases of small data sizes, and is by no means a straightforward way to provide a bulk increase to the effective size of the global memory on the card.
How does one go about mapping shared mem to global mem?
Shared memory is not automatically mapped to global memory. The methodology is to copy the data you need back and forth between shared and global memory.
Related
I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.
For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.
My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?
Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).
All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).
Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.
Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.
First of all you need to think of coalescing the memory access. You didn't mention about the GPU you are using. In the latest GPUs, the coal laced memory read will give same performance as that of constant memory. So always make your memory read and write in coal laced manner as possible as you can.
Another you can use texture memory (If the data size fits into it). This texture memory has some caching mechanism. This is previously used in case when global memory read was non-coalesced. But latest GPUs give almost same performance for texture and global memory.
I don't think the globally declared memory give more performance over dynamically allocated global memory, since the coalescing issue still exists. Also global memory declared in global (variable) scope is not possible in case of CUDA global memory. The variables that can declare globally (in the program) are constant memory variables and texture, which we doesn't required to pass to kernels as arguments.
for memory optimizations please see the memory optimization section in cuda c best practices guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations
CUDA device memory can be allocated using cudaMalloc/cudaFree, sure. This is fine, but primitive.
I'm curious to know, is device memory virtualised in some way? Are there equivalent operations to mmap, and more importantly, mremap for device memory?
If device memory is virtualised, I expect these sorts of functions should exist. It seems modern GPU drivers implement paging when there is contention for limited video resources by multiple processes, which suggests it's virtualised in some way or another...
Does anyone know where I can read more about this?
Edit:
Okay, my question was a bit general. I've read the bits of the manual that talk about mapping system memory for device access. I was more interested in device-allocated memory however.
Specific questions:
- Is there any possible way to remap device memory? (ie, to grow a device allocation)
- Is it possible to map device allocated memory to system memory?
- Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
I have cases where the memory is used by the GPU 99% of the time; so it should be device-local, but it may be convenient to map device memory to system memory for occasional structured read-back without having to implement an awkward deep-copy.
Yes, unified memory exists, however I'm happy with explicit allocation, save for the odd moment when I want a sneaky read-back.
I've found the manual fairly light on detail in general.
CUDA comes with a fine CUDA C Programming Guide as it's main manual which has sections on Mapped Memory as well as Unified Memory Programming.
Responding to your additional posted questions, and following your cue to leave UM out of the consideration:
Is there any possible way to remap device memory? (ie, to grow a device allocation)
There is no direct method. You would have to manually create a new allocation of the desired size, and copy the old data to it, then free the old allocation. If you expect to do this a lot, and don't mind the significant overhead associated with it, you could take a look at thrust device vectors which will hide some of the manual labor and allow you to resize an allocation in a single vector-style .resize() operation. There's no magic, however, so thrust is just a template library built on top of CUDA C (for the CUDA device backend) and so it is going to do a sequence of cudaMalloc and cudaFree operations, just as you would "manually".
Is it possible to map device allocated memory to system memory?
Leaving aside UM, no. Device memory cannot be mapped into host address space.
Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
no, host mapped data is never duplicated in device memory, and apart from L2 caching, mapped data needed by the GPU will always be fetched across the PCI-E bus
does cublasSetMatrix work if used in a separate pthread?
I want to overlap tasks on CPU with CPU->GPU data transfer. However, since the data is quite large, I'm trying to avoid allocate large pinned memory.
No it won't. GPU contexts are tied to threads that created them. If you try running cublasSetMatrix or cudaMemcpy in another thread without doing anything else, it will make another context. Memory allocations are not portable between contexts, effectively every context has its own virtual address space. The result will be that you wind up with two GPU contexts, and the copy will fail.
The requirement for pinned memory comes from the CUDA driver. For overlapping copying and execution, the host memory involved in the copying must be in a physical address range that the GPU can access by DMA over the PCI-e bus. That is why pinning is required, otherwise the host memory could be in swap space or other virtual memory, and the DMA transaction would fail.
If you are worried about the amount of pinned host memory required for large problems, try using one or two smaller pinned buffers and executing multiple transfers, using the pinned memory as staging buffers. The performance won't be quite as good as using a single large pinned buffer and one big transfer, but you can still achieve useful kernel/copy overlap and hide a lot of PCI-e latency in the process.
I am new to CUDA programming, and I am mostly working with shared memory per block because of performance reasons. The way my program is structured right now, I use one kernel to load the shared memory and another kernel to read the pre-loaded shared memory. But, as I understand it, shared memory cannot persist between two different kernels.
I have two solutions in mind; I am not sure about the first one, and second might be slow.
First Solution: Instead of using two kernels, I use one kernel. After loading the shared memory, the kernel may wait for an input from the host, perform the operation and then return the value to host. I am not sure whether a kernel can wait for a signal from the host.
Second solution: After loading the shared memory, copy the shared memory value in the global memory. When the next kernel is launched, copy the value from global memory back into the shared memory and then perform the operation.
Please comment on the feasibility of the two solutions.
I would use a variation of your proposed first solution: As you already suspected, you can't wait for host input in a kernel - but you can syncronise your kernels at a point. Just call "__syncthreads();" in your kernel after loading your data into shared memory.
I don't really understand your second solution: why would you copy data to shared memory just to copy it back to global memory in the first kernel? Or would this first kernel also compute something? In this case I guess it will not help to store the preliminary results in the shared memory first, I would rather store them directly in global memory (however, this might depend on the algorithm).
I am developing a small application using CUDA.
i have a huge 2d array (won't fit on shared memory) in which threads in all blocks will read from constantly at random places.
this 2d array is a read-only array.
where should i allocate this 2d array? global memory?constant memroy? texture memory?
Depending on the size of your device's texture memory, you should implement it in this area. Indeed, texture memory is based upon sequential locality cache mechanism. It means that memory accesses are optimized when threads of consecutive identifiers try to reach data elements within relatively close storage locations.
Moreover, this locality is here implemented for 2D accesses. So when each thread reaches a data element of an array stored in texture memory, you're in the case of consecutive 2D accesses. Consequently, you take a full advantage of the memory architecture.
Unfortunately, this memory is not that big and with huge arrays you might be able to make your data fit in it. In this case, you can't avoid to use the global memory.
I agree the jHackTheRipper, a simple solution would be to use texture memory and then profile using the Compute Visual Profiler. Heres a good set of slides from NVIDIA about the different memory types for image convolution; it shows that good shared memory usage and global reads was not too much faster than using texture memory. In your case you should get some coalesced reads from the texmemory that you wouldn't usually get with accessing random values in global memory.
If it's small enough to fit it constant or texture, I would just try all three.
One interesting option that you don't have listed here is mapped memory on the host. You can allocate memory on the host that will be accessible from the device, without explicitly transferring it to device memory. Depending on the amount of the array you need to access, it could be faster than copying to global memory and reading from there.