cuda and cudamalloc allocation large block of memory fails - cuda

I have a GTX570 with 2Gb of memory, when I try to allocate more memory with one cudamalloc call than about 804Mb I get into to trouble. Anyone any ideas to why that is? It is my first call so I doubt it is fragmentation.
No problem:
Memory avaliable: Free: 2336116736, Total: 2684026880
requesting 804913152 bytes
no error
Memory avaliable: Free: 1531199488, Total: 2684026880
requesting 804913152 bytes
no error
Memory avaliable: Free: 726286336, Total: 2684026880
Problem:
Memory avaliable: Free: 2327601152, Total: 2684026880
requesting 805306368 bytes
out of memory
Memory avaliable: Free: 2327597056, Total: 2684026880
requesting 805306368 bytes
out of memory
Memory avaliable: Free: 2327597056, Total: 2684026880

This is caused restrictions imposed by the Windows WDDM subsystem. There is a hard limit imposed on how much memory can be allocated, calculated as
MIN ( ( System Memory Size in MB - 512 MB ) / 2, PAGING_BUFFER_SEGMENT_SIZE )
For desktop windows PAGING_BUFFER_SEGMENT_SIZE is about 2Gb IIRC. You have two options to work around this:
Get a Telsa card and use the dedicated Windows TCC mode driver which takes memory management of the device away from WDDM, eliminating the restriction.
Install linux or use a CUDA aware live distribution for your GPU computing. The Linux driver has no restrictions on memory allocations beyond the free memory capacity of the device.

Related

CUDA bank conflict for L1 cache?

On NVIDIA's 2.x architecture, each warp has 64kb of memory that is by default partitioned into 48kb of Shared Memory and 16kb of L1 cache (servicing global and constant memory).
We all know about the bank conflicts of accessing Shared Memory - the memory is divided into 32 banks of size 32-bits to allow simultaneous independent access by all 32 threads. On the other hand, Global Memory, though much slower, does not experience bank conflicts because memory requests are coalesced across the warp.
Question: Suppose some data from global or constant memory is cached in the L1 cache for a given warp. Is access to this data subject to bank conflicts, like Shared Memory (since the L1 Cache and the Shared Memory are in fact the same hardware), or is it bank-conflict-free in the way that Global/Constant memory is?
On NVIDIA's 2.x architecture, each warp has 64kb of memory that is by
default partitioned into 48kb of Shared Memory and 16kb of L1 cache
Compute capability 2.x devices have 64 KB of SRAM per Streaming Multiprocessor (SM) that can beconfigured as
16 KB L1 and 48 KB shared memory, or
48 KB L1 and 16 KB shared memory.
(servicing global and constant memory).
Loads and stores to global memory, local memory, and surface memory go through the L1. Accesses to constant memory go through dedicated constant caches.
We all know about the bank conflicts of accessing Shared Memory - the
memory is divided into 32 banks of size 32-bits to allow simultaneous
independent access by all 32 threads. On the other hand, Global
Memory, though much slower, does not experience bank conflicts because
memory requests are coalesced across the warp.
Accesses through L1 to global or local memory are done per cache line (128 B). When a load request is issued to L1 the LSU needs to perform an address divergence calculation to determine which threads are accessing the same cache line. The LSU unit then has to perform a L1 cache tag look up. If the line is cached then it is written back to the register file; otherwise, the request is sent to L2. If the warp has threads not serviced by the request then a replay is requested and the operation is reissued with the remaining threads.
Multiple threads in a warp can access the same bytes in the cache line without causing a conflict.
Question: Suppose some data from global or constant memory is cached
in the L1 cache for a given warp.
Constant memory is not cached in L1 it is cached in the constant caches.
Is access to this data subject to bank conflicts, like Shared Memory
(since the L1 Cache and the hared Memory are in fact the same
hardware), or is it bank-conflict-free in the way that global/Constant
memory is?
L1 and the constant cache access a single cache line at a time so there are no bank conflicts.

Load from shared memory the one and same 32 bytes (ulong4) by the each warp thread

If each warp accesses the shared memory at the same address, how would that load the 32 bytes of data (ulong4)? Will it be 'broadcasted'? Would the access time be the same as if each thread loaded the 2 bytes 'unsigned short int'?
Now, in case I need to load from shared memory 32/64 same bytes in each warp, how could I do this?
On devices before compute capability 3.0 shared memory accesses are always 32 bit / 4 bytes wide and will be broadcast if all threads of a warp access the same address. Wider accesses will compile to multiple instructions.
On compute capability 3.0 shared memory accesses can be configured to be either 32 bit wide or 64 bit wide using cudaDeviceSetSharedMemConfig(). The chosen setting will apply to the entire kernel though.
[As I had originally missed the little word "shared" in the question, I gave a completely off-topic answer for global memory instead. Since that one should still be correct, I'll leave it in here:]
It depends:
Compute capability 1.0 and 1.1 don't broadcast and use 64 separate 32 byte memory transactions (two times 16 bytes, extended to the minimum 32 byte transactions size for each thread of the warp)
Compute capability 1.2 and 1.3 broadcast, so two 32 byte transactions (two times 16 bytes, extended to minimum 32 byte transaction size) suffice for all threads of the warp
Compute capability 2.0 and higher just read a 128 byte cache line and satisfy all requests from there.
The compute capability 1.x devices will waste 50% of the transferred data, as a single thread can load at most 16 bytes, but the minimum transaction size is or 32 bytes. Additionally, 32 byte transactions are a lot slower that 128 byte transactions.
The time will be the same as if just 8 bytes were read by each thread because of the minimum transaction size, and because data paths are sufficiently wide to transfer either 8 or 16 bytes to each thread per transaction.
Reading 2× or 4× the data will take 2× or 4× as long on compute capability 1.x, but only minimally longer on 2.0 and higher if the data falls into the same cache line so no further memory transactions are necessary.
So on compute capability 2.0 and higher you don't need to worry. On 1.x read the data through the constant cache or a texture if it is constant, or reorder it in shared memory otherwise (assuming your kernel is memory bandwidth bound).

CUDA Programming - Shared memory configuration

Could you please explain the differences between using both "16 KB shared memory + 48K L1 cache" or "48 KB shared memory + 16 KB L1 cache" in CUDA programming? What should I expect in time execution? When could I expect smaller gpu time?
On Fermi and Kepler nVIDIA GPUs, each SM has a 64KB chunk of memory which can be configured as 16/48 or 48/16 shared memory/L1 cache. Which mode you use depends on how much use of shared memory your kernel makes. If your kernel uses a lot of shared memory then you would probably find that configuring it as 48KB shared memory allows higher occupancy and hence better performance.
On the other hand, if your kernel does not use shared memory at all, or if it only uses a very small amount per thread, then you would configure it as 48KB L1 cache.
How much a "very small amount" is is probably best illustrated with the occupancy calculator which is a spreadsheet included with the CUDA Toolkit and also available here. This spreadsheet allows you to investigate the effect of different shared memory per block and different block sizes.

Why is the constant memory size limited in CUDA?

According to "CUDA C Programming Guide", a constant memory access benefits only if a multiprocessor constant cache is hit (Section 5.3.2.4)1. Otherwise there can be even more memory requests for a half-warp than in case of the coalesced global memory read. So why the constant memory size is limited to 64 KB?
One more question in order not to ask twice. As far as I understand, in the Fermi architecture the texture cache is combined with the L2 cache. Does texture usage still make sense or the global memory reads are cached in the same manner?
1Constant Memory (Section 5.3.2.4)
The constant memory space resides in device memory and is cached in the constant cache mentioned in Sections F.3.1 and F.4.1.
For devices of compute capability 1.x, a constant memory request for a warp is first split into two requests, one for each half-warp, that are issued independently.
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
The constant memory size is 64 KB for compute capability 1.0-3.0 devices. The cache working set is only 8KB (see the CUDA Programming Guide v4.2 Table F-2).
Constant memory is used by the driver, compiler, and variables declared __device__ __constant__. The driver uses constant memory to communicate parameters, texture bindings, etc. The compiler uses constants in many of the instructions (see disassembly).
Variables placed in constant memory can be read and written using the host runtime functions cudaMemcpyToSymbol() and cudaMemcpyFromSymbol() (see the CUDA Programming Guide v4.2 section B.2.2). Constant memory is in device memory but is accessed through the constant cache.
On Fermi texture, constant, L1 and I-Cache are all level 1 caches in or around each SM. All level 1 caches access device memory through the L2 cache.
The 64 KB constant limit is per CUmodule which is a CUDA compilation unit. The concept of CUmodule is hidden under the CUDA runtime but accessible by the CUDA Driver API.

Pinned memory in Nvidia CUDA

I'm writing matrix addition program for GPUs using Streams and obviously pinned memory.So I allocated 3 matrices in pinned memory but after particular dimensions it shows API error 2:out of memory.My RAM is 4GB but i'm not able to use beyond 800MB.Is there any way by which we can control this upper limit?
My sys config:
nVidia GEForce 9800GTX
Intel core 2 Quad
For streamed execution code looks as follows
(int i=0;i<no_of_streams;i++)
{
cudaMemcpyAsync(device_a+i*(n/no_of_streams),hAligned_on_host_a+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyHostToDevice,streams[i]);
cudaMemcpyAsync(device_b+i*(n/no_of_streams),hAligned_on_host_b+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyHostToDevice,streams[i]);
cudaMemcpyAsync(device_c+i*(n/no_of_streams),hAligned_on_host_c+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyHostToDevice,streams[i]);
matrixAddition<<<blocks,threads,0,streams[i]>>>(device_a+i*(n/no_of_streams),device_b+i*(n/no_of_streams),device_c+i*(n/no_of_streams));
cudaMemcpyAsync(hAligned_on_host_a+i*(n/no_of_streams),device_a+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyDeviceToHost,streams[i]);
cudaMemcpyAsync(hAligned_on_host_b+i*(n/no_of_streamss),device_b+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyDeviceToHost,streams[i]);
cudaMemcpyAsync(hAligned_on_host_c+i*(n/no_of_streams),device_c+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyDeviceToHost,streams[i]));
}
So, you haven't specified if this happens after the cudaMalloc or the cudaHostAlloc function calls.
Pinned memory is a limited resource. Any memory defined as being in pinned memory must always be in RAM. As such, that leaves less room in RAM for other system applications. This means, you can't have 4GB of pinned memory if you have 4GB of RAM, or else nothing else could run.
800MB might be a system imposed limit. Considering it's a quarter of your RAM, it might be a reasonable limit. It is also quite close to the size of your global memory. A failure on the card wouldn't translate to a failure on the host, so if it's complaining without having to run something like cudaGetLastError, it's probably a problem on the host.
Sorry I don't know specifics of increasing your pinned memory limit.