How to find amount of contiguous chunk of physical memory available on Solaris platform - solaris-10

I think JVM always takes always contiguous chunks of memory equivalent to the -Xms. If the configured value is not available, then the JVM initialization fails.
In the above perspective, how to find amount of contiguous chunk of physical memory available on Solaris platform.

The JVM does not require a contiguous area of physical memory. What it uses is virtual memory.
As the addressable space shouldn't be scattered when you start the JVM, the limit for the -Xms option only depends on the virtual memory size available for your process. Up to 4 GB minus the area allocated for non heap memory (stacks, native, etc ...) for a 32 bit JVM and only limited by your system available virtual memory size (swap -s) on a 64-bit JVM.

Related

amount of data that can be hold in shared memory CUDA

In my gpu max threads per block is 1024. I am working on a image processing project using CUDA. Now if I want to use shared memory is that mean that I can only work with 1024 pixels using one block and need to copy only those 1024 elements to the shared memory
Your question is quite unclear, so I will answer to what is asked in the title.
The amount of data that can be hold in shared memory in CUDA depends on the Compute Capability of your GPU.
For instance, on CC 2.x and 3.x :
On devices of compute capability 2.x and 3.x, each multiprocessor has 64KB of on-chip memory that can be partitioned between L1 cache and shared memory.
See Configuring the amount of shared memory section here : Nvidia Parallel Forall Devblog : Using Shared Memory in CUDA C/C++
The optimization you have to think about is to avoid bank conflicts by mapping the threads' access to memory banks. This is introduced in this blog and you should read about it.

Understanding concurrency and GPU as a limited resource

With CPU and memory it's simple.
A process has a large virtual address space, which is partially mapped into physical memory. When the current process attempts to access a page that is not in physical memory, OS steps in, chooses a page to swap (e.g. with Round Robin), swaps it into disc, then reads the required page from the swap, and the control is returned back to the process. This is straightforward, because the process cannot continue without having that page.
GPU kernels is a different story.
Let's consider a usecase:
A high-priority [cpu] process, namely X, makes a call to kernel (which is a blocking call). At this moment, it is reasonable for OS to switch contexts and give the CPU to a different process, namely Z. For the sake of example, let the process Z also do something heavy with the GPU.
Now, what does the GPU driver do? Does it stop the kernel that belongs to [higher prioritized] X? Does it inform OS that Z isn't prioritized enough to offload kernels of X? In general, what happens when two processes need GPU resources, but the available GPU memory is sufficient to serve only one of them at a time?
CUDA GPUs context-switch cooperatively at a coarse granularity (think "memcpy" or "kernel launch"). If there is enough memory for both contexts, the hardware is happy to cooperatively context switch between them at a slight performance cost. (But because it's cooperative, long-running kernels will interfere with other kernels' execution.)
Modern GPUs do support virtual memory (i.e. memory protection through address translation), but they do NOT support demand paging. That means every piece of memory accessible to the GPU (device memory and mapped pinned memory) must be physically present and mapped after allocation.
The Windows Display Driver Model (WDDM) introduced in Windows Vista does paging at a very coarse granularity. The driver is required to track which "memory objects" are needed to execute a given command buffer, and the OS ensures that they are present. The OS can swap them out when not needed. The wrinkle with CUDA is that since pointers can be stored, all memory objects associated with the CUDA address space must be resident in order to run a CUDA kernel. So the paging doesn't work as well for CUDA as it does for graphics applications, which WDDM was designed to run.

Load from shared memory the one and same 32 bytes (ulong4) by the each warp thread

If each warp accesses the shared memory at the same address, how would that load the 32 bytes of data (ulong4)? Will it be 'broadcasted'? Would the access time be the same as if each thread loaded the 2 bytes 'unsigned short int'?
Now, in case I need to load from shared memory 32/64 same bytes in each warp, how could I do this?
On devices before compute capability 3.0 shared memory accesses are always 32 bit / 4 bytes wide and will be broadcast if all threads of a warp access the same address. Wider accesses will compile to multiple instructions.
On compute capability 3.0 shared memory accesses can be configured to be either 32 bit wide or 64 bit wide using cudaDeviceSetSharedMemConfig(). The chosen setting will apply to the entire kernel though.
[As I had originally missed the little word "shared" in the question, I gave a completely off-topic answer for global memory instead. Since that one should still be correct, I'll leave it in here:]
It depends:
Compute capability 1.0 and 1.1 don't broadcast and use 64 separate 32 byte memory transactions (two times 16 bytes, extended to the minimum 32 byte transactions size for each thread of the warp)
Compute capability 1.2 and 1.3 broadcast, so two 32 byte transactions (two times 16 bytes, extended to minimum 32 byte transaction size) suffice for all threads of the warp
Compute capability 2.0 and higher just read a 128 byte cache line and satisfy all requests from there.
The compute capability 1.x devices will waste 50% of the transferred data, as a single thread can load at most 16 bytes, but the minimum transaction size is or 32 bytes. Additionally, 32 byte transactions are a lot slower that 128 byte transactions.
The time will be the same as if just 8 bytes were read by each thread because of the minimum transaction size, and because data paths are sufficiently wide to transfer either 8 or 16 bytes to each thread per transaction.
Reading 2× or 4× the data will take 2× or 4× as long on compute capability 1.x, but only minimally longer on 2.0 and higher if the data falls into the same cache line so no further memory transactions are necessary.
So on compute capability 2.0 and higher you don't need to worry. On 1.x read the data through the constant cache or a texture if it is constant, or reorder it in shared memory otherwise (assuming your kernel is memory bandwidth bound).

Why is the constant memory size limited in CUDA?

According to "CUDA C Programming Guide", a constant memory access benefits only if a multiprocessor constant cache is hit (Section 5.3.2.4)1. Otherwise there can be even more memory requests for a half-warp than in case of the coalesced global memory read. So why the constant memory size is limited to 64 KB?
One more question in order not to ask twice. As far as I understand, in the Fermi architecture the texture cache is combined with the L2 cache. Does texture usage still make sense or the global memory reads are cached in the same manner?
1Constant Memory (Section 5.3.2.4)
The constant memory space resides in device memory and is cached in the constant cache mentioned in Sections F.3.1 and F.4.1.
For devices of compute capability 1.x, a constant memory request for a warp is first split into two requests, one for each half-warp, that are issued independently.
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
The constant memory size is 64 KB for compute capability 1.0-3.0 devices. The cache working set is only 8KB (see the CUDA Programming Guide v4.2 Table F-2).
Constant memory is used by the driver, compiler, and variables declared __device__ __constant__. The driver uses constant memory to communicate parameters, texture bindings, etc. The compiler uses constants in many of the instructions (see disassembly).
Variables placed in constant memory can be read and written using the host runtime functions cudaMemcpyToSymbol() and cudaMemcpyFromSymbol() (see the CUDA Programming Guide v4.2 section B.2.2). Constant memory is in device memory but is accessed through the constant cache.
On Fermi texture, constant, L1 and I-Cache are all level 1 caches in or around each SM. All level 1 caches access device memory through the L2 cache.
The 64 KB constant limit is per CUmodule which is a CUDA compilation unit. The concept of CUmodule is hidden under the CUDA runtime but accessible by the CUDA Driver API.

Pinned memory in Nvidia CUDA

I'm writing matrix addition program for GPUs using Streams and obviously pinned memory.So I allocated 3 matrices in pinned memory but after particular dimensions it shows API error 2:out of memory.My RAM is 4GB but i'm not able to use beyond 800MB.Is there any way by which we can control this upper limit?
My sys config:
nVidia GEForce 9800GTX
Intel core 2 Quad
For streamed execution code looks as follows
(int i=0;i<no_of_streams;i++)
{
cudaMemcpyAsync(device_a+i*(n/no_of_streams),hAligned_on_host_a+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyHostToDevice,streams[i]);
cudaMemcpyAsync(device_b+i*(n/no_of_streams),hAligned_on_host_b+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyHostToDevice,streams[i]);
cudaMemcpyAsync(device_c+i*(n/no_of_streams),hAligned_on_host_c+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyHostToDevice,streams[i]);
matrixAddition<<<blocks,threads,0,streams[i]>>>(device_a+i*(n/no_of_streams),device_b+i*(n/no_of_streams),device_c+i*(n/no_of_streams));
cudaMemcpyAsync(hAligned_on_host_a+i*(n/no_of_streams),device_a+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyDeviceToHost,streams[i]);
cudaMemcpyAsync(hAligned_on_host_b+i*(n/no_of_streamss),device_b+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyDeviceToHost,streams[i]);
cudaMemcpyAsync(hAligned_on_host_c+i*(n/no_of_streams),device_c+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyDeviceToHost,streams[i]));
}
So, you haven't specified if this happens after the cudaMalloc or the cudaHostAlloc function calls.
Pinned memory is a limited resource. Any memory defined as being in pinned memory must always be in RAM. As such, that leaves less room in RAM for other system applications. This means, you can't have 4GB of pinned memory if you have 4GB of RAM, or else nothing else could run.
800MB might be a system imposed limit. Considering it's a quarter of your RAM, it might be a reasonable limit. It is also quite close to the size of your global memory. A failure on the card wouldn't translate to a failure on the host, so if it's complaining without having to run something like cudaGetLastError, it's probably a problem on the host.
Sorry I don't know specifics of increasing your pinned memory limit.