CUDA Pinned memory for small data - cuda

I am running host to device bandwidthtests for different sizes of data, and have noticed an increased bandwidth when the host memory is pinned against pageable. Following is my plot of bandwidth in MB/s vs data transfer size in bytes. One could notice that for small amount of data (<300K) pageable fares better than pinned...is it related to memory allocation by the O/S?
This bandwidthtest program is from NVidia's code sample sdk (with slight modifications from my side), and I am testing against Tesla C2050 using CUDA 4.0. The O/S is 64-bit Linux.

The cudaMemcpy implementation has different code paths for different devices, source and destination locations, and data sizes, in order to try to achieve the best possible throughput.
The different rates you are seeing are probably due to the implementation switching as the array size changes.
For example, Fermi GPUs have both dedicated copy engines (which can run in parallel with kernels running on the SMs), and SMs which can access host memory over PCI-e. For smaller arrays, it may be more efficient for cudaMemcpy to be implemented as a kernel running on the SMs that reads host memory directly, and stores the loaded data in device memory (or vice versa), so the driver may choose to do this. Or it may be more efficient to use the copy engine -- I'm not sure which it does in practice, but I think switching between them is the cause of the crossover you see in your graph.

It is possible that test is cheating.
Here is one of timed code:
cutilSafeCall( cudaEventRecord( start, 0 ) );
if( PINNED == memMode )
{
for( unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++ )
{
cutilSafeCall( cudaMemcpyAsync( h_odata, d_idata, memSize,
cudaMemcpyDeviceToHost, 0) );
}
}
else
{
for( unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++ )
{
cutilSafeCall( cudaMemcpy( h_odata, d_idata, memSize,
cudaMemcpyDeviceToHost) );
}
}
cutilSafeCall( cudaEventRecord( stop, 0 ) );
Note, that test uses different functions to do a MemCPY for different kinds of memory. I think, this is a cheating, because main difference between modes is how the memory is allocated, with cudaHostAlloc for pinned and with malloc for unpinned.
Different Memcpy functions can have vary paths of error checking and transfer setup.
So, try to modify the test and do copy in both modes with cudaMemcpy(), e.g. with changing all ifs after cudeEventRecord(...) to if( 0 && (PINNED == memMode) )

Related

Multiple global functions in the same CUDA source file

Can I write two separate global functions, that compute different things, in the same CUDA source file? Something like this:
__global__ void Ker1(mpz_t *d,mpz_t *c,mpz_t e,mpz_t n )
{
int i=blockIdx.x*blockDim.x + threadIdx.x;
mpz_powm (d[i], c[i], e, n);
}
__global__ void Ker2(mpz_t *d,mpz_t *c,mpz_t d, mpz_t n)
{
int i=blockIdx.x*blockDim.x + threadIdx.x;
mpz_powm(c[i], d[i],d, n);
}
int main()
{
/* ... */
cudaMemcpy(decode_device,decode_buffer,memSize,cudaMemcpyHostToDevice);
Ker1<<<dimGrid , dimBlock >>>( d_device,c_device,e,n );
Ker2<<<dimGrid , dimBlock>>>(c_device,d_device,d,n);
cudaMemcpy(decode_buffer,decode_device,memSize,cudaMemcpyDeviceToHost);
}
If not, how would you do something like this?
It is quite unclear what you're asking, but after 3 readings I assume : "Can I write several Kernels in the same source file ?".
Your can write as much kernel launchs as you want in your main function.
An example here on page 9 :
...
cudaMemcpy( dev1, host1, size, H2D ) ;
kernel2 <<< grid, block, 0 >>> ( ..., dev2, ... ) ;
kernel3 <<< grid, block, 0 >>> ( ..., dev3, ... ) ;
cudaMemcpy( host4, dev4, size, D2H ) ;
...
From : Streams and concurrency webinar
The calls will be asynchronous by default, so as soon as the kernel is launched in the GPU, the CPU will treat the instructions that follow.
To force synchronization you have to use cudaDeviceSynchronize(), or any memory transfer via cudaMemcpy that forces synchronization by itself.
Source : the CUDA FAQ.
Q: Can the CPU and GPU run in parallel?
Kernel invocation in CUDA is asynchronous, so the driver will return control to the application as soon as it has launched the kernel.
The "cudaThreadSynchronize()" API call should be used when measuring
performance to ensure that all device operations have completed before
stopping the timer.
CUDA functions that perform memory copies and that control graphics
interoperability are synchronous, and implicitly wait for all kernels
to complete.
By the way, if you don't need to synchronize between kernels, they can be executed concurrently if your GPU has the required compute capability (CC) :
Q: Is it possible to execute multiple kernels at the same time?
Yes. GPUs of compute capability 2.x or higher support concurrent kernel execution and launches.
(still readen from the CUDA FAQ).

CUDA context lifetime

In my application I have some part of the code that works as follows
main.cpp
int main()
{
//First dimension usually small (1-10)
//Second dimension (100 - 1500)
//Third dimension (10000 - 1000000)
vector<vector<vector<double>>> someInfo;
Object someObject(...); //Host class
for (int i = 0; i < N; i++)
someObject.functionA(&(someInfo[i]));
}
Object.cpp
void SomeObject::functionB(vector<vector<double>> *someInfo)
{
#define GPU 1
#if GPU == 1
//GPU COMPUTING
computeOnGPU(someInfo, aConstValue, aSecondConstValue);
#else
//CPU COMPUTING
#endif
}
Object.cu
extern "C" void computeOnGPU(vector<vector<double>> *someInfo, int aConstValue, int aSecondConstValue)
{
//Copy values to constant memory
//Allocate memory on GPU
//Copy data to GPU global memory
//Launch Kernel
//Copy data back to CPU
//Free memory
}
So as (I hope) you can see in the code, the function that prepares the GPU is called many times depending on the value of the first dimension.
All the values that I send to constant memory always remain the same and the sizes of the pointers allocated in global memory are always the same (the data is the only one changing).
This is the actual workflow in my code but I'm not getting any speedup when using GPU, I mean the kernel does execute faster but the memory transfers became my problem (as reported by nvprof).
So I was wondering where in my app the CUDA context starts and finishes to see if there is a way to do only once the copies to constant memory and memory allocations.
Normally, the cuda context begins with the first CUDA call in your application, and ends when the application terminates.
You should be able to do what you have in mind, which is to do the allocations only once (at the beginning of your app) and the corresponding free operations only once (at the end of your app) and populate __constant__ memory only once, before it is used the first time.
It's not necessary to allocate and free the data structures in GPU memory repetetively, if they are not changing in size.

How randomly accessing small constant arrays from CUDA kernel works

My kernel uses float array of size 8 by 8 with random access pattern below.
// inds - big array of indices in range 0,...,7
// flts - 8 by 8 array of floats
// kernel essentially processes large 2D array by looping through slow coordinate
// and having block/thread parallelization of fast coordinate.
__global__ void kernel (int const* inds, float const* flt, ...)
{
int idx = threadIdx.x + blockDim.x * blockIdx.x; // Global fast coordinate
int idy; // Global slow coordinate
int sx = gridDim.x * blockDim.x; // Stride of fast coordinate
for ( idy = 0; idy < 10000; idy++ ) // Slow coordinate loop
{
int id = idx + idy * sx; // Global coordinate in 2D array
int ind = inds[id]; // Index of random access to small array
float f0 = flt[ind * 8 + 0];
...
float f7 = flt[ind * 8 + 7];
NEXT I HAVE SOME ALGEBRAIC FORMULAS THAT USE f0, ..., f7
}
}
What would be the best way to access flt array?
Do not pass flt, use __const__ memory. I am not sure how fast const memory is when different threads access different data.
Use as above. Load uniform will not be used because threads access different data. Will it nevertheless be fast because of cache?
Copy into shared memory and use shared memory array.
Use textures. Never used textures... Can this approach be fast?
For shared memory, it is probably better to transpose flt array, i.e. access it this way to avoid bank conflicts:
float fj = flt_shared[j * 8 + ind]; // where j = 0, ..., 7
PS: Target architectures are Fermi and Kepler.
The "best" way depends also on the architecture you are working on. My personal experience with random access (your access seems to be sort of a random due to the use of the mapping inds[id]) on Fermi and Kepler is that L1 is now so fast that in many cases it is better to keep using global memory instead of shared memory or texture memory.
Accelerating global memory random access: Invalidating the L1 cache line
Fermi and Kepler architectures support two types of loads from global memory. Full caching is the
default mode, it attempts to hit in L1, then L2, then GMEM and the load granularity is 128-byte line. L2-only attempts to hit in L2, then GMEM and the load granularity is 32-bytes. For certain random access patterns, memory efficiency can be increased by invalidating L1 and exploiting the lower granularity of L2. This can be done by compiling with –Xptxas –dlcm=cg option to nvcc.
General guidelines for accelerating global memory access: disabling ECC support
Fermi and Kepler GPUs support Error Correcting Code (ECC), and ECC is enabled by default. ECC reduces peak memory bandwidth and is requested to enhance data integrity in applications like medical imaging and large-scale cluster computing. If not needed, it can
be disabled for improved performance using the nvidia-smi utility on Linux (see the link), or via Control Panel on Microsoft Windows systems. Note that toggling ECC on or off requires a reboot to take effect.
General guidelines for accelerating global memory access on Kepler: using read-only data cache
Kepler features a 48KB cache for data that is known to be read‐only for
the duration of the function. Use of the read‐only path is beneficial because it offloads the Shared/L1 cache path and it supports
full speed unaligned memory access. Use of the read‐only path can be managed automatically by the compiler (use the const __restrict keyword) or explicitly (use the __ldg() intrinsic) by the
programmer.

Thrust: transform_reduce : cudaMalloc in unary_op.operator

In my unary_op.operator, I need to create a temporary array.
I guess cudaMalloc is the way to go.
But, is it performance efficient or is there a better design?
struct my_unary_op
{
__host__ __device__ int operator()(const int& index) const
{
int* array;
cudaMalloc((void**)&array, 10*sizeof(int));
for(int i = 0; i < 10; i++)
array[i] = index;
int sum=0;
for(int i=0; i < 10 ; i++)
sum += array[i];
return sum;
};
};
int main()
{
thrust::counting_iterator<int> first(0);
thrust::counting_iterator<int> last = first+100;
my_unary_op unary_op = my_unary_op();
thrust::plus<int> binary_op;
int init = 0;
int sum = thrust::transform_reduce(first, last, unary_op, init, binary_op);
return 0;
};
You won't be able to compile cudaMalloc() in a __device__ function, because it is a host-only function. You can, however, use plain malloc() or new (on devices of compute capability >= 2.0), but these are not very efficient when running on the device. There are two reasons for this. The first is that concurrently running threads are serialized during the memory allocation call. The second is that the calls allocate global memory in chunks that become arranged in such a way that when the memory load and store instructions are run by the 32 threads in a warp, they are not adjacent, so you don't get properly coalesced memory accesses.
You can address both of these issues by using fixed size C style arrays in your __device__ functions (ie., int array[10];). Small, fixed size arrays can sometimes be optimized by the compiler so that they are stored in the register file, for extremely fast access. If the compiler stores them in global memory, it will use local memory. Local memory is stored in global memory, but it is interleaved in such a way that when the 32 threads in a warp run a load or store instruction, each thread accesses adjacent locations in memory, enabling the transactions to be fully coalesced.
If you don't know at runtime what the size of your C arrays will be, allocate a max size in the array and leave some of it unused.
I think that the total amount of memory that is used by the fixed sized array will depend on the total number of threads that are processed concurrently on the GPU, not on the total number of threads launched by the kernel. In this answer #mharris shows how to calculate the maximum possible number of concurrent threads, which is 24,576 for a GTX580. So, if the fixed size array is 16 32-bit values, the maximum possible amount of memory used by the array would be 1536KiB.
If you need a wide range of array sizes, you can use templates to compile kernels with a number of different sizes. Then, at runtime, select one that is able to accommodate the size that you need. However, chances are that if you simply allocate the maximum of what you might need, the memory usage will not be the limiting factor in the number of threads that you can launch.

Launching a CUDA stream from each host thread

My intention is to use n host threads to create n streams concurrently on a NVidia Tesla C2050. The kernel is a simple vector multiplication...I am dividing the data equally amongst n streams, and each stream would have concurrent execution/data transfer going on.
The data is floating point, I am sometimes getting CPU/GPU sums as equal, and sometimes they are wide apart...I guess this could be attributed to loss of synchronization constructs on my code, for my case, but also I don't think any synch constructs between streams is necessary, because I want every CPU to have a unique stream to control, and I do not care about asynchronous data copy and kernel execution within a thread.
Following is the code each thread runs:
//every thread would run this method in conjunction
static CUT_THREADPROC solverThread(TGPUplan *plan)
{
//Allocate memory
cutilSafeCall( cudaMalloc((void**)&plan->d_Data, plan->dataN * sizeof(float)) );
//Copy input data from CPU
cutilSafeCall( cudaMemcpyAsync((void *)plan->d_Data, (void *)plan->h_Data, plan->dataN * sizeof(float), cudaMemcpyHostToDevice, plan->stream) );
//to make cudaMemcpyAsync blocking
cudaStreamSynchronize( plan->stream );
//launch
launch_simpleKernel( plan->d_Data, BLOCK_N, THREAD_N, plan->stream);
cutilCheckMsg("simpleKernel() execution failed.\n");
cudaStreamSynchronize(plan->stream);
//Read back GPU results
cutilSafeCall( cudaMemcpyAsync(plan->h_Data, plan->d_Data, plan->dataN * sizeof(float), cudaMemcpyDeviceToHost, plan->stream) );
//to make the cudaMemcpyAsync blocking...
cudaStreamSynchronize(plan->stream);
cutilSafeCall( cudaFree(plan->d_Data) );
CUT_THREADEND;
}
And creation of multiple threads and calling the above function:
for(i = 0; i < nkernels; i++)
threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, &plan[i]);
printf("main(): waiting for GPU results...\n");
cutWaitForThreads(threadID, nkernels);
I took this strategy from one of the CUDA Code SDK samples. As I've said before, this code work sometimes, and other time it gives wayward results. I need help with fixing this code...
first off I am not an expert by any stretch of the imagination, just from my experience.
I don't see why this needs multiple host threads. It seems like you're managing one device and passing it multiple streams. The way I've seen this done (pseudocode)
{
create a handle
allocate an array of streams equal to the number of streams you want
for(int n=0;n<NUM_STREAMS;n++)
{
cudaStreamCreate(&streamArray[n]);
}
}
From there you can just pass the streams in your array to the various asynchronous calls (cudaMemcpyAsync(), kernel streams, etc.) and the device manages the rest. I've had weird scalability issues with multiple streams (don't try to make 10k streams, I run into problems around 4-8 on a GTX460), so don't be surprised if you run into those. Best of luck,
John
My bet is that
BLOCK_N, THREAD_N
, don't cover the exact size of the array you are passing.
Please provide the code for initializing the streams and the size of those buffers.
As a side note, Streams are useful for overlapping computation with memory transfer. Synching the stream after each async call is not useful at all.