Multiple global functions in the same CUDA source file - cuda

Can I write two separate global functions, that compute different things, in the same CUDA source file? Something like this:
__global__ void Ker1(mpz_t *d,mpz_t *c,mpz_t e,mpz_t n )
{
int i=blockIdx.x*blockDim.x + threadIdx.x;
mpz_powm (d[i], c[i], e, n);
}
__global__ void Ker2(mpz_t *d,mpz_t *c,mpz_t d, mpz_t n)
{
int i=blockIdx.x*blockDim.x + threadIdx.x;
mpz_powm(c[i], d[i],d, n);
}
int main()
{
/* ... */
cudaMemcpy(decode_device,decode_buffer,memSize,cudaMemcpyHostToDevice);
Ker1<<<dimGrid , dimBlock >>>( d_device,c_device,e,n );
Ker2<<<dimGrid , dimBlock>>>(c_device,d_device,d,n);
cudaMemcpy(decode_buffer,decode_device,memSize,cudaMemcpyDeviceToHost);
}
If not, how would you do something like this?

It is quite unclear what you're asking, but after 3 readings I assume : "Can I write several Kernels in the same source file ?".
Your can write as much kernel launchs as you want in your main function.
An example here on page 9 :
...
cudaMemcpy( dev1, host1, size, H2D ) ;
kernel2 <<< grid, block, 0 >>> ( ..., dev2, ... ) ;
kernel3 <<< grid, block, 0 >>> ( ..., dev3, ... ) ;
cudaMemcpy( host4, dev4, size, D2H ) ;
...
From : Streams and concurrency webinar
The calls will be asynchronous by default, so as soon as the kernel is launched in the GPU, the CPU will treat the instructions that follow.
To force synchronization you have to use cudaDeviceSynchronize(), or any memory transfer via cudaMemcpy that forces synchronization by itself.
Source : the CUDA FAQ.
Q: Can the CPU and GPU run in parallel?
Kernel invocation in CUDA is asynchronous, so the driver will return control to the application as soon as it has launched the kernel.
The "cudaThreadSynchronize()" API call should be used when measuring
performance to ensure that all device operations have completed before
stopping the timer.
CUDA functions that perform memory copies and that control graphics
interoperability are synchronous, and implicitly wait for all kernels
to complete.
By the way, if you don't need to synchronize between kernels, they can be executed concurrently if your GPU has the required compute capability (CC) :
Q: Is it possible to execute multiple kernels at the same time?
Yes. GPUs of compute capability 2.x or higher support concurrent kernel execution and launches.
(still readen from the CUDA FAQ).

Related

Getting an unexpected value in global device memory when multiple threads write to it

Here is problem with cuda threads , memory magament, it returns single threads result "100" but would expect 9 threads result "900".
#indudel <stdio.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>
__global__
void test(int in1,int*ptr){
int e = 0;
for (int i = 0; i < 100; i++){
e++;
}
*ptr +=e;
}
int main(int argc, char **argv)
{
int devID = 0;
cudaError_t error;
error = cudaGetDevice(&devID);
if (error == cudaSuccess)
{
printf("GPU Device fine\n");
}
else{
printf("GPU Device problem, aborting");
abort();
}
int* d_A;
cudaMalloc(&d_A, sizeof(int));
int res=0;
//cudaMemcpy(d_A, &res, sizeof(int), cudaMemcpyHostToDevice);
test <<<3, 3 >>>(0,d_A);
cudaDeviceSynchronize();
cudaMemcpy(&res, d_A, sizeof(int),cudaMemcpyDeviceToHost);
printf("res is : %i",res);
Sleep(10000);
return 0;
}
It returns:
GPU Device fine\n
res is : 100
Would expect it to return higher number because 3x3(blocks,threads), insted of just one threads result?
What is done wrong and where does the numbers get lost?
You can't write your sum in this way to global memory.
You have to use an atomic function to ensure that the store is atomic.
In general, when having multiple device threads writing into the same values on global memory, you have to use either atomic functions :
float atomicAdd(float* address, float val);
double atomicAdd(double*
address, double val);
reads the 32-bit or 64-bit word old located at the address address in
global or shared memory, computes (old + val), and stores the result
back to memory at the same address. These three operations are
performed in one atomic transaction. The function returns old.
or thread synchronization :
Throughput for __syncthreads() is 16 operations per clock cycle for
devices of compute capability 2.x, 128 operations per clock cycle for
devices of compute capability 3.x, 32 operations per clock cycle for
devices of compute capability 6.0 and 64 operations per clock cycle
for devices of compute capability 5.x, 6.1 and 6.2.
Note that __syncthreads() can impact performance by forcing the
multiprocessor to idle as detailed in Device Memory Accesses.
(adapting another answer of mine:)
You are experiencing the effects of the increment operator not being atomic. (C++-oriented description of what that means). What's happening, chronologically, is the following sequence of events (not necessarily in the same order of threads though):
...(other work)...
block 0 thread 0 issues a LOAD instruction with address ptr into register r
block 0 thread 1 issues a LOAD instruction with address ptr into register r
...
block 2 thread 0 issues a LOAD instruction with address ptr into register r
block 0 thread 0 completes the LOAD, now having 0 in register r
...
block 2 thread 2 completes the LOAD, now having 0 in register r
block 0 thread 0 adds 100 to r
...
block 2 thread 2 adds 100 to r
block 0 thread 0 issues a STORE instruction from register r to address ptr
...
block 2 thread 2 issues a STORE instruction from register r to address ptr
Thus every thread sees the initial value of *ptr, which is 0; adds 100; and stores 0+100=100 back. The order of the stores doesn't matter here as long as all of the threads try to store the same false value.
What you need to do is either:
Use atomic operations - the least amount of modifications to your code, but very inefficient, since it serializes your work to a great extent, or
Use a block-level reduction primitive. This will ensure some partial ordering of the computational activity vis-a-vis shared block memory - using __syncthreads() or other mechanisms. Thus it might first have each thread add its own two elements up; then synchronize block threads; then have less threads add up pairs of pair-sums and so on. Here's an nVIDIA blog post on implementing fast reductions on their more modern GPU architectures.
block-local or warp-local and/or work-group-specific partial results, which require less/cheaper synchronization, and combine them eventually after having done a lot of work on them.

CUDA shared memory under the hood questions

I have several questions regarding to CUDA shared memory.
First, as mentioned in this post, shared memory may declare in two different ways:
Either dynamically shared memory allocated, like the following
// Lunch the kernel
dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
This may use inside a kernel as mention:
extern __shared__ int s[];
Or static shared memory, which can use in kernel call like the following:
__shared__ int s[64];
Both are use for different reasons, however which one is better and why ?
Second, I'm running a multi blocks, 256 threads per block kernel. I'm using static shared memory in global and device kernels, both of them uses shared memory. An example is given:
__global__ void startKernel(float* p_d_array)
{
__shared double matA[3*3];
float a1 =0 ;
float a2 = 0;
float a3 = 0;
float b = p_d_array[threadidx.x];
a1 += reduce( b, threadidx.x);
a2 += reduce( b, threadidx.x);
a3 += reduce( b, threadidx.x);
// continue...
}
__device__ reduce ( float data , unsigned int tid)
{
__shared__ float data[256];
// do reduce ...
}
I'd like to know how the shared memory is allocated in such case. I presume each block receive its own shared memory.
What's happening when block # 0 goes into reduce function?
Does the shared memory is allocated in advance to the function call?
I call three different reduce device function, in such case, theoretically in block # 0 , threads # [0,127] may still execute ("delayed due hard word") on the first reduce call, while threads # [128,255] may operate on the second reduce call. In this case, I'd like to know if both reduce function are using the same shared memory?
Even though if they are called from two different function calls ?
On the other hand, Is that possible that a single block may allocated 3*256*sizeof(float) shared memory for both functions calls? That's seems superfluous in CUDA manners, but I still want to know how CUDA operates in such case.
Third, is that possible to gain higher performance in shared memory due to compiler optimization using
const float* p_shared ;
or restrict keyword after the data assignment section?
AFAIR, there is little difference whether you request shared memory "dynamically" or "statically" - in either case it's just a kernel launch parameter be it set by your code or by code generated by the compiler.
Re: 2nd, compiler will sum the shared memory requirement from the kernel function and functions called by kernel.

Declaring Variables in a CUDA kernel

Say you declare a new variable in a CUDA kernel and then use it in multiple threads, like:
__global__ void kernel(float* delt, float* deltb) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float a;
a = delt[i] + deltb[i];
a += 1;
}
and the kernel call looks something like below, with multiple threads and blocks:
int threads = 200;
uint3 blocks = make_uint3(200,1,1);
kernel<<<blocks,threads>>>(d_delt, d_deltb);
Is "a" stored on the stack?
Is a new "a" created for each thread when they are initialized?
Or will each thread independently access "a" at an unknown time, potentially messing up the algorithm?
Any variable (scalar or array) declared inside a kernel function, without an extern specifier, is local to each thread, that is each thread has its own "copy" of that variable, no data race among threads will occur!
Compiler chooses whether local variables will reside on registers or in local memory (actually global memory), depending on transformations and optimizations performed by the compiler.
Further details on which variables go on local memory can be found in the NVIDIA CUDA user guide, chapter 5.3.2.2
None of the above. The CUDA compiler is smart enough and aggressive enough with optimisations that it can detect that a is unused and the complete code can be optimised away.You can confirm this by compiling the kernel with -Xptxas=-v as an option and look at the resource count, which should be basically no registers and no local memory or heap.
In a less trivial example, a would probably be stored in a per thread register, or in per thread local memory, which is off-die DRAM.

CUDA context lifetime

In my application I have some part of the code that works as follows
main.cpp
int main()
{
//First dimension usually small (1-10)
//Second dimension (100 - 1500)
//Third dimension (10000 - 1000000)
vector<vector<vector<double>>> someInfo;
Object someObject(...); //Host class
for (int i = 0; i < N; i++)
someObject.functionA(&(someInfo[i]));
}
Object.cpp
void SomeObject::functionB(vector<vector<double>> *someInfo)
{
#define GPU 1
#if GPU == 1
//GPU COMPUTING
computeOnGPU(someInfo, aConstValue, aSecondConstValue);
#else
//CPU COMPUTING
#endif
}
Object.cu
extern "C" void computeOnGPU(vector<vector<double>> *someInfo, int aConstValue, int aSecondConstValue)
{
//Copy values to constant memory
//Allocate memory on GPU
//Copy data to GPU global memory
//Launch Kernel
//Copy data back to CPU
//Free memory
}
So as (I hope) you can see in the code, the function that prepares the GPU is called many times depending on the value of the first dimension.
All the values that I send to constant memory always remain the same and the sizes of the pointers allocated in global memory are always the same (the data is the only one changing).
This is the actual workflow in my code but I'm not getting any speedup when using GPU, I mean the kernel does execute faster but the memory transfers became my problem (as reported by nvprof).
So I was wondering where in my app the CUDA context starts and finishes to see if there is a way to do only once the copies to constant memory and memory allocations.
Normally, the cuda context begins with the first CUDA call in your application, and ends when the application terminates.
You should be able to do what you have in mind, which is to do the allocations only once (at the beginning of your app) and the corresponding free operations only once (at the end of your app) and populate __constant__ memory only once, before it is used the first time.
It's not necessary to allocate and free the data structures in GPU memory repetetively, if they are not changing in size.

CUDA Pinned memory for small data

I am running host to device bandwidthtests for different sizes of data, and have noticed an increased bandwidth when the host memory is pinned against pageable. Following is my plot of bandwidth in MB/s vs data transfer size in bytes. One could notice that for small amount of data (<300K) pageable fares better than pinned...is it related to memory allocation by the O/S?
This bandwidthtest program is from NVidia's code sample sdk (with slight modifications from my side), and I am testing against Tesla C2050 using CUDA 4.0. The O/S is 64-bit Linux.
The cudaMemcpy implementation has different code paths for different devices, source and destination locations, and data sizes, in order to try to achieve the best possible throughput.
The different rates you are seeing are probably due to the implementation switching as the array size changes.
For example, Fermi GPUs have both dedicated copy engines (which can run in parallel with kernels running on the SMs), and SMs which can access host memory over PCI-e. For smaller arrays, it may be more efficient for cudaMemcpy to be implemented as a kernel running on the SMs that reads host memory directly, and stores the loaded data in device memory (or vice versa), so the driver may choose to do this. Or it may be more efficient to use the copy engine -- I'm not sure which it does in practice, but I think switching between them is the cause of the crossover you see in your graph.
It is possible that test is cheating.
Here is one of timed code:
cutilSafeCall( cudaEventRecord( start, 0 ) );
if( PINNED == memMode )
{
for( unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++ )
{
cutilSafeCall( cudaMemcpyAsync( h_odata, d_idata, memSize,
cudaMemcpyDeviceToHost, 0) );
}
}
else
{
for( unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++ )
{
cutilSafeCall( cudaMemcpy( h_odata, d_idata, memSize,
cudaMemcpyDeviceToHost) );
}
}
cutilSafeCall( cudaEventRecord( stop, 0 ) );
Note, that test uses different functions to do a MemCPY for different kinds of memory. I think, this is a cheating, because main difference between modes is how the memory is allocated, with cudaHostAlloc for pinned and with malloc for unpinned.
Different Memcpy functions can have vary paths of error checking and transfer setup.
So, try to modify the test and do copy in both modes with cudaMemcpy(), e.g. with changing all ifs after cudeEventRecord(...) to if( 0 && (PINNED == memMode) )