I changed my method to allocate host memory from method 1 to method 2 as shown in my code below. The code can compile and run without any error.
I just wonder is it a proper way or any side effect to allocate memory for pointer to pointer using method 2.
#define TESTSIZE 10
#define DIGITSIZE 5
//Method 1
int **ra;
ra = (int**)malloc(TESTSIZE * sizeof(int));
for(int i = 0; i < TESTSIZE; i++){
ra[i] = (int *)malloc(DIGITSIZE * sizeof(int));
}
//Method 2
int **ra;
cudaMallocHost((void**)&ra, TESTSIZE * sizeof(int));
for(int i = 0; i < TESTSIZE; i++){
cudaMallocHost((void**)&ra[i], DIGITSIZE * sizeof(int));
}
Both of them work fine. Yet, there are differences between cudaMallocHost and malloc. The reasons is that cudaMallocHost allocates pinned memory so under the hood the OS's doing something similar to malloc and some extra functions to pin the pages. This means that cudaMallocHost generally takes longer.
That being said, if you repeatedly want to cudaMemcpy from a single buffer then cudaMallocHost may benefit in the long run since it's quicker to transfer data from pinned memory.
Also, you are required to use pinned memory to overlap data transfer/computations with streams.
Related
I'm new to CUDA/C and new to stack overflow. This is my first question.
I'm trying to allocate memory dynamically in a kernel function, but the results are unexpected.
I read using malloc() in a kernel can lower performance a lot, but I need it anyway so I first tried with a simple int ** array just to test the possibility, then I'll actually need to allocate more complex structs.
In my main I used cudaMalloc() to allocate the space for the array of int *, and then I used malloc() for every thread in the kernel function to allocate the array for every index of the outer array. I then used another thread to check the result, but it doesn't always work.
Here's main code:
#define N_CELLE 1024*2
#define L_CELLE 512
extern "C" {
int main(int argc, char **argv) {
int *result = (int *)malloc(sizeof(int));
int *d_result;
int size_numbers = N_CELLE * sizeof(int *);
int **d_numbers;
cudaMalloc((void **)&d_numbers, size_numbers);
cudaMalloc((void **)&d_result, sizeof(int *));
kernel_one<<<2, 1024>>>(d_numbers);
cudaDeviceSynchronize();
kernel_two<<<1, 1>>>(d_numbers, d_result);
cudaMemcpy(result, d_result, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d\n", *result);
cudaFree(d_numbers);
cudaFree(d_result);
free(result);
}
}
I used extern "C"because I could't compile while importing my header, which is not used in this example code. I pasted it since I don't know if this may be relevant or not.
This is kernel_one code:
__global__ void kernel_one(int **d_numbers) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
d_numbers[i] = (int *)malloc(L_CELLE*sizeof(int));
for(int j=0; j<L_CELLE;j++)
d_numbers[i][j] = 1;
}
And this is kernel_two code:
__global__ void kernel_two(int **d_numbers, int *d_result) {
int temp = 0;
for(int i=0; i<N_CELLE; i++) {
for(int j=0; j<L_CELLE;j++)
temp += d_numbers[i][j];
}
*d_result = temp;
}
Everything works fine (aka the count is correct) until I use less than 1024*2*512 total blocks in device memory. For example, if I #define N_CELLE 1024*4 the program starts giving "random" results, such as negative numbers.
Any idea of what the problem could be?
Thanks anyone!
In-kernel memory allocation draws memory from a statically allocated runtime heap. At larger sizes, you are exceeding the size of that heap and then your two kernels are attempting to read and write from uninitialised memory. This produces a runtime error on the device and renders the results invalid. You would already know this if you either added correct API error checking on the host side, or ran your code with the cuda-memcheck utility.
The solution is to ensure that the heap size is set to something appropriate before trying to run a kernel. Adding something like this:
size_t heapsize = sizeof(int) * size_t(N_CELLE) * size_t(2*L_CELLE);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, heapsize);
to your host code before any other API calls, should solve the problem.
I don't know anything about CUDA but these are severe bugs:
You cannot convert from int** to void**. They are not compatible types. Casting doesn't solve the problem, but hides it.
&d_numbers gives the address of a pointer to pointer which is wrong. It is of type int***.
Both of the above bugs result in undefined behavior. If your program somehow seems to works in some condition, that's just by pure (bad) luck only.
I'm new to CUDA, and I can't understand loop unrolling. I've written a piece of code to understand the technique
__global__ void kernel(float *b, int size)
{
int tid = blockDim.x * blockIdx.x + threadIdx.x;
#pragma unroll
for(int i=0;i<size;i++)
b[i]=i;
}
Above is my kernel function. In main I call it like below
int main()
{
float * a; //host array
float * b; //device array
int size=100;
a=(float*)malloc(size*sizeof(float));
cudaMalloc((float**)&b,size);
cudaMemcpy(b, a, size, cudaMemcpyHostToDevice);
kernel<<<1,size>>>(b,size); //size=100
cudaMemcpy(a, b, size, cudaMemcpyDeviceToHost);
for(int i=0;i<size;i++)
cout<<a[i]<<"\t";
_getch();
return 0;
}
Does it mean I have size*size=10000 threads running to execute the program? Are 100 of them created when loop is unrolled?
No. It means you have called a CUDA kernel with one block and that one block has 100 active threads. You're passing size as the second function parameter to your kernel. In your kernel each of those 100 threads executes the for loop 100 times.
#pragma unroll is a compiler optimization that can, for example, replace a piece of code like
for ( int i = 0; i < 5; i++ )
b[i] = i;
with
b[0] = 0;
b[1] = 1;
b[2] = 2;
b[3] = 3;
b[4] = 4;
by putting #pragma unroll directive right before the loop. The good thing about the unrolled version is that it involves less processing load for the processor. In case of for loop version, the processing, in addition to assigning each i to b[i], involves i initialization, evaluating i<5 for 6 times, and incrementing i for 5 times. While in the second case, it only involves filing up b array content (perhaps plus int i=5; if i is used later). Another benefit of loop unrolling is the enhancement of Instruction-Level Parallelism (ILP). In the unrolled version, there would possibly be more operations for the processor to push into processing pipeline without being worried about the for loop condition in every iteration.
Posts like this explain that runtime loop unrolling cannot happen for CUDA. In your case CUDA compiler doesn't have any clues that size is going to be 100 so compile-time loop unrolling will not occur, and so if you force unrolling, you may end up hurting the performance.
If you are sure that the size is 100 for all executions, you can unroll your loop like below:
#pragma unroll
for(int i=0;i<SIZE;i++) //or simply for(int i=0;i<100;i++)
b[i]=i;
in which SIZE is known in compile time with #define SIZE 100.
I also suggest you to have proper CUDA error checking in your code (explained here).
I need some help with Cuda GLOBAL memory. In my project I must declare Global array for avoid to send this array at every kernel call.
EDIT:
My application can call the kernel more than 1,000 times , and on every call I'm sending him an array with size more than [1000 X 1000], So I think it's taking more time , that's why my app works slowly. So I need declare Global array for GPU, So my questions are
1 How to declare Global array
2 How to initialize Global array from CPU before kernel call
Thanks in advance
Your edited question is confusing because you say you are sending your kernel an array of size 1000 x 1000 but you want to know how to do this using a global array. The only way I know of to send this much data to a kernel is to use a global array, so you are probably already doing this with an array in global memory.
Nevertheless, there are 2 methods, at least, to create and initialize an array in global memory:
1.statically, using __device__ and cudaMemcpyToSymbol, for example:
#define SIZE 100
__device__ int A[SIZE];
...
int main(){
int myA[SIZE];
for (int i=0; i< SIZE; i++) myA[i] = 5;
cudaMemcpyToSymbol(A, myA, SIZE*sizeof(int));
...
(kernel calls, etc.)
}
(device variable reference, cudaMemcpyToSymbol reference)
2.dynamically, using cudaMalloc and cudaMemcpy:
#define SIZE 100
...
int main(){
int myA[SIZE];
int *A;
for (int i=0; i< SIZE; i++) myA[i] = 5;
cudaMalloc((void **)&A, SIZE*sizeof(int));
cudaMemcpy(A, myA, SIZE*sizeof(int), cudaMemcpyHostToDevice);
...
(kernel calls, etc.)
}
(cudaMalloc reference, cudaMemcpy reference)
For clarity I'm omitting error checking which you should do on all cuda calls and kernel calls.
If I understand well this question, which is kind of unclear, you want to use global array and send it to the device in every kernel call. This bad practice leads to high latency because in every kernel call you need to transfer your data to the device. In my experience such practice led to negative speed-up.
An optimal way would be to use what I call flip-flop technique. The way you do it is:
Declare two array in the device. d_arr1 and d_arr2
Copy the data host -> device into one of the arrays.
Pass as kernel's parameters pointers to d_arr1 and d_arr2
Process the data into the kernel.
In consequent kernel calls you exchange the pointers you are passing as parameters
This way you avoid to transfer the data every kernel call. You transfer only at the beginning and at the end of your host loop.
int a, even =0;
for(a=0;a<1000;a++)
{
if (even % 2 ==0 )
//call to the kernel(pointer_a, pointer_b)
else
//call to the kernel(pointer_b, pointer_a)
}
I have a the following code:
__global__ void interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
[...]
double phi_cap1, phi_cap2;
if(i<M) {
for(int m=0; m<(2*K+1); m++) {
[calculate phi_cap1];
for(int n=0; n<(2*K+1); n++) {
[calculate phi_cap2];
[calculate phi_cap=phi_cap1*phi_cap2];
[use phi_cap];
}
}
}
}
I would like to use Dynamic Programming on a Kepler K20 card to dispatch the processing of phi_cap1 and phi_cap2 in parallel to a bunch of threads to reduce the computation time. K=6 in my code, so I'm launching a single block of 13x13 threads.
Following the CUDA Dynamic Parallelism Programming Guide, I'm allocating a matrix phi_cap of 169 elements (formed by the products of phi_cap1 and phi_cap2), needed to exchange the data with the child kernel, in global memory. Indeed, quoting the guide,
As a general rule, all storage passed to a child kernel should be allocated explicitly from the global-memory heap.
I then ended-up with the following code
__global__ void interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
[...]
dim3 dimBlock(2*K+1,2*K+1); dim3 dimGrid(1,1);
if(i<M) {
double* phi_cap; cudaMalloc((void**)&phi_cap,sizeof(double)*(2*K+1)*(2*K+1));
child_kernel<<<dimGrid,dimBlock>>>(cc_diff1,cc_diff2,phi_cap);
for(int m=0; m<(2*K+1); m++) {
for(int n=0; n<(2*K+1); n++) {
[use phi_cap];
}
}
}
}
The problem is that the first routine takes 5ms to run, while the second routine, even by commenting the child_kernel launch, takes 23ms, with practically all the time spent in the cudaMalloc API.
Since in dynamic programming one would often need allocating memory space to exchange data with the child kernels, and the only solution seems to be global memory taking so much time, it seems to me that one serious bottleneck of the usefulness of dynamic programming is the data exchange, unless there is a way to circumvent the global memory allocation issue.
The question then is: is there any workaround to the mentioned issue, namely, taking so much time when allocating global memory from within a kernel?. Thanks
SOLUTION PROPOSED IN THE COMMENTS
Allocate the required global memory from outside the parent kernel. I have verified that allocating the required global memory from outside the parent kernel is much faster.
You are calling cudaMalloc from each thread where i < M which means that you are making M cudaMalloc calls.
The bigger M is the worse it is going to get.
Instead you could make a single cudaMalloc call from the first thread of the block allocating M times the size that you used before (actually in your case you should allocate more, so each block is properly aligned). After that sync the threads and you can start your child kernels with correctly computed phi_cap address for each child kernel.
Alternatively (if your specific situation allows you to allocate enough memory that you can hold on to between the kernel calls) you could allocate the memory once outside of the kernel and reuse it. That would be a lot quicker. If M varies between kernel calls you could allocate as much as you would need for the biggest M.
Is it possible for a CUDA kernel to synchronize writes to device-mapped memory without any host-side invocation (e.g., of cudaDeviceSynchronize)? When I run the following program, it doesn't seem that the kernel waits for the writes to device-mapped memory to complete before terminating because examining the page-locked host memory immediately after the kernel launch does not show any modification of the memory (unless a delay is inserted or the call to cudaDeviceSynchronize is uncommented):
#include <stdio.h>
#include <cuda.h>
__global__ void func(int *a, int N) {
int idx = threadIdx.x;
if (idx < N) {
a[idx] *= -1;
__threadfence_system();
}
}
int main(void) {
int *a, *a_gpu;
const int N = 8;
size_t size = N*sizeof(int);
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc((void **) &a, size, cudaHostAllocMapped);
cudaHostGetDevicePointer((void **) &a_gpu, (void *) a, 0);
for (int i = 0; i < N; i++) {
a[i] = i;
}
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
func<<<1, N>>>(a_gpu, N);
// cudaDeviceSynchronize();
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
cudaFreeHost(a);
}
I'm compiling the above for sm_20 with CUDA 4.2.9 on Linux and running it on a Fermi GPU (S2050).
A kernel launch will immediately return to the host code before any kernel activity has occurred. Kernel execution is in this way asynchronous to host execution and does not block host execution. So it's no surprise that you have to wait a bit or else use a barrier (like cudaDeviceSynchronize()) to see the results of the kernel.
As described here:
In order to facilitate concurrent execution between host and device,
some function calls are asynchronous: Control is returned to the host
thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
This is all intentional of course, so that you can use the GPU and CPU simultaneously. If you don't want this behavior, a simple solution as you've already discovered is to insert a barrier. If your kernel is producing data which you will immediately copy back to the host, you don't need a separate barrier. The cudaMemcpy call after the kernel will wait until the kernel is completed before it begins it's copy operation.
I guess to answer your question, you are wanting kernel launches to be synchronous without you having even to use a barrier (why do you want to do this? Is adding the cudaDeviceSynchronize() call a problem?) It's possible to do this:
"Programmers can globally disable asynchronous kernel launches for all
CUDA applications running on a system by setting the
CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is
provided for debugging purposes only and should never be used as a way
to make production software run reliably. "
If you want this synchronous behavior, it's better just to use the barriers (or depend on another subsequent cuda call, like cudaMemcpy). If you use the above method and depend on it, your code will break as soon as somebody else tries to run it without the environment variable set. So it's really not a good idea.