Calculating PI with Fortran & CUDA - cuda

I am trying to make a simple program in PGI's fortran compiler. This simple program will use the graphics card to calculate pi using the "dart board" algorithm. After battling with this program for quite some time now I have finally got it to behave for the most part. However, I am currently stuck on passing back the results properly. I must say, this is a rather tricky program to debug since I can no longer shove any print statements into the subroutine. This program currently returns all zeros. I am not really sure what is going on, but I have two ideas. Both of which I am not sure how to fix:
The CUDA kernel is not running somehow?
I am not converting the values properly? pi_parts = pi_parts_d
Well, this is the status of my current program. All variables with _d on the end stand for the CUDA prepared device memory where all the other variables (with the exception of the CUDA kernel) are typical Fortran CPU prepared variables. Now there are some print statements I have commented out that I have already tried out from CPU Fortran land. These commands were to check if I really was generating the random numbers properly. As for the CUDA method, I have currently commented out the calculations and replaced z to statically equal to 1 just to see something happen.
module calcPi
contains
attributes(global) subroutine pi_darts(x, y, results, N)
use cudafor
implicit none
integer :: id
integer, value :: N
real, dimension(N) :: x, y, results
real :: z
id = (blockIdx%x-1)*blockDim%x + threadIdx%x
if (id .lt. N) then
! SQRT NOT NEEDED, SQRT(1) === 1
! Anything above and below 1 would stay the same even with the applied
! sqrt function. Therefore using the sqrt function wastes GPU time.
z = 1.0
!z = x(id)*x(id)+y(id)*y(id)
!if (z .lt. 1.0) then
! z = 1.0
!else
! z = 0.0
!endif
results(id) = z
endif
end subroutine pi_darts
end module calcPi
program final_project
use calcPi
use cudafor
implicit none
integer, parameter :: N = 400
integer :: i
real, dimension(N) :: x, y, pi_parts
real, dimension(N), device :: x_d, y_d, pi_parts_d
type(dim3) :: grid, tBlock
! Initialize the random number generaters seed
call random_seed()
! Make sure we initialize the parts with 0
pi_parts = 0
! Prepare the random numbers (These cannot be generated from inside the
! cuda kernel)
call random_number(x)
call random_number(y)
!write(*,*) x, y
! Convert the random numbers into graphics card memory land!
x_d = x
y_d = y
pi_parts_d = pi_parts
! For the cuda kernel
tBlock = dim3(256,1,1)
grid = dim3((N/tBlock%x)+1,1,1)
! Start the cuda kernel
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N)
! Transform the results into CPU Memory
pi_parts = pi_parts_d
write(*,*) pi_parts
write(*,*) 'PI: ', 4.0*sum(pi_parts)/N
end program final_project
EDIT TO CODE:
Changed various lines to reflect the fixes mentioned by: Robert Crovella. Current status: error caught by cuda-memcheck revealing: Program hit error 8 on CUDA API call to cudaLaunch on my machine.
If there is any method I can use to test this program please let me know. I am throwing darts and seeing where they land for my current style of debugging with CUDA. Not the most ideal, but it will have to do until I find another way.
May the Fortran Gods have mercy on my soul at this dark hour.

When I compile and run your program I get a segfault. This is due to the last parameter you are passing to the kernel (N_d):
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N_d)
Since N is a scalar quantity, the kernel is expecting to use it directly, rather than as a pointer. So when you pass a pointer to device data (N_d), the process of setting up the kernel generates a seg fault (in host code!) as it attempts to access the value N, which should be passed directly as:
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N)
When I make that change to the code you have posted, I then get actual printed output (instead of a seg fault), which is an array of ones and zeroes (256 ones, followed by 144 zeroes, for a total of N=400 values), followed by the calculated PI value (which happens to be 2.56 in this case (4*256/400), since you have made the kernel basically a dummy kernel).
This line of code is also probably not doing what you want:
grid = dim3(N/tBlock%x,1,1)
With N = 400 and tBlock%x = 256 (from previous code lines), the result of the calculation is 1 (ie. grid ends up at (1,1,1) which amounts to one threadblock). But you really want to launch 2 threadblocks, so as to cover the entire range of your data set (N = 400 elements). There's a number of ways to fix this, but for simplicity let's just always add 1 to the calculation:
grid = dim3((N/tBlock%x)+1,1,1)
Under these circumstances, when we launch grids that are larger (in terms of total threads) than our data set size (512 threads but only 400 data elements in this example) it's customary to put a thread check near the beginning of our kernel (in this case, after the initialization of id), to prevent out-of-bounds accesses, like so:
if (id .lt. N) then
(and a corresponding endif at the very end of the kernel code) This way, only the threads that correspond to actual valid data are allowed to do any work.
With the above changes, your code should be essentially functional, and you should be able to revert your kernel code to the proper statements and start to get an estimate of PI.
Note that you can check the CUDA API for error return codes, and you can also run your code with cuda-memcheck to get an idea of whether the kernel is making out-of-bounds accesses. Niether of these would have helped with this particular seg fault, however.

Related

non-linear optimization on the GPU (CUDA) without data transfer latency

I am trying to perform a non-linear optimization problem entirely on the GPU. Computation of the objective function and data transfer from the GPU to CPU are the bottlenecks. To solve this, I want to
heavily parallelize computation of the objective and
perform the entire optimization on the GPU.
More specifically, the problem is as follows in pseudo-code:
x = x0 // initial guess of the vector of unknowns, typically of size ~10,000
for iteration = 1 : max_iter
D = compute_search_direction(x)
alpha = compute_step_along_direction(x)
x = x + D * alpha // update
end for loop
The functions compute_search_direction(x) and compute_step_along_direction(x) both call the objective function f0(x) dozens of times per iteration. The objective function is a complicated CUDA kernel, basically it is a forward Bloch simulation (=the set of equations that describes the dynamics of nuclear spins in a magnetic field). The output of f0(x) are F (value of the objective function, scalar) and DF (Jacobian, or vector of first derivatives, with same size as x, i.e. ~10,000). On the GPU, f0(x) is really fast but transfer of x from the CPU to the GPU and then transfer back of F and DF from the GPU to the CPU takes a while (~1 second total). Because the function is called dozens of time per iteration, this leads to a pretty slow overall optimization.
Ideally, I would want to have the entire pseudo code above on the GPU. The only solution I can think of now is recursive kernels. The pseudo code above would be the "outer kernel", launched with a number of threads = 1 and a number of blocks = 1 (i.e., this kernel is not really parallel...). This kernel would then call the objective function (i.e., the "inner kernel", this one massively parallel) every time it needs to evaluate the objective function and the vector of first derivatives. Since kernel launches are asynchronous, I can force the GPU to wait until the f0 inner kernel is fully evaluated to move to the next instruction of the outer kernel (using a synchronization point).
In a sense, this is really the same as regular CUDA programming where the CPU controls kernel launches for evaluation of the objective function f0, except the CPU is replaced by an outer kernel that is not parallelzied (1 thread, 1 block). However, since everything is on the GPU, there is no data transfer latency anymore.
I am testing the idea now on a simple example to test feasibility. However, this seems quite cumbersome... My questions are:
Does this make any sense to anyone else?
Is there a more direct way to achieve the same result without the added complexity of nested kernels?
It seems you are mixing up "reducing memory transfer between GPU and CPU", and "having the entire code run on device (aka. on gpu)".
In order to reduce memory transfers, you do not need to have the entire code run on GPU.
You can copy your data to the GPU once, and then switch back and forth between GPU code and CPU code. As long as you don't try to access any GPU memory from your CPU code (and vice-versa), you should be fine.
Here's a pseudo-code of a correct approach for what you want to do.
// CPU code
cudaMalloc(&x,...) //allocate memory for x on GPU
cudaMemCpy(x, x0, size, cudaMemCpyHostToDevice); //Copy x0 to the freshly allocated array
cudaMalloc(&D, ....) //allocate D and alpha before the loop
cudaMalloc(&alpha, ....)
for iteration = 1 : max_iter
compute_search_direction<<<...>>>(x, D) //Call a kernel that does the computation and stores the result in D
compute_step_along_direction<<<....>>>(x, alpha)
combine_result<<<...>>>(x, D, alpha) // x + D * alpha
end for loop
//Eventually copy x on CPU, if need be
Hope it helps!

How to perform basic operations (+ - * /) on GPU and store the result on it

I have the following code line, gamma is a CPU variable, that after i will need to copy to GPU. gamma_x and delta are also stored on CPU. Is there any way that i can execute the following line and store its result directly on GPU? So basically, host gamma, gamma_x and delta on GPU and get the output of the following line on GPU. It would speed up my code a lot for the lines after.
I tried with magma_dcopy but so far i couldn't find a way to make it working because the output of magma_ddot is CPU double.
gamma = -(gamma_x[i+1] + magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue))/delta;
The very short answer is no, you can't do this, or least not if you use magma_ddot.
However, magma_ddot is itself a only very thin wrapper around cublasDdot, and the cublas function fully supports having the result of the operation stored in GPU memory rather than returned to the host.
In theory you could do something like this:
// before the apparent loop you have not shown us:
double* dotresult;
cudaMalloc(&dotresult, sizeof(double));
for (int i=....) {
// ...
// magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue);
cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_DEVICE);
cublasDdot(queue->cublas_handle(), i, &d_gamma_x[1], 1, &(d_l2)[1], 1, &dotresult);
cudaDeviceSynchronize();
cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_HOST);
// Now dotresult holds the magma_ddot result in device memory
// ...
}
Note that might make Magma blow up depending on how you are using it, because Magma uses CUBLAS internally and how CUBLAS state and asynchronous operations are handled inside Magma are completely undocumented. Having said that, if you are careful, it should be OK.
To then execute your calculation, either write a very simple kernel and launch it with one thread, or perhaps use a simple thrust call with a lambda expression, depending on your preference. I leave that as an exercise to the reader.

CUDA out of memory message after using just ~2.2GB of memory on a GTX1080

I'm doing matrix multiplication on a GTX1080 GPU using JCuda, version 0.8.0RC with CUDA 8.0. I load two matrices A and B into the device in row-major vector form, and read the product matrix from the device. But I'm finding that I run out of device memory earlier than I would expect. For example, if matrix A is dimensioned 100000 * 5000 = 500 million entries = 2GB worth of float values, then:
cuMemAlloc(MatrixA, 100000 * 5000 * Sizeof.FLOAT);
works fine. But if I increase the number or rows to 110000 from 100000, I get the following error on this call (which is made before the memory allocations for matrices B and C, so those are not part of the problem):
Exception in thread "main" jcuda.CudaException: CUDA_ERROR_OUT_OF_MEMORY
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:344)
at jcuda.driver.JCudaDriver.cuMemAlloc(JCudaDriver.java:3714)
at JCudaMatrixMultiply.main(JCudaMatrixMultiply.java:84) (my code)
The issue is that allocating a matrix of this size on the device should take only about 2.2GB, and the GTX1080 has 8GB of memory, so I don't see why I'm running out of memory. Does anyone have any thoughts on this? It's true that I'm using the JCuda 0.8.0RC with the release version of CUDA 8, but I tried downloading the RC version of CUDA 8 (8.0.27) to use with JCuda 0.8.0RC and had some problems getting it to work. If versions compatibility is likely to be the issue, however, I can try again.
Matrices of 100000 * 5000 are pretty big, of course, and I won't need to work with larger matrices for a while on my neural network project, but I would like to be confident that I can use all 8GB of memory on this new card. Thanks for any help.
tl;dr:
When calling
cuMemAlloc(MatrixA, (long)110000 * 5000 * Sizeof.FLOAT);
// ^ cast to long here
or alternatively
cuMemAlloc(MatrixA, 110000L * 5000 * Sizeof.FLOAT);
// ^ use the "long" literal suffix here
it should work.
The last argument to cuMemAlloc is of type size_t. This is an implementation-specific unsigned integer type for "arbitrary" sizes. The closest possible primitive type in Java for this is long. And in general, every size_t in CUDA is mapped to long in JCuda. In this case, the Java long is passed as a jlong into the JNI layer, and this is simply cast to size_t for the actual native call.
(The lack of unsigned types in Java and the odd plethora of integer types in C can still cause problems. Sometimes, the C types and the Java types just don't match. But as long as the allocation is not larger than 9 Million Terabytes (!), a long should be fine here...)
But the comment by havogt lead to the right track. What happens here is indeed an integer overflow: The computation of the actual value
110000 * 5000 * Sizeof.FLOAT = 2200000000
is by default done using the int type in Java, and this is where the overflow happens: 2200000000 is larger than Integer.MAX_VALUE. The result will be a negative value. When this is cast to the (unsigned) size_t value in the JNI layer, it will become a ridiculosly large positive value, that clearly causes the error.
When doing the computation using long values, either by explicitly casting to long or by appending the L suffix to one of the literals, the value is passed to CUDA as the proper long value of 2200000000.

CUDA memory operation order within a single thread

From the CUDA Programming Guide (v. 5.5):
The CUDA programming model assumes a device with a weakly-ordered
memory model, that is:
The order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the data is observed being
written by another CUDA or host thread;
The order in which a CUDA thread reads data from shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the read instructions appear in
the program for instructions that are independent of each other
However, do we have a guarantee that the (dependent) memory operations as seen from the single thread are actually consistent? If I do - say:
arr[x] = 1;
int z = arr[y];
where x happens to be equal to y, and no other thread is touching the memory, do I have a guarantee that z is 1? Or do I still need to put some volatile or a barrier between those two operations?
In response to Orpedo's answer.
If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
My problem is what optimizations (done either by compiler or hardware) are allowed?
It could happen --- for example --- that store instruction is non-blocking and the load instruction that follows somehow is managed by the memory controller faster than the already queued-up store.
I don't know CUDA hardware. Do I have a guarantee that the above will never happen?
The CUDA Programming Guide simply stating, that you cannot predict in which order the threads is executed, but every single thread will still run as a sequential thread.
In the example you state, where x and y are the same and NO OTHER THREAD is touching the memory, you DO have a guarantee that z = 1.
Here the point being, that if you have several threads dooing operations on the same data (e.g. an array), you are NOT guaranteed that thread #9 executes before #10.
Take an example:
__device__ void sum_all(float *x, float *result, int size N){
x[threadId.x] = threadId.x;
result[threadId.x] = 0;
for(int i = 0; i < N; i++)
result[threadId.x] += x[threadID.x];
}
Here we have some dumb function, which SHOULD fill a shared array (x) with the numbers from m ... n (read from one number to another number), and then sum up the numbers already put into the array and store the result in another array.
Given that you your lowest indexed thread is enumerated thread #0, you would expect that the first time your code runs this code x should contain
x[] = {0, 0, 0 ... 0} and result[] = {0, 0, 0 ... 0}
next for thread #1
x[] = {0, 1, 0 ... 0} and result[] = {0, 1, 0 ... 0}
next for thread #2
x[] = {0, 1, 2 ... 0} and result[] = {0, 1, 3 ... 0}
and so forth.
But this is NOT guaranteed. You can't know if e.g. thread #3 runs first, hence changing the array x[] before thread #0 runs. You actually don't even know if the arrays are changed by some other thread while you are executing the code.
I am not sure, if this is explicitly stated in the CUDA documentation (I wouldn't expect it to be), as this is a basic principle of computing. Basically what you are asking is, if running your code on a GFX will change the functionality of your code.
The cores of a GPU are generally the same, as that of a CPU, just with less control-arithmetics, a smaller instructionset and typically only supporting single-precision.
In a CUDA-GPU there is 1 program counter for each Warp (section of 32 synchronous cores). Like a CPU, the program counter increases by magnitude of one address element after each instruction, unless you have branches or jumps. This gives the sequential flow of the program, and this can not be changed.
Branches and jumps can only be introduced by the software running on the core, and hence is determined by your compiler. Compiler optimizations can in fact change the functionality of your code, but only in the case where the code is implemented "wrong" with respect to the compiler
So in short - Your code will always be executed in the order it is ordered in the memory, no matter if it is executed on a CPU or a GPU. If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
Hope this was clear enough :)
As far as I understood you're basically asking whether memory dependencies and alias analysis information are being respected in the CUDA compiler.
The answer to that question is, assuming that the CUDA compiler is free of bugs, yes because as Robert noted the CUDA compiler uses LLVM under the hood and two basic modules (which, at the moment, I really don't think they could be excluded by the pipeline) are:
Memory dependence analysis
Alias Analysis
These two passes detect memory locations potentially pointing to the same address and use live-analysis on variables (even out of the block scope) to avoid dangerous optimizations (e.g. you can't write in a live variable before its next read, the data may still be useful).
I don't know the compiler internals but assuming (as any other reasonably trusted compiler) that it will do its best to be bug-free, the analysis that take place in there should really not bother you at all and assure you that at least in theory what you just presented as an example (i.e. the dependent-load faster than the store) cannot happen.
What guarantee you that? Nothing but the fact that the company is giving a compiler to use, and there are disclaimers in case it doesn't for exceptional cases :)
Also: aside from the compiler topic, the instruction execution is also dependent on the hardware specification. In this case, a SIMT hardware instruction issuing unit
cfr. http://www.csl.cornell.edu/~cbatten/pdfs/kim-simt-vstruct-isca2013.pdf and all the referenced papers for more information

Cuda measurement of loop

I launch a very simple kernel <<<1,512>>> on a CUDA Fermi GPU.
__global__ void kernel(){
int x1,x2;
x1=5;
x2=1;
for (int k=0;k<=1000000;k++)
{
x1+=x2;
}
}
The kernel is very simple, it does 10^6 additions and does not transfer anything back to global memory. The result is correct, i.e. after the loop x1 (in all its 512 thread instances) contains 10^6 + 5
I am trying to measure the execution time of the kernel. using both visual studio parallel nsight and nvvp. Nsight measures 2.5 microseconds and nvvp measures 4 microseconds.
The issue is the following: I may increase largely the size of the loop eg to 10^8 and the time remains constant. Same if I decrease the loop size a lot. Why does this happen?
Please note that if I use shared memory or global memory inside the loop, the measurements reflect the work being performed (i.e. there is proportionality).
As noted, CUDA compiler optimisation is very aggressive at removing dead code. Because x2 doesn't participate in a value which is written to memory, it and the loop can be removed. The compiler will also pre-calculate any results which can be deduced at compile time, so if all the constants in the loop are known to the compiler, it can compute the final result and replace it with a constant.
To get around both of these problems, rewrite your code like this:
__global__
void kernel(int *out, int x0, bool flag)
{
int x1 = x0, x2 = 1;
for (int k=0; k<=1000000; k++) {
x1+=x2;
}
if (flag) out[threadIdx.x + blockIdx.x*blockDim.x] = x1;
}
and then run it like this:
kernel<<<1,512>>>((int *)0, 5, false);
By passing the initial value of x1 as an argument to the kernel, you ensure that the loop result isn't available to the compiler. The flag makes the memory store conditional, and then memory store makes the whole calculation unsafe to remove. As long as the flag is set to false at runtime, there is no store performed, so that doesn't effect the timing of the loop.
Because compiler eliminates the dead paths. You code doesn't actually do anything. Look at the assembly.
If you are actually seeing the value, then the compiler may have just optimized out the loop as it can know the value during compile time.
When you write out the register contents to shared memory, compiler cannot guarantee that the result will not be used, and hence the value will actually be computed. In other words, the value you compute must be used somewhere eventually or written to memory otherwise its computation will be dropped.