Multi-GPU for peer to peer - cuda

I have some difficulties setting 2 GPUs for peer to peer communication.
I am using Cuda 4.0 and programming with fortran. PGI compiler
I wrote a program which confirm I have 4 GPUs available on my node.
I decided to use two of them but having the following error:
0: DEALLOCATE: invalid device pointer.
subroutine directTransfer()
use cudafor
implicit none
integer, parameter :: N = 4*1024*1024
real, pinned, allocatable :: a(:), b(:)
real, device, allocatable :: a_d(:), b_d(:)
!these hold free and total memory before and after
!allocation, used to verify allocation happening on proper devices
integer (int_ptr_kind()),allocatable ::
& freeBefore(:), totalBefore(:),
& freeAfter(:), totalAfter(:)
integer :: istat, nDevices, i, accessPeer, timingDev
type(cudaDeviceProp)::prop
type(cudaEvent)::startEvent,stopEvent
real :: time
!allocate host arrays
allocate(a(N), b(N))
allocate(freeBefore(0:nDevices -1),
& totalBefore(0:nDevices -1))
allocate(freeAfter(0:nDevices -1),
& totalAfter(0:nDevices -1))
write(*,*) 'Start!'
!get devices ionfo (including total and free memory)
!before allocation
istat = cudaGetDeviceCount(nDevices)
if(nDevices < 2) then
write(*,*) 'Need at least two CUDA capable devices'
stop
end if
write(*,"('Number of CUDA-capable devices: ',
& i0, /)"),nDevices
do i = 0, nDevices - 1
istat = cudaGetDeviceProperties(prop, i)
istat = cudaSetDevice(i)
istat = cudaMemGetInfo(freeBefore(i), totalBefore(i))
end do
!!!Here is the trouble zone!!!!
istat = cudaSetDevice(0)
allocate(a_d(N))
istat = cudaSetDevice(1)
allocate(b_d(N))
deallocate(freeBefore, totalBefore,freeAfter,totalAfter)
deallocate(a,b,a_d,b_d)
end subroutine directTransfer
With the following I have no error:
istat = cudaSetDevice(0)
allocate(a_d(N))
!istat = cudaSetDevice(1)
!allocate(b_d(N))
With this, also no error:
!istat = cudaSetDevice(0)
!allocate(a_d(N))
istat = cudaSetDevice(1)
allocate(b_d(N))
But this return error
istat = cudaSetDevice(0)
allocate(a_d(N))
istat = cudaSetDevice(1)
allocate(b_d(N))
So it seems I cannot set 2GPUs to start my program.
Could you help me understand why it is not possible to set 2GPUs and a hint to solve this?

Thank you JackOLantern!!
It was the trick.
I changed the code as following and it works perfectly
!clean up
deallocate(freeBefore, totalBefore,freeAfter,totalAfter)
istat = cudaSetDevice(0)
deallocate(a_d)
istat = cudaSetDevice(1)
deallocate(b_d)
deallocate(a,b)
This was my problem answer. Hope it will help others.

Related

How to use acc_set_cuda_stream(streamId, stream)?

I created a CUDA stream in this way:
integer(kind=cuda_stream_kind) :: stream1
istat = cudaStreamCreate(stream1)
to use it for the plan of a cufft:
err_dir = err_dir + cufftPlan2D(plan_dir1,NY,NY,CUFFT_D2Z)
err_dir = err_dir + cufftSetStream(plan_dir1,stream1)
In the routine that executes the cufft, I pass plan_dir1 and I have
subroutine new_fft_dir(z,plan)
!$acc host_data use_device(z)
ierr = ierr + cufftExecD2Z(plan,z,z)
!$acc end host_data
!$acc parallel loop collapse(2) present(z)
do i=1,NXP2
do j=1,NY
z(i,j) = z(i,j)/NY**2
enddo
enddo
!$acc end parallel loop
I would like to set an OpenACC stream equal to the CUDA stream stream1, but using :
integer(kind=cuda_stream_kind) :: stream1
istat = cudaStreamCreate(stream1)
integer :: stream
istat = cudaStreamCreate(stream1)
acc_set_cuda_stream(stream,stream1)
I get **NVFORTRAN-S-0034-Syntax error at or near end of line (main.f90: 48)
**
My goal is to add the async clause to
!$acc parallel loop collapse(2) present(z) async(stream)
do i=1,NXP2
do j=1,NY
z(i,j) = z(i,j)/NY**2
enddo
enddo
!$acc end parallel loop
to have this loop and the fft on the same CUDA stream.
Could the problem be that I use integer(kind=cuda_stream_kind) intead of cudaStream_t stream?
"acc_set_cuda_stream" is a subroutine so you do need to add "call " before it. Also, variables need to be declared before executable code, hence "integer :: stream" needs to be moved up a line.
use cudafor
use openacc
integer(kind=cuda_stream_kind) :: stream1
integer :: stream
istat = cudaStreamCreate(stream1)
call acc_set_cuda_stream(stream,stream1)

Adding a print statement in a Fortran 90 function this do not work [duplicate]

I'm trying to learn Fortran (unfortunately a necessity for my research group) - one of the tasks I set myself was to package one of the necessary functions (Associated Legendre polynomials) from the Numerical Recipes book into a fortran 03 compliant module. The original program (f77) has some error handling in the form of the following:
if(m.lt.0.or.m.gt.1.or.abs(x).gt.1)pause 'bad arguments in plgndr'
Pause seems to have been deprecated since f77 as using this line gives me a compiling error, so I tried the following:
module sha_helper
implicit none
public :: plgndr, factorial!, ylm
contains
! numerical recipes Associated Legendre Polynomials rewritten for f03
function plgndr(l,m,x) result(res_plgndr)
integer, intent(in) :: l, m
real, intent(in) :: x
real :: res_plgndr, fact, pll, pmm, pmmp1, somx2
integer :: i,ll
if (m.lt.0.or.m.gt.l.or.abs(x).gt.1) then
write (*, *) "bad arguments to plgndr, aborting", m, x
res_plgndr=-10e6 !return a ridiculous value
else
pmm = 1.
if (m.gt.0) then
somx2 = sqrt((1.-x)*(1.+x))
fact = 1.
do i = 1, m
pmm = -pmm*fact*somx2
fact = fact+2
end do
end if
if (l.eq.m) then
res_plgndr = pmm
else
pmmp1 = x*(2*m+1)*pmm
if(l.eq.m+1) then
res_plgndr = pmmp1
else
do ll = m+2, l
pll = (x*(2*ll-1)*pmmp1-(ll+m-1)*pmm)/(ll-m)
pmm = pmmp1
pmmp1 = pll
end do
res_plgndr = pll
end if
end if
end if
end function plgndr
recursive function factorial(n) result(factorial_result)
integer, intent(in) :: n
integer, parameter :: RegInt_K = selected_int_kind(20) !should be enough for the factorials I am using
integer (kind = RegInt_K) :: factorial_result
if (n <= 0) then
factorial_result = 1
else
factorial_result = n * factorial(n-1)
end if
end function factorial
! function ylm(l,m,theta,phi) result(res_ylm)
! integer, intent(in) :: l, m
! real, intent(in) :: theta, phi
! real :: res_ylm, front_block
! real, parameter :: pi = 3.1415926536
! front_block = sqrt((2*l+1)*factorial(l-abs(m))/(4*pi*))
! end function ylm
end module sha_helper
The main code after the else works, but if I execute my main program and call the function with bad values, the program freezes before executing the print statement. I know that the print statement is the problem, as commenting it out allows the function to execute normally, returning -10e6 as the value. Ideally, I would like the program to crash after giving a user readable error message, as giving bad values to the plgndr function is a fatal error for the program. The function plgndr is being used by the program sha_lmc. Currently all this does is read some arrays and then print a value of plgndr for testing (early days). The function ylm in the module sha_helper is also not finished, hence it is commented out. The code compiles using gfortran sha_helper.f03 sha_lmc.f03 -o sha_lmc, and
gfortran --version
GNU Fortran (GCC) 4.8.2
!Spherical Harmonic Bayesian Analysis testbed for Lagrangian Dynamical Monte Carlo
program sha_analysis
use sha_helper
implicit none
!Analysis Parameters
integer, parameter :: harm_order = 6
integer, parameter :: harm_array_length = (harm_order+1)**2
real, parameter :: coeff_lo = -0.1, coeff_hi = 0.1, data_err = 0.01 !for now, data_err fixed rather than heirarchical
!Monte Carlo Parameters
integer, parameter :: run = 100000, burn = 50000, thin = 100
real, parameter :: L = 1.0, e = 1.0
!Variables needed by the program
integer :: points, r, h, p, counter = 1
real, dimension(:), allocatable :: x, y, z
real, dimension(harm_array_length) :: l_index_list, m_index_list
real, dimension(:,:), allocatable :: g_matrix
!Open the file, allocate the x,y,z arrays and read the file
open(1, file = 'Average_H_M_C_PcP_boschi_1200.xyz', status = 'old')
read(1,*) points
allocate(x(points))
allocate(y(points))
allocate(z(points))
print *, "Number of Points: ", points
readloop: do r = 1, points
read(1,*) x(r), y(r), z(r)
end do readloop
!Set up the forwards model
allocate(g_matrix(harm_array_length,points))
!Generate the l and m values of spherical harmonics
hloop: do h = 0, harm_order
ploop: do p = -h,h
l_index_list(counter) = h
m_index_list(counter) = p
counter = counter + 1
end do ploop
end do hloop
print *, plgndr(1,2,0.1)
!print *, ylm(1,1,0.1,0.1)
end program sha_analysis
Your program does what is known as recursive IO - the initial call to plgndr is in the output item list of an IO statement (a print statement) [directing output to the console] - inside that function you then also attempt to execute another IO statement [that outputs to the console]. This is not permitted - see 9.11p2 and p3 of F2003 or 9.12p2 of F2008.
A solution is to separate the function invocation from the io statement in the main program, i.e.
REAL :: a_temporary
...
a_temporary = plgndr(1,2,0.1)
PRINT *, a_temporary
Other alternatives in F2008 (but not F2003 - hence the [ ] parts in the first paragraph) include directing the output from the function to a different logical unit (note that WRITE (*, ... and PRINT ... reference the same unit).
In F2008 you could also replace the WRITE statement with a STOP statement with a message (the message must be a constant - which wouldn't let you report the problematic values).
The potential for inadvertently invoking recursive IO is part of the reason that some programming styles discourage conducting IO in functions.
Try:
if (m.lt.0.or.m.gt.l.or.abs(x).gt.1) then
write (*, *) "bad arguments to plgndr, aborting", m, x
stop
else
...
end if

Max reduce in CUDA Fortran

I am trying to perform reduction in CUDA Fortran; what I did so far is something like that, performing the reduction in two steps (see the CUDA kernels below).
In the first kernel I am doing some simple computation and I declare a shared array for a block of threads to store the value of abs(a - anew); once the threads are synchronized, I compute the max value of this shared array, that I store in an intermediate array of dimension gridDim%x * gridDim%y.
In the second kernel, I am reading this array (in a single block of threads) and try to compute the max value of it.
Here is the whole code:
module commons
integer, parameter :: dp=kind(1.d0)
integer, parameter :: nx=1024, ny=1024
integer, parameter :: block_dimx=16, block_dimy=32
end module commons
module kernels
use commons
contains
attributes(global) subroutine kernel_gpu_reduce(a, anew, error, nxi, nyi)
implicit none
integer, value, intent(in) :: nxi, nyi
real(dp), dimension(nxi,nyi), intent(in) :: a
real(dp), dimension(nxi,nyi), intent(inout) :: anew
real(dp), dimension(nxi/block_dimx+1,nyi/block_dimy+1), intent(inout) :: error
real(dp), shared, dimension(block_dimx,block_dimy) :: err_sh
integer :: i, j, k, tx, ty
i = (blockIdx%x - 1)*blockDim%x + threadIdx%x
j = (blockIdx%y - 1)*blockDim%y + threadIdx%y
tx = threadIdx%x
ty = threadIdx%y
if (i > 1 .and. i < nxi .and. j > 1 .and. j < nyi) then
anew(i,j) = 0.25d0*(a(i-1,j) + a(i+1,j) &
& + a(i,j-1) + a(i,j+1))
err_sh(tx,ty) = abs(anew(i,j) - a(i,j))
endif
call syncthreads()
error(blockIdx%x,blockIdx%y) = maxval(err_sh)
end subroutine kernel_gpu_reduce
attributes(global) subroutine max_reduce(local_error, error, nxi, nyi)
implicit none
integer, value, intent(in) :: nxi, nyi
real(dp), dimension(nxi,nyi), intent(in) :: local_error
real(dp), intent(out) :: error
real(dp), shared, dimension(nxi) :: shared_error
integer :: tx, i
tx = threadIdx%x
shared_error(tx) = 0.d0
if (tx >=1 .and. tx <= nxi) shared_error(tx) = maxval(local_error(tx,:))
call syncthreads()
error = maxval(shared_error)
end subroutine max_reduce
end module kernels
program laplace
use cudafor
use kernels
use commons
implicit none
real(dp), allocatable, dimension(:,:) :: a, anew
real(dp) :: error=1.d0
real(dp), device, allocatable, dimension(:,:) :: adev, adevnew
real(dp), device, allocatable, dimension(:,:) :: edev
real(dp), allocatable, dimension(:,:) :: ehost
real(dp), device :: error_dev
integer :: i
integer :: num_device, h_status, ierrSync, ierrAsync
type(dim3) :: dimGrid, dimBlock
num_device = 0
h_status = cudaSetDevice(num_device)
dimGrid = dim3(nx/block_dimx+1, ny/block_dimy+1, 1)
dimBlock = dim3(block_dimx, block_dimy, 1)
allocate(a(nx,ny), anew(nx,ny))
allocate(adev(nx,ny), adevnew(nx,ny))
allocate(edev(dimGrid%x,dimGrid%y), ehost(dimGrid%x,dimGrid%y))
do i = 1, nx
a(i,:) = 1.d0
anew(i,:) = 1.d0
enddo
adev = a
adevnew = anew
call kernel_gpu_reduce<<<dimGrid, dimBlock>>>(adev, adevnew, edev, nx, ny)
ierrSync = cudaGetLastError()
ierrAsync = cudaDeviceSynchronize()
if (ierrSync /= cudaSuccess) write(*,*) &
& 'Sync kernel error - 1st kernel:', cudaGetErrorString(ierrSync)
if (ierrAsync /= cudaSuccess) write(*,*) &
& 'Async kernel error - 1st kernel:', cudaGetErrorString(ierrAsync)
call max_reduce<<<1, dimGrid%x>>>(edev, error_dev, dimGrid%x, dimGrid%y)
ierrSync = cudaGetLastError()
ierrAsync = cudaDeviceSynchronize()
if (ierrSync /= cudaSuccess) write(*,*) &
& 'Sync kernel error - 2nd kernel:', cudaGetErrorString(ierrSync)
if (ierrAsync /= cudaSuccess) write(*,*) &
& 'Async kernel error - 2nd kernel:', cudaGetErrorString(ierrAsync)
error = error_dev
print*, 'error from kernel: ', error
ehost = edev
error = maxval(ehost)
print*, 'error from host: ', error
deallocate(a, anew, adev, adevnew, edev, ehost)
end program laplace
I first had a problem because of the kernel configuration of the second kernel (which was <<<1, dimGrid>>>); I modified the code following Robert's answer. Now I have a memory access error:
Async kernel error - 2nd kernel:
an illegal memory access was encountered
0: copyout Memcpy (host=0x666bf0, dev=0x4203e20000, size=8) FAILED: 77(an illegal memory access was encountered)
And, if I run it with cuda-memcheck:
========= Invalid __shared__ write of size 8
========= at 0x00000060 in kernels_max_reduce_
========= by thread (1,0,0) in block (0,0,0)
========= Address 0x00000008 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x2c5) [0x14ad95]
for every thread.
The code is compiled with PGI Fortran 14.9 and CUDA 6.5 on a Tesla K20 card (with CUDA capability 3.5). I compile it with:
pgfortran -Mcuda -ta:nvidia,cc35 laplace.f90 -o laplace
You can do proper cuda error checking in CUDA Fortran. You should do so in your code.
One problem is that you're trying to launch too many threads (per block) in your second kernel:
call max_reduce<<<1, dimGrid>>>(edev, error_dev, dimGrid%x, dimGrid%y)
^^^^^^^
The dimGrid parameter has previously been computed to be:
dimGrid = dim3(nx/block_dimx+1, ny/block_dimy+1, 1);
Substituting actual values, we have:
dimGrid = dim3(1024/16 + 1, 1024/32 +1);
i.e.
dimGrid = dim3(65,33);
But you are not allowed to request 65*33 = 2145 threads per block. The maximum is either 512 or 1024 depending on what device architecture target you are compiling for.
Because of this error, your second kernel is not running at all.

Cuda illegal memory access error when using array indexes stored in another array

I'm using cuda fortran and I've been struggling with this problem in one simple kernel and I couldn't find the solution.
Isn't it possible to use integer values stored in an array as the indexes for another array?
Here's a simple example (edited to include also the main program):
program test
use cudafor
integer:: ncell, i
integer, allocatable:: values(:)
integer, allocatable, device :: values_d(:)
ncell = 10
allocate(values(ncell), values_d(ncell))
do i=1,ncell
values(i) = i
enddo
values_d = values
call multipleindices_kernel<<< ncell/1024+1,1024 >>> (values_d,
+ ncell)
values = values_d
write (*,*) values
end program test
!////////////////////////////////////////////////////
attributes(global) subroutine multipleindices_kernel(valu, ncell)
use cudafor
implicit none
integer, value:: ncell ! ncell = 10
integer :: valu(ncell)
integer :: tempind(10)
integer:: i
tempind(1)=10
tempind(2)=3
tempind(3)=5
tempind(4)=7
tempind(5)=9
tempind(6)=2
tempind(7)=4
tempind(8)=6
tempind(9)=8
tempind(10)=1
i = (blockidx%x - 1 ) * blockdim%x + threadidx%x
if (i .LE. ncell) then
valu(tempind(i))= 1
endif
end subroutine
I understand that if there were repeated values in the tempind array different threads could be accessing the same memory location for reading or writting, but that is not the case.
Even though, this gives the error "0: copyout Memcpy (host=0x303610, dev=0x3e20000, size=40) FAILED: 77(an illegal memory access was encountered).
Does anyone know if it is possible to use this indexes coming from another array in cuda?
After some additional tests, I've noticed that the problem occurs not while running the kernel itself, but on the transfer of the data back to CPU (if I remove "values = values_d" then no error is displayed). Also, if I substitute in the kernel valu(tempind(i)) by valu(i) it works fine, but I want to have the indexes coming from an array since the purpose of this test is to make a parallelization of a CFD code where the indexes are stored like that.
The problem appears to be that the generated executable doesn't pass the variable ncell to the kernel correctly. Running the application through cuda-memcheck shows that threads outside of the 1-10 are passing through the branch statement, and adding a print statement to print ncell inside the kernel also gives strange answers.
It used to be a requirement that all attributes(global) subroutines had to reside within a module. This requirement seems to have been relaxed in more recent versions of CUDA Fortran (I cannot find references to it in the programming guide). I believe the code outside of the module is causing the error here. By placing multipleindices_kernel within a module and using that module in test I consistantly get correct answers with no errors. The code for this is below:
module testmod
contains
attributes(global) subroutine multipleindices_kernel(valu, ncell)
use cudafor
implicit none
integer, value:: ncell ! ncell = 10
integer :: valu(ncell)
integer :: tempind(10)
integer:: i
tempind(1)=10
tempind(2)=3
tempind(3)=5
tempind(4)=7
tempind(5)=9
tempind(6)=2
tempind(7)=4
tempind(8)=6
tempind(9)=8
tempind(10)=1
i = (blockidx%x - 1 ) * blockdim%x + threadidx%x
if (i .LE. ncell) then
valu(tempind(i))= 1
endif
end subroutine
end module testmod
program test
use cudafor
use testmod
integer:: ncell, i
integer, allocatable:: values(:)
integer, allocatable, device :: values_d(:)
ncell = 10
allocate(values(ncell), values_d(ncell))
do i=1,ncell
values(i) = i
enddo
values_d = values
call multipleindices_kernel<<< ncell/1024+1,1024 >>> (values_d, ncell)
values = values_d
write (*,*) values
end program test
!////////////////////////////////////////////////////

Fortran does not understand call statement

I am attempting to use PGFortran for CUDA. I installed PGFortran on my computer and linked everything up to the best of my knowledge. To get going I decided to follow a tutorial listed here. When trying to compile the code:
module mathOps
contains
attributes(global) subroutine saxpy(x, y, a)
implicit none
real :: x(:), y(:)
real, value :: a
integer :: i, n
n = size(x)
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
if (i <= n) y(i) = y(i) + a*x(i)
end subroutine saxpy
end module mathOps
program testSaxpy
use mathOps
use cudafor
implicit none
integer, parameter :: N = 40000
real :: x(N), y(N), a
real, device :: x_d(N), y_d(N)
type(dim3) :: grid, tBlock
tBlock = dim3(256,1,1)
grid = dim3(ceiling(real(N)/tBlock%x),1,1)
x = 1.0; y = 2.0; a = 2.0
x_d = x
y_d = y
call saxpy<<<grid, tblock="">>>(x_d, y_d, a)
y = y_d
write(*,*) 'Max error: ', maxval(abs(y-4.0))
end program testSaxpy
I got:
PGF90-S-0034-Syntax error at or near identifier saxpy (main.cuf: 29)
0 inform, 0 warnings, 1 severes, 0 fatal for testsaxpy
The error points to the line call saxpy<<<grid, tblock="">>>(x_d, y_d, a). For some reason it apparently hates the fact that I am using <<< and >>>? Going by the tutorial those triple chevrons are meant to be there:
The information between the triple chevrons is the execution
configuration, which dictates how many device threads execute the
kernel in parallel.
Removing these chevrons would not make any sense since they are the purpose of the program. So why does PGFortran dislike this?
As for the compilation. I have followed the tutorial by using
pgf90 -o saxpy main.cuf. But since that gave an error I also tried pgf90 -Mcuda -o saxpy main.cuf. Same results.
There does seem to be a text error in that blog at the kernel invocation line:
call saxpy<<<grid, tblock="">>>(x_d, y_d, a)
tblock="" is not correct. You'll notice elsewhere in that blog text, the kernel invocation line is given correctly as:
call saxpy<<<grid,tBlock>>>(x_d, y_d, a)
So if you change that line accordingly in your actual code, I think you'll have better results.