How to use acc_set_cuda_stream(streamId, stream)? - cuda

I created a CUDA stream in this way:
integer(kind=cuda_stream_kind) :: stream1
istat = cudaStreamCreate(stream1)
to use it for the plan of a cufft:
err_dir = err_dir + cufftPlan2D(plan_dir1,NY,NY,CUFFT_D2Z)
err_dir = err_dir + cufftSetStream(plan_dir1,stream1)
In the routine that executes the cufft, I pass plan_dir1 and I have
subroutine new_fft_dir(z,plan)
!$acc host_data use_device(z)
ierr = ierr + cufftExecD2Z(plan,z,z)
!$acc end host_data
!$acc parallel loop collapse(2) present(z)
do i=1,NXP2
do j=1,NY
z(i,j) = z(i,j)/NY**2
enddo
enddo
!$acc end parallel loop
I would like to set an OpenACC stream equal to the CUDA stream stream1, but using :
integer(kind=cuda_stream_kind) :: stream1
istat = cudaStreamCreate(stream1)
integer :: stream
istat = cudaStreamCreate(stream1)
acc_set_cuda_stream(stream,stream1)
I get **NVFORTRAN-S-0034-Syntax error at or near end of line (main.f90: 48)
**
My goal is to add the async clause to
!$acc parallel loop collapse(2) present(z) async(stream)
do i=1,NXP2
do j=1,NY
z(i,j) = z(i,j)/NY**2
enddo
enddo
!$acc end parallel loop
to have this loop and the fft on the same CUDA stream.
Could the problem be that I use integer(kind=cuda_stream_kind) intead of cudaStream_t stream?

"acc_set_cuda_stream" is a subroutine so you do need to add "call " before it. Also, variables need to be declared before executable code, hence "integer :: stream" needs to be moved up a line.
use cudafor
use openacc
integer(kind=cuda_stream_kind) :: stream1
integer :: stream
istat = cudaStreamCreate(stream1)
call acc_set_cuda_stream(stream,stream1)

Related

Double precision error of cuFFT with PlanMany in Fortran

Following the (answer of JackOLantern) I'm trying to compute a batch 1D FFTs using cufftPlanMany.
The code below perform nwfs=23 times the 1D FFT forward and the 1D FFT backward of an n=256 complex array. It's to train me to handle the routine cufftPlanMany. As a second step, the nwfs arrays will be differents .At the end, I check the errors of each arrays.
Because of the data are allocate as: cinput_d(n,nwfs)
I use th function like this: cufftPlanMany(planmany, 1, fftsize, inembed, nwfs,1, onembed, nwfs,1, CUFFT_C2C, nwfs)
where :
rank = 1
fftsize = {n} same dim for each FFT
inembed = onembed = {0} ignored
istride = ostride = nwfs distance between two successive input and output
idist = odist = 1 distance between two signals
batch = nwfs number of fft to be done
program fft
use cudafor
use precision_m
use cufft_m
implicit none
integer, allocatable:: kx(:)
complex(fp_kind), allocatable:: matrix(:)
complex(fp_kind), allocatable, pinned :: cinput(:,:),coutput(:,:)
complex(fp_kind), allocatable, device :: cinput_d(:,:),coutput_d(:,:)
integer:: i,j,k,n,nwfs
integer, allocatable :: fftsize(:),inembed(:),onembed(:)
type(c_ptr):: plan,planmany
real(fp_kind):: twopi=8._fp_kind*atan(1._fp_kind),h
integer::clock_start,clock_end,clock_rate,istat
real :: elapsed_time
character*1:: a
real(fp_kind):: w,x,y,z
integer:: nerrors
n=256
nwfs=23
h=twopi/real(n,fp_kind)
! allocate arrays on the host
allocate (cinput(n,nwfs),coutput(n,nwfs))
allocate (kx(n),matrix(n))
allocate (fftsize(nwfs),inembed(nwfs),onembed(nwfs))
! allocate arrays on the device
allocate (cinput_d(n,nwfs),coutput_d(n,nwfs))
fftsize(:) = n
inembed(:) = 0
onembed(:) = 0
!initialize arrays on host
kx =(/ ((i-0.5)*0.1953125, i=1,n/2), ((-n+i-0.5)*0.1953125, i=n/2+1,n) /)
matrix = (/ ... /)
!write(*,*) cinput
!copy arrays to device
do i =1,nwfs
cinput(:,i)=matrix(:)
end do
cinput_d=cinput
! Initialize the plan for complex to complex transform
if (fp_kind== singlePrecision) call cufftPlan1D(plan,n,CUFFT_C2C,1)
if (fp_kind== doublePrecision) call cufftPlan1D(plan,n,CUFFT_Z2Z,1)
if (fp_kind== doublePrecision) call cufftPlanMany(planmany, 1, fftsize, inembed, &
nwfs,1, &
onembed, &
nwfs,1, &
CUFFT_Z2Z, nwfs)
if (fp_kind== singlePrecision) call cufftPlanMany(planmany, 1, fftsize, inembed, &
nwfs,1, &
onembed, &
nwfs,1, &
CUFFT_C2C, nwfs)
!c_null_ptr fftsize,inembed,onembed
! cufftPlanMany(plan, rank, n, inembed, istride, idist, &
! onembed, ostride, odist, &
! type, batch)
!subroutine cufftPlan1d(plan, nx, type, batch)
call SYSTEM_CLOCK(COUNT_RATE=clock_rate)
istat=cudaThreadSynchronize()
call SYSTEM_CLOCK(count=clock_start)
! Forward transform out of place
call cufftExec(planmany,cinput_d,coutput_d,CUFFT_FORWARD)
!$cuf kernel do <<<*,*>>>
do i=1,n
do j =1,n
coutput_d(i,j) = coutput_d(i,j)/real(n,fp_kind)!sqrt(twopi*real(n,fp_kind))*sqrt(2.*pi)/sqrt(real(maxn))
end do
end do
call cufftExec(planmany,coutput_d,coutput_d,CUFFT_INVERSE)
istat=cudaThreadSynchronize()
call SYSTEM_CLOCK(count=clock_end)
! Copy results back to host
coutput=coutput_d
do i=1,n
! write(*,'(i2,1x,2(f8.4),1x,2(f8.4),2x,e13.7)') i,cinput(i),coutput(i),abs(coutput(i)-cinput(i))
end do
nerrors=0
do i=1,n
!write(*,'(i2,5(1x,2(f8.4),1x,2(f8.4),2x,3(e13.7,2x)))') i,cinput(i,1),coutput(i,1),abs(coutput(i,1)-cinput(i,1)),abs(coutput(i,6)-cinput(i,6)),abs(coutput(i,nwfs)-cinput(i,nwfs))
do j=1,nwfs
if (abs(coutput(i,j)-cinput(i,j))>1.d-5) then
write(*,'(i3,i3,1x,e13.7,2x,4(f8.4))') i,j,abs(coutput(i,j)-cinput(i,j)),cinput(i,j),coutput(i,j)
nerrors = nerrors + 1
end if
end do
end do
elapsed_time = REAL(clock_end-clock_start)/REAL(clock_rate)
write(*,*) 'elapsed_time :',elapsed_time,clock_start,clock_end,clock_rate
if (nerrors .eq. 0) then
print *, "Test Passed"
else
print *, "Test Failed"
endif
!release memory on the host and on the device
deallocate (cinput,coutput,kx,cinput_d,coutput_d)
! Destroy the plans
call cufftDestroy(plan)
end program fft
Is somebody can tell me why the following "many-FFT" sometimes failed in double precision but never in single precision ?
Single precision: "Test Passed" ALWAYS !
Double precision: "Test Failed" Sometimes !
Indeed, I checked the Device to Host data transfer. That doesn't seem to be it.
Thanks for any help.
Thanks to talonmies. It was the WDDM Timeout Detection & Recovery limit.
See the link, to change the TDR

Reduction in OpenACC

Here is a Fortran subroutine for matrix-vector multiply. It is probably old-fashioned and inefficient in a number of ways, but right now I am just trying to get it to work with OpenACC directives, and I'm trying to figure out how reduction works:
subroutine matrmult(matrix,invec,outvec,n)
integer:: n
real*8, intent(in):: matrix(n,n), invec(n)
real*8, intent(out) :: outvec(n)
real*8 :: tmpmat(n,n)
real*8 :: tmpscl
integer :: i,j,k
!$acc declare create(matrix, invec, outvec, tmpmat)
outvec = 0.d0
!$acc update device(matrix, invec, tmpmat, outvec)
!$acc parallel
!$acc loop gang
do j=1,n
!$acc loop vector
do i=1,n
tmpmat(i,j) = matrix(i,j)*invec(j)
enddo
enddo
!$acc loop vector reduction(+:tmpsclr)
do j=1,n
tmpsclr = 0.d0
do i=1,n
tmpsclr = tmpsclr+tmpmat(j,i)
enddo
outvec(j) = tmpsclr
enddo
!$acc end parallel
!$acc update host(outvec)
end subroutine
This code actually gives correct results. But when I try a gang/vector combination on the last loops, like so:
!$acc loop gang reduction(+:tmpsclr)
do j=1,n
tmpsclr = 0.d0
!$acc loop vector
do i=1,n
tmpsclr = tmpsclr+tmpmat(j,i)
enddo
outvec(j) = tmpsclr
enddo
the results come back all wrong. It looks like the summation is incomplete for most, but not all, of the elements of outvec. This is the case no matter where I put the reduction clause, whether with the gang or the vector. Changing the location changes the results, but never gives correct results.
The results I am getting in a simple test are like the following. matrix is 10x10 and all 1's, and invec is 1,2,3,...10. So the elements of outvec should each just be the sum of the elements in invec, 55. If I run the gang/vector version of the code, each element of outvec is 1, not 55. If I put the reduction with the vector, well, then I get the right answer, 55. And this continues to work until I get past 90 elements. When I get to 91, every element of outvec should be equal to 4186. But only the last one is, and all the rest are equal to 4095 (the sum of 1 to 90). As the number of elements get bigger the variation of values and the discrepancy from the correct answer gets worse.
I clearly don't understand how the reduction works. Can anyone explain?
The reduction clause needs to be on loop where the reduction occurs, i.e. the vector loop. I'd also recommend using the "kernels" directive here since "parallel" will create one kernel launch for the two loops, while "kernels" will create two kernels, one for each loop.
For example:
subroutine foo(n,matrix,invec,outvec)
integer n
real*8, intent(in) :: matrix(n,n)
real*8, intent(in) :: invec(n)
real*8, intent(out) :: outvec(n)
real*8 :: tmpmat(n,n)
real*8 :: tmpscl
integer :: i,j,k
!$acc declare create(matrix, invec, outvec, tmpmat)
outvec = 0.d0
!$acc update device(matrix, invec, tmpmat, outvec)
!$acc kernels
!$acc loop gang
do j=1,n
!$acc loop vector
do i=1,n
tmpmat(i,j) = matrix(i,j)*invec(j)
enddo
enddo
!$acc loop gang
do j=1,n
tmpsclr = 0.d0
!$acc loop vector reduction(+:tmpsclr)
do i=1,n
tmpsclr = tmpsclr+tmpmat(j,i)
enddo
outvec(j) = tmpsclr
enddo
!$acc end kernels
!$acc update host(outvec)
end subroutine foo
% pgf90 -c -acc -Minfo=accel test2.f90
foo:
11, Generating create(matrix(:,:),invec(:),outvec(:))
15, Generating update device(outvec(:),tmpmat(:,:),invec(:),matrix(:,:))
20, Loop is parallelizable
22, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
20, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
22, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
28, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
28, !$acc loop gang ! blockidx%x
31, !$acc loop vector(128) ! threadidx%x
Sum reduction generated for tmpsclr
31, Loop is parallelizable
39, Generating update host(outvec(:))
Hope this helps,
Mat

Max reduce in CUDA Fortran

I am trying to perform reduction in CUDA Fortran; what I did so far is something like that, performing the reduction in two steps (see the CUDA kernels below).
In the first kernel I am doing some simple computation and I declare a shared array for a block of threads to store the value of abs(a - anew); once the threads are synchronized, I compute the max value of this shared array, that I store in an intermediate array of dimension gridDim%x * gridDim%y.
In the second kernel, I am reading this array (in a single block of threads) and try to compute the max value of it.
Here is the whole code:
module commons
integer, parameter :: dp=kind(1.d0)
integer, parameter :: nx=1024, ny=1024
integer, parameter :: block_dimx=16, block_dimy=32
end module commons
module kernels
use commons
contains
attributes(global) subroutine kernel_gpu_reduce(a, anew, error, nxi, nyi)
implicit none
integer, value, intent(in) :: nxi, nyi
real(dp), dimension(nxi,nyi), intent(in) :: a
real(dp), dimension(nxi,nyi), intent(inout) :: anew
real(dp), dimension(nxi/block_dimx+1,nyi/block_dimy+1), intent(inout) :: error
real(dp), shared, dimension(block_dimx,block_dimy) :: err_sh
integer :: i, j, k, tx, ty
i = (blockIdx%x - 1)*blockDim%x + threadIdx%x
j = (blockIdx%y - 1)*blockDim%y + threadIdx%y
tx = threadIdx%x
ty = threadIdx%y
if (i > 1 .and. i < nxi .and. j > 1 .and. j < nyi) then
anew(i,j) = 0.25d0*(a(i-1,j) + a(i+1,j) &
& + a(i,j-1) + a(i,j+1))
err_sh(tx,ty) = abs(anew(i,j) - a(i,j))
endif
call syncthreads()
error(blockIdx%x,blockIdx%y) = maxval(err_sh)
end subroutine kernel_gpu_reduce
attributes(global) subroutine max_reduce(local_error, error, nxi, nyi)
implicit none
integer, value, intent(in) :: nxi, nyi
real(dp), dimension(nxi,nyi), intent(in) :: local_error
real(dp), intent(out) :: error
real(dp), shared, dimension(nxi) :: shared_error
integer :: tx, i
tx = threadIdx%x
shared_error(tx) = 0.d0
if (tx >=1 .and. tx <= nxi) shared_error(tx) = maxval(local_error(tx,:))
call syncthreads()
error = maxval(shared_error)
end subroutine max_reduce
end module kernels
program laplace
use cudafor
use kernels
use commons
implicit none
real(dp), allocatable, dimension(:,:) :: a, anew
real(dp) :: error=1.d0
real(dp), device, allocatable, dimension(:,:) :: adev, adevnew
real(dp), device, allocatable, dimension(:,:) :: edev
real(dp), allocatable, dimension(:,:) :: ehost
real(dp), device :: error_dev
integer :: i
integer :: num_device, h_status, ierrSync, ierrAsync
type(dim3) :: dimGrid, dimBlock
num_device = 0
h_status = cudaSetDevice(num_device)
dimGrid = dim3(nx/block_dimx+1, ny/block_dimy+1, 1)
dimBlock = dim3(block_dimx, block_dimy, 1)
allocate(a(nx,ny), anew(nx,ny))
allocate(adev(nx,ny), adevnew(nx,ny))
allocate(edev(dimGrid%x,dimGrid%y), ehost(dimGrid%x,dimGrid%y))
do i = 1, nx
a(i,:) = 1.d0
anew(i,:) = 1.d0
enddo
adev = a
adevnew = anew
call kernel_gpu_reduce<<<dimGrid, dimBlock>>>(adev, adevnew, edev, nx, ny)
ierrSync = cudaGetLastError()
ierrAsync = cudaDeviceSynchronize()
if (ierrSync /= cudaSuccess) write(*,*) &
& 'Sync kernel error - 1st kernel:', cudaGetErrorString(ierrSync)
if (ierrAsync /= cudaSuccess) write(*,*) &
& 'Async kernel error - 1st kernel:', cudaGetErrorString(ierrAsync)
call max_reduce<<<1, dimGrid%x>>>(edev, error_dev, dimGrid%x, dimGrid%y)
ierrSync = cudaGetLastError()
ierrAsync = cudaDeviceSynchronize()
if (ierrSync /= cudaSuccess) write(*,*) &
& 'Sync kernel error - 2nd kernel:', cudaGetErrorString(ierrSync)
if (ierrAsync /= cudaSuccess) write(*,*) &
& 'Async kernel error - 2nd kernel:', cudaGetErrorString(ierrAsync)
error = error_dev
print*, 'error from kernel: ', error
ehost = edev
error = maxval(ehost)
print*, 'error from host: ', error
deallocate(a, anew, adev, adevnew, edev, ehost)
end program laplace
I first had a problem because of the kernel configuration of the second kernel (which was <<<1, dimGrid>>>); I modified the code following Robert's answer. Now I have a memory access error:
Async kernel error - 2nd kernel:
an illegal memory access was encountered
0: copyout Memcpy (host=0x666bf0, dev=0x4203e20000, size=8) FAILED: 77(an illegal memory access was encountered)
And, if I run it with cuda-memcheck:
========= Invalid __shared__ write of size 8
========= at 0x00000060 in kernels_max_reduce_
========= by thread (1,0,0) in block (0,0,0)
========= Address 0x00000008 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x2c5) [0x14ad95]
for every thread.
The code is compiled with PGI Fortran 14.9 and CUDA 6.5 on a Tesla K20 card (with CUDA capability 3.5). I compile it with:
pgfortran -Mcuda -ta:nvidia,cc35 laplace.f90 -o laplace
You can do proper cuda error checking in CUDA Fortran. You should do so in your code.
One problem is that you're trying to launch too many threads (per block) in your second kernel:
call max_reduce<<<1, dimGrid>>>(edev, error_dev, dimGrid%x, dimGrid%y)
^^^^^^^
The dimGrid parameter has previously been computed to be:
dimGrid = dim3(nx/block_dimx+1, ny/block_dimy+1, 1);
Substituting actual values, we have:
dimGrid = dim3(1024/16 + 1, 1024/32 +1);
i.e.
dimGrid = dim3(65,33);
But you are not allowed to request 65*33 = 2145 threads per block. The maximum is either 512 or 1024 depending on what device architecture target you are compiling for.
Because of this error, your second kernel is not running at all.

How to interface OpenACC with cublasDgetrfBatched in Fortran?

I've been working on a Fortran code which uses the cuBLAS batched LU and cuSPARSE batched tridiagonal solver as part of a BiCG iterative solver with ADI preconditioner.I'm using a Kepler K20X with compute capability 3.5 and CUDA 5.5. I'm doing this without PGI's CUDA Fortran, so I'm writing my own interfaces:
FUNCTION cublasDgetrfBatched(handle, n, dA, ldda, dP, dInfo, nbatch) BIND(C, NAME="cublasDgetrfBatched")
USE, INTRINSIC :: ISO_C_BINDING
INTEGER(KIND(CUBLAS_STATUS_SUCCESS)) :: cublasDgetrfBatched
TYPE(C_PTR), VALUE :: handle
INTEGER(C_INT), VALUE :: n
TYPE(C_PTR), VALUE :: dA
INTEGER(C_INT), VALUE :: ldda
TYPE(C_PTR), VALUE :: dP
TYPE(C_PTR), VALUE :: dInfo
INTEGER(C_INT), VALUE :: nbatch
END FUNCTION cublasDgetrfBatched
I allocate pinned memory on the host with cudaHostAlloc, allocate the device memory for the matrices and the device array containing the device pointers to the matrices, asynchronously copy each matrix to the device, perform the operations, and then asynchronously copy the decomposed matrix and pivots back to the host to perform the back-substitution with a single right-hand side:
REAL(8), POINTER, DIMENSION(:,:,:) :: A
INTEGER, DIMENSION(:,:), POINTER :: ipiv
TYPE(C_PTR) :: cPtr_A, cPtr_ipiv
TYPE(C_PTR), ALLOCATABLE, DIMENSION(:), TARGET :: dPtr_A
TYPE(C_PTR) :: dPtr_ipiv, dPtr_A_d, dPtr_info
INTEGER(C_SIZE_T) :: sizeof_A, sizeof_ipiv
...
stat = cudaHostAlloc(cPtr_A, sizeof_A, cudaHostAllocDefault)
CALL C_F_POINTER(cPtr_A, A, (/m,m,nbatch/))
stat = cudaHostAlloc(cPtr_ipiv, sizeof_ipiv, cudaHostAllocDefault)
CALL C_F_POINTER(cPtr_ipiv, ipiv, (/m,nbatch/))
ALLOCATE(dPtr_A(nbatch))
DO ibatch=1,nbatch
stat = cudaMalloc(dPtr_A(ibatch), m*m*sizeof_double)
END DO
stat = cudaMalloc(dPtr_A_d, nbatch*sizeof_cptr)
stat = cublasSetVector(nbatch, sizeof_cptr, C_LOC(dPtr_A(1)), 1, dPtr_A_d, 1)
stat = cudaMalloc(dPtr_ipiv, m*nbatch*sizeof_cint)
stat = cudaMalloc(dPtr_info, nbatch*sizeof_cint)
...
!$OMP PARALLEL DEFAULT(shared) PRIVATE( stat, ibatch )
!$OMP DO
DO ibatch = 1,nbatch
stat = cublasSetMatrixAsync(m, m, sizeof_double, C_LOC(A(1,1,ibatch)), m, dPtr_A(ibatch), m, mystream)
END DO
!$OMP END DO
!$OMP END PARALLEL
...
stat = cublasDgetrfBatched(cublas_handle, m, dPtr_A_d, m, dPtr_ipiv, dPtr_info, nbatch)
...
stat = cublasGetMatrixAsync(m, nbatch, sizeof_cint, dPtr_ipiv, m, C_LOC(ipiv(1,1)), m, mystream)
!$OMP PARALLEL DEFAULT(shared) PRIVATE( ibatch, stat )
!$OMP DO
DO ibatch = 1,nbatch
stat = cublasGetMatrixAsync(m, m, sizeof_double, dPtr_A(ibatch), m, C_LOC(A(1,1,ibatch)), m, mystream)
END DO
!$OMP END DO
!$OMP END PARALLEL
...
!$OMP PARALLEL DEFAULT(shared) PRIVATE( ibatch, x, stat )
!$OMP DO
DO ibatch = 1,nbatch
x = rhs(:,ibatch)
CALL dgetrs( 'N', m, 1, A(1,1,ibatch), m, ipiv(1,ibatch), x(1), m, info )
rhs(:,ibatch) = x
END DO
!$OMP END DO
!$OMP END PARALLEL
...
I'd rather not have to do this last step, but the cublasDtrsmBatched routine limits the matrix size to 32, and mine are size 80 (a batched Dtrsv would be better, but this doesn't exist). The cost of launching multiple individual cublasDtrsv kernels makes performing the back-sub on the device untenable.
There are other operations which I need to perform between calls to cublasDgetrfBatched and cusparseDgtsvStridedBatch. Most of these are currently being performed on the host with OpenMP being used to parallelize the loops at the batched level. Some of the operations, like matrix-vector multiplication for each of the matrices being decomposed for example, are being computed on the device with OpenACC:
!$ACC DATA COPYIN(A) COPYIN(x) COPYOUT(Ax)
...
!$ACC KERNELS
DO ibatch = 1,nbatch
DO i = 1,m
Ax(i,ibatch) = zero
END DO
DO j = 1,m
DO i = 1,m
Ax(i,ibatch) = Ax(i,ibatch) + A(i,j,ibatch)*x(j,ibatch)
END DO
END DO
END DO
!$ACC END KERNELS
...
!$ACC END DATA
I'd like to place more of the computation on the GPU with OpenACC, but to do so I need to be able to interface the two. Something like the following:
!$ACC DATA COPYIN(A) CREATE(info,A_d) COPYOUT(ipiv)
!$ACC HOST_DATA USE_DEVICE(A)
DO ibatch = 1,nbatch
A_d(ibatch) = acc_deviceptr(A(1,1,ibatch))
END DO
!$ACC END HOST_DATA
...
!$ACC HOST_DATA USE_DEVICE(ipiv,info)
stat = cublasDgetrfBatched(cublas_handle, m, A_d, m, ipiv, info, nbatch)
!$ACC END HOST_DATA
...
!$ACC END DATA
I know the host_data construct with the host_device clauses would be appropriate in most cases, but since I need to actually pass to cuBLAS a device array containing the pointers to the matrices on the device, I'm not sure how to proceed.
Can anyone offer any insight?
Thanks
!! Put everything on the device
!$ACC DATA COPYIN(A) CREATE(info,A_d) COPYOUT(ipiv)
!! populate the device A_d array
!$ACC parallel loop
DO ibatch = 1,nbatch
A_d(ibatch) = A(1,1,ibatch)
END DO
!$ACC end parallel
...
!! send the device address of A_d to the device
!$ACC HOST_DATA USE_DEVICE(A_d,ipiv,info)
stat = cublasDgetrfBatched(cublas_handle, m, A_d, m, ipiv, info, nbatch)
!$ACC END HOST_DATA
...
!$ACC END DATA
or
!! Put everything but A_d on the device
!$ACC DATA COPYIN(A) CREATE(info) COPYOUT(ipiv)
!! populate the host A_d array
DO ibatch = 1,nbatch
A_d(ibatch) = acc_deviceptr( A(1,1,ibatch) )
END DO
!! copy A_d to the device
!$acc data copyin( A_d )
...
!! send the device address of A_d and others to the device
!$ACC HOST_DATA USE_DEVICE(A_d,ipiv,info)
stat = cublasDgetrfBatched(cublas_handle, m, A_d, m, ipiv, info, nbatch)
!$ACC END HOST_DATA
...
!$acc end data
!$ACC END DATA

Multi-GPU for peer to peer

I have some difficulties setting 2 GPUs for peer to peer communication.
I am using Cuda 4.0 and programming with fortran. PGI compiler
I wrote a program which confirm I have 4 GPUs available on my node.
I decided to use two of them but having the following error:
0: DEALLOCATE: invalid device pointer.
subroutine directTransfer()
use cudafor
implicit none
integer, parameter :: N = 4*1024*1024
real, pinned, allocatable :: a(:), b(:)
real, device, allocatable :: a_d(:), b_d(:)
!these hold free and total memory before and after
!allocation, used to verify allocation happening on proper devices
integer (int_ptr_kind()),allocatable ::
& freeBefore(:), totalBefore(:),
& freeAfter(:), totalAfter(:)
integer :: istat, nDevices, i, accessPeer, timingDev
type(cudaDeviceProp)::prop
type(cudaEvent)::startEvent,stopEvent
real :: time
!allocate host arrays
allocate(a(N), b(N))
allocate(freeBefore(0:nDevices -1),
& totalBefore(0:nDevices -1))
allocate(freeAfter(0:nDevices -1),
& totalAfter(0:nDevices -1))
write(*,*) 'Start!'
!get devices ionfo (including total and free memory)
!before allocation
istat = cudaGetDeviceCount(nDevices)
if(nDevices < 2) then
write(*,*) 'Need at least two CUDA capable devices'
stop
end if
write(*,"('Number of CUDA-capable devices: ',
& i0, /)"),nDevices
do i = 0, nDevices - 1
istat = cudaGetDeviceProperties(prop, i)
istat = cudaSetDevice(i)
istat = cudaMemGetInfo(freeBefore(i), totalBefore(i))
end do
!!!Here is the trouble zone!!!!
istat = cudaSetDevice(0)
allocate(a_d(N))
istat = cudaSetDevice(1)
allocate(b_d(N))
deallocate(freeBefore, totalBefore,freeAfter,totalAfter)
deallocate(a,b,a_d,b_d)
end subroutine directTransfer
With the following I have no error:
istat = cudaSetDevice(0)
allocate(a_d(N))
!istat = cudaSetDevice(1)
!allocate(b_d(N))
With this, also no error:
!istat = cudaSetDevice(0)
!allocate(a_d(N))
istat = cudaSetDevice(1)
allocate(b_d(N))
But this return error
istat = cudaSetDevice(0)
allocate(a_d(N))
istat = cudaSetDevice(1)
allocate(b_d(N))
So it seems I cannot set 2GPUs to start my program.
Could you help me understand why it is not possible to set 2GPUs and a hint to solve this?
Thank you JackOLantern!!
It was the trick.
I changed the code as following and it works perfectly
!clean up
deallocate(freeBefore, totalBefore,freeAfter,totalAfter)
istat = cudaSetDevice(0)
deallocate(a_d)
istat = cudaSetDevice(1)
deallocate(b_d)
deallocate(a,b)
This was my problem answer. Hope it will help others.