I am trying to perform reduction in CUDA Fortran; what I did so far is something like that, performing the reduction in two steps (see the CUDA kernels below).
In the first kernel I am doing some simple computation and I declare a shared array for a block of threads to store the value of abs(a - anew); once the threads are synchronized, I compute the max value of this shared array, that I store in an intermediate array of dimension gridDim%x * gridDim%y.
In the second kernel, I am reading this array (in a single block of threads) and try to compute the max value of it.
Here is the whole code:
module commons
integer, parameter :: dp=kind(1.d0)
integer, parameter :: nx=1024, ny=1024
integer, parameter :: block_dimx=16, block_dimy=32
end module commons
module kernels
use commons
contains
attributes(global) subroutine kernel_gpu_reduce(a, anew, error, nxi, nyi)
implicit none
integer, value, intent(in) :: nxi, nyi
real(dp), dimension(nxi,nyi), intent(in) :: a
real(dp), dimension(nxi,nyi), intent(inout) :: anew
real(dp), dimension(nxi/block_dimx+1,nyi/block_dimy+1), intent(inout) :: error
real(dp), shared, dimension(block_dimx,block_dimy) :: err_sh
integer :: i, j, k, tx, ty
i = (blockIdx%x - 1)*blockDim%x + threadIdx%x
j = (blockIdx%y - 1)*blockDim%y + threadIdx%y
tx = threadIdx%x
ty = threadIdx%y
if (i > 1 .and. i < nxi .and. j > 1 .and. j < nyi) then
anew(i,j) = 0.25d0*(a(i-1,j) + a(i+1,j) &
& + a(i,j-1) + a(i,j+1))
err_sh(tx,ty) = abs(anew(i,j) - a(i,j))
endif
call syncthreads()
error(blockIdx%x,blockIdx%y) = maxval(err_sh)
end subroutine kernel_gpu_reduce
attributes(global) subroutine max_reduce(local_error, error, nxi, nyi)
implicit none
integer, value, intent(in) :: nxi, nyi
real(dp), dimension(nxi,nyi), intent(in) :: local_error
real(dp), intent(out) :: error
real(dp), shared, dimension(nxi) :: shared_error
integer :: tx, i
tx = threadIdx%x
shared_error(tx) = 0.d0
if (tx >=1 .and. tx <= nxi) shared_error(tx) = maxval(local_error(tx,:))
call syncthreads()
error = maxval(shared_error)
end subroutine max_reduce
end module kernels
program laplace
use cudafor
use kernels
use commons
implicit none
real(dp), allocatable, dimension(:,:) :: a, anew
real(dp) :: error=1.d0
real(dp), device, allocatable, dimension(:,:) :: adev, adevnew
real(dp), device, allocatable, dimension(:,:) :: edev
real(dp), allocatable, dimension(:,:) :: ehost
real(dp), device :: error_dev
integer :: i
integer :: num_device, h_status, ierrSync, ierrAsync
type(dim3) :: dimGrid, dimBlock
num_device = 0
h_status = cudaSetDevice(num_device)
dimGrid = dim3(nx/block_dimx+1, ny/block_dimy+1, 1)
dimBlock = dim3(block_dimx, block_dimy, 1)
allocate(a(nx,ny), anew(nx,ny))
allocate(adev(nx,ny), adevnew(nx,ny))
allocate(edev(dimGrid%x,dimGrid%y), ehost(dimGrid%x,dimGrid%y))
do i = 1, nx
a(i,:) = 1.d0
anew(i,:) = 1.d0
enddo
adev = a
adevnew = anew
call kernel_gpu_reduce<<<dimGrid, dimBlock>>>(adev, adevnew, edev, nx, ny)
ierrSync = cudaGetLastError()
ierrAsync = cudaDeviceSynchronize()
if (ierrSync /= cudaSuccess) write(*,*) &
& 'Sync kernel error - 1st kernel:', cudaGetErrorString(ierrSync)
if (ierrAsync /= cudaSuccess) write(*,*) &
& 'Async kernel error - 1st kernel:', cudaGetErrorString(ierrAsync)
call max_reduce<<<1, dimGrid%x>>>(edev, error_dev, dimGrid%x, dimGrid%y)
ierrSync = cudaGetLastError()
ierrAsync = cudaDeviceSynchronize()
if (ierrSync /= cudaSuccess) write(*,*) &
& 'Sync kernel error - 2nd kernel:', cudaGetErrorString(ierrSync)
if (ierrAsync /= cudaSuccess) write(*,*) &
& 'Async kernel error - 2nd kernel:', cudaGetErrorString(ierrAsync)
error = error_dev
print*, 'error from kernel: ', error
ehost = edev
error = maxval(ehost)
print*, 'error from host: ', error
deallocate(a, anew, adev, adevnew, edev, ehost)
end program laplace
I first had a problem because of the kernel configuration of the second kernel (which was <<<1, dimGrid>>>); I modified the code following Robert's answer. Now I have a memory access error:
Async kernel error - 2nd kernel:
an illegal memory access was encountered
0: copyout Memcpy (host=0x666bf0, dev=0x4203e20000, size=8) FAILED: 77(an illegal memory access was encountered)
And, if I run it with cuda-memcheck:
========= Invalid __shared__ write of size 8
========= at 0x00000060 in kernels_max_reduce_
========= by thread (1,0,0) in block (0,0,0)
========= Address 0x00000008 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x2c5) [0x14ad95]
for every thread.
The code is compiled with PGI Fortran 14.9 and CUDA 6.5 on a Tesla K20 card (with CUDA capability 3.5). I compile it with:
pgfortran -Mcuda -ta:nvidia,cc35 laplace.f90 -o laplace
You can do proper cuda error checking in CUDA Fortran. You should do so in your code.
One problem is that you're trying to launch too many threads (per block) in your second kernel:
call max_reduce<<<1, dimGrid>>>(edev, error_dev, dimGrid%x, dimGrid%y)
^^^^^^^
The dimGrid parameter has previously been computed to be:
dimGrid = dim3(nx/block_dimx+1, ny/block_dimy+1, 1);
Substituting actual values, we have:
dimGrid = dim3(1024/16 + 1, 1024/32 +1);
i.e.
dimGrid = dim3(65,33);
But you are not allowed to request 65*33 = 2145 threads per block. The maximum is either 512 or 1024 depending on what device architecture target you are compiling for.
Because of this error, your second kernel is not running at all.
Related
I have some Fortran 90 code from which I am trying to call functions from a library I wrote using Cuda.
I am mapping my data on the device using
$!OMP TARGET DATA MAP ( .... )
and I am getting a pointer to the memory allocated on the device using USE_DEVICE_PTR.
However, my arrays are multidimensional. So my question is:
Does OpenMP allocate them as multidimensional arrays (using cudaMallocPitch)
If so, how do I get the pitch?
I have tried to figure it out by myself trying to compute the pitch by computing the difference between the addresses of the first element of two consecutive lines and slices. So I wrote a small Fortran program that calls a function inside of an OpenMP TARGET loop (file getpitch.f90):
program omppitch
implicit none
complex(kind = 8), dimension(:,:,:), allocatable :: A
integer(kind = 4) :: M, N, K, NK
integer(kind = 8) :: pA, pB, P
M = 1024
N = 2048
K = 3
allocate( A(0:K,0:M,0:N) )
call zlarnv( 1, (/ 0, 0, 0, 1 /), M*N*K, A )
!$OMP TARGET DATA MAP ( TO: A ) USE_DEVICE_PTR( A, pA )
!$OMP TARGET PARALLEL &
!$OMP& DEFAULT(NONE) &
!$OMP& SHARED( A, M, N, K, P ) PRIVATE( pA )
do NK = 0,K
call getpitch( A(NK,:,:), pA )
!$OMP CRITICAL
P = pA
!$OMP END CRITICAL
end do !NK
!$OMP END TARGET PARALLEL
!$OMP END TARGET DATA
print *, P
deallocate( A )
end program omppitch
This function is implemented in C++ using a Cuda kernel started by only one thread (file getpitch_ker.cu):
#include <thrust/complex.h>
#define complex_double thrust::complex<double>
extern "C" {
void getpitch( complex_double*, uint64_t* );
void getpitch_( complex_double*, uint64_t* );
}
template<class T>
__global__ void ker_getpitch( T *A, uint64_t* addr ){
*addr = (uint64_t)A;
}
void getpitch( complex_double *A, uint64_t* addr ){
ker_getpitch<<<1, 1>>>( A, addr );
}
void getpitch_( complex_double *A, uint64_t* addr ){
getpitch( A, addr );
}
I compile them using:
nvc++ -c getpitch_ker.cu
nvfortran -ta=tesla,pinned,cc80 -i4 -r8 -o getpitch getpitch.f90 getpitch_ker.o -L$BLAS -lopenblas -mp -lstdc++
And I am getting the following error:
$ ./getpitch
malloc: cuMemMallocHost returns error code 201 for new pool allocation
malloc: cuMemMallocHost returns error code 201 for new pool allocation
malloc: cuMemMallocHost returns error code 201 for new pool allocation
malloc: cuMemMallocHost returns error code 201 for new pool allocation
malloc: cuMemMallocHost returns error code 201 for new pool allocation
malloc: cuMemMallocHost returns error code 201 for new pool allocation
malloc: cuMemMallocHost returns error code 201 for new pool allocation
0
I maybe a problem, either with "cudamalloc" or "cublassetmatrix" not appropriately called from fortran:
module cuda
interface
integer (C_INT) function cudaMallocHost(buffer, size) bind(C,name="cudaMallocHost")
use iso_c_binding
implicit none
type (C_PTR) :: buffer
integer (C_SIZE_T), value :: size
end function cudaMallocHost
integer (C_INT) function cudaFreeHost(buffer) bind(C,name="cudaFreeHost")
use iso_c_binding
implicit none
type (C_PTR), value :: buffer
end function cudaFreeHost
integer (C_INT) function cudaMalloc(buffer, size) bind(C,name="cudaMalloc")
use iso_c_binding
implicit none
type (C_PTR) :: buffer
integer (C_SIZE_T), value :: size
end function cudaMalloc
integer (C_INT) function cudaFree(buffer) bind(C,name="cudaFree")
use iso_c_binding
implicit none
type (C_PTR), value :: buffer
end function cudaFree
integer (C_INT) function cublassetmatrix(M,N,size,A_host,lda_h&
&,A_dev,lda_d) bind(C,name="cublasSetMatrix")
use iso_c_binding
implicit none
type (C_PTR) :: A_dev
type(c_ptr) :: A_host
integer (C_SIZE_T), value :: size
integer(C_Int) :: M,N,lda_h,lda_d
end function cublassetmatrix
Type(c_ptr) function cudaGetErrorString(err) bind(C,name="cudaGetErrorString")
use iso_c_binding
implicit none
integer (C_SIZE_T), value :: err
end function cudaGetErrorString
integer (C_INT) function cublasCreate(handle) bind(C,name="cublasCreate_v2")
use iso_c_binding
implicit none
Type(C_Ptr) :: handle
end function cublasCreate
end interface
end module cuda
program test
use iso_c_binding
use cuda
implicit none
integer, parameter :: fp_kind = kind(0.0)
type(C_PTR) :: cptr_A, cptr_A_D
real (fp_kind), dimension(:,:), pointer :: A=>null()
real :: time_start,time_end
integer:: i,j, res, m1
integer(c_int) :: x
type(c_ptr) :: handle
logical:: lsexit
CHARACTER(len=50), POINTER :: errchar
m1=500
res=cublasCreate(handle)
if(res/=0) Then
write(*,*) "ERROR 1 ",res;
end if
res = cudaMallocHost ( cptr_A, m1*m1*sizeof(fp_kind) )
if(res/=0) Then
write(*,*) "ERROR 2 ",res;
end if
call c_f_pointer ( cptr_A, A, (/ m1, m1 /) )
A=1._fp_kind
res = cudaMalloc ( cptr_A_D, m1*m1*sizeof(fp_kind) )
if(res/=0) Then
write(*,*) "ERROR 3 ",res;
end if
res=cublasSetMatrix (m1,m1,sizeof(fp_kind),cptr_A,m1,cptr_A_D,m1)
if(res/=0) Then
write(*,*) "ERROR 4 ",res,sizeof(fp_kind)
call c_f_pointer ( cudageterrorstring(int(res,kind=8)),&
& errchar, [ len(errchar) ] )
write(*,*) trim(adjustl(errchar))
end if
end program test
The make command is:
tmp:
ifort -O3 -o tmp $(FFLAGS) tmp.f90 -L/opt/cuda/lib64 -lcublas -lcudart
clean:
rm tmp cuda.mod
Although "cudamallochost" expects "void ** ptr" it seems to work as the fortran pointer is usable, and when put in a loop with "cudafree" does not produce memory leaks.
The code fails at "cublassetmatrix" function either with error code 7 ("too many resources") or 11 ("invalid argument").
Any idea?
I am attempting to use PGFortran for CUDA. I installed PGFortran on my computer and linked everything up to the best of my knowledge. To get going I decided to follow a tutorial listed here. When trying to compile the code:
module mathOps
contains
attributes(global) subroutine saxpy(x, y, a)
implicit none
real :: x(:), y(:)
real, value :: a
integer :: i, n
n = size(x)
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
if (i <= n) y(i) = y(i) + a*x(i)
end subroutine saxpy
end module mathOps
program testSaxpy
use mathOps
use cudafor
implicit none
integer, parameter :: N = 40000
real :: x(N), y(N), a
real, device :: x_d(N), y_d(N)
type(dim3) :: grid, tBlock
tBlock = dim3(256,1,1)
grid = dim3(ceiling(real(N)/tBlock%x),1,1)
x = 1.0; y = 2.0; a = 2.0
x_d = x
y_d = y
call saxpy<<<grid, tblock="">>>(x_d, y_d, a)
y = y_d
write(*,*) 'Max error: ', maxval(abs(y-4.0))
end program testSaxpy
I got:
PGF90-S-0034-Syntax error at or near identifier saxpy (main.cuf: 29)
0 inform, 0 warnings, 1 severes, 0 fatal for testsaxpy
The error points to the line call saxpy<<<grid, tblock="">>>(x_d, y_d, a). For some reason it apparently hates the fact that I am using <<< and >>>? Going by the tutorial those triple chevrons are meant to be there:
The information between the triple chevrons is the execution
configuration, which dictates how many device threads execute the
kernel in parallel.
Removing these chevrons would not make any sense since they are the purpose of the program. So why does PGFortran dislike this?
As for the compilation. I have followed the tutorial by using
pgf90 -o saxpy main.cuf. But since that gave an error I also tried pgf90 -Mcuda -o saxpy main.cuf. Same results.
There does seem to be a text error in that blog at the kernel invocation line:
call saxpy<<<grid, tblock="">>>(x_d, y_d, a)
tblock="" is not correct. You'll notice elsewhere in that blog text, the kernel invocation line is given correctly as:
call saxpy<<<grid,tBlock>>>(x_d, y_d, a)
So if you change that line accordingly in your actual code, I think you'll have better results.
I've been working on a Fortran code which uses the cuBLAS batched LU and cuSPARSE batched tridiagonal solver as part of a BiCG iterative solver with ADI preconditioner.I'm using a Kepler K20X with compute capability 3.5 and CUDA 5.5. I'm doing this without PGI's CUDA Fortran, so I'm writing my own interfaces:
FUNCTION cublasDgetrfBatched(handle, n, dA, ldda, dP, dInfo, nbatch) BIND(C, NAME="cublasDgetrfBatched")
USE, INTRINSIC :: ISO_C_BINDING
INTEGER(KIND(CUBLAS_STATUS_SUCCESS)) :: cublasDgetrfBatched
TYPE(C_PTR), VALUE :: handle
INTEGER(C_INT), VALUE :: n
TYPE(C_PTR), VALUE :: dA
INTEGER(C_INT), VALUE :: ldda
TYPE(C_PTR), VALUE :: dP
TYPE(C_PTR), VALUE :: dInfo
INTEGER(C_INT), VALUE :: nbatch
END FUNCTION cublasDgetrfBatched
I allocate pinned memory on the host with cudaHostAlloc, allocate the device memory for the matrices and the device array containing the device pointers to the matrices, asynchronously copy each matrix to the device, perform the operations, and then asynchronously copy the decomposed matrix and pivots back to the host to perform the back-substitution with a single right-hand side:
REAL(8), POINTER, DIMENSION(:,:,:) :: A
INTEGER, DIMENSION(:,:), POINTER :: ipiv
TYPE(C_PTR) :: cPtr_A, cPtr_ipiv
TYPE(C_PTR), ALLOCATABLE, DIMENSION(:), TARGET :: dPtr_A
TYPE(C_PTR) :: dPtr_ipiv, dPtr_A_d, dPtr_info
INTEGER(C_SIZE_T) :: sizeof_A, sizeof_ipiv
...
stat = cudaHostAlloc(cPtr_A, sizeof_A, cudaHostAllocDefault)
CALL C_F_POINTER(cPtr_A, A, (/m,m,nbatch/))
stat = cudaHostAlloc(cPtr_ipiv, sizeof_ipiv, cudaHostAllocDefault)
CALL C_F_POINTER(cPtr_ipiv, ipiv, (/m,nbatch/))
ALLOCATE(dPtr_A(nbatch))
DO ibatch=1,nbatch
stat = cudaMalloc(dPtr_A(ibatch), m*m*sizeof_double)
END DO
stat = cudaMalloc(dPtr_A_d, nbatch*sizeof_cptr)
stat = cublasSetVector(nbatch, sizeof_cptr, C_LOC(dPtr_A(1)), 1, dPtr_A_d, 1)
stat = cudaMalloc(dPtr_ipiv, m*nbatch*sizeof_cint)
stat = cudaMalloc(dPtr_info, nbatch*sizeof_cint)
...
!$OMP PARALLEL DEFAULT(shared) PRIVATE( stat, ibatch )
!$OMP DO
DO ibatch = 1,nbatch
stat = cublasSetMatrixAsync(m, m, sizeof_double, C_LOC(A(1,1,ibatch)), m, dPtr_A(ibatch), m, mystream)
END DO
!$OMP END DO
!$OMP END PARALLEL
...
stat = cublasDgetrfBatched(cublas_handle, m, dPtr_A_d, m, dPtr_ipiv, dPtr_info, nbatch)
...
stat = cublasGetMatrixAsync(m, nbatch, sizeof_cint, dPtr_ipiv, m, C_LOC(ipiv(1,1)), m, mystream)
!$OMP PARALLEL DEFAULT(shared) PRIVATE( ibatch, stat )
!$OMP DO
DO ibatch = 1,nbatch
stat = cublasGetMatrixAsync(m, m, sizeof_double, dPtr_A(ibatch), m, C_LOC(A(1,1,ibatch)), m, mystream)
END DO
!$OMP END DO
!$OMP END PARALLEL
...
!$OMP PARALLEL DEFAULT(shared) PRIVATE( ibatch, x, stat )
!$OMP DO
DO ibatch = 1,nbatch
x = rhs(:,ibatch)
CALL dgetrs( 'N', m, 1, A(1,1,ibatch), m, ipiv(1,ibatch), x(1), m, info )
rhs(:,ibatch) = x
END DO
!$OMP END DO
!$OMP END PARALLEL
...
I'd rather not have to do this last step, but the cublasDtrsmBatched routine limits the matrix size to 32, and mine are size 80 (a batched Dtrsv would be better, but this doesn't exist). The cost of launching multiple individual cublasDtrsv kernels makes performing the back-sub on the device untenable.
There are other operations which I need to perform between calls to cublasDgetrfBatched and cusparseDgtsvStridedBatch. Most of these are currently being performed on the host with OpenMP being used to parallelize the loops at the batched level. Some of the operations, like matrix-vector multiplication for each of the matrices being decomposed for example, are being computed on the device with OpenACC:
!$ACC DATA COPYIN(A) COPYIN(x) COPYOUT(Ax)
...
!$ACC KERNELS
DO ibatch = 1,nbatch
DO i = 1,m
Ax(i,ibatch) = zero
END DO
DO j = 1,m
DO i = 1,m
Ax(i,ibatch) = Ax(i,ibatch) + A(i,j,ibatch)*x(j,ibatch)
END DO
END DO
END DO
!$ACC END KERNELS
...
!$ACC END DATA
I'd like to place more of the computation on the GPU with OpenACC, but to do so I need to be able to interface the two. Something like the following:
!$ACC DATA COPYIN(A) CREATE(info,A_d) COPYOUT(ipiv)
!$ACC HOST_DATA USE_DEVICE(A)
DO ibatch = 1,nbatch
A_d(ibatch) = acc_deviceptr(A(1,1,ibatch))
END DO
!$ACC END HOST_DATA
...
!$ACC HOST_DATA USE_DEVICE(ipiv,info)
stat = cublasDgetrfBatched(cublas_handle, m, A_d, m, ipiv, info, nbatch)
!$ACC END HOST_DATA
...
!$ACC END DATA
I know the host_data construct with the host_device clauses would be appropriate in most cases, but since I need to actually pass to cuBLAS a device array containing the pointers to the matrices on the device, I'm not sure how to proceed.
Can anyone offer any insight?
Thanks
!! Put everything on the device
!$ACC DATA COPYIN(A) CREATE(info,A_d) COPYOUT(ipiv)
!! populate the device A_d array
!$ACC parallel loop
DO ibatch = 1,nbatch
A_d(ibatch) = A(1,1,ibatch)
END DO
!$ACC end parallel
...
!! send the device address of A_d to the device
!$ACC HOST_DATA USE_DEVICE(A_d,ipiv,info)
stat = cublasDgetrfBatched(cublas_handle, m, A_d, m, ipiv, info, nbatch)
!$ACC END HOST_DATA
...
!$ACC END DATA
or
!! Put everything but A_d on the device
!$ACC DATA COPYIN(A) CREATE(info) COPYOUT(ipiv)
!! populate the host A_d array
DO ibatch = 1,nbatch
A_d(ibatch) = acc_deviceptr( A(1,1,ibatch) )
END DO
!! copy A_d to the device
!$acc data copyin( A_d )
...
!! send the device address of A_d and others to the device
!$ACC HOST_DATA USE_DEVICE(A_d,ipiv,info)
stat = cublasDgetrfBatched(cublas_handle, m, A_d, m, ipiv, info, nbatch)
!$ACC END HOST_DATA
...
!$acc end data
!$ACC END DATA
I have some difficulties setting 2 GPUs for peer to peer communication.
I am using Cuda 4.0 and programming with fortran. PGI compiler
I wrote a program which confirm I have 4 GPUs available on my node.
I decided to use two of them but having the following error:
0: DEALLOCATE: invalid device pointer.
subroutine directTransfer()
use cudafor
implicit none
integer, parameter :: N = 4*1024*1024
real, pinned, allocatable :: a(:), b(:)
real, device, allocatable :: a_d(:), b_d(:)
!these hold free and total memory before and after
!allocation, used to verify allocation happening on proper devices
integer (int_ptr_kind()),allocatable ::
& freeBefore(:), totalBefore(:),
& freeAfter(:), totalAfter(:)
integer :: istat, nDevices, i, accessPeer, timingDev
type(cudaDeviceProp)::prop
type(cudaEvent)::startEvent,stopEvent
real :: time
!allocate host arrays
allocate(a(N), b(N))
allocate(freeBefore(0:nDevices -1),
& totalBefore(0:nDevices -1))
allocate(freeAfter(0:nDevices -1),
& totalAfter(0:nDevices -1))
write(*,*) 'Start!'
!get devices ionfo (including total and free memory)
!before allocation
istat = cudaGetDeviceCount(nDevices)
if(nDevices < 2) then
write(*,*) 'Need at least two CUDA capable devices'
stop
end if
write(*,"('Number of CUDA-capable devices: ',
& i0, /)"),nDevices
do i = 0, nDevices - 1
istat = cudaGetDeviceProperties(prop, i)
istat = cudaSetDevice(i)
istat = cudaMemGetInfo(freeBefore(i), totalBefore(i))
end do
!!!Here is the trouble zone!!!!
istat = cudaSetDevice(0)
allocate(a_d(N))
istat = cudaSetDevice(1)
allocate(b_d(N))
deallocate(freeBefore, totalBefore,freeAfter,totalAfter)
deallocate(a,b,a_d,b_d)
end subroutine directTransfer
With the following I have no error:
istat = cudaSetDevice(0)
allocate(a_d(N))
!istat = cudaSetDevice(1)
!allocate(b_d(N))
With this, also no error:
!istat = cudaSetDevice(0)
!allocate(a_d(N))
istat = cudaSetDevice(1)
allocate(b_d(N))
But this return error
istat = cudaSetDevice(0)
allocate(a_d(N))
istat = cudaSetDevice(1)
allocate(b_d(N))
So it seems I cannot set 2GPUs to start my program.
Could you help me understand why it is not possible to set 2GPUs and a hint to solve this?
Thank you JackOLantern!!
It was the trick.
I changed the code as following and it works perfectly
!clean up
deallocate(freeBefore, totalBefore,freeAfter,totalAfter)
istat = cudaSetDevice(0)
deallocate(a_d)
istat = cudaSetDevice(1)
deallocate(b_d)
deallocate(a,b)
This was my problem answer. Hope it will help others.