Cuda Fortran: Data copy from cpu to gpu [closed] - cuda

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a problem about the data copy form host to device. Here is my problem. I have an array define as
real, allocatable :: cpuArray(:,:,:)
real, device, allocatable :: gpuArrray(:,:,:)
allocate(cpuArray(0:imax-1,0:jmax-1,0:kmax-1))
allocate(gpuArrray(-1:imax,-1:jmax,-1:kmax))
!array initialiazation
cpuArrray = randomValue !non 0 value
gpuArray = 0.0 !first 0 all gpu array elements
gpuArrray(0:imax-1,0:jmax-1,0:kmax-1)= cpuArray
My expectation is that only the designated index in the gpuArray will receive data from the host however it does not work.
Could you help me find what is wrong with this?
PS: I based my my approach from this tutorial of PGI home page
--
When I set both the cpuArray and the gpuArray the same dimension,
I get exactly the correct result.
But the current situation produces 0 for all element in the gpuArray. I modified the default value to a non zero (ie. gpuArray = 10.0 !first 10 all gpu array elements ) but the result still 0.
Best regards,
Adjeiinfo

All my apologies to the whole community. I could solve my problem. It was a silly bug I introduced in the test program. Instead of cpuArrray= cpuArray(0:imax-1,0:jmax-1,0:kmax-1) in the check program, I did cpuArrray= cpuArray.So the program was working well but the result check program was buggy.
Thank you for your follow up.
For your reference this is a part of the program (can be built and run)
module mytest
use cudafor
implicit none
integer :: imax , jmax, kmax
integer :: i,j,k
!host arrays
real,allocatable:: h_a(:,:,:)
real,allocatable:: h_b(:,:,:)
real,allocatable:: h_c(:,:,:)
!device array
real,device,allocatable:: d_b(:,:,:)
real,device,allocatable:: d_c(:,:,:)
real,device,allocatable:: d_b_copy(:,:,:)
real,device,allocatable:: d_c_copy(:,:,:)
contains
attributes(global) subroutine testdata()
integer :: d_i, d_j,d_k
d_i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
d_j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do d_k = 0, 1
d_b_copy(d_i, d_j, d_k) = d_b(d_i, d_j, d_k)
d_c_copy(d_i, d_j, d_k) = d_c(d_i, d_j, d_k)
end do
end subroutine testdata
end module mytest
program Test
use mytest
type(dim3) :: dimGrid, dimBlock,dimGrid1, dimBlock1
imax = 32
jmax = 32
kmax = 2
dimGrid = dim3(2,2, 1)
dimBlock = dim3(imax,jmax,1)
allocate(h_a(0:imax-1,0:jmax-1,0:1))
allocate(h_b(0:imax-1,0:jmax-1,0:1))
allocate(h_c(0:imax-1,0:jmax-1,0:1))
!real,device,allocatable::d_c(:,:,:)
allocate(d_b(0:imax-1,0:jmax-1,0:1))
allocate(d_c(-1:imax,-1:jmax,-1:16))
allocate(d_b_copy(0:imax-1,0:jmax-1,0:1))
allocate(d_c_copy(-1:imax,-1:jmax,-1:1))
!array initialization
do k = 0,kmax-1
do j=0, jmax-1
do i = 0, imax-1
h_a(i,j,k) = i*0.1
end do
end do
end do
!data transfer (cpu to gpu)
d_b = h_a
d_c(0:imax-1,0:jmax-1,0:kmax-1)= h_a
call testdata<<<dimGrid,dimBlock>>>()
!copy back to cpu
h_b = d_b_copy(0:imax-1,0:jmax-1,0:kmax-1)
h_c = d_c_copy(0:imax-1,0:jmax-1,0:kmax-1)
!just for visual test
write(*,*), h_b
open(24,file='h_a.dat')
write(24,*) h_a
close(24)
open(24,file='d_b_copy.dat')
write(24,*) h_b
close(24)
open(24,file='d_c_copy.dat')
write(24,*) h_c
close(24)
end program Test

Related

Using matrices as arguments in functions and as output in subroutines in Fortran

I was trying to create a program that requires me to use matrices as input for functions and subroutines and also requires me to take matrix as subroutine output in Fortran. But, I've encountered multiple errors while doing so. I am not able to understand the source of these errors and hence how to fix them.
I'm confident of the logic but I seem to be making errors in dealing with the matrices.
Program to solve system of linear equations(Gauss elimination with partial pivoting)
Code:
program solving_equations
implicit none
real, allocatable :: a(:,:),interchanged(:,:)
real, allocatable :: x(:)
real addition,multiplying_term,alpha,maximum
integer i,j,row,rth_ele,f_maxfinder,k,n,s,inte
read(*,*)n
allocate( a( n,(n+1) ) )
allocate( x(n) )
allocate( interchanged( n,(n+1) ) )
do i=1,n
read(*,*)( a(i,j),j=1,(n+1) )
end do
do rth_ele= 1,(n-1)
row=f_maxfinder( a , n , rth_ele )
if (row==rth_ele) then
continue
else
call interchanger(a,rth_ele,row,n,interchanged)
a = interchanged
end if
do i= (rth_ele+1) , n
! once i is fixed, multiplying term is fixed too
multiplying_term=( a(i,rth_ele)/a(rth_ele,rth_ele) )
do j=1,(n+1)
a(i,j)=a(i,j)-a(rth_ele,j)*multiplying_term
end do
end do
end do
x(n)=a(n,n+1)/a(n,n)
do i=(n-1),1,-1
addition=0.0
do s=n , (i+1) , -1
addition=addition+a(i,s)*x(s)
end do
x(i)= ( ( a(i,n+1)- addition )/a(i,i) )
end do
do i=1,n
print*,x(i)
end do
endprogram solving_equations
!=================
function f_maxfinder(a,n,rth_ele)
integer inte,f_maxfinder
real maximum
maximum=a(rth_ele,rth_ele)
do inte=n,nint(rth_ele+1),-1
if( a(inte,rth_ele) > maximum ) then
maximum = a(inte,rth_ele)
f_maxfinder=inte
else
continue
end if
end do
end
subroutine interchanger( a,rth_ele,row,n,interchanged )
integer i
real alpha
real, allocatable :: interchanged(:,:)
allocate( interchanged( n,(n+1) ) )
do i=1,n+1
alpha=a(row,i)
a(row,i)=a(rth_ele,i)
a(rth_ele,i)=alpha
end do
do i=1,n
do j=1,(n+1)
interchanged(i,j)=a(i,j)
end do
end do
end
Errors:
row=f_maxfinder( a , n , rth_ele )
1
Warning: Rank mismatch in argument 'a' at (1) (scalar and rank-2)
a(row,i)=a(rth_ele,i)
Error: The function result on the lhs of the assignment at (1) must have the pointer attribute.
a(rth_ele,i)=alpha
Error: The function result on the lhs of the assignment at (1) must have the pointer attribute.
call interchanger(a,rth_ele,row,n,interchanged)
1
Error: Explicit interface required for 'interchanger' at (1): allocatable argument
Thanks!
You're missing a declaration of a as an array in f_maxfinder. implicit none is your friend - be sure to use it all the time.
interchanger has a dummy argument interchanged that is an allocatable, assumed-shape array. This requires that an explicit interface to interchanger be visible in the caller. (See my post https://stevelionel.com/drfortran/2012/01/05/doctor-fortran-gets-explicit-again/ for more on this.
The interface issue could be solved by putting the subroutines in a module and adding a use of the module in the main program.
By the way, there's no need to make a allocatable in f_maxfinder, as you are not allocating or deallocating it. It is still an assumed-shape array so the explicit interface is still required.
Here is a working example taking into account #SteveLionel's advice and the following comments:
Always use implicit none, at least once in the main program and don't forget to pass the -warn flag to the compiler.
Either use a module for functions and subroutines, then add use <module> to the main program, or simply use contains and include them inside the main program as I did below.
The interchanged array is already alcated in the main program, you don't need to re-allocate it in the interchanger subroutine, just pass it as an assumed-shape array.
Remove unused variables; alpha, maximum, k, inte.
Define a in f_maxfinder function.
Function type is better written in front of the function name for readability; see your definition of f_maxfinder and don't declare the function again in main program, unless you're using an explicit interface.
The nint procedure accepts real input, you don't need it here.
Finally add any missing variable declarations in your function/subroutine.
program solving_equations
implicit none
real, allocatable :: a(:,:), interchanged(:,:), x(:)
real :: addition, multiplying_term
integer :: i, j, row, rth_ele, n, s
read (*,*) n
allocate ( a( n,(n+1) ) )
allocate ( x( n ) )
allocate ( interchanged( n,(n+1) ) )
do i = 1,n
do j = 1,(n+1)
read (*,*) a(i,j)
end do
end do
do rth_ele = 1,(n-1)
row = f_maxfinder( a , n , rth_ele )
if (row == rth_ele) then
continue
else
call interchanger(a, rth_ele, row, n, interchanged)
a = interchanged
end if
do i = (rth_ele+1) , n
! once i is fixed, multiplying term is fixed too
multiplying_term = a(i,rth_ele) / a(rth_ele,rth_ele)
do j = 1,(n+1)
a(i,j) = a(i,j) - a(rth_ele,j) * multiplying_term
end do
end do
end do
x(n) = a(n,n+1) / a(n,n)
do i = (n-1),1,-1
addition = 0.0
do s = n,(i+1),-1
addition = addition + a(i,s) * x(s)
end do
x(i)= (a(i,n+1) - addition) / a(i,i)
end do
do i = 1,n
print *, x(i)
end do
contains
integer function f_maxfinder(a, n, rth_ele)
integer :: n, rth_ele, inte
real :: maximum, a(:,:)
maximum = a(rth_ele,rth_ele)
do inte = n,rth_ele+1,-1
if (a(inte,rth_ele) > maximum) then
maximum = a(inte,rth_ele)
f_maxfinder = inte
else
continue
end if
end do
end
subroutine interchanger( a, rth_ele, row, n, interchanged )
integer :: i, rth_ele, row, n
real :: alpha, a(:,:), interchanged(:,:)
do i = 1,n+1
alpha = a(row,i)
a(row,i) = a(rth_ele,i)
a(rth_ele,i) = alpha
end do
do i = 1,n
do j = 1,(n+1)
interchanged(i,j) = a(i,j)
end do
end do
end
end program solving_equations
Entering a sample 3-by-4 array, you get the following output (check the results, you know your algorithm):
3
4
3
6
3
7
4
6
7
4
4
2
0
2.05263186
-2.15789509
0.210526198
Process returned 0 (0x0) execution time : 1.051 s
Press any key to continue.

why the result is different between cpu and gpu?

This is my code running on GPU
tid=threadidx%x
bid=blockidx%x
bdim=blockdim%x
isec = mesh_sec_1(lev)+bid-1
if (isec .le. mesh_sec_0(lev)) then
if(.not. sec_is_int(isec)) return
do iele = tid, sec_n_ele(isec), bdim
idx = n_ele_idx(isec)+iele
u(1:5) = fv_u(1:5,idx)
u(6 ) = fv_t( idx)
g = 0.0d0
do j= sec_iA_ls(idx), sec_iA_ls(idx+1)-1
ss = sec_jA_ls(1,j)
ee = sec_jA_ls(2,j)
tem = n_ele_idx(ss)+ee
du(1:5) = fv_u(1:5, n_ele_idx(ss)+ee)-u(1:5)
du(6 ) = fv_t( n_ele_idx(ss)+ee)-u(6 )
coe(1:3) = sec_coe_ls(1:3,j)
do k=1,6
g(1:3,k)=g(1:3,k)+du(k)*sec_coe_ls(1:3,j)
end do
end do
do j=1,6
do i=1,3
fv_gra(i+(j-1)*3,idx)=g(i,j)
end do
end do
end do
end if
and next is my code running on CPU
do isec = h_mesh_sec_1(lev),h_mesh_sec_0(lev)
if(.not. h_sec_is_int(isec)) cycle
do iele=1,h_sec_n_ele(isec)
idx = h_n_ele_idx(isec)+iele
u(1:5) = h_fv_u(1:5,idx)
u(6 ) = h_fv_t( idx)
g = 0.0d0
do j= h_sec_iA_ls(idx),h_sec_iA_ls(idx+1)-1
ss = h_sec_jA_ls(1,j)
ee = h_sec_jA_ls(2,j)
du(1:5) = h_fv_u(1:5,h_n_ele_idx(ss)+ee)-u(1:5)
du(6 ) = h_fv_t( h_n_ele_idx(ss)+ee)-u(6 )
do k=1,6
g(1:3,k)= g(1:3,k) + du(k)*h_sec_coe_ls(1:3,j)
end do
end do
do j=1,6
do i=1,3
h_fv_gra(i+(j-1)*3,idx) = g(i,j)
enddo
enddo
end do
end do
The variable between h_* and * shows it belong to cpu and gpu separately.
The result is same at many points, but at some points they are a little different. I add the check code like this.
do i =1,size(h_fv_gra,1)
do j = 1,size(h_fv_gra,2)
if(hd_fv_gra(i,j)-h_fv_gra(i,j) .ge. 1.0d-9) then
print *,hd_fv_gra(i,j)-h_fv_gra(i,j),i,j
end if
end do
end do
The hd_* is a copy of the gpu result. we can see the difference:
1.8626451492309570E-009 13 14306
1.8626451492309570E-009 13 14465
1.8626451492309570E-009 13 14472
1.8626451492309570E-009 14 14128
1.8626451492309570E-009 14 14146
1.8626451492309570E-009 14 14150
1.8626451492309570E-009 14 14153
1.8626451492309570E-009 14 14155
1.8626451492309570E-009 14 14156
So I am confused about that. The precision of Cuda should not as large as this. Any reply will be welcomed.
In addition, I don't know how to print the variables in GPU codes, which can help me debug.
In your code, calculation of g value most probably benefits from Fused Multiply Add (fma) optimization in CUDA.
g(1:3,k)=g(1:3,k)+du(k)*sec_coe_ls(1:3,j)
On the CPU side, this is not impossible but strongly depends on compiler choices (and the actual CPU running the code should it implement fma).
To enforce usage of separate multiply and add, you want to use intrinsics from CUDA, as defined here, such as :
__device__ ​ double __dadd_rn ( double x, double y ) Add two floating point values in round-to-nearest-even mode.
and
__device__ ​ double __dmul_rn ( double x, double y ) Multiply two floating point values in round-to-nearest-even mode.
with a rounding mode identical to the one defined on CPU (it depends on the CPU architecture, whether it be Power or Intel x86 or other).
Alternate approach is to pass the --fmad false option to ptxas when compiling cuda, using the --ptxas-options option from nvcc detailed here.

Double-precision error using Dislin

I get the following error when trying to compile:
call qplot (Z, B, m + 1)
1
Error: Type mismatch in argument 'x' at (1); passed REAL(8) to REAL(4)
Everything seems to be in double precision so I can't help but think it is a Dislin error, especially considering that it appears with reference to a Dislin statement. What am I doing wrong? My code is the following:
program test
use dislin
integer :: i
integer, parameter :: n = 2
integer, parameter :: m = 5000
real (kind = 8) :: X(n + 1), Z(0:m), B(0:m)
X(1) = 1.D0
X(2) = 0.D0
X(3) = 2.D0
do i = 0, m
Z(i) = -1.D0 + (2.D0*i) / m
B(i) = f(Z(i))
end do
call qplot (Z, B, m + 1)
read(*,*)
contains
real (kind = 8) function f(t)
implicit none
real (kind = 8), intent(in) :: t
real (kind = 8), parameter :: pi = Atan(1.D0)*4.D0
f = cos(pi*t)
end function f
end program
From the DISLIN manual I read that qplot requires (single precision) floats:
QPLOT connects data points with lines.
The call is: CALL QPLOT (XRAY, YRAY, N) level 0, 1
or: void qplot (const float *xray, const float *yray, int n);
XRAY, YRAY are arrays that contain X- and Y-coordinates.
N is the number of data points.
So you need to convert Z and B to real:
call qplot (real(Z), real(B), m + 1)
Instead of using fixed numbers for the kind of numbers (which vary between compilers), please consider using the ISO_Fortran_env module and the pre-defined constants REAL32 and REAL64.
The qplot routine requires a default real. You can convert your data
call qplot(real(Z), real(B), m + 1)
I second the remark with kind = 8, it is very ugly, if you insist on 8 at least declare a constant
integer, parameter :: rp = 8
and use
real(rp) ::
As the first two answers explain, the standard versions of the dislin routines require single precision arguments. I find it most convenient to use these since I may have single or double arguments, using the real technique to convert the type of double variables. It seems unlikely that the lost precision will be perceptible on a graph. However, if you wish to work exclusively in double precision, there is an alternative set of routines. They have the same names, but take double precision arguments. To obtain them, link in the library "dislin_d".

Cuda illegal memory access error when using array indexes stored in another array

I'm using cuda fortran and I've been struggling with this problem in one simple kernel and I couldn't find the solution.
Isn't it possible to use integer values stored in an array as the indexes for another array?
Here's a simple example (edited to include also the main program):
program test
use cudafor
integer:: ncell, i
integer, allocatable:: values(:)
integer, allocatable, device :: values_d(:)
ncell = 10
allocate(values(ncell), values_d(ncell))
do i=1,ncell
values(i) = i
enddo
values_d = values
call multipleindices_kernel<<< ncell/1024+1,1024 >>> (values_d,
+ ncell)
values = values_d
write (*,*) values
end program test
!////////////////////////////////////////////////////
attributes(global) subroutine multipleindices_kernel(valu, ncell)
use cudafor
implicit none
integer, value:: ncell ! ncell = 10
integer :: valu(ncell)
integer :: tempind(10)
integer:: i
tempind(1)=10
tempind(2)=3
tempind(3)=5
tempind(4)=7
tempind(5)=9
tempind(6)=2
tempind(7)=4
tempind(8)=6
tempind(9)=8
tempind(10)=1
i = (blockidx%x - 1 ) * blockdim%x + threadidx%x
if (i .LE. ncell) then
valu(tempind(i))= 1
endif
end subroutine
I understand that if there were repeated values in the tempind array different threads could be accessing the same memory location for reading or writting, but that is not the case.
Even though, this gives the error "0: copyout Memcpy (host=0x303610, dev=0x3e20000, size=40) FAILED: 77(an illegal memory access was encountered).
Does anyone know if it is possible to use this indexes coming from another array in cuda?
After some additional tests, I've noticed that the problem occurs not while running the kernel itself, but on the transfer of the data back to CPU (if I remove "values = values_d" then no error is displayed). Also, if I substitute in the kernel valu(tempind(i)) by valu(i) it works fine, but I want to have the indexes coming from an array since the purpose of this test is to make a parallelization of a CFD code where the indexes are stored like that.
The problem appears to be that the generated executable doesn't pass the variable ncell to the kernel correctly. Running the application through cuda-memcheck shows that threads outside of the 1-10 are passing through the branch statement, and adding a print statement to print ncell inside the kernel also gives strange answers.
It used to be a requirement that all attributes(global) subroutines had to reside within a module. This requirement seems to have been relaxed in more recent versions of CUDA Fortran (I cannot find references to it in the programming guide). I believe the code outside of the module is causing the error here. By placing multipleindices_kernel within a module and using that module in test I consistantly get correct answers with no errors. The code for this is below:
module testmod
contains
attributes(global) subroutine multipleindices_kernel(valu, ncell)
use cudafor
implicit none
integer, value:: ncell ! ncell = 10
integer :: valu(ncell)
integer :: tempind(10)
integer:: i
tempind(1)=10
tempind(2)=3
tempind(3)=5
tempind(4)=7
tempind(5)=9
tempind(6)=2
tempind(7)=4
tempind(8)=6
tempind(9)=8
tempind(10)=1
i = (blockidx%x - 1 ) * blockdim%x + threadidx%x
if (i .LE. ncell) then
valu(tempind(i))= 1
endif
end subroutine
end module testmod
program test
use cudafor
use testmod
integer:: ncell, i
integer, allocatable:: values(:)
integer, allocatable, device :: values_d(:)
ncell = 10
allocate(values(ncell), values_d(ncell))
do i=1,ncell
values(i) = i
enddo
values_d = values
call multipleindices_kernel<<< ncell/1024+1,1024 >>> (values_d, ncell)
values = values_d
write (*,*) values
end program test
!////////////////////////////////////////////////////

Optimizing CUDA FDTD Fortran

I am trying to optimize this FDTD code with CUDA Fortran. I have three 3-D cube matrix with input, output and costant.
attributes (global) subroutine kernel_h(k,num_cells_x,num_cells_y,num_cells_z,Hx,Hy,Hz,Ex,Ey,Ez,Cbdx,Cbdy,Cbdz)
implicit none
integer :: idx,idy
integer,value :: k,num_cells_x,num_cells_y,num_cells_z
real(kind=8), intent(in), dimension(1:num_cells_x,1:num_cells_y,1:num_cells_z) :: Ex, Ey, Ez
real(kind=8), intent(inout), dimension(1:num_cells_x,1:num_cells_y,1:num_cells_z) :: Hx, Hy, Hz
real(kind=8), intent(in), constant, dimension(1:num_cells_x,1:num_cells_y,1:num_cells_z) :: Cbdx,Cbdy,Cbdz
idx = threadIdx%x + ((blockIdx%x-1) * blockDim%x)
idy = threadIdx%y + ((blockIdx%y-1) * blockDim%y)
do while (idx < num_cells_x)
Hz(idx,idy,k) = Hz(idx,idy,k) + ((Ex(idx,idy+1,k)-Ex(idx,idy,k))*Cbdy(idx,idy,k) + (Ey(idx,idy,k)-Ey(idx+1,idy,k))*Cbdx(idx,idy,k))
Hx(idx,idy,k) = Hx(idx,idy,k) + ((Ey(idx,idy,k+1)-Ey(idx,idy,k))*Cbdz(idx,idy,k) + (Ez(idx,idy,k)-Ez(idx,idy+1,k))*Cbdy(idx,idy,k))
Hy(idx,idy,k) = Hy(idx,idy,k) + ((Ez(idx+1,idy,k)-Ez(idx,idy,k))*Cbdx(idx,idy,k) + (Ex(idx,idy,k)-Ex(idx,idy,k+1))*Cbdz(idx,idy,k))
idx = idx + (blockDim%x * gridDim%x)
idy = idy + (blockDim%y * gridDim%y)
end do
end subroutine kernel_h
and my kernel launch is:
bdim=dim3(16,16,1)
gdim=dim3((num_cells_x+(bdim%x-1))/bdim%x,(num_cells_y+(bdim%y-1))/bdim%y,1)
do k=1,num_cells_z
call kernel_h<<<gdim,bdim>>>(k,num_cells_x,num_cells_y,num_cells_z,Hx_d,Hy_d,Hz_d,Ex_d,Ey_d,Ez_d,Cbdx_d,Cbdy_d,Cbdz_d)
end do
My questions are: why i can't load more than 100x100x100 matrix? If i try i get a kernel error launch failure. And can i improve my code performace? I think it could be written in a better way.
I would guess that you are accessing out of bounds.
Consider a 10x10x10 volume (x,y,z). In that case you will launch a single block of 16x16 threads. These threads will access a 17x17 slice (since stencil radius is 1) which is clearly going to end up out of bounds. You would need to disable those threads that will access out of bounds and also disable those threads that will reach beyond the boundary to apply their stencil.
Consider looking at the FDTD3D sample in the CUDA SDK. Granted it's in C but it illustrates how to handle this problem and it also shows how to use shared memory to have a much more efficient implementation.