cuda fortran speed-up - cuda

I am trying to evaluate the speed-up of a simple cuda fortran code: increment of an array.
CPU version:
module simpleOps_m
contains
subroutine increment (a, b)
implicit none
integer , intent ( inout ) :: a(:)
integer , intent (in) :: b
integer :: i, n
n = size (a)
do i = 1, n
a(i) = a(i)+b
enddo
end subroutine increment
end module simpleOps_m
program incrementTest
use simpleOps_m
implicit none
integer , parameter :: n = 1024*1024*100
integer :: a(n), b
a = 1
b = 3
call increment (a, b)
if ( any(a /= 4)) then
write (* ,*) '**** Program Failed **** '
else
write (* ,*) 'Program Passed '
endif
end program incrementTest
GPU version:
module simpleOps_m
contains
attributes ( global ) subroutine increment (a, b)
implicit none
integer , intent ( inout ) :: a(:)
integer , value :: b
integer :: i, n
n = size (a)
do i=blockDim %x*( blockIdx %x -1) + threadIdx %x ,n, BlockDim %x* GridDim %x
a(i) = a(i)+b
end do
end subroutine increment
end module simpleOps_m
program incrementTest
use cudafor
use simpleOps_m
implicit none
integer , parameter :: n = 1024*1024*100
integer :: a(n), b
integer , device :: a_d(n)
integer :: tPB = 256
a = 1
b = 3
a_d = a
call increment <<< 128,tPB >>>(a_d , b)
a = a_d
if ( any(a /= 4)) then
write (* ,*) '**** Program Failed **** '
else
write (* ,*) 'Program Passed '
endif
end program incrementTest
So I compile both versions with pgf90
http://www.pgroup.com/resources/cudafortran.htm
Using "time" command to evaluate execution time, I obtain:
for CPU version
$ time (cpu executable)
real 0m0.715s
user 0m0.410s
sys 0m0.300s
for GPU version
$ time (gpu executable)
real 0m1.057s
user 0m0.710s
sys 0m0.340s
So the speed-up=(CPU exec.time)/(GPU exec.time) is < 1
Are there some reason why the speed-up is not > 1 as one should attain?
Thanks in advance

The problem here is that in this rather contrived example, the cost of initialising the large array on the host (a=1) is almost the same as the cost of loop to increment the array contents, which is the part of the code being parallelised on the GPU. Because the total amount of parallel work is about the same as the total amount of serial work, Amdahl's Law is heavily stacked against achieving any sort of significant speed up by parallelising some of the code on the GPU.
A more significant speed up could probably be achieved by fusing the initialisation and increment operations into a single parallel operation on the GPU.
[This answer has been assembled from comments and added as a community wiki entry to get this question off the unanswered list for the CUDA tag]

Related

Need help understanding how this Haskell code works

I am trying to learn Haskell programming language by trying to figure out some pieces of code.
I have these 2 small functions but I have no idea how to test them on ghci.
What parameters should I use when calling these functions?
total :: (Integer -> Integer) -> Integer -> Integer
total function count = foldr(\x count -> function x + count) 0 [0..count]
The function above is supposed to for the given value n, return f 0 + f 1 + ... + f n.
However when calling the function I don't understand what to put in the f part. n is just an integer, but what is f supposed to be?
iter :: Int -> (a -> a) -> (a -> a)
iter n f
| n > 0 = f . iter (n-1) f
| otherwise = id
iter' :: Int -> (a -> a) -> (a -> a)
iter' n = foldr (.) id . replicate n
This function is supposed to compose the given function f :: a -> a with itself n :: Integer times, e.g., iter 2 f = f . f.
Once again when calling the function I don't understand what to put instead of f as a parameter.
To your first question, you use any value for f such that
f 0 + f 1 + ... + f n
indeed makes sense. You could use any numeric function capable of accepting an Integer argument and returning an Integer value, like (1 +), abs, signum, error "error", (\x -> x^3-x^2+5*x-2), etc.
"Makes sense" here means that the resulting expression has type ("typechecks", in a vernacular), not that it would run without causing an error.
To your second question, any function that returns the same type of value as its argument, like (1+), (2/) etc.

Count number of odd digits in Integer Haskell

I'm trying to make program which counts the number of odd digits in integer using Haskell. I have ran into problem with checking longer integers. My program looks like this at the moment:
oddDigits:: Integer -> Int
x = 0
oddDigits i
| i `elem` [1,3,5,7,9] = x + 1
| otherwise = x + 0
If my integer is for example 22334455 my program should return value 4, because there are 4 odd digits in that integer. How can I check all numbers in that integer? Currently it only checks first digit and returns 1 or 0. I'm still pretty new to haskell.
You can first convert the integer 22334455 to a list "22334455". Then find all the elements satisfying the requirement.
import Data.List(intersect)
oddDigits = length . (`intersect` "13579") . show
In order to solve such problems, you typically split this up into smaller problems. A typical pipeline would be:
split the number in a list of digits;
filter the digits that are odd; and
count the length of the resulting list.
You thus can here implement/use helper functions. For example we can generate a list of digits with:
digits' :: Integral i => i -> [i]
digits' 0 = []
digits' n = r : digits' q
where (q, r) = quotRem n 10
Here the digits will be produced in reverse order, but since that does not influences the number of digits, that is not a problem. I leave the other helper functions as an exercise.
Here's an efficient way to do that:
oddDigits :: Integer -> Int
oddDigits = go 0
where
go :: Int -> Integer -> Int
go s 0 = s
go s n = s `seq` go (s + fromInteger r `mod` 2) q
where (q, r) = n `quotRem` 10
This is tail-recursive, doesn't accumulate thunks, and doesn't build unnecessary lists or other structures that will need to be garbage collected. It also handles negative numbers correctly.

Fortran: Calling other functions in a function

I wrote the GNU Fortran code in two separate files on Code::Blocks: main.f95, example.f95. main.f95 content:
program testing
use example
implicit none
integer :: a, b
write(*,"(a)", advance="no") "Enter first number: "
read(*,*) a
write(*,"(a)", advance="no") "Enter second number: "
read(*,*) b
write(*,*) factorial(a)
write(*,*) permutation(a, b)
write(*,*) combination(a, b)
end program testing
example.f95 content:
module example
contains
integer function factorial(x)
implicit none
integer, intent(in) :: x
integer :: product_ = 1, i
if (x < 1) then
factorial = -1
else if (x == 0 .or. x == 1) then
factorial = 1
else
do i = 2, x
product_ = product_ * i
end do
factorial = product_
end if
end function factorial
real function permutation(x, y)
implicit none
integer, intent(in) :: x, y
permutation = factorial(x) / factorial(x - y)
end function permutation
real function combination(x, y)
implicit none
integer, intent(in) :: x, y
combination = permutation(x, y) / factorial(y)
end function combination
end module example
When I run this code, the output is:
Enter first number: 5
Enter second number: 3
120
0.00000000
0.00000000
The permutation and combination functions don't work properly. Thanks for answers.
I think you've fallen foul of one of Fortran's well-known (to those who know it) gotchas. But before revealing that I have to ask how much testing you did ? I ran your code, got the odd result and thought for a minute ...
then I tested the factorial function for a few small values of x which produced
factorial 1 = 1
factorial 2 = 2
factorial 3 = 12
factorial 4 = 288
factorial 5 = 34560
factorial 6 = 24883200
factorial 7 = 857276416
factorial 8 = -511705088
factorial 9 = 1073741824
factorial 10 = 0
which is obviously wrong. So it seems that you didn't test your code properly, if at all, before asking for help. (I didn't test your combination and permutation functions.)
O tempora, o mores
You've initialised the variable product_ in the line
integer :: product_ = 1, i
and this automatically means that product_ acquires the attribute save so its value is stored from invocation to invocation (gotcha !). At the start of each call (other than the first) product_ has the value it had at the end of the previous call.
The remedy is simple, don't initialise product_. Change
integer :: product_ = 1, i
to
integer :: product_ , i
...
product_ = 1
Simpler still would be to not write your own factorial function but to use the intrinsic product function but that's another story.

Return an array from a function and store it in the main program

Here is the Main Program:
PROGRAM integration
EXTERNAL funct
DOUBLE PRECISION funct, a , b, sum, h
INTEGER n, i
REAL s
PARAMETER (a = 0, b = 10, n = 200)
h = (b-a)/n
sum = 0.0
DO i = 1, n
sum = sum+funct(i*h+a)
END DO
sum = h*(sum-0.5*(funct(a)+funct(b)))
PRINT *,sum
CONTAINS
END
And below is the Function funct(x)
DOUBLE PRECISION FUNCTION funct(x)
IMPLICIT NONE
DOUBLE PRECISION x
INTEGER K
Do k = 1,10
funct = x ** 2 * k
End Do
PRINT *, 'Value of funct is', funct
RETURN
END
I would like the 'Sum' in the Main Program to print 10 different sums over 10 different values of k in Function funct(x).
I have tried the above program but it just compiles the last value of Funct() instead of 10 different values in sum.
Array results require an explicit interface. You would also need to adjust funct and sum to actually be arrays using the dimension statement. Using an explicit interface requires Fortran 90+ (thanks for the hints by #francescalus and #VladimirF) and is quite tedious:
PROGRAM integration
INTERFACE funct
FUNCTION funct(x) result(r)
IMPLICIT NONE
DOUBLE PRECISION r
DIMENSION r( 10 )
DOUBLE PRECISION x
END FUNCTION
END INTERFACE
DOUBLE PRECISION a , b, sum, h
DIMENSION sum( 10)
INTEGER n, i
PARAMETER (a = 0, b = 10, n = 200)
h = (b-a)/n
sum = 0.0
DO i = 1, n
sum = sum+funct(i*h+a)
END DO
sum = h*(sum-0.5*(funct(a)+funct(b)))
PRINT *,sum
END
FUNCTION funct(x)
IMPLICIT NONE
DOUBLE PRECISION funct
DIMENSION funct( 10)
DOUBLE PRECISION x
INTEGER K
Do k = 1,10
funct(k) = x ** 2 * k
End Do
PRINT *, 'Value of funct is', funct
RETURN
END
If you can, you should switch to a more modern Standard such as Fortran 90+, and use modules. These provide interfaces automatically, which makes the code much simpler.
Alternatively, you could take the loop over k out of the function, and perform the sum element-wise. This would be valid FORTRAN 77:
PROGRAM integration
c ...
DIMENSION sum( 10)
c ...
INTEGER K
c ...
DO i = 1, n
Do k = 1,10
sum(k)= sum(k)+funct(i*h+a, k)
End Do
END DO
c ...
Notice that I pass k to the function. It needs to be adjusted accordingly:
DOUBLE PRECISION FUNCTION funct(x,k)
IMPLICIT NONE
DOUBLE PRECISION x
INTEGER K
funct = x ** 2 * k
PRINT *, 'Value of funct is', funct
RETURN
END
This version just returns a scalar and fills the array in the main program.
Apart from that I'm not sure it is wise to use a variable called sum. There is an intrinsic function with the same name. This could lead to some confusion...

pgi cuda fortran compiling error

As I compile a single cuda fortran code , the compiler give me the following error,
PGF90-F-0000-Internal compiler error. Device compiler exited with error status code and
Attempt to call global subroutine without chevrons: increment
arch linux, pgf90 2013
the code is as follow:
module simple
contains
attributes (global) subroutine increment(a,b)
implicit none
integer, intent(inout) :: a(:)
integer , intent(in) :: b
integer :: i , n
n = size( a )
do i = 1 , n
a ( i ) = a ( i )+ b
end do
end subroutine increment
end module simple
program incrementTestCPU
use simple
implicit none
integer :: n = 256
integer :: a ( n ) , b
a = 1
b = 3
call increment ( a , b )
if ( any ( a /= 4)) then
write (* ,*) "pass"
else
write(*,*) "not passed"
end if
end program incrementTestCPU
You're calling this a "cuda fortran" code, but it is syntactically incorrect whether you want to ultimately run the subroutine on the host (CPU) or device (GPU). You may wish to refer to this blog post as a quick start guide.
If you want to run the subroutine increment on the GPU, you have not called it correctly:
call increment ( a , b )
A GPU subroutine call needs kernel launch parameters, which are contained in the "triple chevron" <<<...>>> syntax which should be placed between the increment and its parameter list, like so:
call increment<<<1,1>>> ( a , b )
and this is giving rise to the error message:
Attempt to call global subroutine without chevrons
If, instead, you intend to run this subroutine on the CPU, and are just passing it through the CUDA fortran compiler, then it is incorrect to specify the global attribute on the subroutine:
attributes (global) subroutine increment(a,b)
The following is a modification of your code which would run the subroutine on the GPU, and compiles cleanly for me with PGI 14.9 tools:
$ cat test3.cuf
module simple
contains
attributes (global) subroutine increment(a,b)
implicit none
integer :: a(:)
integer, value :: b
integer :: i , n
n = size( a )
do i = 1 , n
a ( i ) = a ( i )+ b
end do
end subroutine increment
end module simple
program incrementTestCPU
use simple
use cudafor
implicit none
integer, parameter :: n = 256
integer, device :: a_d(n), b_d
integer :: a ( n ) , b
a = 1
b = 3
a_d = a
b_d = b
call increment<<<1,1>>> ( a_d , b_d )
a = a_d
if ( any ( a /= 4)) then
write (* ,*) "pass"
else
write(*,*) "not passed"
end if
end program incrementTestCPU
$ pgf90 -Mcuda -ta=nvidia,cc20,cuda6.5 -Minfo test3.cuf -o test3
incrementtestcpu:
23, Memory set idiom, loop replaced by call to __c_mset4
29, any reduction inlined
$ pgf90 --version
pgf90 14.9-0 64-bit target on x86-64 Linux -tp nehalem
The Portland Group - PGI Compilers and Tools
Copyright (c) 2014, NVIDIA CORPORATION. All rights reserved.
$
If you are trying to create a CPU-only version, then remove all CUDA Fortran syntax from your program. If you still have difficulty, you can ask a Fortran-directed question, as it is not a CUDA issue at that point. As an example, the following (non-CUDA) code compiled cleanly for me:
module simple
contains
subroutine increment(a,b)
implicit none
integer, intent(inout) :: a(:)
integer , intent(in) :: b
integer :: i , n
n = size( a )
do i = 1 , n
a ( i ) = a ( i )+ b
end do
end subroutine increment
end module simple
program incrementTestCPU
use simple
implicit none
integer, parameter :: n = 256
integer :: a ( n ) , b
a = 1
b = 3
call increment ( a , b )
if ( any ( a /= 4)) then
write (* ,*) "pass"
else
write(*,*) "not passed"
end if
end program incrementTestCPU