Calling functions in OpenMP fortran threads - function

I have a basic question. In the code below, I am calling the same function, 'add', twice. When I do this using OpenMP, I'm getting incorrect results.
program p
integer::i,j,omp_get_thread_num,n
real::suma
i=5
j=10
!$omp parallel num_threads(2) private(n)
n=omp_get_thread_num()
if(n==0) goto 1111
suma=add(i,n)
write(*,*)'sum for 5=',suma,n,i
goto 1000
1111 suma=add(j,n)
write(*,*)'sum for 10=',suma,n,j
1000 continue
!$omp end parallel
end program p
!----------------------------------------
function add(k,n)result(l)
implicit none
integer::k,s,n
real::l1,l
!write(*,*)'thread employing me is:',n
l1=0.0
do s=k,k+5
l1=l1+s
end do
l=l1
return
end function add
The result of executing this code is:
sum for 10= 45.0000000 0 10
sum for 5= 45.0000000 1 5
However, when I uncomment line 22, ie '!write(,)'thread employing me is:',n'
the result is:
thread employing me is: 0
sum for 10= 75.0000000 0 10
thread employing me is: 1
sum for 5= 45.0000000 1 5
What should I do in order to employ the same function using different threads, correctly(ie without mixing up the variables) Can anyone explain the results obtained?
This is a simplistic version of my actual problem. (where I'm using the same function in threads)
Edit: Ok, I've realized the very silly mistake of not including 'suma' in the private list. But still, can someone tell why, if line 22 is uncommented, it always gives the correct result, even if suma is not made private?

There is a data race condition in your program. suma is shared (by the implicit data sharing rules of OpenMP) and both threads assign to it at the same time. Uncommenting the write statement results in slight offset in the execution of the second thread and therefore hides the race condition (it doesn't on my OS X - it just makes the program randomly print two times 45.0 or two times 75.0).
!$omp parallel num_threads(2) private(n, suma)
...
!$omp end parallel
Besides that, you should really really really use OpenMP sections instead of the goto logic that you have employed:
!$omp parallel num_threads(2) private(n, suma)
n=omp_get_thread_num()
!$omp sections
suma=add(i,n)
write(*,*)'sum for 5=',suma,n,i
!$omp section
suma=add(j,n)
write(*,*)'sum for 10=',suma,n,j
!$omp end sections
!$omp end parallel

Related

Modelica and Event generation of Functions

I try to understand when events are generated in Modelica. In the context of functions I noticed behaviour I didn't expect: functions appear to suppress event generation.
I am surprised as this is, to my knowledge, not explicitly stated in the Modelica reference.
For example, if I run this model in OMEdit of OpenModelica 1.17.0
model timeEventTest
Real z(start=0);
Real dummy(start=0);
equation
der(z) = dummy;
algorithm
if time > 10 then
dummy := 1;
else
dummy := -1.;
end if;
end timeEventTest;
I get the following output in the solver window of OMEdit
### STATISTICS ###
timer
events
1 state events
0 time events
solver: dassl
46 steps taken
46 calls of functionODE
44 evaluations of jacobian
0 error test failures
0 convergence test failures
0.000122251s time of jacobian evaluation
The simulation finished successfully.
Apart from the fact that the solver (I used dassl) interpreted the event at time=10 as a state event rather than a time event, the behaviour is as expected.
However, if I instead run the (mathematically identical) model
model timeEventTest2
Real z(start=0);
equation
der(z) = myfunc(time-10);
end timeEventTest2;
with myfunc defined as
function myfunc
input Real x;
output Real y;
algorithm
if x > 0 then
y := 1;
else
y:= -1;
end if;
end myfunc;
I obtain the following output in OMEdit
### STATISTICS ###
timer
events
0 state events
0 time events
solver: dassl
52 steps taken
79 calls of functionODE
63 evaluations of jacobian
13 error test failures
0 convergence test failures
0.000185296s time of jacobian evaluation
The simulation finished successfully.
Not only is the event at time = 10 NOT detected, the solver even got into some trouble as indicated by the error test failures.
This is a trivial example, however, I can imagine that the apparent suppression of events by function may result in mayor problems in larger models.
What did I miss here? Can I enforce the strict triggering of events within functions?
Some built-in functions also trigger events, e.g. div and mod (curiously, sign and abs don't).
Edit: Obviously, you need to run the examples at least to a time > 10s. I ran the simulations to 20s.
Functions in Modelica normally do not generate events.
See 8.5 Events and Synchronization in the Modelica Spec.
All equations and assignment statements within when-clauses and all assignment statements within function classes are implicitly treated with noEvent, i.e., relations within the scope of these operators never induce state or time events.
But its possible to change this behavior:
Add the annotation GenerateEvents=true to the function.
However, it seems like this is not sufficient for some Modelica simulators (tested with OpenModelica v1.16.5 and Dymola 2021x).
To make it work in OpenModelica and Dymola, you have add the Inline annotation or you have to assign the function output in one line.
So if you re-write your function as follows and you will get a state event:
function myfunc
input Real x;
output Real y;
algorithm
y := if x > 0 then 1 else -1;
annotation (GenerateEvents=true);
end myfunc;
or by additionally adding the Inline annotation:
function myfunc
input Real x;
output Real y;
algorithm
if x > 0 then
y := 1;
else
y := -1;
end if;
annotation (GenerateEvents=true, Inline=true);
end myfunc;
State event vs time event
To turn the state event in timeEventTest into a time event, change the if-condition to
if time > 10 then
This is also covered in chapter 8.5 Events and Synchronization of the Modelica Spec. Only the following two cases trigger time events:
time >= discrete expression
time < discrete expression

How can I check convergence of a numerical method based on the results? (Octave)

I'm writing a fixed point iteration script in Octave and need to check if the method converges. At the moment the only thing I've come up is a quite rudimentary check of the derivative of g(x) evaluated in x0.
if (conv_x<=1)
fprintf("\nThe method guarantees convergence:\n|g'(x0)| <= 1\n%d <= 1\n", conv_x)
else
fprintf("\nThe method does not guarantee convergence:\n|g'(x0)| > 1\n%d > 1\n", conv_x)
endif
Although there are cases in which it does converge even though it isn't guaranteed.
Example (command window):
The method does not guarantee convergence:
|g'(x0)| > 1
2.48318 > 1
i x_i Ea Er Er%
0 1.000000
1 3.623970 0.292484 0.080708 8.07081%
2 3.277427 0.346543 0.105736 10.5736%
3 2.929255 0.348173 0.118860 11.886%
4 2.663926 0.265329 0.099601 9.96007%
5 2.531185 0.132741 0.052442 5.24424%
6 2.490991 0.040194 0.016136 1.61356%
7 2.482583 0.008408 0.003387 0.338681%
8 2.481053 0.001530 0.000617 0.0616501%
9 2.480784 0.000270 0.000109 0.0108692%
10 2.480736 0.000047 0.000019 0.00190502%
11 2.480728 0.000008 0.000003 0.000333541%
>>
Is there a way in which I can make the program read the results and THEN have it say if it converges or not? Instead of just saying if convergence is guaranteed or not before the method is applied.
The solution ended up being extremely simple. I made a vector which stored all the values of the absolute error. Then compared the first value with the last to check if the method was converging.
err_v = [err_v, err_abs]
err_v is inside the fixed point method loop, so it stores every value.
Then I just compared the first value with the last like so:
I stored the first and last values in separate variables:
err_v_i = err_v(1);
err_v_f = err_v(:,end);
Finally I just compared the two with an if statement.
if(err_v_i > err_v_f)
fprintf("\nMethod is converging\n")
else
fprintf("\nMethod is NOT converging\n")
endif
This made it possible to point out situations in which the method converges even though |g'(x0)| > 1
Thanks Cris Luengo, your first comment made me realize how simple this was.

How does this recursive function in Julia work?

This code in Julia:
function seq(n)
if n<2
return BigInt(2)
else
return 1/(3-seq(n-1))
end
end
# and then run
[seq(n) for n=1:10]
replicates the recursive sequence Un = 1/(3-U(n-1)) where U1=2 and it works. But can someone explain to me how it works? for every n does it calculate every term before it or does the "return" store it somewhere which it can then call again when it needs to so it doesn't have to calculate every term for n every time?
It's just a normal recursive function: it calls itself however many times it needs to in order to compute the result. It terminates because every call chain eventually reaches the base case. There is no implicit caching of results or anything like that—it recomputes the same result however many times the function is called. If you want to remember previously calculated values, you can use the Memoize package to automatically "memoize" return values. Here's a terser version of the unmemoized function:
julia> seq(n) = n < 2 ? BigFloat(2) : 1/(3-seq(n-1))
seq (generic function with 1 method)
julia> seq(1) # trigger compilation
2.0
julia> #time [seq(n) for n=1:100];
0.001152 seconds (20.00 k allocations: 1.069 MiB)
julia> #time [seq(n) for n=1:100];
0.001365 seconds (20.00 k allocations: 1.069 MiB)
I changed it to fit on a single line and to return BigFloat(2) instead of BigInt(2) since the function returns BigFloat for larger inputs because of the division operation. Note that the second timing is no faster than the first (slower, in fact, probably because garbage collection kicks in during the second but not the first). Here's the same thing but with memoization:
julia> using Memoize
julia> #memoize seqm(n) = n < 2 ? BigFloat(2) : 1/(3-seqm(n-1))
seqm (generic function with 1 method)
julia> seqm(1) # trigger compilation
2.0
julia> #time [seqm(n) for n=1:100];
0.000071 seconds (799 allocations: 36.750 KiB)
julia> #time [seqm(n) for n=1:100];
0.000011 seconds (201 allocations: 4.000 KiB)
The first timing is significantly faster than the unmemoized version even though the memoization cache is empty at the start because the same computation is done many times and memoization avoids doing it all but the first time. The second timing is even faster because now all 100 computed values are already cached and can just be returned.

Calculating PI with Fortran & CUDA

I am trying to make a simple program in PGI's fortran compiler. This simple program will use the graphics card to calculate pi using the "dart board" algorithm. After battling with this program for quite some time now I have finally got it to behave for the most part. However, I am currently stuck on passing back the results properly. I must say, this is a rather tricky program to debug since I can no longer shove any print statements into the subroutine. This program currently returns all zeros. I am not really sure what is going on, but I have two ideas. Both of which I am not sure how to fix:
The CUDA kernel is not running somehow?
I am not converting the values properly? pi_parts = pi_parts_d
Well, this is the status of my current program. All variables with _d on the end stand for the CUDA prepared device memory where all the other variables (with the exception of the CUDA kernel) are typical Fortran CPU prepared variables. Now there are some print statements I have commented out that I have already tried out from CPU Fortran land. These commands were to check if I really was generating the random numbers properly. As for the CUDA method, I have currently commented out the calculations and replaced z to statically equal to 1 just to see something happen.
module calcPi
contains
attributes(global) subroutine pi_darts(x, y, results, N)
use cudafor
implicit none
integer :: id
integer, value :: N
real, dimension(N) :: x, y, results
real :: z
id = (blockIdx%x-1)*blockDim%x + threadIdx%x
if (id .lt. N) then
! SQRT NOT NEEDED, SQRT(1) === 1
! Anything above and below 1 would stay the same even with the applied
! sqrt function. Therefore using the sqrt function wastes GPU time.
z = 1.0
!z = x(id)*x(id)+y(id)*y(id)
!if (z .lt. 1.0) then
! z = 1.0
!else
! z = 0.0
!endif
results(id) = z
endif
end subroutine pi_darts
end module calcPi
program final_project
use calcPi
use cudafor
implicit none
integer, parameter :: N = 400
integer :: i
real, dimension(N) :: x, y, pi_parts
real, dimension(N), device :: x_d, y_d, pi_parts_d
type(dim3) :: grid, tBlock
! Initialize the random number generaters seed
call random_seed()
! Make sure we initialize the parts with 0
pi_parts = 0
! Prepare the random numbers (These cannot be generated from inside the
! cuda kernel)
call random_number(x)
call random_number(y)
!write(*,*) x, y
! Convert the random numbers into graphics card memory land!
x_d = x
y_d = y
pi_parts_d = pi_parts
! For the cuda kernel
tBlock = dim3(256,1,1)
grid = dim3((N/tBlock%x)+1,1,1)
! Start the cuda kernel
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N)
! Transform the results into CPU Memory
pi_parts = pi_parts_d
write(*,*) pi_parts
write(*,*) 'PI: ', 4.0*sum(pi_parts)/N
end program final_project
EDIT TO CODE:
Changed various lines to reflect the fixes mentioned by: Robert Crovella. Current status: error caught by cuda-memcheck revealing: Program hit error 8 on CUDA API call to cudaLaunch on my machine.
If there is any method I can use to test this program please let me know. I am throwing darts and seeing where they land for my current style of debugging with CUDA. Not the most ideal, but it will have to do until I find another way.
May the Fortran Gods have mercy on my soul at this dark hour.
When I compile and run your program I get a segfault. This is due to the last parameter you are passing to the kernel (N_d):
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N_d)
Since N is a scalar quantity, the kernel is expecting to use it directly, rather than as a pointer. So when you pass a pointer to device data (N_d), the process of setting up the kernel generates a seg fault (in host code!) as it attempts to access the value N, which should be passed directly as:
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N)
When I make that change to the code you have posted, I then get actual printed output (instead of a seg fault), which is an array of ones and zeroes (256 ones, followed by 144 zeroes, for a total of N=400 values), followed by the calculated PI value (which happens to be 2.56 in this case (4*256/400), since you have made the kernel basically a dummy kernel).
This line of code is also probably not doing what you want:
grid = dim3(N/tBlock%x,1,1)
With N = 400 and tBlock%x = 256 (from previous code lines), the result of the calculation is 1 (ie. grid ends up at (1,1,1) which amounts to one threadblock). But you really want to launch 2 threadblocks, so as to cover the entire range of your data set (N = 400 elements). There's a number of ways to fix this, but for simplicity let's just always add 1 to the calculation:
grid = dim3((N/tBlock%x)+1,1,1)
Under these circumstances, when we launch grids that are larger (in terms of total threads) than our data set size (512 threads but only 400 data elements in this example) it's customary to put a thread check near the beginning of our kernel (in this case, after the initialization of id), to prevent out-of-bounds accesses, like so:
if (id .lt. N) then
(and a corresponding endif at the very end of the kernel code) This way, only the threads that correspond to actual valid data are allowed to do any work.
With the above changes, your code should be essentially functional, and you should be able to revert your kernel code to the proper statements and start to get an estimate of PI.
Note that you can check the CUDA API for error return codes, and you can also run your code with cuda-memcheck to get an idea of whether the kernel is making out-of-bounds accesses. Niether of these would have helped with this particular seg fault, however.

Is the nvidia kepler shuffle "destructive"?

I'm using the implementation of the parallel reduction on CUDA using new kepler's shuffle instructions, similar to this:
http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
I was searching for the minima of rows in a given matrix, and in the end of the kernel I had the following code:
my_register = min(my_register, __shfl_down(my_register,8,16));
my_register = min(my_register, __shfl_down(my_register,4,16));
my_register = min(my_register, __shfl_down(my_register,2,16));
my_register = min(my_register, __shfl_down(my_register,1,16));
My blocks are 16*16, so everything worked fine, with that code I was getting minima in two sub-rows in the very same kernel.
Now I also need to return the indices of the smallest elements in every row of my matrix, so I was going to replace "min" with the "if" statement and handle these indices in a similar fashion, I got stuck at this code:
if (my_reg > __shfl_down(my_reg,8,16)){my_reg = __shfl_down(my_reg,8,16);};
if (my_reg > __shfl_down(my_reg,4,16)){my_reg = __shfl_down(my_reg,4,16);};
if (my_reg > __shfl_down(my_reg,2,16)){my_reg = __shfl_down(my_reg,2,16);};
if (my_reg > __shfl_down(my_reg,1,16)){my_reg = __shfl_down(my_reg,1,16);};
No cudaErrors whatsoever, but kernel returns trash now. Nevertheless I have fix for that:
myreg_tmp = __shfl_down(myreg,8,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,4,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,2,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,1,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
So, allocating new tmp variable to sneak into neighboring registers saves everything for me.
Now the question: Are the kepler shuffle instructions destructive ? in a sense that invoking same instruction twice doesn't issue the same result. I haven't assigned anything to those registers saying "my_reg > __shfl_down(my_reg,8,16)" - this adds up to my confusion. Can anyone explain me what is the problem with invoking shuffle twice? I'm pretty much a newbie in CUDA, so detailed explanation for dummies is welcomed
warp shuffle is not destructive. The operation, if repeated under the exact same conditions, will return the same result each time. The var value (myreg in your example) does not get modified by the warp shuffle function itself.
The problem you are experiencing is due to the fact that the number of participating threads on the second invocation of __shfl_down() in your first method is different than the other invocations, in either method.
First, let's remind ourselves of a key point in the documentation:
Threads may only read data from another thread which is actively participating in the __shfl() command. If the target thread is inactive, the retrieved value is undefined.
Now let's take a look at your first "broken" method:
if (my_reg > __shfl_down(my_reg,8,16)){my_reg = __shfl_down(my_reg,8,16);};
The first time you call __shfl_down() above (within the if-clause), all threads are participating. Therefore all values returned by __shfl_down() will be what you expect. However, once the if clause is complete, only threads that satisfied the if-clause will participate in the body of the if-statement. Therefore, on the second invocation of __shfl_down() within the if-statement body, only threads for which their my_reg value was greater than the my_reg value of the thread 8 lanes above them will participate. This means that some of these assignment statements probably will not return the value you expect, because the other thread may not be participating. (The participation of the thread 8 lanes above would be dependent on the result of the if comparison done by that thread, which may or may not be true.)
The second method you propose has no such issue, and works correctly according to your statements. All threads participate in each invocation of __shfl_down().