Shared Memory in cuda fortran not working as expected - cuda

I am building a cuda fortran and a strange behavior occurs. I don't really understand why my code runs like this and would appreciate your help.
It seems that the value 0 is never assigned and even the loops
executes beyond the boarders.
I tried to put the if condition after the loops but it did not help either.
Thank you for your help
real, shared :: s_d_aaa_adk(0:15,0:15)
real, shared :: s_d_bbb_adk(0:15,0:15)
real, shared :: s_d_ccc_adk(0:15,0:15)
d_k = (blockIdx%x-1)
s_d_j = threadIdx%x-1
s_d_l = threadIdx%y-1
if(d_k == kmax-1)then
s_d_aaa_adk(s_d_j,s_d_l) = 0
s_d_bbb_adk(s_d_j,s_d_l) = 0
s_d_ccc_adk(s_d_j,s_d_l) = 0
endif
do d_k = 0, kmax-2
s_d_bbb_adk(s_d_j,s_d_l) = d_bbb(s_d_j,d_l,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l) = d_ccc(d_j,s_d_l,d_k+1)
s_d_aaa_adk(s_d_j,s_d_l) = d_aaa(d_j,s_d_l,d_k+1)
end do `
I set all global memory array size to be (16,16, kmax),
the grid is (128,1,1), block (16,16,1), and the
the kernel is launched as testkernell<<<grid,block>>>()

Since you're conditioning the if statement on d_k, which is derived from the block index:
d_k = (blockIdx%x-1)
if(d_k == kmax-1)then
This means that only one block out of the 128 in your grid will actually execute the if statement, setting those particular shared memory values to zero. Most of your blocks will not execute what's inside the if statement.
And if kmax happens to be greater than 128, then none of your blocks will execute the if statement.
If you want that if-statement to be executed within every threadblock, you will need to condition it on something other than the block index.
I would make a suggestion about how to restructure the code, but it's not clear to me what you want to achieve as far as loading data into shared memory. For instance, your do-loop doesn't make much sense to me:
do d_k = 0, kmax-2
s_d_bbb_adk(s_d_j,s_d_l) = d_bbb(s_d_j,d_l,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l) = d_ccc(d_j,s_d_l,d_k+1)
s_d_aaa_adk(s_d_j,s_d_l) = d_aaa(d_j,s_d_l,d_k+1)
end do ^ ^
| |
a given thread has specific values for these indices
Your s_d_j and s_d_l variables are thread indices. So a given thread will see this do loop, and it will execute the loop iteratively, loading successive values from the various global memory arrays (d_bbb, d_ccc, etc.) into the exact same locations in each shared memory array.
It seems to me you don't really understand how thread execution works. Pretend that you are a given thread, assign specific values to s_d_j and s_d_l (and d_k, although you are overwriting the block index when re-use that variable as your loop index, which also seems strange to me), and then see if your code execution makes sense.
EDIT: Based on additional comments:
You have stated your overall data set size (x,y,z) is (64,64,32).
You have stated "I am slicing ... array through z. ... I want to put each slice in one block"
That would suggest to me that you should launch one block per slice. Or maybe you have an algorithm in mind that has multiple blocks assigned to a single slice. Regardless, I will assume that you want all the slice data (64, 64) available to a given block that is assigned to that slice. I will assume for now that you will launch 32 blocks. It should not be difficult to extend to the case where multiple blocks are working on a single slice. I will also assume a 32x32 thread block rather than 16x16 that you have indicated. It should not be difficult to extend this to use 16x16 if you want to.
You might do something like this then:
real, shared :: s_d_aaa_adk(0:63,0:63)
real, shared :: s_d_bbb_adk(0:63,0:63)
real, shared :: s_d_ccc_adk(0:63,0:63)
c above uses 48KB of shared mem, so assuming cc 2.0+ and cache config set accordingly
d_k = (blockIdx%x-1)
s_d_j = threadIdx%x-1
s_d_l = threadIdx%y-1
c fill first quadrant
s_d_bbb_adk(s_d_j,s_d_l) = d_bbb(s_d_j,s_d_l,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l) = d_ccc(s_d_j,s_d_l,d_k+1)
s_d_aaa_adk(s_d_j,s_d_l) = d_aaa(s_d_j,s_d_l,d_k+1)
c fill second quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l) = d_bbb(s_d_j+blockDim%x,s_d_l,d_k+1)
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l) = d_ccc(s_d_j+blockDim%x,s_d_l,d_k+1)
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l) = d_aaa(s_d_j+blockDim%x,s_d_l,d_k+1)
c fill third quadrant
s_d_bbb_adk(s_d_j,s_d_l+blockDim%y) = d_bbb(s_d_j,s_d_l+blockDim%y,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l+blockDim%y) = d_ccc(s_d_j,s_d_l+blockDim%y,d_k+1)
s_d_aaa_adk(s_d_j,s_d_l+blockDim%y) = d_aaa(s_d_j,s_d_l+blockDim%y,d_k+1)
c fill fourth quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = d_bbb(s_d_j+blockDim%x,s_d_l+blockDim%y,d_k+1)
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = d_ccc(s_d_j+blockDim%x,s_d_l+blockDim%y,d_k+1)
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = d_aaa(s_d_j+blockDim%x,s_d_l+blockDim%y,d_k+1)
c just guessing about what your intent was on filling with zeroes
c this just makes sure that one of the slices at the end gets zeroes
c instead of the values from the global arrays
if(d_k == kmax-1)then
c fill first quadrant
s_d_bbb_adk(s_d_j,s_d_l) = 0
s_d_ccc_adk(s_d_j,s_d_l) = 0
s_d_aaa_adk(s_d_j,s_d_l) = 0
c fill second quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l) = 0
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l) = 0
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l) = 0
c fill third quadrant
s_d_bbb_adk(s_d_j,s_d_l+blockDim%y) = 0
s_d_ccc_adk(s_d_j,s_d_l+blockDim%y) = 0
s_d_aaa_adk(s_d_j,s_d_l+blockDim%y) = 0
c fill fourth quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = 0
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = 0
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = 0
endif

Related

Is using base case variable in a recursive function important?

I'm currently learning about recursion, it's pretty hard to understand. I found a very common example for it:
function factorial(N)
local Value
if N == 0 then
Value = 1
else
Value = N * factorial(N - 1)
end
return Value
end
print(factorial(3))
N == 0 is the base case. But when i changed it into N == 1, the result is still remains the same. (it will print 6).
Is using the base case important? (will it break or something?)
What's the difference between using N == 0 (base case) and N == 1?
That's just a coincidence, since 1 * 1 = 1, so it ends up working either way.
But consider the edge-case where N = 0, if you check for N == 1, then you'd go into the else branch and calculate 0 * factorial(-1), which would lead to an endless loop.
The same would happen in both cases if you just called factorial(-1) directly, which is why you should either check for > 0 instead (effectively treating every negative value as 0 and returning 1, or add another if condition and raise an error when N is negative.
EDIT: As pointed out in another answer, your implementation is not tail-recursive, meaning it accumulates memory for every recursive functioncall until it finishes or runs out of memory.
You can make the function tail-recursive, which allows Lua to treat it pretty much like a normal loop that could run as long as it takes to calculate its result:
local function factorial(n, acc)
acc = acc or 1
if n <= 0 then
return acc
else
return factorial(n-1, acc*n)
end
return Value
end
print(factorial(3))
Note though, that in the case of factorial, it would take you way longer to run out of stack memory than to overflow Luas number data type at around 21!, so making it tail-recursive is really just a matter of training yourself to write better code.
As the above answer and comments have pointed out, it is essential to have a base-case in a recursive function; otherwise, one ends up with an infinite loop.
Also, in the case of your factorial function, it is probably more efficient to use a helper function to perform the recursion, so as to take advantage of Lua's tail-call optimizations. Since Lua conveniently allows for local functions, you can define a helper within the scope of your factorial function.
Note that this example is not meant to handle the factorials of negative numbers.
-- Requires: n is an integer greater than or equal to 0.
-- Effects : returns the factorial of n.
function fact(n)
-- Local function that will actually perform the recursion.
local function fact_helper(n, i)
-- This is the base case.
if (i == 1) then
return n
end
-- Take advantage of tail calls.
return fact_helper(n * i, i - 1)
end
-- Check for edge cases, such as fact(0) and fact(1).
if ((n == 0) or (n == 1)) then
return 1
end
return fact_helper(n, n - 1)
end

store values of function to prevent from running again

Say I have some complicated function f(fvar1, ..., fvarN) such as:
def f(fvar1,..., fvarN):
return (complicated function of fvar1, ..., fvarN).
Now function g(gvar1, ..., gvarM) has an expression in terms of f(fvar1, ..., fvarN), let's say:
def g(gvar1, ..., gvarM):
return stuff * f(gvar1 * gvar2, ..., gvar5 * gvarM) - stuff * f(gvar3, gvar2, ..., gvarM)
where the arguments of f inside g can be different linear combinations of gvar1, ..., gvarM.
Because f is a complicated function, it is costly to call f, but it is also difficult to store the value locally in g because g has many instances of f with different argument combinations.
Is there a way to store values of f such that f of the same values are not called again and again without having to define every different instance of f locally within g?
Yes, this is called memoisation. The basic idea is to have f() maintain some sort of data store based on the parameters passed in. Then, if it's called with the same parameters, it simply returns the stored value rather than recalculating it.
The data store probably needs to be limited in size and optimised for the pattern of calls you expect, by removing parameter sets based on some rules. For example, if the number of times a parameter set is used indicates its likelihood of being used in future, you probably want to remove patterns that are used infrequently, and keep those that are use more often.
Consider, for example, the following Python code for adding two numbers (let us pretend that this is a massively time-expensive operation):
import random
def addTwo(a, b):
return a + b
for _ in range(100):
x = random.randint(1, 5)
y = random.randint(1, 5)
z = addTwo(x, y)
print(f"{x} + {y} = {z}")
That works but, of course, is inefficient if you use the same numbers as used previously. You can add memoisation as follows.
The code will "remember" a certain number of calculations (probably random, given the dictionaries but I won't guarantee that). If it gets a pair it already knows about, it just returns the cached value.
Otherwise, it calculates the value, storing it into the cache, and ensuring said cache doesn't grow too big:
import random, time
# Cache, and the stats for it.
(pairToSumMap, cached, calculated) = ({}, 0, 0)
def addTwo(a, b):
global pairToSumMap, cached, calculated
# Attempt two different cache lookups first (a:b, b:a).
sum = None
try:
sum = pairToSumMap[f"{a}:{b}"]
except:
try:
sum = pairToSumMap[f"{b}:{a}"]
except:
pass
# Found in cache, return.
if sum is not None:
print("Using cached value: ", end ="")
cached += 1
return sum
# Not found, calculate and add to cache (with limited cache size).
print("Calculating value: ", end="")
calculated += 1
time.sleep(1) ; sum = a + b # Make expensive.
if len(pairToSumMap) > 10:
del pairToSumMap[list(pairToSumMap.keys())[0]]
pairToSumMap[f"{a}:{b}"] = sum
return sum
for _ in range(100):
x = random.randint(1, 5)
y = random.randint(1, 5)
z = addTwo(x, y)
print(f"{x} + {y} = {z}")
print(f"Calculated {calculated}, cached {cached}")
You'll see I've also added cached/calculated information, including a final statistics line which shows the caching in action, for example:
Calculated 29, cached 71
I've also made the calculation an expensive operation so you can see it in action (as per the speed of output). Ones that are cached will come back immediately, calculating the sum will take a second.

Composite trapezoid rule not running in Octave

I have the following code in Octave for implementing the composite trapezoid rule and for some reason the function only stalls whenever I execute it in Octave on f = #(x) x^2, a = 0, b = 4, TOL = 10^-6. Whenever I call trapezoid(f, a, b, TOL), nothing happens and I have to exit the Terminal in order to do anything else in Octave. Here is the code:
% INPUTS
%
% f : a function
% a : starting point
% b : endpoint
% TOL : tolerance
function root = trapezoid(f, a, b, TOL)
disp('test');
max_iterations = 10000;
disp(max_iterations);
count = 1;
disp(count);
initial = (b-a)*(f(b) + f(a))/2;
while count < max_iterations
disp(initial);
trap_0 = initial;
trap_1 = 0;
trap_1_midpoints = a:(0.5^count):b;
for i = 1:(length(trap_1_midpoints)-1)
trap_1 = trap_1 + (trap_1_midpoints(i+1) - trap_1_midpoints(i))*(f(trap_1_midpoints(i+1) + f(trap_1_midpoints(i))))/2;
endfor
if abs(trap_0 - trap_1) < TOL
root = trap_1;
return;
endif
intial = trap_1;
count = count + 1;
disp(count);
endwhile
disp(['Process ended after ' num2str(max_iterations), ' iterations.']);
I have tried your function in Matlab.
Your code is not stalling. It is rather that the size of trap_1_midpoints increases exponentionaly. With that the computation time of trap_1 increases also exponentionaly. This is what you experience as stalling.
I also found a possible bug in your code. I guess the line after the if clause should be initial = trap_1. Check the missing 'i'.
With that, your code still takes forever, but if you increase the tolerance (e.g. to a value of 1) your code return.
You could try to vectorize the for loop for speed up.
Edit: I think inside your for loop, a ) is missing after f(trap_1_midpoints(i+1).
After count=52 or so, the arithmetic sequence trap_1_midpoints is no longer representable in any meaningful fashion in floating point numbers. After count=1075 or similar, the step size is no longer representable as a positive floating point double number. That all is to say, the bound max_iterations = 10000 is ludicrous. As explained below, all computations after count=20 are meaningless.
The theoretical error for stepsize h is O(T·h^2). There is a numerical error accumulation in the summation of O(T/h) numbers that is of that size, i.e., O(mu/h) with mu=1ulp=2^(-52). Which in total means that the lowest error of the numerical integration can be expected around h=mu^(1/3), for double numbers thus h=1e-5 or in the algorithm count=17. This may vary with interval length and how smooth or wavy the function is.
One can expect the behavior that the error divides by four while halving the step size only for step sizes above this boundary 1e-5. This also means that abs(trap_0 - trap_1) is a reliable measure for the error of trap_0 (and abs(trap_0 - trap_1)/3 for trap_1) only inside this range of step sizes.
The error bound TOL=1e-6 should be met for about h=1e-3, which corresponds to count=10. If the recursion does not stop for count = 14 (which should give an error smaller than 1e-8) then the method is not accurately implemented.

Trying to find a way to construct Julia `generator`

I'm new to Julia.
I mainly program in python.
In python,
if you want to iterate over a large set of values,
it is typical to construct a so-called generator to save memory usage.
Here is one example code:
def generator(N):
for i in range(N):
yield i
I wonder if there is anything alike in Julia.
After reading julia manual,
#task macro seems to have the same (or similar) functionality as generator in python.
However,
after some experiments,
the memory usage seems to be larger than usual array in julia.
I use #time in IJulia to see the memory usage.
Here is my sample code:
[Update]: Add the code for generator method
(The generator method)
function generator(N::Int)
for i in 1:N
produce(i)
end
end
(generator version)
function fun_gener()
sum = 0
g = #task generator(100000)
for i in g
sum += i
end
sum
end
#time fun_gener()
elapsed time: 0.420731828 seconds (6507600 bytes allocated)
(array version)
function fun_arry()
sum = 0
c = [1:100000]
for i in c
sum += i
end
sum
end
#time fun_arry()
elapsed time: 0.000629629 seconds (800144 bytes allocated)
Could anyone tell me why #task will require more space in this case?
And if I want to save memory usage as dealing with a large set of values,
what can I do?
I recommend the "tricked out iterators" blogpost by Carl Vogel, which discusses julia's iterator protocol, tasks and co-routines in some detail.
See also task-aka-coroutines in the julia docs.
In this case you should use the Range type (which defines an iterator protocol):
julia> function fun_arry()
sum = 0
c = 1:100000 # remove the brackets, makes this a Range
for i in c
sum += i
end
sum
end
fun_arry (generic function with 1 method)
julia> fun_arry() # warm up
5000050000
julia> #time fun_arry()
elapsed time: 8.965e-6 seconds (192 bytes allocated)
5000050000
Faster and less memory allocated (just like xrange in python 2).
A snippet from the blogpost:
From https://github.com/JuliaLang/julia/blob/master/base/range.jl, here’s how a Range’s iterator protocol is defined:
start(r::Ranges) = 0
next{T}(r::Range{T}, i) = (oftype(T, r.start + i*step(r)), i+1)
next{T}(r::Range1{T}, i) = (oftype(T, r.start + i), i+1)
done(r::Ranges, i) = (length(r) <= i)
Notice that the next method calculates the value of the iterator in state i. This is different from an Array iterator, which just reads the element a[i] from memory.
Iterators that exploit delayed evaluation like this can have important performance benefits. If we want to iterate over the integers 1 to 10,000, iterating over an Array means we have to allocate about 80MB to hold it. A Range only requires 16 bytes; the same size as the range 1 to 100,000 or 1 to 100,000,000.
You can write a generator method (using Tasks):
julia> function generator(n)
for i in 1:n # Note: we're using a Range here!
produce(i)
end
end
generator (generic function with 2 methods)
julia> for x in Task(() -> generator(3))
println(x)
end
1
2
3
Note: if you replace the Range with this, the performance is much poorer (and allocates way more memory):
julia> #time fun_arry()
elapsed time: 0.699122659 seconds (9 MB allocated)
5000050000
This question was asked (and answered) quite a while ago. Since this question is ranked high on google searches, I'd like to mention that both the question and answer are outdated.
Nowadays, I'd suggest checking out https://github.com/BenLauwens/ResumableFunctions.jl for a Julia library with a macro that implements Python-like yield generators.
using ResumableFunctions
#resumable function fibonnaci(n::Int) :: Int
a = 0
b = 1
for i in 1:n-1
#yield a
a, b = b, a+b
end
a
end
for fib in fibonnaci(10)
println(fib)
end
Since its scope is much more limited than full coroutines, it is also an order of magnitude more efficient than pushing values into a channel since it can compile the generator into a FSM. (Channels have replaced the old produce() function mentioned in the question and previous answers).
With that said, I'd still suggest pushing into a channel as your first approach if performance isn't an issue, because resumablefunctions can sometimes be finicky when compiling your function and can occasionally hit some worst-case behaviour. In particular, because it is a macro that compiles to an FSM rather than a function, you currently need to annotate the types of all variables in the Resumablefunction to get good performance, unlike vanilla Julia functions where this is handled by JIT when the function is first called.
I think that Task has been superseded by Channel(). The usage in terms of Ben Lauwens's Fibonacci generator is:
fibonacci(n) = Channel(ctype=Int) do c
a = 1
b = 1
for i in 1:n
push!(c, a)
a, b = b, a + b
end
end
it can be used using
for a in fibonacci(10)
println(a)
end
1
1
2
3
5
8
13
21
34
55

How to compute Fourier coefficients with MATLAB

I'm trying to compute the Fourier coefficients for a waveform using MATLAB. The coefficients can be computed using the following formulas:
T is chosen to be 1 which gives omega = 2pi.
However I'm having issues performing the integrals. The functions are are triangle wave (Which can be generated using sawtooth(t,0.5) if I'm not mistaking) as well as a square wave.
I've tried with the following code (For the triangle wave):
function [ a0,am,bm ] = test( numTerms )
b_m = zeros(1,numTerms);
w=2*pi;
for i = 1:numTerms
f1 = #(t) sawtooth(t,0.5).*cos(i*w*t);
f2 = #(t) sawtooth(t,0.5).*sin(i*w*t);
am(i) = 2*quad(f1,0,1);
bm(i) = 2*quad(f2,0,1);
end
end
However it's not getting anywhere near the values I need. The b_m coefficients are given for a
triangle wave and are supposed to be 1/m^2 and -1/m^2 when m is odd alternating beginning with the positive term.
The major issue for me is that I don't quite understand how integrals work in MATLAB and I'm not sure whether or not the approach I've chosen works.
Edit:
To clairify, this is the form that I'm looking to write the function on when the coefficients have been determined:
Here's an attempt using fft:
function [ a0,am,bm ] = test( numTerms )
T=2*pi;
w=1;
t = [0:0.1:2];
f = fft(sawtooth(t,0.5));
am = real(f);
bm = imag(f);
func = num2str(f(1));
for i = 1:numTerms
func = strcat(func,'+',num2str(am(i)),'*cos(',num2str(i*w),'*t)','+',num2str(bm(i)),'*sin(',num2str(i*w),'*t)');
end
y = inline(func);
plot(t,y(t));
end
Looks to me that your problem is what sawtooth returns the mathworks documentation says that:
sawtooth(t,width) generates a modified triangle wave where width, a scalar parameter between 0 and 1, determines the point between 0 and 2π at which the maximum occurs. The function increases from -1 to 1 on the interval 0 to 2πwidth, then decreases linearly from 1 to -1 on the interval 2πwidth to 2π. Thus a parameter of 0.5 specifies a standard triangle wave, symmetric about time instant π with peak-to-peak amplitude of 1. sawtooth(t,1) is equivalent to sawtooth(t).
So I'm guessing that's part of your problem.
After you responded I looked into it some more. Looks to me like it's the quad function; not very accurate! I recast the problem like this:
function [ a0,am,bm ] = sotest( t, numTerms )
bm = zeros(1,numTerms);
am = zeros(1,numTerms);
% 2L = 1
L = 0.5;
for ii = 1:numTerms
am(ii) = (1/L)*quadl(#(x) aCos(x,ii,L),0,2*L);
bm(ii) = (1/L)*quadl(#(x) aSin(x,ii,L),0,2*L);
end
ii = 0;
a0 = (1/L)*trapz( t, t.*cos((ii*pi*t)/L) );
% now let's test it
y = ones(size(t))*(a0/2);
for ii=1:numTerms
y = y + am(ii)*cos(ii*2*pi*t);
y = y + bm(ii)*sin(ii*2*pi*t);
end
figure; plot( t, y);
end
function a = aCos(t,n,L)
a = t.*cos((n*pi*t)/L);
end
function b = aSin(t,n,L)
b = t.*sin((n*pi*t)/L);
end
And then I called it like:
[ a0,am,bm ] = sotest( t, 100 );
and I got:
Sweetness!!!
All I really changed was from quad to quadl. I figured that out by using trapz which worked great until the time vector I was using didn't have enough resolution, which led me to believe it was a numerical issue rather than something fundamental. Hope this helps!
To troubleshoot your code I would plot the functions you are using and investigate, how the quad function samples them. You might be undersampling them, so make sure your minimum step size is smaller than the period of the function by at least factor 10.
I would suggest using the FFTs that are built-in to Matlab. Not only is the FFT the most efficient method to compute a spectrum (it is n*log(n) dependent on the length n of the array, whereas the integral in n^2 dependent), it will also give you automatically the frequency points that are supported by your (equally spaced) time data. If you compute the integral yourself (might be needed if datapoints are not equally spaced), you might calculate frequency data that are not resolved (closer spacing than 1/over the spacing in time, i.e. beyond the 'Fourier limit').