This code in Julia:
function seq(n)
if n<2
return BigInt(2)
else
return 1/(3-seq(n-1))
end
end
# and then run
[seq(n) for n=1:10]
replicates the recursive sequence Un = 1/(3-U(n-1)) where U1=2 and it works. But can someone explain to me how it works? for every n does it calculate every term before it or does the "return" store it somewhere which it can then call again when it needs to so it doesn't have to calculate every term for n every time?
It's just a normal recursive function: it calls itself however many times it needs to in order to compute the result. It terminates because every call chain eventually reaches the base case. There is no implicit caching of results or anything like that—it recomputes the same result however many times the function is called. If you want to remember previously calculated values, you can use the Memoize package to automatically "memoize" return values. Here's a terser version of the unmemoized function:
julia> seq(n) = n < 2 ? BigFloat(2) : 1/(3-seq(n-1))
seq (generic function with 1 method)
julia> seq(1) # trigger compilation
2.0
julia> #time [seq(n) for n=1:100];
0.001152 seconds (20.00 k allocations: 1.069 MiB)
julia> #time [seq(n) for n=1:100];
0.001365 seconds (20.00 k allocations: 1.069 MiB)
I changed it to fit on a single line and to return BigFloat(2) instead of BigInt(2) since the function returns BigFloat for larger inputs because of the division operation. Note that the second timing is no faster than the first (slower, in fact, probably because garbage collection kicks in during the second but not the first). Here's the same thing but with memoization:
julia> using Memoize
julia> #memoize seqm(n) = n < 2 ? BigFloat(2) : 1/(3-seqm(n-1))
seqm (generic function with 1 method)
julia> seqm(1) # trigger compilation
2.0
julia> #time [seqm(n) for n=1:100];
0.000071 seconds (799 allocations: 36.750 KiB)
julia> #time [seqm(n) for n=1:100];
0.000011 seconds (201 allocations: 4.000 KiB)
The first timing is significantly faster than the unmemoized version even though the memoization cache is empty at the start because the same computation is done many times and memoization avoids doing it all but the first time. The second timing is even faster because now all 100 computed values are already cached and can just be returned.
Related
Running Octave 6.3.0 for Windows. I need to get the smallest eigenvalue of some matrix.eigs(A,1,"sm") is supposed to do that, but I often get wrong results with singular matrices.
eigs(A) (which returns all the the first 6 eigenvalues/vectors) is correct (at least at the machine precision):
>> A = [[1 1 1];[1 1 1];[1 1 1]]
A =
1 1 1
1 1 1
1 1 1
>> [v lambda flag] = eigs(A)
v =
0.5774 -0.3094 -0.7556
0.5774 -0.4996 0.6458
0.5774 0.8091 0.1098
lambda =
Diagonal Matrix
3.0000e+00 0 0
0 -4.5198e-16 0
0 0 -1.5831e-17
flag = 0
But eigs(A,1,"sm") is not:
>> [v lambda flag] = eigs(A,1,"sm")
warning: eigs: 'A - sigma*B' is singular, indicating sigma is exactly an eigenvalue so convergence is not guaranteed
warning: called from
eigs at line 298 column 20
warning: matrix singular to machine precision
warning: called from
eigs at line 298 column 20
warning: matrix singular to machine precision
warning: called from
eigs at line 298 column 20
warning: matrix singular to machine precision
warning: called from
eigs at line 298 column 20
warning: matrix singular to machine precision
warning: called from
eigs at line 298 column 20
v =
-0.7554
0.2745
0.5950
lambda = 0.4322
flag = 0
Not only the returned eigenvalue is wrong, but the returned flag is zero, indicating that every went right in the function...
Is it a wrong usage of eigs() (but from the doc I can't see what is wrong) or a bug?
EDIT: if not a bug, at least a design issue... No problem either when requesting the 2 smallest values instead of the smallest value alone.
>> eigs(A,2,"sm")
ans =
-1.7700e-17
-5.8485e-16
EDIT 2: the eigs() function in Matlab online just runs fine and return the correct results (at the machine precision)
>> A=ones(3)
A =
1 1 1
1 1 1
1 1 1
>> [v lambda flag] = eigs(A,1,"smallestabs")
v =
-0.7556
0.6458
0.1098
lambda =
-1.5831e-17
flag =
0
After more tests and investigations I think I can answer that yes, Octave eigs() has some flaw.
eigs(A,1,"sm") likely uses the inverse power iteration method, that is repeatedly solving y=A\x, then x=y, starting with an arbitrary x vector. Obviously there's a problem if A is singular. However:
Matlab eigs() runs fine in such case, and returns the correct eigenvalue (at the machine precision). I don't know what it does, maybe adding a tiny value on the diagonal if the matrix is detected as singular, but it does something better (or at least different) than Octave.
If for some (good or bad) reason Octave's algorithm cannot handle a singular matrix, then this should be reflected in the 3rd return argument ("flag"). Instead, it is always zero as if everything went OK.
eigs(A,1,"sm") is actually equivalent to eigs(A,1,0), and the more general syntax is eigs(A,1,sigma), which means "find the closest eigenvalue to sigma, and the associated eigenvector". For this, the inverse power iteration method is applied with the matrix A-sigma*I. Problem: if sigma is already an exact eigenvalue this matrix is singular by definition. Octave eigs() fails in this case, while Matlab eigs() succeeds. It's kind of weird to have a failure when one knows in advance the exact eigenvalue, or sets it by chance. So the right thing to do in Octave is to test if (A-sigma.I) is singular, and if yes add a tiny value to sigma: eigs(A,1,sigma+eps*norm(A)). Matlab eigs() probably does something like that.
I try to understand when events are generated in Modelica. In the context of functions I noticed behaviour I didn't expect: functions appear to suppress event generation.
I am surprised as this is, to my knowledge, not explicitly stated in the Modelica reference.
For example, if I run this model in OMEdit of OpenModelica 1.17.0
model timeEventTest
Real z(start=0);
Real dummy(start=0);
equation
der(z) = dummy;
algorithm
if time > 10 then
dummy := 1;
else
dummy := -1.;
end if;
end timeEventTest;
I get the following output in the solver window of OMEdit
### STATISTICS ###
timer
events
1 state events
0 time events
solver: dassl
46 steps taken
46 calls of functionODE
44 evaluations of jacobian
0 error test failures
0 convergence test failures
0.000122251s time of jacobian evaluation
The simulation finished successfully.
Apart from the fact that the solver (I used dassl) interpreted the event at time=10 as a state event rather than a time event, the behaviour is as expected.
However, if I instead run the (mathematically identical) model
model timeEventTest2
Real z(start=0);
equation
der(z) = myfunc(time-10);
end timeEventTest2;
with myfunc defined as
function myfunc
input Real x;
output Real y;
algorithm
if x > 0 then
y := 1;
else
y:= -1;
end if;
end myfunc;
I obtain the following output in OMEdit
### STATISTICS ###
timer
events
0 state events
0 time events
solver: dassl
52 steps taken
79 calls of functionODE
63 evaluations of jacobian
13 error test failures
0 convergence test failures
0.000185296s time of jacobian evaluation
The simulation finished successfully.
Not only is the event at time = 10 NOT detected, the solver even got into some trouble as indicated by the error test failures.
This is a trivial example, however, I can imagine that the apparent suppression of events by function may result in mayor problems in larger models.
What did I miss here? Can I enforce the strict triggering of events within functions?
Some built-in functions also trigger events, e.g. div and mod (curiously, sign and abs don't).
Edit: Obviously, you need to run the examples at least to a time > 10s. I ran the simulations to 20s.
Functions in Modelica normally do not generate events.
See 8.5 Events and Synchronization in the Modelica Spec.
All equations and assignment statements within when-clauses and all assignment statements within function classes are implicitly treated with noEvent, i.e., relations within the scope of these operators never induce state or time events.
But its possible to change this behavior:
Add the annotation GenerateEvents=true to the function.
However, it seems like this is not sufficient for some Modelica simulators (tested with OpenModelica v1.16.5 and Dymola 2021x).
To make it work in OpenModelica and Dymola, you have add the Inline annotation or you have to assign the function output in one line.
So if you re-write your function as follows and you will get a state event:
function myfunc
input Real x;
output Real y;
algorithm
y := if x > 0 then 1 else -1;
annotation (GenerateEvents=true);
end myfunc;
or by additionally adding the Inline annotation:
function myfunc
input Real x;
output Real y;
algorithm
if x > 0 then
y := 1;
else
y := -1;
end if;
annotation (GenerateEvents=true, Inline=true);
end myfunc;
State event vs time event
To turn the state event in timeEventTest into a time event, change the if-condition to
if time > 10 then
This is also covered in chapter 8.5 Events and Synchronization of the Modelica Spec. Only the following two cases trigger time events:
time >= discrete expression
time < discrete expression
I want to have a function that I can call from a struct. For this, I'm trying to mimic in Julia (to an extent) , C++ class methods. To achieve this, I add a function attribute in a Julia struct and assign the attribute to a function object I pass in at the constructor stage.
The problem is, it works, but the approach is literally 1000 times slower than just directly calling a function.
Below is a MWE of my code:
struct struct_wh_method{F}
func::F;
function struct_wh_method(func_in)
new{typeof(func_in)}(func_in)
end
end
fun() = 1+1;
Now, instantiating the struct object:
A = struct_wh_method(fun);
Next, importing BenchmarkTools
import BenchmarkTools
I finally compare the performance between A.func() and fun():
#btime A.func()
35.583 ns (0 allocations: 0 bytes)
#btime fun()
0.035 ns (0 allocations: 0 bytes)
Is there a way to have the function call more efficient? I have a feeling that I'm doing something terribly wrong. Perhaps, this is fundamentally the incorrect way of using Julia, in which case I would greatly appreciate anyone guiding me to the elegant and high performance "Julian" way of achieving a similar goal. I greatly appreciate the help of the stack overflow community.
Cheers.
What is taking long in your example is not the call to the function itself, but accessing the element of the struct. I.e. a struct with an Int64 as element takes just as long to get it as to get the function. As soon as you put some code in the function that actually does something, there won't be a recognizable difference anymore.
Here some examples:
using BenchmarkTools
struct MyStruct
F::Function
end
struct MyStructInt
I::Int64
end
easy_f() = 1
function hard_f()
count = 0.
for i in rand(100000)
count+=i
end
end
mseasy = MyStruct(easy_f)
mshard = MyStruct(hard_f)
msint = MyStructInt(1)
I = 1
#btime mseasy.F()
#29.826 ns (1 allocation: 16 bytes)
#btime easy_f()
#0.026 ns (0 allocations: 0 bytes)
#btime mshard.F()
#70.977 μs (3 allocations: 781.34 KiB)
#btime hard_f()
#69.223 μs (2 allocations: 781.33 KiB)
#btime msint.I
#29.282 ns (1 allocation: 16 bytes)
#btime I
#1.539 ns (0 allocations: 0 bytes)
Remarkable is the fact that getting the value of an integer takes longer than the value of the easy_f function. I guess the reason is maybe that the compiler is doing a great job at optimizing the function.(Maybe the value is even stored in the CPU cache?)
However, you can still get a slight improvement if instead of calling the object of the struct you define a function that does that (which is usually Julia style)
For example like this:
callfunc(ms::MyStruct) = ms.F()
#btime callfunc(mseasy)
#8.606 ns (0 allocations: 0 bytes)
The difference is in the time to look up your struct. If you interpolate the variable in the #btime call (note the $ below), you get the same time:
julia> #btime $A.func()
0.036 ns (0 allocations: 0 bytes)
2
julia> #btime fun()
0.036 ns (0 allocations: 0 bytes)
2
I'd say there are two relatively separate concerns in your question. The first one is how to reliably perform such microbenchmarks. The second one is how to achieve what you want: store a function in a struct without degrading performances.
Consider the following examples, which I think may help understaing what goes on here.
If the benchmarked function is too simple, the compiler will be able to actually optimize the code away and simply replace it with a pre-computed result. This usually yields sub-nanosecond benchmarks, which is a good sign that something went wrong: with CPU frequencies being in the GHz order these days, any computation should that takes much less than a nanosecond is suspiciously fast.
julia> too_simple(x) = x + 1
too_simple (generic function with 1 method)
julia> #btime too_simple(2)
0.026 ns (0 allocations: 0 bytes)
3
So let's first take a complex enough function for the compiler to not be able to optimize its code away. And let's call it with small enough data that we stay in the nanosecond range. My personal favorite is the sum of all elements in a vector (preferably with floating-point numbers so that the compiler can't make as many optimizations as with integer types). Note that global variables passed to benchmarked functions should be interpolated in #btime. Summing a few elements takes a few nanoseconds, so this looks like a good base for our benchmark: we actually measure something significant, but small enough that any perturbation should be visible:
julia> function fun(x)
acc = 0.0
for elt in x
acc += elt
end
acc
end
fun (generic function with 1 method)
julia> x = rand(8);
julia> using BenchmarkTools
# x is a global variable => interpolate it with $x
julia> #btime fun($x)
5.454 ns (0 allocations: 0 bytes)
3.125754440231318
Now, let's naively try to embed the function into a struct:
julia> struct Bar
func::Function
end
julia> b = Bar(fun)
Bar(fun)
# Both `b` and `x` need are global variables => escape them
julia> #btime $b.func($x)
22.289 ns (1 allocation: 16 bytes)
3.125754440231318
Not only have we lost some time, but there also was a memory allocation. Of course, if the payload in fun had been larger, we wouldn't have seen anything. But still this is not as good as the cost-less abstraction that one might hope.
The problem here is due to the fact that the func field in Bar is not concretely typed: in Julia, each function is of its own specific type (although the types of all functions are subtypes of of the Function abstract type). The compiler doesn't know much about it and can't make too many optimizations beforehand: it has to wait until you actually extract the func field from object b, in order to check exactly what function this is.
What you proposed in your question actually solves this by embedding the concrete type of the function as a type parameter. Note how the type of f in the example below embeds fun itself; this allows the compiler to know about fun as soon as the type of f is known (i.e. during Just-Ahead-of-Time compilation).
julia> struct Foo{F}
func::F
end
julia> f = Foo(fun)
Foo{typeof(fun)}(fun)
julia> typeof(f)
Foo{typeof(fun)}
julia> #btime $f.func($x)
5.055 ns (0 allocations: 0 bytes)
3.125754440231318
Now we get the same performance as before.
In conclusion, I'd say that if you can use such a parameterized type (i.e. if you can afford two instances of your structure to have two separate types if they store different functions) then such an approach should be fine. Still, all this does not seem very Julian; you might want to consider other approaches. Maybe ask another question explaining the problem you were trying to solve with such an approach?
I called the cublas_Sgemm_v2 function for 10236 times with first matrix non-transposed and the second transposed. However, in the nvprof results, I saw three items produced from that function call. The (m, n, k) values to the function call are (588, 588, 20).
There are the items listed in the nvprof results.
Time(%) Time Calls Avg Min Max Name
12.32% 494.86ms 10236 48.344us 47.649us 49.888us sgemm_sm35_ldg_nt_128x8x128x16x16
8.64% 346.91ms 10236 33.890us 32.352us 35.488us sgemm_sm35_ldg_nt_64x16x128x8x32
8.11% 325.63ms 10236 31.811us 31.360us 32.512us sgemm_sm35_ldg_nt_128x16x64x16x16
Is this expected and why is that? Can someone explain what do the values in the function names such as sgemm_sm35_ldg_nt_128x8x128x16x16 mean?
I also have other function calls to cublas_Sgemm_v2 with different transpose settings and I only see one item per each function call.
UPDATE:
As #Marco13 asked, I put more results here:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (588, 100, 588)
20.84% 548.30ms 7984 68.675us 58.977us 81.474us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (588, 100, 588)
12.95% 340.71ms 7984 42.674us 21.856us 64.514us sgemm_sm35_ldg_nn_64x16x64x16x16
All the following resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (588, 588, 100)
9.81% 258.15ms 3992 64.666us 61.601us 68.642us sgemm_sm35_ldg_nt_128x8x128x16x16
6.84% 179.90ms 3992 45.064us 40.097us 49.505us sgemm_sm35_ldg_nt_64x16x128x8x32
6.33% 166.51ms 3992 41.709us 38.304us 61.185us sgemm_sm35_ldg_nt_128x16x64x16x16
Another run with 588 changed to 288:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (288, 100, 288)
22.01% 269.11ms 7984 33.706us 30.273us 39.232us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (288, 100, 288)
14.79% 180.78ms 7984 22.642us 18.752us 26.752us sgemm_sm35_ldg_nn_64x16x64x16x16
Resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (288, 288, 100)
7.43% 90.886ms 3992 22.766us 19.936us 25.024us sgemm_sm35_ldg_nt_64x16x64x16x16
From the last three lines is looks like certain transposition types can be more efficient than the others, and certain matrix sizes are more economic in terms of computation time over matrix size. What is the guideline of ensuring economic computation?
UPDATE 2:
For the case of (m, n, k) = (588, 100, 588) above, I manually transposed the matrix before calling the sgemm function, then there is only one item in the nvprof result. The time it take is only a little less than the sum of the two items in the above table. So there is no much performance gain from doing so.
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
31.65% 810.59ms 15968 50.763us 21.505us 72.098us sgemm_sm35_ldg_nn_64x16x64x16x16
Sorry, not an answer - but slightly too long for a comment:
Concerning the edit, about the influence of the "transpose" state: Transposing a matrix might cause an access pattern that is worse in terms of memory coalescing. A quick websearch brings brings some results about this ( https://devtalk.nvidia.com/default/topic/528450/cuda-programming-and-performance/cublas-related-question/post/3734986/#3734986 ), but with a slightly different setup than yours:
DGEMM performance on a K20c
args: ta=N tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13280010 sec GFLOPS=1034.93
args: ta=T tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13872910 sec GFLOPS=990.7
args: ta=N tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.12521601 sec GFLOPS=1097.61
args: ta=T tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13652611 sec GFLOPS=1006.69
In this case, the differences do not seem worth the hassle of changing the matrix storage (e.g. from column-major to row-major, to avoid transposing the matrix), because all patterns seem to run with a similar speed. But your mileage may vary - particularly, the difference in your tests between (t,n) and (n,n) are very large (548ms vs 340ms), which I found quite surprising. If you have the choice to easily switch between various representations of the matrix, then a benchmark covering all the four cases may be worthwhile.
In any case, regarding your question about the functions that are called there: The CUBLAS code for the sgemm function in CUBLAS 1.1 was already full of unrolled loops and already contained 80 (!) versions of the sgemm function for different cases that have been assembled using a #define-hell. It has to be assumed that this has become even more unreadable in the newer CUBLAS versions, where the newer compute capabilities have to be taken into account - and the function names that you found there indicated that this indeed is the case:
sgemm_sm35_ldg_nt_64x16x128x8x32
sm35 : Runs on a device with compute capability 3.5
ldg : ? Non-texture-memory version ? (CUBLAS 1.1 contained functions called sgemm_main_tex_* which worked on texture memory, and functions sgemm_main_gld_* which worked on normal, global memory)
nt : First matrix is Not transposed, second one is Transposed
64x16x128x8x32 - Probably related to tile sizes, maybe shared memory etc...
Still, I think it's surprising that a single call to sgemm causes three of these internal functions to be called. But as mentioned in the comment: I assume that they try to handle the "main" part of the matrix with a specialized, efficient version, and "border tiles" with one that is capable of doing range checks and/or cope with warps that are not full. (Not very precise, just to be suggestive: A matrix of size 288x288 could be handled by an efficient core for matrices of size 256x256, and two calls for the remaining 32x288 and 288x32 entries).
But all this is also the reason why I guess there can hardly be a general guideline concerning the matrix sizes: The "best" matrix size in terms of computation time over matrix size will at least depend on
the hardware version (compute capability) of the target system
the transposing-flags
the CUBLAS version
EDIT Concerning the comment: One could imagine that there should be a considerable difference between the transosed and the non-transposed processing. When multiplying two matrices
a00 a01 a02 b00 b01 b02
a10 a11 a12 * b10 b11 b12
a20 a21 a22 b20 b21 b22
Then the first element of the result will be
a00 * b00 + a01 * b10 + a02 * b20
(which simply is the dot product of the first row of a and the first column of b). For this computation one has to read consecutive values from a. But the values that are read from b are not consecutive. Instead, they are "the first value in each row". One could think that this would have a negative impact on memory coalescing. But for sure, the NVIDIA engineers have tried hard to avoid any negative impact here, and the implementation of sgemm in CUBLAS is far, far away from "a parallel version of the naive 3-nested-loops implementation" where this access pattern would have such an obvious drawback.
I have a basic question. In the code below, I am calling the same function, 'add', twice. When I do this using OpenMP, I'm getting incorrect results.
program p
integer::i,j,omp_get_thread_num,n
real::suma
i=5
j=10
!$omp parallel num_threads(2) private(n)
n=omp_get_thread_num()
if(n==0) goto 1111
suma=add(i,n)
write(*,*)'sum for 5=',suma,n,i
goto 1000
1111 suma=add(j,n)
write(*,*)'sum for 10=',suma,n,j
1000 continue
!$omp end parallel
end program p
!----------------------------------------
function add(k,n)result(l)
implicit none
integer::k,s,n
real::l1,l
!write(*,*)'thread employing me is:',n
l1=0.0
do s=k,k+5
l1=l1+s
end do
l=l1
return
end function add
The result of executing this code is:
sum for 10= 45.0000000 0 10
sum for 5= 45.0000000 1 5
However, when I uncomment line 22, ie '!write(,)'thread employing me is:',n'
the result is:
thread employing me is: 0
sum for 10= 75.0000000 0 10
thread employing me is: 1
sum for 5= 45.0000000 1 5
What should I do in order to employ the same function using different threads, correctly(ie without mixing up the variables) Can anyone explain the results obtained?
This is a simplistic version of my actual problem. (where I'm using the same function in threads)
Edit: Ok, I've realized the very silly mistake of not including 'suma' in the private list. But still, can someone tell why, if line 22 is uncommented, it always gives the correct result, even if suma is not made private?
There is a data race condition in your program. suma is shared (by the implicit data sharing rules of OpenMP) and both threads assign to it at the same time. Uncommenting the write statement results in slight offset in the execution of the second thread and therefore hides the race condition (it doesn't on my OS X - it just makes the program randomly print two times 45.0 or two times 75.0).
!$omp parallel num_threads(2) private(n, suma)
...
!$omp end parallel
Besides that, you should really really really use OpenMP sections instead of the goto logic that you have employed:
!$omp parallel num_threads(2) private(n, suma)
n=omp_get_thread_num()
!$omp sections
suma=add(i,n)
write(*,*)'sum for 5=',suma,n,i
!$omp section
suma=add(j,n)
write(*,*)'sum for 10=',suma,n,j
!$omp end sections
!$omp end parallel