Poor performance of function attribute inside a struct in Julia - function

I want to have a function that I can call from a struct. For this, I'm trying to mimic in Julia (to an extent) , C++ class methods. To achieve this, I add a function attribute in a Julia struct and assign the attribute to a function object I pass in at the constructor stage.
The problem is, it works, but the approach is literally 1000 times slower than just directly calling a function.
Below is a MWE of my code:
struct struct_wh_method{F}
func::F;
function struct_wh_method(func_in)
new{typeof(func_in)}(func_in)
end
end
fun() = 1+1;
Now, instantiating the struct object:
A = struct_wh_method(fun);
Next, importing BenchmarkTools
import BenchmarkTools
I finally compare the performance between A.func() and fun():
#btime A.func()
35.583 ns (0 allocations: 0 bytes)
#btime fun()
0.035 ns (0 allocations: 0 bytes)
Is there a way to have the function call more efficient? I have a feeling that I'm doing something terribly wrong. Perhaps, this is fundamentally the incorrect way of using Julia, in which case I would greatly appreciate anyone guiding me to the elegant and high performance "Julian" way of achieving a similar goal. I greatly appreciate the help of the stack overflow community.
Cheers.

What is taking long in your example is not the call to the function itself, but accessing the element of the struct. I.e. a struct with an Int64 as element takes just as long to get it as to get the function. As soon as you put some code in the function that actually does something, there won't be a recognizable difference anymore.
Here some examples:
using BenchmarkTools
struct MyStruct
F::Function
end
struct MyStructInt
I::Int64
end
easy_f() = 1
function hard_f()
count = 0.
for i in rand(100000)
count+=i
end
end
mseasy = MyStruct(easy_f)
mshard = MyStruct(hard_f)
msint = MyStructInt(1)
I = 1
#btime mseasy.F()
#29.826 ns (1 allocation: 16 bytes)
#btime easy_f()
#0.026 ns (0 allocations: 0 bytes)
#btime mshard.F()
#70.977 μs (3 allocations: 781.34 KiB)
#btime hard_f()
#69.223 μs (2 allocations: 781.33 KiB)
#btime msint.I
#29.282 ns (1 allocation: 16 bytes)
#btime I
#1.539 ns (0 allocations: 0 bytes)
Remarkable is the fact that getting the value of an integer takes longer than the value of the easy_f function. I guess the reason is maybe that the compiler is doing a great job at optimizing the function.(Maybe the value is even stored in the CPU cache?)
However, you can still get a slight improvement if instead of calling the object of the struct you define a function that does that (which is usually Julia style)
For example like this:
callfunc(ms::MyStruct) = ms.F()
#btime callfunc(mseasy)
#8.606 ns (0 allocations: 0 bytes)

The difference is in the time to look up your struct. If you interpolate the variable in the #btime call (note the $ below), you get the same time:
julia> #btime $A.func()
0.036 ns (0 allocations: 0 bytes)
2
julia> #btime fun()
0.036 ns (0 allocations: 0 bytes)
2

I'd say there are two relatively separate concerns in your question. The first one is how to reliably perform such microbenchmarks. The second one is how to achieve what you want: store a function in a struct without degrading performances.
Consider the following examples, which I think may help understaing what goes on here.
If the benchmarked function is too simple, the compiler will be able to actually optimize the code away and simply replace it with a pre-computed result. This usually yields sub-nanosecond benchmarks, which is a good sign that something went wrong: with CPU frequencies being in the GHz order these days, any computation should that takes much less than a nanosecond is suspiciously fast.
julia> too_simple(x) = x + 1
too_simple (generic function with 1 method)
julia> #btime too_simple(2)
0.026 ns (0 allocations: 0 bytes)
3
So let's first take a complex enough function for the compiler to not be able to optimize its code away. And let's call it with small enough data that we stay in the nanosecond range. My personal favorite is the sum of all elements in a vector (preferably with floating-point numbers so that the compiler can't make as many optimizations as with integer types). Note that global variables passed to benchmarked functions should be interpolated in #btime. Summing a few elements takes a few nanoseconds, so this looks like a good base for our benchmark: we actually measure something significant, but small enough that any perturbation should be visible:
julia> function fun(x)
acc = 0.0
for elt in x
acc += elt
end
acc
end
fun (generic function with 1 method)
julia> x = rand(8);
julia> using BenchmarkTools
# x is a global variable => interpolate it with $x
julia> #btime fun($x)
5.454 ns (0 allocations: 0 bytes)
3.125754440231318
Now, let's naively try to embed the function into a struct:
julia> struct Bar
func::Function
end
julia> b = Bar(fun)
Bar(fun)
# Both `b` and `x` need are global variables => escape them
julia> #btime $b.func($x)
22.289 ns (1 allocation: 16 bytes)
3.125754440231318
Not only have we lost some time, but there also was a memory allocation. Of course, if the payload in fun had been larger, we wouldn't have seen anything. But still this is not as good as the cost-less abstraction that one might hope.
The problem here is due to the fact that the func field in Bar is not concretely typed: in Julia, each function is of its own specific type (although the types of all functions are subtypes of of the Function abstract type). The compiler doesn't know much about it and can't make too many optimizations beforehand: it has to wait until you actually extract the func field from object b, in order to check exactly what function this is.
What you proposed in your question actually solves this by embedding the concrete type of the function as a type parameter. Note how the type of f in the example below embeds fun itself; this allows the compiler to know about fun as soon as the type of f is known (i.e. during Just-Ahead-of-Time compilation).
julia> struct Foo{F}
func::F
end
julia> f = Foo(fun)
Foo{typeof(fun)}(fun)
julia> typeof(f)
Foo{typeof(fun)}
julia> #btime $f.func($x)
5.055 ns (0 allocations: 0 bytes)
3.125754440231318
Now we get the same performance as before.
In conclusion, I'd say that if you can use such a parameterized type (i.e. if you can afford two instances of your structure to have two separate types if they store different functions) then such an approach should be fine. Still, all this does not seem very Julian; you might want to consider other approaches. Maybe ask another question explaining the problem you were trying to solve with such an approach?

Related

Octave eigs() function bugged?

Running Octave 6.3.0 for Windows. I need to get the smallest eigenvalue of some matrix.eigs(A,1,"sm") is supposed to do that, but I often get wrong results with singular matrices.
eigs(A) (which returns all the the first 6 eigenvalues/vectors) is correct (at least at the machine precision):
>> A = [[1 1 1];[1 1 1];[1 1 1]]
A =
1 1 1
1 1 1
1 1 1
>> [v lambda flag] = eigs(A)
v =
0.5774 -0.3094 -0.7556
0.5774 -0.4996 0.6458
0.5774 0.8091 0.1098
lambda =
Diagonal Matrix
3.0000e+00 0 0
0 -4.5198e-16 0
0 0 -1.5831e-17
flag = 0
But eigs(A,1,"sm") is not:
>> [v lambda flag] = eigs(A,1,"sm")
warning: eigs: 'A - sigma*B' is singular, indicating sigma is exactly an eigenvalue so convergence is not guaranteed
warning: called from
eigs at line 298 column 20
warning: matrix singular to machine precision
warning: called from
eigs at line 298 column 20
warning: matrix singular to machine precision
warning: called from
eigs at line 298 column 20
warning: matrix singular to machine precision
warning: called from
eigs at line 298 column 20
warning: matrix singular to machine precision
warning: called from
eigs at line 298 column 20
v =
-0.7554
0.2745
0.5950
lambda = 0.4322
flag = 0
Not only the returned eigenvalue is wrong, but the returned flag is zero, indicating that every went right in the function...
Is it a wrong usage of eigs() (but from the doc I can't see what is wrong) or a bug?
EDIT: if not a bug, at least a design issue... No problem either when requesting the 2 smallest values instead of the smallest value alone.
>> eigs(A,2,"sm")
ans =
-1.7700e-17
-5.8485e-16
EDIT 2: the eigs() function in Matlab online just runs fine and return the correct results (at the machine precision)
>> A=ones(3)
A =
1 1 1
1 1 1
1 1 1
>> [v lambda flag] = eigs(A,1,"smallestabs")
v =
-0.7556
0.6458
0.1098
lambda =
-1.5831e-17
flag =
0
After more tests and investigations I think I can answer that yes, Octave eigs() has some flaw.
eigs(A,1,"sm") likely uses the inverse power iteration method, that is repeatedly solving y=A\x, then x=y, starting with an arbitrary x vector. Obviously there's a problem if A is singular. However:
Matlab eigs() runs fine in such case, and returns the correct eigenvalue (at the machine precision). I don't know what it does, maybe adding a tiny value on the diagonal if the matrix is detected as singular, but it does something better (or at least different) than Octave.
If for some (good or bad) reason Octave's algorithm cannot handle a singular matrix, then this should be reflected in the 3rd return argument ("flag"). Instead, it is always zero as if everything went OK.
eigs(A,1,"sm") is actually equivalent to eigs(A,1,0), and the more general syntax is eigs(A,1,sigma), which means "find the closest eigenvalue to sigma, and the associated eigenvector". For this, the inverse power iteration method is applied with the matrix A-sigma*I. Problem: if sigma is already an exact eigenvalue this matrix is singular by definition. Octave eigs() fails in this case, while Matlab eigs() succeeds. It's kind of weird to have a failure when one knows in advance the exact eigenvalue, or sets it by chance. So the right thing to do in Octave is to test if (A-sigma.I) is singular, and if yes add a tiny value to sigma: eigs(A,1,sigma+eps*norm(A)). Matlab eigs() probably does something like that.

How does this recursive function in Julia work?

This code in Julia:
function seq(n)
if n<2
return BigInt(2)
else
return 1/(3-seq(n-1))
end
end
# and then run
[seq(n) for n=1:10]
replicates the recursive sequence Un = 1/(3-U(n-1)) where U1=2 and it works. But can someone explain to me how it works? for every n does it calculate every term before it or does the "return" store it somewhere which it can then call again when it needs to so it doesn't have to calculate every term for n every time?
It's just a normal recursive function: it calls itself however many times it needs to in order to compute the result. It terminates because every call chain eventually reaches the base case. There is no implicit caching of results or anything like that—it recomputes the same result however many times the function is called. If you want to remember previously calculated values, you can use the Memoize package to automatically "memoize" return values. Here's a terser version of the unmemoized function:
julia> seq(n) = n < 2 ? BigFloat(2) : 1/(3-seq(n-1))
seq (generic function with 1 method)
julia> seq(1) # trigger compilation
2.0
julia> #time [seq(n) for n=1:100];
0.001152 seconds (20.00 k allocations: 1.069 MiB)
julia> #time [seq(n) for n=1:100];
0.001365 seconds (20.00 k allocations: 1.069 MiB)
I changed it to fit on a single line and to return BigFloat(2) instead of BigInt(2) since the function returns BigFloat for larger inputs because of the division operation. Note that the second timing is no faster than the first (slower, in fact, probably because garbage collection kicks in during the second but not the first). Here's the same thing but with memoization:
julia> using Memoize
julia> #memoize seqm(n) = n < 2 ? BigFloat(2) : 1/(3-seqm(n-1))
seqm (generic function with 1 method)
julia> seqm(1) # trigger compilation
2.0
julia> #time [seqm(n) for n=1:100];
0.000071 seconds (799 allocations: 36.750 KiB)
julia> #time [seqm(n) for n=1:100];
0.000011 seconds (201 allocations: 4.000 KiB)
The first timing is significantly faster than the unmemoized version even though the memoization cache is empty at the start because the same computation is done many times and memoization avoids doing it all but the first time. The second timing is even faster because now all 100 computed values are already cached and can just be returned.

Calculating PI with Fortran & CUDA

I am trying to make a simple program in PGI's fortran compiler. This simple program will use the graphics card to calculate pi using the "dart board" algorithm. After battling with this program for quite some time now I have finally got it to behave for the most part. However, I am currently stuck on passing back the results properly. I must say, this is a rather tricky program to debug since I can no longer shove any print statements into the subroutine. This program currently returns all zeros. I am not really sure what is going on, but I have two ideas. Both of which I am not sure how to fix:
The CUDA kernel is not running somehow?
I am not converting the values properly? pi_parts = pi_parts_d
Well, this is the status of my current program. All variables with _d on the end stand for the CUDA prepared device memory where all the other variables (with the exception of the CUDA kernel) are typical Fortran CPU prepared variables. Now there are some print statements I have commented out that I have already tried out from CPU Fortran land. These commands were to check if I really was generating the random numbers properly. As for the CUDA method, I have currently commented out the calculations and replaced z to statically equal to 1 just to see something happen.
module calcPi
contains
attributes(global) subroutine pi_darts(x, y, results, N)
use cudafor
implicit none
integer :: id
integer, value :: N
real, dimension(N) :: x, y, results
real :: z
id = (blockIdx%x-1)*blockDim%x + threadIdx%x
if (id .lt. N) then
! SQRT NOT NEEDED, SQRT(1) === 1
! Anything above and below 1 would stay the same even with the applied
! sqrt function. Therefore using the sqrt function wastes GPU time.
z = 1.0
!z = x(id)*x(id)+y(id)*y(id)
!if (z .lt. 1.0) then
! z = 1.0
!else
! z = 0.0
!endif
results(id) = z
endif
end subroutine pi_darts
end module calcPi
program final_project
use calcPi
use cudafor
implicit none
integer, parameter :: N = 400
integer :: i
real, dimension(N) :: x, y, pi_parts
real, dimension(N), device :: x_d, y_d, pi_parts_d
type(dim3) :: grid, tBlock
! Initialize the random number generaters seed
call random_seed()
! Make sure we initialize the parts with 0
pi_parts = 0
! Prepare the random numbers (These cannot be generated from inside the
! cuda kernel)
call random_number(x)
call random_number(y)
!write(*,*) x, y
! Convert the random numbers into graphics card memory land!
x_d = x
y_d = y
pi_parts_d = pi_parts
! For the cuda kernel
tBlock = dim3(256,1,1)
grid = dim3((N/tBlock%x)+1,1,1)
! Start the cuda kernel
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N)
! Transform the results into CPU Memory
pi_parts = pi_parts_d
write(*,*) pi_parts
write(*,*) 'PI: ', 4.0*sum(pi_parts)/N
end program final_project
EDIT TO CODE:
Changed various lines to reflect the fixes mentioned by: Robert Crovella. Current status: error caught by cuda-memcheck revealing: Program hit error 8 on CUDA API call to cudaLaunch on my machine.
If there is any method I can use to test this program please let me know. I am throwing darts and seeing where they land for my current style of debugging with CUDA. Not the most ideal, but it will have to do until I find another way.
May the Fortran Gods have mercy on my soul at this dark hour.
When I compile and run your program I get a segfault. This is due to the last parameter you are passing to the kernel (N_d):
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N_d)
Since N is a scalar quantity, the kernel is expecting to use it directly, rather than as a pointer. So when you pass a pointer to device data (N_d), the process of setting up the kernel generates a seg fault (in host code!) as it attempts to access the value N, which should be passed directly as:
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N)
When I make that change to the code you have posted, I then get actual printed output (instead of a seg fault), which is an array of ones and zeroes (256 ones, followed by 144 zeroes, for a total of N=400 values), followed by the calculated PI value (which happens to be 2.56 in this case (4*256/400), since you have made the kernel basically a dummy kernel).
This line of code is also probably not doing what you want:
grid = dim3(N/tBlock%x,1,1)
With N = 400 and tBlock%x = 256 (from previous code lines), the result of the calculation is 1 (ie. grid ends up at (1,1,1) which amounts to one threadblock). But you really want to launch 2 threadblocks, so as to cover the entire range of your data set (N = 400 elements). There's a number of ways to fix this, but for simplicity let's just always add 1 to the calculation:
grid = dim3((N/tBlock%x)+1,1,1)
Under these circumstances, when we launch grids that are larger (in terms of total threads) than our data set size (512 threads but only 400 data elements in this example) it's customary to put a thread check near the beginning of our kernel (in this case, after the initialization of id), to prevent out-of-bounds accesses, like so:
if (id .lt. N) then
(and a corresponding endif at the very end of the kernel code) This way, only the threads that correspond to actual valid data are allowed to do any work.
With the above changes, your code should be essentially functional, and you should be able to revert your kernel code to the proper statements and start to get an estimate of PI.
Note that you can check the CUDA API for error return codes, and you can also run your code with cuda-memcheck to get an idea of whether the kernel is making out-of-bounds accesses. Niether of these would have helped with this particular seg fault, however.

cublas one function call produced three executions

I called the cublas_Sgemm_v2 function for 10236 times with first matrix non-transposed and the second transposed. However, in the nvprof results, I saw three items produced from that function call. The (m, n, k) values to the function call are (588, 588, 20).
There are the items listed in the nvprof results.
Time(%) Time Calls Avg Min Max Name
12.32% 494.86ms 10236 48.344us 47.649us 49.888us sgemm_sm35_ldg_nt_128x8x128x16x16
8.64% 346.91ms 10236 33.890us 32.352us 35.488us sgemm_sm35_ldg_nt_64x16x128x8x32
8.11% 325.63ms 10236 31.811us 31.360us 32.512us sgemm_sm35_ldg_nt_128x16x64x16x16
Is this expected and why is that? Can someone explain what do the values in the function names such as sgemm_sm35_ldg_nt_128x8x128x16x16 mean?
I also have other function calls to cublas_Sgemm_v2 with different transpose settings and I only see one item per each function call.
UPDATE:
As #Marco13 asked, I put more results here:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (588, 100, 588)
20.84% 548.30ms 7984 68.675us 58.977us 81.474us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (588, 100, 588)
12.95% 340.71ms 7984 42.674us 21.856us 64.514us sgemm_sm35_ldg_nn_64x16x64x16x16
All the following resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (588, 588, 100)
9.81% 258.15ms 3992 64.666us 61.601us 68.642us sgemm_sm35_ldg_nt_128x8x128x16x16
6.84% 179.90ms 3992 45.064us 40.097us 49.505us sgemm_sm35_ldg_nt_64x16x128x8x32
6.33% 166.51ms 3992 41.709us 38.304us 61.185us sgemm_sm35_ldg_nt_128x16x64x16x16
Another run with 588 changed to 288:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (288, 100, 288)
22.01% 269.11ms 7984 33.706us 30.273us 39.232us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (288, 100, 288)
14.79% 180.78ms 7984 22.642us 18.752us 26.752us sgemm_sm35_ldg_nn_64x16x64x16x16
Resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (288, 288, 100)
7.43% 90.886ms 3992 22.766us 19.936us 25.024us sgemm_sm35_ldg_nt_64x16x64x16x16
From the last three lines is looks like certain transposition types can be more efficient than the others, and certain matrix sizes are more economic in terms of computation time over matrix size. What is the guideline of ensuring economic computation?
UPDATE 2:
For the case of (m, n, k) = (588, 100, 588) above, I manually transposed the matrix before calling the sgemm function, then there is only one item in the nvprof result. The time it take is only a little less than the sum of the two items in the above table. So there is no much performance gain from doing so.
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
31.65% 810.59ms 15968 50.763us 21.505us 72.098us sgemm_sm35_ldg_nn_64x16x64x16x16
Sorry, not an answer - but slightly too long for a comment:
Concerning the edit, about the influence of the "transpose" state: Transposing a matrix might cause an access pattern that is worse in terms of memory coalescing. A quick websearch brings brings some results about this ( https://devtalk.nvidia.com/default/topic/528450/cuda-programming-and-performance/cublas-related-question/post/3734986/#3734986 ), but with a slightly different setup than yours:
DGEMM performance on a K20c
args: ta=N tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13280010 sec GFLOPS=1034.93
args: ta=T tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13872910 sec GFLOPS=990.7
args: ta=N tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.12521601 sec GFLOPS=1097.61
args: ta=T tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13652611 sec GFLOPS=1006.69
In this case, the differences do not seem worth the hassle of changing the matrix storage (e.g. from column-major to row-major, to avoid transposing the matrix), because all patterns seem to run with a similar speed. But your mileage may vary - particularly, the difference in your tests between (t,n) and (n,n) are very large (548ms vs 340ms), which I found quite surprising. If you have the choice to easily switch between various representations of the matrix, then a benchmark covering all the four cases may be worthwhile.
In any case, regarding your question about the functions that are called there: The CUBLAS code for the sgemm function in CUBLAS 1.1 was already full of unrolled loops and already contained 80 (!) versions of the sgemm function for different cases that have been assembled using a #define-hell. It has to be assumed that this has become even more unreadable in the newer CUBLAS versions, where the newer compute capabilities have to be taken into account - and the function names that you found there indicated that this indeed is the case:
sgemm_sm35_ldg_nt_64x16x128x8x32
sm35 : Runs on a device with compute capability 3.5
ldg : ? Non-texture-memory version ? (CUBLAS 1.1 contained functions called sgemm_main_tex_* which worked on texture memory, and functions sgemm_main_gld_* which worked on normal, global memory)
nt : First matrix is Not transposed, second one is Transposed
64x16x128x8x32 - Probably related to tile sizes, maybe shared memory etc...
Still, I think it's surprising that a single call to sgemm causes three of these internal functions to be called. But as mentioned in the comment: I assume that they try to handle the "main" part of the matrix with a specialized, efficient version, and "border tiles" with one that is capable of doing range checks and/or cope with warps that are not full. (Not very precise, just to be suggestive: A matrix of size 288x288 could be handled by an efficient core for matrices of size 256x256, and two calls for the remaining 32x288 and 288x32 entries).
But all this is also the reason why I guess there can hardly be a general guideline concerning the matrix sizes: The "best" matrix size in terms of computation time over matrix size will at least depend on
the hardware version (compute capability) of the target system
the transposing-flags
the CUBLAS version
EDIT Concerning the comment: One could imagine that there should be a considerable difference between the transosed and the non-transposed processing. When multiplying two matrices
a00 a01 a02 b00 b01 b02
a10 a11 a12 * b10 b11 b12
a20 a21 a22 b20 b21 b22
Then the first element of the result will be
a00 * b00 + a01 * b10 + a02 * b20
(which simply is the dot product of the first row of a and the first column of b). For this computation one has to read consecutive values from a. But the values that are read from b are not consecutive. Instead, they are "the first value in each row". One could think that this would have a negative impact on memory coalescing. But for sure, the NVIDIA engineers have tried hard to avoid any negative impact here, and the implementation of sgemm in CUBLAS is far, far away from "a parallel version of the naive 3-nested-loops implementation" where this access pattern would have such an obvious drawback.

Is there a way to avoid creating an array in this Julia expression?

Is there a way to avoid creating an array in this Julia expression:
max((filter(n -> string(n) == reverse(string(n)), [x*y for x = 1:N, y = 1:N])))
and make it behave similar to this Python generator expression:
max(x*y for x in range(N+1) for y in range(x, N+1) if str(x*y) == str(x*y)[::-1])
Julia version is 2.3 times slower then Python due to array allocation and N*N iterations vs. Python's N*N/2.
EDIT
After playing a bit with a few implementations in Julia, the fastest loop style version I've got is:
function f(N) # 320ms for N=1000 Julia 0.2.0 i686-w64-mingw32
nMax = NaN
for x = 1:N, y = x:N
n = x*y
s = string(n)
s == reverse(s) || continue
nMax < n && (nMax = n)
end
nMax
end
but an improved functional version isn't far behind (only 14% slower or significantly faster, if you consider 2x larger domain):
function e(N) # 366ms for N=1000 Julia 0.2.0 i686-w64-mingw32
isPalindrome(n) = string(n) == reverse(string(n))
max(filter(isPalindrome, [x*y for x = 1:N, y = 1:N]))
end
There is 2.6x unexpected performance improvement by defining isPalindrome function, compared to original version on the top of this page.
We have talked about allowing the syntax
max(f(x) for x in itr)
as a shorthand for producing each of the values f(x) in one coroutine while computing the max in another coroutine. This would basically be shorthand for something like this:
max(#task for x in itr; produce(f(x)); end)
Note, however, that this syntax that explicitly creates a task already works, although it is somewhat less pretty than the above. Your problem can be expressed like this:
max(#task for x=1:N, y=x:N
string(x*y) == reverse(string(x*y)) && produce(x*y)
end)
With the hypothetical producer syntax above, it could be reduced to something like this:
max(x*y if string(x*y) == reverse(string(x*y) for x=1:N, y=x:N)
While I'm a fan of functional style, in this case I would probably just use a for loop:
m = 0
for x = 1:N, y = x:N
n = x*y
string(n) == reverse(string(n)) || continue
m < n && (m = n)
end
Personally, I don't find this version much harder to read and it will certainly be quite fast in Julia. In general, while functional style can be convenient and pretty, if your primary focus is on performance, then explicit for loops are your friend. Nevertheless, we should make sure that John's max/filter/product version works. The for loop version also makes other optimizations easier to add, like Harlan's suggestion of reversing the loop ordering and exiting on the first palindrome you find. There are also faster ways to check if a number is a palindrome in a given base than actually creating and comparing strings.
As to the general question of "getting flexible generators and list comprehensions in Julia", the language already has
A general high-performance iteration protocol based on the start/done/next functions.
Far more powerful multidimensional array comprehensions than most languages. At this point, the only missing feature is the if guard, which is complicated by the interaction with multidimensional comprehensions and the need to potentially dynamically grow the resulting array.
Coroutines (aka tasks) which allow, among other patterns, the producer-consumer pattern.
Python has the if guard but doesn't worry about comprehension performance nearly as much – if we're going to add that feature to Julia's comprehensions, we're going to do it in a way that's both fast and interacts well with multidimensional arrays, hence the delay.
Update: The max function is now called maximum (maximum is to max as sum is to +) and the generator syntax and/or filters work on master, so for example, you can do this:
julia> #time maximum(100x - x^2 for x = 1:100 if x % 3 == 0)
0.059185 seconds (31.16 k allocations: 1.307 MB)
2499
Once 0.5 is out, I'll update this answer more thoroughly.
There are two questions being mixed together here: (1) can you filter a list comprehension mid-comprehension (for which the answer is currently no) and (2) can you use a generator that doesn't allocate an array (for which the answer is partially yes). Generators are provided by the Iterators package, but the Iterators package seems to not play well with filter at the moment. In principle, the code below should work:
max((x, y) -> x * y,
filter((x, y) -> string(x * y) == reverse(string(x * y)),
product(1:N, 1:N)))
I don't think so. There aren't currently filters in Julia array comprehensions. See discussion in this issue.
In this particular case, I'd suggest just nested for loops if you want to get faster computation.
(There might be faster approaches where you start with N and count backwards, stopping as soon as you find something that succeeds. Figuring out how to do that correctly is left as an exercise, etc...)
As mentioned, this is now possible (using Julia 0.5.0)
isPalindrome(n::String) = n == reverse(n)
fun(N::Int) = maximum(x*y for x in 1:N for y in x:N if isPalindrome(string(x*y)))
I'm sure there are better ways that others can comment on. Time (after warm-up):
julia> #time fun(1000);
0.082785 seconds (2.03 M allocations: 108.109 MB, 27.35% gc time)