Memory allocation in Julia function - function

Here is a simple function in Julia 0.5.
function foo{T<:AbstractFloat}(x::T)
a = zero(T)
b = zero(T)
return x
end
I started with julia --track-allocation=user. then include("test.jl"). test.jl only has this function. Run foo(5.). Then Profile.clear_malloc_data(). foo(5.) again in the REPL. Quit julia. Look at the file test.jl.mem.
- function foo{T<:AbstractFloat}(x::T)
- a = zero(T)
194973 b = zero(T)
0 return x
- end
-
Why is there 194973 bytes of memory allocated here? This is also not the first line of the function. Although after Profile.clear_malloc_data(), this shouldn't matter.

Let's clarify some parts of the relevant documentation, which can be a little misleading:
In interpreting the results, there are a few important details. Under the user setting, the first line of any function directly called from the REPL will exhibit allocation due to events that happen in the REPL code itself.
Indeed, the line with allocation is not the first line. However, it is still the first tracked line, since Julia 0.5 has some issues with tracking allocation on the actual first statement (this has been fixed on v0.6). Note that it may also (contrary to what the documentation says) propagate into functions, even if they are annotated with #noinline. The only real solution is to ensure the first statement of what's being called is something you don't want to measure.
More significantly, JIT-compilation also adds to allocation counts, because much of Julia’s compiler is written in Julia (and compilation usually requires memory allocation). The recommended procedure is to force compilation by executing all the commands you want to analyze, then call Profile.clear_malloc_data() to reset all allocation counters. Finally, execute the desired commands and quit Julia to trigger the generation of the .mem files.
You're right that Profile.clear_malloc_data() prevents the allocation for JIT compilation being counted. However, this paragraph is separate from the first paragraph; clear_malloc_data does not do anything about allocation due to "events that happen in the REPL code itself".
Indeed, as I'm sure you suspected, there is no allocation in this function:
julia> function foo{T<:AbstractFloat}(x::T)
a = zero(T)
b = zero(T)
return x
end
foo (generic function with 1 method)
julia> #allocated foo(5.)
0
The numbers you see are due to events in the REPL itself. To avoid this issue, wrap the code to measure in a function. That is to say, we can use this as our test harness, perhaps after disabling inlining on foo with #noinline. For instance, here's a revised test.jl:
#noinline function foo{T<:AbstractFloat}(x::T)
a = zero(T)
b = zero(T)
return x
end
function do_measurements()
x = 0. # dummy statement
x += foo(5.)
x # prevent foo call being optimized out
# (it won't, but this is good practice)
end
Then a REPL session:
julia> include("test.jl")
do_measurements (generic function with 1 method)
julia> do_measurements()
5.0
julia> Profile.clear_malloc_data()
julia> do_measurements()
5.0
Which produces the expected result:
- #noinline function foo{T<:AbstractFloat}(x::T)
0 a = zero(T)
0 b = zero(T)
0 return x
- end
-
- function do_measurements()
155351 x = 0. # dummy statement
0 x += foo(5.)
0 x # prevent foo call being optimized out
- # (it won't, but this is good practice)
- end
-

Related

Declare a Julia function that returns a function with a specific signature

How can I declare a Julia function that returns a function with a specific signature. For example, say I want to return a function that takes an Int and returns an Int:
function buildfunc()::?????
mult(x::Int) = x * 2
return mult
end
What should the question marks be replaced with?
One thing needs to be made clear.
Adding a type declaration on the returned parameter is just an assertion, not part of function definition. To understand what is going on look at the lowered (this is a pre-compilation stage) code of a function:
julia> f(a::Int)::Int = 2a
f (generic function with 1 method)
julia> #code_lowered f(5)
CodeInfo(
1 ─ %1 = Main.Int
│ %2 = 2 * a
│ %3 = Base.convert(%1, %2)
│ %4 = Core.typeassert(%3, %1)
└── return %4
)
In this case since the returned type is obvious this assertion will be actually removed during the compilation process (try #code_native f(5) to see yourself).
If you want for some reason to generate functions I recommend to use the #generated macro. Be warned: meta-programming is usually an overkill for solving any Julia related problem.
#generated function f2(x)
if x <: Int
quote
2x
end
else
quote
10x
end
end
end
Now we have a function f2 where the source code of f2 is going to depend on the parameter type:
julia> f2(3)
6
julia> f2(3.)
30.0
Note that this function generation is actually happening during the compile time:
julia> #code_lowered f2(2)
CodeInfo(
# REPL[34]:1 within `f2'
┌ # REPL[34]:4 within `macro expansion'
1 ─│ %1 = 2 * x
└──│ return %1
└
)
Hope that clears things out.
You can use Function type for this purpose. From Julia documentation:
Function is the abstract type of all functions
function b(c::Int64)::Int64
return c+2;
end
function a()::Function
return b;
end
Which prints:
julia> println(a()(2));
4
Julia will throw exception for Float64 input.
julia> println(a()(2.0));
ERROR: MethodError: no method matching b(::Float64) Closest candidates are: b(::Int64)

Julia waits for function to finish before printing message in a loop

I have a function in Julia that requires to do things in a loop. A parameter is passed to the loop, and the bigger this parameter, the slower the function gets. I would like to have a message to know in which iteration it is, but it seems that Julia waits for the whole function to be finished before printing anything. This is Julia 1.4. That behaviour was not on Julia 1.3.
A example would be like this
function f(x)
rr=0.000:0.0001:x
aux=0
for r in rr
print(r, " ")
aux+=veryslowfunction(r)
end
return aux
end
As it is, f, when called, does not print anything until it has finished.
You need to add after the print command:
flush(stdout)
Explanation
The standard output of a process is usually buffered. The particular buffer size and behavior will depend on your system setting and perhaps the terminal type.
By flushing the buffer you make sure that the contents is actually sent to the terminal.
Alternatively, you can also use a library like ProgressLogging.jl (needs TerminalLoggers.jl to see actual output), or ProgressMeter.jl, which will automatically update a nicely formatted status bar during each step of the loop.
For example, with ProgressMeter, a call to
function f(x)
rr=0.000:0.0001:x
aux=0
#showprogress for r in rr
aux += veryslowfunction(r)
end
return aux
end
will show something like (in the end):
Progress: 100%|██████████████████████████████████████████████████████████████| Time: 0:00:10
Again I can't reproduce the behaviour in my terminal (it always prints), but I wanted to add that for these types of situations the #show macro is quite neat:
julia> function f(x)
rr=0.000:0.0001:x
aux=0
for r in rr
#show r
aux+=veryslowfunction(r)
end
return aux
end
f (generic function with 1 method)
julia> f(1)
r = 0.0
r = 0.0001
r = 0.0002
...
It uses println under the hood:
julia> using MacroTools
julia> a = 5
5
julia> prettify(#expand(#show a))
quote
Base.println("a = ", Base.repr($(Expr(:(=), :ibex, :a))))
ibex
end

Are fast Anonymous functions automatic now?

Now that fast anonymous functions are native to julia, do I still have to use the decorator, or is it automatically implemented. Also when I pass a function as an argument into another function, can I static type it? What can I do to improve the run speed.
FastAnonymous is definitely not necessary anymore. Here's how you can verify this yourself:
julia> #noinline g(f, x) = f(x) # prevent inlining so you know it's general
g (generic function with 1 method)
julia> h1(x) = g(identity, x)
h1 (generic function with 1 method)
julia> h2(x) = g(sin, x)
h2 (generic function with 1 method)
julia> #code_warntype h1(1)
Variables
#self#::Core.Compiler.Const(h1, false)
x::Int64
Body::Int64
1 ─ %1 = Main.g(Main.identity, x)::Int64
└── return %1
julia> #code_warntype h2(1)
Variables
#self#::Core.Compiler.Const(h2, false)
x::Int64
Body::Float64
1 ─ %1 = Main.g(Main.sin, x)::Float64
└── return %1
julia> h3(x) = g(z->"I'm a string", x)
h3 (generic function with 1 method)
julia> #code_warntype h3(1)
Variables
#self#::Core.Compiler.Const(h3, false)
x::Int64
#9::getfield(Main, Symbol("##9#10"))
Body::String
1 ─ (#9 = %new(Main.:(##9#10)))
│ %2 = #9::Core.Compiler.Const(getfield(Main, Symbol("##9#10"))(), false)
│ %3 = Main.g(%2, x)::Core.Compiler.Const("I'm a string", false)
└── return %3
In every case Julia knows the return type, and that requires that it "understand" what your function-argument is doing. Moreover:
julia> m = first(methods(g))
g(f, x) in Main at REPL[1]:1
julia> m.specializations
Core.TypeMapEntry(Core.TypeMapEntry(Core.TypeMapEntry(nothing, Tuple{typeof(g),typeof(identity),Int64}, nothing, svec(), 1, -1, MethodInstance for g(::typeof(identity), ::Int64), true, true, false), Tuple{typeof(g),typeof(sin),Int64}, nothing, svec(), 1, -1, MethodInstance for g(::typeof(sin), ::Int64), true, true, false), Tuple{typeof(g),getfield(Main, Symbol("##9#10")),Int64}, nothing, svec(), 1, -1, MethodInstance for g(::getfield(Main, Symbol("##9#10")), ::Int64), true, true, false)
This is a bit hard to read, but if you look carefully you'll see that g has been compiled for 3 inputs:
Tuple{typeof(identity), Int64}
Tuple{typeof(sin), Int64}
Tuple{getfield(Main, Symbol("##9#10")),Int64}
(The compiled versions also take g itself as an extra argument, for reasons having to do with things like the internal implementation of keyword-argument handling, but let's ignore that for now.) The last one is the generated name for the type implementing the anonymous function. What this shows you is that each function has its own type, which is the reason why passing functions as arguments is fast.
For the gurus, there is one other factor that can come in to play: because type inference is subject to the unsolvable halting problem, there are circumstances where inference will decide that this is all getting too complex and abort "early." In such cases (which are relatively rare), it can help to force the compiler to specialize against a particular argument. In our example, that would mean declaring g as
#noinline g(f::F, x) where F = f(x)
rather than
#noinline g(f, x) = f(x)
That ::F is normally unnecessary and appears useless, but you can use it as a compiler-hint to increase the amount of effort used to infer the result. I don't recommend doing that by default (it makes your code a bit harder to read), but if you see weird performance problems it's one thing to try.

Cuda illegal memory access error when using array indexes stored in another array

I'm using cuda fortran and I've been struggling with this problem in one simple kernel and I couldn't find the solution.
Isn't it possible to use integer values stored in an array as the indexes for another array?
Here's a simple example (edited to include also the main program):
program test
use cudafor
integer:: ncell, i
integer, allocatable:: values(:)
integer, allocatable, device :: values_d(:)
ncell = 10
allocate(values(ncell), values_d(ncell))
do i=1,ncell
values(i) = i
enddo
values_d = values
call multipleindices_kernel<<< ncell/1024+1,1024 >>> (values_d,
+ ncell)
values = values_d
write (*,*) values
end program test
!////////////////////////////////////////////////////
attributes(global) subroutine multipleindices_kernel(valu, ncell)
use cudafor
implicit none
integer, value:: ncell ! ncell = 10
integer :: valu(ncell)
integer :: tempind(10)
integer:: i
tempind(1)=10
tempind(2)=3
tempind(3)=5
tempind(4)=7
tempind(5)=9
tempind(6)=2
tempind(7)=4
tempind(8)=6
tempind(9)=8
tempind(10)=1
i = (blockidx%x - 1 ) * blockdim%x + threadidx%x
if (i .LE. ncell) then
valu(tempind(i))= 1
endif
end subroutine
I understand that if there were repeated values in the tempind array different threads could be accessing the same memory location for reading or writting, but that is not the case.
Even though, this gives the error "0: copyout Memcpy (host=0x303610, dev=0x3e20000, size=40) FAILED: 77(an illegal memory access was encountered).
Does anyone know if it is possible to use this indexes coming from another array in cuda?
After some additional tests, I've noticed that the problem occurs not while running the kernel itself, but on the transfer of the data back to CPU (if I remove "values = values_d" then no error is displayed). Also, if I substitute in the kernel valu(tempind(i)) by valu(i) it works fine, but I want to have the indexes coming from an array since the purpose of this test is to make a parallelization of a CFD code where the indexes are stored like that.
The problem appears to be that the generated executable doesn't pass the variable ncell to the kernel correctly. Running the application through cuda-memcheck shows that threads outside of the 1-10 are passing through the branch statement, and adding a print statement to print ncell inside the kernel also gives strange answers.
It used to be a requirement that all attributes(global) subroutines had to reside within a module. This requirement seems to have been relaxed in more recent versions of CUDA Fortran (I cannot find references to it in the programming guide). I believe the code outside of the module is causing the error here. By placing multipleindices_kernel within a module and using that module in test I consistantly get correct answers with no errors. The code for this is below:
module testmod
contains
attributes(global) subroutine multipleindices_kernel(valu, ncell)
use cudafor
implicit none
integer, value:: ncell ! ncell = 10
integer :: valu(ncell)
integer :: tempind(10)
integer:: i
tempind(1)=10
tempind(2)=3
tempind(3)=5
tempind(4)=7
tempind(5)=9
tempind(6)=2
tempind(7)=4
tempind(8)=6
tempind(9)=8
tempind(10)=1
i = (blockidx%x - 1 ) * blockdim%x + threadidx%x
if (i .LE. ncell) then
valu(tempind(i))= 1
endif
end subroutine
end module testmod
program test
use cudafor
use testmod
integer:: ncell, i
integer, allocatable:: values(:)
integer, allocatable, device :: values_d(:)
ncell = 10
allocate(values(ncell), values_d(ncell))
do i=1,ncell
values(i) = i
enddo
values_d = values
call multipleindices_kernel<<< ncell/1024+1,1024 >>> (values_d, ncell)
values = values_d
write (*,*) values
end program test
!////////////////////////////////////////////////////

Fortran does not understand call statement

I am attempting to use PGFortran for CUDA. I installed PGFortran on my computer and linked everything up to the best of my knowledge. To get going I decided to follow a tutorial listed here. When trying to compile the code:
module mathOps
contains
attributes(global) subroutine saxpy(x, y, a)
implicit none
real :: x(:), y(:)
real, value :: a
integer :: i, n
n = size(x)
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
if (i <= n) y(i) = y(i) + a*x(i)
end subroutine saxpy
end module mathOps
program testSaxpy
use mathOps
use cudafor
implicit none
integer, parameter :: N = 40000
real :: x(N), y(N), a
real, device :: x_d(N), y_d(N)
type(dim3) :: grid, tBlock
tBlock = dim3(256,1,1)
grid = dim3(ceiling(real(N)/tBlock%x),1,1)
x = 1.0; y = 2.0; a = 2.0
x_d = x
y_d = y
call saxpy<<<grid, tblock="">>>(x_d, y_d, a)
y = y_d
write(*,*) 'Max error: ', maxval(abs(y-4.0))
end program testSaxpy
I got:
PGF90-S-0034-Syntax error at or near identifier saxpy (main.cuf: 29)
0 inform, 0 warnings, 1 severes, 0 fatal for testsaxpy
The error points to the line call saxpy<<<grid, tblock="">>>(x_d, y_d, a). For some reason it apparently hates the fact that I am using <<< and >>>? Going by the tutorial those triple chevrons are meant to be there:
The information between the triple chevrons is the execution
configuration, which dictates how many device threads execute the
kernel in parallel.
Removing these chevrons would not make any sense since they are the purpose of the program. So why does PGFortran dislike this?
As for the compilation. I have followed the tutorial by using
pgf90 -o saxpy main.cuf. But since that gave an error I also tried pgf90 -Mcuda -o saxpy main.cuf. Same results.
There does seem to be a text error in that blog at the kernel invocation line:
call saxpy<<<grid, tblock="">>>(x_d, y_d, a)
tblock="" is not correct. You'll notice elsewhere in that blog text, the kernel invocation line is given correctly as:
call saxpy<<<grid,tBlock>>>(x_d, y_d, a)
So if you change that line accordingly in your actual code, I think you'll have better results.