w.

Julia Learning Circle: Memory Allocations and Garbage Collection

julia julia learning circle

Immutable and Mutable Types

Concrete types in Julia are either immutable or mutable. Immutable types are created with struct ImmutableType and mutable types are created with mutable struct MutableType. The advantage of immutable types is that they can be allocated on the stack as opposed to on the heap. Allocating objects on the stack is typically more performant due to cache locality and the stack’s simple, but more rigid memory structure.

An interesting situation occurs when an immutable type references a mutable type. Since Julia 1.5, such immutable types can be allocated on the stack.

julia> struct A
           data::Vector{Float64}
       end

julia> a = A(randn(3))
A([0.9462871255469765, 1.1995018446247545, 0.7153882414691778])

Here A is immutable, but references a Vector{Float64}, which is mutable. This means that a.data cannot be changed, but, since a.data is mutable, e.g. a.data[1] can be changed.

julia> a.data = randn(3)
ERROR: setfield! immutable struct of type A cannot be changed
Stacktrace:
 [1] setproperty!(x::A, f::Symbol, v::Vector{Float64})
   @ Base ./Base.jl:34
 [2] top-level scope
   @ REPL[5]:1

julia> a.data[1] = 1.0
1.0

Types T that satisfy isbitstype(T) == true are a subset of immutable types. They are immutable types that reference only other isbitstype types or primitive types. Primitive types are types whose data are a simple collection of bits. A collection of primitive types is defined by base. The purpose of primitive types is to facilitate interoperability with LLVM.

Case Study: Stack-Allocated Vectors (A.K.A. a Very Brief Introduction to StaticArrays.jl)

The usual Vector{Float64} is mutable, which means that it is heap allocated. Let’s see if we can create a more performant vector by creating a vector type that is allocated on the stack.

struct StackVector{N}
    data::NTuple{N, Float64}
end

StackVector(data::Vector{Float64}) = StackVector(Tuple(data))

Define + for our newly defined StackVector.

import Base: +

+(x::StackVector{N}, y::StackVector{N}) where N = StackVector{N}(x.data .+ y.data)

Let’s check that this works as intended.

julia> x = randn(10);

julia> y = randn(10);

julia> stack_x = StackVector(x);

julia> stack_y = StackVector(y);

julia> x + y
10-element Vector{Float64}:
 -0.5453143850886275
  2.120385168072067
  1.1278328263047377
  1.6358682579762607
 -0.22486252827622277
 -2.1333012655133836
  2.6754332229859767
 -0.7701873679976846
  0.26775849165909
 -2.7389288669831786

julia> collect((stack_x + stack_y).data)
10-element Vector{Float64}:
 -0.5453143850886275
  2.120385168072067
  1.1278328263047377
  1.6358682579762607
 -0.22486252827622277
 -2.1333012655133836
  2.6754332229859767
 -0.7701873679976846
  0.26775849165909
 -2.7389288669831786

That looks good. Now let’s see what avoiding allocations on the heap gets us.

julia> using BenchmarkTools

julia> @benchmark $x + $y
BenchmarkTools.Trial:
  memory estimate:  160 bytes
  allocs estimate:  1
  --------------
  minimum time:     53.664 ns (0.00% GC)
  median time:      56.126 ns (0.00% GC)
  mean time:        59.544 ns (1.83% GC)
  maximum time:     572.958 ns (87.42% GC)
  --------------
  samples:          10000
  evals/sample:     987

julia> @benchmark $stack_x + $stack_y
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.052 ns (0.00% GC)
  median time:      0.055 ns (0.00% GC)
  mean time:        0.055 ns (0.00% GC)
  maximum time:     0.099 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Whoa! What happened here is that the compiler is a little too clever: it managed to figure out the answer at compile time and essentially hardcoded the answer. Compare this with

julia> @benchmark $(stack_x + stack_y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.052 ns (0.00% GC)
  median time:      0.055 ns (0.00% GC)
  mean time:        0.056 ns (0.00% GC)
  maximum time:     8.968 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

To stop the compiler from being too clever, BenchmarkTools.jl advises the following trick:

julia> @benchmark $(Ref(stack_x))[] + $(Ref(stack_y))[]
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.276 ns (0.00% GC)
  median time:      2.293 ns (0.00% GC)
  mean time:        2.401 ns (0.00% GC)
  maximum time:     30.049 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

That looks more reasonable. For this small array, compared to the allocating on the heap, that’s an 25x improvement in runtime! This example demonstrates that allocations on the heap can substantially contribute to the total runtime of a program.

The idea of allocating vectors on the stack is certainly not mine. Check out the fantastic StaticArrays.jl, which provides a generic implementation of stack-allocated arrays. If the size of the array is small, these stack-allocated arrays can be significantly more performant than their heap-allocated counterparts. StaticArrays.jl works by automagically generating implementations of linear algebra operations that are optimised for specific sizes of vectors or matrices by using generated functions.

Garbage Collection

As more and more objects are allocated on the heap, eventually the heap fills up. The purpose of the garbage collector is to clean up the heap every once in a while. The underlying principle of garbage collection is that objects are considered garbage, hence can be cleaned, if it can be proven that they cannot be reached (used) anymore in future code.

Julia’s garbage collector algorithm is called mark and sweep. This algorithm consists of two phases: the mark phase, where all objects that are not garbage are found and marked so; and the sweep phase, where all unmarked objects are cleaned. The mark phase first establishes a set of objects that are definitely not garbage. This set is called the root set, and essentially consists of all global variables and everything on the stack. The garbage collector then follows everything that the root set references, and everything that those references reference, and marks those objects along the way.

During the sweep phase, the unmarked objects are freed, which simply means that it is internally recorded that their memory can be freely overwritten and used for something else. These unmarked objects are found by walking through the whole heap. Marked objects, on the other hand, remain untouched. They are also not moved around: you can imagine that the memory used by marked objects can sometimes be rearranged into a more compact arrangement. This, however, takes time. That Julia’s garbage collector does not move marked objects around is referred to by saying that Julia’s mark-and-sweep algorithm is non-moving or non-compacting.

There is more fancy stuff going on. For example, Julia’s garbage collector is generational. You can check out the docstrings of gc.c for more details.

Published on 23 November 2020.