Simulations in Julia: Efficient by default

statistics
Author

Ben Elbers

Published

28 June 2021

Inspired by Grant McDermott’s blog post on efficient simulations in R, I decided to reimplement the same exercise in Julia. This post will not make much sense without having read that excellent post, so I’d recommend doing that first.

I recently switched to doing simulation work in Julia instead of R, because you don’t need many tricks to achieve decent performance on the first try. Compared to R, it is not necessary (at least in this example) to generate all the data at once. Instead, we can implement the simulation algorithm more naturally: generate a dataset, extract the quantities of interest, and repeat this process N times. I find that this leads to more intuitive and readable code, and Julia’s great for this kind of task.

1. Generate the data

We start by implementing a function gen_data() to generate a DataFrame. The code is basically a one-to-one translation from R, but we only generate one instance of the data.

using Distributions 
using DataFrames     

const std_normal = Normal(0, 1)

function gen_data()
  ## total time periods in the the panel = 500
  tt = 500

  # x1 and x2 covariates
  x1_A = 1 .+ rand(std_normal, tt)
  x1_B = 1/4 .+ rand(std_normal, tt)
  x2_A = 1 .+ x1_A .+ rand(std_normal, tt)
  x2_B = 1 .+ x1_B .+ rand(std_normal, tt)

  # outcomes (notice different slope coefs for x2_A and x2_B)
  y_A = x1_A .+ 1*x2_A + rand(std_normal, tt)
  y_B = x1_B .+ 2*x2_B + rand(std_normal, tt)

  # combine
  DataFrame(
    id = vcat(fill(0, length(x1_A)), fill(1, length(x1_B))),
    x1 = vcat(x1_A, x1_B),
    x2 = vcat(x2_A, x2_B),
    x1_dmean = vcat(x1_A .- mean(x1_A), x1_B .- mean(x1_B)),
    x2_dmean = vcat(x2_A .- mean(x2_A), x2_B .- mean(x2_B)),
    y = vcat(y_A, y_B))
end
gen_data (generic function with 1 method)

This should hopefully be pretty clear, even if you have never seen any Julia code before. The function vcat() is used to concatenate vectors, and fill() is similar to rep() in R. Another thing that might be unusual is having to write .+ instead of + to achieve vector addition. While this still trips me up occasionally, I think it’s one of Julia’s best features.

2. Extract the quantities of interest

Given a dataset, we can now run the two regressions and extract the coefficents. Here’s a function to achieve that using the GLM package:

using GLM

function coefs_lm_formula(data)
  mod_level = lm(@formula(y ~ id + x1 * x2), data)
  mod_dmean = lm(@formula(y ~ id + x1_dmean * x2_dmean), data)
  (coef(mod_level)[end], coef(mod_dmean)[end])
end

# example
data = gen_data()
coefs_lm_formula(data)
(-0.14253265966089987, -0.01929471565297597)

Now we just need to repeat these last two lines a large number of times, and save the coefficients.

3. Repeat N times

The function below runs the above nsim times and stores the two coefficients in a matrix. I use the BenchmarkTools package to benchmark this function.

function run_simulations(nsim)
  sims = zeros(nsim, 2);
  for i in 1:nsim
    data = gen_data()
    sims[i, :] .= coefs_lm_formula(data)
  end
  sims
end

using BenchmarkTools
n = 20000
@btime run_simulations($n);
  4.862 s (19160002 allocations: 16.35 GiB)

Around 5 seconds – not bad at all for this “naive” implementation that doesn’t make any use of particular performance tricks. A simple graph shows that the results are the same as in Grant’s post:

using Plots
sims = run_simulations(20000)
histogram(sims, label = ["level" "dmean"])

Some performance improvements

While this is a pretty good result already, there are of course numerous ways to speed this up. Being a novice to Julia, I’m probably not the best person to show this, but here’s an attempt anyway. One straightforward way to speed this up is to avoid creating the model matrix using the @formula call and instead to create the matrix ourselves.

Here’s a way to do this (I’ll simply overwrite the existing function):

function coefs_lm_formula(data)
  constant = fill(1, nrow(data))
  X = Float64[constant data.id data.x1 data.x2 data.x1 .* data.x2]
  mod_level = fit(LinearModel, X, data.y)
  X[:, 5] .= data.x1_dmean .* data.x2_dmean
  mod_dmean = fit(LinearModel, X, data.y)
  (coef(mod_level)[end], coef(mod_dmean)[end])
end
coefs_lm_formula (generic function with 1 method)

And benchmark again:

@btime run_simulations($n);
  1.888 s (2700002 allocations: 8.58 GiB)

A good speedup for a rather simple change!

A last idea is to fit the model more efficiently. The fit function from GLM still computes standard errors and p-values, which are unnecessary for this example. Here’s a last benchmark implementing this:

using LinearAlgebra 
fastfit(X, y) = cholesky!(Symmetric(X' * X)) \ (X' * y)

function coefs_lm_formula(data)
  constant = fill(1, nrow(data))
  X = Float64[constant data.id data.x1 data.x2 data.x1 .* data.x2]
  mod_level = fastfit(X, data.y)
  X[:, 5] .= data.x1_dmean .* data.x2_dmean
  mod_dmean = fastfit(X, data.y)
  (mod_level[end], mod_dmean[end])
end

@btime run_simulations($n);
  1.416 s (1900002 allocations: 4.64 GiB)

Looks like calculating the standard errors and the p-values is not such an expensive operation after all.

Conclusion

The goal of this post was not to squeeze out the last bit of performance for this particular kind of simulation. Instead, I hope that this post shows that the first “naive” implementation in Julia (which would often be extremely slow in R), is often fast enough. There are many other advantages to Julia, but this is a major one for me.