Inspired by Grant McDermott’s blog post on efficient simulations in R, I decided to reimplement the same exercise in Julia. This post will not make much sense without having read that excellent post, so I’d recommend doing that first.
I recently switched to doing simulation work in Julia instead of R, because you don’t need many tricks to achieve decent performance on the first try. Compared to R, it is not necessary (at least in this example) to generate all the data at once. Instead, we can implement the simulation algorithm more naturally: generate a dataset, extract the quantities of interest, and repeat this process N times. I find that this leads to more intuitive and readable code, and Julia’s great for this kind of task.
1. Generate the data
We start by implementing a function gen_data() to generate a DataFrame. The code is basically a one-to-one translation from R, but we only generate one instance of the data.
usingDistributionsusingDataFramesconst std_normal =Normal(0, 1)functiongen_data()## total time periods in the the panel = 500 tt =500# x1 and x2 covariates x1_A =1.+rand(std_normal, tt) x1_B =1/4.+rand(std_normal, tt) x2_A =1.+ x1_A .+rand(std_normal, tt) x2_B =1.+ x1_B .+rand(std_normal, tt)# outcomes (notice different slope coefs for x2_A and x2_B) y_A = x1_A .+1*x2_A +rand(std_normal, tt) y_B = x1_B .+2*x2_B +rand(std_normal, tt)# combineDataFrame( id =vcat(fill(0, length(x1_A)), fill(1, length(x1_B))), x1 =vcat(x1_A, x1_B), x2 =vcat(x2_A, x2_B), x1_dmean =vcat(x1_A .-mean(x1_A), x1_B .-mean(x1_B)), x2_dmean =vcat(x2_A .-mean(x2_A), x2_B .-mean(x2_B)), y =vcat(y_A, y_B))end
gen_data (generic function with 1 method)
This should hopefully be pretty clear, even if you have never seen any Julia code before. The function vcat() is used to concatenate vectors, and fill() is similar to rep() in R. Another thing that might be unusual is having to write .+ instead of + to achieve vector addition. While this still trips me up occasionally, I think it’s one of Julia’s best features.
2. Extract the quantities of interest
Given a dataset, we can now run the two regressions and extract the coefficents. Here’s a function to achieve that using the GLM package:
Now we just need to repeat these last two lines a large number of times, and save the coefficients.
3. Repeat N times
The function below runs the above nsim times and stores the two coefficients in a matrix. I use the BenchmarkTools package to benchmark this function.
functionrun_simulations(nsim) sims =zeros(nsim, 2);for i in1:nsim data =gen_data() sims[i, :] .=coefs_lm_formula(data)end simsendusingBenchmarkToolsn =20000@btimerun_simulations($n);
4.862 s (19160002 allocations: 16.35 GiB)
Around 5 seconds – not bad at all for this “naive” implementation that doesn’t make any use of particular performance tricks. A simple graph shows that the results are the same as in Grant’s post:
While this is a pretty good result already, there are of course numerous ways to speed this up. Being a novice to Julia, I’m probably not the best person to show this, but here’s an attempt anyway. One straightforward way to speed this up is to avoid creating the model matrix using the @formula call and instead to create the matrix ourselves.
Here’s a way to do this (I’ll simply overwrite the existing function):
A last idea is to fit the model more efficiently. The fit function from GLM still computes standard errors and p-values, which are unnecessary for this example. Here’s a last benchmark implementing this:
Looks like calculating the standard errors and the p-values is not such an expensive operation after all.
Conclusion
The goal of this post was not to squeeze out the last bit of performance for this particular kind of simulation. Instead, I hope that this post shows that the first “naive” implementation in Julia (which would often be extremely slow in R), is often fast enough. There are many other advantages to Julia, but this is a major one for me.