A simple form of the IV standard error

regression

ab-testing

Author

Ben Elbers

Published

7 October 2023

Recently, a blog post of mine on encouragement designs was published on the Spotify Engineering blog. In this post, I want to follow up on the formula for the variance of the IV estimator that is shown in that post, which is, with a slight change in notation,

\[\text{Var}[\hat{\beta}_\text{IV}] = \frac{1}{n} \frac{ \widehat{\text{Var}}[Y-\hat{\beta}_\text{IV} X] }{\widehat{\text{Var}}[\hat{E}[X \mid Z]]}.\]

where \(Y\), \(X\), and \(Z\) are random variables for the outcome, treatment, and instrument, respectively. Note that this formula is only correct if the instrument \(Z\) is binary.¹

The aim of this post is to how to derive this version of the formula from the more general IV estimator (under the assumption that there is a single binary instrument \(Z\), and a single predictor \(X\)), and how it compares to the OLS estimator. This version of the formula works well to illustrate why IV estimators have lower power compared to the equivalent OLS model.

To illustrate the derivations with some code, we’ll use a classic example from the econometrics literature – the returns of schooling on wages. Here is a bit of R code to set up the examples:

library(data.table)
library(fixest)
data("SchoolingReturns", package = "ivreg")
returns <- as.data.table(SchoolingReturns)
returns <- returns[, .(y = log(wage), x = education, z = nearcollege == "yes")]
head(returns)
#>           y     x      z
#>       <num> <num> <lgcl>
#> 1: 6.306275     7  FALSE
#> 2: 6.175867    12  FALSE
#> 3: 6.580639    12  FALSE
#> 4: 5.521461    11   TRUE
#> 5: 6.591674    12   TRUE
#> 6: 6.214608    12   TRUE

The data set contains the wage in dollars (which I log-transform here) as the outcome \(Y\), the years of education as the treatment \(X\), and whether the individual grew up near a college, which will be used as the instrument \(Z\).

OLS standard error

To show the similarities and differences between the IV and the OLS standard error, let’s first take a look at the standard error of a simple linear model. Consider a standard linear model of the form \(y_{i}=\alpha+\beta x_{i}+u_{i}\), where we apply all the usual regression assumptions. We’re interested in an estimate of \(\beta\), and its standard error, \(\sqrt{\text{Var}[\hat{\beta}]}\). If we estimate this model using OLS, and call the estimate \(\hat{\beta}_\text{OLS}\), we obtain

\[ \begin{align} \text{Var}[\hat{\beta}_{\text{OLS}}] &=\frac{1}{n}\frac{\text{(residual variance of }y)}{\text{(variance of }x)} \\&=\frac{1}{n}\frac{\widehat{\text{Var}}[Y-\hat{\beta}_\text{OLS} X]}{\widehat{\text{Var}}[X]}. \end{align} \]

(We use a factor of \(\frac{1}{n}\) here for simplicity – use \(\frac{1}{n-2}\) to obtain an unbiased estimate.)

The numerator might look a bit non-standard, so here’s a quick derivation. If we define the predicted values as \(\hat{Y} = \hat{\alpha}+\hat{\beta}_\text{OLS} X\), then the numerator can be written as

\[ \begin{align} \widehat{\text{Var}}[Y-\hat{Y}] &=\widehat{\text{Var}}[Y-(\hat{\alpha}+\hat{\beta}_\text{OLS} X)] \\&=\widehat{\text{Var}}[Y-(\hat{E}[Y]-\hat{\beta}_\text{OLS} \hat{E}[X] + \hat{\beta}_\text{OLS} X)] \\&=\widehat{\text{Var}}[Y - \hat{\beta}_\text{OLS} X], \end{align} \]

where the last term simplifies because constant terms drop out of the variance.

This R code demonstrates the result:

model_ols <- lm(y ~ x, data = returns)
beta_ols <- coef(model_ols)[[2]]

# standard error calculated by lm
sqrt(vcov(model_ols)[2, 2])
#> [1] 0.002869708

# compute standard error manually
adj <- 1 / (nrow(returns) - 2) # use the dof that lm uses
returns[, sqrt(adj * var(y - beta_ols * x) / var(x))]
#> [1] 0.002869708

Deriving the IV standard error

If we consult any standard textbook on econometrics, we’ll find that the general formula for the standard error of the IV estimator is

\[ \hat{\sigma}_{\text{IV}}^{2}(\hat{\mathbf{X}}'\hat{\mathbf{X}})^{-1}. \]

The logic of the IV estimator is that we use only the variation of \(\mathbf{X}\) that is due to \(\mathbf{Z}\) to estimate the effect on \(Y\), and this formula reflects this logic. To see this more clearly, assume that we only have one exogeneous variable \(X\) and one instrumental variable \(Z.\) The first term, \(\hat{\sigma}_{\text{IV}}^{2}\), then becomes \(\widehat{\text{Var}}[Y-\hat{\beta}_\text{IV} X]\). Note that compared to the OLS estimator, we use \({\beta}_\text{IV}\) instead of \({\beta}_\text{OLS}\) – again, this is because we use only the variation of \(X\) that is due to \(Z\).

The second term is based on the matrix \(\hat{\mathbf{X}}\), which contains the predicted values of the regression of \(X\) on \(Z\). In matrix algebra, this is \(\hat{\mathbf{X}}=\mathbf{Z}(\mathbf{Z}'\mathbf{Z})^{-1}\mathbf{Z}'\mathbf{X}\), but if we assume that we have only one exogeneous variable and one instrumental variable, \(\hat{\mathbf{X}}\) has a simpler form. Let’s define \(\hat{\alpha}_{\text{FS}}\) and \(\hat{\beta}_{\text{FS}}\) as the intercept and slope estimates of the regression of \(X\) on \(Z\) (FS = first stage). We can then define the random variable \(\hat{X}\) that contains the predicted values of this regression:

\[\hat{X}=\hat{\alpha}_{\text{FS}} + \hat{\beta}_{\text{FS}}Z=\hat{E}[X]+\frac{\widehat{\text{Cov}}(X,Z)}{\widehat{\text{Var}}(Z)}(Z-\hat{E}[Z]),\]

where the second equality follows from simple regression. The corresponding matrix is then \(\hat{\mathbf{X}}=\begin{bmatrix}\mathbf{1} & \hat{X}\end{bmatrix}\) of size \(n\times 2\). With a bit of matrix multiplication and division, we can now continue with the matrix multiplication, and find the inverse:

\[ \begin{align} (\hat{\mathbf{X}}^{T}\hat{\mathbf{X}})^{-1}&=\begin{bmatrix}n & \mathbf{1}^{T}\hat{X}\\ \mathbf{1}^{T}\hat{X} & \hat{X}^{T}\hat{X} \end{bmatrix}^{-1}\\&=\frac{1}{n}\begin{bmatrix}1 & \hat{E}[X]\\ \hat{E}[X] & \hat{E}^{2}[X]+\frac{\widehat{\text{Cov}}^{2}(X,Z)}{\widehat{\text{Var}}(Z)} \end{bmatrix}^{-1}\\&=\frac{1}{n^{2}\frac{\widehat{\text{Cov}}^{2}(X,Z)}{\widehat{\text{Var}}(Z)}}\begin{bmatrix}\hat{E}^{2}[X]+\frac{\widehat{\text{Cov}}^{2}(X,Z)}{\widehat{\text{Var}}(Z)} & -\hat{E}[X]\\ -\hat{E}[X] & n \end{bmatrix} \end{align} \]

The relevant entry here is in the lower right-hand corner of the matrix, so we have as a preliminary formula

\[\text{Var}[\hat{\beta}_\text{IV}] = \widehat{\text{Var}}[Y-\hat{\beta}_\text{IV} X] \frac{1}{n} \frac{\widehat{\text{Var}}(Z)}{\widehat{\text{Cov}}^{2}(X,Z)}. \]

Until this point, we have only assumed that there is one instrument, but not that this instrument is binary. We’ll now make this assumption to simplify the formula a bit further. If \(Z\) is binary, we have \(\hat{E}[Z]=P(Z=1)\) as the proportion of cases where the instrument is 1, and \(1-\hat{E}[Z]=P(Z=0)\) as the proportion of cases where the instrument is 0. Because \(Z\) is a Bernoulli random variable, we can then immediately state that

\[\widehat{\text{Var}}(Z)=\hat{E}[Z](1-\hat{E}[Z]).\]

For the next derivations, we’ll make use of the fact that we can rewrite expectations as group weighted averages. For instance,

\[\hat{E}[X]=(1-\hat{E}[Z])\hat{E}[X\mid Z=0]+\hat{E}[Z]\hat{E}[X\mid Z=1].\]

We’ll use this strategy to ‘simplify’ the covariance term:

\[ \begin{align} \widehat{\text{Cov}}(X,Z)&=\hat{E}[Z(X-\hat{E}[X])]\\&=(1-\hat{E}[Z])\hat{E}[Z(X-\hat{E}[X])\mid Z=0]+\hat{E}[Z]\hat{E}[Z(X-\hat{E}[X])\mid Z=1]\\&=\hat{E}[Z]\hat{E}[X\mid Z=1]-\hat{E}[Z]\hat{E}[\hat{E}[X]\mid Z=1]\\&=\hat{E}[Z]\left(\hat{E}[X\mid Z=1]-\hat{E}[X]\right)\\&=\hat{E}[Z]\left(\hat{E}[X\mid Z=1]-(1-\hat{E}[Z])E[X\mid Z=0]-\hat{E}[Z]E[X\mid Z=1])\right)\\&=\hat{E}[Z]\left(E[X\mid Z=1](1-\hat{E}[Z])-(1-\hat{E}[Z])E[X\mid Z=0]\right)\\&=\hat{E}[Z](1-\hat{E}[Z])\left(E[X\mid Z=1]-E[X\mid Z=0]\right)\\&=\widehat{\text{Var}}(Z)\left(\hat{E}[X\mid Z=1]-\hat{E}[X\mid Z=0]\right) \end{align} \]

The first equality is just the definition of the covariance. We then rewrite the expectation as a weighted average. Because the covariance involves \(Z\), the term where \(Z=0\) drops out. After simplifying, we replace \(\hat{E}[X]\) with its alternative form as a weighted average. The final version then says that the covariance of a random variable \(X\) and a Bernoulli random variable \(Z\) is equal to the difference in means between \(X\) when \(Z=1\) and \(X\) when \(Z=0\), times the variance of \(Z\).

Plugging these two results into the formula, we get

\[\text{Var}[\hat{\beta}_{\text{IV}}]=\frac{1}{n}\frac{\widehat{\text{Var}}[Y-\hat{\beta}_{\text{IV}}X]}{\widehat{\text{Var}}(Z)\left(\hat{E}[X\mid Z=1]-\hat{E}[X\mid Z=0]\right)^{2}}.\]

The last step is to show that the denominator is equal to \(\widehat{\text{Var}}[\hat{E}[X\mid Z]]\):

\[ \begin{align} \widehat{\text{Var}}[\hat{E}[X\mid Z]]&=\hat{E}[Z]\left(\hat{E}[X\mid Z=1]-\hat{E}[X]\right)^{2}+(1-\hat{E}[Z])\left(\hat{E}[X\mid Z=0]-\hat{E}[X]\right)^{2}\\&=\hat{E}[Z](1-\hat{E}[Z])^{2}\left(\hat{E}[X\mid Z=1]-\hat{E}[X\mid Z=0]\right)^{2}\\&\quad+(1-\hat{E}[Z])\hat{E}^{2}[Z]\left(\hat{E}[X\mid Z=1])-\hat{E}[X\mid Z=0]\right)^{2}\\&=\widehat{\text{Var}}[Z]\left(\hat{E}[X\mid Z=1]-\hat{E}[X\mid Z=0]\right)^{2} \end{align} \]

The first equality is just the definition of the variance of the group means, when two groups are involved. We then apply the identities that have been used for the covariance term, and simplify the result. This is identical to the denominator, so the result is proven.

We therefore have as the final result:

\[\text{Var}[\hat{\beta}_\text{IV}] = \frac{1}{n} \frac{ \widehat{\text{Var}}[Y-\hat{\beta}_\text{IV} X] }{\widehat{\text{Var}}[\hat{E}[X \mid Z]]}.\]

To translate this into R code, we estimate the between variance using the anova function. (We could also use the alternative version, where we take the squared difference between the means and multiply by the variance of \(Z\).)

# we use `feols` from the fixest package for IV estimation
model_iv <- feols(y ~ 1 | x ~ z, data = returns)
beta_iv <- coef(model_iv)[[2]]

# standard error calculated by feols
sqrt(vcov(model_iv)[2, 2])
#> [1] 0.02629134

# compute standard error manually
between_variance <- anova(lm(x ~ z, data = returns))[1, "Mean Sq"] / nrow(returns)
adj <- 1 / (nrow(returns) - 1) # use the dof that feols uses
returns[, sqrt(adj * var(y - beta_iv * x) / between_variance)]
#> [1] 0.02629134

Note that the IV standard error is almost ten times the size of the OLS standard error.

Comparing the IV and OLS standard errors

When we directly compare the IV and the OLS standard error, it becomes apparent that the two formulas are very similarly structured (again, this applies only if \(Z\) is binary):

\[\begin{align} \text{Var}[\hat{\beta}_{\text{OLS}}] &=\frac{1}{n}\frac{\widehat{\text{Var}}[Y-\hat{\beta}_\text{OLS} X]}{\widehat{\text{Var}}[X]} \\ \text{Var}[\hat{\beta}_\text{IV}] &= \frac{1}{n} \frac{ \widehat{\text{Var}}[Y-\hat{\beta}_\text{IV} X] }{\widehat{\text{Var}}[\hat{E}[X \mid Z]]} \end{align}\]

In both cases, we have three ways of achieving a lower standard error:

Increase the sample size, \(n\),
Reduce the size of the numerator,
Increase the size of the denominator.

With OLS, we are guaranteed to obtain the smallest possible standard error under the usual regression assumptions. In comparison, the IV estimator replaces \(\hat{\beta}_{\text{OLS}}\) with \(\hat{\beta}_{\text{IV}}\) in the numerator – which therefore must increase the numerator unless the two are \(\beta\)’s are identical –, and uses only part of the variance in the denominator – which must decrease the numerator unless \(Z\) perfectly determines \(X\). Hence, the decrease in power of the IV estimator comes both from the fact that we use only part of the variance in predicting \(Y\) (in the numerator), and from the fact that we use only part of the variance in predicting \(X\) (in the numerator).

The denominator is based on the law of total variance, which states that

\[\text{Var}[X]=\text{Var}[E[X \mid Z]]+E[\text{Var}[X \mid Z]].\]

This is a between/within decomposition. The law of total variance states that the total variance is equal to the variance of the group means (“between”), plus the variance within the groups. The IV estimator uses only the “between” term. Hence, to make this term as large as possible, \(Z\) should predict \(X\) well. As we have seen, another way to put this is to maximize \(\widehat{\text{Var}}(Z)\left(E[X\mid Z=1]-E[X\mid Z=0]\right)^{2}\). Hence, in the optimal case, we would like to have \(E[Z]=0.5\) to maximize the variance, and have a large difference in group means.

In the empirical example, the numerators and denominators compare as follows:

# comparing the numerators
num_ols <- returns[, var(y - beta_ols * x)]
num_iv <- returns[, var(y - beta_iv * x)]
c(num_ols, num_iv)
#> [1] 0.1775096 0.3099878

# comparing the denominators
denom_ols <- returns[, var(x)]
denom_iv <- between_variance
c(denom_ols, denom_iv)
#> [1] 7.1658624 0.1490379

The IV numerator is almost twice the size of the OLS numerator, but the real loss in power comes from the denominator, which differs by a factor of almost 50. Clearly, and not surprisingly, most of the variance in wages is not between people that grew up close and far from colleges, but within these two groups. This is confirmed by a quick look at the variance decomposition:

returns[, .(between_variance, within_variance = var(x) - between_variance)]
#>    between_variance within_variance
#>               <num>           <num>
#> 1:        0.1490379        7.016824

Conclusion

In the simple setting with just one exogeneous variable and one binary instrument, the IV standard error can be shown to have a simple form that can be compared easily with the OLS standard error. The logic of the IV estimator is that instead of using the full information in \(X\), we use only that part of the information in \(X\) that is also contained in the instrument \(Z\). This affects both the numerator and the denominator of the IV estimator. This highlights how important it is to choose an instrument that is strongly predictive of \(X\).

Footnotes

If \(Z\) is continuous, the formula is still correct if a linear estimator is used for for \({\widehat{\text{Var}}[\hat{E}[X \mid Z]]}\).↩︎