Ben Elbers

A simple form of the IV standard error

Ben Elbers — Fri, 06 Oct 2023 22:00:00 GMT

Recently, a blog post of mine on encouragement designs was published on the Spotify Engineering blog. In this post, I want to follow up on the formula for the variance of the IV estimator that is shown in that post, which is, with a slight change in notation,

where , , and are random variables for the outcome, treatment, and instrument, respectively. Note that this formula is only correct if the instrument is binary.¹

The aim of this post is to how to derive this version of the formula from the more general IV estimator (under the assumption that there is a single binary instrument , and a single predictor ), and how it compares to the OLS estimator. This version of the formula works well to illustrate why IV estimators have lower power compared to the equivalent OLS model.

To illustrate the derivations with some code, we’ll use a classic example from the econometrics literature – the returns of schooling on wages. Here is a bit of R code to set up the examples:

library(data.table)
library(fixest)
data("SchoolingReturns", package = "ivreg")
returns <- as.data.table(SchoolingReturns)
returns <- returns[, .(y = log(wage), x = education, z = nearcollege == "yes")]
head(returns)
#>           y     x      z
#>         
#> 1: 6.306275     7  FALSE
#> 2: 6.175867    12  FALSE
#> 3: 6.580639    12  FALSE
#> 4: 5.521461    11   TRUE
#> 5: 6.591674    12   TRUE
#> 6: 6.214608    12   TRUE

The data set contains the wage in dollars (which I log-transform here) as the outcome , the years of education as the treatment , and whether the individual grew up near a college, which will be used as the instrument .

OLS standard error

To show the similarities and differences between the IV and the OLS standard error, let’s first take a look at the standard error of a simple linear model. Consider a standard linear model of the form , where we apply all the usual regression assumptions. We’re interested in an estimate of , and its standard error, . If we estimate this model using OLS, and call the estimate , we obtain

(We use a factor of here for simplicity – use to obtain an unbiased estimate.)

The numerator might look a bit non-standard, so here’s a quick derivation. If we define the predicted values as , then the numerator can be written as

where the last term simplifies because constant terms drop out of the variance.

This R code demonstrates the result:

model_ols <- lm(y ~ x, data = returns)
beta_ols <- coef(model_ols)[[2]]

# standard error calculated by lm
sqrt(vcov(model_ols)[2, 2])
#> [1] 0.002869708

# compute standard error manually
adj <- 1 / (nrow(returns) - 2) # use the dof that lm uses
returns[, sqrt(adj * var(y - beta_ols * x) / var(x))]
#> [1] 0.002869708

Deriving the IV standard error

If we consult any standard textbook on econometrics, we’ll find that the general formula for the standard error of the IV estimator is

The logic of the IV estimator is that we use only the variation of that is due to to estimate the effect on , and this formula reflects this logic. To see this more clearly, assume that we only have one exogeneous variable and one instrumental variable The first term, , then becomes . Note that compared to the OLS estimator, we use instead of – again, this is because we use only the variation of that is due to .

The second term is based on the matrix , which contains the predicted values of the regression of on . In matrix algebra, this is , but if we assume that we have only one exogeneous variable and one instrumental variable, has a simpler form. Let’s define and as the intercept and slope estimates of the regression of on (FS = first stage). We can then define the random variable that contains the predicted values of this regression:

where the second equality follows from simple regression. The corresponding matrix is then of size . With a bit of matrix multiplication and division, we can now continue with the matrix multiplication, and find the inverse:

The relevant entry here is in the lower right-hand corner of the matrix, so we have as a preliminary formula

Until this point, we have only assumed that there is one instrument, but not that this instrument is binary. We’ll now make this assumption to simplify the formula a bit further. If is binary, we have as the proportion of cases where the instrument is 1, and as the proportion of cases where the instrument is 0. Because is a Bernoulli random variable, we can then immediately state that

For the next derivations, we’ll make use of the fact that we can rewrite expectations as group weighted averages. For instance,

We’ll use this strategy to ‘simplify’ the covariance term:

The first equality is just the definition of the covariance. We then rewrite the expectation as a weighted average. Because the covariance involves , the term where drops out. After simplifying, we replace with its alternative form as a weighted average. The final version then says that the covariance of a random variable and a Bernoulli random variable is equal to the difference in means between when and when , times the variance of .

Plugging these two results into the formula, we get

The last step is to show that the denominator is equal to :

The first equality is just the definition of the variance of the group means, when two groups are involved. We then apply the identities that have been used for the covariance term, and simplify the result. This is identical to the denominator, so the result is proven.

We therefore have as the final result:

To translate this into R code, we estimate the between variance using the anova function. (We could also use the alternative version, where we take the squared difference between the means and multiply by the variance of .)

# we use `feols` from the fixest package for IV estimation
model_iv <- feols(y ~ 1 | x ~ z, data = returns)
beta_iv <- coef(model_iv)[[2]]

# standard error calculated by feols
sqrt(vcov(model_iv)[2, 2])
#> [1] 0.02629134

# compute standard error manually
between_variance <- anova(lm(x ~ z, data = returns))[1, "Mean Sq"] / nrow(returns)
adj <- 1 / (nrow(returns) - 1) # use the dof that feols uses
returns[, sqrt(adj * var(y - beta_iv * x) / between_variance)]
#> [1] 0.02629134

Note that the IV standard error is almost ten times the size of the OLS standard error.

Comparing the IV and OLS standard errors

When we directly compare the IV and the OLS standard error, it becomes apparent that the two formulas are very similarly structured (again, this applies only if is binary):

In both cases, we have three ways of achieving a lower standard error:

Increase the sample size, ,
Reduce the size of the numerator,
Increase the size of the denominator.

With OLS, we are guaranteed to obtain the smallest possible standard error under the usual regression assumptions. In comparison, the IV estimator replaces with in the numerator – which therefore must increase the numerator unless the two are ’s are identical –, and uses only part of the variance in the denominator – which must decrease the numerator unless perfectly determines . Hence, the decrease in power of the IV estimator comes both from the fact that we use only part of the variance in predicting (in the numerator), and from the fact that we use only part of the variance in predicting (in the numerator).

The denominator is based on the law of total variance, which states that

This is a between/within decomposition. The law of total variance states that the total variance is equal to the variance of the group means (“between”), plus the variance within the groups. The IV estimator uses only the “between” term. Hence, to make this term as large as possible, should predict well. As we have seen, another way to put this is to maximize . Hence, in the optimal case, we would like to have to maximize the variance, and have a large difference in group means.

In the empirical example, the numerators and denominators compare as follows:

# comparing the numerators
num_ols <- returns[, var(y - beta_ols * x)]
num_iv <- returns[, var(y - beta_iv * x)]
c(num_ols, num_iv)
#> [1] 0.1775096 0.3099878

# comparing the denominators
denom_ols <- returns[, var(x)]
denom_iv <- between_variance
c(denom_ols, denom_iv)
#> [1] 7.1658624 0.1490379

The IV numerator is almost twice the size of the OLS numerator, but the real loss in power comes from the denominator, which differs by a factor of almost 50. Clearly, and not surprisingly, most of the variance in wages is not between people that grew up close and far from colleges, but within these two groups. This is confirmed by a quick look at the variance decomposition:

returns[, .(between_variance, within_variance = var(x) - between_variance)]
#>    between_variance within_variance
#>                          
#> 1:        0.1490379        7.016824

Conclusion

In the simple setting with just one exogeneous variable and one binary instrument, the IV standard error can be shown to have a simple form that can be compared easily with the OLS standard error. The logic of the IV estimator is that instead of using the full information in , we use only that part of the information in that is also contained in the instrument . This affects both the numerator and the denominator of the IV estimator. This highlights how important it is to choose an instrument that is strongly predictive of .

Footnotes

If is continuous, the formula is still correct if a linear estimator is used for for .↩︎

Eliminating the bias of segregation indices

Ben Elbers — Tue, 23 Nov 2021 23:00:00 GMT

It is well known that most standard estimators of segregation indices are biased. The segregation package provides a few tools to assess this bias. This post will discuss this problem with some simple examples and show under what conditions bootstrapping and simulation can help to remove the bias. The post relies on some tools that were only recently added to the package, so install the most recent version to follow along:

library("segregation")

Bias in small and large samples

To illustrate the problem, let’s use R’s stats::r2dtable function to simulate a random contingency table. To make the following more concrete, let’s assume that we observe racial segregation in schools. Each school has an equal number of students of each of the two racial groups, but we only observe a sample. If the sample is small, we do not expect to sample exactly an even number of students of each of the two groups, so the segregation index is likely to be biased upwards.

One hypothetical sample could look like this:

(mat = stats::r2dtable(1, rep(10, 5), c(25, 25))[[1]])
#>      [,1] [,2]
#> [1,]    5    5
#> [2,]    4    6
#> [3,]    3    7
#> [4,]    7    3
#> [5,]    6    4

Now we can compute the Mutual Information index (M) and its normalized version, the H index:

library("segregation")
dat = matrix_to_long(mat) # convert to long format
mutual_total(dat, "group", "unit", weight = "n")
#>      stat    est
#>      
#> 1:      M 0.0410
#> 2:      H 0.0591

Clearly, both indices are non-zero. For the index of dissimilarity, the bias is even stronger:

dissimilarity(dat, "group", "unit", weight = "n")
#>      stat   est
#>     
#> 1:      D  0.24

A index value of 0.3 is often interpreted as “moderate segregation”, so this bias is clearly a problem. Generally, the index of dissimilarity suffers more from small-sample bias than the information-theoretic indices.

Importantly, the bias is not simply a function of sample size. For instance, if we increase the number of schools to 10,000, but still expect 5 students of each racial group in each school, the bias is pretty much the same:

mat_large = stats::r2dtable(1, rep(10, 10000), c(50000, 50000))[[1]]
dat_large = matrix_to_long(mat_large) # convert to long format
mutual_total(dat_large, "group", "unit", weight = "n")
#>      stat    est
#>      
#> 1:      M 0.0540
#> 2:      H 0.0778

dissimilarity(dat_large, "group", "unit", weight = "n")
#>      stat   est
#>     
#> 1:      D 0.248

This is despite the fact that in the first case, our sample size is 50, and in the second case it’s 100,000! For the index of dissimilarity, Winship (1977) has described this bias in detail.

Solution 1: Bootstrapping

In many circumstances, it helps to enable bootstrapping to estimate the bias. When bootstrapping is enabled, the segregation package reports bias-adjusted estimates. Let’s try this for both datasets from above:

mutual_total(dat, "group", "unit", weight = "n", se = TRUE)
#> 100 bootstrap iterations on 50 observations
#>      stat      est     se              CI   bias
#>                      
#> 1:      M -0.00933 0.0465 -0.0956, 0.0647 0.0503
#> 2:      H -0.01520 0.0683 -0.1400, 0.0932 0.0743

mutual_total(dat_large, "group", "unit", weight = "n", se = TRUE)
#> 100 bootstrap iterations on 1e+05 observations
#>      stat       est      se                CI   bias
#>                          
#> 1:      M -0.000629 0.00118 -0.00296, 0.00159 0.0546
#> 2:      H -0.000908 0.00170 -0.00427, 0.00230 0.0788

In this case, the bootstrap estimates the bias pretty well. Because the bias (last column) is subtracted from the segregation estimates, the bootstrap-adjusted estimate may become slightly negative.

For the index of dissimilarity, this procedure does not work as well:

dissimilarity(dat, "group", "unit", weight = "n", se = TRUE)
#> 100 bootstrap iterations on 50 observations
#>      stat   est     se              CI   bias
#>                   
#> 1:      D 0.171 0.0999 -0.0632, 0.3596 0.0689

dissimilarity(dat_large, "group", "unit", weight = "n", se = TRUE)
#> 100 bootstrap iterations on 1e+05 observations
#>      stat   est      se          CI  bias
#>               
#> 1:      D  0.14 0.00201 0.137,0.145 0.108

Although the bias estimate is fairly large, a substantial bias remains.

Solution 2: Compute the expected value under independence

The bootstrap may sometime work to estimate the bias, but two major problems remain. The first, as we have seen, is that the bias estimation does not work well for the index of dissimilarity. The second situation in which the bootstrap will do badly is when the contingency table is very sparse and contains many zero entries. I’ll come back to that in the example at the end of the post.

A direct approach of estimating the bias is the following: Using the observed marginal distributions, simulate a contingency table under the assumption that true segregation is zero. Repeat this process a number of times and record the average. This quantity is the expected value of the segregation index when students are randomly distributed across schools, conditional on the marginal distributions. In economics, this quantity is also sometime called “random segregation” (Carrington and Troske 1998).

The segregation package implements this algorithm in the following two functions:

mutual_expected(dat, "group", "unit", weight = "n")
#>         stat    est     se
#>           
#> 1: M under 0 0.0443 0.0290
#> 2: H under 0 0.0639 0.0418

dissimilarity_expected(dat, "group", "unit", weight = "n")
#>         stat   est     se
#>          
#> 1: D under 0 0.226 0.0945

In both cases, calculating the expected value of the index gives a good estimate of the bias. When reporting the final results, we could simply subtract the bias from the segregation estimates.

An example with sparse data

As a final point, the example in this section demonstrates some circumstances under which also the information-theoretic indices may be highly biased.

The segregation package contains an example dataset, school_ses with artifical data. Each row of this dataset describes a student, with information on the school the student attends (school_id), the student’s ethnic group (one of A, B, or C; ethnic_group), and the student’s socio-economic status (provided in quintiles; ses_quintile). Because there are three ethnic-groups, we will only compute multigroup indices using the M and H index.

The school_ses dataset is sparse: There are 149 schools in total, but only 46 of those contain students of all three ethnic groups, and 26 schools contain only students of a single ethnic group.

The ethnic segregation in this dataset is fairly large, but we may expect this estimate to be upwardly biased:

mutual_total(school_ses, "ethnic_group", "school_id")
#>      stat   est
#>     
#> 1:      M 0.544
#> 2:      H 0.577

For this dataset, the two approaches of estimating the bias differ somewhat:

mutual_total(school_ses, "ethnic_group", "school_id", se = TRUE)
#> 100 bootstrap iterations on 5153 observations
#>      stat   est      se          CI   bias
#>                
#> 1:      M 0.529 0.01000 0.512,0.545 0.0160
#> 2:      H 0.559 0.00921 0.542,0.576 0.0181

mutual_expected(school_ses, "ethnic_group", "school_id")
#>         stat    est      se
#>            
#> 1: M under 0 0.0304 0.00240
#> 2: H under 0 0.0322 0.00254

Using bootstrapping, the bias for the M index is estimated to be 0.016, while the bias estimated using the “random segregation” approach is 0.03.

This difference is still rather small, and will not be consequential in many situations. However, the advantage of using information-theoretic measures lies in their decomposability, and there the bias may be much larger. For instance, assume that we are interested in computing ethnic segregation conditionally on SES status. We can use the within argument to calculate this:

mutual_total(school_ses, "ethnic_group", "school_id", within = "ses_quintile")
#>      stat   est
#>     
#> 1:      M 0.463
#> 2:      H 0.490

Estimating the bias of this conditional index using bootstrapping yields a bias estimate of around 0.04:

mutual_total(school_ses, "ethnic_group", "school_id", within = "ses_quintile",
  se = TRUE)
#> 100 bootstrap iterations on 5153 observations
#>      stat   est      se          CI   bias
#>                
#> 1:      M 0.424 0.00853 0.410,0.439 0.0389
#> 2:      H 0.450 0.00909 0.433,0.465 0.0408

However, if we compute the expected value conditional on SES, the result looks very different:

mutual_expected(school_ses, "ethnic_group", "school_id", within = "ses_quintile")
#>         stat   est      se
#>           
#> 1: M under 0 0.105 0.00848
#> 2: H under 0 0.132 0.01113

The bias is estimated to be very large – around 0.1 for the M and around 0.13 for the H! The reason for this discrepancy is that the indices are computed within each group defined by the SES quintiles. These “conditional” contingency tables are much smaller, and even sparser than the overall dataset. It follows that the bias is even larger. One therefore has to be very careful when decomposing segregation measures for small or sparse samples.

Conclusion

When working with segregation indices, it is important to be aware that almost all “naive” estimators of these indices are upwardly biased. In many situations, this bias will be small. However, if the overall sample size is small, or some of the groups or units are small, the bias can be substantive. Importantly, it is not always the case that the bias is small in large samples. My recommendation is to always check the sensitivity of your results using both bootstrapping and by calculating “random segregation”. Special attention needs to be paid when decomposing segregation measures for small or sparse samples, as the decompositions will be based on even smaller/sparser samples.

References

Winship, Christopher. 1977. A Revaluation of Indexes of Residential Segregation. Social Forces 55(4): 1058-1066.

Carrington, William J. and Kenneth R. Troske. 1998. Interfirm Segregation and the Black/White Wage Gap. Journal of Labor Economics 16(2): 231-260.

Did Residential Racial Segregation in the U.S. Really Increase?

Ben Elbers — Thu, 22 Jul 2021 22:00:00 GMT

A recent report by the Othering and Belonging institute at UC Berkeley claimed that, of large metropolitan areas in the U.S., 81% have become more segregated over the period 1990-2019. This finding contradicts the recent sociological literature on changes in residential segregation in the U.S., which has generally found that racial residential segregation has slowly declined since the 1970s, especially between Blacks and Whites. The major question then is: What accounts for this diﬀerence?

My new WP answers this question, and here’s a quick summary:

The segregation measure of the Berkeley study, the “Divergence Index,” is identical to mutual information, also known as the index. This index is mechanically aﬀected by changes in racial diversity. Given that the U.S. has become more diverse over the period 1990 to 2019, it is not surprising that this index shows increases in segregation. It is important to emphasize again that the index is mechanically affected by rising diversity. This means that if only the diversity of the metropolitan area changes, the index will increase. Of course, this doesn’t mean that in every metropolitan area where racial diversity is increasing the index value also increases—clearly, other things could also change. The fact that the index is mechanically related to diversity is also not a statement about the general relationship between diversity and segregation: It could be the case, for instance, that more diverse cities are more segregated. When one uses the index to answer such a question, one will almost always find that such a relationship exists, because of the mechanical dependency between diversity and the index. In mathematical terms, the simplest way to see the influence of diversity on the index value is to write the index as the sum of three entropies: The first term is the entropy of the neighborhood distribution, the second term is the entropy of the racial group distribution, and the third term is the entropy of the joint distribution. Given that the racial group entropy increases when diversity increases, the is clearly affected by rising diversity.
Once I correct for the confounding of index change with diversity using a decomposition method, I find that the results are in line with the sociological literature: Residential racial segregation as a whole has declined modestly in most metropolitan areas of the U.S., although segregation has increased slightly when focusing on Asian Americans and Hispanics. The following plot shows the index and the adjusted that corrects for the mechanical influence of rising diversity. The index is also shown:

Trends

Clearly, once we adjust for the mechanical influence of diversity (which the also does), segregation in the median metropolitan area is declining. The figure also shows that all indices are suddenly increasing strongly in 2019. Why is this the case? The reason is that for the years 1990, 2000, and 2010, Census data are available. For 2019, only estimates from the American Community Survey are available, which are well known to inflate segregation estimates. Hence, even the increase in the , which is almost entirely confined to the period 2010-2019, may be spurious and due to the use of ACS data.

For more details, the working paper is on SocArXiv, as well as a complete set of replication materials.

A note on local measures of segregation

Because this came up in the discussion afterwards, here are some remarks on measures of local segregation. The M index (called divergence index in the Berkeley report) can be written as a weighted average of local segregation scores , where measures the segregation of neighborhood :

( is the proportion of racial group in neighborhood , and is the overall proportion of racial group in the metropolitan area). This measure is the Kullback-Leibler divergence. Once we weight by the size of the neighborhood, we obtain the M index (mutual information):

The scores are really useful, but the question is whether they should be used to compare/rank neighborhoods across metros/across time? The issue is unproblematic if we just look at one metro area at one point in time, e.g., to learn which neighborhoods are especially segregated. But what if we want to compare over time/across metros? Then it gets tricky, because local segregation scores are also influenced by the diversity of the metro area. The minimum value of local segregation is zero, but the maximum value is (see if you can guess it) the negative of the logarithm of the proportion of the metro area’s smallest racial group (see here for a proof). If you compare across metropolitan areas, then the range of the local scores will differ if the size of the smallest racial group differs. This makes comparisons really tricky, and I wouldn’t therefore be willing to classify neighborhoods across metro areas as “high” or “low” segregation.

If you want to compare over time, again the decomposition method can be used to adjust for changes in diversity. This map of changes in racial segregation in Brooklyn does this, and shows where segregation increased (red) and declined (blue) in Brooklyn between 2000 and 2010 using diversity-adjusted local segregation scores (source of map).

Changing segregation in Brooklyn

A final point on the H index: The only difference between M and H is the division by the racial entropy. So we can also define local scores for the H index: These are not normalized, but they work as a decomposition, i.e.

and

where is the racial group entropy of the metropolitan area.

The scores are again useful to understand where in a metropolitan area segregation is lowest or highest, but they are equally problematic when used to compare across metro areas or across time.

Simulations in Julia: Efficient by default

Ben Elbers — Sun, 27 Jun 2021 22:00:00 GMT

Inspired by Grant McDermott’s blog post on efficient simulations in R, I decided to reimplement the same exercise in Julia. This post will not make much sense without having read that excellent post, so I’d recommend doing that first.

I recently switched to doing simulation work in Julia instead of R, because you don’t need many tricks to achieve decent performance on the first try. Compared to R, it is not necessary (at least in this example) to generate all the data at once. Instead, we can implement the simulation algorithm more naturally: generate a dataset, extract the quantities of interest, and repeat this process N times. I find that this leads to more intuitive and readable code, and Julia’s great for this kind of task.

1. Generate the data

We start by implementing a function gen_data() to generate a DataFrame. The code is basically a one-to-one translation from R, but we only generate one instance of the data.

using Distributions 
using DataFrames     

const std_normal = Normal(0, 1)

function gen_data()
  ## total time periods in the the panel = 500
  tt = 500

  # x1 and x2 covariates
  x1_A = 1 .+ rand(std_normal, tt)
  x1_B = 1/4 .+ rand(std_normal, tt)
  x2_A = 1 .+ x1_A .+ rand(std_normal, tt)
  x2_B = 1 .+ x1_B .+ rand(std_normal, tt)

  # outcomes (notice different slope coefs for x2_A and x2_B)
  y_A = x1_A .+ 1*x2_A + rand(std_normal, tt)
  y_B = x1_B .+ 2*x2_B + rand(std_normal, tt)

  # combine
  DataFrame(
    id = vcat(fill(0, length(x1_A)), fill(1, length(x1_B))),
    x1 = vcat(x1_A, x1_B),
    x2 = vcat(x2_A, x2_B),
    x1_dmean = vcat(x1_A .- mean(x1_A), x1_B .- mean(x1_B)),
    x2_dmean = vcat(x2_A .- mean(x2_A), x2_B .- mean(x2_B)),
    y = vcat(y_A, y_B))
end

gen_data (generic function with 1 method)

This should hopefully be pretty clear, even if you have never seen any Julia code before. The function vcat() is used to concatenate vectors, and fill() is similar to rep() in R. Another thing that might be unusual is having to write .+ instead of + to achieve vector addition. While this still trips me up occasionally, I think it’s one of Julia’s best features.

2. Extract the quantities of interest

Given a dataset, we can now run the two regressions and extract the coefficents. Here’s a function to achieve that using the GLM package:

using GLM

function coefs_lm_formula(data)
  mod_level = lm(@formula(y ~ id + x1 * x2), data)
  mod_dmean = lm(@formula(y ~ id + x1_dmean * x2_dmean), data)
  (coef(mod_level)[end], coef(mod_dmean)[end])
end

# example
data = gen_data()
coefs_lm_formula(data)

(-0.14253265966089987, -0.01929471565297597)

Now we just need to repeat these last two lines a large number of times, and save the coefficients.

3. Repeat N times

The function below runs the above nsim times and stores the two coefficients in a matrix. I use the BenchmarkTools package to benchmark this function.

function run_simulations(nsim)
  sims = zeros(nsim, 2);
  for i in 1:nsim
    data = gen_data()
    sims[i, :] .= coefs_lm_formula(data)
  end
  sims
end

using BenchmarkTools
n = 20000
@btime run_simulations($n);

  4.862 s (19160002 allocations: 16.35 GiB)

Around 5 seconds – not bad at all for this “naive” implementation that doesn’t make any use of particular performance tricks. A simple graph shows that the results are the same as in Grant’s post:

using Plots
sims = run_simulations(20000)
histogram(sims, label = ["level" "dmean"])

Some performance improvements

While this is a pretty good result already, there are of course numerous ways to speed this up. Being a novice to Julia, I’m probably not the best person to show this, but here’s an attempt anyway. One straightforward way to speed this up is to avoid creating the model matrix using the @formula call and instead to create the matrix ourselves.

Here’s a way to do this (I’ll simply overwrite the existing function):

function coefs_lm_formula(data)
  constant = fill(1, nrow(data))
  X = Float64[constant data.id data.x1 data.x2 data.x1 .* data.x2]
  mod_level = fit(LinearModel, X, data.y)
  X[:, 5] .= data.x1_dmean .* data.x2_dmean
  mod_dmean = fit(LinearModel, X, data.y)
  (coef(mod_level)[end], coef(mod_dmean)[end])
end

coefs_lm_formula (generic function with 1 method)

And benchmark again:

@btime run_simulations($n);

  1.888 s (2700002 allocations: 8.58 GiB)

A good speedup for a rather simple change!

A last idea is to fit the model more efficiently. The fit function from GLM still computes standard errors and p-values, which are unnecessary for this example. Here’s a last benchmark implementing this:

using LinearAlgebra 
fastfit(X, y) = cholesky!(Symmetric(X' * X)) \ (X' * y)

function coefs_lm_formula(data)
  constant = fill(1, nrow(data))
  X = Float64[constant data.id data.x1 data.x2 data.x1 .* data.x2]
  mod_level = fastfit(X, data.y)
  X[:, 5] .= data.x1_dmean .* data.x2_dmean
  mod_dmean = fastfit(X, data.y)
  (mod_level[end], mod_dmean[end])
end

@btime run_simulations($n);

  1.416 s (1900002 allocations: 4.64 GiB)

Looks like calculating the standard errors and the p-values is not such an expensive operation after all.

Conclusion

The goal of this post was not to squeeze out the last bit of performance for this particular kind of simulation. Instead, I hope that this post shows that the first “naive” implementation in Julia (which would often be extremely slow in R), is often fast enough. There are many other advantages to Julia, but this is a major one for me.

New paper in SMR: A Method for Studying Difference in Segregation Levels Across Time and Space

Ben Elbers — Sun, 07 Mar 2021 23:00:00 GMT

Benjamin Elbers. A Method for Studying Difference in Segregation Levels Across Time and Space. Sociological Methods and Research.

Preprint and Online Appendix and Replication Materials

The Problem: Margin dependency

An important topic in the study of segregation are comparisons across space and time. It has been recognized for a long time that many segregation indices are margin-dependent, which complicates such comparisons. For instance, it can be shown that the index of dissimilarity () is margin-dependent in terms of the units under study (e.g., neighborhoods or schools), but not in terms of the groups (e.g., racial/income groups). This led to a debate in the gender segregation literature in the 1990s, where Charles and Grusky (AJS 1995, Demography 1998) advocated the use of log-linear modeling.

Consider the following four tables, which cross-classify the number of male and female employees across the occupations A, B, and C. Table (1) shows the baseline situation. In Table (2), occupation C has grown, while in Table (3) female employment increased across all occupations. Table (4) shows an extreme example, where the integrated occupation B has grown strongly.

Tables 1-4

How do different segregation measures characterize these situations? The following table shows how the popular , , and indices, as well as Charles and Grusky log-linear index , quantify the amount of segregation. Also shown are the two odds ratios and .

Table					Odds ratios
(1)	0.465	0.203	0.295	7.22	0.0714 and 9
(2)	0.501	0.233	0.337	7.22	0.0714 and 9
(3)	0.465	0.206	0.297	7.22	0.0714 and 9
(4)	0.001	0.000	0.000	7.22	0.0714 and 9

The indices , , and are margin-dependent in either one or both directions, while the log-linear indices and the odds ratios stay stable. However, they also do stay stable in the extreme example, which many would regard as not very segregated.

I make use of the M index, which is margin-dependent in both directions, can be standardized (H index), and is highly decomposable:

Defined for a contingency table, where indexes the units, and the groups; where () is the marginal probability of being in unit (group ); and where is the probability of being in group given unit . is called the local segregation score for unit .

The Solution: Decomposition of

To decompose the difference between two indices at times and into marginal and structural components, we construct two counterfactual matrices:

, which has the same marginal distributions as , but the odds ratios from ,
, which has the same marginal distributions as , but the odds ratios from .

This allows for the following decomposition:

To construct the two counterfactual matrices, we use Iterative Proportional Fitting (IPF). To construct , take and adjust all cells towards the column marginals of . Then adjust all cells towards the row marginals of . This adjustment towards the column and row marginals is repeated until both marginals have converged, i.e. are similar to those of .

There are a few straightforward extensions of this decomposition:

Decomposition of marginal. It is often of interest to determine how much the row and column marginals have contributed to segregation change separately. To decompose the marginal component further, define to identify the that is calculated based on the unit marginals from , the group marginals from , and the odds ratios from .

This procedure requires six IPF procedures in total, and is based upon the elimination of the marginal contributions in all possible ways (Shapley value decomposition).

Decomposition of structural. It is also possible to decompose structural change into the contributions of each individual unit by exploiting the decomposability properties of the index:

(Dis)appearing units. In many segregation problems, the researcher has to deal with units that disappear over time, or new units that appear. For instance, in a school segregation problem, schools may close down and new schools may open up. It can be shown that the index provides a clear interpretation for the contribution of these (dis)appearing units towards segregation.

Example: Occupational Gender Segregation

I now apply the full decomposition to the study of occupational gender segregation of the civilian population of the United States between 1990 and 2016:

The data source is the U.S. Census and the American Community Survey, downloaded from IPUMS. Harmonized occupational codings come from IPUMS. Some 50 occupations vanish over time, but no new occupations are introduced. The decomposition was carried out for the whole population, as well as for 9 major occupational groups separately.

Decomposition of occupational gender segregation by major group

The figure shows that:

overall gender segregation has been declining,
much of this is due to changes in the structural component, i.e. the odds ratios,
disappearing occupations do not matter very much, except for operators/laborers,
there is some heterogeneity by major group: declines have been pronounced in some groups, but in some major groups gender segregation has increased,
much of the decline in segregation is structural, while the increase is mostly due to marginal changes,
the three components can offset each other.

See also the R package segregation which accompanies this paper.

Using the bootstrap for bias reduction

Ben Elbers — Wed, 06 Jan 2021 23:00:00 GMT

I came across a neat example in Horowitz (2001, p. 3174), which demonstrates that (in these specific circumstances at least), the bias-corrected bootstrap estimator has lower MSE by a large factor. The setup is as follows. We have a sample of 10 iid observation, where . The goal is then to estimate , for which the true value is . The plug-in estimator is .

Given a realized sample , the usual bootstrap estimates are obtained by resampling times from with replacement, generating the bootstrap samples , and the bootstrap estimates . Let be the average across all . We can then estimate the bias as . In R code, this is:

set.seed(1)
data = rnorm(10, 0, sqrt(6))
(thetahat = exp(mean(data)))
#> [1] 1.382411
bs = replicate(1000, {
    resample = sample(data, 10, replace = TRUE)
    exp(mean(resample))
})
(biashat = mean(bs) - thetahat)
#> [1] 0.2973734

The “debiased” estimate would hence be . For the concrete result, this is , much closer to the true value .

Because we control the data-generating process and know the true value of , we can repeat the above procedures any number of times and obtain approximations for the MSE’s of and . The following code accomplishes that for 100 repetitions:

res = replicate(100, {
    data = rnorm(10, 0, sqrt(6))
    thetahat = exp(mean(data))
    bs = replicate(1000, {
        resample = sample(data, 10, replace = TRUE)
        exp(mean(resample))
    })
    (debiased = 2 * thetahat - mean(bs))
    c(thetahat - 1, debiased - 1, (thetahat - 1) ^ 2, (debiased - 1)^2)
})
apply(res, 1, mean)
#> [1]  0.37878143 -0.04919049  1.10729457  0.47833810

By making use of the identity , we obtain the following results:

Estimator	MSE	Bias	Variance
	1.107	0.379	0.964
	0.478	-0.049	0.476

Similar to the results reported in Horowitz (2001, p. 3175), there is a large reduction in both bias and MSE. Not reported by Horowitz, but also significant, is the reduction in variance. The true bias¹ of is , so the simulation estimate is not far off.

References

Horowitz, Joel L. 2001. “The Bootstrap.” In: Handbook of Econometrics, Volume 5, edited by J. J. Heckman and E. Leamer. Elsevier.

Footnotes

Let , then , and . A log-normal random variable has mean , hence .↩︎

Regression Assumptions, one by one

Ben Elbers — Wed, 30 Sep 2020 22:00:00 GMT

Many textbooks on linear regression start with a fully-fletched regression model, including assumptions such as independence and homoscedasticity right from the start. I always found this treatment not very intuitive, as many results, such as the unbiasedness of the parameter estimates, hold without making these stronger assumptions. The goal of this post is to introduce the assumptions of the OLS model one by one, and only when they become necessary. The major assumptions are, in the order they are introduced: linearity, no perfect collinearity, zero conditional mean, independence, homoscedasticity, and the normal distribution of errors.

The material is inspired by a number of textbooks, most importantly Gelman and Hill (2006) and Hayashi (2000). The latter is especially helpful, because it is clearly stated which assumptions are needed for which result.

The basic assumption: Linearity

In a regression model, we model the outcome as a linear function of a number of predictors. The model is of the following form:

We know the outcome, , and the predictors The are unknown, and we want to estimate them from the data. Because we rarely assume that the relationship between and is completely deterministic, we also add an error term, This term captures anything about that is not captured by the predictors. The most important assumption in regression is in this equation: We assume that the relationship between the outcome and the predictors is additive and linear. This assumption can be relaxed somewhat. For instance, we could interact predictors to produce the model

In this model, is no longer a linear function of and . However, the regression model is still linear in the parameters, although we have to pay attention when interpreting the coefficients. The same is true for a model of the form which includes a squared term. If we assume a multiplicative model we could transform this is into a linear model using

Regression as an optimization problem

We now collect some data, in order to estimate the unknown quantities of this model, the We collect data on the outcome and the predictors To find values for the , first, let’s just guess. For instance, we could roll a die repeatedly and record the results. We call the result to show that we have now estimated . To be sure, this estimate will be pretty bad (we didn’t use any of our data to find it!). To formalize the idea that our made-up ’s will likely be a bad estimate, we compute for each observation the predicted values, :

We use only the deterministic part of the model to compute because, by definition, we do not know Intuitively, if we simply guess the parameter values, the predicted values will have little to do with the actual values . One way to quantify how good or bad our guesses are is to compute a loss function, which is some function that quantifies how much the predicted values deviate from the true, known values. One such loss function is

This loss function, called the squared error loss, is always positive, and ideally, we would like its value to be small. For a random guess, the differences between and will likely be large. A better guess would reduce the value of the loss function. This choice of loss function may seem somewhat arbitrary. For instance, one might ask why we didn’t choose the following loss function:

This is also a very reasonable loss function: it is always positive, and we would like its value to be small. There is an infinite number of other possible loss functions, and in many situations it will be advantageous to choose one of these alternatives. However, the squared error loss has some advantages. Maybe most important of all, its easy to find a closed-form expression for the ’s that minimizes the loss. It is from this choice of loss function that ordinary least squares gets its name; but notably, this refers to the estimation method. Nothing in our regression model so far has told us that we require as a loss function, it’s just one way to estimate the parameters .

The goal is now to find a set of parameters that minimize the value of . At this point, it becomes convenient to switch to matrix notation. We arrange the outcome values and the predictors into a column vector , and a matrix , respectively. The first column of is just a vector of 1’s—this way we can include in the parameter vector. The parameters are collected in a column vector .

Through the use of some matrix algebra, we can then rewrite the loss function as

To find the minimum, we differentiate with respect to (this requires a bit of matrix calculus):

Setting this equal to zero, we get

the so-called “normal equations.” The remaining step is to premultiply both sides by the inverse of , from which we obtain

To show that this is really the minimum of , we would need to show that is convex. We skip that step here.

The expression that we derived for is attractive in its simplicity. However, the expression relies on that fact that is invertible. This will only be the case if has full column rank, or, to put it differently, that no two columns in are perfectly correlated. It is also required that i.e. that we have at least as many observations as predictors. If does not have full column rank, we could replace with a pseudo-inverse. However, that inverse won’t be unique, and so the estimates will no longer be unique either. Going forward, we will assume that no two columns in are perfectly correlated.

If we just want to learn about we are done at this point. We made two big assumptions: We assumed that depends linearly on the inputs , and we assumed that no two predictors are perfectly correlated.

The zero conditional mean assumption

Arguably, until now, the math involved only some optimization, but no statistics. However, in statistics, we also care about uncertainty. For instance, we are often interested in whether some input is associated with the outcome . If , the coefficient for , is zero, we would conclude that there is no association. But might be non-zero just by chance, and so we would like some measure of the uncertainty of .

To do so, we make an assumption about the error term. Currently our model can be written in matrix notation as

We now additionally assume that is a random vector with mean and covariance matrix . The assumption that is specifically zero is not very strict. For instance, we could assume , where is some constant, and then absorb this constant in the coefficient . However, assuming that the mean of the errors is constant is a big assumption: Essentially it means that, after we have accounted for the predictors , there is no other systematic variation in .

It is possibly easier to recognize the severity of this assumption by finding the expressions for the mean and variance of . These are now also random variables, as they are functions of . We still assume that the predictors are known and fixed. It is then straightforward to derive the expectation and variance of :

Note that the assumption about is equivalent to assuming that is a random vector with expectation and covariance matrix . We therefore assume that the information we have in is sufficient to model . The zero conditional mean assumption is frequently violated—the most common occurrence is omitted variable bias.

If is a random vector, this also has consequences for our OLS estimator, , which is a function of . Therefore, is a random vector as well. We can derive the expectation of as follows:

Hence, under the present assumptions, is unbiased. To arrive at this result, we had to assume linearity, no perfect collinearity, and (zero conditional mean).

Standard errors

Next, making use of the fact that for a non-random matrix and random vector , , we find that

This is the expression that we interested in. The diagonal elements of (a matrix) tell us about the uncertainty in estimating the elements of . To estimate the variance, however, we need to know or estimate , and this is where we run into trouble. This matrix will be of the form

This is a symmetric matrix (). We therefore need to estimate variances and covariances, for a total of elements. However, we only have observations in total! Clearly, this won’t work, and we therefore have to make another assumption: First, we will assume that the are independent. We do not assume the are i.i.d., independent and identically distributed. The mean of depends on , and therefore two different ’s may have a different mean. We can also state this assumption in terms of the errors, as is also the covariance matrix of . (Strictly speaking, we only require here that the outcomes/errors are uncorrelated, but the independence assumption is more transparent.)

Substantively, the assumption of independence means that we assume , i.e. that the mean of any outcome does not depend on the values of the other outcomes. Depending on the problem, this assumption can be unrealistic, and it can relaxed with techniques such as clustered standard errors or multilevel models. The assumption of independence is sometimes replaced with the assumption that our data is a random sample from a population of interest, but I find the statement that the are independent more precise.

With the independence assumption, the form of is radically simplified, as all off-diagonal values are now zero:

We are left with the variances on the diagonal. Estimating variances with data points might still sounds impossible, but it is routinely done: When using robust or heteroscedasticity-consistent standard errors, the variances of are estimated as (White 1980; but see Freedman 2006 and King and Roberts 2015). However, in the standard regression model, we instead make another assumption: homoscedasticity. This means that we assume that all the have the same variance, which will denote by . Thus,

where is the identity matrix. Now, we only have to estimate one parameter with observations. Before we show how to estimate , we first note that the expression for the variance becomes much simpler when we assume that the covariance matrix of is given by :

It remains to estimate . Under the assumptions made, it can be shown that

is an unbiased estimator for . The expression is just the average error that our model makes in predicting , adjusted for the degrees of freedom that we required to estimate . Hence, for an estimate of the variance of , we can use

and the square root of this expression is, of course, the standard error. If we are just interested in the standard error, we can stop here.

The Normality assumption

We are now able to estimate and . But if we want to perform statistical inference, we need to know more than just these first two moments. This is where the normality assumption comes in. This assumption applies to the errors, i.e. we assume that . From this follows that and

After this assumption, we know the exact sampling distribution of , which allows us to derive confidence intervals for . These expressions show that the normality assumption is computationally convenient, as making the normal assumption about the errors translates into the result that is also normal. The normality assumption is also justified through the central limit theorem. In other words, even without the explicit assumption, we could have justified the use of a normal approximation.

Conclusion

It turns out that this list is similar the one presented by Gelman and Hill (2006, p. 45f.), who list validity, additivity and linearity, independence of errors, equal variance of errors, and normality of errors as assumptions, in order of their importance. Their list however, adds another major assumption to the front of the list: Validity. In their words,

Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. Optimally, this means that the outcome measure should accurately reflect the phenomenon of interest, the model should include all relevant predictors, and the model should generalize to the cases to which it will be applied. (p. 45)

Furthermore, none of the mathematical assumptions guarantee that the regression coefficients can be interpreted as causal.

References

Freedman, David A. (2006). “On The So-Called ‘Huber Sandwich Estimator’ and ‘Robust Standard Errors’”. The American Statistician. 60 (4): 299–302. doi:10.1198/000313006X152207.
Gelman, Andrew and Jennifer Hill. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Hayashi, Fumio. (2000). Econometrics. Princeton University Press.
King, Gary; Roberts, Margaret E. (2015). “How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It”. Political Analysis. 23 (2): 159–179. doi:10.1093/pan/mpu015.
White, Halbert (1980). “A Heteroscedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroscedasticity”. Econometrica. 48 (4): 817–838. doi:10.2307/1912934.

New paper in Social Forces: Training Regimes and Skill Formation in France and Germany An Analysis of Change Between 1970 and 2010

Ben Elbers — Sat, 20 Jun 2020 22:00:00 GMT

Benjamin Elbers, Thijs Bol, and Thomas A. DiPrete. 2020. Training Regimes and Skill Formation in France and Germany An Analysis of Change Between 1970 and 2010. Social Forces, forthcoming.

Preprint

France and Germany are often portrayed as very different when it comes to school-to-work linkages: France focuses on general education, which means that graduates find jobs in all kinds of sectors. Therefore, the link between education and occupation should be low in France. Germany provides specialized education, leading to a strong match between educational degrees and jobs. High linkage means that one’s educational degree is very predictive of the occupation one is employed in. Low linkage means that the educational degree has no consequence for the kinds of occupations one works in. We use a segregation index, the Mutual Information Index M, to capture this idea.

We used different sources of microdata (French labor force surveys, German censuses, and the European Labor Force Survey) to study whether the characterization of France as low-linkage and Germany as high-linkage was true, both historically (in 1970) and now (well, in 2010).

Our first major finding is that there are strong gender differences: France and Germany look very different when we focus on men, but much less different when we focus on women. Historically, many studies have focused on men only, which gives a one-sided picture. Figure 1 shows the time-series for our measure of linkage.

Figure 1: Differences in school-to-work linkages when viewed from the male (left) and female (right) perspective

Clearly, linkage has also increased over time. This can happen for many reasons, for instance, it could just be because of educational expansion. (Higher educated people usually have higher linkage, as they are more specialized.) To disentangle the different sources, we used a decomposition method that is described in another paper.

When comparing Germany in 1970 to Germany in 2010 (right-hand panel of Figure 2), we find that the increase is indeed mostly explained by the differences in educational composition. In France, the changing educational composition has also increased linkage, but a large part of this decrease has been offset by declines in structural linkage, which is that part of linkage that is unexplained by changes in the composition of the workforce.

Figure 2: Decomposition of Differences in Linkage, 1970-2010

Comparing the countries to each other is even more interesting! In Figure 3, we find that Germany’s higher linkage in 1970 is almost entirely explained by Germany’s different educational distribution. In other words, France’s educational system was providing as good a match as the German system did, but it provided such a good match for a much smaller part of the workforce. In 2010, a lot of the difference is structural, consistent with the over-time comparison. The results in the paper are much more detailed, breaking down the components further by education and occupation.

Figure 3: Decomposition of Differences in Linkage, France vs. Germany

To summarize, we find three things:

School-to-work linkages have increased over time in both France and Germany, due to educational expansion.
In France, educational expansion has been accompanied by a decline in the effectiveness with which graduates are matched with the labor market.
In the 1970s, the main difference between France and Germany was compositional, not structural. This is a major departure from earlier studies, which framed France and Germany as opposite poles on a spectrum between low and high linkage.

Because the change within countries has been so strong, we argue against characterizing educational systems at the national level, especially over longer periods of time. There has been a long tradition in sociology to characterize the German system as “qualificational” and the French system as “organizational”. Our results raise the question whether such cross-national classifications of skill formation systems do justice to actual cross-national differences. We believe this not to be the case. When looking more closely into how school-to-work linkages are established, countries might be similar on some aspects (structural linkage), but differ on others (composition of workers across the programs). Moreover, the differences within countries are as large or larger than differences between countries.

Understanding regression coefficients and multicollinearity through the standardized regression model

Ben Elbers — Tue, 07 Jan 2020 23:00:00 GMT

The so-called standardized regression model is often presented in textbooks¹ as a solution to numerical issues that can arise in regression analysis, or as a method to bring the regression coefficients to a common, more interpretable scale. However, this transformation can also be useful to gain a deeper understanding into the construction of regression coefficients, the problem of multicollinearity, and the inflation of standard errors. It can thus also be a useful educational tool.

Correlation transformation

The standardized model refers to the model that is estimated after applying the correlation transformation to the outcome and the predictor variables. Let be a column vector of length n, then the correlation transformation is defined by

where denotes the mean of the components of . The correlation transformation is similar to a z-standardization, but instead of dividing by the standard deviation, we divide by the square root of the sum of squares. If we now consider another vector , for which the same transformation has been applied, we find that

where denotes the Pearson correlation coefficient between vectors and . From this it also follows that the dot product of the transformed vector with itself will be 1, i.e. The correlation transformation is the key “trick” that will be used to estimate the standardized model.

The standardized model

In a standard regression problem, we have an outcome vector and a matrix containing the p predictors. To estimate the standardized model, we apply the correlation transformation to the outcome vector and to each of the predictors. We then estimate the model

The design matrix contains the p transformed predictors, but no intercept. This is because any intercept term would always be estimated to be zero after the correlation transformation has been applied.

The correlation transformation makes it much easier to understand the role of the key components that are required when finding the estimates for the vector :

The first component is the matrix , which now has the simple form

where stands for the correlation between predictors and . Since this matrix is simply the correlation matrix between the predictor variables, all of its diagonal elements are 1, and all off-diagonal elements are between -1 and 1.

The second component is the vector , which has the simple form

where stands for the correlation between the th predictor and the outcome vector. Thus, the expression for simply involves the two correlation matrices:

Not only the estimates for are of interest, but also their standard errors. The expression for the variance of is

where is estimated through the mean squared error.

Finding the estimates for and the standard errors requires inverting the correlation matrix , which is complicated for large p. We will thus look at two limiting cases, which will make inverting the matrix possible: uncorrelated predictors, and a small number of predictors.

(1) Uncorrelated predictors

We first consider perfectly uncorrelated predictors. When all the predictors are uncorrelated with each other, the correlation matrix has an extremely simple expression:

where is the identity matrix. This fact should be obvious from inspection of the matrix above. The full expression for simply becomes:

Thus, when the predictors are all uncorrelated with each other, the coefficients are simply given by the correlation coefficients between the predictor and the outcome

The standard errors for the regression are constant, i.e. each coefficient will have the same standard error regardless of the size of the correlation between the predictor and It can be shown² that the standard errors are given by

Thus, the standard errors depend only on the sample size, the number of predictors, and the sum of the squared coefficients. Generally, the standard errors will decrease with increasing sample size, increase with an increasing number of predictors, and increase with lower correlations between the predictors and the outcome. All of these results should make intuitive sense.

(2) Two correlated predictors

In actual applications, perfectly uncorrelated predictors are rare. In fact, the goal of regression is often to control for correlated predictors. We now look at the case of two correlated predictors.

In this case, it is also straightforward to find an expression for First, we need to find the inverse of

The determinant of this matrix is , and the inverse is then

As an aside, this form of the matrix also makes it easy to see why perfectly correlated predictors are problematic: When , the determinant of the matrix is zero and the matrix does not have an inverse.

The full expression for is:

Thus,

It is immediately evident that, when the two predictors are uncorrelated the estimated regression coefficients are simply given by their correlation with (as seen above). When both coefficients will change, and the effect will be larger for larger values of If we assume that all three correlations are positive, the formula provides an intuitive way of thinking about what it means to “control” for another variable: the raw correlation between and will be reduced by an amount that depends both on the size of the correlation between and and on the correlation between and

For instance, assume that we are interested in the coefficient . We let and In a simple regression, where we just include we would find the coefficient to be 0.5. Now we want to control for another predictor, which is also correlated with the outcome at 0.7. For any “controlling” to happen, and need to be correlated as well. One interesting question is: How large does this correlation need to be to make zero? This is straighforward – simply plug in the values, set to zero, and solve for

Hence, the effect for would only vanish completely if is fairly large, as should be expected.

In other situations, the coefficient cannot become zero by introducing a control variable. Assume for instance, and The solution here is which is impossible. It turns out that the local minimum is attained at , where . In other words, controlling for will at most reduce from 0.5 to 0.4, and this will happen when .

The combined effect of different correlations can be explored in the Shiny app shown below. The plot shows the coefficient (y-axis) as a function of the correlation between the two predictors (x-axis). Because we are dealing with the correlations among three variables, the range of possible values for may be restricted depending on the values of and ³ The Shiny app will show only the range of possible values.

Using the sliders, one can adjust the correlations between the predictors and the outcome variable. In the default setting, the correlations are set as and For this example, when the estimated coefficient will be inflated compared to the raw correlation (indicated by the orange line). When the estimated coefficient will be attenuated instead. The attenuation will be especially severe as approaches 1. This is the problem of multicollinearity and can also be seen from the formula for : As approaches 1, approaches $ .$

Another interesting fact to note is that the coefficient of a predictor can be non-zero even if the predictor is completely uncorrelated with the outcome. For instance, if we let and the plot shows a sigmoid shape: will be positive when and are negatively correlated, and vice versa. This happens, of course, because multiple regression provides conditional inference: While and may be uncorrelated, they may well be correlated once we condition on .

As a last step, we consider the standard errors for the two regression coefficients. As before,

Thus, the standard errors are again constant:

This clearly shows that any correlation between and increases the variance and standard errors of the estimated coefficients. In fact, as approaches 1, the standard errors approach . This is an important result, because, even though either or might be highly correlated with , under multicollinearity the standard errors might be very large. Thus, statistical tests might not reject the null hypothesis, despite strong correlation.

Conclusion

The standardized regression model, as defined by the correlation transformation, can be used to explore the construction of regression coefficients and standard errors in simple cases. In the model with two predictors, all quantities depend only on the three correlations between and . This makes it easy to see the impact of different correlations on the estimated regression coefficients.

Footnotes

See for instance, Kutner et al. (2005) Applied Linear Statistical Models (esp. p. 271 ff.), on which a lot of this material is based.↩︎
This result can be shown through the use of the “hat” matrix, which is the matrix that satisfies . Because this matrix is a projection matrix, it is idempotent.

We use the mean squared error to estimate . The vector of residuals is denoted by . The variance of can then be found through some matrix algebra:

↩︎
See, for instance, this blogpost for an explanation.↩︎

Tidylog 1.0.0

Ben Elbers — Mon, 06 Jan 2020 23:00:00 GMT

Before I became a heavy user of R, I mainly used Stata. There are a few things that I miss from Stata, but one issue, specifically, bothered me immensely: The lack of feedback for data wrangling operations in R. Have a look, for instance, at this Stata output:

Stata output

The merge operation tells us about the number of matched cases, and the drop command tells us how many cases we lost. This feedback is great at preventing simple errors, especially when working with data interactively. This functionality does not exist in base R, the tidyverse, or the data.table package. Hence, my code often looked like this:

print(nrow(data))
data <- filter(data, length > 5)
print(nrow(data))

This gets ugly pretty quickly, and does not work for many other common problems, such as joins.

This is why I wrote the tidylog package, which is built on top of the tidyverse’s dplyr and tidyr packages. Tidylog provides the missing feedback:

library("tidyverse")
library("tidylog", warn.conflicts = FALSE)

filtered <- filter(mtcars, cyl == 4)
#> filter: removed 21 rows (66%), 11 rows remaining

joined <- left_join(nycflights13::flights, nycflights13::weather,
    by = c("year", "month", "day", "origin", "hour", "time_hour"))
#> left_join: added 9 columns (temp, dewp, humid, wind_dir, wind_speed, …)
#>            > rows only in x     1,556
#>            > rows only in y  (  6,737)
#>            > matched rows     335,220
#>            >                 =========
#>            > rows total       336,776

Tidylog simply overwrites the tidyverse functions for which it provides feedback. This is not very elegant, but means that tidylog is a drop-in solution: Just load it after the tidyverse (or dplyr and/or tidyr), and it will provide feedback.

Since its first version about a year ago, the package has grown to include most dplyr and many tidyr functions. (Thanks to all the contributors!) I might consider other functions, but it seems like for rarer and more complex functions the feedback becomes less useful, because one will usually inspect the output manually anyway. Because tidylog seems pretty much feature-complete to me, I release version 1.0.0 now. The goal for the future is to keep the package updated with developments occuring in dplyr and tidyr.

For more information about tidylog, check out the Github page.

New paper in Management Science: Obscured Transparency? Compensation benchmarking and the biasing of executive pay

Ben Elbers — Thu, 28 Mar 2019 23:00:00 GMT

Mathijs de Vaan, Benjamin Elbers, Thomas A. DiPrete. 2019. Obscured transparency? Compensation benchmarking and the biasing of executive pay. Management Science.

Preprint and replication materials

The disclosure of compensation peer groups is intended to provide shareholders with valuable information that can be used to scrutinize CEO compensation. Compensation consultants and watchdog organizations have established general principles for selecting peers, the most important being that peers should be companies of similar size, market capitalization, and industry profile. However, research suggests that there are substantial incentives for executives and directors to bias the compensation peer group upward to allow the CEO to extract additional rent. When companies choose compensation peer groups whose CEOs are better paid than would be the case with a neutrally chosen peer group, the focal CEO appears to be underpaid in comparison, which generates an argument for an increase in compensation for the focal CEO. A number of studies have found evidence that CEO peer groups are biased, though some have argued that what appears to be bias is actually just a reflection of the talent of the focal CEO.

We define bias as the difference in pay between the median CEO in the peer group selected by the firm and in neutral peer groups that we construct. We leverage the idea that reciprocated peer nominations are unlikely to be biased in order to construct counterfactual peer groups, which allows us to measure the bias of disclosed peer groups. Specifically, we estimate a model that predicts reciprocated peer nominations, and we use the estimates from this model to identify and select peers that are likely to nominate the focal firm as peer. Using eleven years of comprehensive data on compensation peer groups that was collected as part of this project, we demonstrate that the average firm uses an upwardly biased peer group, and that this bias cannot be accounted for by CEO talent. We also find that upward bias in compensation peer groups is highly predictive of higher CEO compensation – suggesting that there is a strong incentive for CEOs to strategically select peers. Figure 1 shows the bias as a percentage of the compensation of the median peer in a neutrally chosen peer group.

Figure 1: Median peer group bias over time, including 25th and 75th percentiles

As Figure 1 shows, the size of the peer group bias has been diminishing over time since the 2006 SEC requirement that firms disclose their compensation peer groups in their corporate reports. This may be a consequence of the requirement for periodic say on pay votes on executive compensation in the 2009 Dodd-Frank Act along with the greater scrutiny of compensation practices by watchdog agencies such as Institutional Shareholder Services. However, Figure 2 shows that the predictive effect of bias on pay has gone up, which offsets the consequences of the decline in bias shown in Figure 1.

Figure 2: Returns on bias are increasing over time: Controlling for firm size and firm performance, by how much does compensation increase for a 10% increase in bias?

We also demonstrate that ambiguity about the membership in a neutrally chosen peer group is being used strategically by firms to increase the size of peer group bias. When it is relatively obvious whom the firm should be choosing as peers, the bias tends to be smaller. Conversely, peer group bias is larger when firms have more discretion due to the set of plausible peers as well as the spread of the pay of their CEOs being relatively large. Figure 3 shows the median, 25th percentile and 75th percentile of peer group bias expressed as a percent of the median pay of a neutrally chosen peer group. This figure shows that bias is generally larger when firms have more discretion in the choice a peer group.

Figure 3: Bias increases with discretion

When a company is doing very well, there is arguably less need to introduce bias into the peer group because the argument for high pay can be made easily. When the company performs less well, the argument for high pay is more difficult to make. Our results show that bias is generally larger when financial targets are not met and when firms have greater discretion in the selection of peer firms from the set of plausible peers. Taken together, the findings from this research call into question the practical impact of recent efforts to introduce greater transparency into the process for determining executive compensation.