<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Ben Elbers</title>
<link>https://elbersb.com/public/posts.html</link>
<atom:link href="https://elbersb.com/public/posts.xml" rel="self" type="application/rss+xml"/>
<description>Ben Elbers&#39; personal website</description>
<generator>quarto-1.7.31</generator>
<lastBuildDate>Sat, 26 Jul 2025 22:00:00 GMT</lastBuildDate>
<item>
  <title>Comparing the scales of the Dissimilarity index and Theil’s index of segregation</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2025-07-27-segregation-scale-H-D/</link>
  <description><![CDATA[ 




<p>When studying segregation, often the question comes up of how to interpret the extent of segregation. Is an index value of 0.3 a meaningful amount of segregation? At what threshold do we speak of “high segregation”? For the Dissimilarity index, D, Massey and Denton (1993) proposed</p>
<blockquote class="blockquote">
<p>a simple rule of thumb … values under 0.3 are low, those between 0.3 and 0.6 are moderate and anything above 0.6 is high. (p.&nbsp;20)</p>
</blockquote>
<p>This simple rule has been frequently used when interpreting the D, but it’s not directly transferable to other segregation indices that operate on a different scale.</p>
<p>In this post, I’ll explore some properties of the Dissimilarity index and Theil’s H index, both of which are frequently used in studies of segregation, and compare how the scales of the D and the H relate to each other.</p>
<section id="understanding-d-and-h" class="level3">
<h3 class="anchored" data-anchor-id="understanding-d-and-h">Understanding D and H</h3>
<p>The Dissimilarity index operates on a linear scale. To illustrate this point, let’s assume we have a city with two racial groups A and B, and two schools. To compare the outcomes for different indices, we define a single parameter <img src="https://latex.codecogs.com/png.latex?n">, so that we can generate different segregation scenarios using this parameter:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>School</th>
<th>A</th>
<th>B</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>School 1</td>
<td><img src="https://latex.codecogs.com/png.latex?n"></td>
<td><img src="https://latex.codecogs.com/png.latex?2000%20-%20n"></td>
</tr>
<tr class="even">
<td>School 2</td>
<td><img src="https://latex.codecogs.com/png.latex?2000%20-%20n"></td>
<td><img src="https://latex.codecogs.com/png.latex?n"></td>
</tr>
</tbody>
</table>
<p>For instance, if we set <img src="https://latex.codecogs.com/png.latex?n=1000">, there is no segregation, or perfect integration: Both schools have an equal amount of A and B students. If we set <img src="https://latex.codecogs.com/png.latex?n=0">, there is absolute segregation: Every school contains only a single racial group. For values between 0 and 1000, we will have intermediate levels of segregation. It’s important to keep in mind that this is a very restricted scenario: Regardless of the value of <img src="https://latex.codecogs.com/png.latex?n">, we only have two schools, both of which are of equal size, and the two racial groups are also always of equal size.</p>
<p>To see how the Dissimilarity index changes when we move away from perfect integration, let’s simplify the formula for our case. Here, <img src="https://latex.codecogs.com/png.latex?A"> and <img src="https://latex.codecogs.com/png.latex?B"> are the totals for each racial group, and <img src="https://latex.codecogs.com/png.latex?a_i"> and <img src="https://latex.codecogs.com/png.latex?b_i"> refer to the number of students of racial groups A and B in school <img src="https://latex.codecogs.com/png.latex?i">. We then have</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AD%20&amp;=%20%5Cfrac%7B1%7D%7B2%7D%20%5Csum_i%20%5Cleft%7C%20%5Cfrac%7Ba_i%7D%7BA%7D%20-%20%5Cfrac%7Bb_i%7D%7BB%7D%20%5Cright%7C%20%5C%5C%0A%20%20&amp;=%20%5Cfrac%7B1%7D%7B2%7D%20%5Cleft%7C%20%5Cfrac%7Bn%7D%7B2000%7D%20-%20%5Cfrac%7B2000%20-%20n%7D%7B2000%7D%20%5Cright%7C%20+%20%5Cfrac%7B1%7D%7B2%7D%20%5Cleft%7C%20%5Cfrac%7B2000%20-%20n%7D%7B2000%7D%20-%20%5Cfrac%7Bn%7D%7B2000%7D%20%5Cright%7C%20%5C%5C%0A%20%20&amp;=%20%5Cfrac%7B1%7D%7B4000%7D%20%5Cleft(%20%5Cleft%7C%202n%20-%202000%20%5Cright%7C%20+%20%5Cleft%7C%202000%20-%202n%20%5Cright%7C%20%5Cright)%20%5C%5C%0A%20%20&amp;=%20%5Cfrac%7B1%7D%7B2000%7D%20%5Cleft%7C%202000%20-%202n%20%5Cright%7C%20%5C%5C%0A%20%20&amp;=%201%20-%20%5Cfrac%7B1%7D%7B1000%7D%20n%20%5C%5C%0A%5Cend%7Balign%7D%0A"></p>
<p>and <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20n%7D%20D%20=%20-%5Cfrac%7B1%7D%7B1000%7D">. Hence, for every pair of students that switches places to increase segregation (decreasing <img src="https://latex.codecogs.com/png.latex?n"> by 1), the D increases by a constant amount, <img src="https://latex.codecogs.com/png.latex?1/1000">. This is an important property of the Dissimarility index: it operates on a linear scale.</p>
<p>Let’s do the same exercise for Theil’s index of segregation, which we’ll call the H index. Here, <img src="https://latex.codecogs.com/png.latex?E"> refers to the entropy of the racial group distribution, <img src="https://latex.codecogs.com/png.latex?E_i"> is the entropy of the racial group distribution within school <img src="https://latex.codecogs.com/png.latex?i">, and <img src="https://latex.codecogs.com/png.latex?p_i"> is the proportion of students in school <img src="https://latex.codecogs.com/png.latex?i">. Because we have two groups of equal size, <img src="https://latex.codecogs.com/png.latex?E=%5Clog%202">, and we also have <img src="https://latex.codecogs.com/png.latex?E_1=E_2">, as the distributions are just flipped. We then have:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AH%20&amp;=%20%5Cfrac%7B1%7D%7B%20E%20%7D%20%5Csum_i%20p_i%20%5Cleft(E%20-%20E_i%20%5Cright)%20%5C%5C%0A%20%20&amp;=%20%5Csum_i%20%5Cfrac%7B1%7D%7B2%7D%20%5Cleft(1%20-%20%5Cfrac%7BE_i%7D%7B%5Clog%202%7D%20%5Cright)%20%5C%5C%0A%20%20&amp;=%201%20-%20%5Cfrac%7BE_1%7D%7B%5Clog%202%7D%20%5C%5C%0A%5Cend%7Balign%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?E_1=E_2=-%5Cfrac%7Bn%7D%7B2000%7D%20%5Clog%5Cfrac%7Bn%7D%7B2000%7D%20-%20%5Cfrac%7B2000-n%7D%7B2000%7D%20%5Clog%5Cfrac%7B2000-n%7D%7B2000%7D">. We therefore have <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20n%7D%20H%20=%20%5Cfrac%7B1%7D%7B2000%20%5Clog%202%7D%20%5Cleft(%20%5Clog%20%5Cfrac%7Bn%7D%7B2000%7D%20-%20%5Clog%20%5Cleft(1%20-%20%5Cfrac%7Bn%7D%7B2000%7D%5Cright)%20%5Cright)."></p>
<p>which shows that for the H index, the change in segregation depends on <img src="https://latex.codecogs.com/png.latex?n"> and is not constant. The fact that the H index operates on the log scale means that a marginal change in segregation when there is little segregation will have a smaller absolute effect compared to when there’s already a lot of segregation:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20n%7D%20H%20%5Cmid_%7Bn=900%7D%20&amp;=%20-0.0001%20%5C%5C%0A%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20n%7D%20H%20%5Cmid_%7Bn=100%7D%20&amp;=%20-0.0021%0A%5Cend%7Balign%7D%0A"></p>
<p>To make these results a bit more intuitive, let’s directly compare the D and H values across the range of possible values for <img src="https://latex.codecogs.com/png.latex?n">:</p>
<div class="cell">
<details class="code-fold">
<summary>Show the code</summary>
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb1-2"></span>
<span id="cb1-3">N <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2000</span></span>
<span id="cb1-4">e <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(n) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> N <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> N) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> (N <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> n) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> N <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>((N <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> n) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> N)</span>
<span id="cb1-5">h <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(n) <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">e</span>(n) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb1-6">d <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(n) <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> n</span>
<span id="cb1-7"></span>
<span id="cb1-8">seg <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb1-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>),</span>
<span id="cb1-10">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">measure =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"D"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1001</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"H"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1001</span>)),</span>
<span id="cb1-11">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">value =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">d</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">h</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb1-12">)</span>
<span id="cb1-13"></span>
<span id="cb1-14">(</span>
<span id="cb1-15"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(seg, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> n, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> value, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> measure))</span>
<span id="cb1-16">    <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_line</span>()</span>
<span id="cb1-17">    <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_reverse</span>()</span>
<span id="cb1-18">    <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Segregation"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"&lt; Less segregated | More segregated &gt;"</span>)</span>
<span id="cb1-19">    <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>()</span>
<span id="cb1-20">    <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>())</span>
<span id="cb1-21">)</span></code></pre></div>
</details>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2025-07-27-segregation-scale-H-D/index_files/figure-html/unnamed-chunk-1-1.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
<p>Clearly, the D increases linearly as <img src="https://latex.codecogs.com/png.latex?n"> decreases, while the <img src="https://latex.codecogs.com/png.latex?H"> shows logarithmic behavior: Small increases when segregation is low, large increase when segregation is high. The <img src="https://latex.codecogs.com/png.latex?H"> index is always smaller than the D index, except at the two extreme cases of complete integregation and complete segregation, where the index values are identical.</p>
<p>Let’s link this back up to Massey and Denton’s interpretation of the D:</p>
<div class="cell">
<details class="code-fold">
<summary>Show the code</summary>
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(kableExtra)</span>
<span id="cb2-2"></span>
<span id="cb2-3">compare <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb2-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">D =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">d</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rev</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>))),</span>
<span id="cb2-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">H =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">h</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rev</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>))), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>),</span>
<span id="cb2-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">level =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"low"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"low"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"low"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"moderate"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"moderate"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"moderate"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"moderate"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"high"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"high"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"high"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"high"</span>)</span>
<span id="cb2-7">)</span>
<span id="cb2-8"></span>
<span id="cb2-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">kable</span>(compare, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">digits =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">col.names =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"D index"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"H index"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Massey/Denton"</span>))</span></code></pre></div>
</details>
<div class="cell-output-display">
<table class="caption-top table table-sm table-striped small">
<thead>
<tr class="header">
<th style="text-align: right;">D index</th>
<th style="text-align: right;">H index</th>
<th style="text-align: left;">Massey/Denton</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: right;">0.0</td>
<td style="text-align: right;">0.00</td>
<td style="text-align: left;">low</td>
</tr>
<tr class="even">
<td style="text-align: right;">0.1</td>
<td style="text-align: right;">0.01</td>
<td style="text-align: left;">low</td>
</tr>
<tr class="odd">
<td style="text-align: right;">0.2</td>
<td style="text-align: right;">0.03</td>
<td style="text-align: left;">low</td>
</tr>
<tr class="even">
<td style="text-align: right;">0.3</td>
<td style="text-align: right;">0.07</td>
<td style="text-align: left;">moderate</td>
</tr>
<tr class="odd">
<td style="text-align: right;">0.4</td>
<td style="text-align: right;">0.12</td>
<td style="text-align: left;">moderate</td>
</tr>
<tr class="even">
<td style="text-align: right;">0.5</td>
<td style="text-align: right;">0.19</td>
<td style="text-align: left;">moderate</td>
</tr>
<tr class="odd">
<td style="text-align: right;">0.6</td>
<td style="text-align: right;">0.28</td>
<td style="text-align: left;">moderate</td>
</tr>
<tr class="even">
<td style="text-align: right;">0.7</td>
<td style="text-align: right;">0.39</td>
<td style="text-align: left;">high</td>
</tr>
<tr class="odd">
<td style="text-align: right;">0.8</td>
<td style="text-align: right;">0.53</td>
<td style="text-align: left;">high</td>
</tr>
<tr class="even">
<td style="text-align: right;">0.9</td>
<td style="text-align: right;">0.71</td>
<td style="text-align: left;">high</td>
</tr>
<tr class="odd">
<td style="text-align: right;">1.0</td>
<td style="text-align: right;">1.00</td>
<td style="text-align: left;">high</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>Hence, in this example, we should consider an H value above ~0.07 already as “moderate segregation”, and a value above ~0.28 already as “high” segregation!</p>
</section>
<section id="a-more-general-situation" class="level3">
<h3 class="anchored" data-anchor-id="a-more-general-situation">A more general situation</h3>
<p>There is a big problem with the table above: It should not be used to translate D into H values for all types of situations. The example is an edge case – only two schools, each school has the same size, and the two racial groups are of equal size as well. Ultimately, the D and the H index work differently and evaluate the same situations differently. To show this point, we generalize our example slightly by studying <em>all</em> possible 2x2 tables with a fixed total population count.</p>
<p>Generating all possible tables can be computationally expensive, so I’m using all tables with a total population of 100 here. That yields 176,451 unique tables, after removing tables that have empty schools or empty racial groups. I then calculated the H and D for each of these tables, and the result is two-dimensional distribution that looks like this:</p>
<div class="cell">
<details class="code-fold">
<summary>Show the code</summary>
<div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(data.table)</span>
<span id="cb3-2"></span>
<span id="cb3-3">generate_tables <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(n, k) {</span>
<span id="cb3-4">    helper <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(n, k, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prefix =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>()) {</span>
<span id="cb3-5">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (k <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(prefix, n)))</span>
<span id="cb3-6"></span>
<span id="cb3-7">        result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>()</span>
<span id="cb3-8">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (i <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>n) {</span>
<span id="cb3-9">            result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(result, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">helper</span>(n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> i, k <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(prefix, i)))</span>
<span id="cb3-10">        }</span>
<span id="cb3-11">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(result)</span>
<span id="cb3-12">    }</span>
<span id="cb3-13"></span>
<span id="cb3-14">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">helper</span>(n, k)</span>
<span id="cb3-15">}</span>
<span id="cb3-16"></span>
<span id="cb3-17">dt <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbindlist</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lapply</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">generate_tables</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>), <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.list</span>(x)))</span>
<span id="cb3-18"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(dt) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"w1"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"w2"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"b1"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"b2"</span>)</span>
<span id="cb3-19">dt[, s1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> w1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b1]</span>
<span id="cb3-20">dt[, s2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> w2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b2]</span>
<span id="cb3-21">dt[, w <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> w1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> w2]</span>
<span id="cb3-22">dt[, b <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> b1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b2]</span>
<span id="cb3-23">dt[, n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> w <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b]</span>
<span id="cb3-24">dt <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> dt[s1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> s2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> w <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> b <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb3-25"></span>
<span id="cb3-26">logf <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(x))</span>
<span id="cb3-27"></span>
<span id="cb3-28">dt[, D <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">abs</span>(w1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> w <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> b1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> b) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">abs</span>(w2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> w <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> b2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> b))]</span>
<span id="cb3-29">dt[, E <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>w <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logf</span>(w <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> n) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> b <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logf</span>(b <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> n)]</span>
<span id="cb3-30">dt[, E1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>w1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> s1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logf</span>(w1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> s1) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> b1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> s1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logf</span>(b1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> s1)]</span>
<span id="cb3-31">dt[, E2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>w2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> s2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logf</span>(w2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> s2) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> b2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> s2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logf</span>(b2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> s2)]</span>
<span id="cb3-32">dt[, H <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> E <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (s1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (E <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> E1) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> s2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (E <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> E2))]</span>
<span id="cb3-33"></span>
<span id="cb3-34">(</span>
<span id="cb3-35">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(dt, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x=</span>D, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y=</span>H))</span>
<span id="cb3-36">    <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">stat_bin_hex</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">bins=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">60</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>))</span>
<span id="cb3-37">    <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_fill_viridis_c</span>()</span>
<span id="cb3-38">    <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_abline</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gray"</span>)</span>
<span id="cb3-39">    <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Correlation r = "</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(dt[, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cor</span>(D, H)], <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)))</span>
<span id="cb3-40">    <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>()</span>
<span id="cb3-41">    <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.position =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"none"</span>)</span>
<span id="cb3-42">)</span></code></pre></div>
</details>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2025-07-27-segregation-scale-H-D/index_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
<p>Ligher areas have higher density, and we can see that there is a thin band of tables where some sort of general relationship between the H and D holds. There are however many scenarios where, for any given value of D, there is a wide range of H values. What the plot also shows is that out of all the possible contingency tables, many are concentrated in the area where segregation is low. Lastly, we can again see that H index is always lower than the D index.</p>
<p>For a new version of the table above, I’m now showing the possible range of H values by showing the 5th, 50th, and 95th percentile of the distribution for any given value of D:</p>
<div class="cell">
<details class="code-fold">
<summary>Show the code</summary>
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">tab <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbindlist</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lapply</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>), <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(r) {</span>
<span id="cb4-2">    dt[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">abs</span>(D <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> r) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.00001</span>, .(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">D =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(D), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">q5 =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">quantile</span>(H, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">q50 =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">median</span>(H), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">q95 =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">quantile</span>(H, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.95</span>))]</span>
<span id="cb4-3">}))</span>
<span id="cb4-4"></span>
<span id="cb4-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">kable</span>(tab, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">digits =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">col.names =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"D"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"5th percentile"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Median"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"95th percentile"</span>))</span></code></pre></div>
</details>
<div class="cell-output-display">
<table class="caption-top table table-sm table-striped small">
<thead>
<tr class="header">
<th style="text-align: right;">D</th>
<th style="text-align: right;">5th percentile</th>
<th style="text-align: right;">Median</th>
<th style="text-align: right;">95th percentile</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: right;">0.0</td>
<td style="text-align: right;">0.00</td>
<td style="text-align: right;">0.00</td>
<td style="text-align: right;">0.00</td>
</tr>
<tr class="even">
<td style="text-align: right;">0.1</td>
<td style="text-align: right;">0.01</td>
<td style="text-align: right;">0.01</td>
<td style="text-align: right;">0.05</td>
</tr>
<tr class="odd">
<td style="text-align: right;">0.2</td>
<td style="text-align: right;">0.02</td>
<td style="text-align: right;">0.03</td>
<td style="text-align: right;">0.13</td>
</tr>
<tr class="even">
<td style="text-align: right;">0.3</td>
<td style="text-align: right;">0.06</td>
<td style="text-align: right;">0.07</td>
<td style="text-align: right;">0.18</td>
</tr>
<tr class="odd">
<td style="text-align: right;">0.4</td>
<td style="text-align: right;">0.10</td>
<td style="text-align: right;">0.13</td>
<td style="text-align: right;">0.28</td>
</tr>
<tr class="even">
<td style="text-align: right;">0.5</td>
<td style="text-align: right;">0.15</td>
<td style="text-align: right;">0.22</td>
<td style="text-align: right;">0.40</td>
</tr>
<tr class="odd">
<td style="text-align: right;">0.6</td>
<td style="text-align: right;">0.22</td>
<td style="text-align: right;">0.29</td>
<td style="text-align: right;">0.46</td>
</tr>
<tr class="even">
<td style="text-align: right;">0.7</td>
<td style="text-align: right;">0.34</td>
<td style="text-align: right;">0.40</td>
<td style="text-align: right;">0.57</td>
</tr>
<tr class="odd">
<td style="text-align: right;">0.8</td>
<td style="text-align: right;">0.44</td>
<td style="text-align: right;">0.55</td>
<td style="text-align: right;">0.70</td>
</tr>
<tr class="even">
<td style="text-align: right;">0.9</td>
<td style="text-align: right;">0.60</td>
<td style="text-align: right;">0.73</td>
<td style="text-align: right;">0.83</td>
</tr>
<tr class="odd">
<td style="text-align: right;">1.0</td>
<td style="text-align: right;">1.00</td>
<td style="text-align: right;">1.00</td>
<td style="text-align: right;">1.00</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>The median scenario closely matches our simplified example above, but there’s quite a range of values. For instance, a D value of 0.8 can correspond to H values in the range of 0.44 to 0.70.</p>
<p>Let’s have a closer look at this scenario. The next figure shows two <a href="https://osf.io/preprints/socarxiv/ruw4g_v1">segplots</a> – a visual display of the contingency table that is used to produce the segregation index. Here we have two examples, both of which have a 90%-10% distribution for the racial group. This overall distribution is shown to the right of each segplot. The D is identical in these two examples, but the H index is quite different: 0.44 on the left, 0.70 on the right – a difference of 60%! In the scenario on the left, there is one school with 35% of the students coming from the minority group, and a second school that is completely segregated. In the scenario on the right, there is one very small school that consists only of students of the minority group, and a large school that contains a small amount of minority group students (~2%).</p>
<div class="cell">
<details class="code-fold">
<summary>Show the code</summary>
<div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(segregation)</span>
<span id="cb5-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(patchwork)</span>
<span id="cb5-3"></span>
<span id="cb5-4">example1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix_to_long</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">72</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">18</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrow=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>))</span>
<span id="cb5-5">example2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix_to_long</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">90</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrow=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>))</span>
<span id="cb5-6">ent <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">entropy</span>(example1, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>)</span>
<span id="cb5-7"></span>
<span id="cb5-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_local</span>(example1, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">wide =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)[, .(unit, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ls =</span> ls <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> ent, p, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">contrib =</span> ls <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> ent <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> p)]</span>
<span id="cb5-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; Key: &lt;unit&gt;</span></span>
<span id="cb5-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      unit        ls     p   contrib</span></span>
<span id="cb5-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt;     &lt;num&gt; &lt;num&gt;     &lt;num&gt;</span></span>
<span id="cb5-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      1 0.3241035  0.72 0.2333545</span></span>
<span id="cb5-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2:      2 0.7331267  0.28 0.2052755</span></span>
<span id="cb5-14"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_local</span>(example2, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">wide =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)[, .(unit, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ls =</span> ls <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> ent, p, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">contrib =</span> ls <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> ent <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> p)]</span>
<span id="cb5-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; Key: &lt;unit&gt;</span></span>
<span id="cb5-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      unit        ls     p   contrib</span></span>
<span id="cb5-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt;     &lt;num&gt; &lt;num&gt;     &lt;num&gt;</span></span>
<span id="cb5-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      1 7.0830689  0.08 0.5666455</span></span>
<span id="cb5-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2:      2 0.1488661  0.92 0.1369568</span></span>
<span id="cb5-20"></span>
<span id="cb5-21">(</span>
<span id="cb5-22">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">segplot</span>(example1, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">bar_space =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>)</span>
<span id="cb5-23">  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"D = 0.8, H = 0.44"</span>)</span>
<span id="cb5-24">  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">segplot</span>(example2, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">bar_space =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>)</span>
<span id="cb5-25">  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"D = 0.8, H = 0.70"</span>)</span>
<span id="cb5-26">  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.position =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"none"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.title.x =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>))</span>
<span id="cb5-27">)</span></code></pre></div>
</details>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2025-07-27-segregation-scale-H-D/index_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
<p>The key to understanding why the H index sees the second example as more segregated is to think about how <em>surprised</em> one is to find any of these schools. With a 90%-10% split, how suprising is it to find a school that is distributed 64%-36%? How surprising is it to find a school that is 100-0%? This is the scenario on the left, and the H index quantifies this amount of surprise for the first school as 0.73, and for the second school as 0.33. (These are <a href="https://osf.io/preprints/socarxiv/3juyc_v1">adjusted local segregation scores</a> – local segregation scores divided by the racial group entropy.) After multiplying these by the size of the school, we arrive at an H value of 0.44.</p>
<p>For the second scenario we ask: With a 90%-10% split, how suprising is it to find a school that is distributed 0%-100%? How surprising is it to find a school that is 98-2%? Here, the H index quantifies the amount of surprise as 7.1 (!) and 0.15, and we arrive at a total H index of 0.70. This intuitively reflects the fact that in a city where the minority group makes up only 10% of the overall student population, it is extremely surprising to find a school that is minority-only. To the D index, these two scenarios are identical, but there is definitely an argument to be made here that the second scenario is, in fact, <em>more</em> segregated.</p>
</section>
<section id="conclusion" class="level3">
<h3 class="anchored" data-anchor-id="conclusion">Conclusion</h3>
<p>This last example has shown that there is no clear unique mapping between H and D index values – and, in fact, if that were the case, there would be no need to have another index in the first place! The example also showed that it is quite intuitive how the H index arrives at a slighly different conclusion compared to the D index. The H index is therefore its own unique index, with unique properties and its own scale.</p>
<p>Nonetheless, situations that many would regard as already highly segregated yield relatively low absolute values for the H index. When interpreting H index values, it is therefore important to consider small deviations from 0 already as moderately segregated. Some might consider this a downside of the H, but in exchange we gain a lot of desirable properties, such as decomposability, local segregation scores, multigroup indices, and the avoidance of many problems that the D index has (see Winship 1977 for some of these). Instead of purely relying on the index value, it is also a good idea to visualize the data, for instance by using a <a href="https://osf.io/preprints/socarxiv/ruw4g_v1">segplot</a>.</p>
<p>Lastly, if we think about the segregation process from a statistical standpoint, any small deviations that might just be due to noise lead to an increase in the segregation score. The H index is much less susceptible to this than the D index, which is also a desirable property. More details on this aspect are found in an <a href="../../posts/2021-11-24-segregation-bias/index.html">earlier post of mine on the bias of segregation indices</a>.</p>
</section>
<section id="references" class="level3">
<h3 class="anchored" data-anchor-id="references">References</h3>
<p>Douglas S. Massey and Nancy A. Denton. 1993. American Apartheid. Harvard University Press.</p>
<p>Winship, Christopher. 1977. A Revaluation of Indexes of Residential Segregation. Social Forces 55(4): 1058-1066.</p>


</section>

 ]]></description>
  <category>segregation</category>
  <guid>https://elbersb.com/public/posts/2025-07-27-segregation-scale-H-D/</guid>
  <pubDate>Sat, 26 Jul 2025 22:00:00 GMT</pubDate>
</item>
<item>
  <title>Logistic regression with categorical predictors</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2025-01-19-logistic-regression-categorical/</link>
  <description><![CDATA[ 




<p>I’ve written a little bit on linear regression on this blog before, for instance on the <a href="../../posts/2020-01-08-correlation-model/index.html">correlation model</a>. Mathematically, linear regression and the OLS estimator are nice to work with exactly because of the linearity. Once we get to more complicated models, such as logistic regression, there is no closed-form solution anymore to the estimator, which makes it a bit harder to see what’s going on under the hood.</p>
<p>In the case of logistic regression, <a href="https://en.wikipedia.org/wiki/Logistic_regression#Maximum_likelihood_estimation_(MLE)">a maximum likelihood estimator</a> is usually used, which has no closed-form solution. One exception to this rule is in the case of categorical predictors. In this case, there is a simple (and indeed somewhat trivial) closed-form solution to the estimation of the model.</p>
<section id="setup" class="level2">
<h2 class="anchored" data-anchor-id="setup">Setup</h2>
<p>In the simplest possible setup, we have a binary predictor <img src="https://latex.codecogs.com/png.latex?X"> and a binary outcome <img src="https://latex.codecogs.com/png.latex?Y">. For instance, similarly to the example on Wikipedia, <img src="https://latex.codecogs.com/png.latex?X"> could be whether a student has studied for an exam, and <img src="https://latex.codecogs.com/png.latex?Y"> could be whether the student has passed. Let’s say 80% of the students that have studied have passed the exam, and 40% of the students that didn’t study have passed the exam. This means, that effectively, we try to fit a logistic curve to two points <img src="https://latex.codecogs.com/png.latex?(0,%200.4)"> and <img src="https://latex.codecogs.com/png.latex?(1,%200.8)">, where the x coordinate specifies whether a student has studied, and the y coordinate specifies the probability of passing the exam. Graphically, this looks like this:</p>
<div class="cell">
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2025-01-19-logistic-regression-categorical/index_files/figure-html/unnamed-chunk-1-1.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
</section>
<section id="curve-fitting" class="level2">
<h2 class="anchored" data-anchor-id="curve-fitting">Curve-fitting</h2>
<p>With this setup, this is an exercise in fitting a logistic function to the data – and because we have only two points, the fit is perfect. The curve we’re fitting is this one:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ap(x)%20=%20%5Ctext%7Blogit%7D%5E%7B-1%7D(%5Cbeta_0+%5Cbeta_1%20x)%20=%20%5Cfrac%7B1%7D%7B1+%5Ctext%7Bexp%7D(-(%5Cbeta_0+%5Cbeta_1%20x))%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?p(x)"> is the probability of passing, and <img src="https://latex.codecogs.com/png.latex?x"> is 1 if a student has studied, and 0 otherwise. The function <img src="https://latex.codecogs.com/png.latex?%5Ctext%7Blogit%7D%5E%7B-1%7D(x)%20=%20%5Cfrac%7B1%7D%7Be%5E%7B-x%7D%7D"> is the inverse logit, also known as the <a href="https://en.wikipedia.org/wiki/Logit">logistic function</a>. The logit function is the inverse of this function, i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?%5Ctext%7Blogit%7D(x)%20=%20%5Ctext%7Blog%7D%5Cfrac%7Bx%7D%7B(1-x)%7D"> which means taking the logarithm of the odds.</p>
<p>Let’s use <img src="https://latex.codecogs.com/png.latex?a"> for the probability of passing if a student didn’t study, and <img src="https://latex.codecogs.com/png.latex?b"> for the probability of passing if a student did study. Then we need to solve this system of equations for <img src="https://latex.codecogs.com/png.latex?%5Cbeta_0"> and <img src="https://latex.codecogs.com/png.latex?%5Cbeta_1">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%20%20%20%20a%20&amp;=%5Ctext%7Blogit%7D%5E%7B-1%7D(%5Cbeta_0)%20%5C%5C%0A%20%20%20%20b%20&amp;=%5Ctext%7Blogit%7D%5E%7B-1%7D(%5Cbeta_0+%5Cbeta_1)%0A%5Cend%7Balign%7D%0A"></p>
<p>Trivially, by just applying the logit transformation, we obtain:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%20%20%20%20%5Cbeta_0%20&amp;=%5Ctext%7Blogit%7D(a)%20%5C%5C%0A%20%20%20%20%5Cbeta_1%20&amp;=%5Ctext%7Blogit%7D(b)%20-%20%5Cbeta_0%20=%20%5Ctext%7Blogit%7D(b)%20-%20%5Ctext%7Blogit%7D(a)%0A%5Cend%7Balign%7D%0A"></p>
<p>This is hopefully intuitive. The results are equivalent to those that would be obtained by OLS in a linear model (where we would get <img src="https://latex.codecogs.com/png.latex?%5Cbeta_0=a"> and <img src="https://latex.codecogs.com/png.latex?%5Cbeta_1=b-a">), but on the logit scale instead of on the probability scale.</p>
</section>
<section id="example" class="level2">
<h2 class="anchored" data-anchor-id="example">Example</h2>
<p>For the example above, we therefore obtain</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%20%20%20%20%5Cbeta_0%20&amp;=%20%5Ctext%7Blog%7D%5Cfrac%7B0.4%7D%7B(1-0.4)%7D%20=%5Ctext%7Blog%7D%5Cfrac%7B2%7D%7B3%7D%20%5Capprox%20-0.405%20%5C%5C%0A%20%20%20%20%5Cbeta_1%20&amp;=%20%5Ctext%7Blog%7D%5Cfrac%7B0.8%7D%7B(1-0.8)%7D%20-%20%5Cbeta_0%20=%20%5Ctext%7Blog%7D(6)%20%5Capprox%201.792%0A%5Cend%7Balign%7D.%0A"></p>
<p>Let’s check this against what we get from R using the <code>glm</code> function:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># create data for 10 students</span></span>
<span id="cb1-2">exam <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb1-3">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>),</span>
<span id="cb1-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb1-5">)</span>
<span id="cb1-6"></span>
<span id="cb1-7">model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glm</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> exam, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">family =</span> binomial)</span>
<span id="cb1-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(model)</span>
<span id="cb1-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; (Intercept)           x </span></span>
<span id="cb1-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;  -0.4054651   1.7917595</span></span></code></pre></div>
</div>
<p>The MLE gives the same answer.</p>
<p>Another way to obtain these coefficients is to use an OLS estimator with aggregate data, where we use the logit transform:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># define a logit function to use in linear model</span></span>
<span id="cb2-2">logit <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>x))</span>
<span id="cb2-3">(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logit</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.4</span>))</span>
<span id="cb2-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] -0.4054651</span></span>
<span id="cb2-5">(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logit</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logit</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.4</span>))</span>
<span id="cb2-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] 1.791759</span></span>
<span id="cb2-7"></span>
<span id="cb2-8">exam_aggregate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.4</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>))</span>
<span id="cb2-9">model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logit</span>(y) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> exam_aggregate, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weights =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>))</span>
<span id="cb2-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(model)</span>
<span id="cb2-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; (Intercept)           x </span></span>
<span id="cb2-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;  -0.4054651   1.7917595</span></span></code></pre></div>
</div>
</section>
<section id="extending-to-a-categorical-predictor" class="level2">
<h2 class="anchored" data-anchor-id="extending-to-a-categorical-predictor">Extending to a categorical predictor</h2>
<p>Until now, <img src="https://latex.codecogs.com/png.latex?X"> was assumed to be binary. What happens if <img src="https://latex.codecogs.com/png.latex?X"> is categorical? It turns out, this case is not very different, as we’re effectively fitting <em>several</em> separate logistic curves if we have only one categorical predictor. As an example, assume we have three academic departments with different admission rates. Department 0 admits 40% of students who apply, department 1 admits 80% of students, and department 2 admits 60% of students.</p>
<p>To use a categorical predictor in a regression model, we “dummy code” the department information with department 0 as the <em>reference category</em>. Define two binary variables <img src="https://latex.codecogs.com/png.latex?X_1"> and <img src="https://latex.codecogs.com/png.latex?X_2">:</p>
<ul>
<li>Department 0 is coded as <img src="https://latex.codecogs.com/png.latex?X_1=0"> and <img src="https://latex.codecogs.com/png.latex?X_2=0"></li>
<li>Department 1 is coded as <img src="https://latex.codecogs.com/png.latex?X_1=1"> and <img src="https://latex.codecogs.com/png.latex?X_2=0"></li>
<li>Department 2 is coded as <img src="https://latex.codecogs.com/png.latex?X_1=0"> and <img src="https://latex.codecogs.com/png.latex?X_2=1"></li>
</ul>
<p>We then define the logistic regression model as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ap(x)%20=%20%5Ctext%7Blogit%7D%5E%7B-1%7D(%5Cbeta_0+%5Cbeta_1%20x_1+%5Cbeta_2%20x_2)%0A"></p>
<p>Because the predictors are binary, we are now effectively fitting two separate curves, as in this figure:</p>
<div class="cell">
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2025-01-19-logistic-regression-categorical/index_files/figure-html/unnamed-chunk-4-1.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
<p>Both logistic curves run through the point <img src="https://latex.codecogs.com/png.latex?(0,%200.4)"> because we picked department 0 as the reference category. The difference in slopes between these two curves is the reason that interactions terms in logistic regression models are much more tricky to interpret than in linear models.</p>
<p>Again, use <img src="https://latex.codecogs.com/png.latex?a">, <img src="https://latex.codecogs.com/png.latex?b">, and <img src="https://latex.codecogs.com/png.latex?c"> for the probability of admittance for departments 0, 1, and 2, respectively. The system of equations to solve now becomes</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%20%20%20%20a%20&amp;=%5Ctext%7Blogit%7D%5E%7B-1%7D(%5Cbeta_0)%20%5C%5C%0A%20%20%20%20b%20&amp;=%5Ctext%7Blogit%7D%5E%7B-1%7D(%5Cbeta_0+%5Cbeta_1)%20%5C%5C%0A%20%20%20%20c%20&amp;=%5Ctext%7Blogit%7D%5E%7B-1%7D(%5Cbeta_0+%5Cbeta_2)%0A%5Cend%7Balign%7D%0A"></p>
<p>with solutions</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%20%20%20%20%5Cbeta_0%20&amp;=%5Ctext%7Blogit%7D(a)%20%5C%5C%0A%20%20%20%20%5Cbeta_1%20&amp;=%5Ctext%7Blogit%7D(b)%20-%20%5Ctext%7Blogit%7D(a)%20%5C%5C%0A%20%20%20%20%5Cbeta_2%20&amp;=%5Ctext%7Blogit%7D(c)%20-%20%5Ctext%7Blogit%7D(a).%0A%5Cend%7Balign%7D%0A"></p>
<p>For the example, the resulting coefficients are</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logit</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.4</span>))</span>
<span id="cb3-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] -0.4054651</span></span>
<span id="cb3-3">(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logit</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logit</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.4</span>))</span>
<span id="cb3-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] 1.791759</span></span>
<span id="cb3-5">(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logit</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">logit</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.4</span>))</span>
<span id="cb3-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] 0.8109302</span></span></code></pre></div>
</div>
<p>And here’s the same using the <code>glm</code> function:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># create data for 15 applicants (5 for each department)</span></span>
<span id="cb4-2">depts <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb4-3">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>),</span>
<span id="cb4-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb4-5">)</span>
<span id="cb4-6"></span>
<span id="cb4-7">model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glm</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.factor</span>(x), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> depts, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">family =</span> binomial)</span>
<span id="cb4-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(model)</span>
<span id="cb4-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;   (Intercept) as.factor(x)1 as.factor(x)2 </span></span>
<span id="cb4-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    -0.4054651     1.7917595     0.8109302</span></span></code></pre></div>
</div>


</section>

 ]]></description>
  <category>regression</category>
  <category>logistic-regression</category>
  <guid>https://elbersb.com/public/posts/2025-01-19-logistic-regression-categorical/</guid>
  <pubDate>Sat, 18 Jan 2025 23:00:00 GMT</pubDate>
</item>
<item>
  <title>A simple form of the IV standard error</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2023-10-07-iv-standard-error/</link>
  <description><![CDATA[ 




<p>Recently, a blog post of mine on encouragement designs was published on the <a href="https://engineering.atspotify.com/2023/08/encouragement-designs-and-instrumental-variables-for-a-b-testing/">Spotify Engineering blog</a>. In this post, I want to follow up on the formula for the variance of the IV estimator that is shown in that post, which is, with a slight change in notation,</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BVar%7D%5B%5Chat%7B%5Cbeta%7D_%5Ctext%7BIV%7D%5D%20=%20%5Cfrac%7B1%7D%7Bn%7D%20%5Cfrac%7B%20%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BY-%5Chat%7B%5Cbeta%7D_%5Ctext%7BIV%7D%20X%5D%20%7D%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5B%5Chat%7BE%7D%5BX%20%5Cmid%20Z%5D%5D%7D."></p>
<p>where <img src="https://latex.codecogs.com/png.latex?Y">, <img src="https://latex.codecogs.com/png.latex?X">, and <img src="https://latex.codecogs.com/png.latex?Z"> are random variables for the outcome, treatment, and instrument, respectively. <em>Note that this formula is only correct if the instrument <img src="https://latex.codecogs.com/png.latex?Z"> is binary.</em><sup>1</sup></p>
<p>The aim of this post is to how to derive this version of the formula from the more general IV estimator (under the assumption that there is a single binary instrument <img src="https://latex.codecogs.com/png.latex?Z">, and a single predictor <img src="https://latex.codecogs.com/png.latex?X">), and how it compares to the OLS estimator. This version of the formula works well to illustrate why IV estimators have lower power compared to the equivalent OLS model.</p>
<p>To illustrate the derivations with some code, we’ll use a classic example from the econometrics literature – the returns of schooling on wages. Here is a bit of R code to set up the examples:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(data.table)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(fixest)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SchoolingReturns"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">package =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ivreg"</span>)</span>
<span id="cb1-4">returns <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.data.table</span>(SchoolingReturns)</span>
<span id="cb1-5">returns <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> returns[, .(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(wage), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> education, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">z =</span> nearcollege <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"yes"</span>)]</span>
<span id="cb1-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(returns)</span>
<span id="cb1-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;           y     x      z</span></span>
<span id="cb1-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;       &lt;num&gt; &lt;num&gt; &lt;lgcl&gt;</span></span>
<span id="cb1-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1: 6.306275     7  FALSE</span></span>
<span id="cb1-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2: 6.175867    12  FALSE</span></span>
<span id="cb1-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 3: 6.580639    12  FALSE</span></span>
<span id="cb1-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 4: 5.521461    11   TRUE</span></span>
<span id="cb1-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 5: 6.591674    12   TRUE</span></span>
<span id="cb1-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 6: 6.214608    12   TRUE</span></span></code></pre></div>
</div>
<p>The data set contains the wage in dollars (which I log-transform here) as the outcome <img src="https://latex.codecogs.com/png.latex?Y">, the years of education as the treatment <img src="https://latex.codecogs.com/png.latex?X">, and whether the individual grew up near a college, which will be used as the instrument <img src="https://latex.codecogs.com/png.latex?Z">.</p>
<section id="ols-standard-error" class="level2">
<h2 class="anchored" data-anchor-id="ols-standard-error">OLS standard error</h2>
<p>To show the similarities and differences between the IV and the OLS standard error, let’s first take a look at the standard error of a simple linear model. Consider a standard linear model of the form <img src="https://latex.codecogs.com/png.latex?y_%7Bi%7D=%5Calpha+%5Cbeta%20x_%7Bi%7D+u_%7Bi%7D">, where we apply all the usual regression assumptions. We’re interested in an estimate of <img src="https://latex.codecogs.com/png.latex?%5Cbeta">, and its standard error, <img src="https://latex.codecogs.com/png.latex?%5Csqrt%7B%5Ctext%7BVar%7D%5B%5Chat%7B%5Cbeta%7D%5D%7D">. If we estimate this model using OLS, and call the estimate <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%5Ctext%7BOLS%7D">, we obtain</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5Cbegin%7Balign%7D%0A%5Ctext%7BVar%7D%5B%5Chat%7B%5Cbeta%7D_%7B%5Ctext%7BOLS%7D%7D%5D%0A%20%20%20%20&amp;=%5Cfrac%7B1%7D%7Bn%7D%5Cfrac%7B%5Ctext%7B(residual%20variance%20of%20%7Dy)%7D%7B%5Ctext%7B(variance%20of%20%7Dx)%7D%0A%20%20%20%20%5C%5C&amp;=%5Cfrac%7B1%7D%7Bn%7D%5Cfrac%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BY-%5Chat%7B%5Cbeta%7D_%5Ctext%7BOLS%7D%20X%5D%7D%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BX%5D%7D.%0A%5Cend%7Balign%7D%0A"></p>
<p>(We use a factor of <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7Bn%7D"> here for simplicity – use <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7Bn-2%7D"> to obtain an unbiased estimate.)</p>
<p>The numerator might look a bit non-standard, so here’s a quick derivation. If we define the predicted values as <img src="https://latex.codecogs.com/png.latex?%5Chat%7BY%7D%20=%20%5Chat%7B%5Calpha%7D+%5Chat%7B%5Cbeta%7D_%5Ctext%7BOLS%7D%20X">, then the numerator can be written as</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5Cbegin%7Balign%7D%0A%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BY-%5Chat%7BY%7D%5D%0A%20%20%20%20&amp;=%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BY-(%5Chat%7B%5Calpha%7D+%5Chat%7B%5Cbeta%7D_%5Ctext%7BOLS%7D%20X)%5D%0A%20%20%20%20%5C%5C&amp;=%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BY-(%5Chat%7BE%7D%5BY%5D-%5Chat%7B%5Cbeta%7D_%5Ctext%7BOLS%7D%20%5Chat%7BE%7D%5BX%5D%20+%20%5Chat%7B%5Cbeta%7D_%5Ctext%7BOLS%7D%20X)%5D%0A%20%20%20%20%5C%5C&amp;=%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BY%20-%20%5Chat%7B%5Cbeta%7D_%5Ctext%7BOLS%7D%20X%5D,%0A%5Cend%7Balign%7D%0A"></p>
<p>where the last term simplifies because constant terms drop out of the variance.</p>
<p>This R code demonstrates the result:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">model_ols <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> returns)</span>
<span id="cb2-2">beta_ols <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(model_ols)[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]]</span>
<span id="cb2-3"></span>
<span id="cb2-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># standard error calculated by lm</span></span>
<span id="cb2-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vcov</span>(model_ols)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>])</span>
<span id="cb2-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] 0.002869708</span></span>
<span id="cb2-7"></span>
<span id="cb2-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># compute standard error manually</span></span>
<span id="cb2-9">adj <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(returns) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># use the dof that lm uses</span></span>
<span id="cb2-10">returns[, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(adj <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">var</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> beta_ols <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> x) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">var</span>(x))]</span>
<span id="cb2-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] 0.002869708</span></span></code></pre></div>
</div>
</section>
<section id="deriving-the-iv-standard-error" class="level2">
<h2 class="anchored" data-anchor-id="deriving-the-iv-standard-error">Deriving the IV standard error</h2>
<p>If we consult any standard textbook on econometrics, we’ll find that the general formula for the standard error of the IV estimator is</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5Chat%7B%5Csigma%7D_%7B%5Ctext%7BIV%7D%7D%5E%7B2%7D(%5Chat%7B%5Cmathbf%7BX%7D%7D'%5Chat%7B%5Cmathbf%7BX%7D%7D)%5E%7B-1%7D.%20"></p>
<p>The logic of the IV estimator is that we use only the variation of <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D"> that is due to <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BZ%7D"> to estimate the effect on <img src="https://latex.codecogs.com/png.latex?Y">, and this formula reflects this logic. To see this more clearly, assume that we only have one exogeneous variable <img src="https://latex.codecogs.com/png.latex?X"> and one instrumental variable <img src="https://latex.codecogs.com/png.latex?Z."> The first term, <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Csigma%7D_%7B%5Ctext%7BIV%7D%7D%5E%7B2%7D">, then becomes <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BY-%5Chat%7B%5Cbeta%7D_%5Ctext%7BIV%7D%20X%5D">. Note that compared to the OLS estimator, we use <img src="https://latex.codecogs.com/png.latex?%7B%5Cbeta%7D_%5Ctext%7BIV%7D"> instead of <img src="https://latex.codecogs.com/png.latex?%7B%5Cbeta%7D_%5Ctext%7BOLS%7D"> – again, this is because we use only the variation of <img src="https://latex.codecogs.com/png.latex?X"> that is due to <img src="https://latex.codecogs.com/png.latex?Z">.</p>
<p>The second term is based on the matrix <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cmathbf%7BX%7D%7D">, which contains the predicted values of the regression of <img src="https://latex.codecogs.com/png.latex?X"> on <img src="https://latex.codecogs.com/png.latex?Z">. In matrix algebra, this is <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cmathbf%7BX%7D%7D=%5Cmathbf%7BZ%7D(%5Cmathbf%7BZ%7D'%5Cmathbf%7BZ%7D)%5E%7B-1%7D%5Cmathbf%7BZ%7D'%5Cmathbf%7BX%7D">, but if we assume that we have only one exogeneous variable and one instrumental variable, <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cmathbf%7BX%7D%7D"> has a simpler form. Let’s define <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Calpha%7D_%7B%5Ctext%7BFS%7D%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7B%5Ctext%7BFS%7D%7D"> as the intercept and slope estimates of the regression of <img src="https://latex.codecogs.com/png.latex?X"> on <img src="https://latex.codecogs.com/png.latex?Z"> (FS = first stage). We can then define the random variable <img src="https://latex.codecogs.com/png.latex?%5Chat%7BX%7D"> that contains the predicted values of this regression:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Chat%7BX%7D=%5Chat%7B%5Calpha%7D_%7B%5Ctext%7BFS%7D%7D%20+%20%5Chat%7B%5Cbeta%7D_%7B%5Ctext%7BFS%7D%7DZ=%5Chat%7BE%7D%5BX%5D+%5Cfrac%7B%5Cwidehat%7B%5Ctext%7BCov%7D%7D(X,Z)%7D%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D(Z)%7D(Z-%5Chat%7BE%7D%5BZ%5D),"></p>
<p>where the second equality follows from simple regression. The corresponding matrix is then <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cmathbf%7BX%7D%7D=%5Cbegin%7Bbmatrix%7D%5Cmathbf%7B1%7D%20&amp;%20%5Chat%7BX%7D%5Cend%7Bbmatrix%7D"> of size <img src="https://latex.codecogs.com/png.latex?n%5Ctimes%202">. With a bit of matrix multiplication and division, we can now continue with the matrix multiplication, and find the inverse:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%20%20%20%20(%5Chat%7B%5Cmathbf%7BX%7D%7D%5E%7BT%7D%5Chat%7B%5Cmathbf%7BX%7D%7D)%5E%7B-1%7D&amp;=%5Cbegin%7Bbmatrix%7Dn%20&amp;%20%5Cmathbf%7B1%7D%5E%7BT%7D%5Chat%7BX%7D%5C%5C%0A%20%20%20%20%5Cmathbf%7B1%7D%5E%7BT%7D%5Chat%7BX%7D%20&amp;%20%5Chat%7BX%7D%5E%7BT%7D%5Chat%7BX%7D%0A%20%20%20%20%5Cend%7Bbmatrix%7D%5E%7B-1%7D%5C%5C&amp;=%5Cfrac%7B1%7D%7Bn%7D%5Cbegin%7Bbmatrix%7D1%20&amp;%20%5Chat%7BE%7D%5BX%5D%5C%5C%0A%20%20%20%20%5Chat%7BE%7D%5BX%5D%20&amp;%20%5Chat%7BE%7D%5E%7B2%7D%5BX%5D+%5Cfrac%7B%5Cwidehat%7B%5Ctext%7BCov%7D%7D%5E%7B2%7D(X,Z)%7D%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D(Z)%7D%0A%20%20%20%20%5Cend%7Bbmatrix%7D%5E%7B-1%7D%5C%5C&amp;=%5Cfrac%7B1%7D%7Bn%5E%7B2%7D%5Cfrac%7B%5Cwidehat%7B%5Ctext%7BCov%7D%7D%5E%7B2%7D(X,Z)%7D%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D(Z)%7D%7D%5Cbegin%7Bbmatrix%7D%5Chat%7BE%7D%5E%7B2%7D%5BX%5D+%5Cfrac%7B%5Cwidehat%7B%5Ctext%7BCov%7D%7D%5E%7B2%7D(X,Z)%7D%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D(Z)%7D%20&amp;%20-%5Chat%7BE%7D%5BX%5D%5C%5C%0A%20%20%20%20-%5Chat%7BE%7D%5BX%5D%20&amp;%20n%0A%20%20%20%20%5Cend%7Bbmatrix%7D%0A%5Cend%7Balign%7D%0A"></p>
<p>The relevant entry here is in the lower right-hand corner of the matrix, so we have as a preliminary formula</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BVar%7D%5B%5Chat%7B%5Cbeta%7D_%5Ctext%7BIV%7D%5D%20=%0A%20%20%20%20%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BY-%5Chat%7B%5Cbeta%7D_%5Ctext%7BIV%7D%20X%5D%0A%20%20%20%20%5Cfrac%7B1%7D%7Bn%7D%20%5Cfrac%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D(Z)%7D%7B%5Cwidehat%7B%5Ctext%7BCov%7D%7D%5E%7B2%7D(X,Z)%7D.%0A"></p>
<p>Until this point, we have only assumed that there is one instrument, but not that this instrument is binary. We’ll now make this assumption to simplify the formula a bit further. If <img src="https://latex.codecogs.com/png.latex?Z"> is binary, we have <img src="https://latex.codecogs.com/png.latex?%5Chat%7BE%7D%5BZ%5D=P(Z=1)"> as the proportion of cases where the instrument is 1, and <img src="https://latex.codecogs.com/png.latex?1-%5Chat%7BE%7D%5BZ%5D=P(Z=0)"> as the proportion of cases where the instrument is 0. Because <img src="https://latex.codecogs.com/png.latex?Z"> is a Bernoulli random variable, we can then immediately state that</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7B%5Ctext%7BVar%7D%7D(Z)=%5Chat%7BE%7D%5BZ%5D(1-%5Chat%7BE%7D%5BZ%5D)."></p>
<p>For the next derivations, we’ll make use of the fact that we can rewrite expectations as group weighted averages. For instance,</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Chat%7BE%7D%5BX%5D=(1-%5Chat%7BE%7D%5BZ%5D)%5Chat%7BE%7D%5BX%5Cmid%20Z=0%5D+%5Chat%7BE%7D%5BZ%5D%5Chat%7BE%7D%5BX%5Cmid%20Z=1%5D."></p>
<p>We’ll use this strategy to ‘simplify’ the covariance term:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%5Cwidehat%7B%5Ctext%7BCov%7D%7D(X,Z)&amp;=%5Chat%7BE%7D%5BZ(X-%5Chat%7BE%7D%5BX%5D)%5D%5C%5C&amp;=(1-%5Chat%7BE%7D%5BZ%5D)%5Chat%7BE%7D%5BZ(X-%5Chat%7BE%7D%5BX%5D)%5Cmid%20Z=0%5D+%5Chat%7BE%7D%5BZ%5D%5Chat%7BE%7D%5BZ(X-%5Chat%7BE%7D%5BX%5D)%5Cmid%20Z=1%5D%5C%5C&amp;=%5Chat%7BE%7D%5BZ%5D%5Chat%7BE%7D%5BX%5Cmid%20Z=1%5D-%5Chat%7BE%7D%5BZ%5D%5Chat%7BE%7D%5B%5Chat%7BE%7D%5BX%5D%5Cmid%20Z=1%5D%5C%5C&amp;=%5Chat%7BE%7D%5BZ%5D%5Cleft(%5Chat%7BE%7D%5BX%5Cmid%20Z=1%5D-%5Chat%7BE%7D%5BX%5D%5Cright)%5C%5C&amp;=%5Chat%7BE%7D%5BZ%5D%5Cleft(%5Chat%7BE%7D%5BX%5Cmid%20Z=1%5D-(1-%5Chat%7BE%7D%5BZ%5D)E%5BX%5Cmid%20Z=0%5D-%5Chat%7BE%7D%5BZ%5DE%5BX%5Cmid%20Z=1%5D)%5Cright)%5C%5C&amp;=%5Chat%7BE%7D%5BZ%5D%5Cleft(E%5BX%5Cmid%20Z=1%5D(1-%5Chat%7BE%7D%5BZ%5D)-(1-%5Chat%7BE%7D%5BZ%5D)E%5BX%5Cmid%20Z=0%5D%5Cright)%5C%5C&amp;=%5Chat%7BE%7D%5BZ%5D(1-%5Chat%7BE%7D%5BZ%5D)%5Cleft(E%5BX%5Cmid%20Z=1%5D-E%5BX%5Cmid%20Z=0%5D%5Cright)%5C%5C&amp;=%5Cwidehat%7B%5Ctext%7BVar%7D%7D(Z)%5Cleft(%5Chat%7BE%7D%5BX%5Cmid%20Z=1%5D-%5Chat%7BE%7D%5BX%5Cmid%20Z=0%5D%5Cright)%0A%5Cend%7Balign%7D%0A"></p>
<p>The first equality is just the definition of the covariance. We then rewrite the expectation as a weighted average. Because the covariance involves <img src="https://latex.codecogs.com/png.latex?Z">, the term where <img src="https://latex.codecogs.com/png.latex?Z=0"> drops out. After simplifying, we replace <img src="https://latex.codecogs.com/png.latex?%5Chat%7BE%7D%5BX%5D"> with its alternative form as a weighted average. The final version then says that the covariance of a random variable <img src="https://latex.codecogs.com/png.latex?X"> and a Bernoulli random variable <img src="https://latex.codecogs.com/png.latex?Z"> is equal to the difference in means between <img src="https://latex.codecogs.com/png.latex?X"> when <img src="https://latex.codecogs.com/png.latex?Z=1"> and <img src="https://latex.codecogs.com/png.latex?X"> when <img src="https://latex.codecogs.com/png.latex?Z=0">, times the variance of <img src="https://latex.codecogs.com/png.latex?Z">.</p>
<p>Plugging these two results into the formula, we get</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BVar%7D%5B%5Chat%7B%5Cbeta%7D_%7B%5Ctext%7BIV%7D%7D%5D=%5Cfrac%7B1%7D%7Bn%7D%5Cfrac%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BY-%5Chat%7B%5Cbeta%7D_%7B%5Ctext%7BIV%7D%7DX%5D%7D%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D(Z)%5Cleft(%5Chat%7BE%7D%5BX%5Cmid%20Z=1%5D-%5Chat%7BE%7D%5BX%5Cmid%20Z=0%5D%5Cright)%5E%7B2%7D%7D."></p>
<p>The last step is to show that the denominator is equal to <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5B%5Chat%7BE%7D%5BX%5Cmid%20Z%5D%5D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5B%5Chat%7BE%7D%5BX%5Cmid%20Z%5D%5D&amp;=%5Chat%7BE%7D%5BZ%5D%5Cleft(%5Chat%7BE%7D%5BX%5Cmid%20Z=1%5D-%5Chat%7BE%7D%5BX%5D%5Cright)%5E%7B2%7D+(1-%5Chat%7BE%7D%5BZ%5D)%5Cleft(%5Chat%7BE%7D%5BX%5Cmid%20Z=0%5D-%5Chat%7BE%7D%5BX%5D%5Cright)%5E%7B2%7D%5C%5C&amp;=%5Chat%7BE%7D%5BZ%5D(1-%5Chat%7BE%7D%5BZ%5D)%5E%7B2%7D%5Cleft(%5Chat%7BE%7D%5BX%5Cmid%20Z=1%5D-%5Chat%7BE%7D%5BX%5Cmid%20Z=0%5D%5Cright)%5E%7B2%7D%5C%5C&amp;%5Cquad+(1-%5Chat%7BE%7D%5BZ%5D)%5Chat%7BE%7D%5E%7B2%7D%5BZ%5D%5Cleft(%5Chat%7BE%7D%5BX%5Cmid%20Z=1%5D)-%5Chat%7BE%7D%5BX%5Cmid%20Z=0%5D%5Cright)%5E%7B2%7D%5C%5C&amp;=%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BZ%5D%5Cleft(%5Chat%7BE%7D%5BX%5Cmid%20Z=1%5D-%5Chat%7BE%7D%5BX%5Cmid%20Z=0%5D%5Cright)%5E%7B2%7D%0A%5Cend%7Balign%7D%0A"></p>
<p>The first equality is just the definition of the variance of the group means, when two groups are involved. We then apply the identities that have been used for the covariance term, and simplify the result. This is identical to the denominator, so the result is proven.</p>
<p>We therefore have as the final result:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BVar%7D%5B%5Chat%7B%5Cbeta%7D_%5Ctext%7BIV%7D%5D%20=%0A%20%20%20%20%5Cfrac%7B1%7D%7Bn%7D%20%5Cfrac%7B%20%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BY-%5Chat%7B%5Cbeta%7D_%5Ctext%7BIV%7D%20X%5D%20%7D%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5B%5Chat%7BE%7D%5BX%20%5Cmid%20Z%5D%5D%7D."></p>
<p>To translate this into R code, we estimate the between variance using the <code>anova</code> function. (We could also use the alternative version, where we take the squared difference between the means and multiply by the variance of <img src="https://latex.codecogs.com/png.latex?Z">.)</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># we use `feols` from the fixest package for IV estimation</span></span>
<span id="cb3-2">model_iv <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">feols</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span> x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> z, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> returns)</span>
<span id="cb3-3">beta_iv <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(model_iv)[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]]</span>
<span id="cb3-4"></span>
<span id="cb3-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># standard error calculated by feols</span></span>
<span id="cb3-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vcov</span>(model_iv)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>])</span>
<span id="cb3-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] 0.02629134</span></span>
<span id="cb3-8"></span>
<span id="cb3-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># compute standard error manually</span></span>
<span id="cb3-10">between_variance <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">anova</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> z, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> returns))[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mean Sq"</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(returns)</span>
<span id="cb3-11">adj <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(returns) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># use the dof that feols uses</span></span>
<span id="cb3-12">returns[, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(adj <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">var</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> beta_iv <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> x) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> between_variance)]</span>
<span id="cb3-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] 0.02629134</span></span></code></pre></div>
</div>
<p>Note that the IV standard error is almost ten times the size of the OLS standard error.</p>
</section>
<section id="comparing-the-iv-and-ols-standard-errors" class="level2">
<h2 class="anchored" data-anchor-id="comparing-the-iv-and-ols-standard-errors">Comparing the IV and OLS standard errors</h2>
<p>When we directly compare the IV and the OLS standard error, it becomes apparent that the two formulas are very similarly structured (again, this applies only if <img src="https://latex.codecogs.com/png.latex?Z"> is binary):</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cbegin%7Balign%7D%0A%5Ctext%7BVar%7D%5B%5Chat%7B%5Cbeta%7D_%7B%5Ctext%7BOLS%7D%7D%5D%0A%20%20%20%20&amp;=%5Cfrac%7B1%7D%7Bn%7D%5Cfrac%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BY-%5Chat%7B%5Cbeta%7D_%5Ctext%7BOLS%7D%20X%5D%7D%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BX%5D%7D%20%5C%5C%0A%5Ctext%7BVar%7D%5B%5Chat%7B%5Cbeta%7D_%5Ctext%7BIV%7D%5D%20&amp;=%0A%20%20%20%20%5Cfrac%7B1%7D%7Bn%7D%20%5Cfrac%7B%20%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5BY-%5Chat%7B%5Cbeta%7D_%5Ctext%7BIV%7D%20X%5D%20%7D%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5B%5Chat%7BE%7D%5BX%20%5Cmid%20Z%5D%5D%7D%0A%5Cend%7Balign%7D"></p>
<p>In both cases, we have three ways of achieving a lower standard error:</p>
<ol type="1">
<li>Increase the sample size, <img src="https://latex.codecogs.com/png.latex?n">,</li>
<li>Reduce the size of the numerator,</li>
<li>Increase the size of the denominator.</li>
</ol>
<p>With OLS, we are guaranteed to obtain the smallest possible standard error under the <a href="https://en.wikipedia.org/wiki/Gauss–Markov_theorem">usual regression assumptions</a>. In comparison, the IV estimator replaces <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7B%5Ctext%7BOLS%7D%7D"> with <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7B%5Ctext%7BIV%7D%7D"> in the numerator – which therefore <em>must</em> increase the numerator unless the two are <img src="https://latex.codecogs.com/png.latex?%5Cbeta">’s are identical –, and uses only part of the variance in the denominator – which <em>must</em> decrease the numerator unless <img src="https://latex.codecogs.com/png.latex?Z"> perfectly determines <img src="https://latex.codecogs.com/png.latex?X">. Hence, the decrease in power of the IV estimator comes both from the fact that we use only part of the variance in predicting <img src="https://latex.codecogs.com/png.latex?Y"> (in the numerator), and from the fact that we use only part of the variance in predicting <img src="https://latex.codecogs.com/png.latex?X"> (in the numerator).</p>
<p>The denominator is based on the law of total variance, which states that</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BVar%7D%5BX%5D=%5Ctext%7BVar%7D%5BE%5BX%20%5Cmid%20Z%5D%5D+E%5B%5Ctext%7BVar%7D%5BX%20%5Cmid%20Z%5D%5D."></p>
<p>This is a between/within decomposition. The law of total variance states that the total variance is equal to the variance of the group means (“between”), plus the variance within the groups. The IV estimator uses only the “between” term. Hence, to make this term as large as possible, <img src="https://latex.codecogs.com/png.latex?Z"> should predict <img src="https://latex.codecogs.com/png.latex?X"> well. As we have seen, another way to put this is to maximize <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7B%5Ctext%7BVar%7D%7D(Z)%5Cleft(E%5BX%5Cmid%20Z=1%5D-E%5BX%5Cmid%20Z=0%5D%5Cright)%5E%7B2%7D">. Hence, in the optimal case, we would like to have <img src="https://latex.codecogs.com/png.latex?E%5BZ%5D=0.5"> to maximize the variance, and have a large difference in group means.</p>
<p>In the empirical example, the numerators and denominators compare as follows:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># comparing the numerators</span></span>
<span id="cb4-2">num_ols <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> returns[, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">var</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> beta_ols <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> x)]</span>
<span id="cb4-3">num_iv <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> returns[, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">var</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> beta_iv <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> x)]</span>
<span id="cb4-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(num_ols, num_iv)</span>
<span id="cb4-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] 0.1775096 0.3099878</span></span>
<span id="cb4-6"></span>
<span id="cb4-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># comparing the denominators</span></span>
<span id="cb4-8">denom_ols <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> returns[, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">var</span>(x)]</span>
<span id="cb4-9">denom_iv <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> between_variance</span>
<span id="cb4-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(denom_ols, denom_iv)</span>
<span id="cb4-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] 7.1658624 0.1490379</span></span></code></pre></div>
</div>
<p>The IV numerator is almost twice the size of the OLS numerator, but the real loss in power comes from the denominator, which differs by a factor of almost 50. Clearly, and not surprisingly, most of the variance in wages is not <em>between</em> people that grew up close and far from colleges, but <em>within</em> these two groups. This is confirmed by a quick look at the variance decomposition:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">returns[, .(between_variance, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">within_variance =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">var</span>(x) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> between_variance)]</span>
<span id="cb5-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    between_variance within_variance</span></span>
<span id="cb5-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;               &lt;num&gt;           &lt;num&gt;</span></span>
<span id="cb5-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:        0.1490379        7.016824</span></span></code></pre></div>
</div>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>In the simple setting with just one exogeneous variable and one binary instrument, the IV standard error can be shown to have a simple form that can be compared easily with the OLS standard error. The logic of the IV estimator is that instead of using the full information in <img src="https://latex.codecogs.com/png.latex?X">, we use only that part of the information in <img src="https://latex.codecogs.com/png.latex?X"> that is also contained in the instrument <img src="https://latex.codecogs.com/png.latex?Z">. This affects both the numerator and the denominator of the IV estimator. This highlights how important it is to choose an instrument that is strongly predictive of <img src="https://latex.codecogs.com/png.latex?X">.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>If <img src="https://latex.codecogs.com/png.latex?Z"> is continuous, the formula is still correct if a linear estimator is used for for <img src="https://latex.codecogs.com/png.latex?%7B%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5B%5Chat%7BE%7D%5BX%20%5Cmid%20Z%5D%5D%7D">.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>regression</category>
  <category>ab-testing</category>
  <guid>https://elbersb.com/public/posts/2023-10-07-iv-standard-error/</guid>
  <pubDate>Fri, 06 Oct 2023 22:00:00 GMT</pubDate>
</item>
<item>
  <title>Eliminating the bias of segregation indices</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2021-11-24-segregation-bias/</link>
  <description><![CDATA[ 




<p>It is well known that most standard estimators of segregation indices are biased. The <a href="http://elbersb.com/segregation">segregation</a> package provides a few tools to assess this bias. This post will discuss this problem with some simple examples and show under what conditions bootstrapping and simulation can help to remove the bias. The post relies on some tools that were only recently added to the package, so install the most recent version to follow along:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"segregation"</span>)</span></code></pre></div>
</div>
<section id="bias-in-small-and-large-samples" class="level2">
<h2 class="anchored" data-anchor-id="bias-in-small-and-large-samples">Bias in small and large samples</h2>
<p>To illustrate the problem, let’s use R’s <code>stats::r2dtable</code> function to simulate a random contingency table. To make the following more concrete, let’s assume that we observe racial segregation in schools. Each school has an equal number of students of each of the two racial groups, but we only observe a sample. If the sample is small, we do not expect to sample exactly an even number of students of each of the two groups, so the segregation index is likely to be biased upwards.</p>
<p>One hypothetical sample could look like this:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mat =</span> stats<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">r2dtable</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">25</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">25</span>))[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]])</span>
<span id="cb2-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      [,1] [,2]</span></span>
<span id="cb2-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1,]    5    5</span></span>
<span id="cb2-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [2,]    4    6</span></span>
<span id="cb2-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [3,]    3    7</span></span>
<span id="cb2-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [4,]    7    3</span></span>
<span id="cb2-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [5,]    6    4</span></span></code></pre></div>
</div>
<p>Now we can compute the Mutual Information index (M) and its normalized version, the H index:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"segregation"</span>)</span>
<span id="cb3-2">dat <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix_to_long</span>(mat) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># convert to long format</span></span>
<span id="cb3-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_total</span>(dat, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>)</span>
<span id="cb3-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      stat    est</span></span>
<span id="cb3-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt;  &lt;num&gt;</span></span>
<span id="cb3-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      M 0.0410</span></span>
<span id="cb3-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2:      H 0.0591</span></span></code></pre></div>
</div>
<p>Clearly, both indices are non-zero. For the index of dissimilarity, the bias is even stronger:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dissimilarity</span>(dat, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>)</span>
<span id="cb4-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      stat   est</span></span>
<span id="cb4-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt; &lt;num&gt;</span></span>
<span id="cb4-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      D  0.24</span></span></code></pre></div>
</div>
<p>A index value of 0.3 is often interpreted as “moderate segregation”, so this bias is clearly a problem. Generally, the index of dissimilarity suffers more from small-sample bias than the information-theoretic indices.</p>
<p>Importantly, the bias is not simply a function of sample size. For instance, if we increase the number of schools to 10,000, but still expect 5 students of each racial group in each school, the bias is pretty much the same:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">mat_large <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> stats<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">r2dtable</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50000</span>))[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]]</span>
<span id="cb5-2">dat_large <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix_to_long</span>(mat_large) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># convert to long format</span></span>
<span id="cb5-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_total</span>(dat_large, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>)</span>
<span id="cb5-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      stat    est</span></span>
<span id="cb5-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt;  &lt;num&gt;</span></span>
<span id="cb5-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      M 0.0540</span></span>
<span id="cb5-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2:      H 0.0778</span></span>
<span id="cb5-8"></span>
<span id="cb5-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dissimilarity</span>(dat_large, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>)</span>
<span id="cb5-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      stat   est</span></span>
<span id="cb5-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt; &lt;num&gt;</span></span>
<span id="cb5-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      D 0.248</span></span></code></pre></div>
</div>
<p>This is despite the fact that in the first case, our sample size is 50, and in the second case it’s 100,000! For the index of dissimilarity, Winship (1977) has described this bias in detail.</p>
</section>
<section id="solution-1-bootstrapping" class="level2">
<h2 class="anchored" data-anchor-id="solution-1-bootstrapping">Solution 1: Bootstrapping</h2>
<p>In many circumstances, it helps to enable bootstrapping to estimate the bias. When bootstrapping is enabled, the <code>segregation</code> package reports <a href="../2021-01-07-bootstrap-bias">bias-adjusted estimates</a>. Let’s try this for both datasets from above:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_total</span>(dat, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb6-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 100 bootstrap iterations on 50 observations</span></span>
<span id="cb6-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      stat      est     se              CI   bias</span></span>
<span id="cb6-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt;    &lt;num&gt;  &lt;num&gt;          &lt;list&gt;  &lt;num&gt;</span></span>
<span id="cb6-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      M -0.00933 0.0465 -0.0956, 0.0647 0.0503</span></span>
<span id="cb6-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2:      H -0.01520 0.0683 -0.1400, 0.0932 0.0743</span></span>
<span id="cb6-7"></span>
<span id="cb6-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_total</span>(dat_large, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb6-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 100 bootstrap iterations on 1e+05 observations</span></span>
<span id="cb6-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      stat       est      se                CI   bias</span></span>
<span id="cb6-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt;     &lt;num&gt;   &lt;num&gt;            &lt;list&gt;  &lt;num&gt;</span></span>
<span id="cb6-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      M -0.000629 0.00118 -0.00296, 0.00159 0.0546</span></span>
<span id="cb6-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2:      H -0.000908 0.00170 -0.00427, 0.00230 0.0788</span></span></code></pre></div>
</div>
<p>In this case, the bootstrap estimates the bias pretty well. Because the bias (last column) is subtracted from the segregation estimates, the bootstrap-adjusted estimate may become slightly negative.</p>
<p>For the index of dissimilarity, this procedure does not work as well:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dissimilarity</span>(dat, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb7-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 100 bootstrap iterations on 50 observations</span></span>
<span id="cb7-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      stat   est     se              CI   bias</span></span>
<span id="cb7-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt; &lt;num&gt;  &lt;num&gt;          &lt;list&gt;  &lt;num&gt;</span></span>
<span id="cb7-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      D 0.171 0.0999 -0.0632, 0.3596 0.0689</span></span>
<span id="cb7-6"></span>
<span id="cb7-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dissimilarity</span>(dat_large, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb7-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 100 bootstrap iterations on 1e+05 observations</span></span>
<span id="cb7-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      stat   est      se          CI  bias</span></span>
<span id="cb7-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt; &lt;num&gt;   &lt;num&gt;      &lt;list&gt; &lt;num&gt;</span></span>
<span id="cb7-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      D  0.14 0.00201 0.137,0.145 0.108</span></span></code></pre></div>
</div>
<p>Although the bias estimate is fairly large, a substantial bias remains.</p>
</section>
<section id="solution-2-compute-the-expected-value-under-independence" class="level2">
<h2 class="anchored" data-anchor-id="solution-2-compute-the-expected-value-under-independence">Solution 2: Compute the expected value under independence</h2>
<p>The bootstrap may sometime work to estimate the bias, but two major problems remain. The first, as we have seen, is that the bias estimation does not work well for the index of dissimilarity. The second situation in which the bootstrap will do badly is when the contingency table is very sparse and contains many zero entries. I’ll come back to that in the example at the end of the post.</p>
<p>A direct approach of estimating the bias is the following: Using the observed marginal distributions, simulate a contingency table under the assumption that true segregation is zero. Repeat this process a number of times and record the average. This quantity is the expected value of the segregation index when students are randomly distributed across schools, conditional on the marginal distributions. In economics, this quantity is also sometime called “random segregation” (Carrington and Troske 1998).</p>
<p>The <code>segregation</code> package implements this algorithm in the following two functions:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_expected</span>(dat, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>)</span>
<span id="cb8-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;         stat    est     se</span></span>
<span id="cb8-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;       &lt;char&gt;  &lt;num&gt;  &lt;num&gt;</span></span>
<span id="cb8-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1: M under 0 0.0443 0.0290</span></span>
<span id="cb8-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2: H under 0 0.0639 0.0418</span></span>
<span id="cb8-6"></span>
<span id="cb8-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dissimilarity_expected</span>(dat, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unit"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n"</span>)</span>
<span id="cb8-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;         stat   est     se</span></span>
<span id="cb8-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;       &lt;char&gt; &lt;num&gt;  &lt;num&gt;</span></span>
<span id="cb8-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1: D under 0 0.226 0.0945</span></span></code></pre></div>
</div>
<p>In both cases, calculating the expected value of the index gives a good estimate of the bias. When reporting the final results, we could simply subtract the bias from the segregation estimates.</p>
</section>
<section id="an-example-with-sparse-data" class="level2">
<h2 class="anchored" data-anchor-id="an-example-with-sparse-data">An example with sparse data</h2>
<p>As a final point, the example in this section demonstrates some circumstances under which also the information-theoretic indices may be highly biased.</p>
<p>The <code>segregation</code> package contains an example dataset, <code>school_ses</code> with artifical data. Each row of this dataset describes a student, with information on the school the student attends (<code>school_id</code>), the student’s ethnic group (one of A, B, or C; <code>ethnic_group</code>), and the student’s socio-economic status (provided in quintiles; <code>ses_quintile</code>). Because there are three ethnic-groups, we will only compute multigroup indices using the M and H index.</p>
<p>The <code>school_ses</code> dataset is sparse: There are 149 schools in total, but only 46 of those contain students of all three ethnic groups, and 26 schools contain only students of a single ethnic group.</p>
<p>The ethnic segregation in this dataset is fairly large, but we may expect this estimate to be upwardly biased:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_total</span>(school_ses, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ethnic_group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"school_id"</span>)</span>
<span id="cb9-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      stat   est</span></span>
<span id="cb9-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt; &lt;num&gt;</span></span>
<span id="cb9-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      M 0.544</span></span>
<span id="cb9-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2:      H 0.577</span></span></code></pre></div>
</div>
<p>For this dataset, the two approaches of estimating the bias differ somewhat:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_total</span>(school_ses, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ethnic_group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"school_id"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb10-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 100 bootstrap iterations on 5153 observations</span></span>
<span id="cb10-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      stat   est      se          CI   bias</span></span>
<span id="cb10-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt; &lt;num&gt;   &lt;num&gt;      &lt;list&gt;  &lt;num&gt;</span></span>
<span id="cb10-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      M 0.529 0.01000 0.512,0.545 0.0160</span></span>
<span id="cb10-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2:      H 0.559 0.00921 0.542,0.576 0.0181</span></span>
<span id="cb10-7"></span>
<span id="cb10-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_expected</span>(school_ses, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ethnic_group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"school_id"</span>)</span>
<span id="cb10-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;         stat    est      se</span></span>
<span id="cb10-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;       &lt;char&gt;  &lt;num&gt;   &lt;num&gt;</span></span>
<span id="cb10-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1: M under 0 0.0304 0.00240</span></span>
<span id="cb10-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2: H under 0 0.0322 0.00254</span></span></code></pre></div>
</div>
<p>Using bootstrapping, the bias for the M index is estimated to be 0.016, while the bias estimated using the “random segregation” approach is 0.03.</p>
<p>This difference is still rather small, and will not be consequential in many situations. However, the advantage of using information-theoretic measures lies in their decomposability, and there the bias may be much larger. For instance, assume that we are interested in computing ethnic segregation conditionally on SES status. We can use the within <code>argument</code> to calculate this:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_total</span>(school_ses, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ethnic_group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"school_id"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">within =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ses_quintile"</span>)</span>
<span id="cb11-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      stat   est</span></span>
<span id="cb11-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt; &lt;num&gt;</span></span>
<span id="cb11-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      M 0.463</span></span>
<span id="cb11-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2:      H 0.490</span></span></code></pre></div>
</div>
<p>Estimating the bias of this conditional index using bootstrapping yields a bias estimate of around 0.04:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_total</span>(school_ses, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ethnic_group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"school_id"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">within =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ses_quintile"</span>,</span>
<span id="cb12-2">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb12-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 100 bootstrap iterations on 5153 observations</span></span>
<span id="cb12-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;      stat   est      se          CI   bias</span></span>
<span id="cb12-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;    &lt;char&gt; &lt;num&gt;   &lt;num&gt;      &lt;list&gt;  &lt;num&gt;</span></span>
<span id="cb12-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1:      M 0.424 0.00853 0.410,0.439 0.0389</span></span>
<span id="cb12-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2:      H 0.450 0.00909 0.433,0.465 0.0408</span></span></code></pre></div>
</div>
<p>However, if we compute the expected value conditional on SES, the result looks very different:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutual_expected</span>(school_ses, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ethnic_group"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"school_id"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">within =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ses_quintile"</span>)</span>
<span id="cb13-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;         stat   est      se</span></span>
<span id="cb13-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;       &lt;char&gt; &lt;num&gt;   &lt;num&gt;</span></span>
<span id="cb13-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 1: M under 0 0.105 0.00848</span></span>
<span id="cb13-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; 2: H under 0 0.132 0.01113</span></span></code></pre></div>
</div>
<p>The bias is estimated to be very large – around 0.1 for the M and around 0.13 for the H! The reason for this discrepancy is that the indices are computed within each group defined by the SES quintiles. These “conditional” contingency tables are much smaller, and even sparser than the overall dataset. It follows that the bias is even larger. One therefore has to be very careful when decomposing segregation measures for small or sparse samples.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>When working with segregation indices, it is important to be aware that almost all “naive” estimators of these indices are upwardly biased. In many situations, this bias will be small. However, if the overall sample size is small, or some of the groups or units are small, the bias can be substantive. Importantly, it is not always the case that the bias is small in large samples. My recommendation is to always check the sensitivity of your results using <em>both</em> bootstrapping and by calculating “random segregation”. Special attention needs to be paid when decomposing segregation measures for small or sparse samples, as the decompositions will be based on even smaller/sparser samples.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<p>Winship, Christopher. 1977. A Revaluation of Indexes of Residential Segregation. <em>Social Forces</em> 55(4): 1058-1066.</p>
<p>Carrington, William J. and Kenneth R. Troske. 1998. Interfirm Segregation and the Black/White Wage Gap. <em>Journal of Labor Economics</em> 16(2): 231-260.</p>


</section>

 ]]></description>
  <category>segregation</category>
  <category>packages</category>
  <guid>https://elbersb.com/public/posts/2021-11-24-segregation-bias/</guid>
  <pubDate>Tue, 23 Nov 2021 23:00:00 GMT</pubDate>
</item>
<item>
  <title>Did Residential Racial Segregation in the U.S. Really Increase?</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2021-07-23-segregation-increase/</link>
  <description><![CDATA[ 




<p>A <a href="https://belonging.berkeley.edu/roots-structural-racism">recent report by the Othering and Belonging institute at UC Berkeley</a> claimed that, of large metropolitan areas in the U.S., 81% have become more segregated over the period 1990-2019. This finding contradicts the recent sociological literature on changes in residential segregation in the U.S., which has generally found that racial residential segregation has slowly declined since the 1970s, especially between Blacks and Whites. The major question then is: What accounts for this diﬀerence?</p>
<p><a href="https://osf.io/preprints/socarxiv/dvutw">My new WP</a> answers this question, and here’s a quick summary:</p>
<ol type="1">
<li><p>The segregation measure of the Berkeley study, the “Divergence Index,” is identical to mutual information, also known as the <img src="https://latex.codecogs.com/png.latex?M"> index. This index is <strong>mechanically</strong> aﬀected by changes in racial diversity. Given that the U.S. has become more diverse over the period 1990 to 2019, it is not surprising that this index shows increases in segregation. It is important to emphasize again that the index is <em>mechanically</em> affected by rising diversity. This means that if only the diversity of the metropolitan area changes, the index will increase. Of course, this doesn’t mean that in every metropolitan area where racial diversity is increasing the index value also increases—clearly, other things could also change. The fact that the index is mechanically related to diversity is also not a statement about the general relationship between diversity and segregation: It could be the case, for instance, that more diverse cities are more segregated. When one uses the <img src="https://latex.codecogs.com/png.latex?M"> index to answer such a question, one will almost always find that such a relationship exists, because of the mechanical dependency between diversity and the <img src="https://latex.codecogs.com/png.latex?M"> index. In mathematical terms, the simplest way to see the influence of diversity on the index value is to write the index as the sum of three entropies: <img src="https://latex.codecogs.com/png.latex?M=E(%5Cmathbf%7Bp%7D_%7Bu%5Ccdot%7D)+E(%5Cmathbf%7Bp%7D_%7B%5Ccdot%20g%7D)-E(%5Cmathbf%7Bp%7D_%7Bug%7D)."> The first term is the entropy of the neighborhood distribution, the second term is the entropy of the racial group distribution, and the third term is the entropy of the joint distribution. Given that the racial group entropy increases when diversity increases, the <img src="https://latex.codecogs.com/png.latex?M"> is clearly affected by rising diversity.</p></li>
<li><p>Once I correct for the confounding of index change with diversity using a <a href="http://elbersb.com/public/posts/smr-paper/">decomposition method</a>, I find that the results are in line with the sociological literature: Residential racial segregation as a whole has declined modestly in most metropolitan areas of the U.S., although segregation has increased slightly when focusing on Asian Americans and Hispanics. The following plot shows the <img src="https://latex.codecogs.com/png.latex?M"> index and the adjusted <img src="https://latex.codecogs.com/png.latex?M"> that corrects for the mechanical influence of rising diversity. The <img src="https://latex.codecogs.com/png.latex?H"> index is also shown:</p></li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2021-07-23-segregation-increase/segtrends.png" class="img-fluid figure-img" style="width:50.0%"></p>
<figcaption>Trends</figcaption>
</figure>
</div>
<p>Clearly, once we adjust for the mechanical influence of diversity (which the <img src="https://latex.codecogs.com/png.latex?H"> also does), segregation in the median metropolitan area is declining. The figure also shows that all indices are suddenly increasing strongly in 2019. Why is this the case? The reason is that for the years 1990, 2000, and 2010, Census data are available. For 2019, only estimates from the American Community Survey are available, which are <a href="https://link.springer.com/article/10.1007/s13524-016-0545-z">well known to inflate segregation estimates</a>. Hence, even the increase in the <img src="https://latex.codecogs.com/png.latex?M">, which is almost entirely confined to the period 2010-2019, may be spurious and due to the use of ACS data.</p>
<p>For more details, the working paper is on <a href="https://osf.io/preprints/socarxiv/dvutw">SocArXiv</a>, as well as a complete set of <a href="https://osf.io/mg9q4/">replication materials</a>.</p>
<section id="a-note-on-local-measures-of-segregation" class="level3">
<h3 class="anchored" data-anchor-id="a-note-on-local-measures-of-segregation">A note on local measures of segregation</h3>
<p>Because this came up in the discussion afterwards, here are some remarks on measures of <em>local</em> segregation. The M index (called divergence index in the Berkeley report) can be written as a weighted average of local segregation scores <img src="https://latex.codecogs.com/png.latex?L_u">, where <img src="https://latex.codecogs.com/png.latex?L_u"> measures the segregation of neighborhood <img src="https://latex.codecogs.com/png.latex?u">:</p>
<p><img src="https://latex.codecogs.com/png.latex?L_u%20=%20%5Csum_%7Bg=1%7D%5E%7BG%7Dp_%7Bg%7Cu%7D%5Clog%5Cfrac%7Bp_%7Bg%7Cu%7D%7D%7Bp_%7B%5Ccdot%20g%7D%7D"></p>
<p>(<img src="https://latex.codecogs.com/png.latex?p_%7Bg%7Cu%7D"> is the proportion of racial group <img src="https://latex.codecogs.com/png.latex?g"> in neighborhood <img src="https://latex.codecogs.com/png.latex?u">, and <img src="https://latex.codecogs.com/png.latex?p_%7B%5Ccdot%20g%7D"> is the overall proportion of racial group <img src="https://latex.codecogs.com/png.latex?g"> in the metropolitan area). This measure is the Kullback-Leibler divergence. Once we weight by the size of the neighborhood, we obtain the M index (mutual information):</p>
<p><img src="https://latex.codecogs.com/png.latex?M=%5Csum_%7Bu=1%7D%5E%7BU%7Dp_%7Bu%5Ccdot%7DL_%7Bu%7D"></p>
<p>The scores <img src="https://latex.codecogs.com/png.latex?L_u"> are really useful, but the question is whether they should be used to compare/rank neighborhoods <em>across</em> metros/across time? The issue is unproblematic if we just look at one metro area at one point in time, e.g., to learn which neighborhoods are especially segregated. But what if we want to compare over time/across metros? Then it gets tricky, because local segregation scores are also influenced by the diversity of the metro area. The minimum value of local segregation is zero, but the maximum value is (see if you can guess it) <em>the negative of the logarithm of the proportion of the metro area’s <em>smallest</em> racial group</em> (see <a href="https://osf.io/preprints/socarxiv/3juyc">here</a> for a proof). If you compare across metropolitan areas, then the range of the local scores will differ if the size of the smallest racial group differs. This makes comparisons really tricky, and I wouldn’t therefore be willing to classify neighborhoods across metro areas as “high” or “low” segregation.</p>
<p>If you want to compare over time, again the decomposition method can be used to adjust for changes in diversity. This map of changes in racial segregation in Brooklyn does this, and shows where segregation increased (red) and declined (blue) in Brooklyn between 2000 and 2010 using diversity-adjusted local segregation scores (<a href="https://journals.sagepub.com/doi/10.1177/0049124121986204">source of map</a>).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2021-07-23-segregation-increase/brooklyn.jpeg" class="img-fluid figure-img" style="width:50.0%"></p>
<figcaption>Changing segregation in Brooklyn</figcaption>
</figure>
</div>
<p>A final point on the H index: The only difference between M and H is the division by the racial entropy. So we can also define local scores for the H index: These are not normalized, but they work as a decomposition, i.e.</p>
<p><img src="https://latex.codecogs.com/png.latex?L_%7Bu%7D%5E%7B(H)%7D%20=%20%5Cfrac%7BL_%7Bu%7D%7D%7BE(%5Cmathbf%7Bp%7D_%7B%5Ccdot%20g%7D)%7D"></p>
<p>and</p>
<p><img src="https://latex.codecogs.com/png.latex?%20H%20=%20%5Csum_%7Bu=1%7D%5E%7BU%7Dp_%7Bu%5Ccdot%7DL_%7Bu%7D%5E%7B(H)%7D%20=%20%5Cfrac%7BM%7D%7BE(%5Cmathbf%7Bp%7D_%7B%5Ccdot%20g%7D)%7D,%20"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?E(%5Cmathbf%7Bp%7D_%7B%5Ccdot%20g%7D)"> is the racial group entropy of the metropolitan area.</p>
<p>The scores <img src="https://latex.codecogs.com/png.latex?L_%7Bu%7D%5E%7B(H)%7D"> are again useful to understand where in a metropolitan area segregation is lowest or highest, but they are equally problematic when used to compare across metro areas or across time.</p>


</section>

 ]]></description>
  <category>segregation</category>
  <guid>https://elbersb.com/public/posts/2021-07-23-segregation-increase/</guid>
  <pubDate>Thu, 22 Jul 2021 22:00:00 GMT</pubDate>
</item>
<item>
  <title>Simulations in Julia: Efficient by default</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2021-06-28-interaction-simulation/</link>
  <description><![CDATA[ 




<p>Inspired by Grant McDermott’s <a href="https://grantmcdermott.com/efficient-simulations-in-r/">blog post on efficient simulations in R</a>, I decided to reimplement the same exercise in Julia. This post will not make much sense without having read that excellent post, so I’d recommend doing that first.</p>
<p>I recently switched to doing simulation work in Julia instead of R, because you don’t need many tricks to achieve decent performance on the first try. Compared to R, it is not necessary (at least in this example) to generate all the data at once. Instead, we can implement the simulation algorithm more naturally: generate a dataset, extract the quantities of interest, and repeat this process <em>N</em> times. I find that this leads to more intuitive and readable code, and Julia’s great for this kind of task.</p>
<section id="generate-the-data" class="level2">
<h2 class="anchored" data-anchor-id="generate-the-data">1. Generate the data</h2>
<p>We start by implementing a function <code>gen_data()</code> to generate a DataFrame. The code is basically a one-to-one translation from R, but we only generate one instance of the data.</p>
<div class="cell" data-execution_count="2">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode julia code-with-copy"><code class="sourceCode julia"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">using</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">Distributions</span> </span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">using</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">DataFrames</span>     </span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">const</span> std_normal <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Normal</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gen_data</span>()</span>
<span id="cb1-7">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## total time periods in the the panel = 500</span></span>
<span id="cb1-8">  tt <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span></span>
<span id="cb1-9"></span>
<span id="cb1-10">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># x1 and x2 covariates</span></span>
<span id="cb1-11">  x1_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rand</span>(std_normal, tt)</span>
<span id="cb1-12">  x1_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rand</span>(std_normal, tt)</span>
<span id="cb1-13">  x2_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.+</span> x1_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rand</span>(std_normal, tt)</span>
<span id="cb1-14">  x2_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.+</span> x1_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rand</span>(std_normal, tt)</span>
<span id="cb1-15"></span>
<span id="cb1-16">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># outcomes (notice different slope coefs for x2_A and x2_B)</span></span>
<span id="cb1-17">  y_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> x1_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>x2_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rand</span>(std_normal, tt)</span>
<span id="cb1-18">  y_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> x1_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>x2_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rand</span>(std_normal, tt)</span>
<span id="cb1-19"></span>
<span id="cb1-20">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># combine</span></span>
<span id="cb1-21">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">DataFrame</span>(</span>
<span id="cb1-22">    id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vcat</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fill</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(x1_A)), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fill</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(x1_B))),</span>
<span id="cb1-23">    x1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vcat</span>(x1_A, x1_B),</span>
<span id="cb1-24">    x2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vcat</span>(x2_A, x2_B),</span>
<span id="cb1-25">    x1_dmean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vcat</span>(x1_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(x1_A), x1_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(x1_B)),</span>
<span id="cb1-26">    x2_dmean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vcat</span>(x2_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(x2_A), x2_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(x2_B)),</span>
<span id="cb1-27">    y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vcat</span>(y_A, y_B))</span>
<span id="cb1-28"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span></span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="3">
<pre><code>gen_data (generic function with 1 method)</code></pre>
</div>
</div>
<p>This should hopefully be pretty clear, even if you have never seen any Julia code before. The function <code>vcat()</code> is used to concatenate vectors, and <code>fill()</code> is similar to <code>rep()</code> in R. Another thing that might be unusual is having to write <code>.+</code> instead of <code>+</code> to achieve vector addition. While this still trips me up occasionally, I think it’s one of <a href="https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting">Julia’s best features</a>.</p>
</section>
<section id="extract-the-quantities-of-interest" class="level2">
<h2 class="anchored" data-anchor-id="extract-the-quantities-of-interest">2. Extract the quantities of interest</h2>
<p>Given a dataset, we can now run the two regressions and extract the coefficents. Here’s a function to achieve that using the GLM package:</p>
<div class="cell" data-execution_count="3">
<div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode julia code-with-copy"><code class="sourceCode julia"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">using</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">GLM</span></span>
<span id="cb3-2"></span>
<span id="cb3-3"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coefs_lm_formula</span>(data)</span>
<span id="cb3-4">  mod_level <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">@formula</span>(y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> x1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> x2), data)</span>
<span id="cb3-5">  mod_dmean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">@formula</span>(y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> x1_dmean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> x2_dmean), data)</span>
<span id="cb3-6">  (<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(mod_level)[<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span>], <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(mod_dmean)[<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span>])</span>
<span id="cb3-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span></span>
<span id="cb3-8"></span>
<span id="cb3-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># example</span></span>
<span id="cb3-10">data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gen_data</span>()</span>
<span id="cb3-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coefs_lm_formula</span>(data)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="4">
<pre><code>(-0.14253265966089987, -0.01929471565297597)</code></pre>
</div>
</div>
<p>Now we just need to repeat these last two lines a large number of times, and save the coefficients.</p>
</section>
<section id="repeat-n-times" class="level2">
<h2 class="anchored" data-anchor-id="repeat-n-times">3. Repeat N times</h2>
<p>The function below runs the above <em>nsim</em> times and stores the two coefficients in a matrix. I use the BenchmarkTools package to benchmark this function.</p>
<div class="cell" data-execution_count="4">
<div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode julia code-with-copy"><code class="sourceCode julia"><span id="cb5-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">run_simulations</span>(nsim)</span>
<span id="cb5-2">  sims <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">zeros</span>(nsim, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>);</span>
<span id="cb5-3">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>nsim</span>
<span id="cb5-4">    data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gen_data</span>()</span>
<span id="cb5-5">    sims[i, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coefs_lm_formula</span>(data)</span>
<span id="cb5-6">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span></span>
<span id="cb5-7">  sims</span>
<span id="cb5-8"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span></span>
<span id="cb5-9"></span>
<span id="cb5-10"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">using</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">BenchmarkTools</span></span>
<span id="cb5-11">n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">20000</span></span>
<span id="cb5-12"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">@btime</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">run_simulations</span>(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>n);</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>  4.862 s (19160002 allocations: 16.35 GiB)</code></pre>
</div>
</div>
<p>Around 5 seconds – not bad at all for this “naive” implementation that doesn’t make any use of particular performance tricks. A simple graph shows that the results are the same as in Grant’s post:</p>
<div class="cell" data-execution_count="5">
<div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode julia code-with-copy"><code class="sourceCode julia"><span id="cb7-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">using</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">Plots</span></span>
<span id="cb7-2">sims <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">run_simulations</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">20000</span>)</span>
<span id="cb7-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">histogram</span>(sims, label <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"level"</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dmean"</span>])</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="6">
<div>
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2021-06-28-interaction-simulation/index_files/figure-html/cell-6-output-1.svg" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
</section>
<section id="some-performance-improvements" class="level2">
<h2 class="anchored" data-anchor-id="some-performance-improvements">Some performance improvements</h2>
<p>While this is a pretty good result already, there are of course numerous ways to speed this up. Being a novice to Julia, I’m probably not the best person to show this, but here’s an attempt anyway. One straightforward way to speed this up is to avoid creating the model matrix using the <code>@formula</code> call and instead to create the matrix ourselves.</p>
<p>Here’s a way to do this (I’ll simply overwrite the existing function):</p>
<div class="cell" data-execution_count="6">
<div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode julia code-with-copy"><code class="sourceCode julia"><span id="cb8-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coefs_lm_formula</span>(data)</span>
<span id="cb8-2">  constant <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fill</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(data))</span>
<span id="cb8-3">  X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">Float64</span>[constant data.id data.x1 data.x2 data.x1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.*</span> data.x2]</span>
<span id="cb8-4">  mod_level <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fit</span>(LinearModel, X, data.y)</span>
<span id="cb8-5">  X[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.=</span> data.x1_dmean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.*</span> data.x2_dmean</span>
<span id="cb8-6">  mod_dmean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fit</span>(LinearModel, X, data.y)</span>
<span id="cb8-7">  (<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(mod_level)[<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span>], <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coef</span>(mod_dmean)[<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span>])</span>
<span id="cb8-8"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span></span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="7">
<pre><code>coefs_lm_formula (generic function with 1 method)</code></pre>
</div>
</div>
<p>And benchmark again:</p>
<div class="cell" data-execution_count="7">
<div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode julia code-with-copy"><code class="sourceCode julia"><span id="cb10-1"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">@btime</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">run_simulations</span>(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>n);</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>  1.888 s (2700002 allocations: 8.58 GiB)</code></pre>
</div>
</div>
<p>A good speedup for a rather simple change!</p>
<p>A last idea is to fit the model more efficiently. The <code>fit</code> function from GLM still computes standard errors and p-values, which are unnecessary for this example. Here’s a last benchmark implementing this:</p>
<div class="cell" data-execution_count="8">
<div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode julia code-with-copy"><code class="sourceCode julia"><span id="cb12-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">using</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">LinearAlgebra</span> </span>
<span id="cb12-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fastfit</span>(X, y) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cholesky!</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Symmetric</span>(X<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">'</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> X)) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\</span> (X<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">'</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> y)</span>
<span id="cb12-3"></span>
<span id="cb12-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coefs_lm_formula</span>(data)</span>
<span id="cb12-5">  constant <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fill</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(data))</span>
<span id="cb12-6">  X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">Float64</span>[constant data.id data.x1 data.x2 data.x1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.*</span> data.x2]</span>
<span id="cb12-7">  mod_level <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fastfit</span>(X, data.y)</span>
<span id="cb12-8">  X[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.=</span> data.x1_dmean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.*</span> data.x2_dmean</span>
<span id="cb12-9">  mod_dmean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fastfit</span>(X, data.y)</span>
<span id="cb12-10">  (mod_level[<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span>], mod_dmean[<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span>])</span>
<span id="cb12-11"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span></span>
<span id="cb12-12"></span>
<span id="cb12-13"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">@btime</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">run_simulations</span>(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>n);</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>  1.416 s (1900002 allocations: 4.64 GiB)</code></pre>
</div>
</div>
<p>Looks like calculating the standard errors and the p-values is not such an expensive operation after all.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>The goal of this post was not to squeeze out the last bit of performance for this particular kind of simulation. Instead, I hope that this post shows that the first “naive” implementation in Julia (which would often be <em>extremely</em> slow in R), is often fast enough. There are many other advantages to Julia, but this is a major one for me.</p>


</section>

 ]]></description>
  <category>statistics</category>
  <guid>https://elbersb.com/public/posts/2021-06-28-interaction-simulation/</guid>
  <pubDate>Sun, 27 Jun 2021 22:00:00 GMT</pubDate>
</item>
<item>
  <title>New paper in SMR: A Method for Studying Difference in Segregation Levels Across Time and Space</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2021-03-08-smr-paper/</link>
  <description><![CDATA[ 




<p>Benjamin Elbers. <a href="https://journals.sagepub.com/doi/10.1177/0049124121986204"><strong>A Method for Studying Difference in Segregation Levels Across Time and Space</strong></a>. Sociological Methods and Research.</p>
<ul>
<li><a href="https://osf.io/preprints/socarxiv/ya7zs/">Preprint</a> and <a href="https://osf.io/pwhdk/">Online Appendix</a> and <a href="https://osf.io/q57tj/">Replication Materials</a></li>
</ul>
<section id="the-problem-margin-dependency" class="level2">
<h2 class="anchored" data-anchor-id="the-problem-margin-dependency">The Problem: Margin dependency</h2>
<p>An important topic in the study of segregation are comparisons across space and time. It has been recognized for a long time that many segregation indices are margin-dependent, which complicates such comparisons. For instance, it can be shown that the index of dissimilarity (<img src="https://latex.codecogs.com/png.latex?D">) is margin-dependent in terms of the units under study (e.g., neighborhoods or schools), but not in terms of the groups (e.g., racial/income groups). This led to a debate in the gender segregation literature in the 1990s, where Charles and Grusky (AJS 1995, Demography 1998) advocated the use of log-linear modeling.</p>
<p>Consider the following four tables, which cross-classify the number of male and female employees across the occupations A, B, and C. Table (1) shows the baseline situation. In Table (2), occupation C has grown, while in Table (3) female employment increased across all occupations. Table (4) shows an extreme example, where the integrated occupation B has grown strongly.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2021-03-08-smr-paper/tables.png" class="img-fluid figure-img"></p>
<figcaption>Tables 1-4</figcaption>
</figure>
</div>
<p>How do different segregation measures characterize these situations? The following table shows how the popular <img src="https://latex.codecogs.com/png.latex?D">, <img src="https://latex.codecogs.com/png.latex?M">, and <img src="https://latex.codecogs.com/png.latex?H"> indices, as well as Charles and Grusky log-linear index <img src="https://latex.codecogs.com/png.latex?A">, quantify the amount of segregation. Also shown are the two odds ratios <img src="https://latex.codecogs.com/png.latex?(F_%7BA%7D/M_%7BA%7D)/(F_%7BB%7D/M_%7BB%7D)"> and <img src="https://latex.codecogs.com/png.latex?(F_%7BC%7D/M_%7BC%7D)/(F_%7BB%7D/M_%7BB%7D)">.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Table</th>
<th><img src="https://latex.codecogs.com/png.latex?D"></th>
<th><img src="https://latex.codecogs.com/png.latex?M"></th>
<th><img src="https://latex.codecogs.com/png.latex?H"></th>
<th><img src="https://latex.codecogs.com/png.latex?A"></th>
<th>Odds ratios</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>(1)</td>
<td>0.465</td>
<td>0.203</td>
<td>0.295</td>
<td>7.22</td>
<td>0.0714 and 9</td>
</tr>
<tr class="even">
<td>(2)</td>
<td>0.501</td>
<td>0.233</td>
<td>0.337</td>
<td>7.22</td>
<td>0.0714 and 9</td>
</tr>
<tr class="odd">
<td>(3)</td>
<td>0.465</td>
<td>0.206</td>
<td>0.297</td>
<td>7.22</td>
<td>0.0714 and 9</td>
</tr>
<tr class="even">
<td>(4)</td>
<td>0.001</td>
<td>0.000</td>
<td>0.000</td>
<td>7.22</td>
<td>0.0714 and 9</td>
</tr>
</tbody>
</table>
<p>The indices <img src="https://latex.codecogs.com/png.latex?D">, <img src="https://latex.codecogs.com/png.latex?M">, and <img src="https://latex.codecogs.com/png.latex?H"> are margin-dependent in either one or both directions, while the log-linear indices and the odds ratios stay stable. However, they also do stay stable in the extreme example, which many would regard as not very segregated.</p>
<p>I make use of the M index, which is margin-dependent in both directions, can be standardized (H index), and is highly decomposable:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AM%20=%5Csum_%7Bu%7Dp_%7B%5Ccdot%20u%7D%5Ctext%7BL%7D_%7Bu%7D%5Ctext%7B%20where%20%7D%5Ctext%7BL%7D_%7Bu%7D=%5Csum_%7Bg%7Dp_%7Bg%7Cu%7D%5Clog%5Cfrac%7Bp_%7Bg%7Cu%7D%7D%7Bp_%7Bg%5Ccdot%7D%7D%0A"></p>
<p>Defined for a <img src="https://latex.codecogs.com/png.latex?U%20%5Ctimes%20G"> contingency table, where <img src="https://latex.codecogs.com/png.latex?u"> indexes the units, and <img src="https://latex.codecogs.com/png.latex?g"> the groups; where <img src="https://latex.codecogs.com/png.latex?p_%7B%5Ccdot%20u%7D"> (<img src="https://latex.codecogs.com/png.latex?p_%7Bg%20%5Ccdot%7D">) is the marginal probability of being in unit <img src="https://latex.codecogs.com/png.latex?u"> (group <img src="https://latex.codecogs.com/png.latex?g">); and where <img src="https://latex.codecogs.com/png.latex?p_%7Bg%7Cu%7D"> is the probability of being in group <img src="https://latex.codecogs.com/png.latex?g"> given unit <img src="https://latex.codecogs.com/png.latex?u">. <img src="https://latex.codecogs.com/png.latex?%7BL%7D_%7Bu%7D"> is called the local segregation score for unit <img src="https://latex.codecogs.com/png.latex?u">.</p>
</section>
<section id="the-solution-decomposition-of-m" class="level2">
<h2 class="anchored" data-anchor-id="the-solution-decomposition-of-m">The Solution: Decomposition of <img src="https://latex.codecogs.com/png.latex?M"></h2>
<p>To decompose the difference between two <img src="https://latex.codecogs.com/png.latex?M"> indices at times <img src="https://latex.codecogs.com/png.latex?t_%7B1%7D"> and <img src="https://latex.codecogs.com/png.latex?t_%7B2%7D"> into marginal and structural components, we construct two counterfactual matrices:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?t'_%7B1%7D">, which has the same marginal distributions as <img src="https://latex.codecogs.com/png.latex?t_%7B2%7D">, but the odds ratios from <img src="https://latex.codecogs.com/png.latex?t_%7B1%7D">,</li>
<li><img src="https://latex.codecogs.com/png.latex?t'_%7B2%7D">, which has the same marginal distributions as <img src="https://latex.codecogs.com/png.latex?t_%7B1%7D">, but the odds ratios from <img src="https://latex.codecogs.com/png.latex?t_%7B2%7D">.</li>
</ul>
<p>This allows for the following decomposition:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0AM(t_%7B2%7D)-M(t_%7B1%7D)%20&amp;%20=%5Coverbrace%7B%5Cfrac%7B1%7D%7B2%7D(M(t_%7B2%7D)-M(t'_%7B2%7D))+%5Cfrac%7B1%7D%7B2%7D(M(t'_%7B1%7D)-M(t_%7B1%7D))%7D%5E%7B%5CDelta_%7B%5Ctext%7Bmarginal%7D%7D%7D%5C%5C%0A&amp;%20+%5Cunderbrace%7B%5Cfrac%7B1%7D%7B2%7D(M(t_%7B2%7D)-M(t'_%7B1%7D))+%5Cfrac%7B1%7D%7B2%7D(M(t'_%7B2%7D)-M(t_%7B1%7D))%7D_%7B%5CDelta_%7B%5Ctext%7Bstructural%7D%7D%7D%0A%5Cend%7Baligned%7D%0A"></p>
<p>To construct the two counterfactual matrices, we use Iterative Proportional Fitting (IPF). To construct <img src="https://latex.codecogs.com/png.latex?t'_%7B1%7D">, take <img src="https://latex.codecogs.com/png.latex?t_%7B1%7D"> and adjust all cells towards the column marginals of <img src="https://latex.codecogs.com/png.latex?t_%7B2%7D">. Then adjust all cells towards the row marginals of <img src="https://latex.codecogs.com/png.latex?t_%7B2%7D">. This adjustment towards the column and row marginals is repeated until both marginals have converged, i.e.&nbsp;are similar to those of <img src="https://latex.codecogs.com/png.latex?t_%7B2%7D">.</p>
<p>There are a few straightforward extensions of this decomposition:</p>
<ul>
<li><strong>Decomposition of <img src="https://latex.codecogs.com/png.latex?%5CDelta">marginal.</strong> It is often of interest to determine how much the row and column marginals have contributed to segregation change separately. To decompose the marginal component further, define <img src="https://latex.codecogs.com/png.latex?M(U;G;O)"> to identify the <img src="https://latex.codecogs.com/png.latex?M"> that is calculated based on the unit marginals from <img src="https://latex.codecogs.com/png.latex?U">, the group marginals from <img src="https://latex.codecogs.com/png.latex?G">, and the odds ratios from <img src="https://latex.codecogs.com/png.latex?O">.</li>
</ul>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5CDelta_%7B%5Ctext%7Bmarginal-units%7D%7D%20&amp;%20=%5Cfrac%7B1%7D%7B4%7D(M(t_%7B2%7D;t_%7B1%7D;t_%7B1%7D)-M(t_%7B1%7D;t_%7B1%7D;t_%7B1%7D))+%5Cfrac%7B1%7D%7B4%7D(M(t_%7B2%7D;t_%7B2%7D;t_%7B1%7D)-M(t_%7B1%7D;t_%7B2%7D;t_%7B1%7D))%5C%5C%0A&amp;%20+%5Cfrac%7B1%7D%7B4%7D(M(t_%7B2%7D;t_%7B2%7D;t_%7B2%7D)-M(t_%7B1%7D;t_%7B2%7D;t_%7B2%7D))+%5Cfrac%7B1%7D%7B4%7D(M(t_%7B2%7D;t_%7B1%7D;t_%7B2%7D)-M(t_%7B1%7D;t_%7B1%7D;t_%7B2%7D))%5C%5C%0A%5CDelta_%7B%5Ctext%7Bmarginal-groups%7D%7D%20&amp;%20=%5Cfrac%7B1%7D%7B4%7D(M(t_%7B1%7D;t_%7B2%7D;t_%7B1%7D)-M(t_%7B1%7D;t_%7B1%7D;t_%7B1%7D))+%5Cfrac%7B1%7D%7B4%7D(M(t_%7B2%7D;t_%7B2%7D;t_%7B1%7D)-M(t_%7B2%7D;t_%7B1%7D;t_%7B1%7D))%5C%5C%0A&amp;%20+%5Cfrac%7B1%7D%7B4%7D(M(t_%7B2%7D;t_%7B2%7D;t_%7B2%7D)-M(t_%7B2%7D;t_%7B1%7D;t_%7B2%7D))+%5Cfrac%7B1%7D%7B4%7D(M(t_%7B1%7D;t_%7B2%7D;t_%7B2%7D)-M(t_%7B1%7D;t_%7B1%7D;t_%7B2%7D))%0A%5Cend%7Baligned%7D%0A"></p>
<p>This procedure requires six IPF procedures in total, and is based upon the elimination of the marginal contributions in all possible ways (Shapley value decomposition).</p>
<ul>
<li><strong>Decomposition of <img src="https://latex.codecogs.com/png.latex?%5CDelta">structural.</strong> It is also possible to decompose structural change into the contributions of each individual unit by exploiting the decomposability properties of the <img src="https://latex.codecogs.com/png.latex?M"> index:</li>
</ul>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5CDelta_%7B%5Ctext%7Bstructural%7D%7D%20&amp;%20=%5Cfrac%7B1%7D%7B2%7D(M(t_%7B2%7D)-M(t'_%7B1%7D))+%5Cfrac%7B1%7D%7B2%7D(M(t'_%7B2%7D)-M(t_%7B1%7D))%5C%5C%0A&amp;%20=%5Csum_%7Bu%7D%5Cfrac%7B1%7D%7B2%7D%5Cleft(p_%7B%5Ccdot%20u%7D%5E%7Bt_%7B2%7D%7D%5Cleft%5BL_%7Bu%7D(t_%7B2%7D)-L_%7Bu%7D(t'_%7B1%7D)%5Cright%5D+p_%7B%5Ccdot%20u%7D%5E%7Bt_%7B1%7D%7D%5Cleft%5BL_%7Bu%7D(t'_%7B2%7D)-L_%7Bu%7D(t_%7B1%7D)%5Cright%5D%5Cright)%0A%5Cend%7Baligned%7D%0A"></p>
<ul>
<li><strong>(Dis)appearing units.</strong> In many segregation problems, the researcher has to deal with units that disappear over time, or new units that appear. For instance, in a school segregation problem, schools may close down and new schools may open up. It can be shown that the <img src="https://latex.codecogs.com/png.latex?M"> index provides a clear interpretation for the contribution of these (dis)appearing units towards segregation.</li>
</ul>
</section>
<section id="example-occupational-gender-segregation" class="level2">
<h2 class="anchored" data-anchor-id="example-occupational-gender-segregation">Example: Occupational Gender Segregation</h2>
<p>I now apply the full decomposition to the study of occupational gender segregation of the civilian population of the United States between 1990 and 2016:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0AM(t_%7B2%7D)-M(t_%7B1%7D)%20&amp;%20=%20%5CDelta_%7B%5Ctext%7Badditions%7D%7D%20+%20%5CDelta_%7B%5Ctext%7Bremovals%7D%7D%5C%5C%0A&amp;%20+%20%5CDelta_%7B%5Ctext%7Bmarginal-units%7D%7D%20+%20%5CDelta_%7B%5Ctext%7Bmarginal-groups%7D%7D%5C%5C%0A&amp;%20+%20%5CDelta_%7B%5Ctext%7Bstructural%7D%7D%0A%5Cend%7Baligned%7D%0A"></p>
<p>The data source is the U.S. Census and the American Community Survey, downloaded from IPUMS. Harmonized occupational codings come from IPUMS. Some 50 occupations vanish over time, but no new occupations are introduced. The decomposition was carried out for the whole population, as well as for 9 major occupational groups separately.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2021-03-08-smr-paper/diffs.png" class="img-fluid figure-img" style="width:70.0%"></p>
<figcaption>Decomposition of occupational gender segregation by major group</figcaption>
</figure>
</div>
<p>The figure shows that:</p>
<ul>
<li>overall gender segregation has been declining,</li>
<li>much of this is due to changes in the structural component, i.e.&nbsp;the odds ratios,</li>
<li>disappearing occupations do not matter very much, except for operators/laborers,</li>
<li>there is some heterogeneity by major group: declines have been pronounced in some groups, but in some major groups gender segregation has increased,</li>
<li>much of the decline in segregation is structural, while the increase is mostly due to marginal changes,</li>
<li>the three components can offset each other.</li>
</ul>
<p>See also the <a href="https://elbersb.github.io/segregation/">R package segregation</a> which accompanies this paper.</p>


</section>

 ]]></description>
  <category>papers</category>
  <category>segregation</category>
  <guid>https://elbersb.com/public/posts/2021-03-08-smr-paper/</guid>
  <pubDate>Sun, 07 Mar 2021 23:00:00 GMT</pubDate>
</item>
<item>
  <title>Using the bootstrap for bias reduction</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2021-01-07-bootstrap-bias/</link>
  <description><![CDATA[ 




<p>I came across a neat example in Horowitz (2001, p.&nbsp;3174), which demonstrates that (in these specific circumstances at least), the bias-corrected bootstrap estimator has lower MSE by a large factor. The setup is as follows. We have a sample of 10 iid observation, where <img src="https://latex.codecogs.com/png.latex?X_%7Bi%7D%5Csim%20N(0,6)">. The goal is then to estimate <img src="https://latex.codecogs.com/png.latex?%5Ctheta=%5Cexp(%5Ctext%7BE%7D%5BX_%7Bi%7D%5D)">, for which the true value is <img src="https://latex.codecogs.com/png.latex?%5Ctheta=1">. The plug-in estimator is <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D=%5Cexp%5Cleft(%5Cfrac%7B1%7D%7B10%7D%5Csum_%7Bi=1%7D%5E%7B10%7DX_%7Bi%7D%5Cright)">.</p>
<p>Given a realized sample <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D=(x_%7B1%7D,%5Cldots,x_%7Bn%7D)">, the usual bootstrap estimates are obtained by resampling <img src="https://latex.codecogs.com/png.latex?m"> times from <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D"> with replacement, generating the bootstrap samples <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_%7Bj%7D%5E%7B*%7D">, and the bootstrap estimates <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D_%7Bj%7D%5E%7B*%7D=%5Cexp%5Cleft(%5Cfrac%7B1%7D%7B10%7D%5Csum_%7Bi=1%7D%5E%7B10%7Dx_%7Bj%7D%5E%7B*%7D%5Cright)">. Let <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D%5E%7B*%7D=%5Cfrac%7B1%7D%7Bm%7D%5Csum_%7Bj=1%7D%5E%7Bm%7D%5Chat%7B%5Ctheta%7D_%7Bj%7D%5E%7B*%7D"> be the average across all <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D_%7Bj%7D%5E%7B*%7D">. We can then estimate the bias as <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7B%5Ctext%7BBias%7D%7D%5B%5Chat%7B%5Ctheta%7D%5D=%5Chat%7B%5Ctheta%7D%5E%7B*%7D-%5Chat%7B%5Ctheta%7D">. In R code, this is:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb1-2">data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>))</span>
<span id="cb1-3">(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">thetahat =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(data)))</span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] 1.382411</span></span>
<span id="cb1-5">bs <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">replicate</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, {</span>
<span id="cb1-6">    resample <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(data, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb1-7">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(resample))</span>
<span id="cb1-8">})</span>
<span id="cb1-9">(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">biashat =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(bs) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> thetahat)</span>
<span id="cb1-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1] 0.2973734</span></span></code></pre></div>
</div>
<p>The “debiased” estimate would hence be <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D-%5Cwidehat%7B%5Ctext%7BBias%7D%7D%5B%5Chat%7B%5Ctheta%7D%5D=2%5Chat%7B%5Ctheta%7D-%5Chat%7B%5Ctheta%7D%5E%7B*%7D">. For the concrete result, this is <img src="https://latex.codecogs.com/png.latex?1.382-0.297=1.085">, much closer to the true value <img src="https://latex.codecogs.com/png.latex?%5Ctheta=1">.</p>
<p>Because we control the data-generating process and know the true value of <img src="https://latex.codecogs.com/png.latex?%5Ctheta">, we can repeat the above procedures any number of times and obtain approximations for the MSE’s of <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D-%5Cwidehat%7B%5Ctext%7BBias%7D%7D%5B%5Chat%7B%5Ctheta%7D%5D">. The following code accomplishes that for 100 repetitions:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">res <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">replicate</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, {</span>
<span id="cb2-2">    data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>))</span>
<span id="cb2-3">    thetahat <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(data))</span>
<span id="cb2-4">    bs <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">replicate</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, {</span>
<span id="cb2-5">        resample <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(data, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb2-6">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(resample))</span>
<span id="cb2-7">    })</span>
<span id="cb2-8">    (<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">debiased =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> thetahat <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(bs))</span>
<span id="cb2-9">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(thetahat <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, debiased <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, (thetahat <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, (debiased <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb2-10">})</span>
<span id="cb2-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">apply</span>(res, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, mean)</span>
<span id="cb2-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; [1]  0.37878143 -0.04919049  1.10729457  0.47833810</span></span></code></pre></div>
</div>
<p>By making use of the identity <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BMSE%7D%5B%5Ccdot%5D=%5Ctext%7BBias%7D%5E%7B2%7D%5B%5Ccdot%5D+%5Ctext%7BVar%7D%5B%5Ccdot%5D">, we obtain the following results:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 25%">
<col style="width: 25%">
<col style="width: 25%">
<col style="width: 25%">
</colgroup>
<thead>
<tr class="header">
<th>Estimator</th>
<th>MSE</th>
<th>Bias</th>
<th>Variance</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D"></td>
<td>1.107</td>
<td>0.379</td>
<td>0.964</td>
</tr>
<tr class="even">
<td><img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D-%5Cwidehat%7B%5Ctext%7BBias%7D%7D%5B%5Chat%7B%5Ctheta%7D%5D"></td>
<td>0.478</td>
<td>-0.049</td>
<td>0.476</td>
</tr>
</tbody>
</table>
<p>Similar to the results reported in Horowitz (2001, p.&nbsp;3175), there is a large reduction in both bias and MSE. Not reported by Horowitz, but also significant, is the reduction in variance. The true bias<sup>1</sup> of <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D"> is <img src="https://latex.codecogs.com/png.latex?%5Cexp(0.3)%20-%201%20%5Capprox%200.35">, so the simulation estimate is not far off.</p>
<section id="references" class="level4">
<h4 class="anchored" data-anchor-id="references">References</h4>
<p>Horowitz, Joel L. 2001. “The Bootstrap.” In: Handbook of Econometrics, Volume 5, edited by J. J. Heckman and E. Leamer. Elsevier.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Let <img src="https://latex.codecogs.com/png.latex?Y%20=%20%5Cfrac%7B1%7D%7B10%7D%20%5Csum%20X_i">, then <img src="https://latex.codecogs.com/png.latex?Y%20%5Csim%20N(0,%200.6)">, and <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D=%20%5Cexp%20(Y)%20%5Csim%20%5Ctext%7BLogNormal%7D(0,%200.6)">. A log-normal random variable has mean <img src="https://latex.codecogs.com/png.latex?%5Cexp%20%5Cleft(%20%5Cfrac%7B%5Cmu%20+%20%5Csigma%5E2%7D%7B2%7D%20%5Cright)">, hence <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BE%7D%5B%5Chat%7B%5Ctheta%7D%5D%20=%20%5Cexp(0.3)">.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>statistics</category>
  <guid>https://elbersb.com/public/posts/2021-01-07-bootstrap-bias/</guid>
  <pubDate>Wed, 06 Jan 2021 23:00:00 GMT</pubDate>
</item>
<item>
  <title>Regression Assumptions, one by one</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2020-10-01-regression-assumptions/</link>
  <description><![CDATA[ 




<!-- • Rewrite this under the assumption that X is random (Hayashi explains this nicely). This is more realistic for non-experimental sciences.

• State at which point Gauss-Markov (BLUE) comes into play

• WHY? "Normality is a concern if you are trying to predict a data point but not if you are trying to approximate a conditional expectation."

• "The question, then, is not whether a laundry list of model assumptions hold, but whether we have a sufficient sample size, and whether the object of interest is the best linear approximation of E[y|X]. It’s true that OLS has even nicer properties when some of these assumptions hold (linearity gives us unbiasedness, homoscedasticity gives us efficiency in the class of linear unbiased estimators, normality gives us asymptotic efficiency in the class of (linear and nonlinear) unbiased estimators). But we may still be interested in this object even if none of these assumptions hold. Linear approximations are easily interpreted and analytically/computationally convenient, and in general E[y|X] is going to be a messy object to try to work with directly." -->
<p>Many textbooks on linear regression start with a fully-fletched regression model, including assumptions such as independence and homoscedasticity right from the start. I always found this treatment not very intuitive, as many results, such as the unbiasedness of the parameter estimates, hold without making these stronger assumptions. The goal of this post is to introduce the assumptions of the OLS model one by one, and only when they become necessary. The major assumptions are, in the order they are introduced: linearity, no perfect collinearity, zero conditional mean, independence, homoscedasticity, and the normal distribution of errors.</p>
<p>The material is inspired by a number of textbooks, most importantly Gelman and Hill (2006) and Hayashi (2000). The latter is especially helpful, because it is clearly stated which assumptions are needed for which result.</p>
<section id="the-basic-assumption-linearity" class="level2">
<h2 class="anchored" data-anchor-id="the-basic-assumption-linearity">The basic assumption: Linearity</h2>
<p>In a regression model, we model the outcome as a linear function of a number of predictors. The model is of the following form:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ay=%5Cbeta_%7B0%7D+%5Cbeta_%7B1%7Dx_%7B1%7D+%5Cbeta_%7B2%7Dx_%7B2%7D+%5Cldots+%5Cepsilon%0A"></p>
<p>We know the outcome, <img src="https://latex.codecogs.com/png.latex?y">, and the predictors <img src="https://latex.codecogs.com/png.latex?x_%7B1%7D,x_%7B2%7D,%5Cldots"> The <img src="https://latex.codecogs.com/png.latex?%5Cbeta_%7Bj%7D"> are unknown, and we want to estimate them from the data. Because we rarely assume that the relationship between <img src="https://latex.codecogs.com/png.latex?y"> and <img src="https://latex.codecogs.com/png.latex?x_%7B1%7D,x_%7B2%7D,%5Cldots"> is completely deterministic, we also add an error term, <img src="https://latex.codecogs.com/png.latex?%5Cepsilon."> This term captures anything about <img src="https://latex.codecogs.com/png.latex?y"> that is not captured by the predictors. The most important assumption in regression is in this equation: We assume that the relationship between the outcome and the predictors is additive and linear. This assumption can be relaxed somewhat. For instance, we could interact predictors to produce the model</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ay=%5Cbeta_%7B0%7D+%5Cbeta_%7B1%7Dx_%7B1%7D+%5Cbeta_%7B2%7Dx_%7B2%7D+%5Cbeta_%7B3%7D(x_%7B1%7Dx_%7B2%7D)+%5Cepsilon.%0A"></p>
<p>In this model, <img src="https://latex.codecogs.com/png.latex?y"> is no longer a linear function of <img src="https://latex.codecogs.com/png.latex?x_%7B1%7D"> and <img src="https://latex.codecogs.com/png.latex?x_%7B2%7D">. However, the regression model is still linear in the parameters, although we have to pay attention when interpreting the coefficients. The same is true for a model of the form <img src="https://latex.codecogs.com/png.latex?y=%5Cbeta_%7B0%7D+%5Cbeta_%7B1%7Dx_%7B1%7D+%5Cbeta_%7B2%7Dx_%7B1%7D%5E%7B2%7D+%5Cepsilon,"> which includes a squared term. If we assume a multiplicative model <img src="https://latex.codecogs.com/png.latex?y=x_%7B1%7Dx_%7B2%7D,"> we could transform this is into a linear model using <img src="https://latex.codecogs.com/png.latex?%5Clog%20y=%5Cbeta_%7B1%7D%5Clog%20x_%7B1%7D+%5Cbeta_%7B2%7D%5Clog%20x_%7B2%7D."></p>
</section>
<section id="regression-as-an-optimization-problem" class="level2">
<h2 class="anchored" data-anchor-id="regression-as-an-optimization-problem">Regression as an optimization problem</h2>
<p>We now collect some data, in order to estimate the unknown quantities of this model, the <img src="https://latex.codecogs.com/png.latex?%5Cbeta_%7Bj%7D."> We collect data on the outcome <img src="https://latex.codecogs.com/png.latex?y"> and the predictors <img src="https://latex.codecogs.com/png.latex?x_%7B1%7D,x_%7B2%7D,%5Cldots"> To find values for the <img src="https://latex.codecogs.com/png.latex?%5Cbeta_%7Bj%7D">, first, let’s just guess. For instance, we could roll a die repeatedly and record the results. We call the result <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7Bj%7D,"> to show that we have now estimated <img src="https://latex.codecogs.com/png.latex?%5Cbeta_%7Bj%7D">. To be sure, this estimate will be pretty bad (we didn’t use any of our data to find it!). To formalize the idea that our made-up <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7Bj%7D">’s will likely be a bad estimate, we compute for each observation the predicted values, <img src="https://latex.codecogs.com/png.latex?%5Chat%7By%7D_%7Bi%7D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7By%7D_%7Bi%7D%20=%20%5Chat%7B%5Cbeta%7D_%7B0%7D+%5Chat%7B%5Cbeta%7D_%7B1%7Dx_%7Bi1%7D+%5Chat%7B%5Cbeta%7D_%7B2%7Dx_%7Bi2%7D+%5Cldots%0A"></p>
<p>We use only the deterministic part of the model to compute <img src="https://latex.codecogs.com/png.latex?%5Chat%7By%7D,"> because, by definition, we do not know <img src="https://latex.codecogs.com/png.latex?%5Cepsilon_%7Bi%7D."> Intuitively, if we simply guess the parameter values, the predicted values <img src="https://latex.codecogs.com/png.latex?%5Chat%7By%7D_%7Bi%7D"> will have little to do with the actual values <img src="https://latex.codecogs.com/png.latex?y_%7Bi%7D">. One way to quantify how good or bad our guesses are is to compute a loss function, which is some function that quantifies how much the predicted values deviate from the true, known values. One such loss function is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7B%5Cmathscr%7BL%7D%7D_%7B2%7D(%5Chat%7B%5Cbeta%7D_%7B0%7D,%5Chat%7B%5Cbeta%7D_%7B1%7D,%5Cldots)=%5Csum_%7Bi=1%7D%5E%7Bn%7D(y_%7Bi%7D-%5Chat%7By%7D_%7Bi%7D)%5E%7B2%7D,%0A"></p>
<p>This loss function, called the squared error loss, is always positive, and ideally, we would like its value to be small. For a random guess, the differences between <img src="https://latex.codecogs.com/png.latex?y_%7Bi%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Chat%7By%7D_%7Bi%7D"> will likely be large. A better guess would reduce the value of the loss function. This choice of loss function may seem somewhat arbitrary. For instance, one might ask why we didn’t choose the following loss function:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7B%5Cmathscr%7BL%7D%7D_%7B1%7D(%5Chat%7B%5Cbeta%7D_%7B0%7D,%5Chat%7B%5Cbeta%7D_%7B1%7D,%5Cldots)=%5Csum_%7Bi=1%7D%5E%7Bn%7D%7Cy_%7Bi%7D-%5Chat%7By%7D_%7Bi%7D%7C.%0A"></p>
<p>This is also a very reasonable loss function: it is always positive, and we would like its value to be small. There is an infinite number of other possible loss functions, and in many situations it will be advantageous to choose one of these alternatives. However, the squared error loss has some advantages. Maybe most important of all, its easy to find a closed-form expression for the <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_j">’s that minimizes the loss. It is from this choice of loss function that ordinary least <em>squares</em> gets its name; but notably, this refers to the estimation method. Nothing in our regression model so far has told us that we require <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7B%5Cmathscr%7BL%7D%7D_%7B2%7D"> as a loss function, it’s just one way to estimate the parameters <img src="https://latex.codecogs.com/png.latex?%5Cbeta_%7Bj%7D">.</p>
<p>The goal is now to find a set of parameters <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7Bj%7D"> that minimize the value of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7B%5Cmathscr%7BL%7D%7D_%7B2%7D(%5Chat%7B%5Cbeta%7D_%7B0%7D,%5Chat%7B%5Cbeta%7D_%7B1%7D,%5Cldots)">. At this point, it becomes convenient to switch to matrix notation. We arrange the outcome values and the predictors into a <img src="https://latex.codecogs.com/png.latex?n%5Ctimes1"> column vector <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D=(y_%7B1%7D,y_%7B2%7D,%5Cldots,y_%7Bn%7D)">, and a <img src="https://latex.codecogs.com/png.latex?n%5Ctimes%20p"> matrix <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D">, respectively. The first column of <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D"> is just a vector of 1’s—this way we can include <img src="https://latex.codecogs.com/png.latex?%5Cbeta_%7B0%7D"> in the parameter vector. The parameters are collected in a <img src="https://latex.codecogs.com/png.latex?p%5Ctimes1"> column vector <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7B%5Cbeta%7D=(%5Cbeta_%7B0%7D,%5Cbeta_%7B1%7D,%5Cldots,%5Cbeta_%7Bp-1%7D)">.</p>
<p>Through the use of some matrix algebra, we can then rewrite the loss function as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7B%5Cmathscr%7BL%7D%7D_%7B2%7D(%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D)=%7C%7C%5Cmathbf%7By%7D-%5Cmathbf%7B%5Chat%7By%7D%7D%7C%7C_%7B2%7D%5E%7B2%7D=%5Cmathbf%7B%5Cmathbf%7By%7D%7D%5E%7BT%7D%5Cmathbf%7B%5Cmathbf%7By%7D%7D-2%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%5E%7BT%7D%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7By%7D+%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%5E%7BT%7D%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D.%0A"></p>
<p>To find the minimum, we differentiate <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7B%5Cmathscr%7BL%7D%7D_%7B2%7D(%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D)"> with respect to <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> (this requires a bit of matrix calculus):</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%7D%5Cmathcal%7B%5Cmathscr%7BL%7D%7D_%7B2%7D(%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D)=-2%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7By%7D+2%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D.%0A"></p>
<p>Setting this equal to zero, we get</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D=%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7By%7D,%0A"></p>
<p>the so-called “normal equations.” The remaining step is to premultiply both sides by the inverse of <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D">, from which we obtain</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D=(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7By%7D.%0A"></p>
<p>To show that this is really the minimum of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7B%5Cmathscr%7BL%7D%7D_%7B2%7D(%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D)">, we would need to show that <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7B%5Cmathscr%7BL%7D%7D_%7B2%7D(%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D)"> is convex. We skip that step here.</p>
<!-- NOT STRICTLY CORRECT, we also need -->
<p>The expression that we derived for <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> is attractive in its simplicity. However, the expression relies on that fact that <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D"> is invertible. This will only be the case if <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D"> has full column rank, or, to put it differently, that no two columns in <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D"> are perfectly correlated. It is also required that <img src="https://latex.codecogs.com/png.latex?n%5Cge%20p,"> i.e.&nbsp;that we have at least as many observations as predictors. If <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D"> does not have full column rank, we could replace <img src="https://latex.codecogs.com/png.latex?(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D"> with a pseudo-inverse. However, that inverse won’t be unique, and so the estimates <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> will no longer be unique either. Going forward, we will assume that no two columns in <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D"> are perfectly correlated.</p>
<p>If we just want to learn about <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D,"> we are done at this point. We made two big assumptions: We assumed that <img src="https://latex.codecogs.com/png.latex?y"> depends linearly on the inputs <img src="https://latex.codecogs.com/png.latex?x_%7B1%7D,x_%7B2%7D,%5Cldots">, and we assumed that no two predictors are perfectly correlated.</p>
</section>
<section id="the-zero-conditional-mean-assumption" class="level2">
<h2 class="anchored" data-anchor-id="the-zero-conditional-mean-assumption">The zero conditional mean assumption</h2>
<p>Arguably, until now, the math involved only some optimization, but no statistics. However, in statistics, we also care about uncertainty. For instance, we are often interested in whether some input <img src="https://latex.codecogs.com/png.latex?x_%7Bj%7D"> is associated with the outcome <img src="https://latex.codecogs.com/png.latex?y">. If <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7Bj%7D">, the coefficient for <img src="https://latex.codecogs.com/png.latex?x_%7Bj%7D">, is zero, we would conclude that there is no association. But <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7Bj%7D"> might be non-zero just by chance, and so we would like some measure of the uncertainty of <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7Bj%7D">.</p>
<p>To do so, we make an assumption about the error term. Currently our model can be written in matrix notation as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbf%7By%7D=%5Cmathbf%7BX%7D%5Cboldsymbol%7B%5Cbeta%7D+%5Cboldsymbol%7B%5Cepsilon%7D.%0A"></p>
<p>We now additionally assume that <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7B%5Cepsilon%7D"> is a random vector with mean <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7B0%7D"> and covariance matrix <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7B%5CSigma%7D">. The assumption that <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BE%7D%5B%5Cboldsymbol%7B%5Cepsilon%7D%5D"> is specifically zero is not very strict. For instance, we could assume <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BE%7D%5B%5Cboldsymbol%7B%5Cepsilon%7D%5D=c">, where <img src="https://latex.codecogs.com/png.latex?c"> is some constant, and then absorb this constant in the coefficient <img src="https://latex.codecogs.com/png.latex?%5Cbeta_%7B0%7D">. However, assuming that the mean of the errors is <em>constant</em> is a big assumption: Essentially it means that, after we have accounted for the predictors <img src="https://latex.codecogs.com/png.latex?x_%7B1%7D,x_%7B2%7D,%5Cldots">, there is no other systematic variation in <img src="https://latex.codecogs.com/png.latex?y">.</p>
<p>It is possibly easier to recognize the severity of this assumption by finding the expressions for the mean and variance of <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D">. These are now also random variables, as they are functions of <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7B%5Cepsilon%7D">. We still assume that the predictors are known and fixed. It is then straightforward to derive the expectation and variance of <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Ctext%7BE%7D%5B%5Cmathbf%7By%7D%5D%20&amp;%20=%5Ctext%7BE%7D%5B%5Cmathbf%7BX%7D%5Cboldsymbol%7B%5Cbeta%7D+%5Cboldsymbol%7B%5Cepsilon%7D%5D=%5Cmathbf%7BX%7D%5Cboldsymbol%7B%5Cbeta%7D+%5Ctext%7BE%7D%5B%5Cboldsymbol%7B%5Cepsilon%7D%5D=%5Cmathbf%7BX%7D%5Cboldsymbol%7B%5Cbeta%7D%5C%5C%0A%5Ctext%7BVar%7D%5B%5Cmathbf%7By%7D%5D%20&amp;%20=%5Ctext%7BVar%7D%5B%5Cmathbf%7BX%7D%5Cboldsymbol%7B%5Cbeta%7D+%5Cboldsymbol%7B%5Cepsilon%7D%5D=%5Ctext%7BVar%7D%5B%5Cboldsymbol%7B%5Cepsilon%7D%5D=%5Cboldsymbol%7B%5CSigma%7D.%0A%5Cend%7Baligned%7D%0A"></p>
<p>Note that the assumption about <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7B%5Cepsilon%7D"> is equivalent to assuming that <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D"> is a random vector with expectation <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D%5Cboldsymbol%7B%5Cbeta%7D"> and covariance matrix <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7B%5CSigma%7D">. We therefore assume that the information we have in <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D"> is sufficient to model <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BE%7D%5B%5Cmathbf%7By%7D%5D">. The zero conditional mean assumption is frequently violated—the most common occurrence is omitted variable bias.</p>
<p>If <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D"> is a random vector, this also has consequences for our OLS estimator, <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D">, which is a function of <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D">. Therefore, <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> is a random vector as well. We can derive the expectation of <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> as follows:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Ctext%7BE%7D%5B%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%5D%20&amp;%20=%5Ctext%7BE%7D%5B(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7By%7D%5D%5C%5C%0A&amp;%20=(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5Cmathbf%7BX%7D%5E%7BT%7D%20%5Ctext%7BE%7D%5B%5Cmathbf%7By%7D%5D%5C%5C%0A&amp;%20=(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5Cmathbf%7BX%7D%5E%7BT%7D%20%5Cmathbf%7BX%7D%20%5Cboldsymbol%7B%5Cbeta%7D%5C%5C%0A&amp;%20=%5Cboldsymbol%7B%5Cbeta%7D.%0A%5Cend%7Baligned%7D%0A"></p>
<p>Hence, under the present assumptions, <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> is unbiased. To arrive at this result, we had to assume linearity, no perfect collinearity, and <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BE%7D%5B%5Cboldsymbol%7B%5Cepsilon%7D%5D=%5Cmathbf%7B0%7D"> (zero conditional mean).</p>
</section>
<section id="standard-errors" class="level2">
<h2 class="anchored" data-anchor-id="standard-errors">Standard errors</h2>
<p>Next, making use of the fact that for a non-random matrix <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BA%7D"> and random vector <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D">, <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BVar%7D%5B%5Cmathbf%7BA%7D%5Cmathbf%7By%7D%5D=%5Cmathbf%7BA%7D%5C%20%5Ctext%7BVar%7D%5B%5Cmathbf%7By%7D%5D%5Cmathbf%7BA%7D%5E%7BT%7D">, we find that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Ctext%7BVar%7D%5B%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%5D%20&amp;%20=%5Ctext%7BVar%7D%5B(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7By%7D%5D%5C%5C%0A&amp;%20=(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5Cmathbf%7BX%7D%5E%7BT%7D%5Ctext%7BVar%7D%5B%5Cmathbf%7By%7D%5D%5C%20%5Cmathbf%7BX%7D(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5C%5C%0A&amp;%20=(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5Cmathbf%7BX%7D%5E%7BT%7D%5Cboldsymbol%7B%5CSigma%7D%5Cmathbf%7BX%7D(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D.%0A%5Cend%7Baligned%7D%0A"></p>
<p>This is the expression that we interested in. The diagonal elements of <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BVar%7D%5B%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%5D"> (a <img src="https://latex.codecogs.com/png.latex?p%5Ctimes%20p"> matrix) tell us about the uncertainty in estimating the elements of <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D">. To estimate the variance, however, we need to know or estimate <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7B%5CSigma%7D">, and this is where we run into trouble. This matrix will be of the form</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboldsymbol%7B%5CSigma%7D=%5Cbegin%7Bbmatrix%7D%5Ctext%7BVar%7D%5By_%7B1%7D%5D%20&amp;%20%5Ctext%7BCov%7D%5By_%7B1%7D,y_%7B2%7D%5D%20&amp;%20%5Ccdots%20&amp;%20%5Ctext%7BCov%7D%5By_%7B1%7D,y_%7Bn%7D%5D%5C%5C%0A%5Ctext%7BCov%7D%5By_%7B2%7D,y_%7B1%7D%5D%20&amp;%20%5Ctext%7BVar%7D%5By_%7B2%7D%5D%20&amp;%20%5Ccdots%20&amp;%20%5Ctext%7BCov%7D%5By_%7B2%7D,y_%7Bn%7D%5D%5C%5C%0A%5Cvdots%20&amp;%20%5Cvdots%20&amp;%20%5Cddots%20&amp;%20%5Cvdots%5C%5C%0A%5Ctext%7BCov%7D%5By_%7Bn%7D,y_%7B1%7D%5D%20&amp;%20%5Ctext%7BCov%7D%5By_%7Bn%7D,y_%7B2%7D%5D%20&amp;%20%5Ccdots%20&amp;%20%5Ctext%7BVar%7D%5By_%7Bn%7D%5D%0A%5Cend%7Bbmatrix%7D%0A"></p>
<p>This is a <img src="https://latex.codecogs.com/png.latex?n%5Ctimes%20n"> symmetric matrix (<img src="https://latex.codecogs.com/png.latex?%5Ctext%7BCov%7D%5By_%7Bi%7D,y_%7Bj%7D%5D=%5Ctext%7BCov%7D%5By_%7Bj%7D,y_%7Bi%7D%5D">). We therefore need to estimate <img src="https://latex.codecogs.com/png.latex?n"> variances and <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7B2%7D(n%5E%7B2%7D-n)"> covariances, for a total of <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7B2%7D(n%5E%7B2%7D+n)"> elements. However, we only have <img src="https://latex.codecogs.com/png.latex?n"> observations in total! Clearly, this won’t work, and we therefore have to make another assumption: First, we will assume that the <img src="https://latex.codecogs.com/png.latex?y_%7Bi%7D"> are independent. We do not assume the <img src="https://latex.codecogs.com/png.latex?y_%7Bi%7D"> are i.i.d., independent and <em>identically</em> distributed. The mean of <img src="https://latex.codecogs.com/png.latex?y_%7Bi%7D"> depends on <img src="https://latex.codecogs.com/png.latex?x_%7Bi%7D">, and therefore two different <img src="https://latex.codecogs.com/png.latex?y_%7Bi%7D">’s may have a different mean. We can also state this assumption in terms of the errors, as <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7B%5CSigma%7D"> is also the covariance matrix of <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7B%5Cepsilon%7D">. (Strictly speaking, we only require here that the outcomes/errors are uncorrelated, but the independence assumption is more transparent.)</p>
<p>Substantively, the assumption of independence means that we assume <img src="https://latex.codecogs.com/png.latex?E%5By_i%7Cy_j%5D%20=%20E%5By_i%5D">, i.e.&nbsp;that the mean of any outcome does not depend on the values of the other outcomes. Depending on the problem, this assumption can be unrealistic, and it can relaxed with techniques such as clustered standard errors or multilevel models. The assumption of independence is sometimes replaced with the assumption that our data is a random sample from a population of interest, but I find the statement that the <img src="https://latex.codecogs.com/png.latex?y_i"> are independent more precise.</p>
<p>With the independence assumption, the form of <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7B%5CSigma%7D"> is radically simplified, as all off-diagonal values are now zero:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboldsymbol%7B%5CSigma%7D'=%5Cbegin%7Bbmatrix%7D%5Ctext%7BVar%7D%5By_%7B1%7D%5D%20&amp;%20%20&amp;%20%20&amp;%20%5Cmathbf%7B0%7D%5C%5C%0A&amp;%20%5Ctext%7BVar%7D%5By_%7B2%7D%5D%5C%5C%0A&amp;%20%20&amp;%20%5Cddots%5C%5C%0A%5Cmathbf%7B0%7D%20&amp;%20%20&amp;%20%20&amp;%20%5Ctext%7BVar%7D%5By_%7Bn%7D%5D%0A%5Cend%7Bbmatrix%7D%0A"></p>
<p>We are left with the <img src="https://latex.codecogs.com/png.latex?n"> variances on the diagonal. Estimating <img src="https://latex.codecogs.com/png.latex?n"> variances with <img src="https://latex.codecogs.com/png.latex?n"> data points might still sounds impossible, but it is routinely done: When using robust or heteroscedasticity-consistent standard errors, the variances of <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BVar%7D%5By_%7Bi%7D%5D"> are estimated as <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5By_%7Bi%7D%5D=(y_%7Bi%7D-%5Chat%7By%7D_%7Bi%7D)%5E%7B2%7D"> (White 1980; but see Freedman 2006 and King and Roberts 2015). However, in the standard regression model, we instead make another assumption: homoscedasticity. This means that we assume that all the <img src="https://latex.codecogs.com/png.latex?y_%7Bi%7D"> have the same variance, which will denote by <img src="https://latex.codecogs.com/png.latex?%5Csigma%5E%7B2%7D">. Thus,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboldsymbol%7B%5CSigma%7D''=%5Cbegin%7Bbmatrix%7D%5Csigma%5E%7B2%7D%20&amp;%20%20&amp;%20%20&amp;%20%5Cmathbf%7B0%7D%5C%5C%0A&amp;%20%5Csigma%5E%7B2%7D%5C%5C%0A&amp;%20%20&amp;%20%5Cddots%5C%5C%0A%5Cmathbf%7B0%7D%20&amp;%20%20&amp;%20%20&amp;%20%5Csigma%5E%7B2%7D%0A%5Cend%7Bbmatrix%7D=%5Csigma%5E%7B2%7D%5Cmathbf%7BI%7D,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BI%7D"> is the <img src="https://latex.codecogs.com/png.latex?n%5Ctimes%20n"> identity matrix. Now, we only have to estimate one parameter with <img src="https://latex.codecogs.com/png.latex?n"> observations. Before we show how to estimate <img src="https://latex.codecogs.com/png.latex?%5Csigma%5E%7B2%7D">, we first note that the expression for the variance becomes much simpler when we assume that the covariance matrix of <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D"> is given by <img src="https://latex.codecogs.com/png.latex?%5Csigma%5E%7B2%7D%5Cmathbf%7BI%7D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Ctext%7BVar%7D%5B%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%5D%20&amp;%20=(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5Cmathbf%7BX%7D%5E%7BT%7D(%5Csigma%5E%7B2%7D%5Cmathbf%7BI%7D)%5Cmathbf%7BX%7D(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5C%5C%0A&amp;%20=%5Csigma%5E%7B2%7D(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5C%5C%0A&amp;%20=%5Csigma%5E%7B2%7D(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%0A%5Cend%7Baligned%7D%0A"></p>
<p>It remains to estimate <img src="https://latex.codecogs.com/png.latex?%5Csigma%5E%7B2%7D">. Under the assumptions made, it can be shown that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7B%5Csigma%7D%5E%7B2%7D=%5Cfrac%7B1%7D%7Bn-p%7D%5Csum_%7Bi=1%7D%5E%7Bn%7D(y_%7Bi%7D-%5Chat%7By%7D_%7Bi%7D)%5E%7B2%7D%0A"></p>
<p>is an unbiased estimator for <img src="https://latex.codecogs.com/png.latex?%5Csigma%5E%7B2%7D">. The expression is just the average error that our model makes in predicting <img src="https://latex.codecogs.com/png.latex?%5Chat%7By%7D">, adjusted for the <img src="https://latex.codecogs.com/png.latex?p"> degrees of freedom that we required to estimate <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D">. Hence, for an estimate of the variance of <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D">, we can use</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cwidehat%7B%5Ctext%7BVar%7D%7D%5B%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%5D=%20%5Chat%7B%5Csigma%7D%5E2%20(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D,%0A"></p>
<p>and the square root of this expression is, of course, the standard error. If we are just interested in the standard error, we can stop here.</p>
</section>
<section id="the-normality-assumption" class="level2">
<h2 class="anchored" data-anchor-id="the-normality-assumption">The Normality assumption</h2>
<p>We are now able to estimate <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BE%7D%5B%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%5D"> and <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BVar%7D%5B%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%5D"> . But if we want to perform statistical inference, we need to know more than just these first two moments. This is where the normality assumption comes in. This assumption applies to the errors, i.e.&nbsp;we assume that <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7B%5Cepsilon%7D%5Csim%5Ctext%7BMVN%7D(0,%5Csigma%5E%7B2%7D%5Cmathbf%7BI%7D)">. From this follows that <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D%5Csim%5Ctext%7BMVN%7D(%5Cmathbf%7BX%7D%5Cboldsymbol%7B%5Cbeta%7D,%5Csigma%5E%7B2%7D%5Cmathbf%7BI%7D),"> and</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%5Csim%5Ctext%7BMVN%7D%5Cleft(%5Cboldsymbol%7B%5Cbeta%7D,%5Csigma%5E%7B2%7D(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5Cright).%0A"></p>
<p>After this assumption, we know the exact sampling distribution of <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D">, which allows us to derive confidence intervals for <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D">. These expressions show that the normality assumption is computationally convenient, as making the normal assumption about the errors translates into the result that <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> is also normal. The normality assumption is also justified through the central limit theorem. In other words, even without the explicit assumption, we could have justified the use of a normal approximation.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>It turns out that this list is similar the one presented by Gelman and Hill (2006, p.&nbsp;45f.), who list validity, additivity and linearity, independence of errors, equal variance of errors, and normality of errors as assumptions, in order of their importance. Their list however, adds another major assumption to the front of the list: Validity. In their words,</p>
<blockquote class="blockquote">
<p>Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. Optimally, this means that the outcome measure should accurately reflect the phenomenon of interest, the model should include all relevant predictors, and the model should generalize to the cases to which it will be applied. (p.&nbsp;45)</p>
</blockquote>
<p>Furthermore, none of the mathematical assumptions guarantee that the regression coefficients can be interpreted as causal.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li>Freedman, David A. (2006). “On The So-Called ‘Huber Sandwich Estimator’ and ‘Robust Standard Errors’”. The American Statistician. 60 (4): 299–302. doi:10.1198/000313006X152207.</li>
<li>Gelman, Andrew and Jennifer Hill. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.</li>
<li>Hayashi, Fumio. (2000). Econometrics. Princeton University Press.</li>
<li>King, Gary; Roberts, Margaret E. (2015). “How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It”. Political Analysis. 23 (2): 159–179. doi:10.1093/pan/mpu015.</li>
<li>White, Halbert (1980). “A Heteroscedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroscedasticity”. Econometrica. 48 (4): 817–838. doi:10.2307/1912934.</li>
</ul>


</section>

 ]]></description>
  <category>statistics</category>
  <guid>https://elbersb.com/public/posts/2020-10-01-regression-assumptions/</guid>
  <pubDate>Wed, 30 Sep 2020 22:00:00 GMT</pubDate>
</item>
<item>
  <title>New paper in Social Forces: Training Regimes and Skill Formation in France and Germany An Analysis of Change Between 1970 and 2010</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2020-06-21-france-germany/</link>
  <description><![CDATA[ 




<p>Benjamin Elbers, Thijs Bol, and Thomas A. DiPrete. 2020. <a href="https://doi.org/10.1093/sf/soaa037"><strong>Training Regimes and Skill Formation in France and Germany An Analysis of Change Between 1970 and 2010</strong></a>. <em>Social Forces</em>, forthcoming.</p>
<ul>
<li><a href="https://osf.io/s8nz4/">Preprint</a></li>
</ul>
<p>France and Germany are often portrayed as very different when it comes to school-to-work linkages: France focuses on general education, which means that graduates find jobs in all kinds of sectors. Therefore, the link between education and occupation should be low in France. Germany provides specialized education, leading to a strong match between educational degrees and jobs. High linkage means that one’s educational degree is very predictive of the occupation one is employed in. Low linkage means that the educational degree has no consequence for the kinds of occupations one works in. We use a segregation index, the Mutual Information Index M, to capture this idea.</p>
<p>We used different sources of microdata (French labor force surveys, German censuses, and the European Labor Force Survey) to study whether the characterization of France as low-linkage and Germany as high-linkage was true, both historically (in 1970) and now (well, in 2010).</p>
<p>Our first major finding is that there are strong gender differences: France and Germany look very different when we focus on men, but much less different when we focus on women. Historically, many studies have focused on men only, which gives a one-sided picture. Figure 1 shows the time-series for our measure of linkage.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2020-06-21-france-germany/m.png" class="img-fluid figure-img" style="width:75.0%"></p>
<figcaption>Figure 1: Differences in school-to-work linkages when viewed from the male (left) and female (right) perspective</figcaption>
</figure>
</div>
<p>Clearly, linkage has also increased over time. This can happen for many reasons, for instance, it could just be because of educational expansion. (Higher educated people usually have higher linkage, as they are more specialized.) To disentangle the different sources, we used a decomposition method that is described <a href="https://osf.io/preprints/socarxiv/ya7zs/">in another paper</a>.</p>
<p>When comparing Germany in 1970 to Germany in 2010 (right-hand panel of Figure 2), we find that the increase is indeed mostly explained by the differences in educational composition. In France, the changing educational composition has also increased linkage, but a large part of this decrease has been offset by declines in structural linkage, which is that part of linkage that is unexplained by changes in the composition of the workforce.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2020-06-21-france-germany/diff-country.png" class="img-fluid figure-img"></p>
<figcaption>Figure 2: Decomposition of Differences in Linkage, 1970-2010</figcaption>
</figure>
</div>
<p>Comparing the countries to each other is even more interesting! In Figure 3, we find that Germany’s higher linkage in 1970 is almost entirely explained by Germany’s different educational distribution. In other words, France’s educational system was providing as good a match as the German system did, but it provided such a good match for a much smaller part of the workforce. In 2010, a lot of the difference is structural, consistent with the over-time comparison. The results in the paper are much more detailed, breaking down the components further by education and occupation.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2020-06-21-france-germany/diff-time.png" class="img-fluid figure-img"></p>
<figcaption>Figure 3: Decomposition of Differences in Linkage, France vs.&nbsp;Germany</figcaption>
</figure>
</div>
<p>To summarize, we find three things:</p>
<ol type="1">
<li><p>School-to-work linkages have increased over time in both France and Germany, due to educational expansion.</p></li>
<li><p>In France, educational expansion has been accompanied by a decline in the effectiveness with which graduates are matched with the labor market.</p></li>
<li><p>In the 1970s, the main difference between France and Germany was compositional, not structural. This is a major departure from earlier studies, which framed France and Germany as opposite poles on a spectrum between low and high linkage.</p></li>
</ol>
<p>Because the change within countries has been so strong, we argue against characterizing educational systems at the national level, especially over longer periods of time. There has been a long tradition in sociology to characterize the German system as “qualificational” and the French system as “organizational”. Our results raise the question whether such cross-national classifications of skill formation systems do justice to actual cross-national differences. We believe this not to be the case. When looking more closely into how school-to-work linkages are established, countries might be similar on some aspects (structural linkage), but differ on others (composition of workers across the programs). Moreover, the differences within countries are as large or larger than differences between countries.</p>



 ]]></description>
  <category>papers</category>
  <guid>https://elbersb.com/public/posts/2020-06-21-france-germany/</guid>
  <pubDate>Sat, 20 Jun 2020 22:00:00 GMT</pubDate>
</item>
<item>
  <title>Understanding regression coefficients and multicollinearity through the standardized regression model</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2020-01-08-correlation-model/</link>
  <description><![CDATA[ 




<p>The so-called <em>standardized regression model</em> is often presented in textbooks<sup>1</sup> as a solution to numerical issues that can arise in regression analysis, or as a method to bring the regression coefficients to a common, more interpretable scale. However, this transformation can also be useful to gain a deeper understanding into the construction of regression coefficients, the problem of multicollinearity, and the inflation of standard errors. It can thus also be a useful educational tool.</p>
<section id="correlation-transformation" class="level2">
<h2 class="anchored" data-anchor-id="correlation-transformation">Correlation transformation</h2>
<p>The <em>standardized model</em> refers to the model that is estimated after applying the correlation transformation to the outcome and the predictor variables. Let <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Ba%7D=(a_%7B1%7D,a_%7B2%7D,%5Cldots,a_%7Bn%7D)%5ET"> be a column vector of length <em>n</em>, then the correlation transformation is defined by</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Aa_%7Bi%7D%5E%7B*%7D=%5Cfrac%7Ba_%7Bi%7D-%5Cbar%7Ba%7D%7D%7B%5Csqrt%7B%5Csum_%7Bi=1%7D%5E%7Bn%7D(a_%7Bi%7D-%5Cbar%7Ba%7D)%5E%7B2%7D%7D%7D,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cbar%7Ba%7D"> denotes the mean of the components of <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Ba%7D">. The correlation transformation is similar to a z-standardization, but instead of dividing by the standard deviation, we divide by the square root of the sum of squares. If we now consider another vector <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bb%7D%5E%7B*%7D">, for which the same transformation has been applied, we find that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft(%5Cmathbf%7Ba%7D%5E%7B*%7D%5Cright)%5E%7BT%7D%5Cmathbf%7Bb%7D%5E%7B*%7D=r_%7Ba,b%7D,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?r_%7Ba,b%7D"> denotes the Pearson correlation coefficient between vectors <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Ba%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bb%7D">. From this it also follows that the dot product of the transformed vector with itself will be 1, i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?%5Cleft(%5Cmathbf%7Ba%7D%5E%7B*%7D%5Cright)%5E%7BT%7D%5Cmathbf%7Ba%7D%5E%7B*%7D=1."> The correlation transformation is the key “trick” that will be used to estimate the standardized model.</p>
</section>
<section id="the-standardized-model" class="level2">
<h2 class="anchored" data-anchor-id="the-standardized-model">The standardized model</h2>
<p>In a standard regression problem, we have an <img src="https://latex.codecogs.com/png.latex?n%5Ctimes1"> outcome vector <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D"> and a <img src="https://latex.codecogs.com/png.latex?n%5Ctimes%20p"> matrix <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D"> containing the <em>p</em> predictors. To estimate the standardized model, we apply the correlation transformation to the outcome vector <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D"> and to each of the predictors. We then estimate the model</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Cmathbf%7By%7D%20&amp;%20=%5Cmathbf%7BX%7D%5Cboldsymbol%7B%5Cbeta%7D+%5Cboldsymbol%7B%5Cepsilon%7D,%5C%5C%0A%5Cboldsymbol%7B%5Cepsilon%7D%20&amp;%20%5Csim%20N(0,%5Csigma%5E%7B2%7D%5Cmathbf%7BI%7D).%0A%5Cend%7Baligned%7D%0A"></p>
<p>The design matrix <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D=%5Cbegin%7Bbmatrix%7D%5Cmathbf%7Bx%7D_%7B1%7D%20&amp;%20%5Cmathbf%7Bx%7D_%7B2%7D%20&amp;%20%5Ccdots%20&amp;%20%5Cmathbf%7Bx%7D_p%5Cend%7Bbmatrix%7D"> contains the <em>p</em> transformed predictors, but no intercept. This is because any intercept term would always be estimated to be zero after the correlation transformation has been applied.</p>
<p>The correlation transformation makes it much easier to understand the role of the key components that are required when finding the estimates for the vector <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D=(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7By%7D%0A"></p>
<p>The first component is the matrix <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D">, which now has the simple form</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D=%5Cbegin%7Bbmatrix%7D1%20&amp;%20r_%7B1,2%7D%20&amp;%20%5Ccdots%20&amp;%20r_%7B1,p%7D%5C%5C%0Ar_%7B2,1%7D%20&amp;%201%20&amp;%20%5Ccdots%20&amp;%20r_%7B2,p%7D%5C%5C%0A%5Cvdots%20&amp;%20%5Cvdots%20&amp;%20%5Cddots%20&amp;%20%5Cvdots%5C%5C%0Ar_%7Bp,1%7D%20&amp;%20r_%7Bp,2%7D%20&amp;%20%5Ccdots%20&amp;%201%0A%5Cend%7Bbmatrix%7D=%5Cmathbf%7Br%7D_%7BXX%7D,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D"> stands for the correlation between predictors <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_%7B1%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_%7B2%7D">. Since this matrix is simply the correlation matrix <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Br%7D_%7BXX%7D"> between the predictor variables, all of its diagonal elements are 1, and all off-diagonal elements are between -1 and 1.</p>
<p>The second component is the vector <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7By%7D">, which has the simple form</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7By%7D=%5Cbegin%7Bbmatrix%7Dr_%7B1,y%7D%5C%5C%0Ar_%7B2,y%7D%5C%5C%0A%5Cvdots%5C%5C%0Ar_%7Bp,y%7D%0A%5Cend%7Bbmatrix%7D=%5Cmathbf%7Br%7D_%7BXY%7D,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?r_%7Bp,y%7D"> stands for the correlation between the <img src="https://latex.codecogs.com/png.latex?p">th predictor and the outcome vector. Thus, the expression for <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> simply involves the two correlation matrices:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D=(%5Cmathbf%7Br%7D_%7BXX%7D)%5E%7B-1%7D%5Cmathbf%7Br%7D_%7BXY%7D.%0A"></p>
<p>Not only the estimates for <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> are of interest, but also their standard errors. The expression for the variance of <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BVar%7D(%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D)=%5Chat%7B%5Csigma%7D%5E%7B2%7D(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D=%5Chat%7B%5Csigma%7D%5E%7B2%7D(%5Cmathbf%7Br%7D_%7BXX%7D)%5E%7B-1%7D,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Csigma%7D%5E%7B2%7D"> is estimated through the mean squared error.</p>
<p>Finding the estimates for <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> and the standard errors requires inverting the correlation matrix <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Br%7D_%7BXX%7D">, which is complicated for large <em>p</em>. We will thus look at two limiting cases, which will make inverting the matrix possible: uncorrelated predictors, and a small number of predictors.</p>
</section>
<section id="uncorrelated-predictors" class="level2">
<h2 class="anchored" data-anchor-id="uncorrelated-predictors">(1) Uncorrelated predictors</h2>
<p>We first consider perfectly uncorrelated predictors. When all the predictors are uncorrelated with each other, the correlation matrix <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Br%7D_%7BXX%7D"> has an extremely simple expression:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbf%7Br%7D_%7BXX%7D=%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D=%5Cmathbf%7BI%7D,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BI%7D"> is the identity matrix. This fact should be obvious from inspection of the matrix above. The full expression for <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> simply becomes:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%20=(%5Cmathbf%7Br%7D_%7BXX%7D)%5E%7B-1%7D%5Cmathbf%7Br%7D_%7BXY%7D%0A%20%20=%5Cmathbf%7Br%7D_%7BXY%7D=%5Cbegin%7Bbmatrix%7Dr_%7B1,y%7D%5C%5C%0Ar_%7B2,y%7D%5C%5C%0A%5Cvdots%5C%5C%0Ar_%7Bp,y%7D%0A%5Cend%7Bbmatrix%7D%0A"></p>
<p>Thus, when the predictors are all uncorrelated with each other, the coefficients are simply given by the correlation coefficients between the predictor and the outcome <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D."></p>
<p>The standard errors for the regression are constant, i.e.&nbsp;each coefficient will have the same standard error regardless of the size of the correlation between the predictor and <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D."> It can be shown<sup>2</sup> that the standard errors are given by</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7Bs.e.%7D(%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D)=%5Cfrac%7B1%7D%7B%5Csqrt%7Bn-p%7D%7D%5Csqrt%7B1-%5Csum_%7Bi=1%7D%5E%7Bp%7Dr_%7Bi,y%7D%5E%7B2%7D%7D.%0A"></p>
<p>Thus, the standard errors depend only on the sample size, the number of predictors, and the sum of the squared coefficients. Generally, the standard errors will decrease with increasing sample size, increase with an increasing number of predictors, and increase with lower correlations between the predictors and the outcome. All of these results should make intuitive sense.</p>
</section>
<section id="two-correlated-predictors" class="level2">
<h2 class="anchored" data-anchor-id="two-correlated-predictors">(2) Two correlated predictors</h2>
<p>In actual applications, perfectly uncorrelated predictors are rare. In fact, the goal of regression is often to control for correlated predictors. We now look at the case of two correlated predictors.</p>
<p>In this case, it is also straightforward to find an expression for <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D."> First, we need to find the inverse of</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbf%7Br%7D_%7BXX%7D=%5Cbegin%7Bbmatrix%7D1%20&amp;%20r_%7B1,2%7D%5C%5C%0Ar_%7B2,1%7D%20&amp;%201%0A%5Cend%7Bbmatrix%7D=%5Cbegin%7Bbmatrix%7D1%20&amp;%20r_%7B1,2%7D%5C%5C%0Ar_%7B1,2%7D%20&amp;%201%0A%5Cend%7Bbmatrix%7D.%0A"></p>
<p>The determinant of this matrix is <img src="https://latex.codecogs.com/png.latex?%5Cdet%5Cmathbf%7Br%7D_%7BXX%7D=1-r_%7B1,2%7D%5E%7B2%7D">, and the inverse is then</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A(%5Cmathbf%7Br%7D_%7BXX%7D)%5E%7B-1%7D%20&amp;%20=%5Cfrac%7B1%7D%7B1-r_%7B1,2%7D%5E%7B2%7D%7D%5Cbegin%7Bbmatrix%7D1%20&amp;%20-r_%7B1,2%7D%5C%5C%0A-r_%7B1,2%7D%20&amp;%201%0A%5Cend%7Bbmatrix%7D.%0A%5Cend%7Baligned%7D%0A"></p>
<p>As an aside, this form of the matrix <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Br%7D_%7BXX%7D=%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D"> also makes it easy to see why perfectly correlated predictors are problematic: When <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D=%5Cpm1">, the determinant of the matrix is zero and the matrix does not have an inverse.</p>
<p>The full expression for <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%20&amp;%20=(%5Cmathbf%7Br%7D_%7BXX%7D)%5E%7B-1%7D%5Cmathbf%7Br%7D_%7BXY%7D%5C%5C%0A&amp;=%5Cfrac%7B1%7D%7B1-r_%7B1,2%7D%5E%7B2%7D%7D%5Cbegin%7Bbmatrix%7D1%20&amp;%20-r_%7B1,2%7D%5C%5C%0A-r_%7B1,2%7D%20&amp;%201%0A%5Cend%7Bbmatrix%7D%5Cbegin%7Bbmatrix%7Dr_%7B1,y%7D%5C%5C%0Ar_%7B2,y%7D%0A%5Cend%7Bbmatrix%7D%5C%5C%0A&amp;%20=%5Cfrac%7B1%7D%7B1-r_%7B1,2%7D%5E%7B2%7D%7D%5Cbegin%7Bbmatrix%7Dr_%7B1,y%7D-r_%7B1,2%7Dr_%7B2,y%7D%5C%5C%0Ar_%7B2,y%7D-r_%7B1,2%7Dr_%7B1,y%7D%0A%5Cend%7Bbmatrix%7D%0A%5Cend%7Baligned%7D%0A"></p>
<p>Thus,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Chat%7B%5Cbeta%7D_%7B1%7D%20&amp;%20=%5Cfrac%7Br_%7B1,y%7D-r_%7B1,2%7Dr_%7B2,y%7D%7D%7B1-r_%7B1,2%7D%5E%7B2%7D%7D,%5C%5C%0A%5Chat%7B%5Cbeta%7D_%7B2%7D%20&amp;%20=%5Cfrac%7Br_%7B2,y%7D-r_%7B1,2%7Dr_%7B1,y%7D%7D%7B1-r_%7B1,2%7D%5E%7B2%7D%7D.%0A%5Cend%7Baligned%7D%0A"></p>
<p>It is immediately evident that, when the two predictors are uncorrelated <img src="https://latex.codecogs.com/png.latex?(r_%7B1,2%7D=0),"> the estimated regression coefficients are simply given by their correlation with <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D"> (as seen above). When <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D%5Cneq%200,"> both coefficients will change, and the effect will be larger for larger values of <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D."> If we assume that all three correlations are positive, the formula provides an intuitive way of thinking about what it means to “control” for another variable: the raw correlation between <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_1"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D"> will be reduced by an amount that depends both on the size of the correlation between <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_1"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_2"> and on the correlation between <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_2"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D."></p>
<p>For instance, assume that we are interested in the coefficient <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7B1%7D">. We let <img src="https://latex.codecogs.com/png.latex?r_%7B1,y%7D=0.5"> and <img src="https://latex.codecogs.com/png.latex?r_%7B2,y%7D=0.7."> In a simple regression, where we just include <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_1,"> we would find the coefficient to be 0.5. Now we want to control for another predictor, <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_2,"> which is also correlated with the outcome at 0.7. For any “controlling” to happen, <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_1"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_2"> need to be correlated as well. One interesting question is: How large does this correlation need to be to make <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7B1%7D"> zero? This is straighforward – simply plug in the values, set to zero, and solve for <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D:"></p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Chat%7B%5Cbeta%7D_%7B1%7D%20&amp;%20=%5Cfrac%7B0.5-r_%7B1,2%7D0.7%7D%7B1-r_%7B1,2%7D%5E%7B2%7D%7D%20=%200%5C%5C%0Ar_%7B1,2%7D%20&amp;%20=%5Cfrac%7B0.5%7D%7B0.7%7D%20%5Capprox%200.71%20%5C%5C%0A%5Cend%7Baligned%7D%0A"></p>
<p>Hence, the effect for <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_1"> would only vanish completely if <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D"> is fairly large, as should be expected.</p>
<p>In other situations, the coefficient cannot become zero by introducing a control variable. Assume for instance, <img src="https://latex.codecogs.com/png.latex?r_%7B1,y%7D=0.5"> and <img src="https://latex.codecogs.com/png.latex?r_%7B2,y%7D=0.4."> The solution here is <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D=1.25,"> which is impossible. It turns out that the local minimum is attained at <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7B1%7D=0.4">, where <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D=0.5">. In other words, controlling for <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_2"> will <strong>at most</strong> reduce <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_%7B1%7D"> from 0.5 to 0.4, and this will happen when <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D=0.5">.</p>
<p>The combined effect of different correlations can be explored in <a href="https://elbersb.shinyapps.io/beta1/">the Shiny app</a> shown below. The plot shows the coefficient <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta_1%7D"> (y-axis) as a function of the correlation between the two predictors (x-axis). Because we are dealing with the correlations among three variables, the range of possible values for <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D"> may be restricted depending on the values of <img src="https://latex.codecogs.com/png.latex?r_%7B1,y%7D"> and <img src="https://latex.codecogs.com/png.latex?r_%7B2,y%7D."><sup>3</sup> The Shiny app will show only the range of possible values.</p>
<p>Using the sliders, one can adjust the correlations between the predictors and the outcome variable. In the default setting, the correlations are set as <img src="https://latex.codecogs.com/png.latex?r_%7B1,y%7D=0.5"> and <img src="https://latex.codecogs.com/png.latex?r_%7B2,y%7D=0.7."> For this example, when <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D%3C0,"> the estimated coefficient will be inflated compared to the raw correlation <img src="https://latex.codecogs.com/png.latex?r_%7B1,y%7D"> (indicated by the orange line). When <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D%3E0,"> the estimated coefficient will be attenuated instead. The attenuation will be especially severe as <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D"> approaches 1. This is the problem of <strong>multicollinearity</strong> and can also be seen from the formula for <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_1">: As <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D"> approaches 1, <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_1"> approaches $ .$</p>
<div>
<iframe style="width: 120%;height:600px;" frameborder="0" src="https://elbersb.shinyapps.io/beta1/">
</iframe>
</div>
<p>Another interesting fact to note is that the coefficient of a predictor can be non-zero even if the predictor is completely uncorrelated with the outcome. For instance, if we let <img src="https://latex.codecogs.com/png.latex?r_%7B1,y%7D=0"> and <img src="https://latex.codecogs.com/png.latex?r_%7B2,y%7D=0.5,"> the plot shows a sigmoid shape: <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cbeta%7D_1"> will be positive when <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_1"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_2"> are negatively correlated, and vice versa. This happens, of course, because multiple regression provides <em>conditional</em> inference: While <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_1"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D"> may be uncorrelated, they may well be correlated once we condition on <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_2">.</p>
<p>As a last step, we consider the standard errors for the two regression coefficients. As before,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Ctext%7BVar%7D(%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D)%20&amp;%20=%5Chat%7B%5Csigma%7D%5E%7B2%7D(%5Cmathbf%7Br%7D_%7BXX%7D)%5E%7B-1%7D%5C%5C%0A&amp;%20=%5Cfrac%7B%5Chat%7B%5Csigma%7D%5E%7B2%7D%7D%7B1-r_%7B12%7D%5E%7B2%7D%7D%5Cbegin%7Bbmatrix%7D1%20&amp;%20-r_%7B12%7D%5C%5C%0A-r_%7B12%7D%20&amp;%201%0A%5Cend%7Bbmatrix%7D%0A%5Cend%7Baligned%7D%0A"></p>
<p>Thus, the standard errors are again constant:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7Bs.e.%7D(%5Chat%7B%5Cbeta%7D_%7B1%7D)=%5Ctext%7Bs.e.%7D(%5Chat%7B%5Cbeta%7D_%7B2%7D)=%5Cfrac%7B%5Chat%7B%5Csigma%7D%7D%7B%5Csqrt%7B1-r_%7B12%7D%5E%7B2%7D%7D%7D%0A"></p>
<p>This clearly shows that any correlation between <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_%7B1%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_%7B2%7D"> increases the variance and standard errors of the estimated coefficients. In fact, as <img src="https://latex.codecogs.com/png.latex?r_%7B1,2%7D"> approaches 1, the standard errors approach <img src="https://latex.codecogs.com/png.latex?%5Cinfty">. This is an important result, because, even though either <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_%7B1%7D"> or <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_%7B2%7D"> might be highly correlated with <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D">, under multicollinearity the standard errors might be very large. Thus, statistical tests might not reject the null hypothesis, despite strong correlation.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>The standardized regression model, as defined by the correlation transformation, can be used to explore the construction of regression coefficients and standard errors in simple cases. In the model with two predictors, all quantities depend only on the three correlations between <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_%7B1%7D,"> <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bx%7D_%7B2%7D,"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7By%7D">. This makes it easy to see the impact of different correlations on the estimated regression coefficients.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>See for instance, Kutner et al.&nbsp;(2005) <em>Applied Linear Statistical Models</em> (esp.&nbsp;p.&nbsp;271 ff.), on which a lot of this material is based.↩︎</p></li>
<li id="fn2"><p>This result can be shown through the use of the “hat” matrix, which is the matrix <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BH%7D"> that satisfies <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cmathbf%7By%7D%7D=%5Cmathbf%7BX%7D%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D=%5Cmathbf%7BH%7D%5Cmathbf%7By%7D">. Because this matrix is a projection matrix, it is idempotent.</p>
<p>We use the mean squared error to estimate <img src="https://latex.codecogs.com/png.latex?%5Csigma%5E%7B2%7D">. The vector of residuals is denoted by <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Be%7D=%5Cmathbf%7By%7D-%5Chat%7B%5Cmathbf%7By%7D%7D=(%5Cmathbf%7BI%7D-%5Cmathbf%7BH%7D)%5Cmathbf%7By%7D">. The variance of <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D"> can then be found through some matrix algebra:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Ctext%7BVar%7D(%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D)%20&amp;%20=%5Ctext%7BMSE%7D(%5Cmathbf%7BX%7D%5E%7BT%7D%5Cmathbf%7BX%7D)%5E%7B-1%7D%5C%5C%0A&amp;%20=%5Cfrac%7B1%7D%7Bn-p%7D%5Cleft(%5Cmathbf%7Be%7D%5E%7BT%7D%5Cmathbf%7Be%7D%5Cright)%5Cmathbf%7BI%7D%5C%5C%0A&amp;%20=%5Cfrac%7B1%7D%7Bn-p%7D%5Cmathbf%7By%7D%5E%7BT%7D(%5Cmathbf%7BI%7D-%5Cmathbf%7BH%7D)%5E%7BT%7D(%5Cmathbf%7BI%7D-%5Cmathbf%7BH%7D)%5Cmathbf%7By%7D%5C%5C%0A&amp;%20=%5Cfrac%7B1%7D%7Bn-p%7D%5Cmathbf%7By%7D%5E%7BT%7D(%5Cmathbf%7BI%7D-%5Cmathbf%7BH%7D)%5Cmathbf%7By%7D%5C%5C%0A&amp;%20=%5Cfrac%7B1%7D%7Bn-p%7D%5Cleft(%5Cmathbf%7By%7D%5E%7BT%7D%5Cmathbf%7By%7D-%5Cmathbf%7By%7D%5E%7BT%7D%5Cmathbf%7BH%7D%5Cmathbf%7By%7D%5Cright)%5C%5C%0A&amp;%20=%5Cfrac%7B1%7D%7Bn-p%7D%5Cleft(1-%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%5E%7BT%7D%5Chat%7B%5Cboldsymbol%7B%5Cbeta%7D%7D%5Cright)%5C%5C%0A&amp;%20=%5Cfrac%7B1%7D%7Bn-p%7D%5Cleft(1-%5Csum_%7Bi=1%7D%5E%7Bp%7Dr_%7Bi,y%7D%5E%7B2%7D%5Cright)%0A%5Cend%7Baligned%7D%0A">↩︎</p></li>
<li id="fn3"><p>See, for instance, <a href="http://jakewestfall.org/blog/index.php/2013/09/17/geometric-argument-for-constraints-on-corrxz-given-corrxy-and-corryz/">this blogpost</a> for an explanation.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>statistics</category>
  <guid>https://elbersb.com/public/posts/2020-01-08-correlation-model/</guid>
  <pubDate>Tue, 07 Jan 2020 23:00:00 GMT</pubDate>
</item>
<item>
  <title>Tidylog 1.0.0</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2020-01-07-tidylog100/</link>
  <description><![CDATA[ 




<p>Before I became a heavy user of R, I mainly used Stata. There are a few things that I miss from Stata, but one issue, specifically, bothered me immensely: The lack of feedback for data wrangling operations in R. Have a look, for instance, at this Stata output:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2020-01-07-tidylog100/stata.png" class="img-fluid figure-img"></p>
<figcaption>Stata output</figcaption>
</figure>
</div>
<p>The <code>merge</code> operation tells us about the number of matched cases, and the <code>drop</code> command tells us how many cases we lost. This feedback is great at preventing simple errors, especially when working with data interactively. This functionality does not exist in base R, the tidyverse, or the data.table package. Hence, my code often looked like this:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(data))</span>
<span id="cb1-2">data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(data, length <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(data))</span></code></pre></div>
</div>
<p>This gets ugly pretty quickly, and does not work for many other common problems, such as joins.</p>
<p>This is why I wrote the <a href="https://github.com/elbersb/tidylog"><strong>tidylog</strong></a> package, which is built on top of the tidyverse’s <a href="https://dplyr.tidyverse.org">dplyr</a> and <a href="https://tidyr.tidyverse.org">tidyr</a> packages. Tidylog provides the missing feedback:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tidyverse"</span>)</span>
<span id="cb2-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tidylog"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">warn.conflicts =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span>
<span id="cb2-3"></span>
<span id="cb2-4">filtered <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(mtcars, cyl <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span>
<span id="cb2-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; filter: removed 21 rows (66%), 11 rows remaining</span></span>
<span id="cb2-6"></span>
<span id="cb2-7">joined <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(nycflights13<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>flights, nycflights13<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>weather,</span>
<span id="cb2-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"year"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"month"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"day"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"origin"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hour"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"time_hour"</span>))</span>
<span id="cb2-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt; left_join: added 9 columns (temp, dewp, humid, wind_dir, wind_speed, …)</span></span>
<span id="cb2-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;            &gt; rows only in x     1,556</span></span>
<span id="cb2-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;            &gt; rows only in y  (  6,737)</span></span>
<span id="cb2-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;            &gt; matched rows     335,220</span></span>
<span id="cb2-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;            &gt;                 =========</span></span>
<span id="cb2-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&gt;            &gt; rows total       336,776</span></span></code></pre></div>
</div>
<p>Tidylog simply overwrites the tidyverse functions for which it provides feedback. This is not very elegant, but means that tidylog is a drop-in solution: Just load it after the tidyverse (or dplyr and/or tidyr), and it will provide feedback.</p>
<p>Since its <a href="https://community.rstudio.com/t/new-package-tidylog-feedback-for-basic-dplyr-operations/22764">first version</a> about a year ago, the package has grown to include most dplyr and many tidyr functions. (Thanks to <a href="https://github.com/elbersb/tidylog/graphs/contributors">all the contributors</a>!) I might consider other functions, but it seems like for rarer and more complex functions the feedback becomes less useful, because one will usually inspect the output manually anyway. Because tidylog seems pretty much feature-complete to me, I release version 1.0.0 now. The goal for the future is to keep the package updated with developments occuring in dplyr and tidyr.</p>
<p>For more information about tidylog, check out <a href="https://github.com/elbersb/tidylog">the Github page</a>.</p>



 ]]></description>
  <category>packages</category>
  <guid>https://elbersb.com/public/posts/2020-01-07-tidylog100/</guid>
  <pubDate>Mon, 06 Jan 2020 23:00:00 GMT</pubDate>
</item>
<item>
  <title>New paper in Management Science: Obscured Transparency? Compensation benchmarking and the biasing of executive pay</title>
  <dc:creator>Ben Elbers</dc:creator>
  <link>https://elbersb.com/public/posts/2019-03-29-obscured-transparency/</link>
  <description><![CDATA[ 




<p>Mathijs de Vaan, Benjamin Elbers, Thomas A. DiPrete. 2019. <a href="https://doi.org/10.1287/mnsc.2018.3151"><strong>Obscured transparency? Compensation benchmarking and the biasing of executive pay</strong></a>. <em>Management Science</em>.</p>
<ul>
<li><a href="https://osf.io/97ebq/">Preprint and replication materials</a></li>
</ul>
<p>The disclosure of compensation peer groups is intended to provide shareholders with valuable information that can be used to scrutinize CEO compensation. Compensation consultants and watchdog organizations have established general principles for selecting peers, the most important being that peers should be companies of similar size, market capitalization, and industry profile. However, research suggests that there are substantial incentives for executives and directors to bias the compensation peer group upward to allow the CEO to extract additional rent. When companies choose compensation peer groups whose CEOs are better paid than would be the case with a neutrally chosen peer group, the focal CEO appears to be underpaid in comparison, which generates an argument for an increase in compensation for the focal CEO. A number of studies have found evidence that CEO peer groups are biased, though some have argued that what appears to be bias is actually just a reflection of the talent of the focal CEO.</p>
<p>We define bias as the difference in pay between the median CEO in the peer group selected by the firm and in neutral peer groups that we construct. We leverage the idea that reciprocated peer nominations are unlikely to be biased in order to construct counterfactual peer groups, which allows us to measure the bias of disclosed peer groups. Specifically, we estimate a model that predicts reciprocated peer nominations, and we use the estimates from this model to identify and select peers that are likely to nominate the focal firm as peer. Using eleven years of comprehensive data on compensation peer groups that was collected as part of this project, we demonstrate that the average firm uses an upwardly biased peer group, and that this bias cannot be accounted for by CEO talent. We also find that upward bias in compensation peer groups is highly predictive of higher CEO compensation – suggesting that there is a strong incentive for CEOs to strategically select peers. Figure 1 shows the bias as a percentage of the compensation of the median peer in a neutrally chosen peer group.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2019-03-29-obscured-transparency/outcomes_bias.png" style="height:70.0%" class="figure-img"></p>
<figcaption>Figure 1: Median peer group bias over time, including 25th and 75th percentiles</figcaption>
</figure>
</div>
<p>As Figure 1 shows, the size of the peer group bias has been diminishing over time since the 2006 SEC requirement that firms disclose their compensation peer groups in their corporate reports. This may be a consequence of the requirement for periodic say on pay votes on executive compensation in the 2009 Dodd-Frank Act along with the greater scrutiny of compensation practices by watchdog agencies such as Institutional Shareholder Services. However, Figure 2 shows that the predictive effect of bias on pay has gone up, which offsets the consequences of the decline in bias shown in Figure 1.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2019-03-29-obscured-transparency/outcomes_effect.png" style="height:70.0%" class="figure-img"></p>
<figcaption>Figure 2: Returns on bias are increasing over time: Controlling for firm size and firm performance, by how much does compensation increase for a 10% increase in bias?</figcaption>
</figure>
</div>
<p>We also demonstrate that ambiguity about the membership in a neutrally chosen peer group is being used strategically by firms to increase the size of peer group bias. When it is relatively obvious whom the firm should be choosing as peers, the bias tends to be smaller. Conversely, peer group bias is larger when firms have more discretion due to the set of plausible peers as well as the spread of the pay of their CEOs being relatively large. Figure 3 shows the median, 25th percentile and 75th percentile of peer group bias expressed as a percent of the median pay of a neutrally chosen peer group. This figure shows that bias is generally larger when firms have more discretion in the choice a peer group.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://elbersb.com/public/posts/2019-03-29-obscured-transparency/outcomes_discretion.png" style="height:70.0%" class="figure-img"></p>
<figcaption>Figure 3: Bias increases with discretion</figcaption>
</figure>
</div>
<p>When a company is doing very well, there is arguably less need to introduce bias into the peer group because the argument for high pay can be made easily. When the company performs less well, the argument for high pay is more difficult to make. Our results show that bias is generally larger when financial targets are not met and when firms have greater discretion in the selection of peer firms from the set of plausible peers. Taken together, the findings from this research call into question the practical impact of recent efforts to introduce greater transparency into the process for determining executive compensation.</p>



 ]]></description>
  <category>papers</category>
  <guid>https://elbersb.com/public/posts/2019-03-29-obscured-transparency/</guid>
  <pubDate>Thu, 28 Mar 2019 23:00:00 GMT</pubDate>
</item>
</channel>
</rss>
