Reference: Angrist and Pischke, Chapter 3, Wooldridge 15.1-15.2
Black and Hispanic students score much lower on standardized tests compared to White and Asian students.
One potential reasons for this:
Charter schools: publicly funded, independently operated schools.
Example: KIPP (Knowledge is Power Program) schools
Minority students who go to KIPP schools score higher on standardized tests than minority students in public schools. But:
Is this a causal effect of KIPP schools? (Treatment effect)
Or is this because KIPP attracts different type of students (e.g. more intelligent, more motivated)? (Selection bias)
$$ \text{Standarized test score} = \beta_0 + \beta_1 \text{KIPP attendance} + u $$
Offers were randomly assigned. Can we treat this as a randomized experiment?
Yes, we can estimate the causal effect of receiving an offer.
Note: Table 3.1 shows the standardized test score (mean = 0, SD = 1) in Math and Verbal.
Standardizing is done by subtracting the mean and dividing by the standard deviation for each test score.
This makes variables more easy to interpret:
A test score of 0 = mean score in the population
A test score of 1 = 1 SD above the mean
What is the IV estimate of the effect of attending KIPP Lynn on Verbal score?
A) 0.15
B) 0.48
C) 0.61
D) Can’t say
Consider this econometric model
$$ y=\beta_0 + \beta_1 x + u $$
We suspect that x and u are correlated, i.e. x is endogenous.
$$ Cov(x,u) \neq 0 $$
In this case OLS would lead to biased and inconsistent estimates of $\beta_1$.
The good news is that we can still get a consistent estimate of $\beta_1$ if we have a suitable instrumental variable.
In order to get consistent estimates of we need a variable $z$ that needs to fulfill the following conditions.
Instrument relevance: $Cov(z,x) \neq 0 $ z should be correlated with X.
Instrument exogeneity: $Cov(z,u) = 0 $. z should not be correlated with the error term.
$$ \text{Standarized test score} = \beta_0 + \beta_1 \text{KIPP attendance} + u $$
KIPP attendance might be endogenous
The instrument $z$ is the lottery status (win or lose).
Does the relevance assumption ($Cov(z,x) \neq 0$) hold?
Is lottery status correlated with KIPP attendance? Cov(lottery, attendance) $\neq 0$?
Does the exogeneity assumption (Cov(z,u) $= 0$) hold?
Is the lottery status correlated with other factors that influence test score (u)? Cov(lottery, u) $= 0$?
Recall from Chapter 2 that the slope estimator of a simple regression is $\hat{\beta}_1=\frac{Cov(x,y)}{Var(x)}$.
To test whether $Cov(z,x) \neq 0$ in the population we simply regress $x$ on $z$ in the sample.
If the coefficient is significantly different from zero we conclude that the relevance assumption holds.
In the KIPP example, we would run the following regression: $$ \widehat{attendance} = \hat{\beta_0} + \hat{\beta}_FS lottery $$
If $\hat{\beta}_{FS}$ is significantly different from zero, we conclude that the relevance assumption holds.
We have to argue that this is the case.
In the KIPP example, it is straightforward that if the lottery was random the lottery status ($z$) should be uncorrelated with all other factors that influence test score ($u$).
instrument relevance $Cov(x,z) \neq 0$
instrument exogeneity $Cov(z,u) = 0$
Begin by studying the covariance between $z$ and $y$: $$ Cov(z,y)= Cov(z,\beta_0+\beta_1 x+u) $$ $$ Cov(z,y)= Cov(z,\beta_0)+ Cov(z,\beta_1 x) + Cov(z,u) $$ $$ Cov(z,y)= 0 + \beta_1 Cov(z,x) + Cov(z,u) $$
Under the IV assumptions (above), we can solve for $\beta_1$, $$ \beta_1 = \frac{Cov(z,y)}{Cov(z,x)} $$
$$ \beta_1 = \frac{Cov(z,y)}{Cov(z,x)} $$
Given a random sample, we can estimate $\beta_1$: $$ \hat{\beta}_1= \frac{\sum_{i=1}^{n} (z_i-\bar{z})(y_i-\bar{y})}{\sum_{i=1}^{n} (z_i-\bar{z})(x_i-\bar{x})} $$
This is the instrumental variables (IV) estimator of $\beta_1$.
Recall that: $$ \hat{\beta}_{IV} = \frac{\hat{\beta}_{RF}}{\hat{\beta}_{FS}} $$
We know that $\hat{\beta}_{RF} = \frac{Cov(y,z)}{Var(z)} $ and $\hat{\beta}_{FS} = \frac{Cov(x,z)}{Var(z)} $
Plugging these into the equation above, it is straightforward to see that: $$ \hat{\beta}_{IV} = \frac{\frac{Cov(y,z)}{Var(z)}}{\frac{Cov(x,z)}{Var(z)}} = \frac{Cov(y,z)}{Cov(x,z)} $$
Consider the econometric model $$ y = \beta_0 + \beta_1 x + \beta_2 c + u $$
We suspect that $x$ and $u$ are correlated ($x$ is endogenous). $$ Cov(x,u) \neq 0$$
We have to make the same two assumptions,
First stage: $$ x = \beta_{0,first} + \beta_{FS} z + \beta_{c,first} c + \upsilon $$
We then use the predicted values of $x$ from the first stage instead of $x$ in a regression on $y$.
Calculate predicted values of x: $$ \hat{x} = \hat{\beta}_{0,first} + \hat{\beta}_{FS} z + \hat{\beta}_{c,first} c $$
Second stage, regression: $$ y = \beta_0 + \beta_{FS} \hat{x} + \beta_c c + u $$
The statistical software also computes the correct standard errors.
Note that also in 2SLS estimation:
$$ \hat{\beta}_{2SLS} = \frac{\text{Reduced Form}}{\text{First Stage}} = \frac{\hat{\beta}_{RF}}{\hat{\beta}_{FS}} $$
If the relevance and exogeneity assumptions hold, the IV estimator is consistent but not unbiased.
(see topic 5, where we talked about consistency).
Compared to OLS the IV estimator is less efficient (i.e., it has a larger variance, larger standard errors)
A stronger first stage leads to more efficient IV estimates.
(a strong instrument has high Cov(z,x))
Goal: Estimate the causal effect of education on wages, taking into account the possibility that education is an endogenous.
How: Use a dummy variable for whether someone grew up near a four-year colleage ($nearc4$) as an instrument for education.
Is the instrument relevant? Is $Cov(educ,nearc4) \neq 0$?
Yes, college proximity significantly predicts years of education.
IV estimate of return to education: $$ \hat{\beta}_{2SLS} = \frac{\hat{\beta}_{RF}}{\hat{\beta}_{FS}} = \frac{0.156}{0.829} = 0.188 $$
IV estimation is very easily implemented using Stata:
Our IV estimate suggests that a one year increase in education leads to approx 19% increase in wage.
Stata automatically estimates the correct standard errors.
Whether we should trust the IV estimates, depends on whether we believe the exogeneity assumption.
Is $Corr(nearc4, u) = 0$? (nearc4=distance to college; u=other factors that influence (log) wage)
Maybe people who live close to colleges have characteristics that make them earn more?
No problem if we can control for these factors.
Estimated return to education: approx. 18%
Let’s go back to the KIPP Lynn example:
The effect estimated with IV is a Local Average Treatment Effect (LATE).
Local = only for those who are moved by the instrument
Local Average Treatment Effect = average treatment effect for those who were moved by the instrument (compliers + defiers).
In the KIPP example, this means that IV estimates the average effect of going to KIPP on test scores for kids whose decision to go to KIPP was influenced by the lottery.
This means: causal effect might be different for never-takers and always-takers.
Are all of these moved in the same direction? Does the monotonicity assumption hold?
If yes, we have estimated of the average effect of education on wage for people who went to college because it is close, but wouldn’t have gone otherwise (compliers).
Instrumental variable estimation is a powerful and popular tool to get an estimate of a causal effects if $x$ is endogenous.
All we need is an instrumental variable that is correlated with $x$ (relevance assumption) and uncorrelated with $u$ (exogeneity assumption).
We can test the relevance assumption. We have to argue the exogeneity assumption.
IV is less efficient than OLS.
Instrumental variable (IV) estimation is a powerful tool to estimate causal effects if OLS can’t.
All we need is an instrument $z$ that fulfils two assumptions:
IV exogeneity: $Cov(z,u) = 0$.