Reference: Wooldridge, Chapter 1 and 13
Why might we want to pool cross sections?
How should we interpret the coefficients on the year dummy variables? (Hint: First establish what is the base year – see lecture on dummy variables)
Are these estimated effects statistically and economically significant? Explain.
How would you test the null hypothesis that, conditional on the other explanatory variables (not related to time), fertility rates are constant over time?
Test if year dummies jointly predict kids after controlling for educ, age, …?
We do an F-test:
With pooled cross section data we can find out if:
We have data on two cross sectional datasets. One from 1978 and one from 1985.
$$ ln(wage) = \beta_0 + \delta_0 y85 + \underbrace{\overbrace{\beta_1 educ}^{\text{Effect of education on wage in 1978: } \beta_1} + \delta_1 (y85*educ)}_{\text{Effect of education on wage in 1985: }\beta1+\delta_1} + \beta_2 exper \\ \quad \quad +\beta_3 exper^2 + \beta_4 union + \underbrace{\overbrace{\beta_5 female}^{\text{Effect of female on wage in 1978: }\beta_5} + \delta_5 (y85*female)}_{\text{Effect of female on wage in 1985: }\beta_5+\delta_5} + u $$
From 1978 to 1985: Estimated return to education increased from about 7% to about 10%
Estimated gender wage gap, conditional on education, experience, experience squared, Union, decreased from approx 32% to approx 23% (remember the approximation error discussed in topic 5)
Background: A rumor that a new incinerator would be built in North Andover started after 1978; construction started in 1981.
We expect that the building of the incinerator has a negative effect on house prices.
Our simple econometric model is thus written as follows: $$ rprice=\beta_0 + \beta_1 nearinc + u $$
Where rprice is the house price in 1978 dollars, nearinc is a dummy variable equal to 1 if the house is near the incinerator and 0 otherwise.
We expect $\beta_1$ <0
Diff-in-Diff estimator of effect of incinerator: $\Delta 1981 - \Delta 1978 = \$-11,863 $
We divide the observations into 2 groups:
We have 2 periods of data
We define 2 dummy variables:
We thus have four groups:
$$ y =\beta_0 + \delta_0 d2 + \beta_1 dT + \delta_1 \underbrace{(d2*dT)}_{\text{interaction term}} + u$$
As long as there are no other explanatory variables in the regression, the OLS estimate of $\delta_1$ is numerically identical to $$ \hat{\delta}_1 = (\bar{y}_{2,T} - \bar{y}_{2,C}) - (\bar{y}_{1,T} - \bar{y}_{1,C}) $$ where $\bar{y}_{2,T}$ is the sample mean for the treated group in the period after the policy change (period 2), $\bar{y}_{1,C}$ is the sample mean for the control group before the policy change (period 1), and so on.
Diff-in-Diff in OLS framework $$ rprice = \beta_0 + \beta_1 nearinc + \beta_2 y81 + \beta_3 ( nearinc* y81) + u $$
Should we reject the null hypothesis that the building of the incinerator did not cause lower house prices?
There are 2 good reasons for adding control variables to a diff-in-diff model:
In Maastricht:
From October 2011 they were open only open for Dutch, Germans and Belgians (DGB) and closed for all other nationalities.
What is the effect of coffeeshop closing on academic performance?
(Coffeeshops is what the Dutch call places you can legally smoke weed)
Poster Announcing Application on 1st of October 2011
The performance of students who are no longer legally permitted to buy marijuana increases substantially.
This is evidence for causal effect of restricting access to marijuana on grades.
They then estimated the following regression: $$ Y_{it} = \alpha + \beta_1 (NonDGB_i*Discrim_t) + \beta_2 NonDGB_i +\beta_3 Discrim_t + \epsilon_{it} $$
Where NonDGP is a dummy indicating if a student is not Dutch, German or Belgian.
CRIME2.dta: a panel data set on crime and unemployment rates for 46 cities for 1982 & 1987.
We observe each city twice!
Crmrte = crimes per 100,000 people
Unem = unemployment rate
y87 = year dummy (1 if year 1987)
We can then write a panel data model with a single observed explanatory variable as: $$ y_{it} = \beta_0 + \delta_2 period2_t + \beta_1 x_{it} + \upsilon_{it}, \quad \quad t=1,2$$
The error term $\upsilon_{it}$ can then be split up into:
We can thus rewrite the model as follows: $$ y_{it} = \beta_0 + \delta_2 period2_t + \beta_1 x_{it} + a_i + u_{it}, \quad \quad t=1,2$$
$a_i$ is called a fixed effect.
For each cross-sectional observation i, write the two years as, $$ y_{i2} = (\beta_0 + \delta_0) + \beta_1 x_{i2} + a_i + u_{i2}, \quad \quad t=2$$ $$ y_{i1} = \beta_0 + \beta_1 x_{i1} + a_i + u_{i1}, \quad \quad t=1$$ Next, subtract the second equation from the first, $$ (y_{i2}-y_{i1}) = \beta_0 + \beta_1 (x_{i2}-x_{i1}) + (u_{i2}-u_{i1}), \quad \quad t=1$$ $$ \Delta y_{i2} = \delta_0 + \beta_1 \Delta x_{i2} + \Delta u_{i2} $$
$$ \Delta y_{it} = \delta_0 + \beta_1 \Delta x_{it} + \Delta u_{it} \quad \quad \quad \quad \quad \text{(Equation 13.17)} $$
Cunem = change in unemployment rate (unemployment rate in 87 – unemployment rate rate in 82).
Ccrmrte = change in crime rate (crime rate in 87 – crime rate in 82)
FD estimates predicts that if unemployment rate increases by 1 percentage point Crime rate increases by 2.217 (per 100,000 population)
The pooled OLS estimate of the effect of area (in square mile) on crime rate is 0.0031. What do you think is the First-Difference OLS estimate of area (in square mile) on crime rate?
A) 0.056
B) -0.024
C) 0.325
D) Can’t say/Not defined
BUT:
A measure of productivity is the scrap rate (scrap) (defect items per 100).
Model: $$ scrap_{it} = \beta_0 + \delta_0 d88_t + \beta_1 grant_{it} + a_i + u_{it} $$
$$ scrap_{it} = \beta_0 + \delta_0 d88_t + \beta_1 grant_{it} + a_i + u_{it} $$
Years: 1987 and 1988.
Differencing to remove $a_i$ gives: $$ \Delta scrap_{it} = \delta_0 + \beta_1 \Delta grant_{it} + u_{it} $$
Therfore, we simply regress the change in the scrap rate on the change in the grant indicator.
Note that, since no firm received a grant in 1987, the change in the grant indicator is equal to the indicator for whether the firm received a grant in 1988.