Reference: Wooldridge, Chapter 7
We can express belonging to a certain category with a binary indicator:
1 = belonging to a category
0 = not belonging to a category
Example: We can express female as a dummy variable by assigning the value 1 for all females and 0 to all others.
You can transform any information into a dummy variable. It is totally arbitrary.
DHigh = 1 if individual is educated more than 12 years, 0 if individual is educated less than 12 years
We can then express the effect of being in a category by simply including it in the DGP.
The effect of being female on wage can be expressed as follows, $$ wage = \beta_0 + \delta_0 female + u $$
$\delta_0$ captures the effect of being female on wage.
The base group (reference group/comparison group) is the group against which we compare the effect of the dummy variable. $$ wage = \beta_0 + \delta_0 female + u $$
In this example, the base group is males$^*$. Thus, $\delta_0$ shows the effect of being female compared to being male on wage.
$*$: We will assume for this exercise that the dataset only consists of males and females. If it contained other genders, the base group would be males and all other genders.
Recall that $female$ takes value 1 for women, 0 for men.
When $female=1$ (i.e., for women), get $$ wage = \underbrace{\beta_0 + \delta_0}_{intercept} + \beta_1 educ + u $$
When $female=0$ (i.e., for men), get $$ wage = \underbrace{\beta_0}_{intercept} + \beta_1 educ + u $$
Note the difference between the two equations is $\delta_0$ which generates a shift in the intercept.
Let's estimate the following model: $ wage = \beta_0 + \delta_0 female + \beta_1 educ + u $
Old specification of the DGP: $ wage = \beta_0 + \delta_0 female + u $
We could instead specify the model DGP as: $ wage = \beta_0 + \gamma_0 male + u $
where $male$ is a dummy variable equal to $1$ if the individual is male and $0$ if the individual is female.
It generates a perfect colinearity (MLR 3). You can perfectly predict male from female and vice versa. This is called dummy variable trap.
Stata will automatically drop one of the redundant dummy variables
In practice you may have qualitative variables with more than two categories.
In this case we create more than one dummy variable: one for each category.
In the regression, we have to leave one of these dummy variables out which will be the base group.
Let’s say we want to distinguish between four marital statuses:
Single, married, divorced and widowed.
We can then create a a dummy variable for each category.
E.g.: $single$ = 1 if person is single and 0 otherwise (i.e., if married, divorced or widowed)
An ordinal variable, is a variable where the order matters but not the difference between values.
E.g., $raceresult$ which takes values of 1 for first, 2 for second, 3 for third, et cetera. (there is an ordering, but no reason to think the difference between 1st and 2nd place is same as between 2nd and 3rd place).
We can estimate the effect of an ordinal variables by creating a dummy for each category.
We then again include all of them except one (the base group) in the regression.
The coefficient on the dummy variable can then be interpreted as effect compared to the base group.
What not to do: What goes wrong if we create a variable $maritalstatus$ which takes values of 0 for $single$, 1 for $married$, 2 for $divorced$ and 3 for $widowed$ and simply put that in our regression? (Hint: Why would you expect the effect of $divorced$ to be double the effect of $married$? Why would you expect difference between effect of $widowed$ and $divorced$ to be the same as the difference between $married$ and $single$?)
Hamermesh and Biddle (1994) estimate the effect of physical attractiveness on wage.
They use ratings on the beauty using 3 categories (below average, average, above average).
The order of the rating is clear (above average > average> below average), but the difference is not (above average = 2 * average ????)
They therefore include a dummy for above average (1 if beauty is above average, else 0) and a dummy for below average (1 if beauty is below average, else 0).
The base group is average.
Let's estimate the following model: $wage=\beta_0 + \beta_1 belavg + \beta_2 abvavg + u$
Is OLS regression line a good description of the data?
To create an interaction term simply multiply the two variables.
$educXfemale=educ*female$
Let's estimate the following model: $ wage = \beta_0 + \delta_0 female + \beta_1 educ + \delta_1 educXfemale + u$
Estimated return to education for women: 0.54 – 0.09 = 0.45
Is the estimated return to education significantly different for men compared to women?
No, because the p-value on the interaction term educXfemale is 0.407
Advantage of estimating this in one regression is that we can immediately see if the difference is statistically significant.
We often observe differences in outcomes, but...
Problem 1: These differences could be due to differences in “skill” or actual behaviour?
Even if people would have the same “skills” (i.e. everything else is the same)…
Problem 2: These differences could be due to discrimination or favouritism?
$$ S = \beta_1 MATCH*VISIBLE + \beta_2 MATCH*BLIND + \beta_3 NON-MATCH*VISIBLE + \beta_4 NON-MATCH*BLIND + \gamma’ Z + \epsilon $$
S = Exam score per question (standardised)
Z = student and grader gender and nationality.
Endophilia = S|VISIBLE,MATCH – S|BLIND,MATCH $= \beta_1 – \beta_2$
Exophobia = -1*[S|VISIBLE,NON-MATCH – S|BLIND,NON-MATCH] $= \beta_4 – \beta_3$
See paper: Jan Feld, Nicolas Salamanca and Daniel S. Hamermesh, (2016) “Endophilia or Exophobia: Beyond Discrimination”, The Economic Journal