In [1]:
using Pkg
#Pkg.add("DataFrames")
#Pkg.add("HTTP")
#Pkg.add("CSV")
#Pkg.add("GLM")
using DataFrames # To be able to deal with datasets
using HTTP # To be able to request a file from a website
using CSV # To be able to import csv files
using Statistics # To be able to calculate statistics like 'mean'
using GLM # To be able to run regressions

QUAN201 - Introduction to Econometrics

datetoday = Dates.datetime.today().strftime("%m/%d/%Y")

datetoday

Topic 7, what we saw.

  • We discussed estimates of wage on gender, and on gender controlling for years of education.
  • We talked about dummy variables, and how to use these to estimate effect of gender.
  • We talked about interpreting the results when using dummy variables, and importance of remembering which is the base group (a.k.a. reference group).

Here,

  • This 'Notebook' shows how to use Julia (a programming language specialising in computation and data) to analyse the 'New Zealand Income Survey', repeating some examples we saw in class that were done with Stata on an older data set from US.
  • It first downloads a simulated data set that is intended as a copy of the 2011 New Zealand Income Survey. This data set is created by New Zealand Statistics, it is simulated data that is designed to have the properties of the original 2011 New Zealand Income Survey, but because it is not the original raw data there are no concerns about privacy, and so can be made publicly available for us to download at use anywhere.
  • In the following we download the data set, which contains gender and weekly income. It then shows some things like how to look at some summary statistics, setting up categorical variables, and running some basic regressions.

  • Disclaimer: As NZ Statistics note "The income variable is gross weekly income from all sources, not just wages and salary. This means that if you divide income by hours you do not get hourly rates for jobs." In the following I will do exactly that, divide income by hours to create an 'implied hourly wage'. But I do this only for people working between 30 and 60 hours per week, and with income greater than zero, I then further exclude those with implied hourly wages less than 10NZD per hour. While this means the results are not likely to be accurate, the main point of this exercise is to demonstrate how such an analysis can be performed with Julia. Links to studies containing the actual NZ numbers can be found at bottom of page.

Description of NZIS 2011 CART SURF

The New Zealand Income Survey (NZIS) is an annual survey by NZ Stats. The NZIS 2011 CART SURF is a simulated 'copy' of the NZIS 2011, there are no privacy concerns as the data for any specific individual in the SURF is all fake, but overall the dataset still has all the same 'features' as the real data in the original NZIS 2011.

This simulated data set contains 29,471 records with eight variables. The eight variables are: sex is either female or male; agegrp measures age in 5-year intervals; ethnicity is the ethnicity of the individual; qualification is highest level of education completed; hours is weekly hours in all wage and salary jobs; income is gross weekly income from all sources (in NZ dollars); occupation is ten different categories of occupation; lgr is the geographical region of which there are 12.

For more info on the data: http://archive.stats.govt.nz/tools_and_services/microdata-access/nzis-2011-cart-surf.aspx

Getting Set Up

First, import the 'NZ Income Survey (NZIS) 2011 CART SURF' data

This is a 'copy' of the NZ Income Survey of 2011; that is, it is a 'simulated' dataset which has the properties of the original dataset. This means that none of the individual observations is an actual New Zealander (meaning there are no anonymity concerns), but taken together the observations have all the properties of the original dataset.

csv file downloaded from http://archive.stats.govt.nz/tools_and_services/microdata-access/nzis-2011-cart-surf.aspx
an excel containing the dictionary (definition of each variable), can be had from same place

In [18]:
NZIS2011csvlink = HTTP.request("GET", "http://archive.stats.govt.nz/~/media/Statistics/services/microdata-access/nzis11-cart-surf/nzis11-cart-surf.csv?accessType=DOWNLOAD")
NZIS2011data = CSV.read(IOBuffer(String(NZIS2011csvlink.body)));

Now we have the data imported.

It contains 29,471 records with eight variables.
The variables are:

  • sex: 1=male, 2=female
  • agegrp: 5 year bins: 15-19, 20-24,...,60-64, 65+
  • ethnicity: 1=European, 2=Māori, 3=Pacific Peoples, 4=Asian, 5=Middle Eastern/Latin American/African, 6=Other Ethnicity, 9=Residual Categories. (Up to two given.)
  • qualification: 1=None, 2=School, 3=Vocational/Trade, 4=Bachelor or Higher, 5=Other.
  • hours: weekly hours in all wage and salary jobs; 0 to 168 hours rounded to half hours.
  • income: gross weekly income from all sources; -5100 to 25443 New Zealand dollars rounded to whole dollars (losses are negative).
  • occupation: 1=Managers, 2=Professionals, 3=Technicians and Trades Workers, 4=Community and Personal Service Workers, 5=Clerical and Administrative Workers, 6=Sales Workers, 7=Machinery Operators and Drivers, 8=Labourers, 9=Residual Categories, 10=No occupation.
  • lgr: region; 1=Northland, 2=Auckland, 3=Waikato, 4=Bay of Plenty, 5=Gisborne/Hawke's Bay, 6=Taranaki, 7=Manawatu-Wanganui, 8=Wellington, 9=Nelson/Tasman/Marlborough/West Coast, 10=Canterbury, 11=Otago, 12=Southland.
In [19]:
# Let's look at some summary statistics of our data set
describe(NZIS2011data)
Out[19]:

8 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64RealFloat64RealNothingInt64DataType
1ethnicity2.4346311.0360Int64
2lgr5.5568915.0120Int64
3sex1.5163412.020Int64
4agegrp41.15911540.0650Int64
5qualification2.5959813.050Int64
6occupation6.0407916.0100Int64
7hours19.11090.010.0168.00Float64
8income692.174-5100536.0254430Int64

Modern datasets, like Julia's dataframes, can actually handle us just giving the categorical variables their actual names (behind the scenes it then does the appropriate thing in terms of dummy variables, etc.)

We will do this below creating a categorical 'gender' from the current 1-or-2 valued 'sex'

In [20]:
# Ignoring gender for a moment, let's just look at the average (weekly) income
mean(NZIS2011data.income)
Out[20]:
692.1743748091344
In [21]:
# Now let's look at the average income conditional on gender
by(NZIS2011data, :sex, :income => mean)
# Recall, sex: 2 is female, 1 is male
Out[21]:

2 rows × 2 columns

sexincome_mean
Int64⍰Float64
12611.108
21778.717

Warning: the following 'implicit hourly wage' is exactly what NZ Stats warn not to do with this dataset!

We will do it anyway, just so we can see how to use Julia to do such an analysis, but from here on all the actual numbers should be treated as unlikely to be correct. [At the bottom of this page there is a link to a study containing results based on the actual New Zealand numbers.]

In [22]:
# We want to look at wages and the gender wage gap. So create 'hourly wage' by dividing income by hours worked.
NZIS2011data[:wage]=map((income,hours) -> income/hours, NZIS2011data[:income],NZIS2011data[:hours])
# In the hope of making this 'implicit hourly wage' reasonable:
# Let's also just look at people working more than 30 hours, and less than 60 hours. 
# And who have an implied hourly wage of at least $5. 

# We can do this by restricting the dataset.
# People who don't meet these restriction likely have unreliable 'implied hourly wage'
NZIS2011data = NZIS2011data[NZIS2011data[:hours] .> 30, :] # Impose restriction on dataset that hours>30
NZIS2011data = NZIS2011data[NZIS2011data[:hours] .< 60, :]
NZIS2011data = NZIS2011data[NZIS2011data[:wage] .>= 10, :] # Assume anyone with implied hourly wage $5 is some kind of error in data

describe(NZIS2011data)
Out[22]:

9 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64RealFloat64RealNothingUnion…DataType
1ethnicity2.3467411.0350Int64
2lgr5.5635415.0120Int64
3sex1.4341611.020Int64
4agegrp39.50181540.0650Int64
5qualification2.7978913.050Int64
6occupation3.820213.0100Int64
7hours41.346630.540.059.00Float64
8income1095.18328949.096100Int64
9wage26.530610.022.95218.238Float64
In [23]:
by(NZIS2011data, :sex, :wage => mean)
Out[23]:

2 rows × 2 columns

sexwage_mean
Int64⍰Float64
1226.1733
2126.8047

Recall from Lecture Slides

  • The effect of being female on wage can be expressed as follows, $$ wage = \beta_0 + \beta_1 female + u $$

  • $\beta_1$ captures the effect of being female on wage.

In [24]:
# Run the regression: wage=beta0+beta1*female+u
# Two options for how to set this up:
# First, create a dummy variable for female, and then run regression
NZIS2011data[:female]=map(sex -> sex-1, NZIS2011data[:sex]);
lm(@formula(wage ~ female), NZIS2011data)

### Test yourself: what would go wrong if we regress income ~ sex?
Out[24]:
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: wage ~ 1 + female

Coefficients:
──────────────────────────────────────────────────────
              Estimate  Std.Error    t value  Pr(>|t|)
──────────────────────────────────────────────────────
(Intercept)  26.8047     0.182334  147.009      <1e-99
female       -0.631411   0.276722   -2.28175    0.0225
──────────────────────────────────────────────────────
In [25]:
# Second, create a categorical variable for gender, instead of a number that is interpreted as a category, and just use this directly.
myDict=["male","female"] # This is a 'dictionary' that tells us 1 is male and 2 is female
NZIS2011data[:gender]=map(akey->myDict[akey], NZIS2011data[:sex])
lm(@formula(wage ~ gender), NZIS2011data)
# Remark: notice that regression results are telling you that they have used a dummy for male, 
#         rather than female.

### Test yourself: What is the advantage of using the regression of wage on female (or male), 
###                rather than the average income conditional on gender that we got using the by() command?
Out[25]:
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: wage ~ 1 + gender

Coefficients:
───────────────────────────────────────────────────────
               Estimate  Std.Error    t value  Pr(>|t|)
───────────────────────────────────────────────────────
(Intercept)   26.1733     0.208157  125.739      <1e-99
gender: male   0.631411   0.276722    2.28175    0.0225
───────────────────────────────────────────────────────
In [26]:
# More advanced way to do exactly the same regression is
glm(@formula(wage ~ female), NZIS2011data, Normal(), IdentityLink())
# glm allows you to easily switch to logit, probit, etc. by changing the Normal() and IdentityLink().
# These are alternative regression models that we will not see in this course, but are common in more advanced courses.

# lm: 'linear regression model'
# glm: 'generalized linear regression model'
Out[26]:
StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: wage ~ 1 + female

Coefficients:
──────────────────────────────────────────────────────
              Estimate  Std.Error    z value  Pr(>|z|)
──────────────────────────────────────────────────────
(Intercept)  26.8047     0.182334  147.009      <1e-99
female       -0.631411   0.276722   -2.28175    0.0225
──────────────────────────────────────────────────────
In [27]:
# We might also decide that it makes more sense to use log of wage.
NZIS2011data[:lwage]=map(wage->log(wage), NZIS2011data[:wage]);
lm(@formula(lwage ~ female), NZIS2011data)
Out[27]:
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: lwage ~ 1 + female

Coefficients:
─────────────────────────────────────────────────────────
                Estimate   Std.Error    t value  Pr(>|t|)
─────────────────────────────────────────────────────────
(Intercept)   3.18102     0.00538904  590.275      <1e-99
female       -0.00621086  0.00817875   -0.75939    0.4476
─────────────────────────────────────────────────────────

Recall from Lecture Slides

We next looked at the gender wage gap controlling for education level.

This was done running the regression, $$ wage=\beta_0+\beta_1*female+educ+u $$

In [28]:
# Run the regression: wage=beta0+beta1*female+educ+u
# Unlike the example from class slides we don't have a continuous variable for education (in slides it was 
# number of years), so instead we will set qualification as a categorical value and use that.
# qualification  1=None, 2=School, 3=Vocational/Trade, 4=Bachelor or Higher, 5=Other.
myDict=["None", "School", "Vocational", "BachelororHigher", "Other"];
NZIS2011data[:qualification]=map(akey->myDict[akey], NZIS2011data[:qualification]);
glm(@formula(wage ~ gender+qualification), NZIS2011data, Normal(), IdentityLink())

# Note: just use categorical 'gender', so Julia is choosing how to create the 
# dummy, and has set a dummy for male rather than female

### Test yourself: Which qualification level has been used as the Base (aka Reference) group?
Out[28]:
StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: wage ~ 1 + gender + qualification

Coefficients:
────────────────────────────────────────────────────────────────────
                            Estimate  Std.Error    z value  Pr(>|z|)
────────────────────────────────────────────────────────────────────
(Intercept)                30.3269     0.31134    97.4077     <1e-99
gender: male                0.953131   0.273445    3.48564    0.0005
qualification: None        -6.11959    0.426774  -14.3392     <1e-45
qualification: Other       -3.29383    0.601474   -5.47626    <1e-7 
qualification: School      -6.80131    0.407003  -16.7107     <1e-61
qualification: Vocational  -5.09794    0.370181  -13.7715     <1e-42
────────────────────────────────────────────────────────────────────
In [29]:
# Run the regression: wage=beta0+beta1*female+educ+female*educ+u
glm(@formula(wage ~ gender+qualification+gender*qualification), NZIS2011data, Normal(), IdentityLink())

# Note: we did not need to 'pre-create' the gender-qualification interaction term as a variable, 
# we were able to simply include it in the regression equation.
Out[29]:
StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: wage ~ 1 + gender + qualification + gender & qualification

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                                          Estimate  Std.Error   z value  Pr(>|z|)
─────────────────────────────────────────────────────────────────────────────────
(Intercept)                               29.6307    0.394794  75.0536     <1e-99
gender: male                               2.35487   0.560209   4.20356    <1e-4 
qualification: None                       -5.40791   0.63382   -8.53225    <1e-16
qualification: Other                      -2.42001   0.910192  -2.65879    0.0078
qualification: School                     -5.79006   0.592418  -9.77361    <1e-21
qualification: Vocational                 -4.02691   0.551702  -7.29906    <1e-12
gender: male & qualification: None        -1.42846   0.858587  -1.66374    0.0962
gender: male & qualification: Other       -1.70817   1.21405   -1.407      0.1594
gender: male & qualification: School      -1.96975   0.815801  -2.4145     0.0158
gender: male & qualification: Vocational  -2.01352   0.745791  -2.69985    0.0069
─────────────────────────────────────────────────────────────────────────────────

A reminder to finish that we used an 'implied hourly wage' that was precisely what NZ Stats warned us not to do. So do not interpret the resulting numbers too seriously. This exercise is simply to show how this kind of analysis can nowadays be performed in a fully open and online manner.

If you want to read more about the differences in incomes and wages by gender in New Zealand here are three studies:

  • Some basic numbers can be found in this study which uses the (not publicly available) version of the 2015 New Zealand Income Study. It performs an 'Oaxaca-Blinder' decomposition, which is the standard empirical measure of discrimination.
    [This study found "Focusing on the working age population in NZ, and utilising 2015 Income Survey data (sourced from Statistics NZ) we find that females earn on average \$25 per hour, while the comparable average for males is \$29 per hour.", a 16% difference in contrast to the 2-3% difference in our implied hourly wage above.]
  • The role of productivity differences in the gender wage gap: http://motu-www.motu.org.nz/wpapers/17_15.pdf
  • The effect of parenthood on the gender wage gap: http://motu-www.motu.org.nz/wpapers/18_08.pdf