using Pkg
#Pkg.add("DataFrames")
#Pkg.add("HTTP")
#Pkg.add("CSV")
#Pkg.add("GLM")
using DataFrames # To be able to deal with datasets
using HTTP # To be able to request a file from a website
using CSV # To be able to import csv files
using Statistics # To be able to calculate statistics like 'mean'
using GLM # To be able to run regressions
In the following we download the data set, which contains gender and weekly income. It then shows some things like how to look at some summary statistics, setting up categorical variables, and running some basic regressions.
Disclaimer: As NZ Statistics note "The income variable is gross weekly income from all sources, not just wages and salary. This means that if you divide income by hours you do not get hourly rates for jobs." In the following I will do exactly that, divide income by hours to create an 'implied hourly wage'. But I do this only for people working between 30 and 60 hours per week, and with income greater than zero, I then further exclude those with implied hourly wages less than 10NZD per hour. While this means the results are not likely to be accurate, the main point of this exercise is to demonstrate how such an analysis can be performed with Julia. Links to studies containing the actual NZ numbers can be found at bottom of page.
The New Zealand Income Survey (NZIS) is an annual survey by NZ Stats. The NZIS 2011 CART SURF is a simulated 'copy' of the NZIS 2011, there are no privacy concerns as the data for any specific individual in the SURF is all fake, but overall the dataset still has all the same 'features' as the real data in the original NZIS 2011.
This simulated data set contains 29,471 records with eight variables. The eight variables are: sex is either female or male; agegrp measures age in 5-year intervals; ethnicity is the ethnicity of the individual; qualification is highest level of education completed; hours is weekly hours in all wage and salary jobs; income is gross weekly income from all sources (in NZ dollars); occupation is ten different categories of occupation; lgr is the geographical region of which there are 12.
First, import the 'NZ Income Survey (NZIS) 2011 CART SURF' data
This is a 'copy' of the NZ Income Survey of 2011; that is, it is a 'simulated' dataset which has the properties of the original dataset. This means that none of the individual observations is an actual New Zealander (meaning there are no anonymity concerns), but taken together the observations have all the properties of the original dataset.
csv file downloaded from http://archive.stats.govt.nz/tools_and_services/microdata-access/nzis-2011-cart-surf.aspx
an excel containing the dictionary (definition of each variable), can be had from same place
NZIS2011csvlink = HTTP.request("GET", "http://archive.stats.govt.nz/~/media/Statistics/services/microdata-access/nzis11-cart-surf/nzis11-cart-surf.csv?accessType=DOWNLOAD")
NZIS2011data = CSV.read(IOBuffer(String(NZIS2011csvlink.body)));
Now we have the data imported.
It contains 29,471 records with eight variables.
The variables are:
# Let's look at some summary statistics of our data set
describe(NZIS2011data)
Modern datasets, like Julia's dataframes, can actually handle us just giving the categorical variables their actual names (behind the scenes it then does the appropriate thing in terms of dummy variables, etc.)
We will do this below creating a categorical 'gender' from the current 1-or-2 valued 'sex'
# Ignoring gender for a moment, let's just look at the average (weekly) income
mean(NZIS2011data.income)
# Now let's look at the average income conditional on gender
by(NZIS2011data, :sex, :income => mean)
# Recall, sex: 2 is female, 1 is male
We will do it anyway, just so we can see how to use Julia to do such an analysis, but from here on all the actual numbers should be treated as unlikely to be correct. [At the bottom of this page there is a link to a study containing results based on the actual New Zealand numbers.]
# We want to look at wages and the gender wage gap. So create 'hourly wage' by dividing income by hours worked.
NZIS2011data[:wage]=map((income,hours) -> income/hours, NZIS2011data[:income],NZIS2011data[:hours])
# In the hope of making this 'implicit hourly wage' reasonable:
# Let's also just look at people working more than 30 hours, and less than 60 hours.
# And who have an implied hourly wage of at least $5.
# We can do this by restricting the dataset.
# People who don't meet these restriction likely have unreliable 'implied hourly wage'
NZIS2011data = NZIS2011data[NZIS2011data[:hours] .> 30, :] # Impose restriction on dataset that hours>30
NZIS2011data = NZIS2011data[NZIS2011data[:hours] .< 60, :]
NZIS2011data = NZIS2011data[NZIS2011data[:wage] .>= 10, :] # Assume anyone with implied hourly wage $5 is some kind of error in data
describe(NZIS2011data)
by(NZIS2011data, :sex, :wage => mean)
The effect of being female on wage can be expressed as follows, $$ wage = \beta_0 + \beta_1 female + u $$
$\beta_1$ captures the effect of being female on wage.
# Run the regression: wage=beta0+beta1*female+u
# Two options for how to set this up:
# First, create a dummy variable for female, and then run regression
NZIS2011data[:female]=map(sex -> sex-1, NZIS2011data[:sex]);
lm(@formula(wage ~ female), NZIS2011data)
### Test yourself: what would go wrong if we regress income ~ sex?
# Second, create a categorical variable for gender, instead of a number that is interpreted as a category, and just use this directly.
myDict=["male","female"] # This is a 'dictionary' that tells us 1 is male and 2 is female
NZIS2011data[:gender]=map(akey->myDict[akey], NZIS2011data[:sex])
lm(@formula(wage ~ gender), NZIS2011data)
# Remark: notice that regression results are telling you that they have used a dummy for male,
# rather than female.
### Test yourself: What is the advantage of using the regression of wage on female (or male),
### rather than the average income conditional on gender that we got using the by() command?
# More advanced way to do exactly the same regression is
glm(@formula(wage ~ female), NZIS2011data, Normal(), IdentityLink())
# glm allows you to easily switch to logit, probit, etc. by changing the Normal() and IdentityLink().
# These are alternative regression models that we will not see in this course, but are common in more advanced courses.
# lm: 'linear regression model'
# glm: 'generalized linear regression model'
# We might also decide that it makes more sense to use log of wage.
NZIS2011data[:lwage]=map(wage->log(wage), NZIS2011data[:wage]);
lm(@formula(lwage ~ female), NZIS2011data)
We next looked at the gender wage gap controlling for education level.
This was done running the regression, $$ wage=\beta_0+\beta_1*female+educ+u $$
# Run the regression: wage=beta0+beta1*female+educ+u
# Unlike the example from class slides we don't have a continuous variable for education (in slides it was
# number of years), so instead we will set qualification as a categorical value and use that.
# qualification 1=None, 2=School, 3=Vocational/Trade, 4=Bachelor or Higher, 5=Other.
myDict=["None", "School", "Vocational", "BachelororHigher", "Other"];
NZIS2011data[:qualification]=map(akey->myDict[akey], NZIS2011data[:qualification]);
glm(@formula(wage ~ gender+qualification), NZIS2011data, Normal(), IdentityLink())
# Note: just use categorical 'gender', so Julia is choosing how to create the
# dummy, and has set a dummy for male rather than female
### Test yourself: Which qualification level has been used as the Base (aka Reference) group?
# Run the regression: wage=beta0+beta1*female+educ+female*educ+u
glm(@formula(wage ~ gender+qualification+gender*qualification), NZIS2011data, Normal(), IdentityLink())
# Note: we did not need to 'pre-create' the gender-qualification interaction term as a variable,
# we were able to simply include it in the regression equation.
If you want to read more about the differences in incomes and wages by gender in New Zealand here are three studies: