We use a software called MatchIt to perform matching for us. Our example here mostly follow the tutorial of MatchIt package here[https://kosukeimai.github.io/MatchIt/articles/MatchIt.html].
We use the Lalonde data that we encountered in HW2 and will encounter again in HW4. Here we focus on the observational data of the job training program. The job training program took place in 1976, and we are interested in its impact on 1978 earnings.
library("MatchIt")
data("lalonde")
lalonde
m.out0 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde,
method = NULL, distance = "glm")
## distance here refer to a method to estimate the propensity score
## No matching is performed at this stage
summary(m.out0)
Call:
matchit(formula = treat ~ age + educ + race + married + nodegree +
re74 + re75, data = lalonde, method = NULL, distance = "glm")
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance 0.5774 0.1822 1.7941 0.9211 0.3774 0.6444
age 25.8162 28.0303 -0.3094 0.4400 0.0813 0.1577
educ 10.3459 10.2354 0.0550 0.4959 0.0347 0.1114
raceblack 0.8432 0.2028 1.7615 . 0.6404 0.6404
racehispan 0.0595 0.1422 -0.3498 . 0.0827 0.0827
racewhite 0.0973 0.6550 -1.8819 . 0.5577 0.5577
married 0.1892 0.5128 -0.8263 . 0.3236 0.3236
nodegree 0.7081 0.5967 0.2450 . 0.1114 0.1114
re74 2095.5737 5619.2365 -0.7211 0.5181 0.2248 0.4470
re75 1532.0553 2466.4844 -0.2903 0.9563 0.1342 0.2876
Sample Sizes:
Control Treated
All 429 185
Matched 429 185
Unmatched 0 0
Discarded 0 0
## Love plot before matching
plot(summary(m.out0), abs = F)
## Default thresholds: solid line 0.1, dashed line 0.05
– Add linear terms
Set re74 is the base covariate that we know that we need to add a priori
temp <- glm(treat ~ . , data = lalonde[, -9], family = "binomial")
temp0 <- glm(treat ~re74, data = lalonde[, -9], family = "binomial")
library(MASS)
## can set trace = F to have more concise messages
model <- stepAIC(temp0, scope=list(upper = formula(temp), lower = formula(temp0)), direction = "forward")
Start: AIC=708.54
treat ~ re74
Df Deviance AIC
+ race 2 504.05 512.05
+ married 1 675.29 681.29
<none> 704.54 708.54
+ nodegree 1 702.77 708.77
+ educ 1 702.85 708.85
+ re75 1 704.15 710.15
+ age 1 704.23 710.23
Step: AIC=512.05
treat ~ re74 + race
Df Deviance AIC
+ married 1 496.34 506.34
+ educ 1 501.25 511.25
<none> 504.05 512.05
+ re75 1 503.72 513.72
+ nodegree 1 503.82 513.82
+ age 1 503.94 513.94
Step: AIC=506.34
treat ~ re74 + race + married
Df Deviance AIC
<none> 496.34 506.34
+ educ 1 494.55 506.55
+ re75 1 494.92 506.92
+ nodegree 1 495.94 507.94
+ age 1 496.00 508.00
summary(model)
Call:
glm(formula = treat ~ re74 + race + married, family = "binomial",
data = lalonde[, -9])
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5694 -0.4638 -0.3033 0.8308 2.6076
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.863e-01 1.565e-01 5.663 1.48e-08 ***
re74 -4.949e-05 2.275e-05 -2.176 0.02958 *
racehispan -2.150e+00 3.596e-01 -5.979 2.25e-09 ***
racewhite -3.062e+00 2.826e-01 -10.833 < 2e-16 ***
married -7.262e-01 2.631e-01 -2.760 0.00578 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 751.49 on 613 degrees of freedom
Residual deviance: 496.34 on 609 degrees of freedom
AIC: 506.34
Number of Fisher Scoring iterations: 5
– Add interactions
tempfull <- glm(treat ~(race + married + re74)^2 + I(re74^2), data = lalonde[, -9], family = "binomial")
library(MASS)
model <- stepAIC(model, scope=list(upper = formula(tempfull), lower = formula(model)), direction = "forward")
Start: AIC=506.34
treat ~ re74 + race + married
Df Deviance AIC
+ race:re74 2 489.63 503.63
+ I(re74^2) 1 491.84 503.84
<none> 496.34 506.34
+ race:married 2 492.51 506.51
+ married:re74 1 496.04 508.04
Step: AIC=503.63
treat ~ re74 + race + married + re74:race
Df Deviance AIC
+ I(re74^2) 1 486.57 502.57
<none> 489.63 503.63
+ married:re74 1 489.55 505.55
+ race:married 2 488.29 506.29
Step: AIC=502.57
treat ~ re74 + race + married + I(re74^2) + re74:race
Df Deviance AIC
<none> 486.57 502.57
+ married:re74 1 485.92 503.92
+ race:married 2 485.13 505.13
summary(model)
Call:
glm(formula = treat ~ re74 + race + married + I(re74^2) + re74:race,
family = "binomial", data = lalonde[, -9])
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5612 -0.5542 -0.2251 0.8372 3.0364
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.683e-01 1.664e-01 5.218 1.81e-07 ***
re74 -1.148e-04 6.260e-05 -1.834 0.0667 .
racehispan -2.034e+00 4.175e-01 -4.871 1.11e-06 ***
racewhite -2.642e+00 3.183e-01 -8.300 < 2e-16 ***
married -6.451e-01 2.665e-01 -2.420 0.0155 *
I(re74^2) 4.976e-09 3.270e-09 1.522 0.1281
re74:racehispan -2.781e-05 7.658e-05 -0.363 0.7165
re74:racewhite -1.644e-04 9.291e-05 -1.770 0.0768 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 751.49 on 613 degrees of freedom
Residual deviance: 486.57 on 606 degrees of freedom
AIC: 502.57
Number of Fisher Scoring iterations: 7
## obtain linearized propensity scores
lps <- predict(model)
– Drop control and treated units that are out of the estimated propenstiy score range
library(ggplot2)
temp.data <- data.frame(lps = lps, treated = as.factor(lalonde$treat))
ggplot(temp.data, aes(x = lps, fill = treated, color = treated)) +
geom_histogram(alpha = 0.5, position = "identity") +
xlab("Linearized propensity score")
idx.treated <- which((lalonde$treat == 1) & (lps > max(lps[lalonde$treat == 0])))
idx.control <- which((lalonde$treat == 0) & (lps < min(lps[lalonde$treat == 1])))
idx <- c(idx.treated, idx.control) ## dropped two treated units and 77 control units
lalonde <- lalonde[-idx, ]
lps <- lps[-idx]
From the histogram, we see a potential issue here, which is that there are not enough controls that can be potentially matched with the treated units for units whose lps > 0.
Greedy algorithm, covariate balancing is not satisfying
## Greedy algorithm
m.out1 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde,
estimand = "ATT",
method = "nearest", distance = lps)
plot(summary(m.out1), abs = F)
We change to optimal matching, and covariate balancing is still not satisfying
## Optimal matching algorithm
## install additional package
##install.packages("optmatch") and choose the binary version
library("optmatch")
m.out2 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde,
estimand = "ATT",
method = "optimal", distance = lps)
plot(summary(m.out2), abs = F)
You see that even the linearized propensity score are not matched well, which indicates that some treated units are not able to be well matched (as there are not enough controls with lps > 0).
Three strategies to improve balancing:
## Strategy I: require exact match for race, as race seems to be the most unbalanced covariate and is discrete
m.out3 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde,
estimand = "ATT", exact = "race",
method = "optimal", distance = lps)
Fewer control units than treated units in some 'exact' strata; not all treated units will get a match.
summary(m.out3)
Call:
matchit(formula = treat ~ age + educ + race + married + nodegree +
re74 + re75, data = lalonde, method = "optimal", distance = lps,
estimand = "ATT", exact = "race")
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance 0.2259 -1.8038 1.9434 0.4496 0.3409 0.6040
age 25.8197 26.4091 -0.0819 0.4847 0.0685 0.1514
educ 10.3279 10.1193 0.1041 0.4802 0.0372 0.1027
raceblack 0.8415 0.2472 1.6276 . 0.5944 0.5944
racehispan 0.0601 0.1733 -0.4762 . 0.1132 0.1132
racewhite 0.0984 0.5795 -1.6158 . 0.4812 0.4812
married 0.1858 0.4205 -0.6033 . 0.2347 0.2347
nodegree 0.7104 0.6335 0.1694 . 0.0769 0.0769
re74 1785.3081 3243.4437 -0.3770 0.7722 0.1575 0.3977
re75 1448.6596 1860.5826 -0.1318 1.3614 0.0915 0.2458
Summary of Balance for Matched Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
distance -0.0701 -0.1097 0.0379 1.0038 0.0095 0.0948 0.0442
age 26.1207 26.2241 -0.0144 0.4638 0.0810 0.1983 1.1792
educ 10.3793 10.0259 0.1764 0.3449 0.0513 0.1466 1.3814
raceblack 0.7500 0.7500 0.0000 . 0.0000 0.0000 0.0000
racehispan 0.0948 0.0948 0.0000 . 0.0000 0.0000 0.0000
racewhite 0.1552 0.1552 0.0000 . 0.0000 0.0000 0.0000
married 0.2328 0.2931 -0.1552 . 0.0603 0.0603 0.1995
nodegree 0.7155 0.6379 0.1711 . 0.0776 0.0776 0.9313
re74 2261.7502 2786.3308 -0.1356 0.6877 0.0323 0.1293 0.5261
re75 2073.3520 1608.2498 0.1488 1.4552 0.0724 0.1552 0.6351
Percent Balance Improvement:
Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance 98.1 99.5 97.2 84.3
age 82.4 -6.1 -18.3 -31.0
educ -69.5 -45.1 -37.7 -42.7
raceblack 100.0 . 100.0 100.0
racehispan 100.0 . 100.0 100.0
racewhite 100.0 . 100.0 100.0
married 74.3 . 74.3 74.3
nodegree -0.9 . -0.9 -0.9
re74 64.0 -44.9 79.5 67.5
re75 -12.9 -21.6 20.8 36.9
Sample Sizes:
Control Treated
All 352 183
Matched 116 116
Unmatched 236 67
Discarded 0 0
plot(summary(m.out3), abs = F)
Strategy II: we can to improve balancing by discard units if the distance between pairs are two large
head(m.out2$match.matrix)
[,1]
NSW1 "PSID429"
NSW2 "PSID170"
NSW3 "PSID370"
NSW4 "PSID140"
NSW5 "PSID159"
NSW6 "PSID281"
names(lps) <- rownames(lalonde)
dists <- lps[rownames(m.out2$match.matrix)] - lps[m.out2$match.matrix[,1]]
hist(dists, main = "Histogram of distances between matched pairs")
## We see that some distances are large.
## We set a threshold 1 to make a balance between removing poor matches and retain most of the treated units
ID.treated.notmatched <- names(dists[abs(dists) > 1])
lalonde.new <- lalonde[!(rownames(lalonde) %in% ID.treated.notmatched),]
lps.new <- lps[!(rownames(lalonde) %in% ID.treated.notmatched)]
m.out4 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde.new,
estimand = "ATT",
method = "optimal", distance = lps.new)
plot(summary(m.out4), abs = F)
summary(m.out4)
Call:
matchit(formula = treat ~ age + educ + race + married + nodegree +
re74 + re75, data = lalonde.new, method = "optimal", distance = lps.new,
estimand = "ATT")
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance 0.0425 -1.8038 1.4736 0.6469 0.3145 0.5148
age 25.1034 26.4091 -0.1971 0.4108 0.0724 0.1295
educ 10.3793 10.1193 0.1375 0.4280 0.0401 0.0988
raceblack 0.7500 0.2472 1.1613 . 0.5028 0.5028
racehispan 0.0948 0.1733 -0.2678 . 0.0785 0.0785
racewhite 0.1552 0.5795 -1.1721 . 0.4244 0.4244
married 0.1121 0.4205 -0.9776 . 0.3084 0.3084
nodegree 0.6897 0.6335 0.1213 . 0.0561 0.0561
re74 1793.4120 3243.4437 -0.3621 0.8277 0.1699 0.4146
re75 1623.3948 1860.5826 -0.0653 1.8385 0.0938 0.2685
Summary of Balance for Matched Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
distance 0.0425 -0.1095 0.1213 1.0880 0.0364 0.2931 0.1218
age 25.1034 25.8966 -0.1197 0.3837 0.0884 0.2069 1.5072
educ 10.3793 10.0431 0.1778 0.3657 0.0467 0.0948 1.3538
raceblack 0.7500 0.7500 0.0000 . 0.0000 0.0000 0.0000
racehispan 0.0948 0.0776 0.0588 . 0.0172 0.0172 0.0588
racewhite 0.1552 0.1724 -0.0476 . 0.0172 0.0172 0.0476
married 0.1121 0.2845 -0.5466 . 0.1724 0.1724 0.6012
nodegree 0.6897 0.6379 0.1118 . 0.0517 0.0517 0.9317
re74 1793.4120 2587.1274 -0.1982 0.7040 0.0779 0.2414 0.5919
re75 1623.3948 1557.9356 0.0180 1.4168 0.0220 0.0776 0.5909
Percent Balance Improvement:
Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance 91.8 80.6 88.4 43.1
age 39.3 -7.7 -22.0 -59.8
educ -29.3 -18.6 -16.5 4.1
raceblack 100.0 . 100.0 100.0
racehispan 78.0 . 78.0 78.0
racewhite 95.9 . 95.9 95.9
married 44.1 . 44.1 44.1
nodegree 7.9 . 7.9 7.9
re74 45.3 -85.7 54.1 41.8
re75 72.4 42.8 76.6 71.1
Sample Sizes:
Control Treated
All 352 116
Matched 116 116
Unmatched 236 0
Discarded 0 0
Covariate balancing improves. However, notice that because we have removed more than 1/3 of of the treated units, the treated population may have changed dramastically. The ATT we estimate here can be biased for the ATT of the original population.
Strategy III: In the software tutorial, it uses another method called “optimal full matching” and reaches extremely good balancing. It assigns every treated and control unit in the sample to one subclass each (Hansen 2004; Stuart and Green 2008). Each subclass contains one treated unit and one or more control units or one control units and one or more treated units. Here it performs better than optimal matching, as full matching allows one control be matched with many treated units.
We can also use matching with replacement here, which does a similar job.
## We match each treated with 2 controls to increase stability (as some controls may be used multiple times)
m.out5 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde,
estimand = "ATT", replace= T, ratio = 2,
method = "nearest", distance = lps)
plot(summary(m.out5), abs = F)
Actually, for matching with replacement you can change the propensity score estimation method to the simple logistic regression using all covariates as linear terms, and you will see an even better covariate balancing result.
m.out6 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde,
estimand = "ATT", replace= T, ratio = 2,
method = "nearest", distance = "glm")
plot(summary(m.out6), abs = F)
We can continue with either m.out3 or m.out6.
m.data <- match.data(m.out3)
head(m.data)
## Neyman's approach, treating the data as from a pairwise randomized experiment
tau_hat_vec <- sapply(levels(m.data$subclass), function(sc) {
treated <- m.data$re78[m.data$subclass == sc & m.data$treat == 1]
control <- m.data$re78[m.data$subclass == sc & m.data$treat == 0]
return(treated - control)
})
tau_hat <- mean(tau_hat_vec, na.rm = T)
sd_tau_hat <- sd(tau_hat_vec)/sqrt(length(tau_hat_vec))
print(c(tau_hat, sd_tau_hat))
[1] 1633.733 1058.917
## 95% CI
print(c(tau_hat- 1.96 * sd_tau_hat, tau_hat + 1.96 * sd_tau_hat))
[1] -441.7443 3709.2105
Neyman’s approach with regression adjustment of the bias
## First run a linear regression on the control (remember to remove the treatment assignment variable!)
fit0 <- lm(re78 ~ age + educ + race + married + nodegree +
re74 + re75, data = m.data[m.data$treat == 0, ])
m.data$predicted_y0 <- predict(fit0, m.data)
## Neyman's approach with regression adjustment
tau_hat_vec_adjusted <- sapply(levels(m.data$subclass), function(sc) {
treated <- m.data$re78[m.data$subclass == sc & m.data$treat == 1]
control <- m.data$re78[m.data$subclass == sc & m.data$treat == 0]
bias <- m.data$predicted_y0[m.data$subclass == sc & m.data$treat == 1] -
m.data$predicted_y0[m.data$subclass == sc & m.data$treat == 0]
control.adjusted <- control + bias
return(treated - control.adjusted)
})
tau_hat_adjusted <- mean(tau_hat_vec_adjusted)
sd_tau_hat_adjusted <- sd(tau_hat_vec_adjusted)/sqrt(length(tau_hat_vec_adjusted))
print(c(tau_hat_adjusted, sd_tau_hat_adjusted))
[1] 1316.275 1039.687
## 95% CI
print(c(tau_hat_adjusted- 1.96 * sd_tau_hat_adjusted, tau_hat_adjusted + 1.96 * sd_tau_hat_adjusted))
[1] -721.5113 3354.0616
Run outcome regression directly:
fit <- lm(re78 ~ treat + age + educ + race + married + nodegree +
re74 + re75, data = m.data, weights = m.data$weights)
summary(fit)
Call:
lm(formula = re78 ~ treat + age + educ + race + married + nodegree +
re74 + re75, data = m.data, weights = m.data$weights)
Residuals:
Min 1Q Median 3Q Max
-8775 -5569 -2461 3739 53111
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -647.84836 4259.36037 -0.152 0.8792
treat 1369.97912 1033.01727 1.326 0.1861
age 45.34815 59.45116 0.763 0.4464
educ 418.06660 281.66535 1.484 0.1392
racehispan 2955.46825 1776.60496 1.664 0.0976 .
racewhite 922.70695 1429.48384 0.645 0.5193
married 171.05278 1289.29833 0.133 0.8946
nodegree 422.36639 1490.59059 0.283 0.7772
re74 -0.05998 0.13628 -0.440 0.6603
re75 0.14356 0.18787 0.764 0.4456
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7626 on 222 degrees of freedom
Multiple R-squared: 0.0391, Adjusted R-squared: 0.0001402
F-statistic: 1.004 on 9 and 222 DF, p-value: 0.438
Results are similar to the regression adjustment of bias.
m.data <- match.data(m.out6)
head(m.out6$match.matrix)
[,1] [,2]
NSW1 "PSID345" "PSID234"
NSW2 "PSID79" "PSID66"
NSW3 "PSID134" "PSID173"
NSW4 "PSID412" "PSID226"
NSW5 "PSID406" "PSID401"
NSW6 "PSID416" "PSID277"
## Neyman's approach
tau_hat_vec <- sapply(rownames(m.out6$match.matrix), function(ID) {
treated <- m.data[ID,]$re78
control <- m.data[m.out6$match.matrix[ID,1],]$re78
return(treated - control)
})
tau_hat <- mean(tau_hat_vec, na.rm = T)
sd_tau_hat <- sd(tau_hat_vec)/sqrt(length(tau_hat_vec))
print(c(tau_hat, sd_tau_hat))
[1] 1504.2209 696.2598
## 95% CI
print(c(tau_hat- 1.96 * sd_tau_hat, tau_hat + 1.96 * sd_tau_hat))
[1] 139.5517 2868.8901
Neyman’s approach with regression adjustment of the bias
## First run a linear regression on the control (remember to remove the treatment assignment variable!)
fit0 <- lm(re78 ~ age + educ + race + married + nodegree +
re74 + re75, data = m.data[m.data$treat == 0, ])
m.data$predicted_y0 <- predict(fit0, m.data)
## Neyman's approach with regression adjustment
tau_hat_vec_adjusted <- sapply(rownames(m.out6$match.matrix), function(ID) {
treated <- m.data[ID,]$re78
control <- m.data[m.out6$match.matrix[ID,1],]$re78
bias <- m.data[ID,]$predicted_y0 -
m.data[m.out6$match.matrix[ID,1],]$predicted_y0
control.adjusted <- control + bias
return(treated - control.adjusted)
})
tau_hat_adjusted <- mean(tau_hat_vec_adjusted)
sd_tau_hat_adjusted <- sd(tau_hat_vec_adjusted)/sqrt(length(tau_hat_vec_adjusted))
print(c(tau_hat_adjusted, sd_tau_hat_adjusted))
[1] 1421.1407 697.8793
## 95% CI
print(c(tau_hat_adjusted- 1.96 * sd_tau_hat_adjusted, tau_hat_adjusted + 1.96 * sd_tau_hat_adjusted))
[1] 53.29722 2788.98425
Results are different because: