We continue analyzing the lalonde data obtained from the MatchIt package. We estimate the propensity scores using the selected logistic regression model in R Example 7
library("MatchIt")
data("lalonde")
model <- glm(treat ~ . , data = lalonde[, -9], family = "binomial")
eps <- predict(model, type = "response")
lps <- predict(model)
m.out0 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde,
method = NULL, distance = "glm")
## distance here refer to a method to estimate the propensity score
## No matching is performed at this stage
summary(m.out0)
Call:
matchit(formula = treat ~ age + educ + race + married + nodegree +
re74 + re75, data = lalonde, method = NULL, distance = "glm")
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.5774 0.1822 1.7941 0.9211 0.3774
age 25.8162 28.0303 -0.3094 0.4400 0.0813
educ 10.3459 10.2354 0.0550 0.4959 0.0347
raceblack 0.8432 0.2028 1.7615 . 0.6404
racehispan 0.0595 0.1422 -0.3498 . 0.0827
racewhite 0.0973 0.6550 -1.8819 . 0.5577
married 0.1892 0.5128 -0.8263 . 0.3236
nodegree 0.7081 0.5967 0.2450 . 0.1114
re74 2095.5737 5619.2365 -0.7211 0.5181 0.2248
re75 1532.0553 2466.4844 -0.2903 0.9563 0.1342
eCDF Max
distance 0.6444
age 0.1577
educ 0.1114
raceblack 0.6404
racehispan 0.0827
racewhite 0.5577
married 0.3236
nodegree 0.1114
re74 0.4470
re75 0.2876
Sample Sizes:
Control Treated
All 429 185
Matched 429 185
Unmatched 0 0
Discarded 0 0
## Love plot before matching
plot(summary(m.out0), abs = F)
## Default thresholds: solid line 0.1, dashed line 0.05
library(ggplot2)
temp.data <- data.frame(lps = lps, treated = as.factor(lalonde$treat))
ggplot(temp.data, aes(x = lps, fill = treated, color = treated)) +
geom_histogram(alpha = 0.5, position = "identity") +
xlab("Linearized propensity score")
idx.treated <- which((lalonde$treat == 1) & (lps > max(lps[lalonde$treat == 0])))
idx.control <- which((lalonde$treat == 0) & (lps < min(lps[lalonde$treat == 1])))
idx <- c(idx.treated, idx.control) ## dropped two treated units and 77 control units
lalonde <- lalonde[-idx, ]
lps <- lps[-idx]
From the histogram, we see a potential issue here, which is that there are not enough controls that can be potentially matched with the treated units for units whose lps > 0.
Greedy algorithm, covariate balancing is not satisfying
## Greedy algorithm
m.out1 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde,
estimand = "ATT",
method = "nearest", distance = lps)
summary(m.out1)
Call:
matchit(formula = treat ~ age + educ + race + married + nodegree +
re74 + re75, data = lalonde, method = "nearest", distance = lps,
estimand = "ATT")
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.2121 -1.8606 1.8456 0.6082 0.3534
age 25.4463 26.9167 -0.2103 0.4537 0.0748
educ 10.3220 10.2366 0.0417 0.5423 0.0291
raceblack 0.8362 0.2339 1.6272 . 0.6023
racehispan 0.0621 0.1640 -0.4218 . 0.1018
racewhite 0.1017 0.6022 -1.6558 . 0.5005
married 0.1977 0.4382 -0.6037 . 0.2404
nodegree 0.6949 0.6290 0.1431 . 0.0659
re74 2179.3904 4051.3213 -0.3760 0.8686 0.1692
re75 1485.9177 2329.2386 -0.2622 0.9815 0.1236
eCDF Max
distance 0.6074
age 0.1413
educ 0.0837
raceblack 0.6023
racehispan 0.1018
racewhite 0.5005
married 0.2404
nodegree 0.0659
re74 0.4022
re75 0.2685
Summary of Balance for Matched Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.2121 -0.7040 0.8158 0.7784 0.1285
age 25.4463 25.2316 0.0307 0.4336 0.0870
educ 10.3220 10.5876 -0.1294 0.6076 0.0235
raceblack 0.8362 0.4915 0.9311 . 0.3446
racehispan 0.0621 0.2203 -0.6553 . 0.1582
racewhite 0.1017 0.2881 -0.6168 . 0.1864
married 0.1977 0.2090 -0.0284 . 0.0113
nodegree 0.6949 0.6384 0.1227 . 0.0565
re74 2179.3904 2348.2864 -0.0339 1.3547 0.0478
re75 1485.9177 1612.6659 -0.0394 1.4896 0.0506
eCDF Max Std. Pair Dist.
distance 0.3955 0.8163
age 0.2486 1.4414
educ 0.0678 1.2146
raceblack 0.3446 0.9311
racehispan 0.1582 1.0297
racewhite 0.1864 0.6916
married 0.0113 0.8511
nodegree 0.0565 0.8344
re74 0.2542 0.7272
re75 0.2034 0.7374
Percent Balance Improvement:
Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance 55.8 49.6 63.6 34.9
age 85.4 -5.7 -16.3 -75.9
educ -210.7 18.6 19.3 19.0
raceblack 42.8 . 42.8 42.8
racehispan -55.3 . -55.3 -55.3
racewhite 62.7 . 62.7 62.7
married 95.3 . 95.3 95.3
nodegree 14.2 . 14.2 14.2
re74 91.0 -115.6 71.7 36.8
re75 85.0 -2032.9 59.1 24.2
Sample Sizes:
Control Treated
All 372 177
Matched 177 177
Unmatched 195 0
Discarded 0 0
plot(summary(m.out1), abs = F)
## Optimal matching algorithm
## install additional package
##install.packages("optmatch") and choose the binary version
library("optmatch")
m.out2 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde,
estimand = "ATT",
method = "optimal", distance = lps)
summary(m.out2)
Call:
matchit(formula = treat ~ age + educ + race + married + nodegree +
re74 + re75, data = lalonde, method = "optimal", distance = lps,
estimand = "ATT")
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.2121 -1.8606 1.8456 0.6082 0.3534
age 25.4463 26.9167 -0.2103 0.4537 0.0748
educ 10.3220 10.2366 0.0417 0.5423 0.0291
raceblack 0.8362 0.2339 1.6272 . 0.6023
racehispan 0.0621 0.1640 -0.4218 . 0.1018
racewhite 0.1017 0.6022 -1.6558 . 0.5005
married 0.1977 0.4382 -0.6037 . 0.2404
nodegree 0.6949 0.6290 0.1431 . 0.0659
re74 2179.3904 4051.3213 -0.3760 0.8686 0.1692
re75 1485.9177 2329.2386 -0.2622 0.9815 0.1236
eCDF Max
distance 0.6074
age 0.1413
educ 0.0837
raceblack 0.6023
racehispan 0.1018
racewhite 0.5005
married 0.2404
nodegree 0.0659
re74 0.4022
re75 0.2685
Summary of Balance for Matched Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.2121 -0.7041 0.8158 0.7783 0.1285
age 25.4463 25.2316 0.0307 0.4336 0.0870
educ 10.3220 10.5876 -0.1294 0.6076 0.0235
raceblack 0.8362 0.4915 0.9311 . 0.3446
racehispan 0.0621 0.2203 -0.6553 . 0.1582
racewhite 0.1017 0.2881 -0.6168 . 0.1864
married 0.1977 0.2090 -0.0284 . 0.0113
nodegree 0.6949 0.6384 0.1227 . 0.0565
re74 2179.3904 2348.2864 -0.0339 1.3547 0.0478
re75 1485.9177 1611.7353 -0.0391 1.4891 0.0504
eCDF Max Std. Pair Dist.
distance 0.3955 0.8163
age 0.2486 1.3719
educ 0.0678 1.2531
raceblack 0.3446 0.9311
racehispan 0.1582 1.0297
racewhite 0.1864 0.7290
married 0.0113 0.7376
nodegree 0.0565 0.9816
re74 0.2542 0.7311
re75 0.2034 0.7723
Percent Balance Improvement:
Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance 55.8 49.6 63.6 34.9
age 85.4 -5.7 -16.3 -75.9
educ -210.7 18.6 19.3 19.0
raceblack 42.8 . 42.8 42.8
racehispan -55.3 . -55.3 -55.3
racewhite 62.7 . 62.7 62.7
married 95.3 . 95.3 95.3
nodegree 14.2 . 14.2 14.2
re74 91.0 -115.6 71.7 36.8
re75 85.1 -2031.2 59.2 24.2
Sample Sizes:
Control Treated
All 372 177
Matched 177 177
Unmatched 195 0
Discarded 0 0
plot(summary(m.out2), abs = F)
You see that even the linearized propensity score are not matched well, which indicates that some treated units are not able to be well matched (as there are not enough controls with lps > 0).
Three strategies to improve balancing:
m.out3 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde,
estimand = "ATT", exact = "race",
method = "optimal", distance = lps)
Fewer control units than treated units in some 'exact' strata; not all treated units will get a match.
summary(m.out3)
Call:
matchit(formula = treat ~ age + educ + race + married + nodegree +
re74 + re75, data = lalonde, method = "optimal", distance = lps,
estimand = "ATT", exact = "race")
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.2121 -1.8606 1.8456 0.6082 0.3534
age 25.4463 26.9167 -0.2103 0.4537 0.0748
educ 10.3220 10.2366 0.0417 0.5423 0.0291
raceblack 0.8362 0.2339 1.6272 . 0.6023
racehispan 0.0621 0.1640 -0.4218 . 0.1018
racewhite 0.1017 0.6022 -1.6558 . 0.5005
married 0.1977 0.4382 -0.6037 . 0.2404
nodegree 0.6949 0.6290 0.1431 . 0.0659
re74 2179.3904 4051.3213 -0.3760 0.8686 0.1692
re75 1485.9177 2329.2386 -0.2622 0.9815 0.1236
eCDF Max
distance 0.6074
age 0.1413
educ 0.0837
raceblack 0.6023
racehispan 0.1018
racewhite 0.5005
married 0.2404
nodegree 0.0659
re74 0.4022
re75 0.2685
Summary of Balance for Matched Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance -0.1307 -0.1576 0.0240 1.0094 0.0061
age 25.7414 26.1034 -0.0518 0.4951 0.0716
educ 10.1207 10.3793 -0.1261 0.6544 0.0254
raceblack 0.7500 0.7500 0.0000 . 0.0000
racehispan 0.0948 0.0948 0.0000 . 0.0000
racewhite 0.1552 0.1552 0.0000 . 0.0000
married 0.2414 0.2759 -0.0866 . 0.0345
nodegree 0.6724 0.6207 0.1123 . 0.0517
re74 2782.5274 3052.6559 -0.0543 1.4129 0.0680
re75 1720.7984 1913.4844 -0.0599 1.6225 0.0683
eCDF Max Std. Pair Dist.
distance 0.0776 0.0380
age 0.1897 1.3068
educ 0.0690 1.1430
raceblack 0.0000 0.0000
racehispan 0.0000 0.0000
racewhite 0.0000 0.0000
married 0.0345 0.4329
nodegree 0.0517 0.9361
re74 0.2845 0.7281
re75 0.1638 0.8358
Percent Balance Improvement:
Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance 98.7 98.1 98.3 87.2
age 75.4 11.1 4.3 -34.2
educ -202.6 30.7 12.7 17.6
raceblack 100.0 . 100.0 100.0
racehispan 100.0 . 100.0 100.0
racewhite 100.0 . 100.0 100.0
married 85.7 . 85.7 85.7
nodegree 21.5 . 21.5 21.5
re74 85.6 -145.4 59.8 29.3
re75 77.2 -2490.3 44.7 39.0
Sample Sizes:
Control Treated
All 372 177
Matched 116 116
Unmatched 256 61
Discarded 0 0
plot(summary(m.out3), abs = F)
names(lps) <- rownames(lalonde)
dists <- lps[rownames(m.out2$match.matrix)] - lps[m.out2$match.matrix[,1]]
hist(dists, main = "Histogram of distances between matched pairs")
## We see that some distances are large.
## We set a threshold 1 to make a balance between removing poor matches and retain most of the treated units
ID.treated.notmatched <- names(dists[abs(dists) > 1])
lalonde.new <- lalonde[!(rownames(lalonde) %in% ID.treated.notmatched),]
lps.new <- lps[!(rownames(lalonde) %in% ID.treated.notmatched)]
m.out4 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde.new,
estimand = "ATT",
method = "optimal", distance = lps.new)
plot(summary(m.out4), abs = F)
summary(m.out4)
Call:
matchit(formula = treat ~ age + educ + race + married + nodegree +
re74 + re75, data = lalonde.new, method = "optimal", distance = lps.new,
estimand = "ATT")
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.0469 -1.8606 1.4078 0.8854 0.3314
age 25.8000 26.9167 -0.1626 0.4377 0.0772
educ 10.4091 10.2366 0.0919 0.4546 0.0367
raceblack 0.7455 0.2339 1.1744 . 0.5116
racehispan 0.0909 0.1640 -0.2542 . 0.0731
racewhite 0.1636 0.6022 -1.1853 . 0.4385
married 0.1545 0.4382 -0.7846 . 0.2836
nodegree 0.7182 0.6290 0.1982 . 0.0891
re74 1705.0367 4051.3213 -0.4815 0.8321 0.2130
re75 1083.5616 2329.2386 -0.5774 0.4417 0.1578
eCDF Max
distance 0.5344
age 0.1518
educ 0.1147
raceblack 0.5116
racehispan 0.0731
racewhite 0.4385
married 0.2836
nodegree 0.0891
re74 0.4562
re75 0.3185
Summary of Balance for Matched Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.0469 -0.1394 0.1374 1.1420 0.0393
age 25.8000 26.3091 -0.0741 0.3857 0.0945
educ 10.4091 10.5000 -0.0484 0.5131 0.0287
raceblack 0.7455 0.7364 0.0209 . 0.0091
racehispan 0.0909 0.0909 0.0000 . 0.0000
racewhite 0.1636 0.1727 -0.0246 . 0.0091
married 0.1545 0.2545 -0.2766 . 0.1000
nodegree 0.7182 0.6182 0.2223 . 0.1000
re74 1705.0367 2618.8808 -0.1876 1.1223 0.1054
re75 1083.5616 1803.4733 -0.3337 0.5553 0.0978
eCDF Max Std. Pair Dist.
distance 0.2182 0.1397
age 0.2545 1.5407
educ 0.1000 1.1422
raceblack 0.0091 0.0209
racehispan 0.0000 0.0909
racewhite 0.0091 0.2703
married 0.1000 0.5784
nodegree 0.1000 0.8285
re74 0.3273 0.6800
re75 0.2364 1.0188
Percent Balance Improvement:
Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance 90.2 -9.2 88.1 59.2
age 54.4 -15.3 -22.4 -67.7
educ 47.3 15.4 21.7 12.8
raceblack 98.2 . 98.2 98.2
racehispan 100.0 . 100.0 100.0
racewhite 97.9 . 97.9 97.9
married 64.7 . 64.7 64.7
nodegree -12.2 . -12.2 -12.2
re74 61.1 37.2 50.5 28.3
re75 42.2 28.0 38.0 25.8
Sample Sizes:
Control Treated
All 372 110
Matched 110 110
Unmatched 262 0
Discarded 0 0
Covariate balancing improves. However, notice that because we have removed more than 1/3 of of the treated units, the treated population may have changed dramastically. The ATT we estimate here can be biased for the ATT of the original population.
We can also use matching with replacement here, which does a similar job.
## We match each treated with 2 controls to increase stability (as some controls may be used multiple times)
m.out5 <- matchit(treat ~ age + educ + race + married +
nodegree + re74 + re75, data = lalonde,
estimand = "ATT", replace= T, ratio = 2,
method = "nearest", distance = lps)
plot(summary(m.out5), abs = F)
We can continue with either m.out3 or m.out5.
m.data <- match.data(m.out3)
## Neyman's approach, treating the data as from a pairwise randomized experiment
tau_hat_vec <- sapply(levels(m.data$subclass), function(sc) {
treated <- m.data$re78[m.data$subclass == sc & m.data$treat == 1]
control <- m.data$re78[m.data$subclass == sc & m.data$treat == 0]
return(treated - control)
})
tau_hat <- mean(tau_hat_vec, na.rm = T)
sd_tau_hat <- sd(tau_hat_vec)/sqrt(length(tau_hat_vec))
print(c(tau_hat, sd_tau_hat))
[1] 1009.6593 861.6323
## 95% CI
print(c(tau_hat- 1.96 * sd_tau_hat, tau_hat + 1.96 * sd_tau_hat))
[1] -679.1401 2698.4587
## First run a linear regression on the control (remember to remove the treatment assignment variable!)
fit0 <- lm(re78 ~ age + educ + race + married + nodegree +
re74 + re75, data = m.data[m.data$treat == 0, ])
m.data$predicted_y0 <- predict(fit0, m.data)
## Neyman's approach with regression adjustment
tau_hat_vec_adjusted <- sapply(levels(m.data$subclass), function(sc) {
treated <- m.data$re78[m.data$subclass == sc & m.data$treat == 1]
control <- m.data$re78[m.data$subclass == sc & m.data$treat == 0]
bias <- m.data$predicted_y0[m.data$subclass == sc & m.data$treat == 1] -
m.data$predicted_y0[m.data$subclass == sc & m.data$treat == 0]
control.adjusted <- control + bias
return(treated - control.adjusted)
})
tau_hat_adjusted <- mean(tau_hat_vec_adjusted)
sd_tau_hat_adjusted <- sd(tau_hat_vec_adjusted)/sqrt(length(tau_hat_vec_adjusted))
print(c(tau_hat_adjusted, sd_tau_hat_adjusted))
[1] 1081.3067 847.2499
## 95% CI
print(c(tau_hat_adjusted- 1.96 * sd_tau_hat_adjusted, tau_hat_adjusted + 1.96 * sd_tau_hat_adjusted))
[1] -579.3032 2741.9165
Run outcome regression directly:
fit <- lm(re78 ~ treat + age + educ + race + married + nodegree +
re74 + re75, data = m.data, weights = m.data$weights)
summary(fit)
Call:
lm(formula = re78 ~ treat + age + educ + race + married + nodegree +
re74 + re75, data = m.data, weights = m.data$weights)
Residuals:
Min 1Q Median 3Q Max
-9848 -4897 -1746 3506 26323
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2983.1800 3660.7375 -0.815 0.4160
treat 1160.2892 827.2007 1.403 0.1621
age 45.6270 50.5176 0.903 0.3674
educ 534.1462 246.7257 2.165 0.0315 *
racehispan 548.6558 1429.8698 0.384 0.7016
racewhite 1165.0281 1185.9980 0.982 0.3270
married -568.3086 1066.0661 -0.533 0.5945
nodegree 940.0476 1242.3852 0.757 0.4501
re74 0.0134 0.1015 0.132 0.8950
re75 0.3143 0.1625 1.934 0.0544 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6284 on 222 degrees of freedom
Multiple R-squared: 0.07724, Adjusted R-squared: 0.03983
F-statistic: 2.065 on 9 and 222 DF, p-value: 0.03376
Results are similar to the regression adjustment of bias.
m.data <- match.data(m.out5)
## Neyman's approach
tau_hat_vec <- sapply(rownames(m.out5$match.matrix), function(ID) {
treated <- m.data[ID,]$re78
control <- m.data[m.out5$match.matrix[ID,1],]$re78
return(treated - control)
})
tau_hat <- mean(tau_hat_vec, na.rm = T)
sd_tau_hat <- sd(tau_hat_vec)/sqrt(length(tau_hat_vec))
print(c(tau_hat, sd_tau_hat))
[1] 1761.0360 643.3557
## 95% CI
print(c(tau_hat- 1.96 * sd_tau_hat, tau_hat + 1.96 * sd_tau_hat))
[1] 500.0589 3022.0132
## First run a linear regression on the control (remember to remove the treatment assignment variable!)
fit0 <- lm(re78 ~ age + educ + race + married + nodegree +
re74 + re75, data = m.data[m.data$treat == 0, ])
m.data$predicted_y0 <- predict(fit0, m.data)
## Neyman's approach with regression adjustment
tau_hat_vec_adjusted <- sapply(rownames(m.out5$match.matrix), function(ID) {
treated <- m.data[ID,]$re78
control <- m.data[m.out5$match.matrix[ID,1],]$re78
bias <- m.data[ID,]$predicted_y0 -
m.data[m.out5$match.matrix[ID,1],]$predicted_y0
control.adjusted <- control + bias
return(treated - control.adjusted)
})
tau_hat_adjusted <- mean(tau_hat_vec_adjusted)
sd_tau_hat_adjusted <- sd(tau_hat_vec_adjusted)/sqrt(length(tau_hat_vec_adjusted))
print(c(tau_hat_adjusted, sd_tau_hat_adjusted))
[1] 1859.113 646.671
## 95% CI
print(c(tau_hat_adjusted- 1.96 * sd_tau_hat_adjusted, tau_hat_adjusted + 1.96 * sd_tau_hat_adjusted))
[1] 591.6382 3126.5883
Results are different because: