Example: modeling correlated survey response

The data is a survey result for whether the respondents support legalizing abortion under three situations: 1) if the family has a very low income and cannot afford anymore children, 2) when the women is not married and does not want to marry the man, 3) when the women wants it for any reason. We aim to understand how one person’s opinion change under different situations and whether gender has an effect.

Abortion <- read.table("Abortion.dat",header=T)
Abortion

1.1 Applying the GLMM model

Gauss-Hermite Quadrature

We want to use a GLMM model to take into account the correlation of the choices from the same individual.

library(glmmML) ## do not install package from source if you face an installation error
## one can also use the glmer function in the lme4 package
fit.glmm <- glmmML(response ~ gender + factor(situation),
                   cluster = case, 
                   data = Abortion,
                   family = binomial, 
                   method = "ghq", n.points = 70)
summary(fit.glmm)
## 
## Call:  glmmML(formula = response ~ gender + factor(situation), family = binomial,      data = Abortion, cluster = case, method = "ghq", n.points = 70) 
## 
## 
##                        coef se(coef)        z Pr(>|z|)
## (Intercept)         0.21596   0.3766  0.57347 5.66e-01
## gender              0.01257   0.4888  0.02572 9.79e-01
## factor(situation)2 -0.54230   0.1572 -3.45022 5.60e-04
## factor(situation)3 -0.83470   0.1601 -5.21346 1.85e-07
## 
## Scale parameter in mixing distribution:  8.736 gaussian 
## Std. Error:                              0.5421 
## 
##         LR p-value for H_0: sigma = 0:  0 
## 
## Residual deviance: 4579 on 5545 degrees of freedom   AIC: 4589

Intepretations of the results: - Intepretation of coefficients - individual heteogeneity: our estimate random effect is \(\hat \sigma_u = 8.736\), indicating strong correlation among the choices within one individual. - results change with n.points

Using the glmer function from lme4 will give you the same solution

library(lme4)
## Loading required package: Matrix
fit.glmm2 <- glmer(response ~ gender + factor(situation) + (1|case), data = Abortion,
                   family = binomial, nAGQ = 70)
summary(fit.glmm2)
## Generalized linear mixed model fit by maximum likelihood (Adaptive
##   Gauss-Hermite Quadrature, nAGQ = 70) [glmerMod]
##  Family: binomial  ( logit )
## Formula: response ~ gender + factor(situation) + (1 | case)
##    Data: Abortion
## 
##      AIC      BIC   logLik deviance df.resid 
##   4588.5   4621.7  -2289.3   4578.5     5545 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.7809 -0.1224 -0.1056  0.1397  1.7149 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  case   (Intercept) 76.32    8.736   
## Number of obs: 5550, groups:  case, 1850
## 
## Fixed effects:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.21596    0.37680   0.573 0.566544    
## gender              0.01258    0.48904   0.026 0.979478    
## factor(situation)2 -0.54230    0.15719  -3.450 0.000561 ***
## factor(situation)3 -0.83470    0.16010  -5.214 1.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) gender fct()2
## gender      -0.727              
## fctr(sttn)2 -0.206  0.000       
## fctr(sttn)3 -0.206  0.000  0.512

Laplace approximation

library(lme4)
fit.glmm3 <- glmer(response ~ gender + factor(situation) + (1|case), data = Abortion,
                   family = binomial, nAGQ = 1)
summary(fit.glmm3)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: response ~ gender + factor(situation) + (1 | case)
##    Data: Abortion
## 
##      AIC      BIC   logLik deviance df.resid 
##   4942.7   4975.8  -2466.3   4932.7     5545 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.6419 -0.1794 -0.1614  0.1996  1.6073 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  case   (Intercept) 31.24    5.59    
## Number of obs: 5550, groups:  case, 1850
## 
## Fixed effects:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.12930    0.21730   0.595    0.552    
## gender              0.01072    0.28191   0.038    0.970    
## factor(situation)2 -0.38903    0.08894  -4.374 1.22e-05 ***
## factor(situation)3 -0.59852    0.08960  -6.680 2.39e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) gender fct()2
## gender      -0.727              
## fctr(sttn)2 -0.206  0.000       
## fctr(sttn)3 -0.205  0.000  0.506

Laplace approximation is less accurate, but it is the only choice if the random effect is not a scalar.

1.2 Compare with the results from the standard logistic regression

Let’s look at the marginal model results. Here we just use standard GLM ignoring correlation structure. The book uses a more sophisticated marginal model which provides valid inference. Our standard glm here would not provide valid inference for the marginal model, but the estimates should still be good.

(Why? Brief explanation: though the correlation structure of the samples is not considered, the score equations from the standard GLM is typically still consistent. Specifically, the expectation of score equation at the true parameter value is 0, and when correlation across samples are weak, the consistency of score equation can be obtained from the LLN)

fit.glm <- glm(response ~ gender + factor(situation), data = Abortion, family = binomial)
summary(fit.glm)
## 
## Call:
## glm(formula = response ~ gender + factor(situation), family = binomial, 
##     data = Abortion)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)  
## (Intercept)         0.023940   0.055528   0.431   0.6664  
## gender              0.003582   0.054138   0.066   0.9472  
## factor(situation)2 -0.097329   0.065783  -1.480   0.1390  
## factor(situation)3 -0.149347   0.065825  -2.269   0.0233 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7689.5  on 5549  degrees of freedom
## Residual deviance: 7684.2  on 5546  degrees of freedom
## AIC: 7692.2
## 
## Number of Fisher Scoring iterations: 3

The marginal model coefficients are very scaled down towards \(0\) compared with GLMM. We can check that the ratio is roughly \(1/\sqrt{1 + \sigma_u^2/1.7^2}\).

plot(fit.glm$coefficients, fit.glmm$coefficients, xlim = c(-0.5, 0.25), ylim = c(-1, 0.25))
abline(v = 0, lty = 2)
abline(h = 0, lty = 2)
abline(a = 0, b = sqrt(1 + (fit.glmm$sigma/1.7)^2), col = "red")

1.2 GEE for modeling the marginal GLM

library(gee)
fit.marginal.glm <- gee(response ~ gender + factor(situation),
                          id = case,
                          data = Abortion,
                          family = binomial, 
                          corstr = "exchangeable")
## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
## running glm to get initial regression estimate
##        (Intercept)             gender factor(situation)2 factor(situation)3 
##        0.023939537        0.003582051       -0.097329124       -0.149347113
summary(fit.marginal.glm)
## 
##  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
##  gee S-function, version 4.13 modified 98/01/27 (1998) 
## 
## Model:
##  Link:                      Logit 
##  Variance to Mean Relation: Binomial 
##  Correlation Structure:     Exchangeable 
## 
## Call:
## gee(formula = response ~ gender + factor(situation), id = case, 
##     data = Abortion, family = binomial, corstr = "exchangeable")
## 
## Summary of Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.5068644 -0.4825396 -0.4687095  0.5174604  0.5312905 
## 
## 
## Coefficients:
##                        Estimate Naive S.E.     Naive z Robust S.E.    Robust z
## (Intercept)         0.024021377 0.06776549  0.35447800  0.06779334  0.35433237
## gender              0.003437873 0.08790630  0.03910838  0.08784072  0.03913758
## factor(situation)2 -0.097329120 0.02812582 -3.46049001  0.02753161 -3.53517666
## factor(situation)3 -0.149347107 0.02814374 -5.30658404  0.02973865 -5.02198729
## 
## Estimated Scale Parameter:  1.000721
## Number of Iterations:  2
## 
## Working Correlation
##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.8173308 0.8173308
## [2,] 0.8173308 1.0000000 0.8173308
## [3,] 0.8173308 0.8173308 1.0000000

The naive method assume that the variance model is correct. The robust methods is the Sandwich estimator.

The point estimates are very similar to the GLM model without considering correlations within individuals, but the standard deviations of the situation difference estimates are smaller if we take into account positive correlations of the responses within individuals.

Sandwich estimator under model mis-specification

library(gee)
fit.marginal.glm2 <- gee(response ~ gender + factor(situation),
                          id = case,
                          data = Abortion,
                          family = binomial, 
                          corstr = "independence")
## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
## running glm to get initial regression estimate
##        (Intercept)             gender factor(situation)2 factor(situation)3 
##        0.023939537        0.003582051       -0.097329124       -0.149347113
summary(fit.marginal.glm2)
## 
##  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
##  gee S-function, version 4.13 modified 98/01/27 (1998) 
## 
## Model:
##  Link:                      Logit 
##  Variance to Mean Relation: Binomial 
##  Correlation Structure:     Independent 
## 
## Call:
## gee(formula = response ~ gender + factor(situation), id = case, 
##     data = Abortion, family = binomial, corstr = "independence")
## 
## Summary of Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.5068800 -0.4825552 -0.4686891  0.5174448  0.5313109 
## 
## 
## Coefficients:
##                        Estimate Naive S.E.     Naive z Robust S.E.    Robust z
## (Intercept)         0.023939537 0.05554845  0.43096681  0.06779337  0.35312505
## gender              0.003582051 0.05415761  0.06614123  0.08784012  0.04077921
## factor(situation)2 -0.097329124 0.06580705 -1.47900750  0.02753161 -3.53517682
## factor(situation)3 -0.149347113 0.06584875 -2.26803253  0.02973865 -5.02198759
## 
## Estimated Scale Parameter:  1.000721
## Number of Iterations:  1
## 
## Working Correlation
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

Final conclusion

If you consider correlations within individuals, we will have enough power to detect that response to abortion changes significantly across situations. If you treat the responses as independent, you overestimate the uncertainty in your difference estimates.