As an introduction to generalized linear models, we briefly analyze two datasets and make a connection with the linear models that we are familiar with.

1. Example 1: Male Satellites for Female Horseshoe Crabs (Agresti Chapter 1.5)

An illustration, image downloaded from here where you can also read more about the related science

An illustration, image downloaded from here where you can also read more about the related science

A scientific question: why do some females mate multiply?

This is generally a hard question. Here we try to get a relevant answer from a dataset.

1.1 Read the dataset

Crabs <- read.table("Crabs.dat", header = T)
Crabs

Variables:

  • \(y\): Number of male satellites
  • spine: spine condition (1, both good; 2, one worn or broken; 3, both worn or broken)
  • weight: in kg
  • width: carapace width (cm)
  • color: (1, medium light; 2, medium; 3, medium dark; 4, dark)

We are interested in understanding what factors are associated with \(y\).

1.2 Some exploratory visualization

  • The histogram of number of satellites
hist(Crabs$y, breaks = 50)

  • Pairwise scatterplot
pairs(Crabs[, -1], pch = 19)

1.3 Using a linear regression

There seems to be a linear trend between \(y\) and weight/width. As weight and width are highly correlated, we only include weights in the model

result <- lm(y ~ weight + factor(color) + factor(spine), data = Crabs)
summary(result)
## 
## Call:
## lm(formula = y ~ weight + factor(color) + factor(spine), data = Crabs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5667 -2.1053 -0.6737  1.4428 11.1453 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.64718    1.42262  -0.455    0.650    
## weight          1.82676    0.41283   4.425 1.74e-05 ***
## factor(color)2 -0.69966    0.96243  -0.727    0.468    
## factor(color)3 -1.34027    1.06110  -1.263    0.208    
## factor(color)4 -1.31893    1.16743  -1.130    0.260    
## factor(spine)2 -0.46795    0.93605  -0.500    0.618    
## factor(spine)3  0.06789    0.62228   0.109    0.913    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.953 on 166 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.1204 
## F-statistic: 4.922 on 6 and 166 DF,  p-value: 0.0001166
plot(result)

  • Is the linear model good? (not too bad)
  • Do we trust the p-values for the weight variable? (increased variability when y increases)

Actually, linear models can be quite general to deal with non-Gaussian data, unequal variances of the error terms et. al. We may discuss this later, but we will first expand our linear model toolbox to introduce generalized linear models.

1.4 Use a Poisson GLM model

As \(y\) are counts, we instead assume that

  • \(y_i \sim \text{Poisson}(\mu_i)\) for sample \(i\)
  • \(\mu_i = g(X_i^T\beta)\) where \(X_i\) are the covariates

[ ] Case 1: assume that \(g(X_i^T\beta) = X_i^T\beta\), which is the same mean relation as the linear model.

try(result1 <- glm(y ~ weight + factor(color) + factor(spine), data = Crabs, family = poisson(link=identity)))
## Warning in log(y/mu): NaNs produced
## Error : no valid set of coefficients has been found: please supply starting values

We see a computation error. (why? We will discuss later in the lectures.)

It is feasible if we only include the color variable.

result2 <- glm(y ~ factor(color), data = Crabs, family = poisson(link="identity"))
summary(result2)
## 
## Call:
## glm(formula = y ~ factor(color), family = poisson(link = "identity"), 
##     data = Crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8577  -2.1106  -0.1649   0.8721   4.7491  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      4.0833     0.5833   7.000 2.56e-12 ***
## factor(color)2  -0.7886     0.6123  -1.288  0.19780    
## factor(color)3  -1.8561     0.6252  -2.969  0.00299 ** 
## factor(color)4  -2.0379     0.6582  -3.096  0.00196 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 609.14  on 169  degrees of freedom
## AIC: 972.44
## 
## Number of Fisher Scoring iterations: 3

Compare with the results using the linear model

result3 <- lm(y ~ factor(color), data = Crabs)
summary(result3)
## 
## Call:
## lm(formula = y ~ factor(color), data = Crabs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0833 -2.2273 -0.2947  1.7053 11.7053 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.0833     0.8985   4.544 1.05e-05 ***
## factor(color)2  -0.7886     0.9536  -0.827   0.4094    
## factor(color)3  -1.8561     1.0137  -1.831   0.0689 .  
## factor(color)4  -2.0379     1.1170  -1.824   0.0699 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.113 on 169 degrees of freedom
## Multiple R-squared:  0.0396, Adjusted R-squared:  0.02256 
## F-statistic: 2.323 on 3 and 169 DF,  p-value: 0.07685

Question: You can see that we have the same point estimates of the coefficients, while the standard errors using GLM Poisson model is much smaller than the linear model. Why? Do you think which ones are more trustworthy?

[ ] Case 2: assume that \(g(X_i^T\beta) = e^{X_i^T\beta}\)

result4 <- glm(y ~ weight + factor(color), data = Crabs, family = poisson())
summary(result4)
## 
## Call:
## glm(formula = y ~ weight + factor(color), family = poisson(), 
##     data = Crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9833  -1.9272  -0.5553   0.8646   4.8270  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.04978    0.23315  -0.214   0.8309    
## weight          0.54618    0.06811   8.019 1.07e-15 ***
## factor(color)2 -0.20511    0.15371  -1.334   0.1821    
## factor(color)3 -0.44980    0.17574  -2.560   0.0105 *  
## factor(color)4 -0.45205    0.20844  -2.169   0.0301 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 551.80  on 168  degrees of freedom
## AIC: 917.1
## 
## Number of Fisher Scoring iterations: 6

Should we trust the results here?

1.5 Intepretation of the coefficients

Say we believe result that the coefficient \(\beta\) for the weight is significantly nonzero, can we describe this finding in words? How is it related to the scientific question “why do some females mate multiply”?

  • What is the difference between the intepretation of \(\beta\) in linear model and in GLM?
  • Can we say that heavier weight of the female is a cause of more satelite males?
  • Association is not causation, read Agresti Chapter 1.2.3.
  • You can take a causal inference course in the future which will answer under what assumptions we can say that heavier weight is a cause of more satelite males.

Example 2: Election counts (Faraway Chapter 1)

You can read the original paper: Uncounted Votes: Does Voting Equipment Matter? for the background of the problem. You can learn about the equipment concern in the 2000 election more from this Scientific American article. Basically, people were concerned about the possible bias of the voting results from the voting machinary.

We analyze a small dataset of the voting results for the 2000 United States Presidential election in Georgia.

2.1 Read the dataset

# install.packages("faraway")
data(gavote, package = "faraway")
gavote
  • equip: the voting method, takes five values “LEVER”, “OS-CC” (optimal scan, central count), “OS-PC” (optimal scan, precinct count), “Paper”,“PUNCH” (punch card)
  • econ: the economic level of the county, takes three values “middle”, “poor” and “rich”
  • perAA: the percentage of African Americans
  • rural: whether the county is rural or urban
  • atlanta: whether the county is part of the Atlanta metropolitan area
  • gore: number of votes for Al Gore
  • bush: number of votes for George Bush
  • other: number of votes for other candidates
  • votes: total vote counts
  • ballots: number of ballots issued

We are interested in understanding whether the voting machinary affects the undercount:

gavote$undercount <- (gavote$ballots - gavote$votes)/gavote$ballots
hist(gavote$undercount, breaks = 50)
rug(gavote$undercount)

2.2 Using a linear regression

We first look at the pairwise comparisons to gain some impression of the data

gavote$pergore <- gavote$gore/gavote$votes
pairs(gavote[, c("undercount", colnames(gavote)[1:5], "pergore")])

Now we try a linear model including other control variables

result <- lm(undercount ~ pergore + factor(rural) + factor(econ) + factor(atlanta) + factor(equip), data = gavote)
summary(result)
## 
## Call:
## lm(formula = undercount ~ pergore + factor(rural) + factor(econ) + 
##     factor(atlanta) + factor(equip), data = gavote)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.066004 -0.012196 -0.002656  0.010145  0.116770 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                0.032140   0.012972   2.478 0.014344 *  
## pergore                   -0.004207   0.020009  -0.210 0.833740    
## factor(rural)urban        -0.007036   0.004951  -1.421 0.157411    
## factor(econ)poor           0.018858   0.004404   4.282  3.3e-05 ***
## factor(econ)rich          -0.015538   0.007767  -2.000 0.047278 *  
## factor(atlanta)notAtlanta  0.001654   0.008418   0.196 0.844508    
## factor(equip)OS-CC         0.008975   0.004378   2.050 0.042104 *  
## factor(equip)OS-PC         0.020715   0.005510   3.759 0.000244 ***
## factor(equip)PAPER        -0.010151   0.015720  -0.646 0.519443    
## factor(equip)PUNCH         0.016159   0.006449   2.506 0.013303 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02173 on 149 degrees of freedom
## Multiple R-squared:  0.2852, Adjusted R-squared:  0.2421 
## F-statistic: 6.607 on 9 and 149 DF,  p-value: 6.481e-08
  • How can we inteprete the significant results of the equipment effect? If we do not adjust for other control variables, none of the equipment effects will be significant
result <- lm(undercount ~ factor(equip), data = gavote)
summary(result)
## 
## Call:
## lm(formula = undercount ~ factor(equip), data = gavote)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.046179 -0.015147 -0.004215  0.012349  0.136557 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.189e-02  2.910e-03  14.395   <2e-16 ***
## factor(equip)OS-CC  9.049e-05  4.766e-03   0.019    0.985    
## factor(equip)OS-PC  9.670e-03  6.080e-03   1.591    0.114    
## factor(equip)PAPER -1.647e-03  1.794e-02  -0.092    0.927    
## factor(equip)PUNCH  5.200e-03  6.734e-03   0.772    0.441    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02504 on 154 degrees of freedom
## Multiple R-squared:  0.0198, Adjusted R-squared:  -0.005662 
## F-statistic: 0.7776 on 4 and 154 DF,  p-value: 0.5413

Should we inlude the other control variables or not?

  • Do we trust the p-values? Notice that each county has varying total number of ballots. It may not be suitable to assume equal variance across samples.

2.3 Using a binary GLM model (logistic regression)

As the undercounts are proportions, denote \(y_i\) as the number of uncounted ballots and \(n_i\) as the total number of ballots, we assume:

  • \(y_i \sim \text{Binomial}(n_i, p_i)\)
  • \(\log(p_i/(1-p_i)) = X_i^T\beta\)

This model natural accounts for the unequal variances of the error terms that are brought by the difference in the total number of ballots.

Let’s first consider the simple GLM without adding other control covariates

gavote$undercountNumber <- gavote$ballots - gavote$votes
result.glm <- glm(cbind(undercountNumber, votes) ~ factor(equip), data = gavote, family = "binomial")
summary(result.glm)
## 
## Call:
## glm(formula = cbind(undercountNumber, votes) ~ factor(equip), 
##     family = "binomial", data = gavote)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -59.303   -3.927    1.441    8.028   54.388  
## 
## Coefficients:
##                     Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)        -3.183865   0.007823 -406.976   <2e-16 ***
## factor(equip)OS-CC -0.140390   0.010423  -13.470   <2e-16 ***
## factor(equip)OS-PC -0.640942   0.010975  -58.400   <2e-16 ***
## factor(equip)PAPER -0.202773   0.095969   -2.113   0.0346 *  
## factor(equip)PUNCH  0.167267   0.009406   17.783   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 36829  on 158  degrees of freedom
## Residual deviance: 28379  on 154  degrees of freedom
## AIC: 29559
## 
## Number of Fisher Scoring iterations: 5

Compared with the linear model results, the p-values are much smaller. Should we trust them?

Now let’s add all the control variables:

result.glm <- glm(cbind(undercountNumber, votes) ~ pergore + factor(rural) + factor(econ) + factor(atlanta) + factor(equip), data = gavote, family = "binomial")
summary(result.glm)
## 
## Call:
## glm(formula = cbind(undercountNumber, votes) ~ pergore + factor(rural) + 
##     factor(econ) + factor(atlanta) + factor(equip), family = "binomial", 
##     data = gavote)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -30.313   -5.008   -0.545    5.659   36.642  
## 
## Coefficients:
##                           Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)               -3.10478    0.02696 -115.180  < 2e-16 ***
## pergore                   -0.74836    0.04384  -17.070  < 2e-16 ***
## factor(rural)urban        -0.16871    0.01036  -16.287  < 2e-16 ***
## factor(econ)poor           0.40147    0.01102   36.424  < 2e-16 ***
## factor(econ)rich          -0.89529    0.01631  -54.903  < 2e-16 ***
## factor(atlanta)notAtlanta  0.06652    0.01226    5.427 5.73e-08 ***
## factor(equip)OS-CC         0.22233    0.01143   19.451  < 2e-16 ***
## factor(equip)OS-PC         0.08584    0.01271    6.754 1.43e-11 ***
## factor(equip)PAPER        -0.40060    0.09606   -4.170 3.04e-05 ***
## factor(equip)PUNCH         0.75731    0.01358   55.767  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 36829  on 158  degrees of freedom
## Residual deviance: 15668  on 149  degrees of freedom
## AIC: 16857
## 
## Number of Fisher Scoring iterations: 4

Why do all p-values become much smaller? We will get back to this issue when we talk about binary data GLM.