Introduction to GLM

As an introduction to generalized linear models, we briefly analyze two datasets and make a connection with the linear models that we are familiar with.

1. Example 1: Male Satellites for Female Horseshoe Crabs (Agresti Chapter 1.5)

An illustration, image downloaded from here where you can also read more about the related science

A scientific question: why do some females mate multiply?

This is generally a hard question. We may alternatively ask, what features of the female horeshoe crab cause it to mate multiply? This is still a hard question, as it is about causality. We want to get some understanding of this by performing regression on an observational dataset.

1.1 Read the dataset

Crabs <- read.table("Crabs.dat", header = T)
Crabs

ABCDEFGHIJ0123456789

crab <int>	y <int>	weight <dbl>	width <dbl>	color <int>	spine <int>
1	8	3.050	28.3	2	3
2	0	1.550	22.5	3	3
3	9	2.300	26.0	1	1
4	0	2.100	24.8	3	3
5	4	2.600	26.0	3	3
6	0	2.100	23.8	2	3
7	0	2.350	26.5	1	1
8	0	1.900	24.7	3	2
9	0	1.950	23.7	2	1
10	0	2.150	25.6	3	3

## Change color and spine to categorical
Crabs$color <- factor(Crabs$color)
Crabs$spine <- factor(Crabs$spine)
Crabs

ABCDEFGHIJ0123456789

crab <int>	y <int>	weight <dbl>	width <dbl>	color <fct>	spine <fct>
1	8	3.050	28.3	2	3
2	0	1.550	22.5	3	3
3	9	2.300	26.0	1	1
4	0	2.100	24.8	3	3
5	4	2.600	26.0	3	3
6	0	2.100	23.8	2	3
7	0	2.350	26.5	1	1
8	0	1.900	24.7	3	2
9	0	1.950	23.7	2	1
10	0	2.150	25.6	3	3

Variables:

: Number of male satellites
spine: spine condition (1, both good; 2, one worn or broken; 3, both worn or broken)
weight: in kg
width: carapace width (cm)
color: (1, medium light; 2, medium; 3, medium dark; 4, dark)

We are interested in understanding what factors are associated with .

1.2 Some exploratory visualization

The histogram of number of satellites

hist(Crabs$y, breaks = 50)

Pairwise scatterplot

pairs(Crabs[, -1], pch = 19)

The above scatter plot can not visualize categorical data. We can use the ggpairs function in the GGally package

library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(Crabs[, -1])

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.3 Using a linear regression

There seems to be a linear trend between and weight/width. As weight and width are highly correlated, we only include weights in the model (why?)

result <- lm(y ~ weight + factor(color) + factor(spine), data = Crabs)
summary(result)

## 
## Call:
## lm(formula = y ~ weight + factor(color) + factor(spine), data = Crabs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5667 -2.1053 -0.6737  1.4428 11.1453 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.64718    1.42262  -0.455    0.650    
## weight          1.82676    0.41283   4.425 1.74e-05 ***
## factor(color)2 -0.69966    0.96243  -0.727    0.468    
## factor(color)3 -1.34027    1.06110  -1.263    0.208    
## factor(color)4 -1.31893    1.16743  -1.130    0.260    
## factor(spine)2 -0.46795    0.93605  -0.500    0.618    
## factor(spine)3  0.06789    0.62228   0.109    0.913    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.953 on 166 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.1204 
## F-statistic: 4.922 on 6 and 166 DF,  p-value: 0.0001166

We should check whether the model is appropriate or not first before interpreting the coefficients and the p-values
Model diagnosis plots

  plot(result)

- Linear relationship might be fine (but response is bounded above zero) - Increased variability when y increases – Very common when response is restricted to be positive - Gaussianality fails (Is it fine?)

We want expand our linear model toolbox to introduce generalized linear models.

1.4 Use a Poisson GLM model

As are counts, we instead assume that

for sample
where are the covariates

[ ] Case 1: assume that , which is the same mean relationship as in the linear model.

try(result1 <- glm(y ~ weight + factor(color) + factor(spine), data = Crabs, family = poisson(link=identity)))

## Warning in log(y/mu): NaNs produced

## Error : no valid set of coefficients has been found: please supply starting values

We see a computation error. This happens as is not always positive.

[ ] Case 2: assume that ( needs to be always positive, but after transformation, does not have any support restrictions)

result4 <- glm(y ~ weight + factor(color) + factor(spine), data = Crabs, family = poisson())
summary(result4)

## 
## Call:
## glm(formula = y ~ weight + factor(color) + factor(spine), family = poisson(), 
##     data = Crabs)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.04263    0.25354  -0.168   0.8665    
## weight          0.54756    0.07318   7.482 7.31e-14 ***
## factor(color)2 -0.26768    0.16781  -1.595   0.1107    
## factor(color)3 -0.52087    0.19414  -2.683   0.0073 ** 
## factor(color)4 -0.53966    0.22525  -2.396   0.0166 *  
## factor(spine)2 -0.16066    0.21146  -0.760   0.4474    
## factor(spine)3  0.09086    0.11948   0.760   0.4470    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 549.70  on 166  degrees of freedom
## AIC: 919
## 
## Number of Fisher Scoring iterations: 6

Should we trust the results here? We will discuss model diagonosis of the above Poisson regression later.

[ ] Case 3: comparison between assuming Gaussian'' noise v.s.Poisson’’ noise in linear model.

If we insist on using the linear link, it is feasible if we only include the color variable. (This is only for illustration purpose, we may not want to use this model in real data analysis as ``weight’’ seems to be more important)

result2 <- glm(y ~ factor(color), data = Crabs, family = poisson(link="identity"))
summary(result2)

## 
## Call:
## glm(formula = y ~ factor(color), family = poisson(link = "identity"), 
##     data = Crabs)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      4.0833     0.5833   7.000 2.56e-12 ***
## factor(color)2  -0.7886     0.6123  -1.288  0.19780    
## factor(color)3  -1.8561     0.6252  -2.969  0.00299 ** 
## factor(color)4  -2.0379     0.6582  -3.096  0.00196 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 609.14  on 169  degrees of freedom
## AIC: 972.44
## 
## Number of Fisher Scoring iterations: 3

In the above model, we assumed Poisson noise of the response, the noise model is . The mean relationship with is . We now compare with the results using the linear model assuming the same mean relationship but assumes homoscedastic noise .

result3 <- lm(y ~ factor(color), data = Crabs)
summary(result3)

## 
## Call:
## lm(formula = y ~ factor(color), data = Crabs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0833 -2.2273 -0.2947  1.7053 11.7053 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.0833     0.8985   4.544 1.05e-05 ***
## factor(color)2  -0.7886     0.9536  -0.827   0.4094    
## factor(color)3  -1.8561     1.0137  -1.831   0.0689 .  
## factor(color)4  -2.0379     1.1170  -1.824   0.0699 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.113 on 169 degrees of freedom
## Multiple R-squared:  0.0396, Adjusted R-squared:  0.02256 
## F-statistic: 2.323 on 3 and 169 DF,  p-value: 0.07685

Question: You can see that we have the same point estimates of the coefficients, while the standard errors using Poisson linear model is much smaller than the Gaussian linear model. Why? Do you think which ones are more trustworthy?

1.5 Intepretation of the coefficients

Say we believe result that the coefficient for the weight is significantly nonzero, can we describe this finding in words? How is it related to the scientific question “why do some females mate multiply”?

What is the difference between the intepretation of in linear model and in GLM?
Can we say that heavier weight of the female is a cause of more satelite males?
Association is not causation, read Agresti Chapter 1.2.3. We will discuss later if we have time.

Example 2: Election counts (Faraway Chapter 1)

You can read the original paper: Uncounted Votes: Does Voting Equipment Matter? for the background of the problem. You can learn about the equipment concern in the 2000 election more from this Scientific American article. Basically, people were concerned about the possible bias of the voting results from the voting machinary.

We analyze a small dataset of the voting results for the 2000 United States Presidential election in Georgia.

2.1 Read the dataset

# install.packages("faraway")
data(gavote, package = "faraway")
gavote

ABCDEFGHIJ0123456789

	equip <fct>	econ <fct>	perAA <dbl>	rural <fct>	atlanta <fct>	gore <int>	bush <int>	other <int>	votes <int>
APPLING	LEVER	poor	0.182	rural	notAtlanta	2093	3940	66	6099
ATKINSON	LEVER	poor	0.230	rural	notAtlanta	821	1228	22	2071
BACON	LEVER	poor	0.131	rural	notAtlanta	956	2010	29	2995
BAKER	OS-CC	poor	0.476	rural	notAtlanta	893	615	11	1519
BALDWIN	LEVER	middle	0.359	rural	notAtlanta	5893	6041	192	12126
BANKS	LEVER	middle	0.024	rural	notAtlanta	1220	3202	111	4533
BARROW	OS-CC	middle	0.079	urban	notAtlanta	3657	7925	520	12102
BARTOW	OS-PC	middle	0.079	urban	Atlanta	7508	14720	552	22780
BEN.HILL	OS-PC	poor	0.282	rural	notAtlanta	2234	2381	46	4661
BERRIEN	OS-CC	poor	0.107	rural	notAtlanta	1640	2718	52	4410

equip: the voting method, takes five values “LEVER”, “OS-CC” (optimal scan, central count), “OS-PC” (optimal scan, precinct count), “Paper”,“PUNCH” (punch card)
econ: the economic level of the county, takes three values “middle”, “poor” and “rich”
perAA: the percentage of African Americans
rural: whether the county is rural or urban
atlanta: whether the county is part of the Atlanta metropolitan area
gore: number of votes for Al Gore
bush: number of votes for George Bush
other: number of votes for other candidates
votes: total vote counts
ballots: number of ballots issued

We are interested in understanding whether the voting machinary affects the undercount:

gavote$undercount <- (gavote$ballots - gavote$votes)/gavote$ballots
hist(gavote$undercount, breaks = 50)
rug(gavote$undercount)

2.2 Using a linear regression

We first look at the pairwise comparisons to gain some impression of the data

gavote$pergore <- gavote$gore/gavote$votes
ggpairs(gavote[, c("undercount", colnames(gavote)[1:5], "pergore")])

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We are insterested in understanding the equipment effect on undercount. Now we try a linear model including other control variables

result <- lm(undercount ~ pergore + perAA + factor(rural) + factor(econ) + factor(atlanta) + factor(equip) + ballots, data = gavote)

#result <- lm(undercount ~ pergore + perAA + factor(rural)  + factor(equip), data = gavote)
summary(result)

## 
## Call:
## lm(formula = undercount ~ pergore + perAA + factor(rural) + factor(econ) + 
##     factor(atlanta) + factor(equip) + ballots, data = gavote)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.066426 -0.012006 -0.002564  0.009781  0.115446 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.141e-02  1.616e-02   1.943 0.053928 .  
## pergore                    1.281e-02  4.438e-02   0.289 0.773299    
## perAA                     -1.051e-02  2.994e-02  -0.351 0.726132    
## factor(rural)urban        -6.313e-03  5.069e-03  -1.245 0.214964    
## factor(econ)poor           1.879e-02  4.720e-03   3.981 0.000108 ***
## factor(econ)rich          -1.624e-02  7.886e-03  -2.059 0.041218 *  
## factor(atlanta)notAtlanta -1.700e-03  9.777e-03  -0.174 0.862183    
## factor(equip)OS-CC         8.756e-03  4.436e-03   1.974 0.050263 .  
## factor(equip)OS-PC         2.127e-02  5.626e-03   3.780 0.000227 ***
## factor(equip)PAPER        -1.063e-02  1.581e-02  -0.672 0.502672    
## factor(equip)PUNCH         1.673e-02  6.527e-03   2.563 0.011376 *  
## ballots                   -4.563e-08  6.607e-08  -0.691 0.490932    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02184 on 147 degrees of freedom
## Multiple R-squared:  0.2879, Adjusted R-squared:  0.2346 
## F-statistic: 5.404 on 11 and 147 DF,  p-value: 3.514e-07

How can we inteprete the significant results of the equipment effect? If we do not adjust for other control variables, none of the equipment effects will be significant

result <- lm(undercount ~ factor(equip), data = gavote)
summary(result)

## 
## Call:
## lm(formula = undercount ~ factor(equip), data = gavote)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.046179 -0.015147 -0.004215  0.012349  0.136557 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.189e-02  2.910e-03  14.395   <2e-16 ***
## factor(equip)OS-CC  9.049e-05  4.766e-03   0.019    0.985    
## factor(equip)OS-PC  9.670e-03  6.080e-03   1.591    0.114    
## factor(equip)PAPER -1.647e-03  1.794e-02  -0.092    0.927    
## factor(equip)PUNCH  5.200e-03  6.734e-03   0.772    0.441    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02504 on 154 degrees of freedom
## Multiple R-squared:  0.0198, Adjusted R-squared:  -0.005662 
## F-statistic: 0.7776 on 4 and 154 DF,  p-value: 0.5413

Should we inlude the other control variables or not? (Adjusting for confounding variables?)

Do we trust the p-values?

  plot(result)

The above diagnosis plots are not very informative as there are too few unique fitted values. Notice that each county has varying total number of ballots. It may not be suitable to assume equal variance across samples.

hist(gavote$votes, breaks = 100)

2.3 Using a binary GLM model (logistic regression)

As the undercounts are proportions, denote as the number of uncounted ballots and as the total number of ballots, we assume:

This model natural accounts for the unequal variances of the error terms that are brought by the difference in the total number of ballots:

Let’s first consider the simple GLM without adding other control covariates

gavote$undercountNumber <- gavote$ballots - gavote$votes
result.glm <- glm(cbind(undercountNumber, votes) ~ factor(equip), data = gavote, family = "binomial")
summary(result.glm)

## 
## Call:
## glm(formula = cbind(undercountNumber, votes) ~ factor(equip), 
##     family = "binomial", data = gavote)
## 
## Coefficients:
##                     Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)        -3.183865   0.007823 -406.976   <2e-16 ***
## factor(equip)OS-CC -0.140390   0.010423  -13.470   <2e-16 ***
## factor(equip)OS-PC -0.640942   0.010975  -58.400   <2e-16 ***
## factor(equip)PAPER -0.202773   0.095969   -2.113   0.0346 *  
## factor(equip)PUNCH  0.167267   0.009406   17.783   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 36829  on 158  degrees of freedom
## Residual deviance: 28379  on 154  degrees of freedom
## AIC: 29559
## 
## Number of Fisher Scoring iterations: 5

Compared with the linear model results, the p-values are much smaller. Should we trust them more than the linear model?

Now let’s add all the control variables:

result.glm <- glm(cbind(undercountNumber, votes) ~ pergore + perAA + factor(rural) + factor(econ) + factor(atlanta) + factor(equip) + ballots, data = gavote, family = "binomial")
summary(result.glm)

## 
## Call:
## glm(formula = cbind(undercountNumber, votes) ~ pergore + perAA + 
##     factor(rural) + factor(econ) + factor(atlanta) + factor(equip) + 
##     ballots, family = "binomial", data = gavote)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -2.835e+00  3.617e-02 -78.379  < 2e-16 ***
## pergore                   -1.782e+00  9.596e-02 -18.573  < 2e-16 ***
## perAA                      8.681e-01  7.171e-02  12.104  < 2e-16 ***
## factor(rural)urban        -1.907e-01  1.056e-02 -18.051  < 2e-16 ***
## factor(econ)poor           3.453e-01  1.205e-02  28.650  < 2e-16 ***
## factor(econ)rich          -9.215e-01  1.701e-02 -54.178  < 2e-16 ***
## factor(atlanta)notAtlanta  3.334e-02  1.615e-02   2.065   0.0389 *  
## factor(equip)OS-CC         2.308e-01  1.149e-02  20.083  < 2e-16 ***
## factor(equip)OS-PC         7.814e-02  1.288e-02   6.067 1.31e-09 ***
## factor(equip)PAPER        -3.818e-01  9.607e-02  -3.974 7.06e-05 ***
## factor(equip)PUNCH         7.325e-01  1.378e-02  53.175  < 2e-16 ***
## ballots                    1.160e-07  6.523e-08   1.779   0.0753 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 36829  on 158  degrees of freedom
## Residual deviance: 15519  on 147  degrees of freedom
## AIC: 16712
## 
## Number of Fisher Scoring iterations: 4

Why do all p-values become much smaller? We will get back to this issue when we talk about binary data GLM.