As an introduction to generalized linear models, we briefly analyze two datasets and make a connection with the linear models that we are familiar with.
A scientific question: why do some females mate multiply?
This is generally a hard question. We may alternatively ask, what features of the female horeshoe crab cause it to mate multiply? This is still a hard question, as it is about causality. We want to get some understanding of this by performing regression on an observational dataset.
Crabs <- read.table("Crabs.dat", header = T)
Crabs
crab <int> | y <int> | weight <dbl> | width <dbl> | color <int> | spine <int> |
---|---|---|---|---|---|
1 | 8 | 3.050 | 28.3 | 2 | 3 |
2 | 0 | 1.550 | 22.5 | 3 | 3 |
3 | 9 | 2.300 | 26.0 | 1 | 1 |
4 | 0 | 2.100 | 24.8 | 3 | 3 |
5 | 4 | 2.600 | 26.0 | 3 | 3 |
6 | 0 | 2.100 | 23.8 | 2 | 3 |
7 | 0 | 2.350 | 26.5 | 1 | 1 |
8 | 0 | 1.900 | 24.7 | 3 | 2 |
9 | 0 | 1.950 | 23.7 | 2 | 1 |
10 | 0 | 2.150 | 25.6 | 3 | 3 |
## Change color and spine to categorical
Crabs$color <- factor(Crabs$color)
Crabs$spine <- factor(Crabs$spine)
Crabs
crab <int> | y <int> | weight <dbl> | width <dbl> | color <fct> | spine <fct> |
---|---|---|---|---|---|
1 | 8 | 3.050 | 28.3 | 2 | 3 |
2 | 0 | 1.550 | 22.5 | 3 | 3 |
3 | 9 | 2.300 | 26.0 | 1 | 1 |
4 | 0 | 2.100 | 24.8 | 3 | 3 |
5 | 4 | 2.600 | 26.0 | 3 | 3 |
6 | 0 | 2.100 | 23.8 | 2 | 3 |
7 | 0 | 2.350 | 26.5 | 1 | 1 |
8 | 0 | 1.900 | 24.7 | 3 | 2 |
9 | 0 | 1.950 | 23.7 | 2 | 1 |
10 | 0 | 2.150 | 25.6 | 3 | 3 |
Variables:
We are interested in understanding what factors are associated with y.
hist(Crabs$y, breaks = 50)
pairs(Crabs[, -1], pch = 19)
The above scatter plot can not visualize categorical data. We can use the ggpairs function in the GGally package
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(Crabs[, -1])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There seems to be a linear trend between y and
weight
/width
. As weight
and
width
are highly correlated, we only include
weights
in the model (why?)
result <- lm(y ~ weight + factor(color) + factor(spine), data = Crabs)
summary(result)
##
## Call:
## lm(formula = y ~ weight + factor(color) + factor(spine), data = Crabs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5667 -2.1053 -0.6737 1.4428 11.1453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.64718 1.42262 -0.455 0.650
## weight 1.82676 0.41283 4.425 1.74e-05 ***
## factor(color)2 -0.69966 0.96243 -0.727 0.468
## factor(color)3 -1.34027 1.06110 -1.263 0.208
## factor(color)4 -1.31893 1.16743 -1.130 0.260
## factor(spine)2 -0.46795 0.93605 -0.500 0.618
## factor(spine)3 0.06789 0.62228 0.109 0.913
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.953 on 166 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.1204
## F-statistic: 4.922 on 6 and 166 DF, p-value: 0.0001166
plot(result)
- Linear relationship might be fine (but response is bounded above zero) - Increased variability when y increases – Very common when response is restricted to be positive - Gaussianality fails (Is it fine?)
We want expand our linear model toolbox to introduce generalized linear models.
As y are counts, we instead assume that
[ ] Case 1: assume that g(XTiβ)=XTiβ, which is the same mean relationship as in the linear model.
try(result1 <- glm(y ~ weight + factor(color) + factor(spine), data = Crabs, family = poisson(link=identity)))
## Warning in log(y/mu): NaNs produced
## Error : no valid set of coefficients has been found: please supply starting values
We see a computation error. This happens as XTiˆβ is not always positive.
[ ] Case 2: assume that g(μi)=logμi=XTiβ (μi needs to be always positive, but after transformation, logμi does not have any support restrictions)
result4 <- glm(y ~ weight + factor(color) + factor(spine), data = Crabs, family = poisson())
summary(result4)
##
## Call:
## glm(formula = y ~ weight + factor(color) + factor(spine), family = poisson(),
## data = Crabs)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.04263 0.25354 -0.168 0.8665
## weight 0.54756 0.07318 7.482 7.31e-14 ***
## factor(color)2 -0.26768 0.16781 -1.595 0.1107
## factor(color)3 -0.52087 0.19414 -2.683 0.0073 **
## factor(color)4 -0.53966 0.22525 -2.396 0.0166 *
## factor(spine)2 -0.16066 0.21146 -0.760 0.4474
## factor(spine)3 0.09086 0.11948 0.760 0.4470
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 549.70 on 166 degrees of freedom
## AIC: 919
##
## Number of Fisher Scoring iterations: 6
Should we trust the results here? We will discuss model diagonosis of the above Poisson regression later.
[ ] Case 3: comparison between assuming
Gaussian'' noise v.s.
Poisson’’ noise in linear model.
If we insist on using the linear link, it is feasible if we only
include the color
variable. (This is only for illustration
purpose, we may not want to use this model in real data analysis as
``weight’’ seems to be more important)
result2 <- glm(y ~ factor(color), data = Crabs, family = poisson(link="identity"))
summary(result2)
##
## Call:
## glm(formula = y ~ factor(color), family = poisson(link = "identity"),
## data = Crabs)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.0833 0.5833 7.000 2.56e-12 ***
## factor(color)2 -0.7886 0.6123 -1.288 0.19780
## factor(color)3 -1.8561 0.6252 -2.969 0.00299 **
## factor(color)4 -2.0379 0.6582 -3.096 0.00196 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 609.14 on 169 degrees of freedom
## AIC: 972.44
##
## Number of Fisher Scoring iterations: 3
In the above model, we assumed Poisson noise of the response, the noise model is Var(yi)=E(yi)=μi. The mean relationship with Xi is μi=XTiβi. We now compare with the results using the linear model assuming the same mean relationship but assumes homoscedastic noise Var(yi)=σ2.
result3 <- lm(y ~ factor(color), data = Crabs)
summary(result3)
##
## Call:
## lm(formula = y ~ factor(color), data = Crabs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0833 -2.2273 -0.2947 1.7053 11.7053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0833 0.8985 4.544 1.05e-05 ***
## factor(color)2 -0.7886 0.9536 -0.827 0.4094
## factor(color)3 -1.8561 1.0137 -1.831 0.0689 .
## factor(color)4 -2.0379 1.1170 -1.824 0.0699 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.113 on 169 degrees of freedom
## Multiple R-squared: 0.0396, Adjusted R-squared: 0.02256
## F-statistic: 2.323 on 3 and 169 DF, p-value: 0.07685
Question: You can see that we have the same point estimates of the coefficients, while the standard errors using Poisson linear model is much smaller than the Gaussian linear model. Why? Do you think which ones are more trustworthy?
Say we believe result that the coefficient β for the weight
is
significantly nonzero, can we describe this finding in words? How is it
related to the scientific question “why do some females mate
multiply”?
weight
of the female is a cause
of more satelite males?You can read the original paper: Uncounted Votes: Does Voting Equipment Matter? for the background of the problem. You can learn about the equipment concern in the 2000 election more from this Scientific American article. Basically, people were concerned about the possible bias of the voting results from the voting machinary.
We analyze a small dataset of the voting results for the 2000 United States Presidential election in Georgia.
# install.packages("faraway")
data(gavote, package = "faraway")
gavote
equip <fct> | econ <fct> | perAA <dbl> | rural <fct> | atlanta <fct> | gore <int> | bush <int> | other <int> | votes <int> | ||
---|---|---|---|---|---|---|---|---|---|---|
APPLING | LEVER | poor | 0.182 | rural | notAtlanta | 2093 | 3940 | 66 | 6099 | |
ATKINSON | LEVER | poor | 0.230 | rural | notAtlanta | 821 | 1228 | 22 | 2071 | |
BACON | LEVER | poor | 0.131 | rural | notAtlanta | 956 | 2010 | 29 | 2995 | |
BAKER | OS-CC | poor | 0.476 | rural | notAtlanta | 893 | 615 | 11 | 1519 | |
BALDWIN | LEVER | middle | 0.359 | rural | notAtlanta | 5893 | 6041 | 192 | 12126 | |
BANKS | LEVER | middle | 0.024 | rural | notAtlanta | 1220 | 3202 | 111 | 4533 | |
BARROW | OS-CC | middle | 0.079 | urban | notAtlanta | 3657 | 7925 | 520 | 12102 | |
BARTOW | OS-PC | middle | 0.079 | urban | Atlanta | 7508 | 14720 | 552 | 22780 | |
BEN.HILL | OS-PC | poor | 0.282 | rural | notAtlanta | 2234 | 2381 | 46 | 4661 | |
BERRIEN | OS-CC | poor | 0.107 | rural | notAtlanta | 1640 | 2718 | 52 | 4410 |
We are interested in understanding whether the voting machinary affects the undercount:
gavote$undercount <- (gavote$ballots - gavote$votes)/gavote$ballots
hist(gavote$undercount, breaks = 50)
rug(gavote$undercount)
We first look at the pairwise comparisons to gain some impression of the data
gavote$pergore <- gavote$gore/gavote$votes
ggpairs(gavote[, c("undercount", colnames(gavote)[1:5], "pergore")])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We are insterested in understanding the equipment effect on undercount. Now we try a linear model including other control variables
result <- lm(undercount ~ pergore + perAA + factor(rural) + factor(econ) + factor(atlanta) + factor(equip) + ballots, data = gavote)
#result <- lm(undercount ~ pergore + perAA + factor(rural) + factor(equip), data = gavote)
summary(result)
##
## Call:
## lm(formula = undercount ~ pergore + perAA + factor(rural) + factor(econ) +
## factor(atlanta) + factor(equip) + ballots, data = gavote)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.066426 -0.012006 -0.002564 0.009781 0.115446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.141e-02 1.616e-02 1.943 0.053928 .
## pergore 1.281e-02 4.438e-02 0.289 0.773299
## perAA -1.051e-02 2.994e-02 -0.351 0.726132
## factor(rural)urban -6.313e-03 5.069e-03 -1.245 0.214964
## factor(econ)poor 1.879e-02 4.720e-03 3.981 0.000108 ***
## factor(econ)rich -1.624e-02 7.886e-03 -2.059 0.041218 *
## factor(atlanta)notAtlanta -1.700e-03 9.777e-03 -0.174 0.862183
## factor(equip)OS-CC 8.756e-03 4.436e-03 1.974 0.050263 .
## factor(equip)OS-PC 2.127e-02 5.626e-03 3.780 0.000227 ***
## factor(equip)PAPER -1.063e-02 1.581e-02 -0.672 0.502672
## factor(equip)PUNCH 1.673e-02 6.527e-03 2.563 0.011376 *
## ballots -4.563e-08 6.607e-08 -0.691 0.490932
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02184 on 147 degrees of freedom
## Multiple R-squared: 0.2879, Adjusted R-squared: 0.2346
## F-statistic: 5.404 on 11 and 147 DF, p-value: 3.514e-07
result <- lm(undercount ~ factor(equip), data = gavote)
summary(result)
##
## Call:
## lm(formula = undercount ~ factor(equip), data = gavote)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.046179 -0.015147 -0.004215 0.012349 0.136557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.189e-02 2.910e-03 14.395 <2e-16 ***
## factor(equip)OS-CC 9.049e-05 4.766e-03 0.019 0.985
## factor(equip)OS-PC 9.670e-03 6.080e-03 1.591 0.114
## factor(equip)PAPER -1.647e-03 1.794e-02 -0.092 0.927
## factor(equip)PUNCH 5.200e-03 6.734e-03 0.772 0.441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02504 on 154 degrees of freedom
## Multiple R-squared: 0.0198, Adjusted R-squared: -0.005662
## F-statistic: 0.7776 on 4 and 154 DF, p-value: 0.5413
Should we inlude the other control variables or not? (Adjusting for confounding variables?)
plot(result)
The above diagnosis plots are not very informative as there are too few unique fitted values. Notice that each county has varying total number of ballots. It may not be suitable to assume equal variance across samples.
hist(gavote$votes, breaks = 100)
As the undercounts are proportions, denote yi as the number of uncounted ballots and ni as the total number of ballots, we assume:
This model natural accounts for the unequal variances of the error
terms that are brought by the difference in the total number of ballots:
Var(yi)=nipi(1−pi)
Let’s first consider the simple GLM without adding other control covariates
gavote$undercountNumber <- gavote$ballots - gavote$votes
result.glm <- glm(cbind(undercountNumber, votes) ~ factor(equip), data = gavote, family = "binomial")
summary(result.glm)
##
## Call:
## glm(formula = cbind(undercountNumber, votes) ~ factor(equip),
## family = "binomial", data = gavote)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.183865 0.007823 -406.976 <2e-16 ***
## factor(equip)OS-CC -0.140390 0.010423 -13.470 <2e-16 ***
## factor(equip)OS-PC -0.640942 0.010975 -58.400 <2e-16 ***
## factor(equip)PAPER -0.202773 0.095969 -2.113 0.0346 *
## factor(equip)PUNCH 0.167267 0.009406 17.783 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 36829 on 158 degrees of freedom
## Residual deviance: 28379 on 154 degrees of freedom
## AIC: 29559
##
## Number of Fisher Scoring iterations: 5
Compared with the linear model results, the p-values are much smaller. Should we trust them more than the linear model?
Now let’s add all the control variables:
result.glm <- glm(cbind(undercountNumber, votes) ~ pergore + perAA + factor(rural) + factor(econ) + factor(atlanta) + factor(equip) + ballots, data = gavote, family = "binomial")
summary(result.glm)
##
## Call:
## glm(formula = cbind(undercountNumber, votes) ~ pergore + perAA +
## factor(rural) + factor(econ) + factor(atlanta) + factor(equip) +
## ballots, family = "binomial", data = gavote)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.835e+00 3.617e-02 -78.379 < 2e-16 ***
## pergore -1.782e+00 9.596e-02 -18.573 < 2e-16 ***
## perAA 8.681e-01 7.171e-02 12.104 < 2e-16 ***
## factor(rural)urban -1.907e-01 1.056e-02 -18.051 < 2e-16 ***
## factor(econ)poor 3.453e-01 1.205e-02 28.650 < 2e-16 ***
## factor(econ)rich -9.215e-01 1.701e-02 -54.178 < 2e-16 ***
## factor(atlanta)notAtlanta 3.334e-02 1.615e-02 2.065 0.0389 *
## factor(equip)OS-CC 2.308e-01 1.149e-02 20.083 < 2e-16 ***
## factor(equip)OS-PC 7.814e-02 1.288e-02 6.067 1.31e-09 ***
## factor(equip)PAPER -3.818e-01 9.607e-02 -3.974 7.06e-05 ***
## factor(equip)PUNCH 7.325e-01 1.378e-02 53.175 < 2e-16 ***
## ballots 1.160e-07 6.523e-08 1.779 0.0753 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 36829 on 158 degrees of freedom
## Residual deviance: 15519 on 147 degrees of freedom
## AIC: 16712
##
## Number of Fisher Scoring iterations: 4
Why do all p-values become much smaller? We will get back to this issue when we talk about binary data GLM.