1. Revisit the horseshoe data in Example 1

For case 3 in Section 1.3 of the first R example, we compared Poisson GLM and Gaussian linear model using the same identity link.

Crabs <- read.table("Crabs.dat", header = T)
Crabs
ABCDEFGHIJ0123456789
crab
<int>
y
<int>
weight
<dbl>
width
<dbl>
color
<int>
spine
<int>
183.05028.323
201.55022.533
392.30026.011
402.10024.833
542.60026.033
602.10023.823
702.35026.511
801.90024.732
901.95023.721
1002.15025.633
Crabs$color <- factor(Crabs$color)

result.Poisson <- glm(y ~ factor(color), data = Crabs, family = poisson(link="identity"))
result.linear <- glm(y ~ factor(color), data = Crabs, family = gaussian(link="identity"))

Both models might be problematic. Now we perform some model diagnosis plotting the residuals.

library(ggplot2)
df <- data.frame(x = predict(result.Poisson), y = abs(residuals(result.Poisson, type = "pearson")))
ggplot(df, aes(x = x, y= y)) +
  # geom_jitter is basically geom_point + position_jitter
  geom_jitter(width = 0.02,  # Controls horizontal jitter
              height = 0.02, # Controls vertical jitter
              alpha = 0.7,  # Make points partially transparent
              color = "blue") +
  labs(title = "Poisson regression",
    x = "Predicted values", y = "|Pearson residual|") +
  theme_minimal()

library(ggplot2)

# we need a correction here as the R function only computes pearson and deviance residuals with dispersion = 1

df <- data.frame(x = predict(result.linear), 
                 y = abs(residuals(result.linear, type = "pearson")/
                           sqrt(summary(result.linear)$dispersion)))
ggplot(df, aes(x = x, y= y)) +
  # geom_jitter is basically geom_point + position_jitter
  geom_jitter(width = 0.02,  # Controls horizontal jitter
              height = 0.02, # Controls vertical jitter
              alpha = 0.7,  # Make points partially transparent
              color = "blue") +
  labs(title = "Gaussian linear regression",
    x = "Predicted values", y = "|Pearson residuals|") +
  theme_minimal()

library(ggplot2)
df <- data.frame(x = predict(result.Poisson), y = abs(residuals(result.Poisson, type = "deviance")))
ggplot(df, aes(x = x, y= y)) +
  # geom_jitter is basically geom_point + position_jitter
  geom_jitter(width = 0.02,  # Controls horizontal jitter
              height = 0.02, # Controls vertical jitter
              alpha = 0.7,  # Make points partially transparent
              color = "blue") +
  labs(title = "Poisson regression",
    x = "Predicted values", y = "|Deviance residual|") +
  theme_minimal()

2. Housing price data

We use the same dataset as Agresti Chapter 4.7. The question here is to understand the question: what affects the selling price of a house? This dataset contains 100 house selling records in Gainesville, Florida.

Houses <- read.table("Houses.dat", header = T)
Houses$new <- factor(Houses$new)
Houses
ABCDEFGHIJ0123456789
case
<int>
taxes
<int>
beds
<int>
baths
<int>
new
<fct>
price
<dbl>
size
<int>
13104420279.92048
21173210146.5912
33076420237.71654
41608320200.02068
51454330159.91477
62997321499.93153
74054320265.51355
83002321289.92075
96627540587.03990
1032032070.01160

2.1 Some data exploration

Pair-wise scattor plots

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(Houses[, -1])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.2 Using a linear model

We include all covariates, and can decide whether we also want to include interactions

We can compare the model with and without interactions.

fit.1 <- lm(price ~ size + new + baths + beds + taxes, data = Houses)
fit.2 <- lm(price ~ (size + new + beds + baths + taxes)^2, data = Houses)
anova(fit.1, fit.2, test = "Chisq")
ABCDEFGHIJ0123456789
 
 
Res.Df
<dbl>
RSS
<dbl>
Df
<dbl>
Sum of Sq
<dbl>
Pr(>Chi)
<dbl>
194209754.3NANANA
284113596.01096158.272.711037e-11

If we use the glm function, the default is also to use the Gaussian linear model, though the ANOVA table column names are different.

fit.1 <- glm(price ~ size + new + baths + beds + taxes, data = Houses)
fit.2 <- glm(price ~ (size + new + baths + beds + taxes)^2, data = Houses)
anova(fit.1, fit.2, test = "Chisq")
ABCDEFGHIJ0123456789
 
 
Resid. Df
<dbl>
Resid. Dev
<dbl>
Df
<dbl>
Deviance
<dbl>
Pr(>Chi)
<dbl>
194209754.3NANANA
284113596.01096158.272.711037e-11

One thing to notice is that the deviance and residual deviance reported here are values with σ2=1. The computation of p-values takes into account the estimation of σ2.

We can even compare with a model adding third order interactions.

fit.3 <- glm(price ~ (size + new + beds + baths + taxes)^3, data = Houses)
anova(fit.2, fit.3, test = "Chisq")
ABCDEFGHIJ0123456789
 
 
Resid. Df
<dbl>
Resid. Dev
<dbl>
Df
<dbl>
Deviance
<dbl>
Pr(>Chi)
<dbl>
184113596.01NANANA
27691903.28821692.740.0216889

We see that the third-order interactions also seems help.

2.2.1 Model checking for the linear models

All the above p-values are valid only when the equal variance assumption holds. So before we further select a good linear model, we should check whether the residuals look good.

par(mfrow=c(1,3), mar = c(4, 4, 6, 4)) 
## we plot 
plot(fit.1, which = 3, main = "main effect model")
plot(fit.2, which = 3, main = "second-order interations")
plot(fit.3, which = 3, main = "third-order interactions")
## Warning: not plotting observations with leverage one:
##   22, 35, 68, 84, 92

We can see that for all three models, the residual variances tend to increase with the fitted values. This suggests that the Gaussian linear model assuming equal variance of the noise is not proper.

This suggests that we would like to change to a GLM model.

2.3 Using a Gamma GLM

2.3.1 Gamma distribution

The density function is f(y;k,μ)=(k/μ)kΓ(k)eky/μyk1

Now we can look at the ANOVA table (in GLM, it is the deviance table) to obtain a simplier model.

with E(y)=μ and Var(y)=μ2/k. The canonical parameter is θ=1/μ and the dispersion function is a(ϕ)=1/k.

2.3.2 Applying the Gamma GLM to our dataset

As a housing price is always positive, we can choose to use a Gamma GLM model, which allows the standard deviations of the noise to increase with μi.

If we use the same link as the linear model, we see that the residual variance trend using the Gamma GLM is much better, though there seems to be a bit over correction.

fit.gamma1 <- glm(price ~ size + new + beds + baths + taxes, family = Gamma(link = identity), data = Houses)
fit.gamma2 <- glm(price ~ (size + new + beds + baths + taxes)^2, family = Gamma(link = identity), data = Houses)
fit.gamma3 <- glm(price ~ (size + new + beds + baths + taxes)^3, family = Gamma(link = identity), data = Houses)
## we plot deviance against predicted values
par(mfrow=c(1,3), mar = c(4, 4, 6, 4)) 
plot(residuals(fit.gamma1)/sqrt(summary(fit.gamma1)$dispersion)~predict(fit.gamma1), 
     main = "main effect model", xlab = "predicted response", ylab = "deviance residual")
plot(residuals(fit.gamma2)/sqrt(summary(fit.gamma2)$dispersion)~predict(fit.gamma2), 
     main = "second-order interations", xlab = "predicted response", ylab = "deviance residual")
plot(residuals(fit.gamma3)/sqrt(summary(fit.gamma3)$dispersion)~predict(fit.gamma3), 
     main = "third-order interactions", xlab = "predicted response", ylab = "deviance residual")

What if we use the canonical link for the Gamma? (Inverse link: 1μ=XTβ)

fit.gamma.canonical <- glm(price ~ size + new + beds + baths + taxes, family = Gamma(), data = Houses)

plot(residuals(fit.gamma.canonical)/sqrt(summary(fit.gamma.canonical)$dispersion)~predict(fit.gamma.canonical, type = "response"),
     main = "main effect model", xlab = expression(hat(mu)), ylab = "deviance residual")

## we can plot the linear link if we need the points can be more evenly spreaded across x axis (no change here as we use identity link)
plot(residuals(fit.gamma.canonical)/sqrt(summary(fit.gamma.canonical)$dispersion)~predict(fit.gamma.canonical, type = "link"), 
     main = "main effect model", xlab = expression(hat(eta)), ylab = "deviance residual")

We do see non-random trend suggesting that the link function is not appropriate.

2.3.3 Building a good Gamma GLM model

How to build a good Gamma GLM model for predicting the housing price? (You can read Agresti Chapter 4.7.1 for how a linear model is chosen). Here, “good” means that we want to find a model that can fit the data well while being as simple as possible to avoid unnessary uncertainties.

The following steps to pick a good model are a bit at-hoc. I would just like to provide a simple guidance and it’s fine if you have a different opinion.

anova(fit.gamma1, fit.gamma2, test = "LRT")
ABCDEFGHIJ0123456789
 
 
Resid. Df
<dbl>
Resid. Dev
<dbl>
Df
<dbl>
Deviance
<dbl>
Pr(>Chi)
<dbl>
1946.509248NANANA
2845.550293100.95895460.1443698

What we find out is that the interaction terms may not be really necessary when we use a Gamma GLM.

Now we compute the deviance analysis table for the main effect model and see if some covariates are not that important.

anova(fit.gamma1,  test = "LRT")
ABCDEFGHIJ0123456789
 
 
Df
<int>
Deviance
<dbl>
Resid. Df
<int>
Resid. Dev
<dbl>
Pr(>Chi)
<dbl>
NULLNANA9931.940111NA
size120.65673119811.2833801.063933e-67
new10.34830469710.9350762.397478e-02
beds10.35422089610.5808552.280860e-02
baths10.13913559510.4417191.536277e-01
taxes13.9324718946.5092483.310670e-14
anova(fit.gamma1,  test = "Chisq")
ABCDEFGHIJ0123456789
 
 
Df
<int>
Deviance
<dbl>
Resid. Df
<int>
Resid. Dev
<dbl>
Pr(>Chi)
<dbl>
NULLNANA9931.940111NA
size120.65673119811.2833801.063933e-67
new10.34830469710.9350762.397478e-02
beds10.35422089610.5808552.280860e-02
baths10.13913559510.4417191.536277e-01
taxes13.9324718946.5092483.310670e-14

The result suggests that the baths covariate may also be unnecessary to include.

fit.gamma4 <- glm(price ~ size + new + beds + taxes, family = Gamma(link = identity), data = Houses)
anova(fit.gamma4,  test = "LRT")
ABCDEFGHIJ0123456789
 
 
Df
<int>
Deviance
<dbl>
Resid. Df
<int>
Resid. Dev
<dbl>
Pr(>Chi)
<dbl>
NULLNANA9931.940111NA
size120.65673119811.2833801.140683e-67
new10.34830469710.9350762.400716e-02
beds10.35422089610.5808552.283987e-02
taxes13.9279903956.6528653.469166e-14
par(mfrow=c(2,2)) 
plot(fit.gamma4)

summary(fit.gamma4)
## 
## Call:
## glm(formula = price ~ size + new + beds + taxes, family = Gamma(link = identity), 
##     data = Houses)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.829561  13.540229   1.760  0.08164 .  
## size          0.066093   0.012128   5.450 3.97e-07 ***
## new1         22.494342  19.256820   1.168  0.24568    
## beds        -17.031149   6.339464  -2.687  0.00852 ** 
## taxes         0.037877   0.005124   7.393 5.61e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.06837392)
## 
##     Null deviance: 31.9401  on 99  degrees of freedom
## Residual deviance:  6.6529  on 95  degrees of freedom
## AIC: 1003.1
## 
## Number of Fisher Scoring iterations: 6

The deviance computed from GLM still is not devided by a(ϕ), which is estimated as a(ˆϕ)=0.0684. For example, the deviance for covariate “new” in our deviance table is 0.3483/0.0684=5.1, resulting in a p-value of 0.024 after comparing with χ21.