1. Data

We use the same dataset as Agresti Chapter 4.7. The question here is to understand the questioni: what affects the selling price of a house? This dataset contains house selling records in Gainesville, Florida.

Houses <- read.table("Houses.dat", header = T)
Houses

ABCDEFGHIJ0123456789

case <int>	taxes <int>	beds <int>	baths <int>	new <int>	price <dbl>	size <int>
1	3104	4	2	0	279.9	2048
2	1173	2	1	0	146.5	912
3	3076	4	2	0	237.7	1654
4	1608	3	2	0	200.0	2068
5	1454	3	3	0	159.9	1477
6	2997	3	2	1	499.9	3153
7	4054	3	2	0	265.5	1355
8	3002	3	2	1	289.9	2075
9	6627	5	4	0	587.0	3990
10	320	3	2	0	70.0	1160

price: selling price of the house in thousands of dollars.
size: size of the house in square feet
taxes: annual property tax
beds: number of bedrooms
baths: number of bathrooms
new: whether the house is a new construction or not

1.1 Some data exploration

Pair-wise scattor plots

pairs(Houses[, -1])

Pair-wise correlations

cor(Houses[, -1])

##           taxes       beds     baths        new     price      size
## taxes 1.0000000 0.47392873 0.5948543 0.38087410 0.8419802 0.8187958
## beds  0.4739287 1.00000000 0.4922224 0.04931556 0.3939570 0.5447831
## baths 0.5948543 0.49222235 1.0000000 0.25148095 0.5582533 0.6582247
## new   0.3808741 0.04931556 0.2514810 1.00000000 0.4732608 0.3843277
## price 0.8419802 0.39395702 0.5582533 0.47326080 1.0000000 0.8337848
## size  0.8187958 0.54478311 0.6582247 0.38432773 0.8337848 1.0000000

2. Using a linear model

We include all covariates, and can decide whether we also want to include interactions

We can compare the model with and without interactions.

fit.1 <- lm(price ~ size + new + baths + beds + taxes, data = Houses)
fit.2 <- lm(price ~ (size + new + beds + baths + taxes)^2, data = Houses)
anova(fit.1, fit.2, test = "Chisq")

ABCDEFGHIJ0123456789

	Res.Df <dbl>	RSS <dbl>	Df <dbl>	Sum of Sq <dbl>	Pr(>Chi) <dbl>
1	94	209754.3	NA	NA	NA
2	84	113596.0	10	96158.27	2.711037e-11

If we use the glm function, the default is also to use the Gaussian linear model, though the ANOVA table column names are different.

fit.1 <- glm(price ~ size + new + baths + beds + taxes, data = Houses)
fit.2 <- glm(price ~ (size + new + baths + beds + taxes)^2, data = Houses)
anova(fit.1, fit.2, test = "Chisq")

ABCDEFGHIJ0123456789

	Resid. Df <dbl>	Resid. Dev <dbl>	Df <dbl>	Deviance <dbl>	Pr(>Chi) <dbl>
1	94	209754.3	NA	NA	NA
2	84	113596.0	10	96158.27	2.711037e-11

One thing to notice is that the deviance and residual deviance reported here are values with . The computation of p-values takes into account the estimation of .

We can even compare with a model adding third order interactions.

fit.3 <- glm(price ~ (size + new + beds + baths + taxes)^3, data = Houses)
anova(fit.2, fit.3, test = "Chisq")

ABCDEFGHIJ0123456789

	Resid. Df <dbl>	Resid. Dev <dbl>	Df <dbl>	Deviance <dbl>	Pr(>Chi) <dbl>
1	84	113596.01	NA	NA	NA
2	76	91903.28	8	21692.74	0.0216889

We see that the third-order interactions also seems help.

2.1 Model checking for the linear models

All the above p-values are valid only when the equal variance assumption holds. So before we further select a good linear model, we should check whether the residuals look good.

par(mfrow=c(1,3), mar = c(4, 4, 6, 4)) 
plot(fit.1, which = 3, main = "main effect model")
plot(fit.2, which = 3, main = "second-order interations")
plot(fit.3, which = 3, main = "third-order interactions")

## Warning: not plotting observations with leverage one:
##   22, 35, 68, 84, 92

We can see that for all three models, the residual variances tend to increase with the fitted values. This suggests that the Gaussian linear model assuming equal variance of the noise is not proper.

This suggests that we would like to change to a GLM model.

3. Using a Gamma GLM

3.1 Gamma distribution (Chapter 4.7.2)

The density function is

Now we can look at the ANOVA table (in GLM, it is the deviance table) to obtain a simplier model.

with and . The canonical parameter is and the dispersion function is .

3.2 Applying the Gamma GLM to our dataset

As a housing price is always positive, we can choose to use a Gamma GLM model, which allows the standard deviations of the noise to increase with .

If we use the same link as the linear model, we see that the residual variance trend using the Gamma GLM is much better, though there seems to be a bit over correction.

fit.gamma1 <- glm(price ~ size + new + beds + baths + taxes, family = Gamma(link = identity), data = Houses)
fit.gamma2 <- glm(price ~ (size + new + beds + baths + taxes)^2, family = Gamma(link = identity), data = Houses)
fit.gamma3 <- glm(price ~ (size + new + beds + baths + taxes)^3, family = Gamma(link = identity), data = Houses)
par(mfrow=c(1,3), mar = c(4, 4, 6, 4)) 
plot(fit.gamma1, which = 3, main = "main effect model")
plot(fit.gamma2, which = 3, main = "second-order interations")
plot(fit.gamma3, which = 3, main = "third-order interactions")

## Warning: not plotting observations with leverage one:
##   22, 35, 68, 84, 92

3.3 Building a good Gamma GLM model

How to build a good Gamma GLM model for predicting the housing price? (You can read Agresti Chapter 4.7.1 for how a linear model is chosen). Here, “good” means that we want to find a model that can fit the data well while being as simple as possible to avoid unnessary uncertainties.

The following steps to pick a good model are a bit at-hoc. I would just like to provide a simple guidance and it’s fine if you have a different opinion.

anova(fit.gamma1, fit.gamma2, test = "LRT")

ABCDEFGHIJ0123456789

	Resid. Df <dbl>	Resid. Dev <dbl>	Df <dbl>	Deviance <dbl>	Pr(>Chi) <dbl>
1	94	6.509248	NA	NA	NA
2	84	5.550293	10	0.9589546	0.1443698

What we find out is that the interaction terms may not be really necessary when we use a Gamma GLM.

Now we compute the deviance analysis table for the main effect model and see if some covariates are not that important.

anova(fit.gamma1,  test = "LRT")

ABCDEFGHIJ0123456789

	Df <int>	Deviance <dbl>	Resid. Df <int>	Resid. Dev <dbl>	Pr(>Chi) <dbl>
NULL	NA	NA	99	31.940111	NA
size	1	20.6567311	98	11.283380	1.063933e-67
new	1	0.3483046	97	10.935076	2.397478e-02
beds	1	0.3542208	96	10.580855	2.280860e-02
baths	1	0.1391355	95	10.441719	1.536277e-01
taxes	1	3.9324718	94	6.509248	3.310670e-14

The result suggests that the baths covariate may also be unnecessary to include.

fit.gamma4 <- glm(price ~ size + new + beds + taxes, family = Gamma(link = identity), data = Houses)
anova(fit.gamma4,  test = "LRT")

ABCDEFGHIJ0123456789

	Df <int>	Deviance <dbl>	Resid. Df <int>	Resid. Dev <dbl>	Pr(>Chi) <dbl>
NULL	NA	NA	99	31.940111	NA
size	1	20.6567311	98	11.283380	1.140683e-67
new	1	0.3483046	97	10.935076	2.400716e-02
beds	1	0.3542208	96	10.580855	2.283987e-02
taxes	1	3.9279903	95	6.652865	3.469166e-14

par(mfrow=c(2,2)) 
plot(fit.gamma4)

summary(fit.gamma4)

## 
## Call:
## glm(formula = price ~ size + new + beds + taxes, family = Gamma(link = identity), 
##     data = Houses)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.82167  -0.18460  -0.02599   0.14511   0.65720  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.829561  13.540229   1.760  0.08164 .  
## size          0.066093   0.012128   5.450 3.97e-07 ***
## new          22.494342  19.256820   1.168  0.24568    
## beds        -17.031149   6.339464  -2.687  0.00852 ** 
## taxes         0.037877   0.005124   7.393 5.61e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.06837392)
## 
##     Null deviance: 31.9401  on 99  degrees of freedom
## Residual deviance:  6.6529  on 95  degrees of freedom
## AIC: 1003.1
## 
## Number of Fisher Scoring iterations: 6

The deviance computed from GLM still is not devided by , which is estimated as . For example, the deviance for covariate “new” in our deviance table is , resulting in a p-value of after comparing with .

Example: Building a GLM

1. Data

1.1 Some data exploration

2. Using a linear model

2.1 Model checking for the linear models

3. Using a Gamma GLM

3.1 Gamma distribution (Chapter 4.7.2)

3.2 Applying the Gamma GLM to our dataset

3.3 Building a good Gamma GLM model