Crabs <- read.table("Crabs.dat", header = T)
Crabs
Variables:
hist(Crabs$y, breaks = c(0:16) - 0.5)
Two modes, possibly we need a zero-inflated GLM model
In Data example 1, we used a Poisson model:
fit.pois <- glm(y ~ weight + factor(color) + factor(spine), data = Crabs, family = poisson())
summary(fit.pois)
##
## Call:
## glm(formula = y ~ weight + factor(color) + factor(spine), family = poisson(),
## data = Crabs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0433 -1.8609 -0.5944 0.9193 4.9483
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.04263 0.25354 -0.168 0.8665
## weight 0.54756 0.07318 7.482 7.31e-14 ***
## factor(color)2 -0.26768 0.16781 -1.595 0.1107
## factor(color)3 -0.52087 0.19414 -2.683 0.0073 **
## factor(color)4 -0.53966 0.22525 -2.396 0.0166 *
## factor(spine)2 -0.16066 0.21146 -0.760 0.4474
## factor(spine)3 0.09086 0.11948 0.760 0.4470
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 549.70 on 166 degrees of freedom
## AIC: 919
##
## Number of Fisher Scoring iterations: 6
plot(fit.pois, which = 3)
We see some sign of over-dispersion and the fit is quite poor.
We can try a ZINB model
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
fit.zinb <- zeroinfl(y ~ weight + factor(color) + factor(spine) | weight + factor(color) + factor(spine), dist = "negbin", data = Crabs)
summary(fit.zinb)
##
## Call:
## zeroinfl(formula = y ~ weight + factor(color) + factor(spine) | weight +
## factor(color) + factor(spine), data = Crabs, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.4662 -0.7573 -0.2847 0.5384 3.9855
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.06668 0.37345 2.856 0.00429 **
## weight 0.23167 0.11191 2.070 0.03844 *
## factor(color)2 -0.12300 0.24963 -0.493 0.62221
## factor(color)3 -0.28019 0.27637 -1.014 0.31068
## factor(color)4 0.38110 0.34082 1.118 0.26349
## factor(spine)2 -0.07365 0.28227 -0.261 0.79415
## factor(spine)3 -0.13142 0.16620 -0.791 0.42908
## Log(theta) 1.80490 0.38961 4.633 3.61e-06 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.0997 1.3893 2.231 0.025669 *
## weight -1.6968 0.4399 -3.858 0.000115 ***
## factor(color)2 0.1501 0.8945 0.168 0.866738
## factor(color)3 0.6331 0.9722 0.651 0.514934
## factor(color)4 1.9139 1.0441 1.833 0.066795 .
## factor(spine)2 0.2357 0.7340 0.321 0.748105
## factor(spine)3 -0.5560 0.5619 -0.989 0.322464
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 6.0794
## Number of iterations in BFGS optimization: 25
## Log-likelihood: -344.6 on 15 Df
mu.pois <- predict(fit.pois, type = "response")
mu.zinb <- predict(fit.zinb, type = "response")
plot(data.frame(mu.pois, Crabs$y, mu.zinb), xlim = c(0, 13), ylim = c(0, 13))
If we ignore the zero-inflation
library(MASS)
fit.nb <- glm.nb(y ~ weight + factor(color) + factor(spine), data = Crabs)
summary(fit.nb)
##
## Call:
## glm.nb(formula = y ~ weight + factor(color) + factor(spine),
## data = Crabs, init.theta = 0.9650308392, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8782 -1.3684 -0.3245 0.4242 2.2309
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.32230 0.56387 -0.572 0.568
## weight 0.69294 0.16565 4.183 2.88e-05 ***
## factor(color)2 -0.32061 0.37253 -0.861 0.389
## factor(color)3 -0.59550 0.41590 -1.432 0.152
## factor(color)4 -0.57850 0.46433 -1.246 0.213
## factor(spine)2 -0.24107 0.39339 -0.613 0.540
## factor(spine)3 0.04246 0.24792 0.171 0.864
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.965) family taken to be 1)
##
## Null deviance: 220.67 on 172 degrees of freedom
## Residual deviance: 196.52 on 166 degrees of freedom
## AIC: 761.32
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.965
## Std. Err.: 0.176
##
## 2 x log-likelihood: -745.321
We can compare the two models and check whether the zero-inflation part is useful or not. We can not use the anova function but we can still calculate a likelihood-ratio test p-value
1 - pchisq(2 * as.numeric((logLik(fit.zinb) - logLik(fit.nb))), df = df.residual(fit.nb) - df.residual(fit.zinb))
## [1] 2.785625e-09
which significantly shows that the NB model is not sufficient without considering the zero inflation.