7.5.1 Horseshoe Crab satellites count

Crabs <- read.table("Crabs.dat", header = T)
Crabs

Variables:

hist(Crabs$y, breaks = c(0:16) - 0.5)

Two modes, possibly we need a zero-inflated GLM model

1. Possible issues using the Poisson model

In Data example 1, we used a Poisson model:

fit.pois <- glm(y ~ weight + factor(color) + factor(spine), data = Crabs, family = poisson())
summary(fit.pois)
## 
## Call:
## glm(formula = y ~ weight + factor(color) + factor(spine), family = poisson(), 
##     data = Crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0433  -1.8609  -0.5944   0.9193   4.9483  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.04263    0.25354  -0.168   0.8665    
## weight          0.54756    0.07318   7.482 7.31e-14 ***
## factor(color)2 -0.26768    0.16781  -1.595   0.1107    
## factor(color)3 -0.52087    0.19414  -2.683   0.0073 ** 
## factor(color)4 -0.53966    0.22525  -2.396   0.0166 *  
## factor(spine)2 -0.16066    0.21146  -0.760   0.4474    
## factor(spine)3  0.09086    0.11948   0.760   0.4470    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 549.70  on 166  degrees of freedom
## AIC: 919
## 
## Number of Fisher Scoring iterations: 6
plot(fit.pois, which = 3)

We see some sign of over-dispersion and the fit is quite poor.

2. Using the ZINB model

We can try a ZINB model

library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
fit.zinb <- zeroinfl(y ~ weight + factor(color) + factor(spine) | weight + factor(color) + factor(spine), dist = "negbin", data = Crabs)
summary(fit.zinb)
## 
## Call:
## zeroinfl(formula = y ~ weight + factor(color) + factor(spine) | weight + 
##     factor(color) + factor(spine), data = Crabs, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4662 -0.7573 -0.2847  0.5384  3.9855 
## 
## Count model coefficients (negbin with log link):
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.06668    0.37345   2.856  0.00429 ** 
## weight          0.23167    0.11191   2.070  0.03844 *  
## factor(color)2 -0.12300    0.24963  -0.493  0.62221    
## factor(color)3 -0.28019    0.27637  -1.014  0.31068    
## factor(color)4  0.38110    0.34082   1.118  0.26349    
## factor(spine)2 -0.07365    0.28227  -0.261  0.79415    
## factor(spine)3 -0.13142    0.16620  -0.791  0.42908    
## Log(theta)      1.80490    0.38961   4.633 3.61e-06 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.0997     1.3893   2.231 0.025669 *  
## weight          -1.6968     0.4399  -3.858 0.000115 ***
## factor(color)2   0.1501     0.8945   0.168 0.866738    
## factor(color)3   0.6331     0.9722   0.651 0.514934    
## factor(color)4   1.9139     1.0441   1.833 0.066795 .  
## factor(spine)2   0.2357     0.7340   0.321 0.748105    
## factor(spine)3  -0.5560     0.5619  -0.989 0.322464    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 6.0794 
## Number of iterations in BFGS optimization: 25 
## Log-likelihood: -344.6 on 15 Df
  • The dispersion parameter \(\gamma = 1/\theta = 0.14\)
  • We have inflated estimates of the standard errors of the estimated coefficients
  • The deviance residuals are slightly better
mu.pois <- predict(fit.pois, type = "response")
mu.zinb <- predict(fit.zinb, type = "response")
plot(data.frame(mu.pois, Crabs$y, mu.zinb), xlim = c(0, 13), ylim = c(0, 13))

3. Compare with the NB model

If we ignore the zero-inflation

library(MASS)
fit.nb <- glm.nb(y ~ weight + factor(color) + factor(spine), data = Crabs)
summary(fit.nb)
## 
## Call:
## glm.nb(formula = y ~ weight + factor(color) + factor(spine), 
##     data = Crabs, init.theta = 0.9650308392, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8782  -1.3684  -0.3245   0.4242   2.2309  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.32230    0.56387  -0.572    0.568    
## weight          0.69294    0.16565   4.183 2.88e-05 ***
## factor(color)2 -0.32061    0.37253  -0.861    0.389    
## factor(color)3 -0.59550    0.41590  -1.432    0.152    
## factor(color)4 -0.57850    0.46433  -1.246    0.213    
## factor(spine)2 -0.24107    0.39339  -0.613    0.540    
## factor(spine)3  0.04246    0.24792   0.171    0.864    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.965) family taken to be 1)
## 
##     Null deviance: 220.67  on 172  degrees of freedom
## Residual deviance: 196.52  on 166  degrees of freedom
## AIC: 761.32
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.965 
##           Std. Err.:  0.176 
## 
##  2 x log-likelihood:  -745.321

We can compare the two models and check whether the zero-inflation part is useful or not. We can not use the anova function but we can still calculate a likelihood-ratio test p-value

1 - pchisq(2 * as.numeric((logLik(fit.zinb) - logLik(fit.nb))), df = df.residual(fit.nb) - df.residual(fit.zinb))
## [1] 2.785625e-09

which significantly shows that the NB model is not sufficient without considering the zero inflation.