(Chapter 6.2.5 of Agresti book)
We simulate data from the following model: \[y_i^* = 2 + 0.6x_i - 4z_i + \epsilon_i\]
set.seed(1)
n <- 100
x <- runif(n, 0, 10)
z <- rbinom(n, 1, 0.5)
epsilon <- rnorm(n, 0, 1)
y_star <- 2 + 0.6 * x - 4 * z + epsilon
pch <- rep("1", n)
pch[z == 0] <- "0"
plot(x, y_star, pch = pch, xlim = c(0, 10), ylim = c(-3, 9))
abline(2, 0.6)
abline(-2, 0.6)
This figure shows the relationship between \(x\) and \(y^*\) at two levels of \(z\). Clearly, there is no interaction between \(x\) and \(z\) in affecting \(y^*\).
Now consider the scenario that we can only measure \(y\), a discretized version of \(y^*\).
y <- rep(1, n)
y[y_star > 2 & y_star <=4] <- 2
y[y_star > 4 & y_star <=6] <- 3
y[y_star > 6 & y_star <=8] <- 4
y[y_star > 8] <- 5
lm.fit1 <- lm(y[z == 1] ~ x[z==1])
lm.fit2 <- lm(y[z == 0] ~ x[z==0])
plot(x, y, pch = pch, xlim = c(0, 10), ylim = c(0, 5))
abline(lm.fit1$coefficients[1], lm.fit1$coefficients[2])
abline(lm.fit2$coefficients[1], lm.fit2$coefficients[2])
One can see that each category has different variances
lm.fit3 <- lm(y ~ x + z + x:z)
summary(lm.fit3)
##
## Call:
## lm(formula = y ~ x + z + x:z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10736 -0.34012 0.01254 0.23616 1.17697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.87577 0.17175 10.921 < 2e-16 ***
## x 0.23284 0.02894 8.045 2.28e-12 ***
## z -1.17386 0.21924 -5.354 5.86e-07 ***
## x:z -0.09913 0.03747 -2.646 0.00952 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4881 on 96 degrees of freedom
## Multiple R-squared: 0.8137, Adjusted R-squared: 0.8079
## F-statistic: 139.8 on 3 and 96 DF, p-value: < 2.2e-16
Incorrect inference of the interaction due to a flooring effect and a ceiling effect. In contrast, if we can observe \(y^*\), we can pbtain the current estimation.
lm.fit4 <- lm(y_star ~ x + z + x:z)
summary(lm.fit4)
##
## Call:
## lm(formula = y_star ~ x + z + x:z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9498 -0.5791 -0.1286 0.5798 2.4582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.44125 0.33727 7.238 1.12e-10 ***
## x 0.52209 0.05683 9.187 8.31e-15 ***
## z -4.72214 0.43051 -10.969 < 2e-16 ***
## x:z 0.11644 0.07358 1.583 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9584 on 96 degrees of freedom
## Multiple R-squared: 0.8906, Adjusted R-squared: 0.8871
## F-statistic: 260.4 on 3 and 96 DF, p-value: < 2.2e-16
Instead, if we use a cumulative link model, we can correctly identify that there is no interaction between \(x\) and \(z\). As \(\epsilon\) follows a normal distribution, use use a cumulative probit model.
library(VGAM)
## Loading required package: stats4
## Loading required package: splines
fit <- suppressWarnings(vglm(y ~ x + z + x:z, family = cumulative(link = "probitlink", parallel = T)))
summary(fit)
##
## Call:
## vglm(formula = y ~ x + z + x:z, family = cumulative(link = "probitlink",
## parallel = T))
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## probitlink(P[Y<=1]) -3.499 -1.591e-01 -0.006832 0.18263 2.4354
## probitlink(P[Y<=2]) -2.253 -1.351e-01 0.009128 0.15963 2.0628
## probitlink(P[Y<=3]) -3.574 6.988e-05 0.008994 0.13790 1.2357
## probitlink(P[Y<=4]) -1.953 3.658e-06 0.001032 0.04755 0.8009
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.62218 0.47087 -1.321 0.1864
## (Intercept):2 1.08514 0.42447 2.556 0.0106 *
## (Intercept):3 3.27600 0.62037 5.281 1.29e-07 ***
## (Intercept):4 4.85644 0.78166 6.213 5.20e-10 ***
## x -0.44869 0.09002 -4.984 6.22e-07 ***
## z 3.49280 0.83168 4.200 2.67e-05 ***
## x:z 0.02389 0.12895 0.185 0.8530
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: probitlink(P[Y<=1]), probitlink(P[Y<=2]),
## probitlink(P[Y<=3]), probitlink(P[Y<=4])
##
## Residual deviance: 123.7853 on 393 degrees of freedom
##
## Log-likelihood: NA on 393 degrees of freedom
##
## Number of Fisher scoring iterations: 12
##
##
## Exponentiated coefficients:
## x z x:z
## 0.6384641 32.8779657 1.0241817
Question: what are the true values of the estimated coefficients?
Alligators <- read.table("Alligators.dat", header = T)
Alligators
This is a dataset with grouped multinomial responses.
We can use a Baseline-category logit model. Using the first category as the baseline, we have
library(VGAM)
fit <- vglm(formula = cbind(y2, y3, y4, y5, y1) ~ size + factor(lake), family = multinomial, data = Alligators)
summary(fit)
##
## Call:
## vglm(formula = cbind(y2, y3, y4, y5, y1) ~ size + factor(lake),
## family = multinomial, data = Alligators)
##
## Pearson residuals:
## log(mu[,1]/mu[,5]) log(mu[,2]/mu[,5]) log(mu[,3]/mu[,5]) log(mu[,4]/mu[,5])
## 1 0.0953 0.028205 -0.54130 -0.7268
## 2 -0.5082 0.003228 0.66646 1.2589
## 3 -0.3693 -0.461102 -0.42005 1.8347
## 4 0.4125 0.249983 0.19772 -1.3779
## 5 -0.5526 -0.191149 0.07215 0.2790
## 6 0.6500 0.110694 -0.02784 -0.2828
## 7 0.6757 0.827737 0.79863 -0.3081
## 8 -1.3051 -0.802694 -0.69525 0.4629
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -3.2074 0.6387 -5.021 5.13e-07 ***
## (Intercept):2 -2.0718 0.7067 -2.931 0.003373 **
## (Intercept):3 -1.3980 0.6085 -2.297 0.021601 *
## (Intercept):4 -1.0781 0.4709 -2.289 0.022061 *
## size:1 1.4582 0.3959 3.683 0.000231 ***
## size:2 -0.3513 0.5800 -0.606 0.544786
## size:3 -0.6307 0.6425 -0.982 0.326296
## size:4 0.3316 0.4482 0.740 0.459506
## factor(lake)2:1 2.5956 0.6597 3.934 8.34e-05 ***
## factor(lake)2:2 1.2161 0.7860 1.547 0.121824
## factor(lake)2:3 -1.3483 1.1635 -1.159 0.246529
## factor(lake)2:4 -0.8205 0.7296 -1.125 0.260713
## factor(lake)3:1 2.7803 0.6712 4.142 3.44e-05 ***
## factor(lake)3:2 1.6925 0.7804 2.169 0.030113 *
## factor(lake)3:3 0.3926 0.7818 0.502 0.615487
## factor(lake)3:4 0.6902 0.5597 1.233 0.217511
## factor(lake)4:1 1.6584 0.6129 2.706 0.006813 **
## factor(lake)4:2 -1.2428 1.1854 -1.048 0.294466
## factor(lake)4:3 -0.6951 0.7813 -0.890 0.373608
## factor(lake)4:4 -0.8262 0.5575 -1.482 0.138378
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]),
## log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
##
## Residual deviance: 17.0798 on 12 degrees of freedom
##
## Log-likelihood: -47.5138 on 12 degrees of freedom
##
## Number of Fisher scoring iterations: 5
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):1'
##
##
## Reference group is level 5 of the response
We can check whether we need a more complicated model by LRT test of the residual deviance
1 - pchisq(17.0798, df = 12)
## [1] 0.1466201
This indicates that interactions are not needed.
Both the lake and size main effects are needed.
library(VGAM)
fit.nosize <- vglm(formula = cbind(y2, y3, y4, y5, y1) ~ factor(lake), family = multinomial, data = Alligators)
anova(fit.nosize, fit, type = "I")
The fitted probabilities
fitted(fit)
## y2 y3 y4 y5 y1
## 1 0.09309880 0.04745657 0.070401523 0.25373963 0.5353035
## 2 0.02307168 0.07182461 0.140896287 0.19400964 0.5701978
## 3 0.60189675 0.07722761 0.008817482 0.05387208 0.2581861
## 4 0.24864518 0.19483742 0.029416085 0.06866281 0.4584385
## 5 0.51683852 0.08876722 0.035894709 0.17420051 0.1842990
## 6 0.19296122 0.20239954 0.108225068 0.20066164 0.2957525
## 7 0.41285579 0.01156654 0.029671169 0.09380245 0.4521040
## 8 0.13967784 0.02389871 0.081067366 0.09791362 0.6574425
y1: Fish, y2: Invertebrate, y3: Reptile, y4: Bird, y5: Other
Mental impairment is ordinal: 1 = well, 2 = mild symptom formation, 3 = moderate symptom formation, 4 = impaired
Mental <- read.table("Mental.dat", header = T)
Mental
#table(Mental$impair)
fit <- vglm(impair ~ life + ses, family = cumulative(link = "logitlink", parallel = T), data = Mental)
summary(fit)
##
## Call:
## vglm(formula = impair ~ life + ses, family = cumulative(link = "logitlink",
## parallel = T), data = Mental)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## logitlink(P[Y<=1]) -1.568 -0.7048 -0.2102 0.8070 2.713
## logitlink(P[Y<=2]) -2.328 -0.4666 0.2657 0.6904 1.615
## logitlink(P[Y<=3]) -3.688 0.1198 0.2039 0.4194 1.892
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.2819 0.6231 -0.452 0.65096
## (Intercept):2 1.2128 0.6511 1.863 0.06251 .
## (Intercept):3 2.2094 0.7171 3.081 0.00206 **
## life -0.3189 0.1194 -2.670 0.00759 **
## ses 1.1112 0.6143 1.809 0.07045 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]),
## logitlink(P[Y<=3])
##
## Residual deviance: 99.0979 on 115 degrees of freedom
##
## Log-likelihood: -49.5489 on 115 degrees of freedom
##
## Number of Fisher scoring iterations: 5
##
## No Hauck-Donner effect found in any of the estimates
##
##
## Exponentiated coefficients:
## life ses
## 0.7269742 3.0380707
Intepretation for the sign of the coefficients: a positive coefficient represents a negative effect on \(y_i\).
We check if the proportional odds assumption is reasonable by comparing with a more flexible model allowing different \(\beta\) for different category \(k\).
fit.nonpo <- vglm(impair ~ life + ses, family = cumulative(link = "logitlink", parallel = F), data = Mental)
anova(fit, fit.nonpo, type = "I")
#1 - pchisq(2 * (logLik(fit.nonpo) - logLik(fit)), df = df.residual(fit) - df.residual(fit.nonpo))
The p-value is not small enough to reject