(Chapter 6.2.5 of Agresti book)
We simulate data from the following model: \[y_i^* = 2 + 0.6x_i - 4z_i + \epsilon_i\]
set.seed(1)
n <- 100
x <- runif(n, 0, 10)
z <- rbinom(n, 1, 0.5)
epsilon <- rnorm(n, 0, 1)
y_star <- 2 + 0.6 * x - 4 * z + epsilon
pch <- rep("1", n)
pch[z == 0] <- "0"
plot(x, y_star, pch = pch, xlim = c(0, 10), ylim = c(-3, 9))
abline(2, 0.6)
abline(-2, 0.6)
This figure shows the relationship between \(x\) and \(y^*\) at two levels of \(z\). Clearly, there is no interaction between \(x\) and \(z\) in affecting \(y^*\).
Now consider the scenario that we can only measure \(y\), a discretized version of \(y^*\).
y <- rep(1, n)
y[y_star > 2 & y_star <=4] <- 2
y[y_star > 4 & y_star <=6] <- 3
y[y_star > 6 & y_star <=8] <- 4
y[y_star > 8] <- 5
lm.fit1 <- lm(y[z == 1] ~ x[z==1])
lm.fit2 <- lm(y[z == 0] ~ x[z==0])
plot(x, y, pch = pch, xlim = c(0, 10), ylim = c(0, 5))
abline(lm.fit1$coefficients[1], lm.fit1$coefficients[2])
abline(lm.fit2$coefficients[1], lm.fit2$coefficients[2])
One can see that each category has different variances
lm.fit3 <- lm(y ~ x + z + x:z)
summary(lm.fit3)
##
## Call:
## lm(formula = y ~ x + z + x:z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10736 -0.34012 0.01254 0.23616 1.17697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.87577 0.17175 10.921 < 2e-16 ***
## x 0.23284 0.02894 8.045 2.28e-12 ***
## z -1.17386 0.21924 -5.354 5.86e-07 ***
## x:z -0.09913 0.03747 -2.646 0.00952 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4881 on 96 degrees of freedom
## Multiple R-squared: 0.8137, Adjusted R-squared: 0.8079
## F-statistic: 139.8 on 3 and 96 DF, p-value: < 2.2e-16
Incorrect inference of the interaction due to a flooring effect and a ceiling effect. In contrast, if we can observe \(y^*\), we can pbtain the current estimation.
lm.fit4 <- lm(y_star ~ x + z + x:z)
summary(lm.fit4)
##
## Call:
## lm(formula = y_star ~ x + z + x:z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9498 -0.5791 -0.1286 0.5798 2.4582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.44125 0.33727 7.238 1.12e-10 ***
## x 0.52209 0.05683 9.187 8.31e-15 ***
## z -4.72214 0.43051 -10.969 < 2e-16 ***
## x:z 0.11644 0.07358 1.583 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9584 on 96 degrees of freedom
## Multiple R-squared: 0.8906, Adjusted R-squared: 0.8871
## F-statistic: 260.4 on 3 and 96 DF, p-value: < 2.2e-16
Instead, if we use a cumulative link model, we can correctly identify that there is no interaction between \(x\) and \(z\). As \(\epsilon\) follows a normal distribution, use use a cumulative probit model.
library(VGAM)
## Loading required package: stats4
## Loading required package: splines
fit <- suppressWarnings(vglm(y ~ x + z + x:z, family = cumulative(link = "probitlink", parallel = T)))
summary(fit)
##
## Call:
## vglm(formula = y ~ x + z + x:z, family = cumulative(link = "probitlink",
## parallel = T))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.62218 0.47087 -1.321 0.1864
## (Intercept):2 1.08514 0.42447 2.556 0.0106 *
## (Intercept):3 3.27600 0.62037 5.281 1.29e-07 ***
## (Intercept):4 4.85644 0.78166 6.213 5.20e-10 ***
## x -0.44869 0.09002 -4.984 6.22e-07 ***
## z 3.49280 0.83168 4.200 2.67e-05 ***
## x:z 0.02389 0.12895 0.185 0.8530
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: probitlink(P[Y<=1]), probitlink(P[Y<=2]),
## probitlink(P[Y<=3]), probitlink(P[Y<=4])
##
## Residual deviance: 123.7853 on 393 degrees of freedom
##
## Log-likelihood: NA on 393 degrees of freedom
##
## Number of Fisher scoring iterations: 12
##
##
## Exponentiated coefficients:
## x z x:z
## 0.6384641 32.8779657 1.0241817
Question: what are the true values of the estimated coefficients?
Mental impairment is ordinal: 1 = well, 2 = mild symptom formation, 3 = moderate symptom formation, 4 = impaired
Mental <- read.table("Mental.dat", header = T)
Mental
#table(Mental$impair)
fit <- vglm(impair ~ life + ses, family = cumulative(link = "logitlink", parallel = T), data = Mental)
summary(fit)
##
## Call:
## vglm(formula = impair ~ life + ses, family = cumulative(link = "logitlink",
## parallel = T), data = Mental)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.2819 0.6231 -0.452 0.65096
## (Intercept):2 1.2128 0.6511 1.863 0.06251 .
## (Intercept):3 2.2094 0.7171 3.081 0.00206 **
## life -0.3189 0.1194 -2.670 0.00759 **
## ses 1.1112 0.6143 1.809 0.07045 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]),
## logitlink(P[Y<=3])
##
## Residual deviance: 99.0979 on 115 degrees of freedom
##
## Log-likelihood: -49.5489 on 115 degrees of freedom
##
## Number of Fisher scoring iterations: 5
##
## No Hauck-Donner effect found in any of the estimates
##
##
## Exponentiated coefficients:
## life ses
## 0.7269742 3.0380707
Intepretation for the sign of the coefficients: a positive coefficient represents a negative effect on \(y_i\).
We check if the proportional odds assumption is reasonable by comparing with a more flexible model allowing different \(\beta\) for different category \(k\).
fit.nonpo <- vglm(impair ~ life + ses, family = cumulative(link = "logitlink", parallel = F), data = Mental)
anova(fit, fit.nonpo, type = "I")
#1 - pchisq(2 * (logLik(fit.nonpo) - logLik(fit)), df = df.residual(fit) - df.residual(fit.nonpo))
The p-value is not small enough to reject