1. A simulation showing the problem of using linear regression for ordinal response

(Chapter 6.2.5 of Agresti book)

We simulate data from the following model: \[y_i^* = 2 + 0.6x_i - 4z_i + \epsilon_i\]

set.seed(1)
n <- 100
x <- runif(n, 0, 10)
z <- rbinom(n, 1, 0.5)
epsilon <- rnorm(n, 0, 1)
y_star <- 2 + 0.6 * x - 4 * z + epsilon
pch <- rep("1", n)
pch[z == 0] <- "0"
plot(x, y_star, pch = pch, xlim = c(0, 10), ylim = c(-3, 9))
abline(2, 0.6)
abline(-2, 0.6)

This figure shows the relationship between \(x\) and \(y^*\) at two levels of \(z\). Clearly, there is no interaction between \(x\) and \(z\) in affecting \(y^*\).

1.1 Using a linear model

Now consider the scenario that we can only measure \(y\), a discretized version of \(y^*\).

y <- rep(1, n)
y[y_star > 2 & y_star <=4] <- 2
y[y_star > 4 & y_star <=6] <- 3
y[y_star > 6 & y_star <=8] <- 4
y[y_star > 8] <- 5
lm.fit1 <- lm(y[z == 1] ~ x[z==1])
lm.fit2 <- lm(y[z == 0] ~ x[z==0])
plot(x, y, pch = pch, xlim = c(0, 10), ylim = c(0, 5))
abline(lm.fit1$coefficients[1], lm.fit1$coefficients[2])
abline(lm.fit2$coefficients[1], lm.fit2$coefficients[2])

One can see that each category has different variances

lm.fit3 <- lm(y ~ x + z + x:z)
summary(lm.fit3)
## 
## Call:
## lm(formula = y ~ x + z + x:z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10736 -0.34012  0.01254  0.23616  1.17697 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.87577    0.17175  10.921  < 2e-16 ***
## x            0.23284    0.02894   8.045 2.28e-12 ***
## z           -1.17386    0.21924  -5.354 5.86e-07 ***
## x:z         -0.09913    0.03747  -2.646  0.00952 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4881 on 96 degrees of freedom
## Multiple R-squared:  0.8137, Adjusted R-squared:  0.8079 
## F-statistic: 139.8 on 3 and 96 DF,  p-value: < 2.2e-16

Incorrect inference of the interaction due to a flooring effect and a ceiling effect. In contrast, if we can observe \(y^*\), we can pbtain the current estimation.

lm.fit4 <- lm(y_star ~ x + z + x:z)
summary(lm.fit4)
## 
## Call:
## lm(formula = y_star ~ x + z + x:z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9498 -0.5791 -0.1286  0.5798  2.4582 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.44125    0.33727   7.238 1.12e-10 ***
## x            0.52209    0.05683   9.187 8.31e-15 ***
## z           -4.72214    0.43051 -10.969  < 2e-16 ***
## x:z          0.11644    0.07358   1.583    0.117    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9584 on 96 degrees of freedom
## Multiple R-squared:  0.8906, Adjusted R-squared:  0.8871 
## F-statistic: 260.4 on 3 and 96 DF,  p-value: < 2.2e-16

1.2 Using a cumulative probit model

Instead, if we use a cumulative link model, we can correctly identify that there is no interaction between \(x\) and \(z\). As \(\epsilon\) follows a normal distribution, use use a cumulative probit model.

library(VGAM)
## Loading required package: stats4
## Loading required package: splines
fit <- suppressWarnings(vglm(y ~ x + z + x:z, family = cumulative(link = "probitlink", parallel = T)))
summary(fit)
## 
## Call:
## vglm(formula = y ~ x + z + x:z, family = cumulative(link = "probitlink", 
##     parallel = T))
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1 -0.62218    0.47087  -1.321   0.1864    
## (Intercept):2  1.08514    0.42447   2.556   0.0106 *  
## (Intercept):3  3.27600    0.62037   5.281 1.29e-07 ***
## (Intercept):4  4.85644    0.78166   6.213 5.20e-10 ***
## x             -0.44869    0.09002  -4.984 6.22e-07 ***
## z              3.49280    0.83168   4.200 2.67e-05 ***
## x:z            0.02389    0.12895   0.185   0.8530    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: probitlink(P[Y<=1]), probitlink(P[Y<=2]), 
## probitlink(P[Y<=3]), probitlink(P[Y<=4])
## 
## Residual deviance: 123.7853 on 393 degrees of freedom
## 
## Log-likelihood: NA on 393 degrees of freedom
## 
## Number of Fisher scoring iterations: 12 
## 
## 
## Exponentiated coefficients:
##          x          z        x:z 
##  0.6384641 32.8779657  1.0241817

Question: what are the true values of the estimated coefficients?

2. Chapter 6.3.3, Ordinal response, Mental Impairment

Mental impairment is ordinal: 1 = well, 2 = mild symptom formation, 3 = moderate symptom formation, 4 = impaired

Mental <- read.table("Mental.dat", header = T)
Mental
#table(Mental$impair)
fit <- vglm(impair ~ life + ses, family = cumulative(link = "logitlink", parallel = T), data = Mental)
summary(fit)
## 
## Call:
## vglm(formula = impair ~ life + ses, family = cumulative(link = "logitlink", 
##     parallel = T), data = Mental)
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept):1  -0.2819     0.6231  -0.452  0.65096   
## (Intercept):2   1.2128     0.6511   1.863  0.06251 . 
## (Intercept):3   2.2094     0.7171   3.081  0.00206 **
## life           -0.3189     0.1194  -2.670  0.00759 **
## ses             1.1112     0.6143   1.809  0.07045 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3])
## 
## Residual deviance: 99.0979 on 115 degrees of freedom
## 
## Log-likelihood: -49.5489 on 115 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Exponentiated coefficients:
##      life       ses 
## 0.7269742 3.0380707

Intepretation for the sign of the coefficients: a positive coefficient represents a negative effect on \(y_i\).

We check if the proportional odds assumption is reasonable by comparing with a more flexible model allowing different \(\beta\) for different category \(k\).

fit.nonpo <- vglm(impair ~ life + ses, family = cumulative(link = "logitlink", parallel = F), data = Mental)
anova(fit, fit.nonpo, type = "I")
#1 - pchisq(2 * (logLik(fit.nonpo) - logLik(fit)), df = df.residual(fit) - df.residual(fit.nonpo))

The p-value is not small enough to reject