1. A simulation showing the problem of using linear regression for ordinal response

(Chapter 6.2.5 of Agresti book)

We simulate data from the following model:

set.seed(1)
n <- 100
x <- runif(n, 0, 10)
z <- rbinom(n, 1, 0.5)
epsilon <- rnorm(n, 0, 1)
y_star <- 2 + 0.6 * x - 4 * z + epsilon
pch <- rep("1", n)
pch[z == 0] <- "0"
plot(x, y_star, pch = pch, xlim = c(0, 10), ylim = c(-3, 9))
abline(2, 0.6)
abline(-2, 0.6)

This figure shows the relationship between and at two levels of . Clearly, there is no interaction between and in affecting .

1.1 Using a linear model

Now consider the scenario that we can only measure , a discretized version of .

y <- rep(1, n)
y[y_star > 2 & y_star <=4] <- 2
y[y_star > 4 & y_star <=6] <- 3
y[y_star > 6 & y_star <=8] <- 4
y[y_star > 8] <- 5
lm.fit1 <- lm(y[z == 1] ~ x[z==1])
lm.fit2 <- lm(y[z == 0] ~ x[z==0])
plot(x, y, pch = pch, xlim = c(0, 10), ylim = c(0, 5))
abline(lm.fit1$coefficients[1], lm.fit1$coefficients[2])
abline(lm.fit2$coefficients[1], lm.fit2$coefficients[2])

One can see that each category has different variances

lm.fit3 <- lm(y ~ x + z + x:z)
summary(lm.fit3)

## 
## Call:
## lm(formula = y ~ x + z + x:z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10736 -0.34012  0.01254  0.23616  1.17697 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.87577    0.17175  10.921  < 2e-16 ***
## x            0.23284    0.02894   8.045 2.28e-12 ***
## z           -1.17386    0.21924  -5.354 5.86e-07 ***
## x:z         -0.09913    0.03747  -2.646  0.00952 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4881 on 96 degrees of freedom
## Multiple R-squared:  0.8137, Adjusted R-squared:  0.8079 
## F-statistic: 139.8 on 3 and 96 DF,  p-value: < 2.2e-16

Incorrect inference of the interaction due to a flooring effect and a ceiling effect. In contrast, if we can observe , we can pbtain the current estimation.

lm.fit4 <- lm(y_star ~ x + z + x:z)
summary(lm.fit4)

## 
## Call:
## lm(formula = y_star ~ x + z + x:z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9498 -0.5791 -0.1286  0.5798  2.4582 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.44125    0.33727   7.238 1.12e-10 ***
## x            0.52209    0.05683   9.187 8.31e-15 ***
## z           -4.72214    0.43051 -10.969  < 2e-16 ***
## x:z          0.11644    0.07358   1.583    0.117    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9584 on 96 degrees of freedom
## Multiple R-squared:  0.8906, Adjusted R-squared:  0.8871 
## F-statistic: 260.4 on 3 and 96 DF,  p-value: < 2.2e-16

1.2 Using a cumulative probit model

Instead, if we use a cumulative link model, we can correctly identify that there is no interaction between and . As follows a normal distribution, use use a cumulative probit model.

library(VGAM)

## Loading required package: stats4

## Loading required package: splines

fit <- suppressWarnings(vglm(y ~ x + z + x:z, family = cumulative(link = "probitlink", parallel = T)))
summary(fit)

## 
## Call:
## vglm(formula = y ~ x + z + x:z, family = cumulative(link = "probitlink", 
##     parallel = T))
## 
## Pearson residuals:
##                        Min         1Q    Median      3Q    Max
## probitlink(P[Y<=1]) -3.499 -1.591e-01 -0.006832 0.18263 2.4354
## probitlink(P[Y<=2]) -2.253 -1.351e-01  0.009128 0.15963 2.0628
## probitlink(P[Y<=3]) -3.574  6.988e-05  0.008994 0.13790 1.2357
## probitlink(P[Y<=4]) -1.953  3.658e-06  0.001032 0.04755 0.8009
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1 -0.62218    0.47087  -1.321   0.1864    
## (Intercept):2  1.08514    0.42447   2.556   0.0106 *  
## (Intercept):3  3.27600    0.62037   5.281 1.29e-07 ***
## (Intercept):4  4.85644    0.78166   6.213 5.20e-10 ***
## x             -0.44869    0.09002  -4.984 6.22e-07 ***
## z              3.49280    0.83168   4.200 2.67e-05 ***
## x:z            0.02389    0.12895   0.185   0.8530    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: probitlink(P[Y<=1]), probitlink(P[Y<=2]), 
## probitlink(P[Y<=3]), probitlink(P[Y<=4])
## 
## Residual deviance: 123.7853 on 393 degrees of freedom
## 
## Log-likelihood: NA on 393 degrees of freedom
## 
## Number of Fisher scoring iterations: 12 
## 
## 
## Exponentiated coefficients:
##          x          z        x:z 
##  0.6384641 32.8779657  1.0241817

Question: what are the true values of the estimated coefficients?

2. Data examples

2.1 Chapter 6.3.2, Nominal response, Alligator Food Choice

Alligators <- read.table("Alligators.dat", header = T)
Alligators

ABCDEFGHIJ0123456789

lake <int>	size <int>	y1 <int>	y2 <int>	y3 <int>	y4 <int>	y5 <int>
1	1	23	4	2	2	8
1	0	7	0	1	3	5
2	1	5	11	1	0	3
2	0	13	8	6	1	0
3	1	5	11	2	1	5
3	0	8	7	6	3	5
4	1	16	19	1	2	3
4	0	17	1	0	1	3

y1: Fish, y2: Invertebrate, y3: Reptile, y4: Bird, y5: Other
lake: 1 = Hancock, 2 = Ocklawaha, 3 = Trafford, 4 = George
size: 1 = Alligator size > 2.3m, 0 = Alligator size <= 2.3m

This is a dataset with grouped multinomial responses.

We can use a Baseline-category logit model. Using the first category as the baseline, we have

library(VGAM)
fit <- vglm(formula = cbind(y2, y3, y4, y5, y1) ~ size + factor(lake), family = multinomial, data = Alligators)
summary(fit)

## 
## Call:
## vglm(formula = cbind(y2, y3, y4, y5, y1) ~ size + factor(lake), 
##     family = multinomial, data = Alligators)
## 
## Pearson residuals:
##   log(mu[,1]/mu[,5]) log(mu[,2]/mu[,5]) log(mu[,3]/mu[,5]) log(mu[,4]/mu[,5])
## 1             0.0953           0.028205           -0.54130            -0.7268
## 2            -0.5082           0.003228            0.66646             1.2589
## 3            -0.3693          -0.461102           -0.42005             1.8347
## 4             0.4125           0.249983            0.19772            -1.3779
## 5            -0.5526          -0.191149            0.07215             0.2790
## 6             0.6500           0.110694           -0.02784            -0.2828
## 7             0.6757           0.827737            0.79863            -0.3081
## 8            -1.3051          -0.802694           -0.69525             0.4629
## 
## Coefficients: 
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1    -3.2074     0.6387  -5.021 5.13e-07 ***
## (Intercept):2    -2.0718     0.7067  -2.931 0.003373 ** 
## (Intercept):3    -1.3980     0.6085  -2.297 0.021601 *  
## (Intercept):4    -1.0781     0.4709  -2.289 0.022061 *  
## size:1            1.4582     0.3959   3.683 0.000231 ***
## size:2           -0.3513     0.5800  -0.606 0.544786    
## size:3           -0.6307     0.6425  -0.982 0.326296    
## size:4            0.3316     0.4482   0.740 0.459506    
## factor(lake)2:1   2.5956     0.6597   3.934 8.34e-05 ***
## factor(lake)2:2   1.2161     0.7860   1.547 0.121824    
## factor(lake)2:3  -1.3483     1.1635  -1.159 0.246529    
## factor(lake)2:4  -0.8205     0.7296  -1.125 0.260713    
## factor(lake)3:1   2.7803     0.6712   4.142 3.44e-05 ***
## factor(lake)3:2   1.6925     0.7804   2.169 0.030113 *  
## factor(lake)3:3   0.3926     0.7818   0.502 0.615487    
## factor(lake)3:4   0.6902     0.5597   1.233 0.217511    
## factor(lake)4:1   1.6584     0.6129   2.706 0.006813 ** 
## factor(lake)4:2  -1.2428     1.1854  -1.048 0.294466    
## factor(lake)4:3  -0.6951     0.7813  -0.890 0.373608    
## factor(lake)4:4  -0.8262     0.5575  -1.482 0.138378    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]), 
## log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
## 
## Residual deviance: 17.0798 on 12 degrees of freedom
## 
## Log-likelihood: -47.5138 on 12 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):1'
## 
## 
## Reference group is level  5  of the response

We can check whether we need a more complicated model by LRT test of the residual deviance

1 - pchisq(17.0798, df = 12)

## [1] 0.1466201

This indicates that interactions are not needed.

Both the lake and size main effects are needed.

library(VGAM)
fit.nosize <- vglm(formula = cbind(y2, y3, y4, y5, y1) ~ factor(lake), family = multinomial, data = Alligators)
anova(fit.nosize, fit, type = "I")

ABCDEFGHIJ0123456789

	Resid. Df <dbl>	Resid. Dev <dbl>	Df <dbl>	Deviance <dbl>	Pr(>Chi) <dbl>
1	16	38.16724	NA	NA	NA
2	12	17.07983	4	21.08741	0.0003042796

The fitted probabilities

fitted(fit)

##           y2         y3          y4         y5        y1
## 1 0.09309880 0.04745657 0.070401523 0.25373963 0.5353035
## 2 0.02307168 0.07182461 0.140896287 0.19400964 0.5701978
## 3 0.60189675 0.07722761 0.008817482 0.05387208 0.2581861
## 4 0.24864518 0.19483742 0.029416085 0.06866281 0.4584385
## 5 0.51683852 0.08876722 0.035894709 0.17420051 0.1842990
## 6 0.19296122 0.20239954 0.108225068 0.20066164 0.2957525
## 7 0.41285579 0.01156654 0.029671169 0.09380245 0.4521040
## 8 0.13967784 0.02389871 0.081067366 0.09791362 0.6574425

y1: Fish, y2: Invertebrate, y3: Reptile, y4: Bird, y5: Other

2.2 Chapter 6.3.3, Ordinal response, Mental Impairment

Mental impairment is ordinal: 1 = well, 2 = mild symptom formation, 3 = moderate symptom formation, 4 = impaired

Mental <- read.table("Mental.dat", header = T)
Mental

ABCDEFGHIJ0123456789

impair <int>	ses <int>	life <int>
1	1	1
1	1	9
1	1	0
1	1	4
1	1	3
1	0	2
1	0	1
1	1	3
1	1	3
1	1	7

#table(Mental$impair)

life: life events index, measuring the number of severity of important life events happend in the past three years
ses: socioeconomic status 1 = high 0 = low

fit <- vglm(impair ~ life + ses, family = cumulative(link = "logitlink", parallel = T), data = Mental)
summary(fit)

## 
## Call:
## vglm(formula = impair ~ life + ses, family = cumulative(link = "logitlink", 
##     parallel = T), data = Mental)
## 
## Pearson residuals:
##                       Min      1Q  Median     3Q   Max
## logitlink(P[Y<=1]) -1.568 -0.7048 -0.2102 0.8070 2.713
## logitlink(P[Y<=2]) -2.328 -0.4666  0.2657 0.6904 1.615
## logitlink(P[Y<=3]) -3.688  0.1198  0.2039 0.4194 1.892
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept):1  -0.2819     0.6231  -0.452  0.65096   
## (Intercept):2   1.2128     0.6511   1.863  0.06251 . 
## (Intercept):3   2.2094     0.7171   3.081  0.00206 **
## life           -0.3189     0.1194  -2.670  0.00759 **
## ses             1.1112     0.6143   1.809  0.07045 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3])
## 
## Residual deviance: 99.0979 on 115 degrees of freedom
## 
## Log-likelihood: -49.5489 on 115 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Exponentiated coefficients:
##      life       ses 
## 0.7269742 3.0380707

Intepretation for the sign of the coefficients: a positive coefficient represents a negative effect on .

We check if the proportional odds assumption is reasonable by comparing with a more flexible model allowing different for different category .

fit.nonpo <- vglm(impair ~ life + ses, family = cumulative(link = "logitlink", parallel = F), data = Mental)
anova(fit, fit.nonpo, type = "I")

ABCDEFGHIJ0123456789

	Resid. Df <dbl>	Resid. Dev <dbl>	Df <dbl>	Deviance <dbl>	Pr(>Chi) <dbl>
1	115	99.09790	NA	NA	NA
2	111	96.74859	4	2.349308	0.6718069

#1 - pchisq(2 * (logLik(fit.nonpo) - logLik(fit)), df = df.residual(fit) - df.residual(fit.nonpo))

The p-value is not small enough to reject

lake <int>	size <int>	y1 <int>	y2 <int>	y3 <int>	y4 <int>	y5 <int>
1	1	23	4	2	2	8
1	0	7	0	1	3	5
2	1	5	11	1	0	3
2	0	13	8	6	1	0
3	1	5	11	2	1	5
3	0	8	7	6	3	5
4	1	16	19	1	2	3
4	0	17	1	0	1	3

lake <int>	size <int>	y1 <int>	y2 <int>	y3 <int>	y4 <int>	y5 <int>
1	1	23	4	2	2	8
1	0	7	0	1	3	5
2	1	5	11	1	0	3
2	0	13	8	6	1	0
3	1	5	11	2	1	5
3	0	8	7	6	3	5
4	1	16	19	1	2	3
4	0	17	1	0	1	3

Example: GLM for multinomial responses

1. A simulation showing the problem of using linear regression for ordinal response

1.1 Using a linear model

1.2 Using a cumulative probit model

2. Data examples

2.1 Chapter 6.3.2, Nominal response, Alligator Food Choice

2.2 Chapter 6.3.3, Ordinal response, Mental Impairment

lake <int>	size <int>	y1 <int>	y2 <int>	y3 <int>	y4 <int>	y5 <int>
1	1	23	4	2	2	8
1	0	7	0	1	3	5
2	1	5	11	1	0	3
2	0	13	8	6	1	0
3	1	5	11	2	1	5
3	0	8	7	6	3	5
4	1	16	19	1	2	3
4	0	17	1	0	1	3