1. Example 1: Classification for endometrial cancer

Endometrial <- read.table("Endometrial.dat", header = T)
Endometrial

HG: histology grade (0: low / 1: high)
NV: neovasculation
PI: pulsatility index
EH: endometrium height

1.1 Quasi-complete seperation

table(Endometrial$NV, Endometrial$HG)

##    
##      0  1
##   0 49 17
##   1  0 13

If \(NV_i > 0\), then \(y_i = 1\). If \(NV_i = 0\), then \(y_i\) can be either \(0\) or \(1\).

We can see that R returns very large parameter estimation

fit <- glm(HG ~ NV + PI + EH, family = binomial, data = Endometrial)
summary(fit)

## 
## Call:
## glm(formula = HG ~ NV + PI + EH, family = binomial, data = Endometrial)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.50137  -0.64108  -0.29432   0.00016   2.72777  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    4.30452    1.63730   2.629 0.008563 ** 
## NV            18.18556 1715.75089   0.011 0.991543    
## PI            -0.04218    0.04433  -0.952 0.341333    
## EH            -2.90261    0.84555  -3.433 0.000597 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 104.903  on 78  degrees of freedom
## Residual deviance:  55.393  on 75  degrees of freedom
## AIC: 63.393
## 
## Number of Fisher Scoring iterations: 17

You can still run a likelihood-ratio statistics with df = 1 for \(H_0\): \(\beta_1 = 0\)

deviance(glm(HG ~ PI + EH, family = binomial, data = Endometrial)) - deviance(fit)

## [1] 9.357643

Compare with a Chi-square statistics, the p-value is 0.002. We can conclude that beta1 > 0. We will present other inferences in later sections.

We can also run the deviance analysis

fit1 <- glm(HG ~ PI + EH + NV, family = "binomial", data = Endometrial)
anova(fit1, test = "LRT")

1.2 ROC curve

Here we used all samples to train the logistic regression, and we now look at the training error with ROC curve.

Endometrial$pred <- predict(fit, Endometrial, type = "response")
Endometrial

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

pred <- prediction(Endometrial$pred, Endometrial$HG)
perf <- performance(pred,"sens","fpr")
plot(perf)

## Compare with a simplier model
fit2 <- glm(HG ~ PI + EH, family = "binomial", data = Endometrial)
Endometrial$pred2 <- predict(fit2, Endometrial, type = "response")
pred2 <- prediction(Endometrial$pred2, Endometrial$HG)
perf2 <- performance(pred2,"sens","fpr")
plot(perf2, add = T, col = "red")

We see a slighted larger error for the simplier model.

For an unbiased evaluation of the classification performance, we should look at the ROC curve on a test dataset. We can also randomly select \(20\%\) samples as test datae.

n <- nrow(Endometrial)
set.seed(1)
test.idx <- sample(n, round(0.2*n))
fit.train <- glm(HG ~ PI + EH + NV, family = "binomial", data = Endometrial[-test.idx,])
test <- Endometrial[test.idx, ]
test$pred <- predict(fit.train, test, type = "response")
pred <- prediction(test$pred, test$HG)
perf <- performance(pred,"sens","fpr")
plot(perf)

## Compare with a simplier model
fit.train2 <- glm(HG ~ PI + EH, family = "binomial", data = Endometrial[-test.idx,])
test$pred2 <- predict(fit.train2, test, type = "response")
pred2 <- prediction(test$pred2, test$HG)
perf2 <- performance(pred2,"sens","fpr")
plot(perf2, add = T, col = "red")

We see that the simpler model is better in terms of classification. Anyway, the sample size in this dataset is small, so the randomness in the test data ROC curve is non-negligiable.

2. Example 2: Dose-response study

The data reports the death of adult flour beetles after the exposure to gaseous carbon disulfide at various dosages. The data is in a group-level form.

beetles2 <- read.table("beetles2.dat", header = T)
beetles2

2.1 Group-level data V.S. ungrouped data

Let’s use the probit link.

alive <- beetles2$n - beetles2$dead
data <- matrix(append(beetles2$dead, alive), ncol = 2)
logdose <- beetles2$logdose
dead <- beetles2$dead
n <- beetles2$n
fit.probit <- glm(data ~ logdose, family = binomial(link = probit))
summary(fit.probit)

## 
## Call:
## glm(formula = data ~ logdose, family = binomial(link = probit))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5627  -0.4848   0.7647   1.0530   1.3149  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -34.956      2.649  -13.20   <2e-16 ***
## logdose       19.741      1.488   13.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.202  on 7  degrees of freedom
## Residual deviance:   9.987  on 6  degrees of freedom
## AIC: 40.185
## 
## Number of Fisher Scoring iterations: 4

Residual deviance is \(9.99\) (with p-value \(0.125\) from the likelihood ratio test, after comparing with the group-level saturated model)

Now let’s check the ungrouped data

Beetles <- read.table("Beetles.dat", header = T)
Beetles

fit.probit2 <- glm(y ~ x, family = binomial(link = probit), data = Beetles)
summary(fit.probit2)

## 
## Call:
## glm(formula = y ~ x, family = binomial(link = probit), data = Beetles)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5638  -0.6263   0.1597   0.4478   2.3883  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -34.956      2.649  -13.20   <2e-16 ***
## x             19.741      1.488   13.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.44  on 480  degrees of freedom
## Residual deviance: 371.23  on 479  degrees of freedom
## AIC: 375.23
## 
## Number of Fisher Scoring iterations: 6

Residual deviance is \(371.23\). The log-likelihood ratio test here for the residual deviance is invalid.

The log-log link

This is the fit using the probit link. The data do not support the sysmmetric of the response curve at \(0.5\).

plot(logdose, dead/n, pch = 20, ylim = c(0, 1))
curve(predict(fit.probit, newdata = list(logdose = x), type = "response"), add = T, lty = 2)
abline(h = 0.5, col = "red", lty = 2)

The fit using the logit link

fit.logit <- glm(data ~ logdose, family = binomial(link = logit))
summary(fit.logit)

## 
## Call:
## glm(formula = data ~ logdose, family = binomial(link = logit))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5878  -0.4085   0.8442   1.2455   1.5860  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -60.740      5.182  -11.72   <2e-16 ***
## logdose       34.286      2.913   11.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.202  on 7  degrees of freedom
## Residual deviance:  11.116  on 6  degrees of freedom
## AIC: 41.314
## 
## Number of Fisher Scoring iterations: 4

plot(logdose, dead/n, pch = 20, ylim = c(0, 1))
curve(predict(fit.probit, newdata = list(logdose = x), type = "response"), add = T, lty = 2)
curve(predict(fit.logit, newdata = list(logdose = x), type = "response"), add = T, lty = 1)

The fitted curve is very similar, the residual deviance is slightly larger.

The fit using the complementary log-log link

fit.cloglog <- glm(data ~ logdose, family = binomial(link = cloglog))
summary(fit.cloglog)

## 
## Call:
## glm(formula = data ~ logdose, family = binomial(link = cloglog))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.80002  -0.56588   0.01475   0.38096   1.31591  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -39.522      3.236  -12.21   <2e-16 ***
## logdose       22.015      1.797   12.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.2024  on 7  degrees of freedom
## Residual deviance:   3.5143  on 6  degrees of freedom
## AIC: 33.712
## 
## Number of Fisher Scoring iterations: 4

plot(logdose, dead/n, pch = 20, ylim = c(0, 1))
curve(predict(fit.probit, newdata = list(logdose = x), type = "response"), add = T, lty = 2)
curve(predict(fit.cloglog, newdata = list(logdose = x), type = "response"), add = T, lty = 1, col = "blue")

The fitted curve is better, and the residual deviance is smaller.

The fit using the log-log link. In R, we can not directly use a log-log link. We can fit a log-log link Binomial GLM using the cloglog link (see Agresti Chapter 5.6.3)

data2 <- matrix(append(alive, dead), ncol = 2)
fit.loglog <- glm(data2 ~ logdose, family = binomial(link = cloglog))
summary(fit.loglog)

## 
## Call:
## glm(formula = data2 ~ logdose, family = binomial(link = cloglog))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4425  -2.0554  -0.7002   0.4494   2.5362  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   37.661      2.949   12.77   <2e-16 ***
## logdose      -21.583      1.680  -12.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.202  on 7  degrees of freedom
## Residual deviance:  27.573  on 6  degrees of freedom
## AIC: 57.771
## 
## Number of Fisher Scoring iterations: 6

plot(logdose, dead/n, pch = 20, ylim = c(0, 1))
curve(predict(fit.probit, newdata = list(logdose = x), type = "response"), add = T, lty = 2)
curve(1 - predict(fit.loglog, newdata = list(logdose = x), type = "response"), add = T, lty = 1, col = "red")