Example: Classification for endometrial cancer

Endometrial <- read.table("Endometrial.dat", header = T)
Endometrial

ABCDEFGHIJ0123456789

NV <int>	PI <int>	EH <dbl>	HG <int>
0	13	1.64	0
0	16	2.26	0
0	8	3.14	0
0	34	2.68	0
0	20	1.28	0
0	5	2.31	0
0	17	1.80	0
0	10	1.68	0
0	26	1.56	0
0	17	2.31	0

HG: histology grade (0: low / 1: high)
NV: neovasculation
PI: pulsatility index
EH: endometrium height

1.1 Quasi-complete seperation

table(Endometrial$NV, Endometrial$HG)

##    
##      0  1
##   0 49 17
##   1  0 13

If , then . If , then can be either or .

We can see that R returns very large parameter estimation

fit <- glm(HG ~ NV + PI + EH, family = binomial, data = Endometrial)
summary(fit)

## 
## Call:
## glm(formula = HG ~ NV + PI + EH, family = binomial, data = Endometrial)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.50137  -0.64108  -0.29432   0.00016   2.72777  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    4.30452    1.63730   2.629 0.008563 ** 
## NV            18.18556 1715.75089   0.011 0.991543    
## PI            -0.04218    0.04433  -0.952 0.341333    
## EH            -2.90261    0.84555  -3.433 0.000597 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 104.903  on 78  degrees of freedom
## Residual deviance:  55.393  on 75  degrees of freedom
## AIC: 63.393
## 
## Number of Fisher Scoring iterations: 17

You can still run a likelihood-ratio statistics with df = 1 for :

deviance(glm(HG ~ PI + EH, family = binomial, data = Endometrial)) - deviance(fit)

## [1] 9.357643

Compare with a Chi-square statistics, the p-value is 0.002. We can conclude that beta1 > 0. We will present other inferences in later sections.

We can also run the deviance analysis

fit1 <- glm(HG ~ PI + EH + NV, family = "binomial", data = Endometrial)
anova(fit1, test = "LRT")

ABCDEFGHIJ0123456789

	Df <int>	Deviance <dbl>	Resid. Df <int>	Resid. Dev <dbl>	Pr(>Chi) <dbl>
NULL	NA	NA	78	104.90253	NA
PI	1	0.3009983	77	104.60153	5.832573e-01
EH	1	39.8506269	76	64.75090	2.741462e-10
NV	1	9.3576428	75	55.39326	2.220576e-03

1.2 ROC curve

Here we used all samples to train the logistic regression, and we now look at the training error with ROC curve.

Endometrial$pred <- predict(fit, Endometrial, type = "response")
Endometrial

ABCDEFGHIJ0123456789

PI <int>	EH <dbl>	pred <dbl>
13	1.64	0.268128293
16	2.26	0.050675633
8	3.14	0.005782436
34	2.68	0.007327976
20	1.28	0.436719782
5	2.31	0.068407170
17	1.80	0.162834126
10	1.68	0.270183124
26	1.56	0.210765814
17	2.31	0.042386309

library(ROCR)
pred <- prediction(Endometrial$pred, Endometrial$HG)
perf <- performance(pred,"sens","fpr")
plot(perf)

## Compare with a simplier model
fit2 <- glm(HG ~ PI + EH, family = "binomial", data = Endometrial)
Endometrial$pred2 <- predict(fit2, Endometrial, type = "response")
pred2 <- prediction(Endometrial$pred2, Endometrial$HG)
perf2 <- performance(pred2,"sens","fpr")
plot(perf2, add = T, col = "red")

We see a slighted larger error for the simplier model.

For an unbiased evaluation of the classification performance, we should look at the ROC curve on a test dataset. We can also randomly select samples as test datae.

n <- nrow(Endometrial)
set.seed(1)
test.idx <- sample(n, round(0.2*n))
fit.train <- glm(HG ~ PI + EH + NV, family = "binomial", data = Endometrial[-test.idx,])
test <- Endometrial[test.idx, ]
test$pred <- predict(fit.train, test, type = "response")
pred <- prediction(test$pred, test$HG)
perf <- performance(pred,"sens","fpr")
plot(perf)

## Compare with a simplier model
fit.train2 <- glm(HG ~ PI + EH, family = "binomial", data = Endometrial[-test.idx,])
test$pred2 <- predict(fit.train2, test, type = "response")
pred2 <- prediction(test$pred2, test$HG)
perf2 <- performance(pred2,"sens","fpr")
plot(perf2, add = T, col = "red")

We see that the simpler model is better in terms of classification. Anyway, the sample size in this dataset is small, so the randomness in the test data ROC curve is non-negligiable.

Example 3, part II

Example: Classification for endometrial cancer

1.1 Quasi-complete seperation

1.2 ROC curve