library(palmerpenguins)
library(mclust)

1) Prepare the same penguins data used in Lecture 5

We reuse the same four numeric measurements and keep complete cases.

raw_df <- penguins[, c(
  "species", "island",
  "bill_length_mm", "bill_depth_mm",
  "flipper_length_mm", "body_mass_g"
)]

keep <- complete.cases(raw_df)
penguins_df <- raw_df[keep, ]

X <- as.matrix(penguins_df[, c(
  "bill_length_mm", "bill_depth_mm",
  "flipper_length_mm", "body_mass_g"
)])

X_scaled <- scale(X)

2) Quick baseline: K-means (hard assignments)

set.seed(32950)
km3 <- kmeans(X_scaled, centers = 3, nstart = 20)

table(km3$cluster)

  1   2   3 
132  87 123 
table(TrueSpecies = penguins_df$species, KmeansCluster = km3$cluster)
           KmeansCluster
TrueSpecies   1   2   3
  Adelie    127  24   0
  Chinstrap   5  63   0
  Gentoo      0   0 123

3) Compute GMM using MClust()

Model selection with BIC surface

Though GMM can model scale differences, the solution may not be stable if scales across variables differ a lot. So we still tend to scale the data first before running GMM in practice.

We use R function Mclust(). By default, it automatically select both number of components and covariance structure by BIC.

set.seed(32950)
gmm_auto <- Mclust(X_scaled, G = 1:12, verbose = F) ## G: the possible numbers of components

summary(gmm_auto)
---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VEE (ellipsoidal, equal shape and orientation) model with 4 components: 

Clustering table:
  1   2   3   4 
154  66  57  65 

The BIC in MClust is defined as the reverse of regularly defined BIC, so higher BIC represents better model in MClust: \[\text{BIC}_{\text{mclust}} = -\text{BIC} =2l(\hat \Theta) - d\log n\]

We can actually visualize how BIC changes across different modeling choices.

plot(gmm_auto, what = "BIC")

You may also specify a specific convariance structure and number of components (for instance, fixed G = 3).

gmm_3 <- Mclust(X_scaled, G = 3, modelNames = "VEE", verbose = F)
gmm_3$bic
[1] -2510.485

The output of Mclust

We have the fitted model stored in gmm_auto$paramters, the estimated responsibilites stored in gmm_auto$z, and the corresponding hard clustering results in gmm_auto$classification and the uncertainties stored in gmm_auto$uncertainty. \[\gamma_{ik} = P(Z_i=k\mid x_i), \quad \text{uncertainty}_i=1-\max_k\gamma_{ik}\]

print("First 6 entries of the responsibilities:")
[1] "First 6 entries of the responsibilities:"
head(gmm_auto$z)
          [,1]         [,2]         [,3]         [,4]
[1,] 0.9999859 7.831331e-42 1.088997e-27 1.405042e-05
[2,] 0.9997910 1.289681e-26 3.344178e-19 2.089817e-04
[3,] 0.9917237 7.758721e-30 1.436243e-21 8.276320e-03
[4,] 0.9999983 2.581199e-43 4.760017e-29 1.671648e-06
[5,] 0.9999928 1.510400e-52 3.022682e-33 7.230942e-06
[6,] 0.9999509 1.237485e-35 1.283305e-24 4.905485e-05
print("First 6 entries of the uncertainty:")
[1] "First 6 entries of the uncertainty:"
head(gmm_auto$uncertainty)
[1] 1.405042e-05 2.089817e-04 8.276320e-03 1.671648e-06 7.230942e-06 4.905485e-05
print("First 6 entries of hard clustering labels:")
[1] "First 6 entries of hard clustering labels:"
head(gmm_auto$classification)
[1] 1 1 1 1 1 1

4) Visualize the clustering results

MClust provides visualzation of clustering results in the original data space

plot(gmm_auto, what = "classification")

plot(gmm_3, what = "classification")

Though BIC prefers \(K = 4\), compared to the results when \(K =3\), the additional cluster is not very well seperated in the data, so \(K=3\) seems a more reasonable choice.

plot(gmm_3, what = "uncertainty")

5) Visualize clustering results in PC space

pca <- prcomp(X_scaled)
score12 <- pca$x[, 1:2]

op <- par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

plot(score12,
     col = penguins_df$species,
     pch = 19,
     xlab = "PC1", ylab = "PC2",
     main = "Ground truth (Species)")
legend("topright",
       legend = levels(as.factor(penguins_df$species)),
       col = 1:length(levels(as.factor(penguins_df$species))),
       pch = 19)

plot(score12,
     col = km3$cluster,
     pch = 19,
     xlab = "PC1", ylab = "PC2",
     main = "K-means clusters")

plot(score12,
     col = gmm_3$classification,
     pch = 19,
     xlab = "PC1", ylab = "PC2",
     main = paste("GMM:", gmm_3$modelName, "(G=", gmm_3$G, ")"))

plot(score12,
     col = "grey80",
     pch = 16,
     xlab = "PC1", ylab = "PC2",
     main = "GMM uncertainty (>0.1 highlighted)")

idx <- gmm_3$uncertainty > 0.1   # adjust threshold
points(score12[idx, ],
       col = "red",
       pch = 19)

par(op)