library(palmerpenguins)
library(mclust)
We reuse the same four numeric measurements and keep complete cases.
raw_df <- penguins[, c(
"species", "island",
"bill_length_mm", "bill_depth_mm",
"flipper_length_mm", "body_mass_g"
)]
keep <- complete.cases(raw_df)
penguins_df <- raw_df[keep, ]
X <- as.matrix(penguins_df[, c(
"bill_length_mm", "bill_depth_mm",
"flipper_length_mm", "body_mass_g"
)])
X_scaled <- scale(X)
set.seed(32950)
km3 <- kmeans(X_scaled, centers = 3, nstart = 20)
table(km3$cluster)
1 2 3
132 87 123
table(TrueSpecies = penguins_df$species, KmeansCluster = km3$cluster)
KmeansCluster
TrueSpecies 1 2 3
Adelie 127 24 0
Chinstrap 5 63 0
Gentoo 0 0 123
MClust()Though GMM can model scale differences, the solution may not be stable if scales across variables differ a lot. So we still tend to scale the data first before running GMM in practice.
We use R function Mclust(). By default, it automatically select both number of components and covariance structure by BIC.
set.seed(32950)
gmm_auto <- Mclust(X_scaled, G = 1:12, verbose = F) ## G: the possible numbers of components
summary(gmm_auto)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VEE (ellipsoidal, equal shape and orientation) model with 4 components:
Clustering table:
1 2 3 4
154 66 57 65
The BIC in MClust is defined as the reverse of regularly defined BIC, so higher BIC represents better model in MClust: \[\text{BIC}_{\text{mclust}} = -\text{BIC} =2l(\hat \Theta) - d\log n\]
We can actually visualize how BIC changes across different modeling choices.
plot(gmm_auto, what = "BIC")
You may also specify a specific convariance structure and number of components (for instance, fixed G = 3).
gmm_3 <- Mclust(X_scaled, G = 3, modelNames = "VEE", verbose = F)
gmm_3$bic
[1] -2510.485
MclustWe have the fitted model stored in gmm_auto$paramters, the estimated responsibilites stored in gmm_auto$z, and the corresponding hard clustering results in gmm_auto$classification and the uncertainties stored in gmm_auto$uncertainty. \[\gamma_{ik} = P(Z_i=k\mid x_i), \quad \text{uncertainty}_i=1-\max_k\gamma_{ik}\]
print("First 6 entries of the responsibilities:")
[1] "First 6 entries of the responsibilities:"
head(gmm_auto$z)
[,1] [,2] [,3] [,4]
[1,] 0.9999859 7.831331e-42 1.088997e-27 1.405042e-05
[2,] 0.9997910 1.289681e-26 3.344178e-19 2.089817e-04
[3,] 0.9917237 7.758721e-30 1.436243e-21 8.276320e-03
[4,] 0.9999983 2.581199e-43 4.760017e-29 1.671648e-06
[5,] 0.9999928 1.510400e-52 3.022682e-33 7.230942e-06
[6,] 0.9999509 1.237485e-35 1.283305e-24 4.905485e-05
print("First 6 entries of the uncertainty:")
[1] "First 6 entries of the uncertainty:"
head(gmm_auto$uncertainty)
[1] 1.405042e-05 2.089817e-04 8.276320e-03 1.671648e-06 7.230942e-06 4.905485e-05
print("First 6 entries of hard clustering labels:")
[1] "First 6 entries of hard clustering labels:"
head(gmm_auto$classification)
[1] 1 1 1 1 1 1
MClust provides visualzation of clustering results in the original data space
plot(gmm_auto, what = "classification")
plot(gmm_3, what = "classification")
Though BIC prefers \(K = 4\), compared to the results when \(K =3\), the additional cluster is not very well seperated in the data, so \(K=3\) seems a more reasonable choice.
plot(gmm_3, what = "uncertainty")
pca <- prcomp(X_scaled)
score12 <- pca$x[, 1:2]
op <- par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
plot(score12,
col = penguins_df$species,
pch = 19,
xlab = "PC1", ylab = "PC2",
main = "Ground truth (Species)")
legend("topright",
legend = levels(as.factor(penguins_df$species)),
col = 1:length(levels(as.factor(penguins_df$species))),
pch = 19)
plot(score12,
col = km3$cluster,
pch = 19,
xlab = "PC1", ylab = "PC2",
main = "K-means clusters")
plot(score12,
col = gmm_3$classification,
pch = 19,
xlab = "PC1", ylab = "PC2",
main = paste("GMM:", gmm_3$modelName, "(G=", gmm_3$G, ")"))
plot(score12,
col = "grey80",
pch = 16,
xlab = "PC1", ylab = "PC2",
main = "GMM uncertainty (>0.1 highlighted)")
idx <- gmm_3$uncertainty > 0.1 # adjust threshold
points(score12[idx, ],
col = "red",
pch = 19)
par(op)