knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(ggplot2)
library(glmnet)
Packages and data
This notebook uses the Hitters data from either
ISLR2. The Hitters data record Major League
Baseball players from the 1986 season. Salary is the
response variable: annual salary in thousands of dollars.
data("Hitters", package = "ISLR2")
hitters <- na.omit(Hitters)
dim(hitters)
[1] 263 20
summary(hitters$Salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
67.5 190.0 425.0 535.9 750.0 2460.0
A quick look at the original predictors
There are 19 predictors in the original dataset, and 263 samples.
- Variables such as
AtBat, Hits,
HmRun, Runs, RBI, and
Walks describe recent batting performance.
- Variables beginning with
C are career totals.
PutOuts, Assists, and Errors
describe fielding.
League, Division, and
NewLeague are categorical indicators.
hitters
ggplot(hitters, aes(x = Years, y = Salary, color = League, size = Hits)) +
geom_point(alpha = 0.7) +
labs(
title = "Hitters: salary versus years in the league",
x = "Years",
y = "Salary"
) +
theme_minimal(base_size = 13)

num_vars <- c("Salary", "AtBat", "Hits", "HmRun", "Runs", "RBI", "Walks",
"Years", "CAtBat", "CHits", "CHmRun", "CRuns", "CRBI", "CWalks")
corr_mat <- cor(hitters[, num_vars])
corr_df <- as.data.frame(as.table(corr_mat))
names(corr_df) <- c("Var1", "Var2", "Correlation")
ggplot(corr_df, aes(x = Var1, y = Var2, fill = Correlation)) +
geom_tile() +
scale_fill_gradient2(low = "#2166ac", mid = "white", high = "#b2182b",
midpoint = 0, limits = c(-1, 1)) +
labs(
title = "Correlation structure among selected quantitative variables",
x = NULL,
y = NULL
) +
theme_minimal(base_size = 11) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank()
)

There is substantial correlation among the career totals, and several
batting variables are also strongly related. This is a good setting for
regularization.
Create a richer design matrix
To make the prediction problem more genuinely high-dimensional, we
add two-way interactions among the quantitative predictors while keeping
the categorical variables as main effects only.
reg_formula <- Salary ~ League + Division + NewLeague +
(AtBat + Hits + HmRun + Runs + RBI + Walks + Years +
CAtBat + CHits + CHmRun + CRuns + CRBI + CWalks +
PutOuts + Assists + Errors)^2
X <- model.matrix(reg_formula, data = hitters)[, -1]
y <- hitters$Salary
dim(X)
[1] 263 139
colnames(X)[1:12]
[1] "LeagueN" "DivisionW" "NewLeagueN" "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks" "Years" "CAtBat" "CHits"
qr(X)$rank
[1] 139
This creates a design matrix with 139 predictors in total.
Train/test split
set.seed(32950)
n <- nrow(X)
train_id <- sample(seq_len(n), size = floor(0.8 * n))
test_id <- setdiff(seq_len(n), train_id)
X_train <- X[train_id, , drop = FALSE]
X_test <- X[test_id, , drop = FALSE]
y_train <- y[train_id]
y_test <- y[test_id]
length(train_id)
[1] 210
length(test_id)
[1] 53
OLS on the same design
train_dat <- hitters[train_id, ]
test_dat <- hitters[test_id, ]
ols_fit <- lm(reg_formula, data = train_dat)
ols_pred <- predict(ols_fit, newdata = test_dat)
ols_mse <- mean((y_test - ols_pred)^2)
ols_mse
[1] 179377.1
Even without exact degeneracy, this expanded design makes OLS much
less stable. This is exactly the kind of setting where regularization
becomes useful.
Ridge, lasso, and elastic net
We will use the glmnet function to run Lasso, Ridge and
Elastic Net. The glmnet functions will estimate an
intercept automatically and will not penalize it.
We first use cv.glmnet to choose lambda.
The function will then fit the model with the best lambda
on the entire training dataset. By default, cv.glmnet uses
the number of folds K = 10. We will also standardize all
the predictors.
set.seed(32950)
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0, standardize = TRUE)
set.seed(32950)
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1, standardize = TRUE)
set.seed(32950)
cv_enet <- cv.glmnet(X_train, y_train, alpha = 0.5, standardize = TRUE)
Cross-validation curves
par(mfrow = c(1, 3))
plot(cv_ridge, main = "Ridge CV")
plot(cv_lasso, main = "Lasso CV")
plot(cv_enet, main = "Elastic net CV")
par(mfrow = c(1, 1))

Test data prediction error
Use the tuning parameters selected by CV on the whole training data,
and then test the models on the test data.
ridge_pred <- predict(cv_ridge, newx = X_test, s = "lambda.min")
lasso_pred <- predict(cv_lasso, newx = X_test, s = "lambda.min")
enet_pred <- predict(cv_enet, newx = X_test, s = "lambda.min")
ridge_mse <- mean((y_test - ridge_pred)^2)
lasso_mse <- mean((y_test - lasso_pred)^2)
enet_mse <- mean((y_test - enet_pred)^2)
results <- data.frame(
Method = c("OLS", "Ridge", "Lasso", "Elastic net"),
Test_MSE = c(ols_mse, ridge_mse, lasso_mse, enet_mse)
)
results[order(results$Test_MSE), ]
Lasso, Elastic net and Ridge all predict the test data much better
than OLS, with Lasso performing the best here.
Coefficient paths
We can also visualize how the coefficients change with
lambda on the training data.
ridge_fit <- glmnet(X_train, y_train, alpha = 0, standardize = TRUE)
lasso_fit <- glmnet(X_train, y_train, alpha = 1, standardize = TRUE)
enet_fit <- glmnet(X_train, y_train, alpha = 0.5, standardize = TRUE)
par(mfrow = c(1, 3))
plot(ridge_fit, xvar = "lambda", label = FALSE, main = "Ridge path")
plot(lasso_fit, xvar = "lambda", label = FALSE, main = "Lasso path")
plot(enet_fit, xvar = "lambda", label = FALSE, main = "Elastic net path")
par(mfrow = c(1, 1))

The Lasso and Elastic net paths actually very similar. The Elastic
net path is just smoother because of the \(\ell_2\) penalty added.
How sparse is the lasso fit?
lasso_coef <- coef(cv_lasso, s = "lambda.min")
enet_coef <- coef(cv_enet, s = "lambda.min")
sum(lasso_coef[-1, 1] != 0)
[1] 53
sum(enet_coef[-1, 1] != 0)
[1] 60
lasso_nonzero <- lasso_coef[lasso_coef[, 1] != 0, , drop = FALSE]
lasso_nonzero
54 x 1 sparse Matrix of class "dgCMatrix"
lambda.min
(Intercept) 5.348107e+02
LeagueN 8.602123e+00
DivisionW -7.185711e+01
NewLeagueN 5.058167e+01
AtBat -1.456531e+00
Hits -2.925791e+00
HmRun 1.870931e+00
Runs -2.139482e+00
RBI -2.298497e+00
Walks -4.853590e+00
CHits 3.329951e-01
CRuns 7.069482e-01
CRBI 1.895966e-01
CWalks 5.206373e-01
PutOuts 3.072278e-01
Assists 2.070375e+00
Errors -7.617525e+00
AtBat:Hits 3.320834e-03
AtBat:Walks 3.748820e-04
AtBat:Years 5.824630e-02
AtBat:CWalks 1.332638e-03
AtBat:Assists -1.280061e-03
Hits:HmRun 1.344806e-01
Hits:Walks 7.223271e-02
Hits:Assists -1.465332e-03
Hits:Errors 2.194851e-05
HmRun:RBI -3.128173e-02
HmRun:CAtBat -5.534970e-03
HmRun:CHmRun 8.095777e-02
HmRun:CRBI 2.288763e-02
HmRun:CWalks -4.280627e-02
HmRun:PutOuts -3.850681e-03
HmRun:Assists -1.492172e-02
HmRun:Errors -3.231415e-01
Runs:CRBI 3.560147e-04
Runs:Errors 6.072413e-02
RBI:CRBI 3.964278e-03
Walks:Years -2.285219e-03
Walks:PutOuts -2.003278e-03
Walks:Assists -6.429953e-03
Walks:Errors 3.312738e-02
Years:CAtBat -1.496781e-02
Years:Errors -2.030175e-01
CAtBat:CHmRun -1.215814e-04
CAtBat:Assists 4.543147e-05
CHits:PutOuts 4.403663e-04
CHits:Errors 6.742580e-05
CHmRun:CRBI 2.900925e-07
CHmRun:PutOuts -1.009738e-03
CHmRun:Errors 9.663381e-03
CRBI:Errors 2.646156e-02
CWalks:Assists -2.295138e-03
PutOuts:Assists -1.357541e-03
Assists:Errors 7.169966e-03
Takeaway
- OLS becomes unstable on the full interaction-expanded design.
- Ridge stabilizes the fit by shrinking all coefficients.
- Lasso performs both shrinkage and variable selection.
- Elastic net gives a compromise between sparsity and stability.
---
title: "Lecture 10 Demo: Regularization with the Hitters Data"
output: html_notebook
---

```{r setup}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

library(ggplot2)
library(glmnet)
```

## Packages and data

This notebook uses the `Hitters` data from either `ISLR2`. The `Hitters` data record Major League Baseball players from the 1986 season. `Salary` is the response variable: annual salary in thousands of dollars.

```{r}
data("Hitters", package = "ISLR2")
hitters <- na.omit(Hitters)

dim(hitters)
summary(hitters$Salary)
```



### A quick look at the original predictors

There are 19 predictors in the original dataset, and 263 samples. 

- Variables such as `AtBat`, `Hits`, `HmRun`, `Runs`, `RBI`, and `Walks` describe recent batting performance.
- Variables beginning with `C` are career totals.
- `PutOuts`, `Assists`, and `Errors` describe fielding.
- `League`, `Division`, and `NewLeague` are categorical indicators.

```{r}
hitters
```

```{r}
ggplot(hitters, aes(x = Years, y = Salary, color = League, size = Hits)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Hitters: salary versus years in the league",
    x = "Years",
    y = "Salary"
  ) +
  theme_minimal(base_size = 13)
```

```{r}
num_vars <- c("Salary", "AtBat", "Hits", "HmRun", "Runs", "RBI", "Walks",
              "Years", "CAtBat", "CHits", "CHmRun", "CRuns", "CRBI", "CWalks")

corr_mat <- cor(hitters[, num_vars])

corr_df <- as.data.frame(as.table(corr_mat))
names(corr_df) <- c("Var1", "Var2", "Correlation")

ggplot(corr_df, aes(x = Var1, y = Var2, fill = Correlation)) +
  geom_tile() +
  scale_fill_gradient2(low = "#2166ac", mid = "white", high = "#b2182b",
                       midpoint = 0, limits = c(-1, 1)) +
  labs(
    title = "Correlation structure among selected quantitative variables",
    x = NULL,
    y = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid = element_blank()
  )
```

There is substantial correlation among the career totals, and several batting variables are also strongly related. This is a good setting for regularization.

### Create a richer design matrix

To make the prediction problem more genuinely high-dimensional, we add two-way interactions among the quantitative predictors while keeping the categorical variables as main effects only.

```{r}
reg_formula <- Salary ~ League + Division + NewLeague +
  (AtBat + Hits + HmRun + Runs + RBI + Walks + Years +
   CAtBat + CHits + CHmRun + CRuns + CRBI + CWalks +
   PutOuts + Assists + Errors)^2

X <- model.matrix(reg_formula, data = hitters)[, -1]
y <- hitters$Salary

dim(X)
```

This creates a design matrix with 139 predictors in total. 



## Train/test split

```{r}
set.seed(32950)
n <- nrow(X)
train_id <- sample(seq_len(n), size = floor(0.8 * n))
test_id <- setdiff(seq_len(n), train_id)

X_train <- X[train_id, , drop = FALSE]
X_test <- X[test_id, , drop = FALSE]
y_train <- y[train_id]
y_test <- y[test_id]

length(train_id) # number of training samples
length(test_id) # number of test samples
```

### OLS on the same design

```{r}
train_dat <- hitters[train_id, ]
test_dat <- hitters[test_id, ]

ols_fit <- lm(reg_formula, data = train_dat)
ols_pred <- predict(ols_fit, newdata = test_dat)
ols_mse <- mean((y_test - ols_pred)^2)

ols_mse
```

Even without exact degeneracy, this expanded design makes OLS much less stable. This is exactly the kind of setting where regularization becomes useful.

## Ridge, lasso, and elastic net

We will use the `glmnet` function to run Lasso, Ridge and Elastic Net. The `glmnet` functions will estimate an intercept automatically and will not penalize it.

We first use `cv.glmnet` to choose `lambda`. The function will then fit the model with the best `lambda` on the entire training dataset. By default, `cv.glmnet` uses the number of folds `K = 10`. We will also standardize all the predictors. 

```{r}
set.seed(32950)
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0, standardize = TRUE)

set.seed(32950)
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1, standardize = TRUE)

set.seed(32950)
cv_enet <- cv.glmnet(X_train, y_train, alpha = 0.5, standardize = TRUE)
```

### Cross-validation curves

```{r}
par(mfrow = c(1, 3))
plot(cv_ridge, main = "Ridge CV")
plot(cv_lasso, main = "Lasso CV")
plot(cv_enet, main = "Elastic net CV")
par(mfrow = c(1, 1))
```

### Test data prediction error

Use the tuning parameters selected by CV on the whole training data, and then test the models on the test data.

```{r}
ridge_pred <- predict(cv_ridge, newx = X_test, s = "lambda.min")
lasso_pred <- predict(cv_lasso, newx = X_test, s = "lambda.min")
enet_pred <- predict(cv_enet, newx = X_test, s = "lambda.min")

ridge_mse <- mean((y_test - ridge_pred)^2)
lasso_mse <- mean((y_test - lasso_pred)^2)
enet_mse <- mean((y_test - enet_pred)^2)

results <- data.frame(
  Method = c("OLS", "Ridge", "Lasso", "Elastic net"),
  Test_MSE = c(ols_mse, ridge_mse, lasso_mse, enet_mse)
)

results[order(results$Test_MSE), ]
```
Lasso, Elastic net and Ridge all predict the test data much better than OLS, with Lasso performing the best here.


## Coefficient paths

We can also visualize how the coefficients change with `lambda` on the training data.

```{r}
ridge_fit <- glmnet(X_train, y_train, alpha = 0, standardize = TRUE)
lasso_fit <- glmnet(X_train, y_train, alpha = 1, standardize = TRUE)
enet_fit <- glmnet(X_train, y_train, alpha = 0.5, standardize = TRUE)
```

```{r}
par(mfrow = c(1, 3))
plot(ridge_fit, xvar = "lambda", label = FALSE, main = "Ridge path")
plot(lasso_fit, xvar = "lambda", label = FALSE, main = "Lasso path")
plot(enet_fit, xvar = "lambda", label = FALSE, main = "Elastic net path")
par(mfrow = c(1, 1))
```

The Lasso and Elastic net paths actually very similar. The Elastic net path is just smoother because of the $\ell_2$ penalty added.

## How sparse is the lasso fit?

```{r}
lasso_coef <- coef(cv_lasso, s = "lambda.min")
enet_coef <- coef(cv_enet, s = "lambda.min")

sum(lasso_coef[-1, 1] != 0) # removing the intercept, and calculate the number of nonzero coefficients
sum(enet_coef[-1, 1] != 0)
```

```{r}
lasso_nonzero <- lasso_coef[lasso_coef[, 1] != 0, , drop = FALSE]
lasso_nonzero
```

## Takeaway

- OLS becomes unstable on the full interaction-expanded design.
- Ridge stabilizes the fit by shrinking all coefficients.
- Lasso performs both shrinkage and variable selection.
- Elastic net gives a compromise between sparsity and stability.
