machine_learning

Insufficient Solution

(1) The least-squares estimates are not unique. There are an infinite set of solutions available and most of these solutions overfit the data.

(2) In many instances the result will be computationally infeasible.

Regularized Regression

Regularization

elitedatascience.com definition

Regularization is a technique used to prevent overfitting by artificially penalizing model coefficients.

Wikipedia definition of Regularization

Regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting.

Strenghts and weaknesses of regularization

Strengths:

Weaknesses:

Three regularized regression algorithms

{ height=40% }

Lasso regression

Ridge regression

Elastic-net

The objective function of regularized regression methods…

Preparations

Necessary packages

library(rsample)  # data splitting 
library(glmnet)   # implementing regularized regression approaches
library(dplyr)    # basic data manipulation procedures
library(ggplot2)  # plotting
library(knitr)    # for tables

The example dataset

library(AmesHousing) 
ames_data <- AmesHousing::make_ames()

Create training (70%) and test (30%) sets

set.seed(123)
ames_split <- rsample::initial_split(ames_data, prop = .7, 
                            strata = "Sale_Price")
ames_train <- rsample::training(ames_split)
ames_test  <- rsample::testing(ames_split)

Ridge Regression

Exemplar coefficients

Exemplar coefficients have been regularized with $\lambda$ ranging from 0 to over 8,000 (log(8103)=9).

How to choose the right $\lambda$

Implementation in glmnet

plot(density(ames_data$Sale_Price),main="")

Training and testing feature model matrices and response vectors.

ames_train_x <- model.matrix(Sale_Price ~ ., ames_train)[, -1]
ames_train_y <- log(ames_train$Sale_Price)

ames_test_x <- model.matrix(Sale_Price ~ ., ames_test)[, -1]
ames_test_y <- log(ames_test$Sale_Price)

# What is the dimension of of your feature matrix?
dim(ames_train_x)
## [1] 2054  307

Behind the scenes

(1.) It is essential that predictor variables are standardized when performing regularized regression. glmnet performs this for you. If you standardize your predictors prior to glmnet you can turn this argument off with standardize=FALSE.

(2.) glmnet will perform Ridge models across a wide range of $\lambda$ parameters, which are illustrated in the figure on the next slide.

ames_ridge <- glmnet(x = ames_train_x,y = ames_train_y,
  alpha = 0)

A wide range of $\lambda$ parameters

plot(ames_ridge, xvar = "lambda")

$\lambda$ values in glmnet

head(ames_ridge$lambda)
## [1] 289.0010 263.3270 239.9337 218.6187 199.1972 181.5011

Access the coefficients with coef.

coef(ames_ridge)[c("Gr_Liv_Area", "TotRms_AbvGrd"),100]
##   Gr_Liv_Area TotRms_AbvGrd 
##  0.0001108687  0.0083032186
coef(ames_ridge)[c("Gr_Liv_Area", "TotRms_AbvGrd"), 1] 
##   Gr_Liv_Area TotRms_AbvGrd 
##  5.848028e-40  1.341550e-37

Tuning

ames_ridge <- cv.glmnet(x = ames_train_x,y = ames_train_y,
  alpha = 0)

Results of cv Ridge regression

plot(ames_ridge)

The plot explained (I)

The plot explained (II)

min(ames_ridge$cvm)       # minimum MSE
## [1] 0.01955871
ames_ridge$lambda.min     # lambda for this min MSE
## [1] 0.1542312
# 1 st.error of min MSE
ames_ridge$cvm[ames_ridge$lambda == ames_ridge$lambda.1se]  
## [1] 0.02160821
ames_ridge$lambda.1se  # lambda for this MSE
## [1] 0.5169216

The plot explained (III)

ames_ridge_min <- glmnet(x = ames_train_x,y = ames_train_y,
  alpha = 0)

Coefficients across the $\lambda$ values

plot(ames_ridge_min, xvar = "lambda")
abline(v = log(ames_ridge$lambda.1se), col = "red", 
       lty = "dashed")

Advantages and Disadvantages

coef(ames_ridge, s = "lambda.1se") %>%
  filter(row != "(Intercept)") %>%
  top_n(25, wt = abs(value)) %>%
  ggplot(aes(value, reorder(row, value))) +
  geom_point() +
  ggtitle("Top 25 influential variables") +
  xlab("Coefficient") +
  ylab(NULL)

Top 25 influential variables

Exercise: ridge regression (I)

1) Load the lars package and the diabetes dataset 2) Load the glmnet package to implement ridge regression.

The dataset has three matrices x, x2 and y. x has a smaller set of independent variables while x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes.

3) Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis.

4) Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison.

Exercise: ridge regression (II)

5) Fit the ridge regression model using the glmnet function and plot the trace of the estimated coefficients against lambdas. Note that coefficients are shrunk closer to zero for higher values of lambda.

6) Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error.

7) Using the minimum value of lambda from the previous exercise, get the estimated beta matrix. Note that coefficients are lower than least squares estimates.

8) To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note the shrinkage effect on the estimates.

Exercise: ridge regression (III)

9) Split the data randomly between a training set (80%) and test set (20%). We will use these to get the prediction standard error for least squares and ridge regression models.

10) Fit the ridge regression model on the training and get the estimated beta coefficients for both the minimum lambda and the higher lambda within 1-standard error of the minimum.

11) Get predictions from the ridge regression model for the test set and calculate the prediction standard error. Do this for both the minimum lambda and the higher lambda within 1-standard error of the minimum.

12) Fit the least squares model on the training set.

13) Get predictions from the least squares model for the test set and calculate the prediction standard error.

Ridge and Lasso

A Ridge model…

Lasso Regression

Lasso penalty pushes coefficients to zero

Lasso regression coefficients as $\lambda$ grows from $0 \rightarrow \infty$.{ heigth=60% }

Lasso improves the model with regularization and conducts automated feature selection.

The reduction of coefficients

When a data set has many features, Lasso can be used to identify and extract those features with the largest (and most consistent) signal.

Implementation Lasso regression to ames data

ames_lasso<-glmnet(x=ames_train_x,y=ames_train_y,alpha=1)

A quick drop in number of features

plot(ames_lasso, xvar = "lambda")

Tuning with cv.glmnet

ames_lasso<-cv.glmnet(x=ames_train_x,y=ames_train_y,alpha=1)
names(ames_lasso)
##  [1] "lambda"     "cvm"        "cvsd"       "cvup"       "cvlo"      
##  [6] "nzero"      "name"       "glmnet.fit" "lambda.min" "lambda.1se"

MSE for cross validation

plot(ames_lasso)

Minimum and one standard error MSE and $\lambda$ values.

min(ames_lasso$cvm) # minimum MSE
## [1] 0.02246344
ames_lasso$lambda.min # lambda for this min MSE
## [1] 0.00332281
# 1 st.error of min MSE
ames_lasso$cvm[ames_lasso$lambda == ames_lasso$lambda.1se]  
## [1] 0.02482119
ames_lasso$lambda.1se  # lambda for this MSE
## [1] 0.01472211

MSE within one standard error

Model with minimum MSE

plot(ames_lasso, xvar = "lambda")
abline(v=log(ames_lasso$lambda.min),col="red",lty="dashed")
abline(v=log(ames_lasso$lambda.1se),col="red",lty="dashed")

Lasso for other models than least squares

Advantages and Disadvantages

Rcode for plotting influential variables

coef(ames_lasso, s = "lambda.1se") %>%
  tidy() %>%
  filter(row != "(Intercept)") %>%
  ggplot(aes(value, reorder(row, value), color = value > 0)) +
  geom_point(show.legend = FALSE) +
  ggtitle("Influential variables") +
  xlab("Coefficient") +
  ylab(NULL)

Plot Influential variables

MSE for Ridge and Lasso

# minimum Ridge MSE
min(ames_ridge$cvm)
## [1] 0.01955871
# minimum Lasso MSE
min(ames_lasso$cvm)
## [1] 0.02246344

Elastic net

Elastic Nets

A generalization of the Ridge and Lasso models is the Elastic Net (Zou and Hastie, 2005), which combines the two penalties.

Summary overview

Implementation

lasso    <- glmnet(ames_train_x, ames_train_y, alpha = 1.0) 
elastic1 <- glmnet(ames_train_x, ames_train_y, alpha = 0.25) 
elastic2 <- glmnet(ames_train_x, ames_train_y, alpha = 0.75) 
ridge    <- glmnet(ames_train_x, ames_train_y, alpha = 0.0)

The four model results plottet

Tuning the Elastic Net model

# maintain the same folds across all models
fold_id <- sample(1:10, size = length(ames_train_y), 
                  replace=TRUE)

Creation of a tuning grid

# search across a range of alphas
tuning_grid <- tibble::tibble(
  alpha      = seq(0, 1, by = .1),
  mse_min    = NA,
  mse_1se    = NA,
  lambda_min = NA,
  lambda_1se = NA
)

Iteration over $\alpha$ values - Elastic Net

Now we can iterate over each $\alpha$ value, apply a CV Elastic Net, and extract the minimum and one standard error MSE values and their respective $\lambda$ values.

for(i in seq_along(tuning_grid$alpha)) {
  # fit CV model for each alpha value
  fit <- cv.glmnet(ames_train_x, ames_train_y, 
                   alpha = tuning_grid$alpha[i], 
                   foldid = fold_id)
  # extract MSE and lambda values
tuning_grid$mse_min[i]<-fit$cvm[fit$lambda==fit$lambda.min]
tuning_grid$mse_1se[i]<-fit$cvm[fit$lambda==fit$lambda.1se]
tuning_grid$lambda_min[i]<-fit$lambda.min
tuning_grid$lambda_1se[i]<-fit$lambda.1se
}

The resulting tuning grid

tuning_grid
## # A tibble: 11 x 5
##    alpha mse_min mse_1se lambda_min lambda_1se
##    <dbl>   <dbl>   <dbl>      <dbl>      <dbl>
##  1   0    0.0198  0.0227    0.141       0.623 
##  2   0.1  0.0205  0.0237    0.0365      0.134 
##  3   0.2  0.0210  0.0243    0.0182      0.0736
##  4   0.3  0.0213  0.0249    0.0122      0.0539
##  5   0.4  0.0215  0.0249    0.00912     0.0404
##  6   0.5  0.0216  0.0250    0.00729     0.0323
##  7   0.6  0.0217  0.0255    0.00608     0.0296
##  8   0.7  0.0218  0.0255    0.00521     0.0253
##  9   0.8  0.0219  0.0255    0.00456     0.0222
## 10   0.9  0.0219  0.0255    0.00405     0.0197
## 11   1    0.0220  0.0256    0.00365     0.0177

Plot the MSE

tuning_grid %>%
  mutate(se = mse_1se - mse_min) %>%
  ggplot(aes(alpha, mse_min)) +
  geom_line(size = 2) +
  geom_ribbon(aes(ymax = mse_min + se, ymin = mse_min - se), 
              alpha = .25) +
  ggtitle("MSE +/- one standard error")

MSE +/- one standard error

Predicting

# some best model
cv_lasso   <- cv.glmnet(ames_train_x, ames_train_y, alpha = 1.0)
min(cv_lasso$cvm)
## [1] 0.02036225
# predict
pred <- predict(cv_lasso, s = cv_lasso$lambda.min, ames_test_x)
mean((ames_test_y - pred)^2)
## [1] 0.02040651

The package caret - Classification and Regression Training

**Vignette for the caret package **

library(caret)
train_control <- trainControl(method = "cv", number = 10)
caret_mod <- train(x = ames_train_x,y = ames_train_y,
                   method = "glmnet",
                   preProc = c("center", "scale", "zv", "nzv"),
                   trControl = train_control,
                   tuneLength = 10)

Output for caret model

caret_mod
## glmnet 
## 
## 2054 samples
##  307 predictor
## 
## Pre-processing: centered (113), scaled (113), remove (194) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1848, 1847, 1850, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE       
##   0.1    0.0001335259  0.1529759  0.8626587  0.09899496
##   0.1    0.0003084622  0.1529697  0.8626702  0.09899085
##   0.1    0.0007125878  0.1527312  0.8630945  0.09882473
##   0.1    0.0016461703  0.1523040  0.8638461  0.09858775
##   0.1    0.0038028670  0.1518043  0.8647353  0.09829651
##   0.1    0.0087851163  0.1511589  0.8658996  0.09792048
##   0.1    0.0202947586  0.1512482  0.8658861  0.09846301
##   0.1    0.0468835259  0.1542133  0.8616162  0.10095388
##   0.1    0.1083070283  0.1613716  0.8525304  0.10562397
##   0.1    0.2502032890  0.1783894  0.8378278  0.11798124
##   0.2    0.0001335259  0.1530144  0.8625906  0.09901019
##   0.2    0.0003084622  0.1529054  0.8627844  0.09893138
##   0.2    0.0007125878  0.1526093  0.8633042  0.09873037
##   0.2    0.0016461703  0.1520597  0.8642754  0.09843475
##   0.2    0.0038028670  0.1515362  0.8652246  0.09801179
##   0.2    0.0087851163  0.1511130  0.8660407  0.09800962
##   0.2    0.0202947586  0.1531822  0.8629272  0.10015289
##   0.2    0.0468835259  0.1583264  0.8557192  0.10357302
##   0.2    0.1083070283  0.1713989  0.8406963  0.11319420
##   0.2    0.2502032890  0.2015887  0.8196545  0.13607965
##   0.3    0.0001335259  0.1530175  0.8625852  0.09901246
##   0.3    0.0003084622  0.1528417  0.8628943  0.09887464
##   0.3    0.0007125878  0.1524303  0.8636143  0.09862411
##   0.3    0.0016461703  0.1519261  0.8645281  0.09829674
##   0.3    0.0038028670  0.1513782  0.8655399  0.09792851
##   0.3    0.0087851163  0.1517901  0.8649707  0.09874489
##   0.3    0.0202947586  0.1548532  0.8604377  0.10124984
##   0.3    0.0468835259  0.1631931  0.8487612  0.10712019
##   0.3    0.1083070283  0.1804627  0.8317568  0.12000176
##   0.3    0.2502032890  0.2242488  0.8048855  0.15383093
##   0.4    0.0001335259  0.1530032  0.8626107  0.09900104
##   0.4    0.0003084622  0.1527835  0.8629949  0.09883071
##   0.4    0.0007125878  0.1522777  0.8638796  0.09853749
##   0.4    0.0016461703  0.1517960  0.8647653  0.09816823
##   0.4    0.0038028670  0.1512813  0.8657246  0.09790412
##   0.4    0.0087851163  0.1527729  0.8634153  0.09973034
##   0.4    0.0202947586  0.1567219  0.8575524  0.10244652
##   0.4    0.0468835259  0.1674084  0.8431901  0.11030906
##   0.4    0.1083070283  0.1900641  0.8220999  0.12713979
##   0.4    0.2502032890  0.2456477  0.7968665  0.17219318
##   0.5    0.0001335259  0.1529694  0.8626695  0.09897170
##   0.5    0.0003084622  0.1527206  0.8631031  0.09877948
##   0.5    0.0007125878  0.1521743  0.8640690  0.09847427
##   0.5    0.0016461703  0.1517015  0.8649410  0.09808645
##   0.5    0.0038028670  0.1514410  0.8654740  0.09816784
##   0.5    0.0087851163  0.1535069  0.8622717  0.10039370
##   0.5    0.0202947586  0.1585892  0.8546972  0.10366586
##   0.5    0.0468835259  0.1711919  0.8386196  0.11310210
##   0.5    0.1083070283  0.1988715  0.8144491  0.13394394
##   0.5    0.2502032890  0.2677720  0.7874492  0.19208881
##   0.6    0.0001335259  0.1529457  0.8627110  0.09894655
##   0.6    0.0003084622  0.1526476  0.8632329  0.09872871
##   0.6    0.0007125878  0.1521007  0.8642136  0.09841560
##   0.6    0.0016461703  0.1516403  0.8650668  0.09805262
##   0.6    0.0038028670  0.1516698  0.8651021  0.09848230
##   0.6    0.0087851163  0.1541771  0.8612517  0.10084110
##   0.6    0.0202947586  0.1606510  0.8515691  0.10511838
##   0.6    0.0468835259  0.1751425  0.8337697  0.11597593
##   0.6    0.1083070283  0.2072075  0.8085757  0.14014121
##   0.6    0.2502032890  0.2895134  0.7807597  0.21111363
##   0.7    0.0001335259  0.1529182  0.8627588  0.09892152
##   0.7    0.0003084622  0.1525582  0.8633865  0.09868078
##   0.7    0.0007125878  0.1520463  0.8643179  0.09835732
##   0.7    0.0016461703  0.1515912  0.8651754  0.09802856
##   0.7    0.0038028670  0.1519555  0.8646544  0.09881899
##   0.7    0.0087851163  0.1548004  0.8603025  0.10119692
##   0.7    0.0202947586  0.1627999  0.8483414  0.10680467
##   0.7    0.0468835259  0.1792009  0.8286142  0.11899322
##   0.7    0.1083070283  0.2158269  0.8020943  0.14685245
##   0.7    0.2502032890  0.3123587  0.7639857  0.23129960
##   0.8    0.0001335259  0.1528934  0.8628024  0.09889940
##   0.8    0.0003084622  0.1524796  0.8635221  0.09864034
##   0.8    0.0007125878  0.1519877  0.8644285  0.09830023
##   0.8    0.0016461703  0.1515406  0.8652735  0.09802800
##   0.8    0.0038028670  0.1522942  0.8641256  0.09918411
##   0.8    0.0087851163  0.1554853  0.8592392  0.10161487
##   0.8    0.0202947586  0.1649098  0.8451450  0.10842937
##   0.8    0.0468835259  0.1832344  0.8233507  0.12199727
##   0.8    0.1083070283  0.2242570  0.7962765  0.15379654
##   0.8    0.2502032890  0.3360986  0.7248864  0.25185135
##   0.9    0.0001335259  0.1528683  0.8628464  0.09887948
##   0.9    0.0003084622  0.1524011  0.8636585  0.09859931
##   0.9    0.0007125878  0.1519283  0.8645367  0.09825245
##   0.9    0.0016461703  0.1514541  0.8654270  0.09800839
##   0.9    0.0038028670  0.1526743  0.8635214  0.09958496
##   0.9    0.0087851163  0.1562250  0.8580869  0.10208126
##   0.9    0.0202947586  0.1667279  0.8425633  0.10988380
##   0.9    0.0468835259  0.1869510  0.8187765  0.12486550
##   0.9    0.1083070283  0.2328192  0.7897299  0.16118585
##   0.9    0.2502032890  0.3598416  0.6471565  0.27230177
##   1.0    0.0001335259  0.1528421  0.8628890  0.09885896
##   1.0    0.0003084622  0.1523370  0.8637698  0.09856552
##   1.0    0.0007125878  0.1518838  0.8646164  0.09822006
##   1.0    0.0016461703  0.1514135  0.8655058  0.09802221
##   1.0    0.0038028670  0.1530411  0.8629353  0.09994305
##   1.0    0.0087851163  0.1570340  0.8568153  0.10261590
##   1.0    0.0202947586  0.1683894  0.8402729  0.11115096
##   1.0    0.0468835259  0.1904441  0.8148559  0.12743546
##   1.0    0.1083070283  0.2409139  0.7855310  0.16832017
##   1.0    0.2502032890  0.3805446  0.5646772  0.29026265
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.2 and lambda
##  = 0.008785116.

Which regularization method should we choose?

Advantages and Disadvantages

Further packages

# https://cran.rstudio.com/web/packages/biglasso/biglasso.pdf
install.packages("biglasso")

A comprehensive beginners guide for Linear, Ridge and Lasso Regression