Polynomial Fit with LASSO Regularization

Polynomial Fit with LASSO Regularization#

Notebook by:

Royi Avital RoyiAvital@fixelalgorithms.com

Revision History#

Version	Date	User	Content / Changes
1.0.000	24/03/2024	Royi Avital	First version

# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.linear_model import Lasso
from sklearn.linear_model import lars_path, lasso_path
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures

# Miscellaneous
import math
import os
from platform import python_version
import random
import timeit

# Typing
from typing import Callable, Dict, List, Optional, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

Notations#

(?) Question to answer interactively.
(!) Simple task to add code for the notebook.
(@) Optional / Extra self practice.
(#) Note / Useful resource / Food for thought.

Code Notations:

someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler

Code Exercise#

Single line fill

vallToFill = ???

Multi Line to Fill (At least one)

# You need to start writing
????

Section to Fill

#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#

# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

# Courses Packages
import sys
sys.path.append('../')
sys.path.append('../../')
sys.path.append('../../../')
from utils.DataVisualization import PlotRegressionData

# General Auxiliary Functions

def PlotPolyFitLasso( vX: np.ndarray, vY: np.ndarray, vP: Optional[np.ndarray] = None, P: int = 1, λ: float = 0.0, 
                     numGridPts: int = 1001, hA: Optional[plt.Axes] = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, 
                     markerSize: int = MARKER_SIZE_DEF, lineWidth: int = LINE_WIDTH_DEF, axisTitle: str = None ) -> None:

    if hA is None:
        hF, hA = plt.subplots(1, 2, figsize = figSize)
    else:
        hF = hA[0].get_figure()

    numSamples = len(vY)

    # Polyfit
    if λ == 0:
        # No Lasso (Classic Polyfit)
        vW  = np.polyfit(vX, vY, P)
    else:
        # Lasso
        mX   = PolynomialFeatures(degree = P, include_bias = False).fit_transform(vX[:, None])
        oMdl = Lasso(alpha = λ, fit_intercept = True, max_iter = 500000).fit(mX, vY)
        # Lasso coefficients
        vW   = np.r_[oMdl.coef_[::-1], oMdl.intercept_]
    
    # R2 Score
    vHatY = np.polyval(vW, vX)
    R2    = r2_score(vY, vHatY)
    
    # Plot
    xx  = np.linspace(np.around(np.min(vX), decimals = 1) - 0.1, np.around(np.max(vX), decimals = 1) + 0.1, numGridPts)
    yy  = np.polyval(vW, xx)

    hA[0].plot(vX, vY, '.r', ms = 10, label = '$y_i$')
    hA[0].plot(xx, yy, 'b',  lw = 2,  label = '$\hat{f}(x)$')
    hA[0].set_title (f'P = {P}, R2 = {R2}')
    hA[0].set_xlabel('$x$')
    hA[0].set_xlim(left = xx[0], right = xx[-1])
    hA[0].set_ylim(bottom = np.floor(np.min(vY)), top = np.ceil(np.max(vY)))
    hA[0].grid()
    hA[0].legend()
    
    hA[1].stem(vW[::-1], label = 'Estimated')
    if vP is not None:
        hA[1].stem(vP[::-1], linefmt = 'g', markerfmt = 'gD', label = 'Ground Truth')
    hA[1].set_title('Coefficients')
    hA[1].set_xlabel('$w$')
    hA[1].set_ylim(bottom = -2, top = 6)
    hA[1].legend()

Polynomial Fit with Regularization#

The Degrees of Freedom (DoF) of a Polynomial model depends mainly on the polynomial degree.
One way to regularize the model is by using the degree parameter.
Yet parameter is discrete hence harder to tune and the control the Bias & Variance tradeoff.

The model of smooth regularization is given by:

\[ \arg \min_{\boldsymbol{w}} \frac{1}{2} {\left\| X \boldsymbol{w} - \boldsymbol{y} \right\|}_{2}^{2} + \lambda R \left( \boldsymbol{w} \right) \]

Where \(R \left( \cdot \right)\) is the regularizer and \(\lambda \geq 0\) is the continuous regularization parameter where higher value means stronger regularization.
The properties of the solution will be determined by the regularizer (Sparse, Lowe Values, etc…).
The continuous regularization parameter allows smoother abd more finely tuned regularization.

(#) The model above can, in many cases, be interpreted as a prior on the values of the parameters as in Bayesian Estimation context.

# Parameters

# Data Generation
numSamples  = 50
noiseStd    = 0.3
vP = np.array([0.5, 2, 5])

# Model
polyDeg = 2
λ       = 0.1

# Data Visualization
gridNoiseStd = 0.05
numGridPts = 250

Generate / Load Data#

In the following we’ll generate data according to the following model:

\[ y_{i} = f \left( x_{i} \right) + \epsilon_{i} \]

Where

\[ f \left( x \right) = \frac{1}{2} x^{2} + 2x + 5 \]

# The Data Generating Function

def f( vX: np.ndarray, vP: np.ndarray ) -> np.ndarray:
    # return 0.25 * (vX ** 2) + 2 * vX + 5
    return np.polyval(vP, vX)


hF = lambda vX: f(vX, vP)

# Generate Data

vX = np.linspace(-2, 2, numSamples, endpoint = True) + (gridNoiseStd * np.random.randn(numSamples))
vN = noiseStd * np.random.randn(numSamples)
vY = hF(vX) + vN

print(f'The features data shape: {vX.shape}')
print(f'The labels data shape: {vY.shape}')

The features data shape: (50,)
The labels data shape: (50,)

Plot Data#

# Plot the Data

PlotRegressionData(vX, vY)

plt.show()

../../../../_images/e0ce216d08d605193be1ec8da6d663c08dc4d12bff9c13b1e0a300ca210b5314.png

Train Polyfit Regressor with LASSO Regularization#

The \({L}_{1}\) regularized PolyFit optimization problem is given by:

\[ \arg \min_{\boldsymbol{w}} \frac{1}{2} {\left\| X \boldsymbol{w} - \boldsymbol{y} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{w} \right\|}_{1} \]

Where

\[\begin{split} \boldsymbol{X} = \begin{bmatrix} 1 & x_{1} & x_{1}^{2} & \cdots & x_{1}^{p} \\ 1 & x_{2} & x_{2}^{2} & \cdots & x_{2}^{p} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{N} & x_{N}^{2} & \cdots & x_{N}^{p} \end{bmatrix} \end{split}\]

This regularization is called Least Absolute Shrinkage and Selection Operator (LASSO).
Since the \({L}_{1}\) norm promotes sparsity we basically have a feature selector built in.

# Polynomial Fit with Lasso Regularization

mX         = PolynomialFeatures(degree = polyDeg, include_bias = False).fit_transform(vX[:, None]) #<! Build the model matrix
oLinRegL1  = Lasso(alpha = λ, fit_intercept = True, max_iter = 30000).fit(mX, vY)
vW         = np.r_[oLinRegL1.coef_[::-1], oLinRegL1.intercept_]

# Display the weights
vW

array([0.48820408, 1.95534482, 5.04258165])

Plot Regressor for Various Regularization (λ) Values#

Let’s see the effect of the strength of the regularization on the data.

hPolyFitLasso = lambda λ: PlotPolyFitLasso(vX, vY, vP = vP, P = 15, λ = λ)
lamSlider = FloatSlider(min = 0, max = 1, step = 0.001, value = 0, readout_format = '.4f', layout = Layout(width = '30%'))
interact(hPolyFitLasso, λ = lamSlider)
plt.show()

(?) How do you expect the \({R}^{2}\) score to behave with increasing \(\lambda\)?

Lasso Path for Feature Importance#

The rise of a feature is similar to the correlation of the feature.
Hence we cen use the Lasso Path for feature selection / significance.

(#) The LASSO checks the conditional correlation. Namely the specific combination of the features.
While selection based on correlation is based on marginal correlation. Namely the value of specific feature (Its mean or other statistics).
In practice, LASSO potentially can make a good selection when there are inter correlations between the features.
(#) See Partial / Conditional Correlation vs. Marginal Correlation.

# Data from https://gist.github.com/seankross/a412dfbd88b3db70b74b
# mpg - Miles per Gallon
# cyl - # of cylinders
# disp - displacement, in cubic inches
# hp - horsepower
# drat - driveshaft ratio
# wt - weight
# qsec - 1/4 mile time; a measure of acceleration
# vs - 'V' or straight - engine shape
# am - transmission; auto or manual
# gear - # of gears
# carb - # of carburetors
dfMpg = pd.read_csv('https://github.com/FixelAlgorithmsTeam/FixelCourses/raw/master/DataSets/mtcars.csv')
dfMpg

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
3	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
4	Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
5	Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
6	Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
7	Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
8	Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
9	Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
10	Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
11	Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
12	Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
13	Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
14	Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
15	Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
16	Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
17	Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
18	Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
19	Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
20	Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
21	Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
22	AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
23	Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
24	Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
25	Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
26	Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
27	Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
28	Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
29	Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
30	Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
31	Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

# Data for Analysis
# The target data is the fuel consumption (`mpg`).
dfX = dfMpg.drop(columns = ['model', 'mpg'], inplace = False)
dfX = (dfX - dfX.mean()) / dfX.std() #<! Normalize
dsY = dfMpg['mpg'].copy() #<! Data Series

# LASSO Path Analysis

alphasPath, coefsPath, *_ = lasso_path(dfX, dsY)
# alphasPath, coefsPath, *_ = lars_path(dfX, dsY, method = 'lasso')

# Display the LASSO Path

hF, hA = plt.subplots(figsize = (16, 8))
hA.plot(alphasPath, np.abs(coefsPath.T), lw = 2, label = dfX.columns.to_list())
hA.set_title('The Lasso Path')
hA.set_xlabel('$\lambda$')
hA.set_ylabel('Coefficient Value (${w}_{i}$)')
hA.legend()
plt.show()

../../../../_images/135ddf9a5efc9f4e1b8ab5083a48faa2d0afc5dc7a629e7ef9bd8a25038c5b7e.png

(#) Feature selection can be an objective on its own. For instance, think of a questionary of insurance company to assess the risk of the customer.
Achieving the same accuracy in the risk assessment with less questions (Features) is valuable on its own.
See Why LASSO for Feature Selection.

test cross correlation of top features#

sns.heatmap(dfX.corr(), annot = True, cmap = 'coolwarm', center = 0, fmt = '.2f')

<Axes: >

../../../../_images/8ebe14655cf1eafce06cc5dffb73d14842d203e40768bdde52e1e1d8420caa23.png