Fixel Algorithms

Heart Disease Tree Classifier#

Classification - Decision Tree Classifier#

Notebook by:

Revision History#

Version

Date

User

Content / Changes

1.0.000

23/03/2024

Royi Avital

First version

Open In Colab

# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn.tree import DecisionTreeClassifier


# Miscellaneous
import math
import os
from platform import python_version
import random
import timeit

# Typing
from typing import Callable, Dict, List, Optional, Set, Tuple, Union

# Visualization
import matplotlib as mpl
from matplotlib.colors import LogNorm
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

Notations#

  • (?) Question to answer interactively.

  • (!) Simple task to add code for the notebook.

  • (@) Optional / Extra self practice.

  • (#) Note / Useful resource / Food for thought.

Code Notations:

someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler

Code Exercise#

  • Single line fill

vallToFill = ???
  • Multi Line to Fill (At least one)

# You need to start writing
????
  • Section to Fill

#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())
# Constants

FIG_SIZE_DEF = (8, 8)
ELM_SIZE_DEF = 50
CLASS_COLOR = ('b', 'r')
# Courses Packages
import sys,os
sys.path.append('../')
sys.path.append('../../')
sys.path.append('../../../')
from utils.DataVisualization import PlotConfusionMatrix, PlotLabelsHistogram
# General Auxiliary Functions

Exercise - Decision Tree#

In this exercise we’ll use the Decision Tree model as a classifier.
The SciKit Learn library implement it with the DecisionTreeClassifier class.

We’ll use the Heart Disease Data Set (Also known as Cleveland Heard Disease).
The data set contains binary and categorical features which Decision Trees excel utilizing.

The data set has the following columns:

  • age: Age in years.

  • sex: Sex (1: male; 0: female).

  • cp: Chest pain type: {0: typical angina, 1: atypical angina, 2: non-anginal pain, 3: asymptomatic}.

  • trestbps: Resting blood pressure (in mm Hg on admission to the hospital).

  • chol: Serum cholestoral in mg/dl.

  • fbs: Check fasting blood sugar: {1: Above 120 mg/dl, 0: Below 120 mg/dl}.

  • restecg: Resting electrocardiographic results: {0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria}.

  • thalach: Maximum heart rate achieved.

  • exang: Exercise induced angina: {1: yes, 0: no}.

  • oldpeak = ST depression induced by exercise relative to rest.

  • slope: The slope of the peak exercise ST segment: {0: upsloping, 1: flat, 2: downsloping}.

  • ca: Number of major vessels (0-3) colored by flourosopy.

  • thal: {0: normal, 1: fixed defect, 2: reversable defect}.

  • num: The target variable: {0: <50 (No disease), 1: >50_1 (disease)}.

The exercise will also show the process of handling real world data: Removing invalid data, mapping values, etc…

I this exercise we’ll do the following:

  1. Load the Heart Disease Data Set using fetch_openml() with id .

  2. Validate data.

  3. Convert text to numerical data (Though still as categorical data).

  4. Train a decision tree.

  5. Optimize the parameters: criterion and max_leaf_nodes by the f1 score.

  6. Train the optimal model on all data.

  7. Display the Confusion Matrix and extract the different types of predictions.

  8. Show the feature importance rank of the model.

  • (#) In order to let the classifier know the data is binary / categorical we’ll use a Data Frame as the data structure.

# Parameters

#===========================Fill This===========================#
# 1. Set the options for the `criterion` parameter (Use all options).
# 2. Set the options for the `max_leaf_nodes` parameter.
lCriterion   = ['gini', 'entropy', 'log_loss'] #<! List
lMaxLeaf     = np.linspace(2,20,10) #<! List
#===============================================================#

Generate / Load Data#

# Load Data

dfData, dsY = fetch_openml('heart-c', version = 1, return_X_y = True, as_frame = True, parser = 'auto')

print(f'The data shape: {dfData.shape}')
print(f'The labels shape: {dsY.shape}')
print(f'The unique values of the labels: {dsY.unique()}')
The data shape: (303, 13)
The labels shape: (303,)
The unique values of the labels: ['<50', '>50_1']
Categories (2, object): ['<50', '>50_1']

Plot Data#

# Distribution of Labels

hA = PlotLabelsHistogram(dsY)
plt.show()
../../../_images/7918b3e36056e72e19500b6f9c6901cbc0a0dea0151a9fe4f4cf6547b444f75e.png
  • (?) Is the data balanced?

Pre Process Data#

In this section we’ll transform the data into features which the algorithms can work with.

Remove Missing / Undefined Values#

There are 3 main strategies with dealing with missing values:

  1. A model to interpolate them.

  2. Remove the sample.

  3. Remove the feature.

The choice between (2) and (3) depends on the occurrence of the missing values.
If there is a feature which is dominated by missing values, we might want to consider remove it.
Otherwise, we’ll remove samples with missing values.

  • (#) In case of large data set we might even build different models to different combinations of features.

  • (#) If missing values can happen in production, we need to think of a strategy that holds in that case as well.

  • (#) In practice, another factor to take into account is the importance of the feature.

# Null / NA / NaN Matrix

dfData['Positive'] = dsY #<! Merge data

#===========================Fill This===========================#
# 1. Calculate the logical map of invalid values using the method `isna()`.
dfInvData = dfData.isna() #<! The logical matrix (DF) of invalid values
#===============================================================#

hF, hA = plt.subplots(figsize = FIG_SIZE_DEF)
sns.heatmap(data = dfInvData, square = False, ax = hA)
hA.set_title('Invalid Data Map')

plt.show()

print(f'The features data shape: {dfData.shape}')
../../../_images/5d094a902a1b38160a1be31bc4be921d15065af0003212fc0c4f623e638f4ad4.png
The features data shape: (303, 14)
  • (?) Given the results above, would you remove a feature or few samples?

# Remove NaN / Null Values

#===========================Fill This===========================#
# 1. Remove the NaN / Null values. Use `dropna()`.
# !! Choose the correct policy (Remove samples or features) by `axis`.
dfX = dfData.dropna(axis = 0) #<! Remove samples
#===============================================================#

print(f'The features data shape: {dfX.shape}')
The features data shape: (296, 14)
# Drop Duplicate Rows

#===========================Fill This===========================#
# 1. Drop duplicate rows (Samples) using the method `drop_duplicates()`.
# 2. Reset index using the method `reset_index()` .
dfX = dfX.drop_duplicates()
print(f'dfX shape after drop: {dfX.shape}')

dfX = dfX.reset_index(drop = True)
print(f'dfX shape after reset_index: {dfX.shape}')
#===============================================================#
dfX = dfX.astype(dtype = {'ca': np.int64}) #<! It's integer mistyped as Float64
dfX shape after drop: (296, 14)
dfX shape after reset_index: (296, 14)
# Split Data & Labels
# Create the X, y data.

dsY = dfX['Positive']
dfX = dfX.drop(columns = ['Positive'])
# Distribution of Labels

hA = PlotLabelsHistogram(dsY)
plt.show()
../../../_images/964efefa064dee5a6d1da37c9e19606f0e203f364461ca042ad0ab92eabc4106.png

Convert Data into Numeric Values#

Some of the categorical and binary data is given by text values.
It is better to convert them into numerical values (Though some models can work with them as is).
For some visualizations, the textual data is great, hence we keep it.

# Features Type

# See the Type of the Features
dfX.info() #<! Look at `dfX.dtypes`
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 296 entries, 0 to 295
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       296 non-null    int64   
 1   sex       296 non-null    category
 2   cp        296 non-null    category
 3   trestbps  296 non-null    int64   
 4   chol      296 non-null    int64   
 5   fbs       296 non-null    category
 6   restecg   296 non-null    category
 7   thalach   296 non-null    int64   
 8   exang     296 non-null    category
 9   oldpeak   296 non-null    float64 
 10  slope     296 non-null    category
 11  ca        296 non-null    int64   
 12  thal      296 non-null    category
dtypes: category(7), float64(1), int64(5)
memory usage: 17.0 KB
# Lists of the Features Type

lBinaryFeature  = ['sex', 'fbs', 'exang']
lCatFeature     = ['cp', 'restecg', 'slope', 'thal']
lNumFeature     = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'ca']
# Creating a Copy (Numerical)

#===========================Fill This===========================#
# 1. Create a copy (Not a view) using the method `copy()`.
dfXNum = dfX.copy()
#===============================================================#
# Encode Binary Categorical Features
# Usually this is done using `Binarizer` and `OrdinalEncoder`.
# Yet there is a defined mapping in the data description which will be used.

dSex        = {'female': 0, 'male': 1}
dCp         = {'typ_angina': 0, 'atyp_angina': 1, 'non_anginal': 2, 'asympt': 3}
dFbs        = {'f': 0, 't': 1}
dRestEcg    = {'normal': 0, 'st_t_wave_abnormality': 1, 'left_vent_hyper': 2}
dExAng      = {'no': 0, 'yes': 1}
dSlope      = {'up': 0, 'flat': 1, 'down': 2}
dThal       = {'normal': 0, 'fixed_defect': 1, 'reversable_defect': 2}

dMapper = {'sex': dSex, 'fbs': dFbs, 'exang': dExAng, 'cp': dCp, 'restecg': dRestEcg, 'slope': dSlope, 'thal': dThal}

for colName in (lBinaryFeature + lCatFeature):
    # dMapping = dMapper[colName]
    dfXNum[colName] = dfXNum[colName].map(dMapper[colName])
dfXNum.head()
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
0 63 1 0 145 233 1 2 150 0 2.3 2 0 1
1 67 1 3 160 286 0 2 108 1 1.5 1 3 0
2 67 1 3 120 229 0 2 129 1 2.6 1 2 2
3 37 1 2 130 250 0 0 187 0 3.5 2 0 0
4 41 0 1 130 204 0 2 172 0 1.4 0 0 0
# Encode the Labels

#===========================Fill This===========================#
# 1. Create a dictionary which maps the string `<50` to 0 and `>50_1` to 1.
# 2. Apply a mapping on `dsY` using the method `map()`.
dMapY   = {'<50': 0, '>50_1': 1} #<! Mapping dictionary
dsY     = dsY.map(dMapY)
#===============================================================#

dsY = dsY.rename('Positive')
dsY.head()
0    0
1    1
2    1
3    0
4    0
Name: Positive, dtype: category
Categories (2, int64): [0, 1]

Exploratory Data Analysis (EDA)#

This is the stage we’re trying to infer insights on the data using visualizations.
This is a skill which requires experience and creativity.

We’ll do some very basic operations for this data set.
We’ll see the distribution of each feature for the 2 values.

# Binary Data

numFeatures = len(lBinaryFeature)
hF, hA = plt.subplots(1, numFeatures, figsize = (24, 8))
hA = hA.flat

for ii, colName in enumerate(lBinaryFeature):
    sns.histplot(data = dfX, x = colName, hue = dsY, discrete = True, multiple = 'dodge', ax = hA[ii])
../../../_images/26b2c725a081b322df54bb3dd9dffd8c6f5c423ddb616ffddf5f70d71db6da95.png
# Categorical Data

numFeatures = len(lCatFeature)
hF, hA = plt.subplots(1, numFeatures, figsize = (24, 8))
hA = hA.flat

for ii, colName in enumerate(lCatFeature):
    sns.histplot(data = dfX, x = colName, hue = dsY, discrete = True, multiple = 'dodge', ax = hA[ii])
../../../_images/5fdb741351e8839ea41f550ad0e174793ad62d35a0843ff716c29d70c157a502.png
# Numerical Data

lDiscreteData = []

numFeatures = len(lNumFeature)
hF, hA = plt.subplots(1, numFeatures, figsize = (24, 8))
hA = hA.flat

for ii, colName in enumerate(lNumFeature):
    # if pd.api.types.is_integer_dtype(dfX[colName]):
    #     sns.histplot(data = dfX, x = colName, hue = dsY, discrete = True, multiple = 'dodge', ax = hA[ii])
    if colName == 'ca':
        sns.histplot(data = dfX, x = colName, hue = dsY, discrete = True, multiple = 'dodge', ax = hA[ii])
    else:
        sns.kdeplot(data = dfX, x = colName, hue = dsY, fill = True, common_norm = True, ax = hA[ii])
../../../_images/eac61ab5d2867cccce8df1c6788280d37ddd18a9dbf35dfa6a4f98e9aba63d06.png
  • (?) How would you handle the case where a feature has a single value? Look at VarianceThreshold.

  • (#) Usually part of the work on feature includes a process to select the best of them. For example a brute force method is given by SelectKBest.

Train a Decision Tree Model and Optimize Hyper Parameters#

In this section we’ll optimize the model according to the F1 score.
The F1 score is the geometric mean of the precision and recall.
Hence it can handle pretty well imbalanced data as well (Though this case is not really that imbalanced).

We’ll use the f1_score() function to calculate the measure.
The process to optimize the Hyper Parameters will be as following:

  1. Build a data frame to keep the scoring of the different hyper parameters combination.

  2. Optimize the model:

  • Construct a model using the current combination of hyper parameters.

  • Apply a cross validation process to predict the data using cross_val_predict().

  • As the cross validation iterator (The cv parameter) use KFold to implement Leave One Out policy.

  1. Calculate the F1 score of the predicted classes.

  2. Store the result in the performance data frame.

  • (#) Pay attention that while we optimize the hyper parameters according to the F1 score, the model itself has a different loss function.

# Creating the Data Frame

#===========================Fill This===========================#
# 1. Calculate the number of combinations.
# 2. Create a nested loop to create the combinations between the parameters.
# 3. Store the combinations as the columns of a data frame.

# For Advanced Python users: Use iteration tools for create the cartesian product
numComb = len(lCriterion) * len(lMaxLeaf)  #<! Number combinations
dData   = {'Criterion': [], 'Max Leaves': [], 'F1': [0.0] * numComb}  #<! Dictionary (To create the DF from)

for ii, paramCriteria in enumerate(lCriterion):
    for jj, maxLeaf in enumerate(lMaxLeaf):
        dData['Criterion'].append(paramCriteria)
        dData['Max Leaves'].append(maxLeaf)
        
#===============================================================#

# The DF: Each row is a combination to evaluate.
# The columns are the parameters and the F1 score.
dfModelScore = pd.DataFrame(data = dData)

# Display the DF
dfModelScore
Criterion Max Leaves F1
0 gini 2.0 0.0
1 gini 4.0 0.0
2 gini 6.0 0.0
3 gini 8.0 0.0
4 gini 10.0 0.0
5 gini 12.0 0.0
6 gini 14.0 0.0
7 gini 16.0 0.0
8 gini 18.0 0.0
9 gini 20.0 0.0
10 entropy 2.0 0.0
11 entropy 4.0 0.0
12 entropy 6.0 0.0
13 entropy 8.0 0.0
14 entropy 10.0 0.0
15 entropy 12.0 0.0
16 entropy 14.0 0.0
17 entropy 16.0 0.0
18 entropy 18.0 0.0
19 entropy 20.0 0.0
20 log_loss 2.0 0.0
21 log_loss 4.0 0.0
22 log_loss 6.0 0.0
23 log_loss 8.0 0.0
24 log_loss 10.0 0.0
25 log_loss 12.0 0.0
26 log_loss 14.0 0.0
27 log_loss 16.0 0.0
28 log_loss 18.0 0.0
29 log_loss 20.0 0.0
# Optimize the Model

#===========================Fill This===========================#
# 1. Iterate over each row of the data frame `dfModelScore`. Each row defines the hyper parameters.
# 2. Construct the model.
# 3. Train it on the Train Data Set.
# 4. Calculate the score.
# 5. Store the score into the data frame column.

for ii in range(numComb):
    paramCriteria    = dfModelScore['Criterion'][ii]
    maxLeaf          = dfModelScore['Max Leaves'][ii]

    print(f'Processing model {ii + 1:03d} out of {numComb} with `criterion` = {paramCriteria} and `max_leaf_nodes` = {maxLeaf}.')

    oDecTreeCls = DecisionTreeClassifier(criterion = paramCriteria, max_leaf_nodes = int(maxLeaf), random_state = 0) #<! The model with the hyper parameters of the current combination
    
    vYPred = cross_val_predict(oDecTreeCls, dfXNum, dsY, cv = KFold(n_splits = dfXNum.shape[0])) #<! Predict by cross validation

    f1Score = f1_score(dsY, vYPred)
    dfModelScore.loc[ii, 'F1'] = f1Score
    print(f'Finished processing model {ii + 1:03d} with `f1 = {f1Score}.')
#===============================================================#
Processing model 001 out of 30 with `criterion` = gini and `max_leaf_nodes` = 2.0.
Finished processing model 001 with `f1 = 0.7434944237918215.
Processing model 002 out of 30 with `criterion` = gini and `max_leaf_nodes` = 4.0.
Finished processing model 002 with `f1 = 0.6899563318777293.
Processing model 003 out of 30 with `criterion` = gini and `max_leaf_nodes` = 6.0.
Finished processing model 003 with `f1 = 0.7984496124031008.
Processing model 004 out of 30 with `criterion` = gini and `max_leaf_nodes` = 8.0.
Finished processing model 004 with `f1 = 0.7380952380952381.
Processing model 005 out of 30 with `criterion` = gini and `max_leaf_nodes` = 10.0.
Finished processing model 005 with `f1 = 0.7330677290836654.
Processing model 006 out of 30 with `criterion` = gini and `max_leaf_nodes` = 12.0.
Finished processing model 006 with `f1 = 0.728.
Processing model 007 out of 30 with `criterion` = gini and `max_leaf_nodes` = 14.0.
Finished processing model 007 with `f1 = 0.7193675889328063.
Processing model 008 out of 30 with `criterion` = gini and `max_leaf_nodes` = 16.0.
Finished processing model 008 with `f1 = 0.7159533073929961.
Processing model 009 out of 30 with `criterion` = gini and `max_leaf_nodes` = 18.0.
Finished processing model 009 with `f1 = 0.7315175097276264.
Processing model 010 out of 30 with `criterion` = gini and `max_leaf_nodes` = 20.0.
Finished processing model 010 with `f1 = 0.7315175097276264.
Processing model 011 out of 30 with `criterion` = entropy and `max_leaf_nodes` = 2.0.
Finished processing model 011 with `f1 = 0.7434944237918215.
Processing model 012 out of 30 with `criterion` = entropy and `max_leaf_nodes` = 4.0.
Finished processing model 012 with `f1 = 0.6796875.
Processing model 013 out of 30 with `criterion` = entropy and `max_leaf_nodes` = 6.0.
Finished processing model 013 with `f1 = 0.7346938775510204.
Processing model 014 out of 30 with `criterion` = entropy and `max_leaf_nodes` = 8.0.
Finished processing model 014 with `f1 = 0.7295081967213115.
Processing model 015 out of 30 with `criterion` = entropy and `max_leaf_nodes` = 10.0.
Finished processing model 015 with `f1 = 0.7107438016528925.
Processing model 016 out of 30 with `criterion` = entropy and `max_leaf_nodes` = 12.0.
Finished processing model 016 with `f1 = 0.7228915662650602.
Processing model 017 out of 30 with `criterion` = entropy and `max_leaf_nodes` = 14.0.
Finished processing model 017 with `f1 = 0.749034749034749.
Processing model 018 out of 30 with `criterion` = entropy and `max_leaf_nodes` = 16.0.
Finished processing model 018 with `f1 = 0.7193675889328063.
Processing model 019 out of 30 with `criterion` = entropy and `max_leaf_nodes` = 18.0.
Finished processing model 019 with `f1 = 0.7294117647058823.
Processing model 020 out of 30 with `criterion` = entropy and `max_leaf_nodes` = 20.0.
Finished processing model 020 with `f1 = 0.7244094488188977.
Processing model 021 out of 30 with `criterion` = log_loss and `max_leaf_nodes` = 2.0.
Finished processing model 021 with `f1 = 0.7434944237918215.
Processing model 022 out of 30 with `criterion` = log_loss and `max_leaf_nodes` = 4.0.
Finished processing model 022 with `f1 = 0.6796875.
Processing model 023 out of 30 with `criterion` = log_loss and `max_leaf_nodes` = 6.0.
Finished processing model 023 with `f1 = 0.7346938775510204.
Processing model 024 out of 30 with `criterion` = log_loss and `max_leaf_nodes` = 8.0.
Finished processing model 024 with `f1 = 0.7295081967213115.
Processing model 025 out of 30 with `criterion` = log_loss and `max_leaf_nodes` = 10.0.
Finished processing model 025 with `f1 = 0.7107438016528925.
Processing model 026 out of 30 with `criterion` = log_loss and `max_leaf_nodes` = 12.0.
Finished processing model 026 with `f1 = 0.7228915662650602.
Processing model 027 out of 30 with `criterion` = log_loss and `max_leaf_nodes` = 14.0.
Finished processing model 027 with `f1 = 0.749034749034749.
Processing model 028 out of 30 with `criterion` = log_loss and `max_leaf_nodes` = 16.0.
Finished processing model 028 with `f1 = 0.7193675889328063.
Processing model 029 out of 30 with `criterion` = log_loss and `max_leaf_nodes` = 18.0.
Finished processing model 029 with `f1 = 0.7294117647058823.
Processing model 030 out of 30 with `criterion` = log_loss and `max_leaf_nodes` = 20.0.
Finished processing model 030 with `f1 = 0.7244094488188977.
  • (?) Should we use Stratified K Fold above? Why?

  • (#) Pay attention to the speed of the process. This is one of the main advantages of models based on trees.

# Display Sorted Results (Descending)
# Pandas allows sorting data by any column using the `sort_values()` method.
# The `head()` allows us to see only the the first values.
dfModelScore.sort_values(by = ['F1'], ascending = False).head(10)
Criterion Max Leaves F1
2 gini 6.0 0.798450
16 entropy 14.0 0.749035
26 log_loss 14.0 0.749035
20 log_loss 2.0 0.743494
10 entropy 2.0 0.743494
0 gini 2.0 0.743494
3 gini 8.0 0.738095
22 log_loss 6.0 0.734694
12 entropy 6.0 0.734694
4 gini 10.0 0.733068
  • (?) Do you find results satisfactory? How would you answer this?
    See the Code 001 below.

  • (@) Since the class DecisionTreeClassifier doesn’t really support categorical features, you may use One Hot Encoding.
    See Pandas’ get_dummies(). You may use Code 002 below.

# Code 001
from sklearn.linear_model import LogisticRegression

oCls = LogisticRegression(max_iter = 10_000)
oCls = oCls.fit(dfXNum, dsY)
f1_score(dsY, oCls.predict(dfXNum))
# Code 002
dfXCat = dfXNum.copy()

for colName in lCatFeature:
    dfTmp = pd.get_dummies(dfXNum[colName], prefix = colName)
    dfXCat = pd.concat((dfXCat, dfTmp), axis = 1)
    dfXCat = dfXCat.drop(columns = colName)

dfXCat
# Plotting the Train Data F1 as a Heat Map
# We can pivot the data set created to have a 2D matrix of the `F1` as a function of `Criterion` and the `Max Leaves`.

hA = sns.heatmap(data = dfModelScore.pivot(index = 'Max Leaves', columns = 'Criterion', values = 'F1'), robust = True, linewidths = 1, annot = True, fmt = '0.2%', norm = LogNorm())
hA.set_title('F1 of the Cross Validation')
plt.show()
../../../_images/d01dc578808f0f7469f217d6498c985e44e1e7a1b262ad9cbf1cfb13446067c8.png
# Extract the Optimal Hyper Parameters

#===========================Fill This===========================#
# 1. Extract the index of row which maximizes the score.
# 2. Use the index of the row to extract the hyper parameters which were optimized.
# !! You may find the `idxmax()` method of a Pandas data frame useful.

idxArgMax = dfModelScore['F1'].idxmax()
#===============================================================#

optimalCriterion = dfModelScore.loc[idxArgMax, 'Criterion']
optimalMaxLeaf   = dfModelScore.loc[idxArgMax, 'Max Leaves']

print(f'The optimal hyper parameters are: `criterion` = {optimalCriterion}, `max_leaf_nodes` = {optimalMaxLeaf}')
The optimal hyper parameters are: `criterion` = gini, `max_leaf_nodes` = 6.0

Optimal Model#

In this section we’ll extract the best model an retrain it on the whole data (dfXNum).
We need to export the model which has the best Test values.

# Construct the Optimal Model & Train on the Whole Data

#===========================Fill This===========================#
# 1. Construct the model with the optimal hyper parameters.
# 2. Fit the model on the whole data set.
oDecTreeCls = DecisionTreeClassifier(criterion = optimalCriterion, max_leaf_nodes = int(optimalMaxLeaf))
oDecTreeCls = oDecTreeCls.fit(dfXNum, dsY)
#===============================================================#
# Model Score (Accuracy)

print(f'The model score (Accuracy) is: {oDecTreeCls.score(dfXNum, dsY):0.2%}.')
The model score (Accuracy) is: 85.14%.
# Plot the Confusion Matrix
hF, hA = plt.subplots(figsize = (10, 10))

#===========================Fill This===========================#
# 1. Plot the confusion matrix using the `PlotConfusionMatrix()` function.
hA, mConfMat = PlotConfusionMatrix(dsY, oDecTreeCls.predict(dfXNum), hA = hA)
#===============================================================#

plt.show()
../../../_images/f9f85a497b93a3eb54c9dbe04392473fcfb98d7f42f2173ebe7dc2dfd84acdd8.png
  • (?) Calculate the TP, TN, FP and FN rates.

  • (?) Calculate the precision and recall.

  • (?) Calculate the precision and recall assuming the labels 0 is the positive label.

Feature Significance#

One advantage of the decision tree based models is having access to the significance of each feature during training.
We can access it using the feature_importances_ attribute (Only after a applying training by the fit() method).

  • (#) This ability is useful as a pre processing of data for any model with no restriction to trees.

  • (#) The idea is measuring the total contribution of the feature to the reduction in loss.

  • (#) This is a good metric for importance mainly for categorical features. For features with high number of unique values (Continuous), it might be not as accurate.

# Extract the Importance of the Features

#===========================Fill This===========================#
# 1. Extract the feature importance using the `feature_importances_` attribute.
vFeatImportance = oDecTreeCls.feature_importances_
#===============================================================#

The feature importance is normalized, hence we can display it like a discrete probability mass function.

# Plot the Feature Importance

hF, hA = plt.subplots(figsize = (16, 8))
hA.bar(x = dfXNum.columns, height = vFeatImportance)
hA.set_title('Features Importance of the Model')
hA.set_xlabel('Feature Name')
hA.set_title('Importance')

plt.show()
../../../_images/b99cf8147c512fa9f801ef1944c363cd49091986511e7c536f4fe985420292e2.png
  • (?) How many non zero values could we have? Look at the number of splits.

  • (?) What can be done with the features with low value?

  • (#) Can you explain what you see with the EDA phase plots?

  • (#) Pay attention, in the context of feature importance we may choose high number of splits even if it means overfit. It won’t be a model for production, but will give a better view of the features.