SVM breast classification#
Notebook by:
Royi Avital RoyiAvital@fixelalgorithms.com
Revision History#
Version |
Date |
User |
Content / Changes |
|---|---|---|---|
1.0.000 |
09/03/2024 |
Royi Avital |
First version |
# Import Packages
# General Tools
import numpy as np
import scipy as sp
import pandas as pd
# Machine Learning
# Image Processing
# Machine Learning
# Miscellaneous
import os
from platform import python_version
import random
import timeit
# Typing
from typing import Callable, List, Tuple
# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
# from bokeh.plotting import figure, show
# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout
Notations#
(?) Question to answer interactively.
(!) Simple task to add code for the notebook.
(@) Optional / Extra self practice.
(#) Note / Useful resource / Food for thought.
Code Notations:
someVar = 2; #<! Notation for a variable
vVector = np.random.rand(4) #<! Notation for 1D array
mMatrix = np.random.rand(4, 3) #<! Notation for 2D array
tTensor = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple = (1, 2, 3) #<! Notation for a tuple
lList = [1, 2, 3] #<! Notation for a list
dDict = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj = MyClass() #<! Notation for an object
dfData = pd.DataFrame() #<! Notation for a data frame
dsData = pd.Series() #<! Notation for a series
hObj = plt.Axes() #<! Notation for an object / handler / function handler
Code Exercise#
Single line fill
vallToFill = ???
Multi Line to Fill (At least one)
# You need to start writing
????
Section to Fill
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???
???
#===============================================================#
# Configuration
# %matplotlib inline
seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)
# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme
runInGoogleColab = 'google.colab' in str(get_ipython())
# Constants
FIG_SIZE_DEF = (8, 8)
ELM_SIZE_DEF = 50
CLASS_COLOR = ('b', 'r')
EDGE_COLOR = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF = 2
# Courses Packages
import sys
sys.path.append('../')
sys.path.append('../../')
sys.path.append('../../../')
from utils.DataVisualization import Plot2DLinearClassifier, PlotBinaryClassData
# General Auxiliary Functions
# Parameters
# Data Generation
# Data Visualization
numGridPts = 250
Exercise#
In this exercise we’ll do the following:
Apply SVM Classifier on the Breast Cancer Wisconsin (Diagnostic) Data Set.
Use the
predict()method of the SVM object.Implement our own score function:
ClsAccuracy().Compare it to the method
score()of the SVM object.Find the value of the parameter
Cwhich maximizes the accuracy.
Generate / Load Data#
# Load Modules
#===========================Fill This===========================#
# 1. Load the `load_breast_cancer` function from the `sklearn.datasets` module.
# 2. Load the `SVC` class from the `sklearn.svm` module.
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
#===============================================================#
# Load Data
dData = load_breast_cancer()
mX = dData.data
vY = dData.target
print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')
The features data shape: (569, 30)
The labels data shape: (569,)
# Pre Process Data
# Standardizing the features to have zero mean and unit variance and labels into {-1, 1}.
#===========================Fill This===========================#
# 1. Normalize Data (Features): Each column to have zero mean and unit standard deviation.
mX = mX - np.mean(mX, axis = 0)
mX = mX / np.std (mX, axis = 0)
print(f"mX.shape = {mX.shape}")
# 2. Transforming the Labels into {-1, 1}.
vY[vY == 0] = -1
print(f"vY.shape = {vY.shape}")
#===============================================================#
mX.shape = (569, 30)
vY.shape = (569,)
(#) Normalization is ambiguous in this context. In some cases it is used to describe the manipulation of the minimum and maximum values of the data.
# Data Dimensions
numSamples = mX.shape[0]
print(f'The features data shape: {mX.shape}') #>! Should be (569, 30)
The features data shape: (569, 30)
(?) Does the data have a constant column of \(1\) or \(-1\)?
(?) Should we add a constant column? Look at the mathematical formulation of the SVC in SciKit Learn.
(?) What’s the
intercept_attribute of the object?
Train a SVM Classifier#
This sections trains an SVM Classifier using SciKit Learn.
The SciKit Learn Package#
In the course, from now on, we’ll mostly use modules and functions from the SciKit Learn package.
It is mostly known for its API of <model>.fit() and <model>.predict().
This simple choice of convention created the ability to scale in the form of pipelines, chaining models for a greater model.
SVM Classifier Class#
# Construct the SVC Object
# Use the SVC constructor and the parameters below.
paramC = 0.0001
kernelType = 'linear'
#===========================Fill This===========================#
# 1. Create a realization of the `SVC` class using the `C` and `kernel` parameters.
oSvmClassifier = SVC(C = paramC, kernel = kernelType)
#===============================================================#
# Train the Model
#===========================Fill This===========================#
# 1. Train the model using the `fit()` method.
oSvmClassifier = oSvmClassifier.fit(mX, vY)
#===============================================================#
(!) Create a function called
ClsAccuracy( oCls, mX, vY )
The function input is a model withpredict()method and the data and labels.
The function output is the accuracy of the model (In the range [0, 1]).
# Scoring (Accuracy) Function
#===========================Fill This===========================#
# .1 Implement the function `ClsAccuracy()` as defined.
def ClsAccuracy( oCls, mX: np.ndarray, vY: np.ndarray ) -> np.floating:
'''
Calculates the accuracy (Fraction) of a model.
oCls - A classifier with a `predict()` method.
mX - The input data mX.shape = (N, d)
vY - The true labels vY.shape = (N,)
'''
predictions = oCls.predict(mX)
sum_true_out_of_all = sum(predictions==vY)/len(vY)
valAcc = np.mean(predictions == vY)
print(f'Accuracy by method sumTrue/all: {sum_true_out_of_all}')
print(f'Accuracy by method mean: {valAcc}')
return valAcc
#===============================================================#
# Score the Model
modelAcc = ClsAccuracy(oSvmClassifier, mX, vY)
print(f'The model accuracy on the training data is: {modelAcc:0.2%}')
Accuracy by method sumTrue/all: 0.6731107205623902
Accuracy by method mean: 0.6731107205623902
The model accuracy on the training data is: 67.31%
(!) Compare the manual scoring function to the
score()method of the classifier.
# Comparing the Score
#===========================Fill This===========================#
# 1. Use the model's method `score()` to evaluate the accuracy.
modelAccRef = oSvmClassifier.score(mX, vY)
#===============================================================#
print(f'The model accuracy (Based on the `score()` method) on the training data is: {modelAccRef:0.2%}')
The model accuracy (Based on the `score()` method) on the training data is: 67.31%
Optimizing the Parameter C of the Model#
(!) Create an array of values of the parameter
C.(!) Create a loop which check the score for each
Cvalue.(!) Keep the
Cvalue which maximizes the score.
#===========================Fill This===========================#
# lC = [0.0001 , 0.1 , 1 , 2 , 2.5 ,3 ] #<! The list of `C` values to optimize over
numParams = 100 #<! Number of different values of `C`
lC = np.linspace(0.001, 5, numParams) #<! The list of `C` values to optimize over
dBestScore = {'Accuracy': 0, 'C': 0} #<! Dictionary to keep the highest score and the corresponding `C`
for ii, paramC in enumerate(lC):
oSvmClassifier = SVC(C = paramC, kernel = kernelType) #<! Construct the SVC object
oSvmClassifier = oSvmClassifier.fit(mX, vY) #<! Train on the data
modelScore = oSvmClassifier.score(mX, vY) #<! Calculate the score (Accuracy)
if (modelScore > dBestScore['Accuracy']):
dBestScore['Accuracy'] = modelScore #<! Update the new best score
dBestScore['C'] = paramC #<! Update the corresponding `C` hyper parameter
#===============================================================#
print(f'The best model has accuracy of {dBestScore["Accuracy"]:0.2%} with `C = {dBestScore["C"]}`')
The best model has accuracy of 99.12% with `C = 1.869313131313131`
(!) Plot the score of the model as a function of the parameter C.
(?) Is the above a good strategy to optimize the model?
(@) Read the documentation of the SVC class. Try other values of kernel.
