PPS - Predictive Power Score

Contents

PPS - Predictive Power Score#

  • statistical metric that measures the predictive relationship between two variables. Unlike correlation, it can capture non-linear and asymmetric relationships.

  • Correlation measures a linear relationship between variables, symmetrically.

  • PPS assesses the ability of one variable to predict another, incorporating machine learning models, and is directional.

How PPS Works:#

  • Utilizes decision trees to estimate the likelihood of predicting one variable using another.

  • Evaluates the success of predictions using a score, with 0 indicating no predictive power and 1 indicating perfect prediction

The Predictive Power Score (PPS) is an alternative to the correlation coefficient (like Pearson’s r) that can reveal insights about the predictive relationship between two variables. While correlation measures linear relationships, PPS can detect more complex patterns such as non-linear relationships. It can be used not only with numerical data but also with categorical variables.

PPS is essentially a score that can tell you how well one variable can predict another. It is based on the concept that if one variable can be used to predict another using a machine learning model (typically a decision tree), then there is likely a meaningful relationship.

PPS is calculated by building a model to predict one variable using another and then assessing the model’s performance. The performance metric used is generally a model score like R-squared for regression tasks or accuracy for classification. The PPS is normalized to a range of 0 to 1, where 0 indicates no predictive power and 1 indicates perfect predictive ability.

import pandas as pd
import seaborn as sns
import ppscore as pps  # Correct import for the ppscore library
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

# Load example dataset
diamonds = sns.load_dataset('diamonds')


# Initialize LabelEncoder
encoder = LabelEncoder()

print(diamonds.dtypes)


# Select categorical columns
categorical_cols = diamonds.select_dtypes(include=['object']).columns

# Apply LabelEncoder to each categorical column
for col in categorical_cols:
    diamonds[col] = encoder.fit_transform(diamonds[col])


# Calculate the Predictive Power Score
pps_matrix = pps.matrix(diamonds)

pps_matrix.head()

import matplotlib.pyplot as plt
import seaborn as sns

# Create a DataFrame for the heatmap
pps_df = pd.DataFrame(pps_matrix)

plt.figure(figsize=(10, 8))
sns.heatmap(pps_df, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('PPS Matrix')
plt.show()
carat       float64
cut        category
color      category
clarity    category
depth       float64
table       float64
price         int64
x           float64
y           float64
z           float64
dtype: object
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 37
     34 pps_df = pd.DataFrame(pps_matrix)
     36 plt.figure(figsize=(10, 8))
---> 37 sns.heatmap(pps_df, annot=True, fmt=".2f", cmap='coolwarm')
     38 plt.title('PPS Matrix')
     39 plt.show()

File /data/solai/venvMamabaFixel/lib/python3.11/site-packages/seaborn/matrix.py:446, in heatmap(data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, linewidths, linecolor, cbar, cbar_kws, cbar_ax, square, xticklabels, yticklabels, mask, ax, **kwargs)
    365 """Plot rectangular data as a color-encoded matrix.
    366 
    367 This is an Axes-level function and will draw the heatmap into the
   (...)
    443 
    444 """
    445 # Initialize the plotter object
--> 446 plotter = _HeatMapper(data, vmin, vmax, cmap, center, robust, annot, fmt,
    447                       annot_kws, cbar, cbar_kws, xticklabels,
    448                       yticklabels, mask)
    450 # Add the pcolormesh kwargs here
    451 kwargs["linewidths"] = linewidths

File /data/solai/venvMamabaFixel/lib/python3.11/site-packages/seaborn/matrix.py:163, in _HeatMapper.__init__(self, data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, cbar, cbar_kws, xticklabels, yticklabels, mask)
    160 self.ylabel = ylabel if ylabel is not None else ""
    162 # Determine good default values for the colormapping
--> 163 self._determine_cmap_params(plot_data, vmin, vmax,
    164                             cmap, center, robust)
    166 # Sort out the annotations
    167 if annot is None or annot is False:

File /data/solai/venvMamabaFixel/lib/python3.11/site-packages/seaborn/matrix.py:197, in _HeatMapper._determine_cmap_params(self, plot_data, vmin, vmax, cmap, center, robust)
    194 """Use some heuristics to set good defaults for colorbar and range."""
    196 # plot_data is a np.ma.array instance
--> 197 calc_data = plot_data.astype(float).filled(np.nan)
    198 if vmin is None:
    199     if robust:

ValueError: could not convert string to float: 'carat'
<Figure size 1000x800 with 0 Axes>
import pandas as pd
import seaborn as sns
import ppscore as pps  # Correct import for the ppscore library
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

# Load the dataset
diamonds = sns.load_dataset('diamonds')

# Initialize LabelEncoder
encoder = LabelEncoder()

# Select categorical columns
categorical_cols = diamonds.select_dtypes(include=['object']).columns

# Apply LabelEncoder to each categorical column
for col in categorical_cols:
    diamonds[col] = encoder.fit_transform(diamonds[col])

# Calculate the Predictive Power Score
pps_matrix = pps.matrix(diamonds)

# Instead of calling pps_matrix.head(), which assumes pps_matrix is a DataFrame,
# Let's ensure we properly convert the output to a DataFrame if it isn't one already.
if not isinstance(pps_matrix, pd.DataFrame):
    pps_matrix = pd.DataFrame(pps_matrix)

# Check the structure of pps_matrix
print(pps_matrix.head())

# Assuming pps_matrix is correctly formatted as a DataFrame
plt.figure(figsize=(10, 8))
sns.heatmap(pps_matrix, annot=True, fmt=".2f", cmap='coolwarm', vmin=0, vmax=1)
plt.title('PPS Matrix')
plt.show()
       x        y   ppscore            case  is_valid_score  \
0  carat    carat  1.000000  predict_itself            True   
1  carat      cut  0.085389  classification            True   
2  carat    color  0.060319  classification            True   
3  carat  clarity  0.064141  classification            True   
4  carat    depth  0.000000      regression            True   

                metric  baseline_score  model_score                     model  
0                 None         0.00000     1.000000                      None  
1          weighted F1         0.29420     0.354467  DecisionTreeClassifier()  
2          weighted F1         0.15720     0.208037  DecisionTreeClassifier()  
3          weighted F1         0.18000     0.232596  DecisionTreeClassifier()  
4  mean absolute error         1.01662     1.051711   DecisionTreeRegressor()  
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 33
     31 # Assuming pps_matrix is correctly formatted as a DataFrame
     32 plt.figure(figsize=(10, 8))
---> 33 sns.heatmap(pps_matrix, annot=True, fmt=".2f", cmap='coolwarm', vmin=0, vmax=1)
     34 plt.title('PPS Matrix')
     35 plt.show()

File /data/solai/venvMamabaFixel/lib/python3.11/site-packages/seaborn/matrix.py:446, in heatmap(data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, linewidths, linecolor, cbar, cbar_kws, cbar_ax, square, xticklabels, yticklabels, mask, ax, **kwargs)
    365 """Plot rectangular data as a color-encoded matrix.
    366 
    367 This is an Axes-level function and will draw the heatmap into the
   (...)
    443 
    444 """
    445 # Initialize the plotter object
--> 446 plotter = _HeatMapper(data, vmin, vmax, cmap, center, robust, annot, fmt,
    447                       annot_kws, cbar, cbar_kws, xticklabels,
    448                       yticklabels, mask)
    450 # Add the pcolormesh kwargs here
    451 kwargs["linewidths"] = linewidths

File /data/solai/venvMamabaFixel/lib/python3.11/site-packages/seaborn/matrix.py:163, in _HeatMapper.__init__(self, data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, cbar, cbar_kws, xticklabels, yticklabels, mask)
    160 self.ylabel = ylabel if ylabel is not None else ""
    162 # Determine good default values for the colormapping
--> 163 self._determine_cmap_params(plot_data, vmin, vmax,
    164                             cmap, center, robust)
    166 # Sort out the annotations
    167 if annot is None or annot is False:

File /data/solai/venvMamabaFixel/lib/python3.11/site-packages/seaborn/matrix.py:197, in _HeatMapper._determine_cmap_params(self, plot_data, vmin, vmax, cmap, center, robust)
    194 """Use some heuristics to set good defaults for colorbar and range."""
    196 # plot_data is a np.ma.array instance
--> 197 calc_data = plot_data.astype(float).filled(np.nan)
    198 if vmin is None:
    199     if robust:

ValueError: could not convert string to float: 'carat'
<Figure size 1000x800 with 0 Axes>