PPS - Predictive Power Score#
statistical metric that measures the predictive relationship between two variables. Unlike correlation, it can capture non-linear and asymmetric relationships.
Correlation measures a linear relationship between variables, symmetrically.
PPS assesses the ability of one variable to predict another, incorporating machine learning models, and is directional.
How PPS Works:#
Utilizes decision trees to estimate the likelihood of predicting one variable using another.
Evaluates the success of predictions using a score, with 0 indicating no predictive power and 1 indicating perfect prediction
The Predictive Power Score (PPS) is an alternative to the correlation coefficient (like Pearson’s r) that can reveal insights about the predictive relationship between two variables. While correlation measures linear relationships, PPS can detect more complex patterns such as non-linear relationships. It can be used not only with numerical data but also with categorical variables.
PPS is essentially a score that can tell you how well one variable can predict another. It is based on the concept that if one variable can be used to predict another using a machine learning model (typically a decision tree), then there is likely a meaningful relationship.
PPS is calculated by building a model to predict one variable using another and then assessing the model’s performance. The performance metric used is generally a model score like R-squared for regression tasks or accuracy for classification. The PPS is normalized to a range of 0 to 1, where 0 indicates no predictive power and 1 indicates perfect predictive ability.
import pandas as pd
import seaborn as sns
import ppscore as pps # Correct import for the ppscore library
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
# Load example dataset
diamonds = sns.load_dataset('diamonds')
# Initialize LabelEncoder
encoder = LabelEncoder()
print(diamonds.dtypes)
# Select categorical columns
categorical_cols = diamonds.select_dtypes(include=['object']).columns
# Apply LabelEncoder to each categorical column
for col in categorical_cols:
diamonds[col] = encoder.fit_transform(diamonds[col])
# Calculate the Predictive Power Score
pps_matrix = pps.matrix(diamonds)
pps_matrix.head()
import matplotlib.pyplot as plt
import seaborn as sns
# Create a DataFrame for the heatmap
pps_df = pd.DataFrame(pps_matrix)
plt.figure(figsize=(10, 8))
sns.heatmap(pps_df, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('PPS Matrix')
plt.show()
carat float64
cut category
color category
clarity category
depth float64
table float64
price int64
x float64
y float64
z float64
dtype: object
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[6], line 37
34 pps_df = pd.DataFrame(pps_matrix)
36 plt.figure(figsize=(10, 8))
---> 37 sns.heatmap(pps_df, annot=True, fmt=".2f", cmap='coolwarm')
38 plt.title('PPS Matrix')
39 plt.show()
File /data/solai/venvMamabaFixel/lib/python3.11/site-packages/seaborn/matrix.py:446, in heatmap(data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, linewidths, linecolor, cbar, cbar_kws, cbar_ax, square, xticklabels, yticklabels, mask, ax, **kwargs)
365 """Plot rectangular data as a color-encoded matrix.
366
367 This is an Axes-level function and will draw the heatmap into the
(...)
443
444 """
445 # Initialize the plotter object
--> 446 plotter = _HeatMapper(data, vmin, vmax, cmap, center, robust, annot, fmt,
447 annot_kws, cbar, cbar_kws, xticklabels,
448 yticklabels, mask)
450 # Add the pcolormesh kwargs here
451 kwargs["linewidths"] = linewidths
File /data/solai/venvMamabaFixel/lib/python3.11/site-packages/seaborn/matrix.py:163, in _HeatMapper.__init__(self, data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, cbar, cbar_kws, xticklabels, yticklabels, mask)
160 self.ylabel = ylabel if ylabel is not None else ""
162 # Determine good default values for the colormapping
--> 163 self._determine_cmap_params(plot_data, vmin, vmax,
164 cmap, center, robust)
166 # Sort out the annotations
167 if annot is None or annot is False:
File /data/solai/venvMamabaFixel/lib/python3.11/site-packages/seaborn/matrix.py:197, in _HeatMapper._determine_cmap_params(self, plot_data, vmin, vmax, cmap, center, robust)
194 """Use some heuristics to set good defaults for colorbar and range."""
196 # plot_data is a np.ma.array instance
--> 197 calc_data = plot_data.astype(float).filled(np.nan)
198 if vmin is None:
199 if robust:
ValueError: could not convert string to float: 'carat'
<Figure size 1000x800 with 0 Axes>
import pandas as pd
import seaborn as sns
import ppscore as pps # Correct import for the ppscore library
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
# Load the dataset
diamonds = sns.load_dataset('diamonds')
# Initialize LabelEncoder
encoder = LabelEncoder()
# Select categorical columns
categorical_cols = diamonds.select_dtypes(include=['object']).columns
# Apply LabelEncoder to each categorical column
for col in categorical_cols:
diamonds[col] = encoder.fit_transform(diamonds[col])
# Calculate the Predictive Power Score
pps_matrix = pps.matrix(diamonds)
# Instead of calling pps_matrix.head(), which assumes pps_matrix is a DataFrame,
# Let's ensure we properly convert the output to a DataFrame if it isn't one already.
if not isinstance(pps_matrix, pd.DataFrame):
pps_matrix = pd.DataFrame(pps_matrix)
# Check the structure of pps_matrix
print(pps_matrix.head())
# Assuming pps_matrix is correctly formatted as a DataFrame
plt.figure(figsize=(10, 8))
sns.heatmap(pps_matrix, annot=True, fmt=".2f", cmap='coolwarm', vmin=0, vmax=1)
plt.title('PPS Matrix')
plt.show()
x y ppscore case is_valid_score \
0 carat carat 1.000000 predict_itself True
1 carat cut 0.085389 classification True
2 carat color 0.060319 classification True
3 carat clarity 0.064141 classification True
4 carat depth 0.000000 regression True
metric baseline_score model_score model
0 None 0.00000 1.000000 None
1 weighted F1 0.29420 0.354467 DecisionTreeClassifier()
2 weighted F1 0.15720 0.208037 DecisionTreeClassifier()
3 weighted F1 0.18000 0.232596 DecisionTreeClassifier()
4 mean absolute error 1.01662 1.051711 DecisionTreeRegressor()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 33
31 # Assuming pps_matrix is correctly formatted as a DataFrame
32 plt.figure(figsize=(10, 8))
---> 33 sns.heatmap(pps_matrix, annot=True, fmt=".2f", cmap='coolwarm', vmin=0, vmax=1)
34 plt.title('PPS Matrix')
35 plt.show()
File /data/solai/venvMamabaFixel/lib/python3.11/site-packages/seaborn/matrix.py:446, in heatmap(data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, linewidths, linecolor, cbar, cbar_kws, cbar_ax, square, xticklabels, yticklabels, mask, ax, **kwargs)
365 """Plot rectangular data as a color-encoded matrix.
366
367 This is an Axes-level function and will draw the heatmap into the
(...)
443
444 """
445 # Initialize the plotter object
--> 446 plotter = _HeatMapper(data, vmin, vmax, cmap, center, robust, annot, fmt,
447 annot_kws, cbar, cbar_kws, xticklabels,
448 yticklabels, mask)
450 # Add the pcolormesh kwargs here
451 kwargs["linewidths"] = linewidths
File /data/solai/venvMamabaFixel/lib/python3.11/site-packages/seaborn/matrix.py:163, in _HeatMapper.__init__(self, data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, cbar, cbar_kws, xticklabels, yticklabels, mask)
160 self.ylabel = ylabel if ylabel is not None else ""
162 # Determine good default values for the colormapping
--> 163 self._determine_cmap_params(plot_data, vmin, vmax,
164 cmap, center, robust)
166 # Sort out the annotations
167 if annot is None or annot is False:
File /data/solai/venvMamabaFixel/lib/python3.11/site-packages/seaborn/matrix.py:197, in _HeatMapper._determine_cmap_params(self, plot_data, vmin, vmax, cmap, center, robust)
194 """Use some heuristics to set good defaults for colorbar and range."""
196 # plot_data is a np.ma.array instance
--> 197 calc_data = plot_data.astype(float).filled(np.nan)
198 if vmin is None:
199 if robust:
ValueError: could not convert string to float: 'carat'
<Figure size 1000x800 with 0 Axes>