Model evaluation¶

Evaluation metrics¶

Evaluation metrics are used to quantify the performance of a model (once it has been trained) for a given problem, and this throughout a model's lifecycle.

In [59]:

from sklearn import metrics

Warning on metrics !¶

Metrics must be chosen according to the characteristics of the task: classification, regression, clustering...

A metric alone gives you only a partial view of a model's performance on a task: it is often desirable to compare different evaluation metrics.

In addition, the metric doesn't necessarily give you any information about its interpretability, so you need to inspect your model through complementary analyses such as analysis of the relationships between the variable to be predicted and the features, or of the relationships between the features themselves.

No single model can offer the best performance for every task (No Free Lunch Theorem)!

Baseline score¶

Always calculate the score of a reference model to serve as a point of comparison for your evaluation metric.

This may be a result from the state of the art in the field under study, for example:

a physical model for predicting global warming
human performance on the same task

Or else a stupid model giving a stereotyped response, for example:

giving a random response
for classification: predict the most frequent class
for regression: predict a measure of central tendency (mean, median, mode)
...

Example of a stupid baseline template with sklearn ¶

DummyClassifier for classifications :

In [61]:

import numpy as np
from sklearn.dummy import DummyClassifier
X = np.array([-1, 1, 1, 1])
y = np.array([0, 1, 1, 1])

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X, y)

dummy_clf.predict(X)

dummy_clf.score(X, y)

Out[61]:

0.75

DummyClassifier for regressions :

In [63]:

import numpy as np
from sklearn.dummy import DummyRegressor
X = np.array([1.0, 2.0, 3.0, 4.0])
y = np.array([2.0, 3.0, 5.0, 10.0])

dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(X, y)

dummy_regr.predict(X)

dummy_regr.score(X, y)

Out[63]:

0.0

Common regression metrics ¶

Mean Squared Error (MSE)¶

It measures the mean-square difference between the $y$ labels and their predicted values $\hat y$ :

$$MSE = \frac{1}{n}\sum_{i=1}^n(y_i - \hat y_i)^2$$

Important points

useful for penalizing large errors
very sensitive to outliers
does not give error direction
does not give the same order of magnitude as $y$.

Root Mean Squared Error (MSE)¶

$$ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n(y_i - \hat y_i)^2}$$

Using the square root gives an error of the same order of magnitude as the labels $y$

Mean Absolute Error (MAE)¶

It measures the mean of the absolute difference (norm $L_1$) between labels $y$ and their predicted value $\hat y$ :

$$MAE = \frac{1}{n}\sum_{i=1}^n |y_i - \hat y_i|$$

Important points :

Useful for comparing models
less sensitive than MSE to outliers
does not give error direction
give the same order of magnitude as $y$.

Max Error¶

It measures the greatest error made by the model:

$$ME = max{_{i=1}^n|y_i - \hat y_i|}$$

Use Max error when you want to limit the magnitude of errors.

The coefficient of determination $R^2$.¶

Measures the proportion of variance observed on $y$ that is explained by the features $X$ in the dataset. It assesses the model's quality of fit (goodness of fit) compared with a stupid model that would always predict $\bar y$ (the mean).

$$R^2(y,\hat y) = 1 - \frac{\sum{_{i=1}^n} (y_i-\hat y_i)^2}{\sum{_{i=1}^n} (y_i-\bar y_i)^2} $$

$R^2$ = 1 characterizes a model that fits the data perfectly
$R^2$ ~ 0 characterizes a model that does no better than the stupid model
$R^2$ < 0 characterizes an even worse model!

Important points

gives a standardized error measure (between -1 and 1)
can be used to determine whether or not the chosen model is better than a stupid model.

Examples of metric comparison during cross-validation¶

In [66]:

import pandas as pd
from sklearn.model_selection import cross_validate

cv_results = cross_validate(mlp_model, X_train, y_train, cv=5, 
                            scoring = ['neg_mean_absolute_error',
                                       'neg_mean_squared_error',
                                       'max_error','r2'])

In [67]:

pd.DataFrame(cv_results)

Out[67]:

	fit_time	score_time	test_neg_mean_absolute_error	test_neg_mean_squared_error	test_max_error	test_r2
0	3.835665	0.008013	-30.412854	-2321.110081	-392.599692	0.928451
1	3.393434	0.010037	-28.486267	-2060.943748	-464.888348	0.938005
2	3.495894	0.011007	-28.249311	-1979.778788	-280.844825	0.938423
3	4.059982	0.008575	-28.350779	-2177.317791	-440.987206	0.934958
4	5.503218	0.018209	-27.754710	-1900.744156	-385.141612	0.941775

In [68]:

cv_results['test_r2'].mean()

Out[68]:

0.9363221440235678

Common classification metrics ¶

We often distinguish 3 types of classification: binary {$C_0$,$C_1$}, multiclass {$C_1 \ldots C_n $}, et mutlilabel { {$C_1$,$D_1$},$\ldots$,$C_n$,$D_n$} }

Confusion matrix¶

In binary classification tasks, we distinguish two types of error, which can be represented by a Confusion matrix

Example of a confusion matrix for multiclass classification¶

We train an SVM on the iris data set:

In [82]:

import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay

# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel="linear", C=0.01).fit(X_train, y_train)

The confusion matrix is often represented as a heat map:

In [85]:

confmat = ConfusionMatrixDisplay.from_estimator(
        classifier,
        X_test,
        y_test,
        display_labels=class_names,
        cmap=plt.cm.Blues,
)
confmat.ax_.set_title("Confusion matrix");

No description has been provided for this image

Many classification metrics are based on the confusion matrix ! (see the dedicated wikipedia page for more details)

Accuracy score¶

This is the simplest score, corresponding to the fraction of correct answers:

$$ accuracy = \frac{TP + TN}{TP + TN + FP +FN} $$

Be careful! This measure gives an over-confident score, especially when the dataset to be processed contains unbalanced classes. In this case, the balanced accuracy score is preferable.

Balanced accuracy score¶

$$ balanced~accuracy = \frac{1}{2}(\frac{TP}{TP + FN} + \frac{TN}{TN + FP})$$

Recall / sensitivity / true positive rate¶

This metric measures the classifier's ability to detect true positives among positive samples:

$$ recall = \frac{TP}{TP + FN}$$

Recall is preferred when it is important to identify the occurrences of a class, for example in a disease screening test.

Precision¶

This metric measures the classifier's ability to detect true positives among positive predictions:

$$ precision = \frac{TP}{TP + FP}$$

Precision is preferred when it's important to correctly identify a class, for example when using a search engine.

F-score¶

It's a metric that combines, with a weight, precision and recall. The F1 score is often used:

$$ F_1 = 2\frac{precision \times recall}{precision + recall}$$

but the weighting can be generalized to any value:

$$ F_\beta = (1+\beta^2)\frac{precision \times recall}{\beta^2precision + recall}$$

The best $F_\beta$ score is 1, the worst 0
Gives the same weight to recall and precision, if $\beta > 1$ we favor recall

Classification report ¶

This scikit-learn function returns a report integrating several classification metrics:

precision
recall
F1-score

In [105]:

from sklearn.metrics import classification_report

print(classification_report(y_test,classifier.predict(X_test) , target_names=class_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        13
  versicolor       1.00      0.62      0.77        16
   virginica       0.60      1.00      0.75         9

    accuracy                           0.84        38
   macro avg       0.87      0.88      0.84        38
weighted avg       0.91      0.84      0.84        38

Specificity / selectivity / true negative rate¶

This metric measures the classifier's ability to detect true negatives among negative samples:

$$ specificity = \frac{TN}{FP + TN}$$

ROC curve area (AUC)¶

The ROC curve (Receiver Operating Characteristic) is a measure of the performance of a binary classifier, derived from signal detection theory (it was used to separate radar signals from background noise).

It is determined by calculating the True positive rate (or recall) versus the True negative rate (1-specificity) by varying a discrimination threshold. The metric then consists of measuring the area under the ROC curve (AUC) as a score.

For an interactive example of its construction, see this website.

Multilabel metrics ¶

Although some of the metrics presented may have multilabel variants, there are metrics specific to multilabel classification tasks (which involve predicting several labels at the same time).

Some of these are described and implemented in scikit-learn, such as the coverage error, label ranking average precision, ranking loss...

Analysis of relationships between variables¶

Using one or more metrics is a first step, but it's not enough to understand the relationships between variables and their impacts on predictions.

Analysis of relationships between features¶

This involves analyzing the relationships between features, by examining the importance of features used by the model.

Some models are natively equipped with such a score:

the coefficients of a regression.
the importance of the features used in the partitions of a decision tree.

Analysis of the relationship between features and target variable¶

This involves analyzing the relationships between predictors and target variables, in particular their degree of dependence.

Partial dependency analysis (partial dependence plots - PDP)¶

PDP graphs are often used to represent the dependence of the response variable, with a sub-group of features of interest (usually a small number of the most important ones), compared with all other features.

Example of PDP on a predictive model of bicycle rental with the variables temperature and humidity

No description has been provided for this image

From scikit-learn

Individidual conditional expectation plot (ICE)¶

ICE represents the dependence between the response variable and a representative subgroup of features, for each obervation separately. An interest feature is generally represented by an ICE grapics (for ease of reading)

ICE example on the same bike rental prediction model: