Model evaluation¶

Evaluation metrics¶

Evaluation metrics are used to quantify the performance of a model (once it has been trained) for a given problem, and this throughout a model's lifecycle.

In [59]:
from sklearn import metrics

Warning on metrics !¶

Metrics must be chosen according to the characteristics of the task: classification, regression, clustering...

A metric alone gives you only a partial view of a model's performance on a task: it is often desirable to compare different evaluation metrics.

In addition, the metric doesn't necessarily give you any information about its interpretability, so you need to inspect your model through complementary analyses such as analysis of the relationships between the variable to be predicted and the features, or of the relationships between the features themselves.

No single model can offer the best performance for every task (No Free Lunch Theorem)!

Baseline score¶

Always calculate the score of a reference model to serve as a point of comparison for your evaluation metric.

This may be a result from the state of the art in the field under study, for example:

  • a physical model for predicting global warming
  • human performance on the same task

Or else a stupid model giving a stereotyped response, for example:

  • giving a random response
  • for classification: predict the most frequent class
  • for regression: predict a measure of central tendency (mean, median, mode)
  • ...

Example of a stupid baseline template with sklearn¶

DummyClassifier for classifications :

In [61]:
import numpy as np
from sklearn.dummy import DummyClassifier
X = np.array([-1, 1, 1, 1])
y = np.array([0, 1, 1, 1])

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X, y)

dummy_clf.predict(X)

dummy_clf.score(X, y)
Out[61]:
0.75

DummyClassifier for regressions :

In [63]:
import numpy as np
from sklearn.dummy import DummyRegressor
X = np.array([1.0, 2.0, 3.0, 4.0])
y = np.array([2.0, 3.0, 5.0, 10.0])

dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(X, y)

dummy_regr.predict(X)

dummy_regr.score(X, y)
Out[63]:
0.0

Common regression metrics¶

Mean Squared Error (MSE)¶

It measures the mean-square difference between the $y$ labels and their predicted values $\hat y$ :

$$MSE = \frac{1}{n}\sum_{i=1}^n(y_i - \hat y_i)^2$$

Important points

  • useful for penalizing large errors
  • very sensitive to outliers
  • does not give error direction
  • does not give the same order of magnitude as $y$.

Root Mean Squared Error (MSE)¶

$$ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n(y_i - \hat y_i)^2}$$

Using the square root gives an error of the same order of magnitude as the labels $y$

Mean Absolute Error (MAE)¶

It measures the mean of the absolute difference (norm $L_1$) between labels $y$ and their predicted value $\hat y$ :

$$MAE = \frac{1}{n}\sum_{i=1}^n |y_i - \hat y_i|$$

Important points :

  • Useful for comparing models
  • less sensitive than MSE to outliers
  • does not give error direction
  • give the same order of magnitude as $y$.

Max Error¶

It measures the greatest error made by the model:

$$ME = max{_{i=1}^n|y_i - \hat y_i|}$$

Use Max error when you want to limit the magnitude of errors.

The coefficient of determination $R^2$.¶

Measures the proportion of variance observed on $y$ that is explained by the features $X$ in the dataset. It assesses the model's quality of fit (goodness of fit) compared with a stupid model that would always predict $\bar y$ (the mean).

$$R^2(y,\hat y) = 1 - \frac{\sum{_{i=1}^n} (y_i-\hat y_i)^2}{\sum{_{i=1}^n} (y_i-\bar y_i)^2} $$
  • $R^2$ = 1 characterizes a model that fits the data perfectly
  • $R^2$ ~ 0 characterizes a model that does no better than the stupid model
  • $R^2$ < 0 characterizes an even worse model!

Important points

  • gives a standardized error measure (between -1 and 1)
  • can be used to determine whether or not the chosen model is better than a stupid model.

Examples of metric comparison during cross-validation¶

In [66]:
import pandas as pd
from sklearn.model_selection import cross_validate

cv_results = cross_validate(mlp_model, X_train, y_train, cv=5, 
                            scoring = ['neg_mean_absolute_error',
                                       'neg_mean_squared_error',
                                       'max_error','r2'])
In [67]:
pd.DataFrame(cv_results)
Out[67]:
fit_time score_time test_neg_mean_absolute_error test_neg_mean_squared_error test_max_error test_r2
0 3.835665 0.008013 -30.412854 -2321.110081 -392.599692 0.928451
1 3.393434 0.010037 -28.486267 -2060.943748 -464.888348 0.938005
2 3.495894 0.011007 -28.249311 -1979.778788 -280.844825 0.938423
3 4.059982 0.008575 -28.350779 -2177.317791 -440.987206 0.934958
4 5.503218 0.018209 -27.754710 -1900.744156 -385.141612 0.941775
In [68]:
cv_results['test_r2'].mean()
Out[68]:
0.9363221440235678

Common classification metrics¶

We often distinguish 3 types of classification
binary {$C_0$,$C_1$}, multiclass {$C_1 \ldots C_n $}, et mutlilabel { {$C_1$,$D_1$},$\ldots$,$C_n$,$D_n$} }

Confusion matrix¶

In binary classification tasks, we distinguish two types of error, which can be represented by a Confusion matrix

confusion matrix

Example of a confusion matrix for multiclass classification¶

We train an SVM on the iris data set:

In [82]:
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay

# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel="linear", C=0.01).fit(X_train, y_train)

The confusion matrix is often represented as a heat map:

In [85]:
confmat = ConfusionMatrixDisplay.from_estimator(
        classifier,
        X_test,
        y_test,
        display_labels=class_names,
        cmap=plt.cm.Blues,
)
confmat.ax_.set_title("Confusion matrix");
No description has been provided for this image

Many classification metrics are based on the confusion matrix ! (see the dedicated wikipedia page for more details)

Accuracy score¶

This is the simplest score, corresponding to the fraction of correct answers:

$$ accuracy = \frac{TP + TN}{TP + TN + FP +FN} $$

Be careful! This measure gives an over-confident score, especially when the dataset to be processed contains unbalanced classes. In this case, the balanced accuracy score is preferable.

Balanced accuracy score¶

$$ balanced~accuracy = \frac{1}{2}(\frac{TP}{TP + FN} + \frac{TN}{TN + FP})$$

Recall / sensitivity / true positive rate¶

This metric measures the classifier's ability to detect true positives among positive samples:

$$ recall = \frac{TP}{TP + FN}$$

Recall is preferred when it is important to identify the occurrences of a class, for example in a disease screening test.

Precision¶

This metric measures the classifier's ability to detect true positives among positive predictions:

$$ precision = \frac{TP}{TP + FP}$$

Precision is preferred when it's important to correctly identify a class, for example when using a search engine.

F-score¶

It's a metric that combines, with a weight, precision and recall. The F1 score is often used:

$$ F_1 = 2\frac{precision \times recall}{precision + recall}$$

but the weighting can be generalized to any value:

$$ F_\beta = (1+\beta^2)\frac{precision \times recall}{\beta^2precision + recall}$$
  • The best $F_\beta$ score is 1, the worst 0
  • Gives the same weight to recall and precision, if $\beta > 1$ we favor recall

Classification report¶

This scikit-learn function returns a report integrating several classification metrics:

  • precision
  • recall
  • F1-score
In [105]:
from sklearn.metrics import classification_report

print(classification_report(y_test,classifier.predict(X_test) , target_names=class_names))
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        13
  versicolor       1.00      0.62      0.77        16
   virginica       0.60      1.00      0.75         9

    accuracy                           0.84        38
   macro avg       0.87      0.88      0.84        38
weighted avg       0.91      0.84      0.84        38

Specificity / selectivity / true negative rate¶

This metric measures the classifier's ability to detect true negatives among negative samples:

$$ specificity = \frac{TN}{FP + TN}$$

ROC curve area (AUC)¶

The ROC curve (Receiver Operating Characteristic) is a measure of the performance of a binary classifier, derived from signal detection theory (it was used to separate radar signals from background noise).

It is determined by calculating the True positive rate (or recall) versus the True negative rate (1-specificity) by varying a discrimination threshold. The metric then consists of measuring the area under the ROC curve (AUC) as a score.

ROC curve

For an interactive example of its construction, see this website.

Multilabel metrics¶

Although some of the metrics presented may have multilabel variants, there are metrics specific to multilabel classification tasks (which involve predicting several labels at the same time).

Some of these are described and implemented in scikit-learn, such as the coverage error, label ranking average precision, ranking loss...

Analysis of relationships between variables¶

Using one or more metrics is a first step, but it's not enough to understand the relationships between variables and their impacts on predictions.

Analysis of relationships between features¶

This involves analyzing the relationships between features, by examining the importance of features used by the model.

Some models are natively equipped with such a score:

  • the coefficients of a regression.
  • the importance of the features used in the partitions of a decision tree.

Analysis of the relationship between features and target variable¶

This involves analyzing the relationships between predictors and target variables, in particular their degree of dependence.

Partial dependency analysis (partial dependence plots - PDP)¶

PDP graphs are often used to represent the dependence of the response variable, with a sub-group of features of interest (usually a small number of the most important ones), compared with all other features.

Example of PDP on a predictive model of bicycle rental with the variables temperature and humidity

No description has been provided for this image

From scikit-learn

Individidual conditional expectation plot (ICE)¶

ICE represents the dependence between the response variable and a representative subgroup of features, for each obervation separately. An interest feature is generally represented by an ICE grapics (for ease of reading)

ICE example on the same bike rental prediction model:

No description has been provided for this image

d'apres scikit-learn

For more details on the partial dependency analysis of the variables in this data set, see the complete example on scikit-learn