“You Can’t Predict the Errors of Your Model”… Or Can You?

NannyML has released DLE, an algorithm able to predict the MAE and the MSE of your regression model, in the absence of the ground truth

In a previous article, we have seen how to predict the performance of your classification model, before the ground truth is available. This is super useful in a real-world setting because it gives you early feedback about how your model is performing in production.

This was made possible by an algorithm conceived by NannyML, called “Confidence-Based Performance Estimation” (CBPE), which uses the probabilities predicted by the model to obtain a reliable estimate of any classification metric. A natural question is:

Can we do it also for regression models?

According to NannyML, yes, we can. They have developed a method called “Direct Loss Estimation” (DLE) that allows estimating the performance (specifically Mean Absolute Error and Mean Square Error) of any regression model, when the ground truth is not available.

They claim here that — as a one-stop solution to estimate performance — DLE outperforms other approaches, such as Bayesian method or Conformalized Quantile Regression. But until we don’t see we don’t believe, right? So, in this article, we will explore DLE, try it on a real dataset, and see if it actually performs well as NannyML promises.

The main idea

The intuition behind DLE is very simple. So simple that it seems too good to be true. Basically, the idea is to directly predict the errors made by the model.

I know what you are thinking. You are probably skeptical because you’ve heard that:

“You cannot make a model predict the errors of another model.”

But is it true? To answer this question, we need to start from the posterior distribution.

Introducing the posterior distribution

We are used to models that return a single value for each observation, also called “point prediction”. However, we must keep in mind that, behind that point prediction, there is always a full distribution. If you like fancy statistical terms, you can call this the “posterior distribution”.

What’s the meaning of the posterior distribution?

The posterior distribution gives you a full account of the uncertainty of the prediction.

Let’s try to understand it with the aid of an example.

Suppose that we trained a model to predict the income of a person based on her country, gender, age, marital status, and job. Now, imagine that we have 10,000 people, all with the following characteristics:

Country: USA.
Gender: Female.
Age: 27.
Marital status: Married.
Job: Salesperson.

Of course, even though they carry the same features, they will not have the same income. We can imagine the distribution of their incomes as the posterior distribution for this particular combination of features.

Let’s generate a posterior distribution in Python:

# generate posterior distribution

import numpy as np

posterior_distribution = np.random.lognormal(0, .25, 10000) * 50000

This is the histogram of the distribution:

Posterior distribution of the income of an individual. [Image by Author]

The cool thing about knowing the posterior distribution is that we can calculate anything we want: percentiles, mean, median… Anything.

However, most predictive models are designed to get a point prediction. Indeed, when given the 10,000 individuals above, your model will predict the same income for each of them. As you can guess, the models are usually designed to predict the mean of the posterior distribution.

np.mean(posterior_distribution)

In our case, this is 50,000 $. So, this is the model’s prediction.

The point prediction corresponds to the mean of the posterior distribution. [Image by Author]

Why it doesn’t make sense to predict the error…

For each one of our individuals, the error is given by the difference between the point prediction and the true value. For example, if the income of a person is 65,000 $, the error made by the model is -15,000 $.

error_distribution = point_prediction - posterior_distribution

And if you take the mean error:

mean_error = np.mean(error_distribution)

This is of course zero by definition.

The fact that the mean error is zero makes a lot of sense. It means that, on average, our model makes the right predictions, because positive errors balance out negative errors.

At this point, it’s easy to see why it doesn’t make sense to predict the error. Because it would mean trying to predict something that is null by definition. ***

But didn’t we say that DLE is based exactly on predicting the error? So, what’s the catch?

*** We are assuming that the model predicts the mean of the posterior distribution. This is not always the case. For example, if you use a loss function different from MSE, your errors may not be 0-centered. However, in that case, predicting the signed error would still be pointless: it would be like “correcting” the loss function that you chose in the first place.

… But it makes a lot of sense to predict the absolute error

The point is that DLE does not predict the signed error, but the absolute error!

It may seem like a small difference, but actually, it’s a whole different story. In fact, contrary to the signed error:

The absolute error is a measure of how uncertain our point prediction is.

Having the posterior distribution, we can compute the absolute errors:

absolute_error_distribution = np.abs(point_prediction - posterior_distribution)

And if we take the mean, we get the mean absolute error:

mean_absolute_error = np.mean(absolute_error_distribution)

With the distribution we have simulated, the mean absolute error is around 10,000 $. It means that, on average, the actual income is 10,000 $ away from the point prediction given by our model.

Point prediction and Mean absolute error. [Image by Author]

To sum up:

Trying to predict the signed error makes no sense, because we are trying to correct a model that we believe to be the best one.
Trying to predict the absolute (squared) error makes a lot of sense, because we are trying to quantify the uncertainty associated with the prediction.

Note that the final performance of our model is directly linked to the uncertainty associated with it. It’s intuitive. The more uncertainty, the worse performance. The less uncertainty, the better performance.

Describing DLE algorithm

In practice, DLE trains a new model that learns the uncertainty associated with the predictions of the original model. In fact:

The original model makes the point predictions (as always).
The model introduced by DLE predicts the absolute (or squared) errors made by the main model. NannyML calls this one “nanny model” (for obvious reasons).

We can summarize the whole process in 4 steps.

As usual, everything starts with training the model on the training dataset.

1st step. The model learns on the training dataset. [Image by Author]

Then, the model is used to make predictions on the test dataset. Once we have the predictions, we can also calculate the absolute errors as the absolute difference between target and prediction:

2nd step. The model makes predictions on test data and absolute errors are computed. [Image by Author]

At this point, a second model — called the nanny model — uses the original features and the predictions made by the first model to learn patterns about absolute errors.

3rd step. The model learns to predict absolute errors on test data. [Image by Author]

Finally, when the model is used in production to make predictions, we can use the first model to predict the target variable and the nanny model to predict absolute errors.

4th step. As soon as the production data is available, the main model makes predictions about the target and the nanny model makes predictions about the absolute error. [Image by Author]

To summarize, this is a view of the entire process (with no distinction between training, test and production datasets for clarity):

All the steps of Direct Loss Estimation. For the sake of simplicity, there is no distinction here between training, test and production datasets. [Image by Author]

Predicting performance on a real dataset

Let’s see if DLE works on a real dataset.

We will use the Superconductor dataset, an open-source dataset from UCI. This dataset consists of 21,263 materials, with 81 features recorded regarding atomic mass, atomic radius, density, and others. The goal is to predict the critical temperature.

We will split the dataset into a training, a test, and a production dataset.

# load data and split into train, test, prod

import pandas as pd

df = pd.read_csv("/kaggle/input/superconductor-dataset/train.csv").sample(frac=1, random_state=123)

df_train = df.iloc[:int(len(df)/5)]
df_test = df.iloc[int(len(df)/5):int(len(df)/5*2)]
df_prod = df.iloc[int(len(df)/5*2):]

X_train = df_train.drop("critical_temp", axis=1)
X_test = df_test.drop("critical_temp", axis=1)
X_prod = df_prod.drop("critical_temp", axis=1)

y_train = df_train["critical_temp"]
y_test = df_test["critical_temp"]
y_prod = df_prod["critical_temp"]

At this point, we are ready to train our model:

# train model

from lightgbm import LGBMRegressor

model = LGBMRegressor().fit(X=X_train,y=y_train)

Then, we use the test dataset to compute the errors made by the model.

# compute observed errors made by the model on test data

pred_test = pd.Series(model.predict(X_test),  index=X_test.index).clip(0)

error_test = pred_test — y_test

Once we have the errors, we can train the nanny model. Note that the nanny model uses as features all the original features plus the predictions made by the first model.

# train model to predict the absolute error

model_abs_error = LGBMRegressor().fit(
  X=pd.concat([X_test, pred_test], axis=1),  
  y=error_test.abs()
)

On the production dataset, we first use the main model to obtain the predictions. Then, we use the nanny model to get the predicted absolute errors.

# predict the absolute errors on production data

pred_prod = pd.Series(model.predict(X_prod), index=X_prod.index).clip(0)

pred_abs_error_prod = pd.Series(model_abs_error.predict(pd.concat([X_prod, pred_prod], axis=1)), index=X_prod.index)

At this point, we have the prediction of the absolute error for every single observation. Thus, we can finally predict the Mean Absolute Error (MAE) on the production dataset, which was our initial goal.

# predict MAE on production set

pred_mae_prod = np.mean(pred_abs_error_prod)

Testing DLE against data drift

The whole purpose of DLE is to predict the MAE (or MSE) of our model when we don’t have the ground truth. This is extremely useful in a real-world scenario when we want to know in advance if there is a deterioration in the performance of our model.

In order to simulate this scenario, we will split the production set into ten folds. To reproduce data drift, we will not split the folds randomly, but we will divide them based on the predictions made by the model. In this way, we make sure that the folds are sufficiently different from each other, and that the performances are reasonably different across the folders.

So, let’s compare the actual MAE on each folder with the MAE obtained as the mean of the absolute errors predicted by the nanny model.

Comparison of observed MAE and predicted MAE on test folds. [Image by Author]

The results are impressive. The MAE predicted through the nanny model is practically the same as the actual MAE.

Let’s try with the MSE:

Comparison of observed MSE and predicted MSE on test folds. [Image by Author]

Also in this case, the results are amazingly good.

What happens if we try to estimate the error (rather than the absolute error)?

We have already seen above that trying to estimate the signed error makes no sense, from a theoretical point of view. But since data scientists prefer the practice over the theory, let’s try to do it anyway.

In other words, this means repeating the procedure above but, rather than training the nanny model on test absolute errors, training it on test signed errors.

Translated in Python, this means the following:

# train model to predict the error (which makes no sense)

model_error = LGBMRegressor().fit(
  X=pd.concat([X_test, pred_test], axis=1),
  y=error_test
)

Now, we are able to get a prediction of the signed error for each observation in the test set. If we take these predictions, take their absolute value, and then average them, this is a new estimate of the MAE.

Let’s see how it would perform on the same test folds as above:

Comparison of observed MAE and predicted MAE on test folds. The predicted MAE here has been obtained from a nanny model trained on signed errors rather than absolute errors. [Image by Author]

It’s evident how, using this new strategy, the predicted MAE systematically underestimates the actual MAE.

This is even worse if we try to do it for MSE:

Comparison of observed MSE and predicted MSE on test folds. The predicted MSE here has been obtained from a nanny model trained on signed errors rather than squared errors. [Image by Author]

This is just further proof that predicting the absolute error is completely different from predicting the signed error and then taking the absolute error.

And if you don’t want to reinvent the wheel…

In this article, we have implemented DLE from scratch to show how it works under the hood. However, in real life, it’s often preferable to use libraries that are well maintained.

So, you may want to directly use NannyML, which has several native functionalities, such as fitting hyperparameters for you.

# use DLE directly with NannyML

import nannyml as nml

estimator = nml.DLE(
  feature_column_names=features,
  y_pred='y_pred',
  y_true='critical_temp',
  metrics=['mae', 'mse'],
  chunk_number=10,
  tune_hyperparameters=False
)

estimator.fit(df_test)

results = estimator.estimate(df_prod_drifted)

Moreover, you can easily obtain fancy plots about your model performance with one line of code:

results.plot(metric=’mae’)

When it won’t work

When using DLE to estimate performance, there are some assumptions that you should pay attention to. Indeed, if some of these assumptions are violated, the performance estimated by DLE may not be reliable anymore. These are the assumptions:

There is no concept drift. If the relation between the features and the target variable changes in unforeseen ways, the error that you are trying to predict changes as well, and so the nanny model may fail.
There is no covariate shift to previously unseen regions in the input space. Both the main model and the nanny model learn on the original features. If the features drift to values not seen during the training/validation phases, the models may make an uncorrect guess.
The sample of data is large enough. Of course, both the training and the validation dataset need to be large enough for both the models to be robust.

Summing up

In this article, we have seen how to reliably predict the expected performance (specifically, MAE and MSE) of a regression model, when the ground truth is not available.

This algorithm — called “Direct Loss Estimation” (DLE) — has been proposed in this article by NannyML (an open-source library focused on post-deployment data science).

DLE is based on the intuition of directly predicting the absolute (or the squared) error of each single observation. We have tested it on the Superconductor dataset and obtained outstanding results.

You can find all the code used for this article in this Github repository.

Thank you for reading! I hope you enjoyed this article. If you’d like, add me on Linkedin!

This article was originally published in Towards Data Science.

Subscribe