Counterfactual Evaluation Policy for Machine Learning Models


This post was originally published by Sanjeev Suresh at Towards Data Science

How to monitor models whose actions prevent us from observing ground truth?

The goal of monitoring any system is to track its health. In the context of machine learning, it is crucial to track the performance of the models we are serving in production. It can help us inform when our models are not fresh anymore and retraining of the model is required. It can also help us detect abuse in cases like fraud detection where there could be adversarial actors trying to harm the model.

To monitor the performance of the models we need to compare their predictions against the true labels. However, a model’s actions can often prevent us from observing the ground truth.

To see why that may be the case, consider the example of a credit card fraud detection model. We will use that as the running example for the article. It predicts every transaction as either fraud or not a fraud. The application code now acts on these predictions by blocking predicted fraudulent transactions. Once a transaction is blocked, we are unable to observe what would have happened if we had let it through. This means that we are not able to know whether the blocked transaction was actually fraudulent or not. We are only able to observe and label the transactions we let through. In other words, we are only able to observe a skewed distribution of non-blocked transactions.

This absence of ground truth labels for a part of our model predictions introduces two main problems:

  1. How is it possible to continuously monitor the health of the model in production — that is, check metrics like precision and recall of the model?
  2. How can we do subsequent re-training of the model? Before the first version of the model is deployed, we have a dataset that is representative of the real-world distribution. However, in the future, it is possible to only train on the examples that are allowed because those are the ones that have a ground truth label. This means that we can only train on a skewed distribution which is different from the real-world distribution with every retraining. Over time, this can result in a gradual loss in the performance of the model.

For evaluation purposes and retraining our model, we want an approximation of the distribution of examples and their labels in the absence of our intervention.

Let’s define our goals — We need a policy that lets us:

  1. Evaluate the performance of the model in production
  2. Generate unbiased training data for future retraining

To meet our two goals, we let through a fraction of transactions for review that we would otherwise block. Let’s call this fraction P(allow).

Going back to our fraud detection example, this would mean allowing a fraction of predicted fraudulent transactions to go through. These allowed transactions can have real costs, but it is the price to pay to have a healthy model in production, and such practices are widely adopted in the industry.

In the absence of a counterfactual evaluation policy, our model logic may look like the following. When the model score is greater than a threshold, the model is predicting that the transaction is fraud and we decide to block it.

This is the modified logic with P(allow) = 0.1. We allow 10% of the transactions we would have otherwise blocked. For these 10% transactions, we are able to observe the true label.

Let’s take an example scenario and see how we can calculate Precision and Recall. Here P(allow) = 0.1. Out of 1000 examples, the model predicted 100 examples as fraud. Because of our counterfactual evaluation policy, we still let through 10 out of them. By observing the true labels on these transactions, 8 out of them turned out to be actually fraud.


With this approach, it is straightforward to compute precision by just using the review transactions. Precision is just the fraction of our review transactions that are actually fraud.

Precision = Review Fraud / Review Transactions = 8 / 10 = 80%


To compute recall, it is first needed to estimate the total fraud transactions, both caught by the model and overall. Fraud transactions that were caught by the model can be estimated by weighting the review transactions by a factor of 10 (that is 1/P(allow)). This is done because every transaction that would’ve been blocked (but was randomly chosen not to), is in some sense a representation of 10 total transactions.

Estimated fraud transactions caught by the model= Review Fraud * 1/P(allow) = 8*(1/0.1) = 80

Estimated overall fraud transactions = Estimated fraud transactions caught by the model+ Other Fraud = 80 + 40 = 120

Recall = Estimated fraud transactions caught by the model / Estimated overall fraud transactions= 80/120 = 66.67%


Training can be done on the 910 allowed transactions, that is the ones for which we know the ground truth. Use a weight of 1 for the 900 predicted not fraud transactions and a weight of 10(that is 1/P(allow)) for the 10 review transactions. This creates an unbiased data distribution for the training data.

Instead of having a constant value, we could have a custom propensity function for P(allow). The logic behind that is we can use more of the review budget on less “obvious” fraud transactions.

If the classifier is giving a score of 1 for a transaction, it might be quite certain that the transaction is fraud. On the other hand, if the score is, say, 0.49 or 0.50 or 0.51, it is less certain and we are at a decision boundary. With the above policy, transactions are allowed for review at the same rate regardless of the score, but in some sense, there is more confidence the greater the score is. To optimize the limited review budget, it is wise to spend it more where the outcome is less certain, namely, in the decision boundary.

In order to do this, a propensity function for P(allow) can be used instead of the uniform threshold earlier. P(allow) will be inversely proportional to the score so that it allows more transactions closer to the boundary and fewer transactions close to 1. Again, here for training and recall estimation, we will add a weight of 1/P(allow) for transactions that we decided to allow.


This can result in a more efficient usage of the limited review budget and less overall fraud happening. The transactions that are more unsure are the ones being reviewed more. We are trying to find an ideal balance between explore and exploit paradigms. This has a strong theoretical background.


A single high-weight transaction getting misclassified would move our estimates significantly. This can cause a lot of variance in the estimates and make them unreliable. In practice, this can be a huge problem.

If you are developing a new model that might affect what ground truth values are observed, a counterfactual evaluation policy must be thought of prior to the first deployment in production. This will enable you to monitor your model as well as have an unbiased data distribution for retraining. Start with a simple uniform threshold for the propensity function and go beyond that only if the situation warrants the additional complexity.

Spread the word

This post was originally published by Sanjeev Suresh at Towards Data Science

Related posts