How to set up holdout tests to evaluate prediction accuracy

If you've ever used AI to make predictions, chances are you've asked yourself "How do I know these are . . . correct?" It's a fair question, and healthy skepticism is a critical component of any Responsible AI practice.

Luckily there are plenty of great options for confirming that the predictions you're getting are mostly right. First, a bit of background.

Prediction basics: How do AI models make predictions?

To set the stage, let's review how prediction works. Throughout this post, I'll use the example of a company wanting to predict which of its leads are most likely to "convert" (become a customer). In Faraday you'd represent this prediction objective with an outcome like so:

Lead conversion outcome configuration

Finding patterns

Now we'll take you behind the scenes. To predict something, a system must comb through historical examples of that thing happening to find patterns.

Once you know the patterns, you can compare newly observed behavior to the crucial patterns that signal "true" - if they are similar, you predict "true," otherwise "false."

So, to follow our example, the first step is to pull together some historical lead data, including some leads that converted and some that didn't.

Training data diagram

Here, the top row represents your leads over time at their first capture point. Some of them—the <span style={{color: "green"}}>green dots—"convert" by becoming customers: we call these examples. Others—the <span style={{color: "red"}}>red dots—go "stale" and represent failures: we call these counterexamples. It's too early to say what the few remaining gray dots will do, so we'll just ignore them.

One useful thing to note is that 6/13 of our leads converted, or about half the time. This will come in handy down the road.

Your prediction system will then use one of a variety of methods to look for interesting patterns. Is there something in common among the green leads? What about the red ones? How are they different? This identification of patterns (the model) is what lets the system make predictions about new leads.

Using patterns to make predictions

So now, let's take those pesky remaining gray dots, as well as some new ones that might have popped up since we figured out our patterns. I've put a dark border around them:

Data diagram including new leads

Your prediction system will scrutinize each of these possibilities to see which patterns in your list it follows. If it follows a lot of the patterns associated with conversion, the system will predict that it will convert. If it follows a lot of the patterns associated with non-conversion, the system will predict non-conversion.

Diagram showing predictions

Beware silly patterns

Easy, right? Too easy you might say. The reason to be skeptical here is that the patterns your prediction system found may be coincidences rather than reflect some underlying truth.

For example, your prediction system may say:

Here's a pattern I found! Leads with one of these exact email addresses convert, all others don't: amy@example.com, leon@example.com, ..., pat@example.com

Sounds silly, but it happens all the time when the wrong kind of data is examined for patterns. The methods your prediction system uses aren't always clever enough to distinguish between goofy patterns and serious ones. Based on the data available, this pattern is actually incredibly powerful! In fact, it's 100% accurate.

Problem is, it will never work with any new leads, because they'll all have different email addresses and the system will therefore predict non-conversion for all of them.

Three holdout testing methods to see if your predictions are accurate

The way to avoid this kind of silliness is to "test" your patterns to see if they're right. If they pass the test, you can trust them going forward.

Think about your tests in school: your teacher asked you questions and you provided answers. Your teacher knew the true answers. If your answers matched the teacher's true answers, you got an A.

The hardest part of all this is that somebody has to "know" the "true" answers. If only we had some lead data where we knew the right answer . . .

But of course we do! We know which leads actually converted and which didn't. We just can't use the same data to both find patterns and put those patterns to the test.

There are a few great ways to split up your data for these two simultaneous purposes. They all rely on a "holdout"—setting aside a portion of the data to use exclusively for testing. This allows you to see whether the system "understands" enough to assess new data versus only being able to regurgitate answers for the questions it has already seen. Let's dive into these methods now.

Method 1: random holdout

Setup

This is the most common approach, and it involves choosing a random subset of your historical data to set aside as a holdout. Only then do you use your prediction system to find patterns among the remaining data.

Cross validation diagram

Here, we've used 50% of our data as a holdout. The system will find patterns exclusively among the remaining leads.

Testing

Now we're ready to "take the test." The leads in the holdout are the questions, and their true disposition (converted or not) is the answer key. But for the time being we pretend we don't have the answer key and we treat the holdout leads like new leads. What can the patterns we have tell us about these leads?

I'll illustrate those predictions with colored borders on the holdout leads:

Prediction step of cross validation

Once we've used the patterns to make predictions about the holdout leads, we can dramatically reveal the answer key and see how accurate the patterns really are. Let's see how we did:

Cross validation results

We got 4 right and 2 wrong for about 67% accuracy. Is that good or bad?

Evaluation

In real life you'll rarely get an A on this test. In some cases you'll predict conversion when the lead didn't actually convert (false positive). In other cases, you'll predict non-conversion when the lead did actually convert (false negative).

Recall that our actual lead conversion rate is about 50%. That means that a really dumb prediction system could always predict "convert" and get it right about 50% of the time, or flip a virtual coin and get the same result. By this measure, 67% is quite a bit better!

So assuming you got more right than wrong, it's an improvement. Before, your patterns were always right about your historical leads, but were useless for assessing new leads. Now, maybe your patterns aren't as perfect, but at least they're useful!

Conclusion

Finding the right balance between false positives and false negatives is part of the art of prediction and ultimately comes down to risk versus reward. It's out of scope for this post but I'll be writing one on that soon—subscribe below to get notified.

The upshot here is that the patterns the system found on the non-holdout leads appear to be effective in making predictions about the holdout leads. This is a very good sign that the patterns themselves are "real" and not based on silly coincidences.

The name for this approach is cross-validation. In practice, you actually perform this random holdout method multiple times to really make sure your patterns are real. The Faraday platform automatically uses 3 passes like this every time we look for patterns.