This article is part of Faraday's Out of the Lab series, which highlights initiatives our Data Science team undertakes and challenges they solve.
Persona development is a valuable application of machine learning in marketing. These quantitatively developed personas provide brands with a truly data-driven perspective of who their customers are, and let them deliver personalized experiences at a greater scale.
How to find the optimal number of personas using k-means clustering
Our previous article discusses how we develop personas using the k-means clustering algorithm. In this article, we’ll share a few methods we use to find the optimal number of personas for the brands we work with.
The elbow method: finding diminishing returns
The most common metric for measuring the success of k-means clustering is called the ratio of sum of squares (rss). At Faraday, we determine the optimal number of personas by comparing the rss for k-means models with two to fourteen personas. We employ the elbow method, which shows the cutoff after which we see diminishing returns for increasing the number of personas. For most clients, the optimal number of distinct personas is usually between four and seven.
Going beyond the elbow method
The elbow method is simple, but it has a downside: sometimes, there just isn't a clear single point that is optimal. Maybe three personas could be optimal, maybe six, and we have to intervene based on our client’s business needs to choose one or the other as the optimal number of personas.
Additionally, while reducing variability within each group is an extremely useful way to measure k-means clustering success, it doesn't tell the whole story. We believe that an important aspect of clustered groups is stability, which isn't considered with the elbow method selection process. Our clients expect that their persona groups won't shift every time they convert a lead. This is why we are currently researching a new method for determining the optimal number of personas, which focuses on the stability of the persona group.
The bootstrap method: focusing on stability
Our first step to assess the stability of k-means clustering involves resampling the client’s data using the bootstrap method. Like k-means clustering, bootstrapping has been around for a while – it was first introduced in 1979 by Bradley Efron. The basic idea behind bootstrapping is to take data from the pool we have, with the ability to pick the same data point multiple times. This is a method known as sampling with replacement.
The bootstrap method provides us with a powerful tool to measure how much our desired clustering outcome could change with different data. We can get a good sense of how much the personas will change with the addition of new customers. Our goal is to minimize how much influence single customers, or small groups of customers, have on the resulting personas. We want personas that remain relatively unchanged by single customers. In other words, we want more stability.
Here’s an example of the bootstrap method estimating the amount of change in each of the clusters for a given dataset.
Putting it all together
Once we have the results from bootstrapping, we use the ratio of sum of squares again to measure just how stable the personas are. What we’re looking for is a ratio close to 1, which would indicate that there is no variability within the personas.
The stability metric for determining the optimal number of personas shows promising results for the basic examples in this blog post. At Faraday, we’re actively working on applying this methodology to the more complex persona groupings our clients rely on. Stay tuned for more developments with respect to improving our personas methodology.