Why first party data (and RFM) is not enough

1st party data-based AI modeling struggles with limitations like noisy data and the cold start problem. By integrating 3rd party data through providers like Faraday, companies can enhance predictions and accelerate AI deployment.

Seamus Abshere
Seamus Abshere on

Companies are looking for the fastest way to build AI into their software. Using techniques like RFM (recency-frequency-monetary), many are implementing AI features that learn from how people use websites, respond to emails, and pay for products. This process is called "first party data-based modeling". Companies like Twilio and Klaviyo have used this simple approach to rapidly add features to existing large products.

At the same time, increasing privacy regulation (did you know Colorado and Virginia have their own versions of CCPA?) and the retirement of "third party cookies" have made companies think "first party data only" is the only option left. It turns out this is not quite right - the very same transaction and customer records that make up first party data can be linked to traditional, legal, consented offline data to create more powerful technology. Over time, companies that start with RFM upgrade to using offline data too.

Problem

Why is first party data not enough? The total information contained in a user's activity may be insufficient to predict their future behavior - the model may be weak. Over time, repeat customers may build up enough data to be useful. But new users will always be a mystery. And once a first party model is built, it must be constantly monitored for new patterns in relatively noisy data. In other words, building AI features into software but limiting them to first party data makes them weaker, less widely applicable, and less stable.

Activity vs affinity

First party activity data is traditionally thought of in terms of "recency" (time stamps), "frequency" (aggregation over time), and "monetary" (revenue impact). Product choice can be added to this (this person bought a pillow!) Does this contain sufficient data to predict their next purchase or how much they will spend over their lifetime? Often, it does not. It would be more predictive to know, in addition to RFM, their lifestyle preferences and financial resources. Activity data is a relatively incomplete view into a person's overall affinity for a product.

Cold start problem

The first time a person shows up at your business, you don't know anything about them - so you can't make any first party-based decisions about them. This is called the "cold start problem" with first party data. Over time, you will know more about their purchase behavior, activity on your website, opens/responses to emails, etc., and RFM models will become more powerful (though still not as powerful as the same model with lifestyle and financial data mixed in). However, if a software company's clients need to make predictions at first contact, first party data will not help.

Noisy data

Activity data is noisy. Many random factors can affect when a user opens a browser, comes to your site, stops, buys a product, or opens an email. First party-based models must be able to adapt quickly to the same changing environment. In some industries, like fraud detection, the ability to update models in real time to address ever-changing threat actors is a key feature. In other industries, the need or capacity to constantly refresh models based on noisy data it not necessarily positive. For example, a model that uses lifestyle or financial aspects to predict next best product is likely to exhibit more stable performance over time, especially if it incorporates (but does not rely solely on) first product purchased and time spent on site.

Solution

What possible solutions are there? One is to use offline data - a supply chain that has existed for decades. Ever since sewing magazines realized their mailing lists offered valuable insight into people's hobbies, they have been asking their readers for permission to share their data with marketing partners. In terms of regulation, this data is consented - unlike web scraping or even some forms of first party data. Sewing is only one of the many niches with dedicated providers who can derive affinity data from purchases. And every major data aggregator - who has relationships with all the niche providers - has a vested interest in defending its continued use and teams dedicated to adapting to changing laws.

There are a few challenges to using offline data directly. The contracting process is lengthy and requires a very secure infrastructure to receive and store the data - not something all companies have (or even want). Also, there is a lot of data - hundreds of columns. It can take data science teams years just to understand what data is valuable to each market segment and build repeatable workflows to create models. Even if this is done, issues like proxy discrimination and bias propagation arise - and may disqualify the product's use in certain industries.

Are we back to square one - first party data only? Not necessarily. AI infrastructure companies like Faraday that have offline data built-in are solving contracting, security, model discovery, and responsible AI concerns with standardized components. The most important benefit is time to value - software companies feeling the imperative to add AI can release it an order of magnitude faster.

Contracting

Traditionally, buying offline data has been a multi-month contracting process with "what the market will bear" pricing. AI infrastructure companies like Faraday have standardized contracts and usage-based pricing, making it affordable to try new AI product initiatives without high upfront data costs.

Security

Data security requirements for holding offline data are high. A data breach may release not only customers' order history but also more personal data. Having a SOC 2 that specifically audits for the protection of consumer data is a common requirement. AI infrastructure companies store offline data on behalf of other companies, letting them use it in batch and real time use cases without ever coming into possession of it.

Holding offline data always increases legal requirements. For example, if you hold more than first party transaction and activity data, responding to CCPA (and other) data access and data deletion requests becomes much more expensive. It also reclassifies a company as a data broker, forcing it to register in many states. AI infrastructure companies respond to such requests on behalf of users and are registered data brokers so that other companies don't have to be.

Time to value

Data scientists need time to figure out what data matters, but data usually arrives in an unusable state. It must be cleaned and reformatted. Different techniques are then used to measure each data point's value to the business. Experimental models are built in Jupyter notebooks; finally, the whole process is handed over to a development team to make ready for production. AI infrastructure companies can accelerate this process, for example, by automatically matching first party transaction data to offline data and generating information gain metrics and models that can be backtested, field tested, and iterated on.

Responsible AI

In addition to general public concern about malicious use of AI, specific industry regulators and trade groups have been adding requirements. Longstanding requirements - such as fair housing and fair lending - apply to AI just as much as traditional methods. Other requirements - such as model transparency and elimination of proxy discrimination - take on new forms when using automatically generated models. AI infrastructure companies can provide these as out-of-the-box reporting and mitigation tools instead of individual company data science or compliance teams implementing them on a one-off basis.

Don't leave money on the table

First party data is not a magic bullet; offline data has a steep learning curve. AI infrastructure providers like Faraday, Twilio Segment, and Snowflake can help add AI to existing products faster and with less complexity, while also bypassing the limiting first party data-only stage.

Ready for easy AI?

Skip the ML struggle and focus on your downstream application. We have built-in sample data so you can get started without sharing yours.