Find repeat purchasers
This tutorial uses the Faraday API to predict repeat purchasing. You upload customer identifiers and first party data, and we provide responsibly sourced third-party data and infrastructure to make predictions.
😅This tutorial seems long, but it's only 9 POST requests to transform a raw CSV of orders into finished predictions. Go forth and conquer!
📘You can't accidentally incur charges
The steps in this guide, including generating your predictive model, are completely free. You won't be charged until you want to start retrieving buy-again scores at scale.
Account & credentials
Create a free account if you haven't already. You will immediately get an API key that works for test data.
Prepare and send your data
You are ready to send some data over to Faraday. This is done by placing your data into a CSV file and sending it through the API.
📘Sample data
Don't have access to customer data just yet? No problem — grab our sample data from the Testing page.
Make a CSV
Since this tutorial is based on your customers, your data source may be an export of your orders, for example, but it could also be a list of users from your CRM or other marketing tools. You will need to format your data as a CSV. See Sending data to Faraday for examples and validation details.
Here's an example list of columns in a valid CSV:
- customer ID
- first name
- last name
- address
- city
- state
But you could also (or alternatively) include:
- phone
🚧️Include a header row
Your CSV file should have a "header" row, but you can use any headers you like. We suggest using recognizable headers that make sense to you.
👍Additional columns are OK
There is no need to remove other columns if you are using a larger dataset that is convenient to export, just upload the whole thing!
Uploading your CSV
After preparing your CSV file, you are going to upload it using the API's upload endpoint.
Note that you will always upload your files to a subfolder underneath uploads
. The below example uploads a local file named acme_orders.csv
to a folder and file on Faraday at orders/file1.csv
. You can pick whatever folder name and filename you want: we will use it in the next step. You can even upload multiple files with the same column structure into the same folder if that's easier — they'll all get merged together. This is especially useful if you want to update your model over time - for example, as new orders come in.
curl --request POST \
--url https://api.faraday.ai/v1/uploads/orders/file1.csv \
--header 'Accept: application/json' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--header 'Content-Type: application/octet-stream' \
--data-binary "@acme_orders.csv"
❗Repeated calls to the `/uploads` endpoint
At the same url
location, repeated calls will result in any existing file at
that location being overwritten without warning (even if your data have
changed). If you want to add a new file rather than overwrite it is your
responsibility to make sure the name is unique.
Mapping your data
Once your file has finished uploading, Faraday needs to know how to understand it. You'll use Datasets to define this mapping.
📘All data across an account is used in modeling
Make sure that all of your configurations and data are up to date - we use all available information as much as we can in order to build the best models. If you connect a value and a date incorrectly, so that the value actually is updated after the described date, then models can cheat (for instance if you gather 'favorite color' when a customer churns but associate it with the last transaction, 'favorite color' may be incorrectly available as a predictor of churn while this is not actually useful for prediction).
If you're using the sample file, check out Testing for an example API call that includes the right field configuration.
curl --request POST \
--url https://api.faraday.io/v1/datasets \
--header 'Accept: application/json' \
--header "Authorization: Bearer YOUR_API_KEY" \
--header "Content-Type: application/json" \
--data '
{
"name": "orders_data",
"identity_sets": {
"customer": {
"house_number_and_street": [
"address"
],
"person_first_name": "first_name",
"person_last_name": "last_name",
"city": "city",
"state": "state"
}
},
"output_to_streams": {
"orders": {
"data_map": {
"datetime": {
"column_name": "date",
"format": "date_iso8601"
},
"value": {
"column_name": "total",
"format": "currency_dollars"
}
}
}
},
"options": {
"type": "hosted_csv",
"upload_directory": "orders"
}
}
'
Let's break down the above example.
upload_directory
— Here you are telling Faraday which files we're talking about by specifying the subfolder you uploaded your data to, e.g.orders
in our above example. If there are multiple files in this folder (and they all have the same structure), they will be merged together.identity_sets
— Here's where you specify how Faraday should recognize the people in each of your rows. Your data may have multiple ways of identifying people per row, especially in lists of orders where you may have separate billing and shipping info. Our example above creates an arbitrary identity namecustomer
. It uses email (mapping the 'account_email' column from our CSV file to the 'email' field Faraday expects), but if you have names, addresses, or phone numbers it's important to include them to improve identity resolution. Faraday will always use the best combination of available identifiers to recognize people. Mapping options are available in Datasets.output_to_streams
— Here's where you tell Faraday how to recognize events in your data. Here, we're calling our eventsorders
, because that's how many companies define their customers' transactional behavior, but you can use any name you like, and one dataset may represent multiple event types. You can use thedatetime
field to specify when the event occurred—in this case,updated_at
column from the CSV. You can also include metadata about products involved in the event and a dollar value associated with the event. Though all of these fields are optional.
❗Repeated calls to the `/datasets` endpoint
Repeated calls with identical configurations will result in duplicate resources being created. This can cause problems with downstream models by introducing potentially uneven duplications in data.
For instance if you had a folder of system_x_orders
with one schema, and another folder of system_y_orders
with a different schema, both mapping into the orders
stream, and were to have called the /datasets
endpoint twice for system_y_orders
, a customer's total lifetime orders count will increase one for each system x order and two for each system y order. If you do this after building a model, this will result in inaccurate predictions.
You only need to call the /datasets
endpoint once for an entire upload_directory
of data. If you add a new week's CSV file (with a new filename but into an existing upload_directory
) using the /uploads
endpoint, you do not need to make another call to the /datasets
endpoint.
Create your cohorts
Now you're going to use this identity and event data to formally define the important groups of people for your objective. We want to know who has the event of interest (second purchase) and (optionally) when, as well as who is eligible to make a second purchase (optionally starting / ending when). Specifying the dates of events allows us to model the attainment event rate as a function of time varying attributes (such as tenure, recency, etc), and give you better predictions. In this case, individuals who have made a second purchase are a subset of the customers group:
For this tutorial, you want to include all the people in the customers dataset you created, creating a cohort from it. All you have to do is point to the
orders
stream you created above and give your cohort a name like "Customers." By default, when a cohort is specified from an event stream, this captures the first date in the stream, which in this case (the first order) is when a customer becomes eligible to make a second purchase.
You will reference these groups both when you define your propensity objective and when you later want to generate predictions.
Customers
Start by creating a cohort of individuals who have placed an order. Point to the orders
stream you created above and give your cohort a name like "Customers." Like so:
curl --request POST \
--url https://api.faraday.ai/v1/cohorts \
--header 'Accept: application/json' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '
{
"stream_name": "orders",
"name": "Customers"
}
'
You'll need the UUID of the cohort you just created in the next step, so copy it now!
Repeat buyers
Now create another cohort, this time of customers who have placed multiple orders. Again, we'll point to the orders
stream, but this time using the min_count
option to restrict membership to people who have experienced an order event at least twice. We'll call this cohort "Repeat purchasers":
curl --request POST \
--url https://api.faraday.ai/v1/cohorts \
--header "Authorization: Bearer YOUR_API_TOKEN" \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '
{
"name": "Repeat purchasers",
"stream_name": "orders",
"min_count": 2
}
'
Copy down this cohort's ID too.
Now we have two cohorts, with one being a subset of the other. This structure will make for straightforward outcome, scope, and scoring configurations later on.
Create your outcome
Now that you've formally defined your customer groups, it's time to move onto prediction. For this tutorial, we're going to create an outcome from your customers, which will use ML to build a model that predicts whether a given individual looks more like someone who will have the event of interest within the next 30 days or not.
You will take the cohort UUIDs returned in the previous step and use them to make the following call to create an outcome:
curl --request POST \
--url https://api.faraday.ai/v1/outcomes \
--header "Authorization: Bearer YOUR_API_KEY" \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '
{
"name": "Repeat Purchasers",
"attainment_cohort_id": "_REPEAT_PURCHASERS_COHORT_ID",
"eligible_cohort_id": "YOUR_CUSTOMERS_COHORT_ID"
}
'
When you create this outcome, Faraday starts building and validating the appropriate ML model behind the scenes. Remember to save the UUID you get back in your response.
Learn about your model
Once the model has finished building, we will generate an outcome model report for you. The report explains how we generated your model, how well your model performed, and more. You can use the call below to view the report:
curl --request GET \
--url https://api.faraday.ai/v1/outcomes/YOUR_OUTCOME_ID/report.html \
--header "Authorization: Bearer YOUR_API_KEY" \
--header 'Accept: text/html'
Generate repeat purchaser predictions
Now that you have a model for repeat purchasing, you can predict which customers are likely to buy again in the next 30 days, and then retrieve those results.
Set up your scope
To do this, you will first create a Scope—this is how you tell Faraday which predictions you may want on which populations. You'll need three UUIDs from resources you created:
- The repeat purchaser outcome (the model)
- The customers cohort (to include this population)
- The repeat purchasers cohort (to exclude this population)
Rather than defining a new cohort of one-time purchasers to score, you can specify the inclusion cohort_ids
and the exclusion exclusion_cohort_ids
in the population
.
Here you can also explicitly set demo
to true
, although it's the default. This puts the scope in a preview mode to avoid billing charges—by limiting its output.
curl --request POST \
--url https://api.faraday.ai/v1/scopes \
--header "Authorization: Bearer YOUR_API_KEY" \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '
{
"payload": {
"outcome_ids": [
"YOUR_OUTCOME_ID"
]
},
"population": {
"cohort_ids": [
"YOUR_CUSTOMERS_COHORT_ID"
],
"exclusion_cohort_ids": [
"YOUR_REPEAT_PURCHASERS_COHORT_ID"
]
},
"name": "SCOPE_NAME",
"preview": false
}
'
🚧️Broadening your scope
The above configuration is great at controlling costs, because Faraday will only calculate and cache scores for the "Customers" you have each day. If you want to retrieve a buy-again score for each new customer the moment they place their first order, you'll want your scope's population to be "Everybody." See the scopes reference for more information.
Checking scope status
Faraday proactively makes and caches the prediction you defined in your scope, which may take some time. To see if your scope is ready, you can fetch https://api.faraday.ai/v1/scopes/{scope_id}
.
Deploying predictions
Now it's time to download the results! The simplest way to do this is to retrieve them all in a single CSV file.
Add a target
First you'll add a Target to your scope with type hosted_csv
.
curl --request POST \
--url https://api.faraday.ai/v1/targets \
--header 'Accept: application/json' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '
{
"name": "repeat_purchaser_csv_target",
"options": {
"type": "hosted_csv"
},
"representation": {
"mode": "hashed"
},
"scope_id": "YOUR_SCOPE_ID"
}
'
Check whether your target is ready
Prior to trying to download your CSV check whether the resource (along with its dependencies) is ready:
curl --request GET \
--url https://api.faraday.ai/v1/targets/YOUR_TARGET_ID \
--header "Authorization: Bearer YOUR_API_KEY" \
--header "Accept: application/json"
Retrieve your CSV
Once your deployment is ready, you can download the hosted CSV you created when you added your deploy target.
Looking at the file, you'll see that each one of your customers has been scored for propensity to buy again.
In production, you'll generally automate the retrieval of this file and its insertion into your data warehouse and other systems. Faraday supports integration with a wide variety of tools.
curl --request GET \
--url https://api.faraday.ai/v1/targets/YOUR_TARGET_ID/download.csv \
--header "Authorization: Bearer YOUR_API_KEY" \
--header "Accept: application/json" > my_local_file.csv
open my_local_file.csv
🚧️Preview mode
If a scope is in preview mode, you will only get a sample of the complete results back. This helps you validate the results you're getting and built your integrations before incurring charges.