Preparing your data for seamless ingestion - a technical guide

When you’re ready to use Faraday to build AI Agents, the first step is getting your data into our system. There are two primary methods to make that happen—using a data warehouse or uploading individual .csv files. Your best option depends on your team’s setup and resources. This guide covers how to ingest your data and best practices to make it usable for modeling as quickly as possible.

Part 1: Locating your data

Data warehouse vs. file upload

Data warehouse:
If your data resides in a warehouse (e.g., Snowflake or BigQuery), integration is straightforward. Your team can provision a small, dedicated space within the data warehouse for Faraday, providing secure, efficient, and timely transfers without the hassle of repeated manual uploads. Our connections console provides the necessary credential details for integration.

File uploads:
If your data is managed as individual files, you can upload them directly via our datasets console. Although this method is simple, it is time-consuming to keep data fresh over time because of the level of effort required each time.

note: Faraday is looking for .csv (comma-separated values) files - they can be exported from databases, various business platforms and spreadsheets alike.

Automation considerations:
For brands using cloud storage (e.g., Amazon S3 or Google Cloud Storage), setting up automated data transfers is simple and efficient. These services can be connected to Faraday with minimal effort and can be nearly as effective as connecting directly to a data warehouse.

An SFTP transfer option is equally suitable, but may require slightly more knowledge and access to resources within your organization.

`.csv` file formatting requirements

To ensure smooth processing, follow these guidelines when preparing your CSV files:

Format:
- Files: Most of these files are comma-separated and UTF-8 encoded, ideally with a header row listing the column names. If headers are missing, you can add them in the dataset configuration's 'Advanced options'.
- Data warehouse: Sticking as closely to simple and common types of fields such as decimals, integers, boolean (true/false) and text (string/varchar) are most effective. Tables or views are both acceptable.
Size:
- Files: Individual files can be up to 5GB in size, and generally should be fewer than 50 columns. Consider splitting larger files into smaller parts and-or removing any fields not related to behavior, time, quality or identity.
- Data warehouse: Generally, there is no size limit for data warehouse tables, but large tables should be handled with care because the larger the dataset, the longer the processing time.
Structure:
- Files: If providing updates to data, ensure consistent structure across all files in a dataset (cloud bucket/SFTP: "prefix") - column names, orders and the data types they are set to (if applicable).
- Data warehouse: If using a table, ensure it remains up to date through a regular refresh process. Flattened tables (not nested) are preferred in most cases. Plan on establishing a set of fields to include and avoid changing them over time (such as adding or removing fields).
Delimiter consistency:
- Files: Prefer commas as standard delimiters, but this setting is customizable at the dataset level (tabs, semicolons). If necessary, ensure all columns are quoted.
Additional flexibility:
- Files: We accept zip-compressed files such as .gz or .zip if you are batching and saving on storage.

Part 2: Data essentials for accurate predictions

Once your data is uploaded, it must include the necessary elements for precise predictions. Faraday supports various data types, but certain formats yield better results.

Types of data

Event stream data:
This type of data is preferred for its detailed insights into customer behavior, where each row represents an event (e.g., account creation or purchase) - with a timestamp. This granularity is key for understanding customer journeys. (See our Datasets documentation for more details.)

Flavors:
- email engagement data
- website engagement data
- app usage data
- order data
- sales funnel data
- customer journey mapping
- direct mail response tracking
Customer-level (summary) data:
This type of data captures one row per customer with summary information such as: purchase history metrics, customer tier, original source or lifetime value. With a joining key (e.g., customer_id), it works best when combined with event stream data.
- note: Summary type data may hold valuable insights on your customers' first-party traits. It can also include event timestamps as additional columns when relevant and is frequently seen in CRM exports.

Essential fields

To ensure accurate modeling, include these key fields:

Personally identifiable information (PII):
First name, last name, address (street, city, state, postal code), email, and phone number. At least name + address or email are required at a minimum. More PII enables enhanced data enrichment through the Faraday Identity Graph, and can unlock a world of value.
Event timestamps:
Accurate timestamps are critical for tracking when events occur. "YYYY-MM-DD" is the preferred format, but other common selections are available via the app and API.
- note: Ensure all time-based fields follow the same format, whichever option you choose.
- Event properties: Other values which are captured at the time of the event that may exist in your records. (Cost of goods, method of payment, unique products or services, result of event)
  
  In these cases, the additional data should answer the question: How much value was generated for your business by this event?
Unique identifiers:
Each set of data should have a primary key (e.g., an order number) and a reference key to link said event to an individual. These datapoints deliver maximum flexibility to both you and Faraday through the process of updates.

Identity resolution and matching

After your first-party data is connected, Faraday’s identity resolution software links individuals to additional third-party attributes with over 1,500 available traits and counting. Providing as much quality PII as possible improves matching accuracy and minimizes reliance on aggregated traits.

At a minimum, you must provide an email or a combination of first and last name with street address and postal code.

Note on data hygiene

Organizing and labeling your data correctly is vital — poor quality input yields poor quality output every time. Clean, well-curated data leads to smoother processing and more accurate insights.

Best practices include:

Maintaining consistent formatting: no type mixing across fields, names and addresses follow a similar structure, categories have fixed sets of standard values, and avoiding filler values for missingness (e.g., "None", "N/A")
Verifying that phone numbers are a full 10 digits, accurate emails are provided, and names come with full physical addresses
Deduplication
Ensuring you do not have large gaps in field coverage or in time between groups of events
Not providing more than you intend to use (aside from future use cases)

Investing time in even minimal data hygiene upfront saves time and reduces the likelihood of missing a problem that was lying in wait.

What’s next?

Getting your data into Faraday is just the beginning. By following these best practices, your data will be structured and ready for predictive modeling—whether you’re predicting lead conversion or conducting market opportunity analyses. If you have trouble checking all the boxes above, don't worry - Faraday will help find a solution for your needs.

Let us know if you have additional questions as you prepare your data. You can connect with support at support@faraday.ai - happy modeling!

Preparing your data for seamless ingestion - a technical guide

Part 1: Locating your data

Data warehouse vs. file upload

`.csv` file formatting requirements

Part 2: Data essentials for accurate predictions

Types of data

Essential fields

Identity resolution and matching

Note on data hygiene

What’s next?

Ready for easy AI?

So how many persona sets do you need anyway?

How Faraday protects your data during testing

How to shape your deployments: a guide to filters, formats, and more

The Faraday Identity Graph just got even smarter

Match Boost improves enrichment coverage on sparse data

Save $100K/month with smarter targeting - how Faraday’s predictive AI agents are reshaping debt settlement

Part 1: Locating your data

Data warehouse vs. file upload

.csv file formatting requirements

Part 2: Data essentials for accurate predictions

Types of data

Essential fields

Identity resolution and matching

Note on data hygiene

What’s next?

Ready for easy AI?

`.csv` file formatting requirements