Preparing your data for seamless ingestion - a technical guide
To get started with Faraday, proper data ingestion is essential. This article offers a technical overview of the processes and best practices for preparing and ingesting your data into Faraday's system, ensuring fast and accurate predictions.



This post is part of a series called Getting started with Faraday that helps to familiarize Faraday users with the platform
When you’re ready to use Faraday to build AI Agents, the first step is getting your data into our system. There are two primary methods to make that happen—using a data warehouse or uploading individual .csv
files. Your best option depends on your team’s setup and resources. This guide covers how to ingest your data and best practices to make it usable for modeling as quickly as possible.
Part 1: Locating your data
Data warehouse vs. file upload
Data warehouse:
If your data resides in a warehouse (e.g., Snowflake or BigQuery), integration is straightforward. Your team can provision a small, dedicated space within the data warehouse for Faraday, providing secure, efficient, and timely transfers without the hassle of repeated manual uploads. Our connections console provides the necessary credential details for integration.
File uploads:
If your data is managed as individual files, you can upload them directly via our datasets console. Although this method is simple, it is time-consuming to keep data fresh over time because of the level of effort required each time.
-
note:
Faraday is looking for.csv
(comma-separated values) files - they can be exported from databases, various business platforms and spreadsheets alike.Automation considerations:
For brands using cloud storage (e.g., Amazon S3 or Google Cloud Storage), setting up automated data transfers is simple and efficient. These services can be connected to Faraday with minimal effort and can be nearly as effective as connecting directly to a data warehouse.An SFTP transfer option is equally suitable, but may require slightly more knowledge and access to resources within your organization.
.csv
file formatting requirements
To ensure smooth processing, follow these guidelines when preparing your CSV files:
-
Format:
- Files: Most of these files are comma-separated and UTF-8 encoded, ideally with a header row listing the column names. If headers are missing, you can add them in the dataset configuration's 'Advanced options'.
- Data warehouse: Sticking as closely to simple and common types of fields such as decimals, integers, boolean (true/false) and text (string/varchar) are most effective. Tables or views are both acceptable.
-
Size:
- Files: Individual files can be up to 5GB in size, and generally should be fewer than 50 columns. Consider splitting larger files into smaller parts and-or removing any fields not related to behavior, time, quality or identity.
- Data warehouse: Generally, there is no size limit for data warehouse tables, but large tables should be handled with care because the larger the dataset, the longer the processing time.
-
Structure:
- Files: If providing updates to data, ensure consistent structure across all files in a dataset (cloud bucket/SFTP: "prefix") - column names, orders and the data types they are set to (if applicable).
- Data warehouse: If using a table, ensure it remains up to date through a regular refresh process. Flattened tables (not nested) are preferred in most cases. Plan on establishing a set of fields to include and avoid changing them over time (such as adding or removing fields).
-
Delimiter consistency:
- Files: Prefer commas as standard delimiters, but this setting is customizable at the dataset level (tabs, semicolons). If necessary, ensure all columns are quoted.
-
Additional flexibility:
- Files: We accept zip-compressed files such as .gz or .zip if you are batching and saving on storage.
Part 2: Data essentials for accurate predictions
Once your data is uploaded, it must include the necessary elements for precise predictions. Faraday supports various data types, but certain formats yield better results.
Types of data
-
Event stream data:
This type of data is preferred for its detailed insights into customer behavior, where each row represents an event (e.g., account creation or purchase) - with a timestamp. This granularity is key for understanding customer journeys. (See our Datasets documentation for more details.)Flavors:
- email engagement data
- website engagement data
- app usage data
- order data
- sales funnel data
- customer journey mapping
- direct mail response tracking
-
Customer-level (summary) data:
This type of data captures one row per customer with summary information such as: purchase history metrics, customer tier, original source or lifetime value. With a joining key (e.g., customer_id), it works best when combined with event stream data.note:
Summary type data may hold valuable insights on your customers' first-party traits. It can also include event timestamps as additional columns when relevant and is frequently seen in CRM exports.
Essential fields
To ensure accurate modeling, include these key fields:
-
Personally identifiable information (PII):
First name, last name, address (street, city, state, postal code), email, and phone number. At least name + address or email are required at a minimum. More PII enables enhanced data enrichment through the Faraday Identity Graph, and can unlock a world of value. -
Event timestamps:
Accurate timestamps are critical for tracking when events occur. "YYYY-MM-DD" is the preferred format, but other common selections are available via the app and API.-
note:
Ensure all time-based fields follow the same format, whichever option you choose. -
Event properties: Other values which are captured at the time of the event that may exist in your records. (Cost of goods, method of payment, unique products or services, result of event)
In these cases, the additional data should answer the question:
How much value was generated for your business by this event?
-
-
Unique identifiers:
Each set of data should have a primary key (e.g., an order number) and a reference key to link said event to an individual. These datapoints deliver maximum flexibility to both you and Faraday through the process of updates.
Identity resolution and matching
After your first-party data is connected, Faraday’s identity resolution software links individuals to additional third-party attributes with over 1,500 available traits and counting. Providing as much quality PII as possible improves matching accuracy and minimizes reliance on aggregated traits.
At a minimum, you must provide an email or a combination of first and last name with street address and postal code.
Note on data hygiene
Organizing and labeling your data correctly is vital — poor quality input yields poor quality output every time. Clean, well-curated data leads to smoother processing and more accurate insights.
Best practices include:
- Maintaining consistent formatting: no type mixing across fields, names and addresses follow a similar structure, categories have fixed sets of standard values, and avoiding filler values for missingness (e.g., "None", "N/A")
- Verifying that phone numbers are a full 10 digits, accurate emails are provided, and names come with full physical addresses
- Deduplication
- Ensuring you do not have large gaps in field coverage or in time between groups of events
- Not providing more than you intend to use (aside from future use cases)
Investing time in even minimal data hygiene upfront saves time and reduces the likelihood of missing a problem that was lying in wait.
What’s next?
Getting your data into Faraday is just the beginning. By following these best practices, your data will be structured and ready for predictive modeling—whether you’re predicting lead conversion or conducting market opportunity analyses. If you have trouble checking all the boxes above, don't worry - Faraday will help find a solution for your needs.
Let us know if you have additional questions as you prepare your data. You can connect with support at support@faraday.ai - happy modeling!
Ready for easy AI agents?
Skip the struggle and focus on your downstream application. We have built-in sample data so you can get started without sharing yours.