How we made our CSV processing 142x faster
A data science hack to process CSVs faster using XSV.
At Faraday, we've long used csvkit to understand, transform, and beat senseless our many streams of data. However, even this inimitable swiss army knife can be improved on - we've switched to xsv.
xsv is a fast CSV-parsing toolkit written in Rust that mostly matches the functionality of csvkit (including the clutch ability to pipe between modules), with a few extras tacked on (like smart sampling). Did I mention it's fast? In a standup comparison, I ran the "stats" module of XSV against "csvstat" from csvkit, on a 30k-line, 400-column CSV file:
-
Python-based csvkit chews through it in a respectable-and-now-expected 4m16s.
-
xsv takes 1.8 seconds. I don't even have time for a sip of my coffee.
The difference between csvkit and xsv is partly defined by scale; both tools are plenty fast on smaller datasets. But once you get into 10MB-and-upward range, xsv's processing speed pulls away exponentially.
If you've been using csvkit forever (like me), or if you want to be able to transform and analyze CSVs without loading them into a DB, give xsv a shot:
Install Rust
curl https://sh.rustup.rs -sSf | sh
. . . which also gives you the rust package manager cargo
, which lets you:
Install xsv
cargo install xsv
Then be sure your PATH is configured correctly:
export PATH=~/.cargo/bin:$PATH
. . . and try it out on a demo CSV with 10k rows, some messy strings, and multiple data types:
curl https://gist.githubusercontent.com/wboykinm/044e2af62fc0c7f77e17f6ccd55b8fb0/raw/fca391e6c03a06a7be770fefca6c47a9acdd2305/mock_data.csv \
| xsv stats \
| xsv table
(xsv table
formats the data so it's readable in the console):
field type sum min max min_length max_length mean stddev
id Integer 5005000 1 1000 1 4 500.49999999999994 288.6749902572106
first_name Unicode Aaron Willie 3 11
last_name Unicode Adams Young 3 10
email Unicode aadamsp5@senate.gov wwrightd8@upenn.edu 12 34
gender Unicode Female Male 4 6
ip_address Unicode 0.111.40.87 99.50.37.244 9 15
value Unicode $1007.98 $999.37 0 8
company Unicode Abata Zoovu 0 13
lat Float 243963.82509999987 -47.75034 69.70287 0 9 24.42080331331331 24.98767816017553
lon Float 443214.19009999954 -179.12198 170.29993 0 10 44.36578479479489 71.16647723898215
messed_up_data Unicode !@#$%^&*() 𠜎𠜱𠝹𠱓𠱸𠲖𠳏 0 393
version Unicode 0.1.1 9.99 3 14
Happy parsing!
Ready for easy AI?
Skip the ML struggle and focus on your downstream application. We have built-in sample data so you can get started without sharing yours.