Writing Ruby extensions in Rust: Part 2
In the part 1 I showed how to create Ruby gem with simple native extension written in Rust. This time I want to extend it and create something useful, namely fast CSV parser. So what’s the plan? I want to create Enumerator
-style class which can be represented in Ruby as:
class HelixCSV
def initialize(file_path)
# Save path
end
def open
# Open file
end
def next
# Returns an Array of fields or nil if EOF
end
def close
# Close file
end
end
Let’s first look at Rust code from the part 1:
#[macro_use]
extern crate helix;
ruby! {
class HelixCSV {
// skipped
def add(&mut self, m: i64) -> i64 {
self.n += m;
self.n
}
}
}
Here ruby!
is a macro which does all the heavy lifting. def
body looks just like normal Rust code, but Ruby doesn’t have i64
or any other Rust type, so what happens behind the scene is a type coercion between Ruby and Rust. When I call helix_csv.add(1)
in Ruby, helix
will convert Ruby Integer
into Rust i64
and return value of type i64
back to Integer
. Appart from simple types (Float
<> f64
, Integer
<> i64
etc.) and collections (e.g. Array of Float
<> Vec<f64>
, Hash of Float, String
<> HashMap<f64, String>
) coercions also cover two special cases:
- Rust doesn’t have
nil
s while Ruby has them,helix
usesOption<_>
type to representnil
able values (String|nil
<>Option<String>
), and of course you are free to compose them, e.g.Array of Float|nil
<>Option<Vec<f64>>
. - Rust doesn’t have exceptions, so to raise Ruby exception we will need to wrap return value in
Result<_, helix::Error>
, and then we can use macroraise!
to raise an exception, and we can combine types too:
def toArray(&self, x: f64) -> Result<Option<Vec<f64>>, helix::Error> {
if x < 0.0 {
raise!("Negative value!") //=> Exception
} else if x < 1.0 {
Ok(Some(vec![x])) //=> [x]
} else {
Ok(None) //=> nil
}
}
There is one caveat: initialize
has to return struct
, which means it should be flawless and can’t raise exceptions, so we won’t be able to open a file in HelixCSV::new
and skip open
, initialize
will only save file path. open
will open a file and save some kind of an iterator: Option<CSVIter>
for next
. Rust CSV crate creates an iterator of type Iterator<Item=Result<csv::StringRecord, csv::Error>>
, but since it’s a dynamic type I have to box it and wrap into a structure:
type CSVIterType = Iterator<Item=Result<csv::StringRecord, csv::Error>>;
struct CSVIter {
iter: Box<CSVIterType>,
}
Also all fields that are used in helix
structure have to implement Debug
and Clone
traits. Since I’m not going to use them I can simply define them as:
impl std::fmt::Debug for CSVIter {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(f, "CSVIter")
}
}
impl Clone for CSVIter {
fn clone(&self) -> CSVIter {
panic!("Not cloneable!")
}
}
Now it’s easy to implement open
and next
, here’s how the final version of lib.rs
looks like:
extern crate csv;
#[macro_use]
extern crate helix;
use std::fs::File;
use std::io::BufReader;
type CSVIterType = Iterator<Item=Result<csv::StringRecord, csv::Error>>;
struct CSVIter {
iter: Box<CSVIterType>,
}
impl std::fmt::Debug for CSVIter {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(f, "CSVIter")
}
}
impl Clone for CSVIter {
fn clone(&self) -> CSVIter {
panic!("Not cloneable!")
}
}
ruby! {
class HelixCSV {
struct {
path: String,
iter: Option<CSVIter>,
}
def initialize(helix, path: String) {
HelixCSV { helix, path, iter: None }
}
def open(&mut self) -> Result<(), helix::Error> {
self.iter = None;
let buf_reader =
match File::open(&self.path) {
Ok(f) => BufReader::new(f),
Err(e) => raise!(e.to_string()),
};
let csv_reader =
csv::ReaderBuilder::new()
.has_headers(false)
.from_reader(buf_reader);
self.iter = Some(CSVIter{iter: Box::new(csv_reader.into_records())});
Ok(())
}
def next(&mut self) -> Result<Option<Vec<String>>, helix::Error> {
match self.iter {
Some(ref mut iter) =>
match iter.iter.next() {
Some(Ok(record)) =>
Ok(Some(record.iter().map(|s| s.to_string()).collect())),
Some(Err(e)) =>
raise!(e.to_string()),
None =>
Ok(None)
}
None =>
raise!("closed file")
}
}
def close(&mut self) -> () {
self.iter = None
}
}
}
We also need to add dependency to Cargo.toml
:
and the only thing left is to build native extension:
$ rake build
{}
cargo rustc --release -- -C link-args=-Wl,-undefined,dynamic_lookup
Compiling version_check v0.1.5
Compiling libc v0.2.43
Compiling cfg-if v0.1.6
Compiling serde v1.0.80
Compiling libcruby-sys v0.7.5
Compiling cstr-macro v0.1.0
Compiling memchr v2.1.1
Compiling csv-core v0.1.4
Compiling helix v0.7.5
Compiling csv v1.0.2
Compiling helix_csv v0.1.0 (.../helix_csv)
Finished release [optimized] target(s) in 25.91s
$ bin/console
2.5.3 :001 > csv = HelixCSV.new("sample.csv")
=> #<HelixCSV:0x000055b9b1aa6e08>
2.5.3 :002 > csv.open
=> nil
2.5.3 :003 > csv.next
=> ["a407cd64-b473-44b6-910c-8f6d09dc2a6a", "6982c5ab-8f07-4f8e-bd0b-9f2bf5ec626e", ...]
2.5.3 :004 > csv.close
=> nil
Benchmarks
Code
# FastestCSV
FastestCSV.foreach('sample.csv') do |rec|
end
# HelixCSV
csv_reader = HelixCSV.new('sample.csv')
csv_reader.open
while (rec = csv_reader.next) do
end
csv_reader.close
# CSV
CSV.foreach('sample.csv') do |rec|
end
Data
Records | Fields per record | File size | CPU |
---|---|---|---|
1M | 21 | 742M | i7-6600U @ 2.60GHz |
Results
gem | FastestCSV | HelixCSV | CSV |
---|---|---|---|
Time, secs | 9.54 | 12.9 | 42.5 |
Note: FastestCSV doesn’t seem to decode embedded newlines, which in my case was a deal breaker.
Was it worth the time and effort?
At one of my jobs I had to read and analyze large csv
files (it’s a very popular format in the enterprise world!) and company codebase was mostly written in Ruby. Finding fast gem that handles “standard” csv
(i.e. csv
you get from Excel or Postgres) proved to be pretty complicated, not only in Ruby world, but even in Java. Many fail to decode newlines inside quoted fields (both Excel and Postgres exports can easily have them). We used patched FastestCSV
, but even my simple gem can easily outperform FastestCSV
e.g. if I want only certain columns. Big part of the time is spent instantiating Ruby objects and coercing types, so if I only want one column out of 21 in my sample file, 12.9 secs will become only 4.7 secs. I can also offload a lot of other work to Rust, e.g. finding unique values, checking and converting types etc. This will be a way faster than Ruby with relatively small amount of Rust code. Not to mention that Rust CSV crate has tons of other options.