Victor Moroz - Writing Ruby extensions in Rust: Part 2

Writing Ruby extensions in Rust: Part 2

Posted on November 18, 2018

In the part 1 I showed how to create Ruby gem with simple native extension written in Rust. This time I want to extend it and create something useful, namely fast CSV parser. So what’s the plan? I want to create Enumerator-style class which can be represented in Ruby as:

class HelixCSV
  def initialize(file_path)
    # Save path
  end

  def open
    # Open file
  end

  def next
    # Returns an Array of fields or nil if EOF
  end

  def close
    # Close file
  end
end

Let’s first look at Rust code from the part 1:

#[macro_use]
extern crate helix;

ruby! {
    class HelixCSV {
        // skipped

        def add(&mut self, m: i64) -> i64 {
            self.n += m;
            self.n
        }
    }

}

Here ruby! is a macro which does all the heavy lifting. def body looks just like normal Rust code, but Ruby doesn’t have i64 or any other Rust type, so what happens behind the scene is a type coercion between Ruby and Rust. When I call helix_csv.add(1) in Ruby, helix will convert Ruby Integer into Rust i64 and return value of type i64 back to Integer. Appart from simple types (Float <> f64, Integer <> i64 etc.) and collections (e.g. Array of Float <> Vec<f64>, Hash of Float, String <> HashMap<f64, String>) coercions also cover two special cases:

Rust doesn’t have nils while Ruby has them, helix uses Option<_> type to represent nilable values (String|nil <> Option<String>), and of course you are free to compose them, e.g. Array of Float|nil <> Option<Vec<f64>>.
Rust doesn’t have exceptions, so to raise Ruby exception we will need to wrap return value in Result<_, helix::Error>, and then we can use macro raise! to raise an exception, and we can combine types too:

    def toArray(&self, x: f64) -> Result<Option<Vec<f64>>, helix::Error> {
        if x < 0.0 {
            raise!("Negative value!") //=> Exception
        } else if x < 1.0 {
            Ok(Some(vec![x])) //=> [x]
        } else {
            Ok(None) //=> nil
        }
    }

There is one caveat: initialize has to return struct, which means it should be flawless and can’t raise exceptions, so we won’t be able to open a file in HelixCSV::new and skip open, initialize will only save file path. open will open a file and save some kind of an iterator: Option<CSVIter> for next. Rust CSV crate creates an iterator of type Iterator<Item=Result<csv::StringRecord, csv::Error>>, but since it’s a dynamic type I have to box it and wrap into a structure:

type CSVIterType = Iterator<Item=Result<csv::StringRecord, csv::Error>>;

struct CSVIter {
    iter: Box<CSVIterType>,
}

Also all fields that are used in helix structure have to implement Debug and Clone traits. Since I’m not going to use them I can simply define them as:

impl std::fmt::Debug for CSVIter {
    fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
        write!(f, "CSVIter")
    }
}

impl Clone for CSVIter {
    fn clone(&self) -> CSVIter { 
        panic!("Not cloneable!") 
    }
}

Now it’s easy to implement open and next, here’s how the final version of lib.rs looks like:

extern crate csv;

#[macro_use]
extern crate helix;

use std::fs::File;
use std::io::BufReader;

type CSVIterType = Iterator<Item=Result<csv::StringRecord, csv::Error>>;

struct CSVIter {
    iter: Box<CSVIterType>,
}

impl std::fmt::Debug for CSVIter {
    fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
        write!(f, "CSVIter")
    }
}

impl Clone for CSVIter {
    fn clone(&self) -> CSVIter { 
        panic!("Not cloneable!") 
    }
}

ruby! {
    class HelixCSV {
        struct {
            path: String,
            iter: Option<CSVIter>,
        }

        def initialize(helix, path: String) {
            HelixCSV { helix, path, iter: None }
        }

        def open(&mut self) -> Result<(), helix::Error> {
            self.iter = None;

            let buf_reader = 
                match File::open(&self.path) {
                    Ok(f)   => BufReader::new(f),
                    Err(e)  => raise!(e.to_string()),
                };

            let csv_reader =
                csv::ReaderBuilder::new()
                    .has_headers(false)
                    .from_reader(buf_reader);

            self.iter = Some(CSVIter{iter: Box::new(csv_reader.into_records())});
            Ok(())
        }

        def next(&mut self) -> Result<Option<Vec<String>>, helix::Error> {
            match self.iter {
                Some(ref mut iter) =>
                    match iter.iter.next() {
                        Some(Ok(record)) =>
                            Ok(Some(record.iter().map(|s| s.to_string()).collect())), 
                        Some(Err(e)) =>
                            raise!(e.to_string()),
                        None =>
                            Ok(None)
                    }
                None =>
                    raise!("closed file")
            }
        }

        def close(&mut self) -> () {
            self.iter = None
        }
    }
}

We also need to add dependency to Cargo.toml:

[dependencies]
helix = "0.7.5"
csv = "1.0.2"

and the only thing left is to build native extension:

$ rake build
{}
cargo rustc --release -- -C link-args=-Wl,-undefined,dynamic_lookup
   Compiling version_check v0.1.5
   Compiling libc v0.2.43                                                                                      
   Compiling cfg-if v0.1.6                                                                                       
   Compiling serde v1.0.80                                                                                       
   Compiling libcruby-sys v0.7.5                                                                                       
   Compiling cstr-macro v0.1.0                                                                                       
   Compiling memchr v2.1.1                                                                                       
   Compiling csv-core v0.1.4                                                                                       
   Compiling helix v0.7.5                                                                                       
   Compiling csv v1.0.2
   Compiling helix_csv v0.1.0 (.../helix_csv)                                                             
    Finished release [optimized] target(s) in 25.91s

$ bin/console
2.5.3 :001 > csv = HelixCSV.new("sample.csv")
 => #<HelixCSV:0x000055b9b1aa6e08> 
2.5.3 :002 > csv.open
 => nil 
2.5.3 :003 > csv.next
 => ["a407cd64-b473-44b6-910c-8f6d09dc2a6a", "6982c5ab-8f07-4f8e-bd0b-9f2bf5ec626e", ...]
2.5.3 :004 > csv.close
 => nil

Benchmarks

Code

# FastestCSV
FastestCSV.foreach('sample.csv') do |rec|
end

# HelixCSV
csv_reader = HelixCSV.new('sample.csv')
csv_reader.open
while (rec = csv_reader.next) do
end
csv_reader.close

# CSV
CSV.foreach('sample.csv') do |rec|
end

Data

Records	Fields per record	File size	CPU
1M	21	742M	i7-6600U @ 2.60GHz

Results

gem	FastestCSV	HelixCSV	CSV
Time, secs	9.54	12.9	42.5

Note: FastestCSV doesn’t seem to decode embedded newlines, which in my case was a deal breaker.

Was it worth the time and effort?

At one of my jobs I had to read and analyze large csv files (it’s a very popular format in the enterprise world!) and company codebase was mostly written in Ruby. Finding fast gem that handles “standard” csv (i.e. csv you get from Excel or Postgres) proved to be pretty complicated, not only in Ruby world, but even in Java. Many fail to decode newlines inside quoted fields (both Excel and Postgres exports can easily have them). We used patched FastestCSV, but even my simple gem can easily outperform FastestCSV e.g. if I want only certain columns. Big part of the time is spent instantiating Ruby objects and coercing types, so if I only want one column out of 21 in my sample file, 12.9 secs will become only 4.7 secs. I can also offload a lot of other work to Rust, e.g. finding unique values, checking and converting types etc. This will be a way faster than Ruby with relatively small amount of Rust code. Not to mention that Rust CSV crate has tons of other options.