debuggable

 
Contact Us
 

Parsing file uploads at 500 mb/s with node.js

Posted on 31/5/10 by Felix Geisendörfer

A few weeks ago I set out to create a new multipart/form-data parser for node.js. We need this parser for the new version of transloadit that we have been working on since our setback last month.

The result is a new library called formidable, which, on a high level, makes receiving file uploads with node.js as easy as:

var formidable = require('formidable')
  , http = require('http')
  , sys = require('sys');

http.createServer(function(req, res) {
  if (req.url == '/upload' && req.method.toLowerCase() == 'post') {
    // parse a file upload
    var form = new formidable.IncomingForm();
    form.parse(req, function(fields, files) {
      res.writeHead(200, {'content-type': 'text/plain'});
      res.write('received upload:\n\n');
      res.end(sys.inspect({fields: fields, files: files}));
    });
    return;
  }

  // show a file upload form
  res.writeHead(200, {'content-type': 'text/html'});
  res.end
    ( '<form action="/upload" enctype="multipart/form-data" method="post">'
    + '<input type="text" name="title"><br>'
    + '<input type="file" name="upload" multiple="multiple"><br>'
    + '<input type="submit" value="Upload">'
    + '</form>'
    );
});

Essentially this works similar to other platforms where file uploads are saved to disk before your script is invoked with a path to the uploaded file.

What's nice about this however, is that you can hook into the whole thing on a lower level:

form.parse(req);
form.addListener('file', function(field, file) {
  // file looks like this:
  // {path: '...' , filename: '...', mime: '...'}
});

We use that interface for processing HTML5 multi-file uploads as they come in, rather than waiting for the entire upload to finish.

You could even overwrite the onPart handler, which gives you direct access to the raw data stream:

form.onPart = function(part) {
  part.addListener('data', function(chunk) {
    // do cool stuff, like streaming incoming files somewhere else
  });
};

All of this is possible thanks to the underlaying multipart parser which makes heavy use of node.js buffers.

Buffers in node are basically just (efficient) arrays of raw memory that you can access byte by byte. The parser works by looping over each incoming buffer chunk, while maintaining as little state as possible to do its work:

// simplified excerpt from MultipartParser.write
// chunk = Buffer of incoming data

for (var i = 0; i < chunk.length; i++) {
  var character = chunk[i];
  switch (this.state) {
    case 'BOUNDARY-BEGIN':
      if (character != this.boundary[i]) {
        // unexpected character, abort parsing
        return 0;
      }

      if (i == this.boundary.length) {
       // emit event, advance to next state
       this.onPartHeaderBegin();
       this.state = 'HEADER-BEGIN';
      }
      break;
    case 'HEADER-BEGIN':
      // ...
      break;
  }
}

But, as you can imagine, this approach turned out to be somewhat limited in speed. I was only able to get about 16-20 mb/s out of this. My goal however was to get somewhere around 125 mb/s, enough to saturate a 1 gbit network connection.

So I started to look for ways to optimize. The data I was parsing looks like this:

--AaB03x
content-disposition: form-data; name="title"

A picture of me on my unicycle!
--AaB03x
content-disposition: form-data; name="pic"; filename="unicycle.jpg"
Content-Type: image/jpeg

... binary data ...
--AaB03x--

The sequence here being:

  1. Series of boundary characters (--AaB03x), announcing the beginning of a new part
  2. Headers for that part
  3. \r\n\r\n
  4. Data for this part
  5. Series of boundary characters, announcing the end of the part or end of the stream

What stands out is the fact that there is no parsing needed for step #4. All the data of a part itself is a plain octet stream. So after talking to Ryan about it, he recommended me to look into the Boyer-Moore algorithm to speed things up.

The algorithm is usually the fastest method to find a sub string within a string of text. It basically works by analyzing the needle string and building a lookup table of its characters to efficiently skip over parts of the haystack.

But implementing it, was not exactly easy. The algorithm is not trivial, and many of the example implementations I found were wrong. That however was not the problem. The real challenge was that I was working with a stream of data, rather than a big string I had full access to.

This meant keeping lots of additional state in the parser, as well as creating a very complicated piece of code. I like challenges, but I also like efficiently using my time, so I started looking for a shortcut.

And then it hit me. Most of the boyer-moore algorithm is designed to improve the performance of the worst-case scenario. The worst-case scenario for this problem is the case where you hit a character in your haystack that is also a character in the needle string. Boyer-moore deals with this case by knowing the offset of each character in the needle string, so it can maximize the number of characters to skip in any case.

But file uploads rarely cause these worst-case scenarios! With human text, character repetition is pretty high. But file uploads are binary data, so most bytes are likely to fall outside the ASCII range of characters usually used for the boundary.

That made the solution much simpler. All I had to do was generating a list of all characters in the boundary, and whenever I hit a character that was not in that list, I knew I could safely skip the full length of the boundary:

while (i + boundary.length <= chunk.length) {
  if (chunk[i + boundary.length - 1] in boundaryChars) {
    // worst case, go back to byte by byte parsing until a non-matching char occurs
    break;
  }
  i += boundary.length;
}

This resulted in an incredible speed up, allowing to parse uploads at 500 mb/sec. The parsing can be even faster if a longer boundary sequence is used. Short boundaries and hitting the worst-case scenario frequently will slow things down.

The benchmark I created is using an actual boundary created by Mozilla Firefox. Your milage may vary slightly.

The whole thing could still be optimized further, but at this point I believe it is fast enough to make other parts of the system more likely to become the bottleneck.

Formidable is licensed under the MIT license. Questions & suggestions regarding the library, node.js or the parser would be most welcome.

--fg

 
&nsbp;

You can skip to the end and add a comment.

John Smith  said on May 31, 2010:

What about raw file upload (without multipart handling) a la CouchDB (http://wiki.apache.org/couchdb/HTTP_Document_API#Standalone_Attachments). I know you can't do it with a browser but i'm curious of the speed :)

Nice article by the way !

Felix Geisendörfer said on May 31, 2010:

John Smith: If the HTTP body contains nothing but the uploaded file, node.js should be able to receive it as fast as your operating system can read from the network since there is no need to parse anything (other than the initial http headers).

However, that would not allow for additional fields or more than one file to be uploaded at a time - so it's not really comparable to multipart.

Karl G said on May 31, 2010:

Very clever with the range check.

I actually faced a similar problem this past week and wound up using the KMP Algorithm, which is straightforward both to implement and to convert to streaming. I looked into Boyer-Moore but I knew I'd never hit the KMP worst case for my problem domain and KMP is much simpler. Wound up being ~80 lines of Java.

Allen H  said on Jul 20, 2010:

Very cool. I'm looking at a similar project right now but it would involve unzipping very large (200mb+) text files containing event data and then parsing that. Any thoughts on approach - would you do anything differently if you were dealing with the whole file at once?

Felix Geisendörfer said on Jul 20, 2010:

Allen H: I would use a command line tool to do the unzipping, iterating over the text file could be done with node. I doubt you'd need a sophisticated parser for it, since you'll deal with CSV/TSV right?

Allen H  said on Jul 20, 2010:

Ya, CSV. Thanks!

This post is too old. We do not allow comments here anymore in order to fight spam. If you have real feedback or questions for the post, please contact us.