debuggable

 
Contact Us
 
2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10

Streaming UTF-8 (with node.js)

Posted on 18/5/10 by Felix Geisendörfer

UTF-8 is a variable-length character encoding for Unicode. This text is encoded in UTF-8, so the characters you are reading can consist of 1 to 4 bytes. As long as the characters fit within the ASCII range (0-127), there is exactly 1 byte used per character.

But if I want to express a character outside the ASCII range, such as '¢', I need more bytes. The character '¢' for example consists of: 0xC2 and 0xA2. The first byte, 0xC2, indicates that '¢' is a 2-byte character. This is easy to understand if you look at the binary representation of 0xC2:

11000010

As you can see, the bit sequence begins with '110', which as per the UTF-8 specification means: "2 byte character ahead!". Another character such as '€' (0xE2, 0x82, 0xAC) would work the same way. The first byte, 0xE2, looks like this in binary:

11100010

The prefix '1110' specifies that there are 3 bytes forming the current character. More exotic characters may even start with '11110', which indicates a 4 byte character.

As you can guess, UTF-8 text is not trivial to stream. Networks and file systems are not UTF-8 aware, so they will often split a chunk of text in the middle of a character.

To make sure you don't process a partial character, you have to analyze the last 3 bytes of any given chunk in your stream to check for the bit-prefixes that are used to announce a multibyte character. If you detect an incomplete character, you need to buffer the bytes you have for it, and then prepend them to the next chunk that comes in.

This way you can completely avoid breaking apart multibyte characters within a UTF-8 text, while still getting great performance and memory usage (only the last 3 bytes need checking / buffering).

As of yesterday, node.js's net / http modules are now fully UTF-8 safe, thanks to the streaming Utf8Decoder (undocumented, API may change) you can see below:

var Buffer = require('buffer').Buffer;

var Utf8Decoder = exports.Utf8Decoder = function() {
  this.charBuffer = new Buffer(4);
  this.charReceived = 0;
  this.charLength = 0;
};

Utf8Decoder.prototype.write = function(buffer) {
  var charStr = '';
  // if our last write ended with an incomplete multibyte character
  if (this.charLength) {
    // determine how many remaining bytes this buffer has to offer for this char
    var i = (buffer.length >= this.charLength - this.charReceived)
      ? this.charLength - this.charReceived
      : buffer.length;

    // add the new bytes to the char buffer
    buffer.copy(this.charBuffer, this.charReceived, 0, i);
    this.charReceived += i;

    if (this.charReceived < this.charLength) {
      // still not enough chars in this buffer? wait for more ...
      return;
    }

    // get the character that was split
    charStr = this.charBuffer.slice(0, this.charLength).toString();
    this.charReceived = this.charLength = 0;

    if (i == buffer.length) {
      // if there are no more bytes in this buffer, just emit our char
      this.onString(charStr)
      return;
    }

    // otherwise cut of the characters end from the beginning of this buffer
    buffer = buffer.slice(i, buffer.length);
  }


  // determine how many bytes we have to check at the end of this buffer
  var i = (buffer.length >= 3)
    ? 3
    : buffer.length;

  // figure out if one of the last i bytes of our buffer announces an incomplete char
  for (; i > 0; i--) {
    c = buffer[buffer.length - i];

    // See http://en.wikipedia.org/wiki/UTF-8#Description

    // 110XXXXX
    if (i == 1 && c >> 5 == 0x06) {
      this.charLength = 2;
      break;
    }

    // 1110XXXX
    if (i <= 2 && c >> 4 == 0x0E) {
      this.charLength = 3;
      break;
    }

    // 11110XXX
    if (i <= 3 && c >> 3 == 0x1E) {
      this.charLength = 4;
      break;
    }
  }

  if (!this.charLength) {
    // no incomplete char at the end of this buffer, emit the whole thing
    this.onString(charStr+buffer.toString());
    return;
  }

  // buffer the incomplete character bytes we got
  buffer.copy(this.charBuffer, 0, buffer.length - i, buffer.length);
  this.charReceived = i;

  if (buffer.length - i > 0) {
    // buffer had more bytes before the incomplete char, emit them
    this.onString(charStr+buffer.slice(0, buffer.length - i).toString());
  } else if (charStr) {
    // or just emit the charStr if any
    this.onString(charStr);
  }
};

I feel like this implementation could still be somewhat simplified, so if you have any suggestions or comments, please let me know!

--fg

PS: Another buffer-based project I'm working on is a fast multipart parser - stay tuned for another post!

 

Understanding node.js

Posted on 29/4/10 by Felix Geisendörfer

Node.js has generally caused two reactions in people I've introduced it to. Basically people either "got it" right away, or they ended up being very confused.

If you have been in the second group so far, here is my attempt to explain node:

  • It is a command line tool. You download a tarball, compile and install the source.
  • It let's you run JavaScript programs by typing 'node my_app.js' in your terminal.
  • The JS is executed by the V8 javascript engine (the thing that makes Google Chrome so fast).
  • Node provides a JavaScript API to access the network and file system

"But I can do everything I need in: ruby, python, php, java, ... !".

I hear you. And you are right! Node is no freaking unicorn that will come and do your work for you, sorry. It's just a tool, and it probably won't replace your regular tools completely, at least not for now.

"Get to the point!"

Alright, I will. Node is basically very good when you need to do several things at the same time. Have you ever written a piece of code and said "I wish this would run in parallel"? Well, in node everything runs in parallel, except your code.

"Huh?"

That's right, everything runs in parallel, except your code. To understand that, imagine your code is the king, and node is his army of servants.

The day starts by one servant waking up the king and asking him if he needs anything. The king gives the servant a list of tasks and goes back to sleep a little longer. The servant now distributes those tasks among his colleagues and they get to work.

Once a servant finishes a task, he lines up outside the kings quarter to report. The king lets one servant in at a time, and listens to things he reports. Sometimes the king will give the servant more tasks on the way out.

Life is good, for the king's servants carry out all of his tasks in parallel, but only report with one result at a time, so the king can focus. *

"That's fantastic, but could you quit the silly metaphor and speak geek to me?"

Sure. A simple node program may look like this:

var fs = require('fs')
  , sys = require('sys');

fs.readFile('treasure-chamber-report.txt', function(report) {
  sys.puts("oh, look at all my money: "+report);
});

fs.writeFile('letter-to-princess.txt', '...', function() {
  sys.puts("can't wait to hear back from her!");
});

Your code gives node the two tasks to read and write a file, and then goes to sleep. Once node has completed a task, the callback for it is fired. But there can only be one callback firing at the same time. Until that callback has finished executing, all other callbacks have to wait in line. In addition to that, there is no guarantee on the order in which the callbacks will fire.

"So I don't have to worry about code accessing the same data structures at the same time?"

You got it! That's the entire beauty of JavaScripts single-threaded / event loop design!

"Very nice, but why should I use it?"

One reason is efficiency. In a web application, your main response time cost is usually the sum of time it takes to execute all your database queries. With node, you can execute all your queries at once, reducing the response time to the duration it takes to execute the slowest query.

Another reason is JavaScript. You can use node to share code between the browser and your backend. JavaScript is also on its way to become a really universal language. No matter if you did python, ruby, java, php, ... in the past, you've probably picked up some JS along the way, right?

And the last reason is raw speed. V8 is constantly pushing the boundaries in being one of the fastest dynamic language interpreters on the planet. I can't think of any other language that is being pushed for speed as aggressively as JavaScript is right now. In addition to that, node's I/O facilities are really light weight, bringing you as close to fully utilizing your system's full I/O capacities as possible.

"So you are saying I should write all my apps in node from now on?"

Yes and no. Once you start to swing the node hammer, everything is obviously going to start looking like a nail. But if you're working on something with a deadline, you might want to base your decision on:

  • Are low response times / high concurrency important? Node is really good at that.
  • How big is the project? Small projects should be fine. Big projects should evaluate carefully (available libraries, resources to fix a bug or two upstream, etc.).

"Does node run on Windows?"

No. If you are on windows, you need to run a virtual machine (I recommend VirtualBox) with Linux. Windows support for node is planned, but don't hold your breath for the next few months unless you want to help with the port.

"Can I access the DOM in node?"

Excellent question! No, the DOM is a browser thingy, and node's JS engine (V8) is thankfully totally separate from all that mess. However, there are people working on implementing the DOM as a node module, which may open very exciting possibilities such as unit testing client-side code.

"Isn't event driven programming really hard?"

That depends on you. If you already learned how to juggle AJAX calls and user events in the browser, getting used to node shouldn't be a problem.

Either way, test driven development can really help you to come up with maintainable designs.

"Who is using it?"

There is a small / incomplete list in the node wiki (scroll to "Companies using Node"). Yahoo is experimenting with node for YUI, Plurk is using it for massive comet and Paul Bakaus (of jQuery UI fame) is building a mind-blowing game engine that has some node in the backend. Joyent has hired Ryan Dahl (the creator of node) and heavily sponsors the development.

Oh, and Heroku just announced (experimental ) hosting support for node.js as well.

"Where can I learn more?"

Tim Caswell is running the excellent How To Node blog. Follow #nodejs on twitter. Subscribe to the mailing list. And come and hang out in the IRC channel, #node.js (yes, the dot is in the name). We're close to hitting the 200 lurker-mark there soon : ).

I'll also continue to write articles here on debuggable.com.

That's it for now. Feel free to comment if you have more questions!

--fg

*: The metaphor is obviously a simplification, but if it's hard to find a counterpart for the concept of non-blocking in reality.

 

Interview on the changelog

Posted on 6/4/10 by Felix Geisendörfer

Adam and Wynn interviewed me for the latest episode of the changelog:

Episode 0.2.0 - Node.js with Felix Geisendörfer

For those interested in node.js (frameworks, unit testing, etc.) this should be a nice introduction to the current status of the project and ecosystem.

--fg

 

Quitting open source

Posted on 1/4/10 by Felix Geisendörfer

Update: Yes, this of course was an April fools joke : )

This is a decision that has not been easy for us. Over the years, Tim and I build our reputation and company on the basis of open source. For a long time we worked on CakePHP, and more recently I became very involved with node.js.

Generally, open source has been very good to us. We learned a lot about programming, collaborated with great people and, thanks to our clients, made payroll consistently.

But sometimes, even if things are good, you have to evaluate if they are aligned with your true goals. And today, we finally decided they are not. Our goals and the nature of open source are in fundamental conflict.

What changed? Well, when we initially started debuggable, our goal was to build commercial web applications and make a living from that. But it turns out, that is actually really hard.

37signals may wash your head about getting real until it comes off, but the truth is:

You will not build a product like basecamp with a limit of 10h / week. And you certainly will not, let me stress that, develop an open source technology like Ruby on Rails along the way, which will then help you promote your product(s). 37signals either got insanely lucky, or they are glorifying their story after the fact. It is probably both.

This leads back to our decision. We have concluded, that in order to reach our goals for 2010, we need more time. It is not really possible to cut on client work, since we've got bills to pay. So instead, we will drastically cut on our open source involvement.

Beginning today, we will only work on open source, if we actually really need a critical bug fixed, and the maintainer of the project refuses to fix it for us. We will only send those bug fixes upstream, if the patches are big enough so it would be more difficult for us to maintain them privately.

We estimate that this will free us 20 hours / week. Those hours, for now, will be used to finally get transloadit.com out of the testing phase. At the same time, we are evaluating a more radical technology change. We recently developed our first native iPhone application for a client, and it is a truly great plattform. Also the iPad seems like a great opportunity we shouldn't miss. So I wouldn't be surprised if Tim and I would be moving to the App Store ecosystem at some point.

So fare well, open source community. We are sorry we can longer be a part. Our website will be updated to reflect our new philosophy by tomorrow.

-- Felix & Tim

 

JavaScript Meetup Hamburg + Slides

Posted on 4/3/10 by Felix Geisendörfer

Update: Andy Wenk posted a very nice summary of the event.

Last night Tim and I took a little road trip to Hamburg. I had no idea the Reeperbahn looked like Las Vegas : ).

Anyway, our actual destination was the first Hamburg JS meetup where Malte Ubl invited me to speak about node.js. The turnout was fantastic, and thanks to SinnerSchrader's hosting of the event, there was plenty of pizza, beer and an absolutely fantastic location.

I've also updated my previous node.js talk, all the examples should now be 0.1.30+ compatible, and the section about "The Future" of node has a more recent and interesting list of things that are on the radar:

You can also download the slides as PDF (164 kb).

--fg

 
2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10