debuggable

 
Contact Us
 
5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13

Queues in the cloud - Debuggable PHP SQS Library

Posted on 6/3/09 by Felix Geisendörfer

Hey folks,

the cloud promises an abundance of CPU cycles and huge opportunities for parallell processing. However, before you can tap into this potential, you need to have a good mechanism for distribution. Hashing can help, but there are other options as well. One of them is to use a queue.

So how does it work? Well essentially a queue is a big list where new items are added to the bottom and existing items are read from the top. So lets say you want to encode 10,000 videos using 100 machines. The way a queue can help with this is by having one machine loop over all videos and put a message in the queue for each one of them. This message is a text string (JSON is nice for this) that represents 1 job of converting 1 video. So you end up with a queue that has 10,000 messages in it, each representing one job.

Now you can use EC2 to launch 100 High CPU instances at $0.20 / hour. Each one of these workers now runs a program in infinite loop that queries the queue for job messages. As soon as a message is fetched it becomes invisible to all other workers querying the queue. This is called the visibility timeout and lasts for a number of X seconds definable by you. So during this time the worker who fetched the message is responsible for completing the job. If it does, it sends a request to the queue for deleting the message. If the worker fails (for example because it crashes), the message will appear in the queue again after the timeout, and other workers can process it.

So as you can see there are a lot of advantages to this approach:

  • The worker machines don't need to be aware of each other
  • Fault tolerant, no job gets lost because a worker machine fails
  • Effectiveness, the queue tries to make sure no message gets processed twice
  • The system itself can never get overloaded and it always works as fast as it can
  • Queue length can serve as a very nice measurement of load on a given system

Of course all of those advantages depend on the queue service not failing which of course is very difficult to achieve. So in a scenario as the one described above, Amazon's SQS service is a very interesting solution. Why?

Well first of all it is cheap. While testing the service I put 1.7 mio messages in my queue - Amazon charged me $1.74, so its like $1 / mio messages + bandwidth. The next good thing is that its highly scalable. Whether you put 10 or 10 mio. messages in your queue, Amazon says they'll sort it out for you. And last but not least, there are already aws monitoring tools out there to monitor your queues hosted with amazon.

So far so good. There are also things that suck about SQS. First of all, the latency is pretty high. My tests confirmed what Wikipedia and others say about SQS: It takes ~2-10s for a message added to a queue to be available for reading. If you need a very responsive queue, that rules SQS out for you. Also very stupid is the lack of a "flush" function. So while you are developing you have to write your own tool for flushing a queue. And last but not least is the fact that SQS requires your system to be idempotent. This basically means that SQS does not guarantee that a message cannot be fetched by 2 workers at the same time and those be processed twice. Idempotent means that your app needs to be prepared for that and processing anything twice needs to lead to the same result. But of course, SQS tries to avoid this scenario as much as possible.

Anyway, if you come to the conclusion that SQS can solve more problems for you than it creates, it is an amazing service. So how do you use it in your apps? Well I first tried using the PHP library Amazon provides, but I have come to hate it. I mean it is very comprehensive and does the job. But the people who wrote it were clearly Java engineers forced to write PHP. I feel sorry for them.

For a long time my searches for an alternative came up empty, but last night I discovered at least 2 viable options. The first is called php-aws and provides very clean, easy to use classes for S3, SQS, EC2 and AWIS. From their project page I also found another project called Tarzan. PHP-AWS recommends Tarzan as a super-robust and comprehensive alternative. And from my analysis of it, it indeed looks like a very mature project and I encourage everybody to check it out.

Well - too bad for me that I was already way into implementing my own class last night when discovering those two options : ). But nevertheless I am very proud to present the Debuggable.com SQS PHP5 Library. Besides a very easy to use and intuitive interface, it features the following attributes:

  • Exponential backoff on failures and retry maximum
  • Uses CURL for reliable HTTP communication
  • It is completely unit tested, which was an interesting challenge

The lib itself is as simple as they come:

$queue = new SqsQueue('my_queue', array('key' => '...', 'secretKey' => '...'));
$queue->sendMessage(array('autoJsonSerialize' => 'is fantastic'));

$lookMaIAmUsingSPL = count($queue);

I do not recommend anybody to use the library for production purposes right now, but if you want to get started with SQS or study the implementation I think you have found an excellent library. Writing the library certainly has been a great experience and opportunity for me to study SQS in detail.

Anyway, enough said. Back to my cloud bed.

-- Felix Geisendörfer aka the_undefined

 

Embracing the Cloud - Locating Resources

Posted on 3/3/09 by Felix Geisendörfer

With the rise of affordable cloud computing (especially services like EC2 and S3) we need to learn to apply additional skills to our craft. One of those is using Hashtables to locate resources in our system.

Here is an example. Lets say your application stores image files uploaded by its users. At some point your hard disk is filling up. Now you have two choices.

  • Add another hard drive
  • Distribute the files amongst multiple servers

Choice A is called "Vertical Scaling" and usually easier. Choice B is "Horizontal Scaling". Horizontal scaling is obviously the future, since the free lunch is over.

So once you decided you want to be serious and not place your bets on vertical scaling, you face a problem. On any given server X how do I know which server holds File Y? The answer is simple, you look it up using a hash function. Since the rise of Gravatar and Git, cryptographic hash functions, especially MD5 and SHA1, have become popular general-purpose hashing functions of choice.

So if you previously stored your file urls in the database like this:

$file['url'] = 'http://example.org/files/'. $file['id'];
storeAt($file['url'], file_get_contents($file['path']));

You would rewrite it to look like this:

function url($file) {
  $servers = file_get_contents('http://resources.example.org/file_servers.json').
  $servers = json_decode($servers, true);
  $sha1 = sha1_file($file['path']);

  foreach ($servers as $key => $server) {
    if (preg_match($key, $sha1)) {
      return $server . '/' . $file['id'];
    }
  }
  throw new Exception('invalid server list');
}

$file['url'] = url($file);
storeAt($file['url'], file_get_contents($file['path']));

And store a file called file_servers.json on resources.example.org:

{
  "/^[a-l]/": "http://bob.files.example.org",
  "/^[l-z]/": "http://john.files.example.org",
}

If you now cache the file_servers.json list on each machine and make sure resources.example.org is available, you can scale to pretty much infinity.

How does it work? Simple. We wrote a hash function called url() which will assign a unqiue location to each newly uploaded file. This works by computing the sha1 fingerprint for the file. Then we pick a server based on the leading characters of the sha1, using the ranges defined in file_servers.json.

If you need to remove full servers or add new ones, you simply modify file_servers.json file.

The probably largest and most sophisticated implementation of the concept above seems to be Amazon S3 right now.

The jury is still out if they will manage to win the upcoming cloud war, but there are exciting times ahead of us. The recession is putting many companies, big and small, in need of saving costs through innovation. So what renewable energies and fuel efficient cars are for other industries, is what cloud computing is to us.

Think that cloud computing is not going to help with your current clients and applications? Invest an hour to put your clients static files (and uploads) in Amazon S3 and downgrade his shared hosting account. He now needs less storage, bandwidth and CPU power and gets redundant backup as a bonus.

-- Felix Geisendörfer aka the_undefined

 

How To Save Half A Second On Every CakePHP Request

Posted on 26/2/09 by Tim Koschützki

Hey folks,

as an application comes closer and closer to its launch date, not having cared about performance during development becomes more and more of a problem.
There are several ways to improve the performance of your CakePHP application. The first thing one would think about is generally caching of your views and database queries. However, during development stage implementing caching can cause a lot of confusion and phantom bugs. In short, it might waste your time which is better put into features. Any performance improvement that does not effect how data is retrieved, stored and cached is welcome. If it affects your entire site and not only parts of it, it's all the better.

With the help of Mark Story's excellent CakePHP DebugKit I got the idea of disabling Cake's reverse route lookup to gain performance. Almost half a second per request for link-heavy sites.

The Hack

As you use the HtmlHelper to create links with the helper's link() method, you throw an url in the form of an array at Router::url() everytime. With the complex routes parsing done for every $html->link() call, this becomes a big issue for link-heavy sites. There will be the most overhead if you have a lot of routes installed. So I thought if it's easily possible to override the behaviour for standard urls that don't need routes parsing. A classical example would be:

$html->link('Check this Article', array('controller' => 'articles', 'action' => 'view', $article['Article']['id']));

Since this link is dynamic (depending on the article id) it does not need routes parsing (most of the time, more on that later). There are others, like 4 parameters, 2 parameters, pagination links that have named params in the url .. or only one parameter if only the 'controller' is present and you want to access the index action, etc. I have tinkered a little and wrote some code that we are using in most of our projects now and it has worked out quite well. In fact it is saving almost half a second for every request:

<?php
class AppHelper extends Helper {
  function url($url = null, $full = false) {
    $Router =& Router::getInstance();
    if (!empty($Router->__params)) {
      if (isset($Router) && !isset($Router->params['requested'])) {
        $params = $Router->__params[0];
      } else {
        $params = end($Router->__params);
      }
    }

    if (isset($params['admin']) && $params['admin'] && !isset($url['admin'])) {
      $url['admin'] = $params['admin'];
    }

    if (is_array($url) && isset($url['controller']) && !isset($url['page'])) {
      if (!isset($url['action'])) {
        $url['action'] = 'index';
      }

      $admin = '';
      if (isset($url['admin']) && $url['admin']) {
        $admin = Configure::read('Routing.admin') . '/';
        if (strpos($url['action'], $admin . '_') === 0) {
          $url['action'] = substr($url['action'], strlen($admin));
        }
      }
      unset($url['admin']);
      unset($url['plugin']);

      $count = count($url);
      if (4 == $count) {
        return '/' . $admin . $url['controller'] . '/' . $url['action'] . '/' . $url[0] . '/' . $url[1];
      }

      if (3 == $count) {
        if (isset($url['id'])) {
          $url[0] = $url['id'];
        }
        return '/' . $admin . $url['controller'] . '/' . $url['action'] . '/' . $url[0];
      }

      if (2 == $count) {
        return '/' . $admin . $url['controller'] . '/' . $url['action'];
      }

      if (1 == $count) {
        return '/' . $admin . $url['controller'] . '/index';
      }
    }

    return parent::url($url, $full);
  }
}
?>

How does it work?

As you can see, it's simply overriding the default behaviour for Helper::url() in your AppHelper. It goes through the classical cases and builds out the url via lightweight string concatenation. Not a beauty, but fast.

What it really does is breaking the reverse routing feature. That means if you specify such a route:

Router::connect('/login', array('controller' => 'auth', 'action' => 'login'));

Then create a link with $html->link('Login Now!', array('controller' => 'auth', 'action' => 'login')), the link url will not automagially transform into '/login' anymore. This is pretty bad, however if your site hardcodes the urls used in routes, this is not a problem. This means that instead of providing array('controller' => 'auth', 'action' => 'login'), you just do '/login', which I always do for specific pages since there aren't so many of them. So the hack helps for non-dynamic cases.

Usage

To use the code, acknowledge it's more like a site-specific optimization. By the way, for those of you who think I should just go ahead and contribute a patch to CakePHP that will make it possible to disable reverse route lookup, I talked to Nate already. He said he is in the process of rewriting the Routing for Cake 1.3 and it will be much faster. In the meantime, this might help you.

Copy and paste the url() function to your AppHelper class in app_helper.php. Try it out and see if it works for you. Obviously, it doesn't affect all cases and is pretty specific to how we write urls. However, it falls back to the normal url parsing if no appropriate case is found. Any suggestions, bug reports and contributions are welcome.

Disclaimer:

In no way do I claim that this code will work as well for you as it does for us. It's specific to how urls are written using $html->link() and you might screw over your app by using this. Take extra caution when using it - especially with reverse route lookup - and make sure to test your app thoroughly.

-- Tim Koschuetzki aka DarkAngelBGE

 

Are we done yet?

Posted on 11/2/09 by Felix Geisendörfer

If you have worked with progress indicators in the past you might have had the same thought that always makes you wonder.

Should I write:

if ($done == $total) {
    // We are done, move on
}

or should I write:

if ($done >= $total) {
    // We are done, move on
}

that is the question!

Today I went for the '==' route because the $done variable should never be bigger than the $total one.

I think this comes right down to your development philosophy: Should weird conditions in your application that seem "harmless" be silently ignored or should they cause a crash?

I guess I'm the fail often fail early type.

What type are you?

-- Felix Geisendörfer aka the_undefined

 

Disable strict host checking for git clone

Posted on 4/2/09 by Felix Geisendörfer

Hey folks,

while playing with automated machine configuration in EC2 for a few minutes this morning, I stumbled across a little hurdle. One of the items in my init script was the cloning of a git repository from GitHub.

This normally isn't a very difficult task to automate. However, it can become so if you see the following message:

$ git clone git@github.com:debuggable/secret-project.debuggable.com.git
Initialized empty Git repository in /var/git/secret-project.debuggable.com/.git/
The authenticity of host 'github.com (65.74.177.129)' can't be established.
RSA key fingerprint is 16:27:ac:a5:76:28:2d:36:63:1b:56:4d:eb:df:a6:48.
Are you sure you want to continue connecting (yes/no)?

Interactive questions like this can be really annoying when it comes to automation. Luckily there is an easy fix available.

$ echo -e "Host github.com\n\tStrictHostKeyChecking no\n" >> ~/.ssh/config

This will add a configuration line to your ~/.ssh/config script that will silently ignore the authenticity of github.com.

-- Felix Geisendörfer aka the_undefined

PS: If the topic of passing ssh options to your git commands is interesting to you, make sure to also check out this git wiki page.

 
5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13