Embracing the Cloud - Locating Resources
Posted on 3/3/09 by Felix Geisendörfer
With the rise of affordable cloud computing (especially services like EC2 and S3) we need to learn to apply additional skills to our craft. One of those is using Hashtables to locate resources in our system.
Here is an example. Lets say your application stores image files uploaded by its users. At some point your hard disk is filling up. Now you have two choices.
- Add another hard drive
- Distribute the files amongst multiple servers
Choice A is called "Vertical Scaling" and usually easier. Choice B is "Horizontal Scaling". Horizontal scaling is obviously the future, since the free lunch is over.
So once you decided you want to be serious and not place your bets on vertical scaling, you face a problem. On any given server X how do I know which server holds File Y? The answer is simple, you look it up using a hash function. Since the rise of Gravatar and Git, cryptographic hash functions, especially MD5 and SHA1, have become popular general-purpose hashing functions of choice.
So if you previously stored your file urls in the database like this:
storeAt($file['url'], file_get_contents($file['path']));
You would rewrite it to look like this:
$servers = file_get_contents('http://resources.example.org/file_servers.json').
$servers = json_decode($servers, true);
$sha1 = sha1_file($file['path']);
foreach ($servers as $key => $server) {
if (preg_match($key, $sha1)) {
return $server . '/' . $file['id'];
}
}
throw new Exception('invalid server list');
}
$file['url'] = url($file);
storeAt($file['url'], file_get_contents($file['path']));
And store a file called file_servers.json on resources.example.org:
"/^[a-l]/": "http://bob.files.example.org",
"/^[l-z]/": "http://john.files.example.org",
}
If you now cache the file_servers.json list on each machine and make sure resources.example.org is available, you can scale to pretty much infinity.
How does it work? Simple. We wrote a hash function called url() which will assign a unqiue location to each newly uploaded file. This works by computing the sha1 fingerprint for the file. Then we pick a server based on the leading characters of the sha1, using the ranges defined in file_servers.json.
If you need to remove full servers or add new ones, you simply modify file_servers.json file.
The probably largest and most sophisticated implementation of the concept above seems to be Amazon S3 right now.
The jury is still out if they will manage to win the upcoming cloud war, but there are exciting times ahead of us. The recession is putting many companies, big and small, in need of saving costs through innovation. So what renewable energies and fuel efficient cars are for other industries, is what cloud computing is to us.
Think that cloud computing is not going to help with your current clients and applications? Invest an hour to put your clients static files (and uploads) in Amazon S3 and downgrade his shared hosting account. He now needs less storage, bandwidth and CPU power and gets redundant backup as a bonus.
-- Felix Geisendörfer aka the_undefined
You can skip to the end and add a comment.
Good article felix, and a simple straightforward implementation.
Nice Felix. Just sold a new project to a customer that involves uploading for clients. Might be as well at S3.
If you're really concerned about scaling the storage of small files, you do not want to do this. I've manually maintaind a setup of about 100TB of small files in a similar fashion and it became a nightmare. Basically because if users keep uploading, the amount of images that will match your regex will grow too, and your servers will get full nonetheless. and it becomes really hard to rebalance.
A really nice implementation of a good distributed storage is mogileFS (http://www.danga.com/mogilefs/), which we ended up implementing.
it keeps track of all files in a database, it can move files around without downtime, you can give specific redundancy policies (eg store thumbnails once, and full images twice etc).
Also, you probably don't want to do so many http requests just to get some really basic data. rather keep it in a database/memcache/etc
The other big player when it comes to hashtables is Google. They are using BigTable as their storage engine for nearly everything. Right now I am looking into google app engine and it makes it clear, that using large amounts of data distributed of really large amounts of servers BigTable seems to be a good implementation. As you can use the app engine there should be a way to use a small app to store what you need in BigTable. Just another idea than to implement everything by yourself. And btw they offer now their services like Amazon aws does.
"Think that cloud computing is not going to help with your current clients and applications? Invest an hour to put your clients static files (and uploads) in Amazon S3 and downgrade his shared hosting account. He now needs less storage, bandwidth and CPU power and gets redundant backup as a bonus."
Wow, so that's really good advantage if you own/use a cloud.
@Dieter_be mogileFS looks interesting. How involved is the set-up (i.e., will the average developer with some *nix skills be able to implement this on a server)? If anything the library seems pretty straightforward.
Aside: @felix? How can I unsubscribe the comment notifications for this post?
Cameron. it's not that hard, you just need a little luck to find a good tutorial for your distro, with a bit of luck everything works out of the box if you install it through perl CPAN (check their wiki for tutorials).
They have some good api's, though no official one for PHP iirc, we ended up implementing Erik Osterman's version. See http://groups.google.com/group/mogile/browse_thread/thread/e0183a72ad6b0f1a
primeminister: Not yet, but I'll add this feature very soon. Sorry : (
Hi Felix,
I'm about to get into the cloud with a CakePHP YouTube like app and right now I'm set on using a Mosso Server and Mosso files. They say that the files are automatically served using the LimeLight CDN so that sounds really sweet.
Would you say that the AmazonWS suite would be better suited for some heavy ffmpeg encoding queue management?
Have a Great Day,
cosmin
This post is too old. We do not allow comments here anymore in order to fight spam. If you have real feedback or questions for the post, please contact us.
Nifty. I really like how simple and straightforward this solution is - I might have to use the idea on my next project.