Why UUIDs?
Posted on 11/9/08 by Felix Geisendörfer
Why UUIDs?
what are UUIDs? Well according to the Wikipedia:
A Universally Unique Identifier (UUID) is an identifier standard used in software construction, standardized by the Open Software Foundation (OSF) as part of the Distributed Computing Environment (DCE).
In English that would roughly translate to really long random strings. So random in fact that Wikipedia thinks:
After generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%
So why should you care and maybe even use those UUIDs? I mean there are certainly reasons against using them. Ever tried to memorize a bunch of UUIDs? Give it a shot:
48c907b0-b8ac-4161-84c9-4fbf1030b5da 48c907b0-dc38-475c-a9c4-4a2e1030b5da 48c907b0-f088-44ae-8be5-4e811030b5da
When it comes to primary keys you might miss the ease with which one can remember auto incrementing values. But hang on, with a bit of copy & paste one can get over this inconvenience and carry those values around.
But why should you go through that trouble? The answer can largely be found in their uniqueness. With UUIDs, you can give an identifier to any object in the real world. This id you give it is guaranteed to not have been used for labeling any other object in the universe so far. Guaranteed in this context means you are more likely to win the lottery without buying lottery tickets than generating a duplicate UUID.
This is an amazing property for several reasons. Imagine a fairly real world task of having to join the blog posts of two previously separate sites into a single new one. Back when Tim and I did that for this blog we had to re-index all our blog post ids as well as all foreign keys pointing to them. Why? Because we both had given out the same auto incrementing ids to different blog posts. If we (wordpress) had used UUIDs, merging our two blogs (or any other sets of data) would have been magnitudes easier.
And there is more. If you want to create polymorphic associates than UUIDs are going to make that a lot easier for you in CakePHP. Just create a table called 'comments' and give it a belongsTo of [Photo, Post, User]. But instead of creating a foreign key field for each model, you get away with just creating a single foreign_id field and setting that up as the foreign key for all the belongsTo associations. Since the UUID the foreign_id points to is unique across tables, you don't even have to track what model / table it actually belongs to. This avoids a more complex query and leads to improved performance for the setup. (You might still want to populate Comment.model just so you know where the link goes to without doing the actual lookup).
Last but not least there is another advantage hidden in the randomness of those ids. By not having your ids follow any pattern, an attacker won't be able to iterate through your databases records without you giving him a map of primary ids for it. This in itself doesn't automatically make your application secure, but it lessens the damage that is likely to be done if a permission bug is exploited.
Anyway, I'm sure I forgot a bunch of reasons and I am also looking forward to people sharing their concerns on performance and the concept itself. In the meantime just know that CakePHP will automatically generate UUID keys if your primary key is a char(36) field.
HTH,
-- Felix Geisendörfer aka the_undefined
You can skip to the end and add a comment.
Very interesting post. I agree that UUIDs uniqueness may be useful in some situations. But for non-massive tasks like blogging and most of webapps y still prefer other indexing methods.
Why? Mainly because URL beautyness. I prefer clean URL like "/posts/why-uuids" than "/posts/why-uuids:48c906cc-7a6c-4f22-9e20-6ffd4834cda3". This is not only a mere esthetic reason. Sending URL over 80 characters long over many mail user agents break links even nowadays.
Also, I also care about search engine optimization. First, "/posts/why-uuids:48c906cc-7a6c-4f22-9e20-6ffd4834cda3" and "/posts/hello-world:48c906cc-7a6c-4f22-9e20-6ffd4834cda3" is duplicated content for a search engine (unless you add new logic to handle redirects). On the other hand, the hash may look like a parameter (id, has) to search engines, which may affect ranking.
I have seen and used so many times the /posts/view/123 pattern, but to me it offers no advantage over the /post/view/my-post-title-here approach. I just use the sluggable behaviour and set Model.slug as a key index in my DB schema. Then I use $model->findBySlug($this->params['slug']).
That's my opinion. Great to learn about UUIDs, but currently they offer to me more cons than pros.
PS: I love your blog! (and this is my first comment on it)
I agree with Jaime: for many clients, SEO is a significant concern and the URL is one of the more important parts of a page... It must be unique, but also contain key terms specific to the content/focus of that page. Also - some times I like to say "hey, take a look at post # 1234" and with UUIDs I wouldn't be able to do that; I'd have to reference by name or UUID.
So - for public facing URLs, slugs it is... for existing projects, ID remains...
I may still use UUIDs in a future project, for situations where no human ever has to read the ID number.
Great explanation though - thanks much!
I am with Jaime on this one. While I can see the need for them in the backend (however you choose to implement), I just don't see the point in having them in the URL's. It's much easier and cleaner to use a permalink/slug (incrementing with duplicates), than it is to have URLs with the UUID appended to the end. I would focus more on the SEO + clean URL aspect before I would put that in the URL.
So, not disagreeing with you - I see the value, I just don't see the need for URL's like these.
I'm in the UUID camp, myself, and have been for a few years. Felix offered the polymorphic justification, but extrapolate that to portability in general. Integers (the traditional unique identifier) are simply too predictable. Good for counting change, but really, really crappy when you need to move data around.
I can't tell you how many times I've had to tell a client that I can't partially migrate their data into another database because either ID conflicts will occur or ambiguous data will result. It's not an uncommon business case in large operations.
Jaime Gómez Obregón: Our urls will redirect if you supply a wrong slug - so no duplicate content issue here.
As far as SEO goes: The UUIDs in our urls should not have any effect on that whatsoever. I can't verify that for sure, but afaik as long as your actual slug content is before the UUID stuff google will be plenty happy.
Depending on your SEO / email marketing / whatever needs you may choose to not use UUIDs or hide their internal usage. But from a data architecture point of view I think they are amazingly helpful concept and if you can use them the rewards will be great ; ).
uuid is bad for mysql.
http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
@laowang You beat me to it :)
If any one depending solely on the randomness of uuid for security you are on a downhill road to hell anyways - In other words security by obscurity is just *not*
But the real strength of uuid is indeed revealed when you have to move data around
@Corey - try a cluster of memcache servers and remember NFS sux big time on reliability
@laowang - in defence of UUIDs the performance issue is primarily with the maintenance of indexes during bulk inserts, and to a lesser extent with compound keys. There might be a negligable impact on index read, but not so as you'd notice. If your datasets are large enough to be affected, you probably have the experience in the team (and the hardware) to work around the issues.
This would apply to virtually any RDBMS, not just MySQL. Most indexes 'chunk' the data (btree for example: http://en.wikipedia.org/wiki/Btree ), so when a new key is added, the tree must be re-computed. For integer based indexes, the key value is normally just incremented, so the tree does not have to be completely restructured for each insert. UUIDs on the other hand, frequently force significant changes to the index structure (which is already larger anyway) - hence the performance impact.
There is an easy workaround (the oldest trick in the book) for bulk inserts: disable the index and re-enable afterwards, thereby recreating the index only once. With UUIDs of course, you don't have to worry about enforcing a unique constraint during the insert. Massive numbers of individual inserts might be a trickier problem but probably not as tricky as managing the bandwidth required to feed those inserts :)
Tarique Sani: Using unpredictable ids is not security by obscurity but security by design. I think that because if someone would have the complete source code of an app using UUIDs along with the db schema he'd still be unable to guess record ids.
UUID are also quite nice if you have to set up a "multi-master" database setup. That way you don't have to do complicated things with an auto_increment ID field. But there are cases where it is not reasonable to use UUIDs. That is the case for key-value tables were the values are always shorter than the UUID (and are also more or less static). (common examples: producer names, role names, ...)
I think UUID keys are especially interesting for (large) web-applications and integer auto_increment keys simply suit the needs of a "traditional" web-site.
Hey Felix, you've sold me on using UUID's for the upcoming project I'm working on. Does cakePHP have something to manage or automatically generate UUID's? Also, what type of field do you use in MySQL to store these beasts? VARCHAR(36)? or binary?
Thanks,
Lane
I've just recently fallen in love with UUIDs, and I'm not completelt sure why. This article goes some way towards justification for using them. I do like the complete randomness, and for a larger project I have coming up UUIDs do provide a level of protection (not security) against people traversing all of the records with predictable integer primary keys.
I intend on using UUIDs for users||profiles primary keys because I do not want it to be easy to trawl through a list of users and scrape profile or contact information, to give away the total number of site members, or allow members to rank themselves on being the newest and having the lowest UID (ala slashdot). It's a great leveller.
@Lane -
Yes, Cake can handle that. Just name your field "id" and type it as a char(36). varchar(36) may also work, but it will always be 36 chars so there's not need to eat up that extra byte on every record.
Hi Felix,
Can you explain this in detail:
"But instead of creating all the foreign keys for that, you get away with just creating a single foreign_id field and setting that up as the foreign key for all the belongsTo associations."
So far I used UUID in association same way as I would with INT keys..
Thanks,
Andras
Andras: Oh that was little blurry indeed. Changed it to:
"But instead of creating a foreign key field for each model, you get away with just creating a single foreign_id field and setting that up as the foreign key for all the belongsTo associations."
Thanks.
This post is too old. We do not allow comments here anymore in order to fight spam. If you have real feedback or questions for the post, please contact us.
Great post. We had been 'discussing' whether or not to use UUID's on the way home from the workshop. I think this really cleared up the pro's of using them.
If you need material for a new post I have a suggestion:
When using cakePHP in a High Availability & Failover environment (specifically using Heartbeat & ldirector), and using an NFS as the backend where the content is housed for all the head units, should the tmp directory within cake be stored by the NFS server also and allow the cache to be accessed by all servers? Or should each head unit have it's own tmp directory and hence it's own cache?
Thanks,
Corey