Muscles on demand - Clean a large git repository the cloud way » Debuggable

Muscles on demand - Clean a large git repository the cloud way

Hey folks,

don't you hate it when you sometimes have to stop your work because your dev machine is ultra-busy doing some CPU or I/O heavy operations that will take hours?

Even so it doesn't happen to me a lot, I actually ran into such a case last night while trying to fix the Git repository of a project we are working on. The repository itself was not corrupted, but it became so fat that git-index-pack would explode on many of the team members. How did that happen? Well it turns out that over time some of the image directories of the project were committed into the repository by accident. This ended up being an insane 1.7 GB of '.git' objects.

With SVN, this is when you realize you made a poor choice in versioning control software and it is time to start the repository over - loosing all history.

Not so much wit Git. Git has an excellent tool called git-filter-branch that you can use to rewrite an entire repositories history.

In our case we wanted to pretend app/webroot/files had never existed:

git filter-branch --index-filter 'git rm -r --cached app/webroot/files' -- --all

However, our repository is full of commits (3.5k) and as I mentioned has some incredibly huge and ugly blobs in it. This means the operation above is slow like crazy.

So instead of having my poor laptop tortured with it for an hour, I decided to hire some muscle. I knew the operation I planned to do was fairly CPU and I/O heavy. For that reason I fired up an c1.xlarge Ec2 instance featuring 8 cores and 7 GB RAM.

Launching one of Eric Hammond's excellent Ubuntu EC2 AMI's and installing git took < 5 minutes. Add a few minutes to transfer the 1.7GB repository over and I was ready to go.

Having recently read Kevin's excellent article on tmpfs, I just put the repository in /dev/shm. This simply meant that the repository was now fully stored in memory - 30x faster then HDD!

Even with all this power, the whole process still took 15 minutes to complete, but the result was impressive. Instead of 1.7GB the repository was shrunk down to 80MB and little angels were dancing & singing around it. It was beautiful : ).

I pushed the lean and mean clone up to github using 'git push -f' and then switching over the local clones of each team member was just a matter of:

git checkout -b backup-master
git branch -D master
git fetch
git checkout -b master origin/master

Of course the new master branch and its backup wouldn't be very nice to each other as far as merges are concerned, but cherry picking the most recent commits worked great.

As you can see cloud computing is not only for the application level, but it can also be a great tool for your development process as a whole. After all it is incredible what can be accomplished for $0.80 this way ; ).

-- Felix Geisendörfer aka the_undefined

&nsbp;

You can skip to the end and add a comment.

Nate Abele said on Mar 13, 2009:

I have to say, that's pretty clever. Now write me a script that does all that for me, so I can just pipe commands to it.

Francis Gulotta said on Mar 16, 2009:

The only part you lost me on was the "couple of minutes" to transfer over a 1.7G repository. =)

Felix Geisendörfer said on Mar 16, 2009:

Francis Gulotta: I directly transfered from one server to another ; ).

KKovacs said on Mar 31, 2009:

This is the coolest thing I read in this month. Thanks! You made me want to look into EC2 again. :-)

This post is too old. We do not allow comments here anymore in order to fight spam. If you have real feedback or questions for the post, please contact us.

debuggable