Crawl Google, they do the same to you ; )

Posted by Felix Geisendörfer, on Jun 10, 2008 - in PHP & CakePHP » Other

Hey folks,

Marc Grabanski just had the great idea of using google to help with the migration of your site to a new domain / url schema. Just get a list of all pages google has indexed from your site and then use that as your basis for checking if your migration worked or not. This is very convenient because you do not have to know all your own urls yourself, and you'll only get the relevant ones (if they are not in google they are unlikely to have traffic).

So here is some quick code for crawling Google instead of being crawled by them in CakePHP:

php
  1. class GoogleIndexShell extends Shell {
  2.   function main() {
  3.     App::import('HttpSocket');
  4.     list($site) = $this->args;
  5.     $Socket = new HttpSocket();
  6.     $links = array();
  7.  
  8.     $start = 0;
  9.     $num = 100;
  10.     do {
  11.       $r = $Socket->get('http://www.google.com/search', array(
  12.         'hl' => 'en',
  13.         'as_sitesearch' => $site,
  14.         'num' => $num,
  15.         'filter' => 0,
  16.         'start' => $start,
  17.       ));
  18.       if (!preg_match_all('/href="([^"]+)" class="?l"?/is', $r, $matches)) {
  19.         die($this->out('Error: Could not parse google results'));
  20.       }
  21.       $links = array_merge($links, $matches[1]);
  22.       $start = $start + $num;
  23.     } while (count($matches[1]) >= $num);
  24.  
  25.     $links = array_unique($links);
  26.     $this->out(sprintf('-> Found %d links on google:', count($links)));
  27.     $this->hr();
  28.     $this->out(join("\n", $links));
  29.   }
  30. }

Usage is as simple as running:

./cake google_index debuggable.com

Which should produce an output like this:

Welcome to CakePHP v1.2.0.7125 beta Console
---------------------------------------------------------------
App : app
Path: /Users/felix/dev/www/php5/debuggable/app
---------------------------------------------------------------
-> Found 293 links on google:
---------------------------------------------------------------
http://debuggable.com/
http://debuggable.com/contracting
http://debuggable.com/contact
http://debuggable.com/workshops
http://debuggable.com/open-source/fixtures-shell
http://debuggable.com/open-source/google-analytics-api
http://debuggable.com/posts/thinking-what:480f4dd5-5f1c-4d37-99b0-4768cbdd56cb
http://debuggable.com/posts/jquerycamp07:480f4dd6-8d40-44e1-8551-4a58cbdd56cb
...

Oh and if you want to see more shell sample code, also check out our FixtureShell and the blog post for it.

-- Felix Geisendörfer aka the_undefined

PS: Please note that this is a quick hack, and any non-trivial change in the markup google uses will break. This is only meant for temporary usage.

Print this Post | Digg This | Stumble It | Delicious

6 Comments

Kim Biesbjerg on Jun 10, 2008:

Cool! Just as I'm about to finish a cake website that is going to replace an old one! Any tips for redirecting these urls to the new ones? I usually do it in htaccess.

Khaled on Jun 10, 2008:

Great post ... I didn't expect to be simple like that ... thanks Felix

Felix Geisendörfer on Jun 11, 2008:

Kim: mod_rewrite is a good choice if you don't have lot of urls (< 1000?). For everything else I would catch CakePHPs error404 using an AppError handler and then check a table called legacy_urls for the correct mapping.

Marc on Jun 11, 2008:

This worked beautifully by the way. Thanks a lot.

Kim Biesbjerg on Jun 13, 2008:

Felix: Right, might actually make a useful plugin for my CMS. Fill the database with indexed urls and have a user interface where you can map the indexed url to the corresponding content on the new site. Great! Thanks for the code snippet - Really enjoy your posts!

Jean on Jul 01, 2008:

Cool indeed.

Add a comment