Crawl Google, they do the same to you ; )
Posted on 10/6/08 by Felix Geisendörfer
Hey folks,
Marc Grabanski just had the great idea of using google to help with the migration of your site to a new domain / url schema. Just get a list of all pages google has indexed from your site and then use that as your basis for checking if your migration worked or not. This is very convenient because you do not have to know all your own urls yourself, and you'll only get the relevant ones (if they are not in google they are unlikely to have traffic).
So here is some quick code for crawling Google instead of being crawled by them in CakePHP:
function main() {
App::import('HttpSocket');
list($site) = $this->args;
$Socket = new HttpSocket();
$links = array();
$start = 0;
$num = 100;
do {
$r = $Socket->get('http://www.google.com/search', array(
'hl' => 'en',
'as_sitesearch' => $site,
'num' => $num,
'filter' => 0,
'start' => $start,
));
if (!preg_match_all('/href="([^"]+)" class="?l"?/is', $r, $matches)) {
die($this->out('Error: Could not parse google results'));
}
$links = array_merge($links, $matches[1]);
$start = $start + $num;
} while (count($matches[1]) >= $num);
$links = array_unique($links);
$this->out(sprintf('-> Found %d links on google:', count($links)));
$this->hr();
$this->out(join("\n", $links));
}
}
Usage is as simple as running:
./cake google_index debuggable.com
Which should produce an output like this:
Welcome to CakePHP v1.2.0.7125 beta Console --------------------------------------------------------------- App : app Path: /Users/felix/dev/www/php5/debuggable/app --------------------------------------------------------------- -> Found 293 links on google: --------------------------------------------------------------- http://debuggable.com/ http://debuggable.com/contracting http://debuggable.com/contact http://debuggable.com/workshops http://debuggable.com/open-source/fixtures-shell http://debuggable.com/open-source/google-analytics-api http://debuggable.com/posts/thinking-what:480f4dd5-5f1c-4d37-99b0-4768cbdd56cb http://debuggable.com/posts/jquerycamp07:480f4dd6-8d40-44e1-8551-4a58cbdd56cb ...
Oh and if you want to see more shell sample code, also check out our FixtureShell and the blog post for it.
-- Felix Geisendörfer aka the_undefined
PS: Please note that this is a quick hack, and any non-trivial change in the markup google uses will break. This is only meant for temporary usage.
You can skip to the end and add a comment.
Great post ... I didn't expect to be simple like that ... thanks Felix
Kim: mod_rewrite is a good choice if you don't have lot of urls (< 1000?). For everything else I would catch CakePHPs error404 using an AppError handler and then check a table called legacy_urls for the correct mapping.
This worked beautifully by the way. Thanks a lot.
Felix: Right, might actually make a useful plugin for my CMS. Fill the database with indexed urls and have a user interface where you can map the indexed url to the corresponding content on the new site. Great! Thanks for the code snippet - Really enjoy your posts!
Cool indeed.
This post is too old. We do not allow comments here anymore in order to fight spam. If you have real feedback or questions for the post, please contact us.
Cool! Just as I'm about to finish a cake website that is going to replace an old one! Any tips for redirecting these urls to the new ones? I usually do it in htaccess.