debuggable

 
Contact Us
 

How I Turned A Slow Array Sort Into A Quick One Using The Quicksort Algorithmn

Posted on 31/5/07 by Tim Koschützki

Today in the morning I had a very unusual programming job to do - or at least what is for me rather unusual. I was confronted with the question whether it is easy to implement the quicksort algorithmn to sort an array of arrays based on a key in the second-dimension of the array. Join me to find out.

As you probably know from my About me page, I am currently working for LinkLift as a web programmer. The company is the largest German Textlink-Marketplace and soon we will expand to other countries within Europe. When you click on one of the link categories on our homepage, you are presented with (unfortunately German) website from which you can buy textlinks.

(Note: We are using the term “adspace” here for purchasable textlinks on a publisher's website.)

Using the links in the table header, you can sort the adspaces by title, available adspaces, page rank and current value.

Here is an array of adspaces, which we are willing to sort. The data is of course changed to preserve publisher anonymity.

[code]
Array
(
[0] => Array
(
[idv11] => 624059d1f8N0
[url] => http://example0.com,
[title] => Some title0,
[current_pagerank] => 7,
[adspots] => 2,
[current_available] => 1,
[links] => 33,
[links_external] => 5,
)
[1] => Array
(
[idv11] => 67059d1f8N0
[url] => http://example1.com,
[title] => Some title1,
[current_pagerank] => 4,
[adspots] => 4,
[current_available] => 3,
[links] => 132,
[links_external] => 9,
)
[2] => Array
(
[idv11] => 6759d1f8N0
[url] => http://example2.com,
[title] => Some title2,
[current_pagerank] => 2,
[adspots] => 8,
[current_available] => 6,
[links] => 12,
[links_external] => 3,
)
)
[/code]

The idv11 is the primary key. We decided to generate a random number for each new adspace so that our customers cannot guess how many adspaces we have in the marketplace by simply looking at their id. ;) The rest of the information should be straightforward - adspots is the number of textlinks the publisher offers on his site, current_available is the number of textlinks that can be purchased still, links is the number of links on the page where our textlinks will appear and links external is the number of those links that go to an external site.

The old sort algorithmn was rather slow and thus it was my task to optimize it. I tend to be used for optimisation tasks quite often these days.. Okay, so let's look at the old sort algorithmn:

function sortAdspacesAndUnify($adspaceArray, $sortBy)
  { 
    $orderBy = 'asc';
    if($sortBy < 0 && $sortBy != -1) $orderBy = 'desc';
    $sortUsBy = determineAdspaceSortBy($sortBy);
    if($sortBy == -1) $sortUsBy = 'revenue';
   
   
    $orderedArray = array()
    $usedIdv11s = array()
    $numResultArray = count($adspaceArray);
   
    while ($numResultArray > 0) {
      $latestKey = 0;
      $highestValue = 0;
      foreach($adspaceArray as $key => $adspace) {   
        if($sortUsBy == 'revenue')
          $value = LL_Admin::calcAdspaceConversion($adspace)
        else
          $value = $adspace[$sortUsBy];
                     
        if ($value >= $highestValue) {
          $highestValue = $value;
          $latestKey = $key;
        }
      }
     
      if (! empty($adspaceArray[$latestKey])
        && !in_array($adspaceArray[$latestKey]['idv11'],$usedIdv11s))
      {
        $orderedArray[] = $adspaceArray[$latestKey];
        $usedIdv11s[] = $adspaceArray[$latestKey]['idv11'];
      }
      unset($adspaceArray[$latestKey]);
     
      $numResultArray--;
    }
     
    if($orderBy == 'desc')
      $orderedArray = array_reverse($orderedArray);
             
    return $orderedArray;
  }

The function might seem a bit overwhelming to you at first, but in fact it's all quite easy. Let's explore the code a bit.

What the function receives is an array of adspaces. Each adspace itself is an array again to hold its information as presented earlier. The second parameter is an integer determining what to sort by. The function call $sortUsBy = determineAdspaceSortBy($sortBy); maps that integer to one of the following strings: "current_available", "link_value", "title" or "current_pagerank". Unfortunately, I cannot show you all of the code, but the person above me will not allow me to do so. It is not a problem, though, in my opinion.

If the $sortBy integer is negative we will sort in descending order. If it is -1, we will sort by adspace "revenue" – the default sortation for each category. Sorting by revenue basically means sorting by the product of current_value * (adspots – current_available). As you can see, revenue in this case means the amount of revenue an adspace has generated for the company already.

Next we declare two arrays - orderedArray, which will be our result array of sorted adspaces, and usedIdv11s, which allows us to track duplicated publishers just in case our input array contained some publishers more than once.

Now the sortation begins. What my colleagues were doing is simply iterating over the array and for each iteration iterate over all existing publishers again, determine the publisher with the highest value to sort by, track it in the ordered array and remove it from the publisher array.

What that means is the number of iterations is n + (n-1) + (n-2) + ... + 1 = (n2 + n)/2, where n is the number of adspaces of the input array. The average complexity class is therefore n2 and thus the the performance gets a lot worse with increasing numbers of adspaces. By the way, note that the $numResultArray variable gets decreased with every iteration. If we did the outer iteration via

while (count($adspaceArray) > 0) {

..we would have had another real performance bottleneck. Check my article Optimising For-Loops for more information on that one.

Okay, so now we are basically where I had been at 9:00am this morning. What I did at this point is turn to Wikipedia and searched for Quicksort, one of the fastest sorting algorithmns ever created. I simply remembered quicksort as being a fast sort algorithmn from my past computer science lessons at school. :) Quicksort is rather simple. It determines a random pivot element, then sorts the array so that all elements greater than the pivot are right from it and all which are less than the pivot's value are left to it. After such a sortation the pivot is at the correct place already and we have to sort the two sub-lists via the same approach.

Quicksort does that recursively and it is extremely fast, check the complexity and time analysis at the end of the wikipedia article there. I thought it will be a lot faster than the old publisher sortation algorithmn and off I went implementing quicksort. :)

What I turned up with is the following:

/**
     * Sorts an adspace array based on the quicksort algorithmn
     *
     * @author tim
     * @param array array of adspaces - note that each has to have an 'idv11' key
     * @param string value to sort by - each adspace element has to have such a key
     * @return array sorted array of adspaces
     */

  function quicksortAdspaces($q = array(), $sortUsBy, $level = 0, $orderBy='desc')
  {  
      $greater = array();
      $less = array();
      $pivotList = array();
     
      if (count($q) < = 1)
        return $q;
     
      $pivot = $q[0];
      foreach ($q as $adspace) {
        $value = '';
        $compareValue = '';
        if ($sortUsBy == 'revenue')
        {
        $value = LL_Admin::calcAdspaceConversion($adspace);
        $compareValue = LL_Admin::calcAdspaceConversion($pivot)
        } else {
        $value = $adspace[$sortUsBy];
        $compareValue = $pivot[$sortUsBy]
      }
       
        if ($value < $compareValue) $less[] = $adspace;
          if ($value == $compareValue) $pivotList[] = $adspace;
          if ($value > $compareValue) $greater[] = $adspace;
      }
      $return = array_merge(quicksortAdspaces($less,$sortUsBy, $level+1), $pivotList, quicksortAdspaces($greater,$sortUsBy, $level+1));
     
      if ($level == 0) {
        if($orderBy == 'desc')  $return = array_reverse($return);
        $return = cleanUpDoubleAdspaces($return);
      }
      return $return;
     }

Should not be difficult to understand. We go through all adspaces, determine a pivot by simply using $pivot = $q[0], which takes the first element of the array. Then we determine a value to sort by - either we sort by adspace revenue, which uses our revenue formula or we sort by one of the adspace's keys (current_pagerank, link_value, current_available or title). If the recursion level is zero, we clean up duplicate adspaces and reverse the array if we are to sort in descending order.

The result? The sortation now is lightning fast with an average complexity class of n * log2n, which is a lot better than n2.

Hope you liked my little story from today's morning. Feel free to modify and use this code. :)


 
&nsbp;

You can skip to the end and add a comment.

bill  said on Jun 08, 2007:

Why not use array_multisort ?

http://us2.php.net/manual/en/function.array-multisort.php

i think this is the function I used to sort an array of arrays by one of the keys in the arrays.

[...] the PHP-Coding-Practices site, there’s a sort of case study posted showing how the author (Tim Koschuetzki) took a chunk of code that was slow at sorting an array and [...]

Mgccl said on Jun 08, 2007:

I agree with bill...
The sort system inside PHP knows what sort is best..

it's even faster if you build another array to sort a individual thing:

$array[67059d1f8N0] = $stuff[$i][adspots];

BTW, quicksort can run in n^2 time...
when you are sorting an array with below 20 keys.. insert sort is good... so you could use quicksort with insert sort.

Usually, when problems like this occur, you want the sorting to be done by the database instead of PHP. Especially there is a huge amount of data. I'm really not sure why is your company ask PHP to do jobs like this.

Tim Koschuetzki said on Jun 08, 2007:

I am not sure either. Actually I have to admit that I have a looked at array_multisort for a while, but couldn't figure out a solution to the problem at hand in a reasonable amount of time (I had a deadline for the task). Maybe one of you two would like to provide some source code?

That would be great!

Mgccl said on Jun 09, 2007:

Yes, I could write one..
just one question..

Do you only need to sort for adspots? just remain the relationship between adspots and idv11?

like the array are like this:

$array[] = array('idv11' => '624059d1f8N0',

'adspots' => 2);

Tim Koschuetzki said on Jun 09, 2007:

yes, exactly.

Mgccl said on Jun 10, 2007:

Ok.. I wrote something in my blog about it..
A example using usort()

http://www.webdevlogs.com/2007/06/10/much-better-than-quicksort

Tim Koschuetzki said on Jun 11, 2007:

Good stuff man! You beat me to it. :] I was really looking at the usort function but could not figure out a way at first glance. Since I was under time constraints I decided to write it based on quicksort.

Thanks!

Tim Koschuetzki said on Jun 12, 2007:

Great stuff, this is what I could come up with. I struggled a bit to get the $sortBy variable into the adcomp function, but I got it done via $GLOBALS.

/**
   * Sorts a given adspace array based on a supplied key

   *

   * @author tim

   * @param array array of adspaces

   * @param string the array key to sort by

   * @param string the order of the sortation, either 'asc' or 'desc'

   * @return array a sorted array

   */

  function sortAdspacesAndUnify($adspaceArray, $sortBy, $orderBy)

  { 

    if ($sortBy == 'revenue') {

      foreach($adspaceArray as &$adspace) {

        $adspace['revenue'] = LL_Admin::calcAdspaceConversion($adspace);

      }

    }

    $GLOBALS['sortBy'] = $sortBy;

   

        function adcomp($a, $b) {

          $sortBy = $GLOBALS['sortBy'];

         

            if ($a[$sortBy] == $b[$sortBy]) {

                return 0;

            }

            return ($a[$sortBy] < $b[$sortBy]) ? -1 : 1;

        }

        usort($adspaceArray, "adcomp");

    if($orderBy == 'desc')

      $adspaceArray = array_reverse($adspaceArray);

             

    return cleanUpDoubleAdspaces($adspaceArray);;

  }
Todd said on Dec 06, 2007:

@Tim Koschuetzki, you could avoid using $GLOBALS by wrapping your sorting algorithms into a class then $GLOBALS['sortBy'] will instead be a private variable for that class.

This post is too old. We do not allow comments here anymore in order to fight spam. If you have real feedback or questions for the post, please contact us.