Programming challenge: find top values in unsorted array

Approach

There are three basic approaches which I think are viable for finding the maximum value of a series:

  • scan all values, keeping track of the maximum
  • quicksort the values in increasing order and return the last element
  • create a max-heap and return the root

The first and third are O(n) and the second is O(n logn).

However, when you need the n-largest values, the recommended approach is to create a max-heap and to iteratively remove the root after restoring the heap property. For example, a comparison of using priority queues versus a complete sort in python shows the former to be about 30 times faster on 2 million to 10 million data elements. [WordAligned.org] The code provided implements the algorithm using a priority queue, another name for a max-heap.

Authorship and adaptations

The priority queue code was adapted from Writing a priority queue in C# (part 1) by Julian Bucknall. [website, book] There a couple of interesting differences:

  1. Use of generics
    I did not want to box and unbox objects, so I used List<> to hold the results. To this end, I created a private struct to hold the user's data and the integer priority, and that is the template type of the List.
  2. Implementation of Dequeue()
    Because the data did not go into a fixed size array any longer, the implementation of Dequeue() had to change. I store the root, move the last element to the root, remove the storage for the last element (because it is now popped), and trickle the root down to its correct place.
  3. Implementation of bubbleUp() and trickleDown()
    Neither takes a second argument. Each stores the index element at the start of the routine and places it in its correct place at the end of the loop.
  4. ParentOf() and LeftChildOf()
    Instead of inlining the index calculation, I moved each to separate routines and gave them descriptive names.
  5. Naming boolean conditions
    Instead of inlining boolean conditions, I gave them names and assigned them to variables. It's easier to understand what is going on.
Design choices
  • Throw exceptions for invalid counts (negative numbers, counts exceeding size of data set) passed to FindTopNValues()
  • Throw exception if Dequeue-ing from empty priority queue
  • Use generics
    It seemed likely that someone would use this class for something other than integers, so I generalized it to work with any objects given a priority, at the slight cost of clarity for the test. In particular, it is not obvious at first that the priority queue uses a Pair made up of the priority value twice.
  • Use single implementation for both FindMaxValue() and FindTopNValues()
    FindMaxValue() could be written with a simple linear scan:
    int FindMaxValue(int[] anyOldOrderValues)
    {
        if (anyOldOrderValues.Length == 0)
            throw...
        int max = anyOldOrderValues[0];
        foreach (int val in anyOldOrderValues)
            if (val > max) max = val;
        return max;
    }
    But the actual runtime complexity for the priority queue is not greater and the priority queue works with templates.
  • Optimization for count of top-n elements
    It seems likely that with larger numbers of top elements selected from the data set, performance would favor just sorting the data and extracting the top n-values. I did not experiment with this, but I expect there is a crossover value that might be used. Something like:
        if (count > anyOldOrderValues.Length/2)
            ...Quicksort the data
    It's also possible that Quicksorting the data might be preferable for small data sets, in the same way that using bubblesort is acceptable on small data sets.
Limitations
  1. Optimization
    The code is not quite optimal. I did not benchmark it, given that the time was limited. I think that the construction of the heap could be faster if the heap were made at once instead of incrementally adding data elements and bubble-ing up.
  2. Data ownership
    The code takes a copy of the data instead of using the int[] array passed in. True, this would destroy the original data set, but the semantics of whether the called function owned the data was not provided. I assumed the most defensive implementation -- that the data was not owned by the called function and that it was necessary to copy the data.
Unit tests
FindMaxValue()
  • throws exception on data set of size 0 [not demonstrated]
  • finds max of 1 element
  • finds maximum of 2 elements in both permutations
  • finds maximum of arbitrarily large data sets [not demonstrated]
FindTopNValues()
  • test with various selection sizes
  • test with various permutations of the order of the test data
  • test with different test data sizes

[Download ZIP file] [PriorityQueue.cs]

1 TrackBacks

Listed below are links to blogs that reference this entry: Programming challenge: find top values in unsorted array.

TrackBack URL for this entry: http://www.iwebthereforeiam.com/mt-tb.cgi/1732

Recently, I was given a programming problem for a pre-interview job screen. Implementation Create a class which implements either the C# or Java interface Implement the two methods defined in the interface. FindMaxValue should return the single highest... Read More

Leave a comment

About this Archive

Find recent content on the main index or look in the archives to find all content.