Introduction
Generalized partition sort is the very cool, mysterious algorithm you’ve never heard of. What surprises me about this algorithm is that it’s so performant, but I haven’t seen it discussed explicitly and in the way I am about to show you. By customizing generalized partition sort with the right function, it can solve any of the following problems:
- Find the smallest k elements from a collection
- Find the largest k elements numbers in one-dimensional space
- Find the closest k elements to k’ in euclidean space
- Find the closest k to j elements from k’ in euclidean space
All of the above are reformulations of the same general problem: Find the slice of elements from a set of values ordered by their image of some function.
Here, a slice just means a contiguous subset of an ordered container. Like the slice function in python. The image of some value v on some function f is just f(v). So we are looking for a slice from the set x0, x1, x2, … where f(x0) <= f(x1) <= f(x2) <= ... <= f(xn) and we are given the input values in an unordered fashion.
Let me show you some examples:
- If that slice of values is from indices 0 to k-1, the transform is the identity function, we are looking for maximal value images, and our collection is an array of numerical values, we get the largest k numbers.
- If that range is from 0 to k-1, the transform is the absolute value of the quantity (each element subtracted from some fixed value k’), we are looking for maximal value images, and our collection is an array of numerical values, we get the furthest k numbers to k’.
- If that range is from j to k-1, the transform is the euclidean distance between each element and some fixed multi-dimensional point k’, we are looking for minimal value images, and our collection is an array of multi-dimensional points (e.g. (x,y,z)), we get the k – j closest points to k’ after the closest j-1 points. This is geometrically a set of points contained within a spherical shell concentric to k’.
Generalized partition sort requires O(n) time. This is really neat, because the most commonly discussed solution to finding some optimal set of k values from a collection of n elements is O(n log k) time. Specifically, the solution is to iterate through the collection and for each item, add it to a min-heap, and pop the minimal element from the min heap if its size is greater than k. So there are n iterations, and adding and popping the min-heap is log(k), so together we have O(n log k) operations, which is often approximated as O(n) for small k. This approximation is not valid if k is not small. For k = n/2, we are guaranteed O(n log n) performance.
So what’s partition sort again?
Partition sort is a repeated use of partitioning data in our active search area to isolate k optimal values. Each partition step narrows the search space. There are two ways to think of partition sort:
- That step in quick-sort before we recurse
- A special case of in-place bucket sort for two buckets
I’ll describe the bucket sort way of looking at things briefly. Consider the problem of partitioning an array of numerical values, like [3, -1, 6, 2], so that all negative values are on the left and all non-negative values are on the right. If we were to use extra memory, we could iterate through our collection, adding all negative values to one bucket and all non-negative values to the other bucket. Then we’d, have two collections: [-1], and [3, 6, 2]. The next step is to populate our original array with all values from the bucket with negative values and then append this array with values from the bucket of non-negative numbers. Our re-constructed collection would now be something like [-1, 3, 6, 2]. Note that when we partition, we don’t really care about the order of elements within each partition. So if our final result was [-1, 2, 3, 6], that would also be a valid partition. The partition step takes O(n) time for a collection of size n given that we need to iterate over every element.
When we finish the partition step, we know the index of the boundary between the two partitions in our active search space. Suppose the array was size k and the partitions had size L and R, respectively. Then we know the first L values have some property and the last R values have some other property. If our comparison was checking for values beyond some fixed value, then we would have access to the minimal L values and maximal R values. Suppose we were looking for the minimal L’ values. We just transformed our problem of finding the minimal L’ values from an array of size k to finding the minimal L’ values from an array of size L = k – R < k. We transformed our original problem to a smaller problem in time O(k).
So we can repeat the partition step until we find the index that separates the smallest L’ values from the rest of the collection. If we are looking for some range L” to L’, we find the minimal L’ values, then from the collection we apply the partition step to find the minimal L” values, discarding them. Then we have the range of values from L” to L’. Pretty neat.
As far as the comparison function, for each step we can choose a random value (the pivot) to construct it. Best case scenario each pivot point is the midpoint dividing our problem in half each time. The infinite sum 1 + 1/2 + 1/4 + … is equal to 2, and our time complexity here is n + n/2 + n/4 + … = 2n = O(n). Worst case our pivot point is the maximum or minimum each time for the duration of the search making the performance O(n^2). This is extremely unlikely for randomly chosen points. Going into depth regarding why the performance is O(n) from a probabilistic standpoint is beyond the scope of this post and I am going to accept the O(n) performance on average. It is likely the true performance of selecting random pivots is O(n^x) where 1 < x << 2.
So how does generalized partition sort differ from partition sort?
When we generalize something we make it more abstract and flexible. There is basically one thing we can abstract: How we choose which partition to place an element
Classic partition sort is a two-bin bucket sort. The function we use to generate value images is the identity function.
Some examples of partition functions to create a partition
In the original partition sort, we are dealing with an array of integers. The elements are compared by their value to a pivot. The partition function is f(x) { return x < pivot; } where true and false values denote the left and right partitions, respectively. Notice that we are not comparing two values in this collection to each other. The partition function is just supposed to tell us which bucket they fall into. Pretty simple. Since partition sort can be used to divide a group with minimal characteristics to another group with maximal characteristics, if we are creative we can choose a partition function to select something we are looking for. Let's consider an example.
We have an array of integers. We want to find the k closest values to some value V. Let’s make our partition function f(x) { return |x – V| < |pivot - V|; }. Note that the absolute value of the difference of two values is the distance between them. Here we are saying that a given point is closer to the target value than the pivot is, we place it in the left bucket. So each time we pick a pivot, we partition our collection into elements closer to the target than the pivot, and elements of equal distance or greater distance to the target from the pivot. That is all we need for a working partition function to get us the k closest values to our target V.
Consider the case where we have three dimensional points as elements (x,y,z). We want to find the k closest values to a thin spherical shell of radius R concentric to C. This is a bit harder but it is still possible. Let’s try f(point) = { |distance(point, C) – R| < |distance(pivot, C) - R| }. What we are doing here is computing the distance of our point from the center of the circle and comparing this distance to R to see how far away we are from the shell.
Conclusions
Generalized partition sort can solve any problem asking for a slice of k optimal elements based on their images from some function in O(n) time regardless of k. This approach is more performant than the canonical O(n log k) linear traversal while maintaining a min-heap of size k.
It is easy to use generalized partition sort. All you need to do is design your own partition comparison function which tells us if a given value image is more or less optimal than the pivot image.