Posts tagged ‘algorithms’

## An interesting algorithm

So, even before I get to the what the algorithm does (which is quite interesting on its own, even before you find out how it accomplishes that), the very first intriguing part is its name. It’s officially referred to as the Blum-Floyd-Pratt-Rivest-Tarjan Algorithm. Is your mouth watering yet? I thought so.^{1} However, CLRS refers to it as the Select Algorithm (see section 9.3), so that’s what I’m going to call it. If you have an unsorted array of n items, it’s a worst-case O(n) algorithm for finding the i^{th} largest item (often the median, since that’s a commonly used statistic). Here’s how the algorithm works:

- Put all the elements of the input into groups of 5 (with one group at the very end that could have less than 5 elements, of course). Find the median of each group.
- Recursively use the Select algorithm to find the median of all the group-of-5 medians from step 1. Call this median of medians
*M*. - Take all the items from the input and partition them into the group of items less than
*M*and the group of items bigger than*M*. - If
*M*is the i^{th}biggest, that’s the answer, and we’re done. If it’s not, you know which partition contains the i^{th}biggest item, and you can recursively call Select on that group (with a different i if the group it’s in is the group of items smaller than*M*).

To give an example of that last statement, suppose we’re looking for the 65th biggest number from a group of 100 numbers, and we find that *M* is the 53rd largest. There are 47 numbers smaller than *M*, and when we recurse, we’ll now be looking for the 12th largest element of that set of 47. If instead *M* had been the 68th largest number, in our recursive step we’d look at the 67 numbers larger than *M* and continue to look for the 65th largest one.

Now, I claimed the running time is linear, and I’d better back up that claim. The trick is to consider how many items get thrown out in that last step. Half the group-of-5 medians are larger than *M*, and half are smaller. Without loss of generality, suppose the i^{th} largest item is in the group that is larger than *M*. Now, not only do we know that we can henceforth ignore the half of the group-of-5 medians smaller than *M*, for each of those medians we can ignore the two items in the group of 5 that were smaller than that median. In other words, we can eliminate at least 3/10 of the elements we were searching over (we can eleminate 3/5 of the items in the groups whose median is smaller than *M*, and half of all the groups fit this category). Sure, I’m eliding an additive constant for the items in the group with *M* itself (not to mention the items in that last small group at the end when n is not a multiple of 5), but that’s not going to affect the asymptotic running time.

So now we can build the recurrence relation at work here. If T(n) is the amount of work done to perform Select on n items, we have

T(n) = O(n) + T(n/5) + T(7n/10)

That is to say, we do O(n) work in steps 1 and 3, we do T(n/5) work in step 2, and we do T(7n/10) work in step 4 because we know at least 3/10 of the elements are in the wrong partition. I’ll leave it as an exercise to the reader to draw out the recursion tree and find the actual amount of work done, but I assure you T(n) is in O(n).

The crazy part here is that magic number 5: If you do this with 3 items per group, the running time is O(n log n). If you do it with more than 5, the running time is still linear, but slower (you spend more time in step 1 finding all the initial medians). The optimal number of items per group really is 5. Nifty!

[1] For those of you not in the know, these are all really famous algorithms names. Blum is the co-inventor of the CAPTCHA. Floyd is from the Floyd-Warshall all-pairs shortest-paths algorithm, which is a staple in introductory algorithms courses. Pratt is from the Knuth-Morris-Pratt string matching algorithm, which was groundbreaking when it was new but is nearly obsolete now. Rivest is the R in RSA encryption, one of the most widely used encryption schemes around. Tarjan is the co-creator of splay trees, a kind of self-balancing tree with the handy property that commonly accessed items float to the top. Blum, Floyd, Rivest, and Tarjan have each won the Turing Award, which is like the Nobel Prize of computer science.

## An interesting data structure

Admittedly, I can’t think of an uncontrived use case when I would prefer this to a hash table, union find, or traditional bit field, but I still think this is a pretty cool concept. It’s a data structure to hold *n* items, with constant time initialization (i.e. you can’t initialize all *n* items at the beginning), and constant time random access. Moreover, it can be used as a set: on top of the constant time random access, you can store and remove items as well as get the number of items in it in constant time, and you can iterate through all the inserted items in O(*n*) time (where *n* is the number of items currently stored in the set; not the maximum number of items it can have). and on top of that, if you don’t need to call the items’ destructor (for instance, you’re storing doubles or ints), you can empty the set in constant time! The only drawback to using it as a set is that it takes memory proportional to the size of the keyspace (for instance, if you’re trying to store items indexed by *b*-bit numbers, it will take O(2^*b*) space). However, it can be used as a bit vector: to store *u* bits, it takes a mere O(*u*) space, and you don’t need to zero all the bits out at the beginning! This implies that if you only end up using a sparse subset of the bits, you don’t have to waste time initializing all the bits you didn’t use. To be fair, the constant hidden behind that O(*u*) space is bigger than the one in a normal bit vector, but I think it’s still a pretty neat idea. Besides, space is cheap these days; people already often use a byte or four for every bit needed.

Anywho, here’s a great write-up of it, along with the original 1993 paper that introduces it (see the top of page 4 to compare the running times of this to a traditional bit field). On top of what is written in those links, the space required can be cut in half by only having a single array instead of two: just have it point to other parts of itself. Insertion/deletion gets a little more complicated, but it totally works. Neat!

## New Algorithms Text

This year, the Algorithms course at Mudd is using a new book, written by Dasgupta, Papadimitriou (the complexity theorist), and Vazirani (the quantum computing guy). You can view a July 2006 draft of the book, which actually looks pretty nice. It doesn’t cover all that I think an algorithms course should (in particular it handles the standard data structures cursorily at best), but it definitely skips the boring bits and goes right to the fun parts. It starts out with cryptography, then goes into matrix multiplication and the FFT (which sound boring, but are actually pretty neat problems). It even has a chapter on quantum computing!

I don’t think this will ever replace CLRS in terms of completeness, but it’s a lot more fun. Definitely worth a read if you’re looking for an algorithms textbook. and really, who isn’t looking for one these days?

## An Algorithms Question

Define a “hub” to be a vertex that shares an edge with every other vertex (such as the middle of a star graph, or any vertex in a complete graph). Suppose we have the adjacency matrix of an undirected, unweighted graph with V vertices (so our input has size V^2). Find an algorithm with running time o(V^2) that can determine whether a hub exists in the graph, or prove that no such algorithm exists. Note the use of little-o notation: the algorithm must be asymptotically *faster* than (big) O(V^2).

My bet is that no such algorithm exists, but I can’t figure out how to prove it.

I’ll be at Mudd tomorrow, btw.

## Integer Factorisation program

I wrote an integer factorisation program (Java bytecode also available for those without a compiler) using an algorithm I just made up, and it works surprisingly well (significantly better than brute force, not nearly as good as the best-known algorithms out there today). Yes, it still has exponential running time, but I thought it was a neat idea.

## Seymour’s my friend

\begin{LittleShopOfHorrors}

Someone show me a way to get outa here,

‘Cause I constantly pray I’ll get outa here

Please, won’t somebody say I’ll get outa here

Someone gimme my shot or I’ll rot here.

…

I’ll do I dunno what to get outa [Mudd]

But a hell of a lot to get outa [Mudd]

People tell me there’s not a way outa [Mudd]

But believe me I gotta get outa [Mudd now!]

\end{LittleShopOfHorrors}

I’m pulling my third all-nighter of the week (well, not quite – I’ve gotten at least an hour of sleep every day), and I don’t expect to have less work until Wednesday afternoon, when I will only have a 15-minute presentation and a 20-page paper left to do. I am *so* ready to graduate right now!

However, the Big Algorithms presentation on Tuesday went just about perfectly (Reid and I discussed pattern matching, and in particular the Knuth-Morris-Pratt and Boyer-Moore algorithms): people seemed to understand everything we covered, we answered questions well, and they asked leading questions (this was awesome – on 3 separate occasions, we answered questions with “actually, that’s covered on the next slide…”). Our timing was perfect down to about 2 minutes of the 75-minute lecture, which blew me away because we’d never actually practiced the entire thing before. Moreover, Reid (who has a speech impediment where he stops talking when he gets flustered) went through his half perfectly, without any long pauses. On the feedback forms, one person even wrote that this was the best student lecture of the entire class. \/\/00T!

Right. Back to work…