Archive for April 2008

An interesting algorithm

So, even before I get to the what the algorithm does (which is quite interesting on its own, even before you find out how it accomplishes that), the very first intriguing part is its name. It’s officially referred to as the Blum-Floyd-Pratt-Rivest-Tarjan Algorithm. Is your mouth watering yet? I thought so.1 However, CLRS refers to it as the Select Algorithm (see section 9.3), so that’s what I’m going to call it. If you have an unsorted array of n items, it’s a worst-case O(n) algorithm for finding the ith largest item (often the median, since that’s a commonly used statistic). Here’s how the algorithm works:

  1. Put all the elements of the input into groups of 5 (with one group at the very end that could have less than 5 elements, of course). Find the median of each group.
  2. Recursively use the Select algorithm to find the median of all the group-of-5 medians from step 1. Call this median of medians M.
  3. Take all the items from the input and partition them into the group of items less than M and the group of items bigger than M.
  4. If M is the ith biggest, that’s the answer, and we’re done. If it’s not, you know which partition contains the ith biggest item, and you can recursively call Select on that group (with a different i if the group it’s in is the group of items smaller than M).

To give an example of that last statement, suppose we’re looking for the 65th biggest number from a group of 100 numbers, and we find that M is the 53rd largest. There are 47 numbers smaller than M, and when we recurse, we’ll now be looking for the 12th largest element of that set of 47. If instead M had been the 68th largest number, in our recursive step we’d look at the 67 numbers larger than M and continue to look for the 65th largest one.

Now, I claimed the running time is linear, and I’d better back up that claim. The trick is to consider how many items get thrown out in that last step. Half the group-of-5 medians are larger than M, and half are smaller. Without loss of generality, suppose the ith largest item is in the group that is larger than M. Now, not only do we know that we can henceforth ignore the half of the group-of-5 medians smaller than M, for each of those medians we can ignore the two items in the group of 5 that were smaller than that median. In other words, we can eliminate at least 3/10 of the elements we were searching over (we can eleminate 3/5 of the items in the groups whose median is smaller than M, and half of all the groups fit this category). Sure, I’m eliding an additive constant for the items in the group with M itself (not to mention the items in that last small group at the end when n is not a multiple of 5), but that’s not going to affect the asymptotic running time.

So now we can build the recurrence relation at work here. If T(n) is the amount of work done to perform Select on n items, we have

T(n) = O(n) + T(n/5) + T(7n/10)

That is to say, we do O(n) work in steps 1 and 3, we do T(n/5) work in step 2, and we do T(7n/10) work in step 4 because we know at least 3/10 of the elements are in the wrong partition. I’ll leave it as an exercise to the reader to draw out the recursion tree and find the actual amount of work done, but I assure you T(n) is in O(n).

The crazy part here is that magic number 5: If you do this with 3 items per group, the running time is O(n log n). If you do it with more than 5, the running time is still linear, but slower (you spend more time in step 1 finding all the initial medians). The optimal number of items per group really is 5. Nifty!

[1] For those of you not in the know, these are all really famous algorithms names. Blum is the co-inventor of the CAPTCHA. Floyd is from the Floyd-Warshall all-pairs shortest-paths algorithm, which is a staple in introductory algorithms courses. Pratt is from the Knuth-Morris-Pratt string matching algorithm, which was groundbreaking when it was new but is nearly obsolete now. Rivest is the R in RSA encryption, one of the most widely used encryption schemes around. Tarjan is the co-creator of splay trees, a kind of self-balancing tree with the handy property that commonly accessed items float to the top. Blum, Floyd, Rivest, and Tarjan have each won the Turing Award, which is like the Nobel Prize of computer science.

Science is awesome: the Kaye effect

A couple cool hacks

First off, Flash is vulnerable to by far the most awesome hack I’ve ever seen (there’s also a good summary of that paper). The attack has several different steps with integer overflows and failed memory allocations, but the heart of the matter is that the Flash player uses a 2-step process to validate that the code it’s running is probably safe, and this exploit changes the representation of the code in between the two checkers (it marks more of it as a no-op, so the second checker ignores the code with the exploit in it). This attack is awesome enough that it can carry out its task without disrupting the Flash player, so an unwary user will be none the wiser. and since there’s basically only one Flash player out there, every version of Flash is vulnerable. Yes, on Windows and even on Windows Vista despite their added security systems, as well as in principle on Linux and Mac. Yes, in both IE and Firefox (and presumably Safari also). This is yet another reason to install NoScript and FlashBlock on Firefox, so that sites cannot use Flash unless you give them permission. This is also another reason why standards should be open, so we can have more than one implementation of the Flash player, so not everyone will be vulnerable when something like this comes along.

The second hack I recently read about comes from Defcon celebrity Dan Kaminsky, who recently showed a very dangerous exploit that makes use of the way many ISPs these days turn DNS errors into pages of ads. This practice breaks the Same Origin Policy, so that your browser trusts these pages as though they came from the actual domain you typed in. To give an example, suppose I have an account with Bank of America and I go to ww.bankofamerica.com. Ordinarily, I’d get a DNS error. However, with certain ISPs these days, I would instead get a valid webpage saying the site doesn’t exist, but here are some ads instead. However, my browser asked for a website from bankofamerica.com and got back a website, so it trusts that it came from the bank. Consequently, it trusts the site with any cookies I have from BoA (these cookies are how BoA knows which account I’m logged into). If someone can put an XSS attack on the ISP’s ad injection system, they can grab my cookies and log into the bank as me. Yes, the bank can defend against things like this, but it’s an unusual enough hack that many companies aren’t defending against it. So beware, and if your ISP is doing this (for instance, if ww.bankofamerica.com returns a valid website), opt out of it! In addition to exposing users to this sort of attack, these ad injection systems often break DNS, which in turn breaks non-HTTP error handling (for instance, I could not VPN into work until I opted out of my ISP’s version of this crap).

C++ is bad: problems with the ternary operator, addendum

I’ve found another interesting wrinkle in the ternary operator. So, here’s a bonus question for that quiz from last time. You start with the following code:

class Abstract;
class DerivedOne;  // Inherits from Abstract
class DerivedTwo;  // Inherits from Abstract
Abstract x;
DerivedOne one;
DerivedTwo two;
bool test;

Again, you can assume that all of these are defined/initialized (and you can assume the comments are correct—both Derived* classes inherit from Abstract). Now, consider these snippets:

# Code Snippet
A
Code Snippet
B
Bonus
if (test) x = one;
else x = two;
x = (test ? one : two);
 


If you read the previous installment, you’ve probably guessed by now that these are not equivalent, and you’d be right. In what ways do they differ?

Another unexpected answer →

C++ is bad: problems with the ternary operator

In today’s installment of “why not to program in C++,” I give you the following quiz, which Dustin, Steve, Tom, and I had to figure out today (Dustin’s code was doing the weirdest things, and we eventually traced it down to this):

Suppose you start out with the following code:

class Argument;
Argument x;
void Foo(const Argument& arg);
bool test;

You can assume that all of these are defined/initialized elsewhere in the code. For each pair of code snippets below, decide whether the two snippets are equivalent to each other.

# Code Snippet
A
Code Snippet
B
1
if (test) Foo(x);
else Foo(Argument());
Foo(test ? x : Argument());
 
2
{ // limit the scope of y
  Argument y;
  Foo(test ? x : y);
}
Foo(test ? x : Argument());
 
 
 
3
Foo(x);
Foo(test ? x : x);
4
Foo(x);
Foo(true ? x : Argument());


Edit: what I meant by the curly braces in Question 2 is that you shouldn’t consider “y is now a defined variable” to be a significant difference between the two snippets.

The answers are worse than you'd think. →