Programming challenge: anagrams in PCRE

12 November 2010, 2:37 PM

(inspired by a question a student asked me in the course I TA)

In the language of your choice, write a function that takes a string argument and returns a string representing the PCRE pattern that matches all anagrams of the input word. For example, if the argument to the function is “retrains”, it should return a regex pattern that matches “restrain”, “retrains”, “strainer”, “terrains”, and “trainers” as well as a bunch of strings that aren’t real words. You can assume that all characters in the input string are English letters. The length of the output string should be polynomial in the length of the input (i.e., just enumerating all possible anagrams doesn’t count, since that takes exponential space).

Also, if you haven’t seen this chestnut: write a POSIX-compliant regex pattern that matches strings of the form a* (i.e., “”, “a”, “aa”, “aaa”, etc) which have composite (non-prime) length.

…and people claim that regular expressions are only as powerful as DFAs are. :-P

Category: Computer Science & Coding, puzzles & games | Comment (RSS) | Trackback

11 Comments

Jacob Mandelson says:

12 November 2010 at 3:28 PM

Well, people keep extending REs with operators that Kleene didn’t have.
PCRE takes that to an extreme!
Maybe we should call Kleene’s original formulation, which does permit reduction to a FSM, a special name, like Kleene RE?

For Q2, I came up with perl -e ‘print pack(“H*”, “28617b322c7d295c312b0a”);’ but I’m a bit disappointed perl -e ‘print pack(“H*”, “617b322c7d7b322c7d0a”);’ didn’t work.

Reply to this comment
- Alan says:
  
  13 November 2010 at 9:44 AM
  
  I like your first solution, and I agree that it would have been very cool if the second one had worked.
  
  Reply to this comment
Jacob Mandelson says:

12 November 2010 at 4:46 PM

Here’s my solution: in C.
I figured counting the characters in an interpolated perl code block would be cheating.

What class are you TAing?

Reply to this comment
- Alan says:
  
  13 November 2010 at 9:43 AM
  
  Whoa! What was that!? Your code didn’t compile with GCC because of an invalid cast from void* on line 52. After I fixed that, I couldn’t get the output regex to work for small words like “cat”, and I’m pretty sure the size of your regex is exponential in the size of the input (your code gave several megabytes of output for the “restrain” example).
  
  Here’s my solution; it’s 2 lines of Python. Given an input string of length n, the length of its output is O(log n).
  
  I’m TAing the freshman “Intro to CS” course at my school. We’re currently discussing regex, which is how this came up.
  
  Reply to this comment
  - Jacob Mandelson says:
    
    13 November 2010 at 10:38 AM
    
    > Your code didnâ€™t compile with GCC because of an invalid cast from void* on line 52.
    
    It compiles cleanly with -Wall, unless you rename it to a C++ file and compile it as such. I’m surprised you don’t know this difference between C and C++. :)
    
    For “cat”, I should produce the PCRE ^(?&endc1a1t1)(?(DEFINE)(?c$)(?a$)(?t$)(?a(?&endt1)|t(?&enda1))(?c(?&endt1)|t(?&endc1))(?c(?&enda1)|a(?&endc1))(?c(?&enda1t1)|a(?&endc1t1)|t(?&endc1a1)))
    
    [Hope those angle brackets and ampersands come through, no “preview” button, unfortunately.]
    
    Is this what you get? Testing this with print “$i\n” if $i =~ /…/ in perl seems to work…
    
    I though my solution would be cubic: It calls printreelems n times, each of which prints n terms, each of size n. But I erred in my term count, it’s the number of ways to pick out i letters from the letters in the word, which has an exponent of # distinct letters. So I can get a shred of “polynomial time” with O(n^26) from the English letters stipulation, but not what I intended, if I had analyzed it correctly I would have not pursued this technique and tried to think of another.
    
    Your solution is much more elegant. I like that way of counting characters.
    
    Reply to this comment
    - Jacob Mandelson says:
      
      13 November 2010 at 10:46 AM
      
      Lemme try my “cat” regex again, with a bit of sed to protect the brackets:
      ^(?&endc1a1t1)(?(DEFINE)(?<endc1>c$)(?<enda1>a$)(?<endt1>t$)(?<enda1t1>a(?&endt1)|t(?&enda1))(?<endc1t1>c(?&endt1)|t(?&endc1))(?<endc1a1>c(?&enda1)|a(?&endc1))(?<endc1a1t1>c(?&enda1t1)|a(?&endc1t1)|t(?&endc1a1)))
      
      Reply to this comment
  - Jacob Mandelson says:
    
    13 November 2010 at 11:02 AM
    
    > (your code gave several megabytes of output for the â€œrestrainâ€ example)
    
    Hmm, there must be some undefined behavior or something I accidentally put in my solution, even with the lot-worse-than-I-expected performance, “restrain” only produces 0.6 M for me.
    
    Reply to this comment
    - Alan says:
      
      13 November 2010 at 11:52 AM
      
      No, you’re right; I had a typo. The output of “restrains” (with an s on the end) is 2.8 MB long, and you’re right that “restrain” without the s only gives 0.6 MB output. However, “restrained” yields 27.7 MB of output, and “restraining” uses 154 MB. I suspect your solution uses exponential space because every letter I add increases the space used by an order of magnitude.
      
      Reply to this comment
      - Jacob Mandelson says:
        
        13 November 2010 at 12:46 PM
        
        Do you get the same output for “cat” as I do?
        
        Let’s see, my (crummy) solution will output one term for each way of selecting letters from the word. If there are a_i of letter i, that gives PRODUCT[a_i+1]. For d distinct letters, that’ll max out with a_i = n/d. Which gives (n/d)^d terms, of size d. Which is not the way my code is behaving, so it’s buggy.
        
        And indeed, the way I recurse through the space is simply broken. A fixed version is here. This gives smaller output (academic since we already know this approach is not good): 50k for “restrain”, 102k for “restrains”, 337k for “restrained”, 1.8M for “restraining”. And does seem to be growing with an exponent of d.
        
        Reply to this comment
        
        Alan says:
        
        14 November 2010 at 2:30 PM
        
        I did indeed get the same output as you for “cat”. Furthermore, I’m no longer convinced that your solution uses exponential space. You use a constant amount of space for each distinct subset of letters, which is “only” O(n^26) (or O(n^52) if you allow for capital letters as well). So, it’s exponential in the size of your alphabet, but not in the size of your input (which adheres to what I wrote in the problem specification, but not exactly to its spirit).
        
        Your new link doesn’t link to anything; are you sure you typed it in correctly?
        
        Reply to this comment
        
        Jacob Mandelson says:
        
        14 November 2010 at 3:02 PM
        
        > Your new link doesnâ€™t link to anything; are you sure you typed it in correctly?
        
        Try here.
        
        > I did indeed get the same output as you for â€œcatâ€.
        
        Hmm, how are you testing it? When I test the regex in comment 1124 with print “$i\n” if $i =~ /…/, it seems to match correctly.
        
        > (which adheres to what I wrote in the problem specification, but not exactly to its spirit)
        
        Yes, if I had realized my solution was n^d instead of incorrectly thinking it was n^3, I wouldn’t have implemented it. I might have tried to use | to make a conjunction with de Morgan’s laws. I doubt I would have thought of using the zero-width lookahead assertion to make a conjunction, I’ll have to remember that trick.
        
        Reply to this comment

Interesting Stuff

Programming challenge: anagrams in PCRE

11 Comments

Leave a Reply

Categories

Archives

Other Cool Stuff