Wednesday, March 19, 2008

gawk vs. mawk und Konsorten

A co-worker of mine somehow lost (deleted?) her e-mails... at least some of them quite important. As it turned out, three months worth of mail mysteriously vanished from her Thunderbird Inbox.

A quick search on the internet revealed a nice little article "Restore Deleted Email in Thunderbird" on jivebay. Exactly what I needed! The only problem: My co-worker's Inbox wasn't just "over 90MB", it was about 850MB. So even my favorite Windows text editor (for those who want to know: It's JujuEdit - can handle very! large files, has support for regular expressions, among other goodies, and is freeware) couldn't handle a job like this in a reasonable amount of time. As far as I recall JujuEdit would have worked for about 2~3 hours to replace all the 'X-Mozilla-Status: xxxx's with zeros.

I decided to have a look at some command line tools from the UNIX/Linux world: awk and sed, both designed to process text-based data (like the Thunderbird mbox files). And after a while studying the sed and awk syntax I ended up with two perfectly working 'one-line scripts'.

Both versions work as expected, but they differ in execution time, and this is - at last - why I'm writing this... a short and completely fragmentary speed comparison between some text-processing tools.

The scripts:

awk '/^X-Mozilla-Status: [0-9]*/ {gsub($2, "0000\r")} {print}' Inbox > NewIn

sed -e 's/^X-Mozilla-Status: [0-9]\{4\}/X-Mozilla-Status: 0000/' Inbox > NewIn

I 'time'd both snippets several times on a (Ubuntu) Linux box and this is what I've got:

(m)awk: ~50s
sed: ~1m40s
JujuEdit (Windows): approx. 3 hours

I know, my testing environment is somewhat unique and the results are not really representative, but IMHO impressive none the less.

This way I found out that Ubuntu's standard awk is in fact mawk, and that mawk is (in this special case) about twice as fast as sed... even though awk is a full blown programming language and sed 'just' a Unix utility.

Today I added GNU awk (gawk) and, of course, Perl to my little test suite, and tried it on my own Inbox (about 160MB) in a small VM Ubuntu installation.
The gawk code snippet is similar to the awk version, the Perl line looks like this:

perl -w -p -e 's/X-Mozilla-Status: [0-9]{4}/X-Mozilla-Status: 0000/' Inbox > NewIn

mawk: ~25s
Perl: ~26s
sed: ~35s
gawk: ~40s

My conclusion:
  • mawk is indeed a very fast implementation of awk.
  • I couldn't get the quantifier / repetition operator {} work in awk.
  • Perl is such a nice thing to have.
  • I still like Linux.
Just my two cents.

It is possible, or even likely, that I made some mistakes here. Comments are welcome, I'm eager to learn... BTW, I'm well aware that zeroing out the Mozilla-Status is not the perfect solution to get deleted mails back, changing 0008 to 0000 or 0001 should do the trick.

Useful links: