A co-worker of mine somehow lost (deleted?) her e-mails... at least some of them quite important. As it turned out, three months worth of mail mysteriously vanished from her Thunderbird Inbox.
A quick search on the internet revealed a nice little article "
Restore Deleted Email in Thunderbird" on jivebay. Exactly what I needed! The only problem: My co-worker's Inbox wasn't just "over 90MB", it was about 850MB. So even my favorite Windows text editor (for those who want to know: It's
JujuEdit - can handle very! large files, has support for regular expressions, among other goodies, and is freeware) couldn't handle a job like this in a reasonable amount of time. As far as I recall JujuEdit would have worked for about 2~3 hours to replace all the 'X-Mozilla-Status: xxxx's with zeros.
I decided to have a look at some command line tools from the UNIX/Linux world:
awk and
sed, both designed to process text-based data (like the Thunderbird mbox files). And after a while studying the sed and awk syntax I ended up with two perfectly working 'one-line scripts'.
Both versions work as expected, but they differ in execution time, and this is - at last - why I'm writing this... a short and completely fragmentary speed comparison between some text-processing tools.
The scripts:
awk '/^X-Mozilla-Status: [0-9]*/ {gsub($2, "0000\r")} {print}' Inbox > NewIn
sed -e 's/^X-Mozilla-Status: [0-9]\{4\}/X-Mozilla-Status: 0000/' Inbox > NewIn
I 'time'd both snippets several times on a (Ubuntu) Linux box and this is what I've got:
(m)awk: ~50s
sed: ~1m40s
JujuEdit (Windows): approx. 3 hours
I know, my testing environment is somewhat unique and the results are not really representative, but IMHO impressive none the less.
This way I found out that Ubuntu's standard awk is in fact mawk, and that mawk is (in this special case) about twice as fast as sed... even though awk is a full blown programming language and sed 'just' a Unix utility.
Today I added GNU awk (
gawk) and, of course,
Perl to my little test suite, and tried it on my own Inbox (about 160MB) in a small VM Ubuntu installation.
The gawk code snippet is similar to the awk version, the Perl line looks like this:
perl -w -p -e 's/X-Mozilla-Status: [0-9]{4}/X-Mozilla-Status: 0000/' Inbox > NewIn
mawk: ~25s
Perl: ~26s
sed: ~35s
gawk: ~40s
My conclusion:
- mawk is indeed a very fast implementation of awk.
- I couldn't get the quantifier / repetition operator {} work in awk.
- Perl is such a nice thing to have.
- I still like Linux.
Just my two cents.
It is possible, or even likely, that I made some mistakes here. Comments are welcome, I'm eager to learn... BTW, I'm well aware that zeroing out the Mozilla-Status is not the perfect solution to get deleted mails back, changing 0008 to 0000 or 0001 should do the trick.
Useful links: