Sunday, July 18, 2004

A Case of Mistaken Identifiers

Through a story at The Register, "Excel Ate My DNA", I discover the delightful note,
entitled "Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics"
The problem?  In its default settings, Excel converts some gene names to dates, and interprets some clone codes as numbers.  A gene name like APR-1 will be converted to 1-apr, for example.  And the ... erm ... helpfulness of Excel can cause a clone code like "2310009E13" to be converted to the number 2.31E+13.  You can't always convert back once you discover the error - SEPT1 and SEP1 both become 1-sep, and notice how the trailing 9 disappeared from the above number?
Thanks, Microsoft.  You've set gene-based cancer therapy back a couple years, probably, with your Freedom to Enervate.  And look how you've done it: with what computer science used to call DWIM: Do What I Mean.  Pioneered by Warren Teitelman at Xerox PARC, DWIM didn't swim - it sunk.  As my old professor Bernard Mont-Reynaud at U.C. Berkeley used to say, "it should be called DWWM - Do What Warren Means."  Hacker dictionaries wryly repeat the most common definition:
1. Able to guess, sometimes even correctly,the result intended when bogus input was provided.

There is at least one scornful lyric on the subject.
"Mistaken identifiers" yields a few other delights.  For example, those unaquainted with crediting trends in the scientific literature may boggle slightly at the eight author attributions in what amounts to a bug report.  But that's nothing!  Check the RIKEN [4] reference and you'll learn about an inquiry into the mouse transcriptome that credits what looks like 135 co-authors.
Special bonus link: a shell script,, that pokes through Excel data and finds likely errors. What's a shell script, you may ask?  Oh, never mind: just some bad old technology from the 1970s, dating back to a time before real computers.  (But not before Real Programmers.)


