New JABBA awards for crimes against bioinformatics acronyms

JABBA is an acronym for Just Another Bogus Bioinformatics Acronym. I hand out JABBA awards to bioinformatics papers that reach just a little bit too far when trying to come up with acronyms (or intialisms). See details of previous winners here.

So without further delay, let's see who the new recipients are. As always, journals like Bioinformatics produce many strong candidates for JABBA awards. Here are three winners from the latest issue:

  1. mpMoRFsDB: a database of molecular recognition features in membrane proteins - What is it with the need for so much use of mixed-case characters these days? It makes names harder to read, and this database has a name which doesn't really roll off the tongue.
  2. CoDNaS: a database of conformational diversity in the native state of proteins - More mixed case madness. I'm really unsure how this database should be pronounced. 'Cod Naz'? 'Code Nass'? 'Coe Dee-en-ays'? The abstract suggests that the acronym stems from the name 'Conformational Diversity of Native State', so I guess we should be thankful that they avoided the potential confusion of naming this database 'CoDoNS'.
  3. GALANT: a Cytoscape plugin for visualizing data as functional landscapes projected onto biological networks - If you want your bioinformatics tool to have a cool sounding name, just give it that name. Don't feel that you somehow have to tenuously arrive at that name from a dubious method of selecting just the letters that make it work. The abstract of this paper reveals that GALANT stems from 'GrAph LANdscape VisualizaTion'. So the full name only consists of three words, and the initial letter of one of those words doesn't even make it into the acronym/initialism. They may as well call their tool 'GREAT' (GRaph landscapE visualizATion).

The joys of dealing with well-annotated 'reference' genomes

Important update included at bottom of this post

Arabidopsis thaliana has one of the best curated eukaryotic genome sequences so you might expect working with Arabidopsis data to be a joy? Well, not always. Consider gene AT1G79920. This gene has two coding variants (AT1G79920.1 and AT1G79920.2) which only seem to differ in their 5' UTR regions:

2013-08-30 at 3.23 PM.png

The primary source for Arabidopsis genome information is The Arabidopsis Information Resource (TAIR) but another site called Phytozome has started collating all plant-related genome data (often pulling it from sites like TAIR).

Both sites allow you to download coding sequences for each gene from their FTP sites, or view the same sequence information on their web sites. You could also download GFF files and using the coordinate information, extract the coding sequences from the whole genome sequence.

A simple sanity check when working with coding sequences is that the length of a coding sequence should be divisible by 3 otherwise there might be a frameshift. There seems to be an issue with AT1G79920 in that it sometimes has a frameshift. I say sometimes because it depends on where you look at the data.

Here is a sequence alignment for exons 5 and 6 of both coding variants of this gene. The sequence identifiers include a 'P' or 'T' to indicate whether the information came from Phytozome or TAIR and they also denote whether it was taken from their web or FTP site. I also insert 'gtag' into one sequence to illustrate where the intron between these exons would occur.

2013-08-30 at 3.25 PM.png

The first boxed rectangle highlights a base that looks like a SNP, but these are coding variants with UTR variation, so the underlying genome sequence in the coding regions should be identical.

The second box is perhaps more disturbing. The web site versions of exon 7 all have a 1 bp deletion which would lead to a frameshift. I guess this error started at TAIR and propagated to Phytozome, but the fact that both sites also have a correct version available on their FTP site is confusing and troubling.

My boss first discovered this by looking at the GFF files for the Arabidopsis genome and this was one of 25 genes with a 'not-divisible-by-3' length error. So it pays to always check — and double check — your data.

Time to send an email to TAIR and Phytozome to report this.

Update

I heard back from TAIR and Phytozome and it seems that there are a small number of likely genome sequence errors in the latest (TAIR10) release of the A. thaliana genome. When these affect genes and would introduce a frameshift error, TAIR make manual edits to the resulting files of CDSs, transcripts, and proteins. They do this when there is clear homology from genes in other species that suggests the change should be made.

So if you work with downloaded FASTA files from the FTP site, you won't see these errors. If you work from GFF files (which is presumably what some of their web displays do), you'll run into these issues. There is a small file (TAIR10sequenceedits.txt) included as part of the TAIR10 release which documents these changes.

Thanks to a speedy and helpful response from both sites. Perhaps I should retitle this post The Joys of Dealing with Fast and Knowledgable Genome Curators?

Crappy science spam from Photon Journals

I've started receiving more and more science-related spam recently. Some of these are semi-legitimate, like those from the OMICS group, but should still be avoided (here's a good cautionary tale of what to expect if you go to an OMICS Group conference).

Sometimes I'm just disappointed by how little effort these spammers take to make their emails looks professional. Over the last couple of weeks I've received four emails from 'Photon Journals' asking me to submit my research to some of their (many) journals. To give you an idea of how lame their attempts at spamming people are, consider that:

  1. Their email comes from a Hotmail account.
  2. They address me as 'Dear Dr. Krbradnam' (clearly they have phished my academic email address from somewhere and just assumed that this is my surname).
  3. Their email is written in purple!
  4. Two emails on the same day ask me to submit material to two completely different journals (see below).
  5. The 'website' for the journal is hosted on Google Sites
  6. The website looks like it was designed by a five year old on acid (see below).
2013-08-29 at 8.31 PM.png
2013-08-29 at 8.32 PM.png
2013-08-29 at 8.44 PM.png

The modus operandi of these fake journals is just to charge you to submit your paper (and really it can be any paper, no-one is going to check it, let alone read it). I guess we should be grateful that some of these are so obviously fake, that it makes it easier to ignore them. Several sites are now collating list of all of these bogus journals (there are a lot).

New JABBA award recipients for the bogus use of bioinformatics acronyms

The latest issue of the journal Bioinformatics was released today and as with most issues, it features a large number of bioinformatics tools and resources. Many of these tools feature questionable use of acronyms and initialisms and so it is time to hand out some new JABBA awards.

A quick reminder, JABBA stands for 'Just Another Bogus Bioinformatics Acronym'. I recently gave out the inaugural JABBA award and have since discovered that JABBA is not the only game in town (if the game is critically evaluating the overreaching use of acronymns in science). Step forward ORCA - the Organisation of Really Contrived Acronyms, a blog that was started by an old bioinformatics colleague of mine. I highly recommend checking this out to see more acronym-derived-crimes.

Anyway, here are the several nominations for JABBA awards from the latest issue of Bioinformatics:

  1. MaGnET: Malaria Genome Exploration Tool - My main issue with this one is the ungainly capitalization. There is also the issue that the tool is in no way related to magnets so if you don't remember the fact that it is called MaGnET then it it hard to find. A Google search for malaria tool doesn't feature MaGnET on the first page of hits.
  2. MMuFLR: missense mutation and frameshift location reporter - Okay, so part of me thinks that this is probably meant to be pronounced 'muffler'? If this is not the case, then it doesn't really roll of the tongue. "Oh you're looking to find missense and frameshift mutations? Then you should try em-em-yoo-ef-el-ar". It doesn't help when you follow the link in the article to find the resource (a Galaxy workflow) only to find no mentions of MMuFLR on the resulting page (unless you search for it).
  3. mRMRe: an R package for parallelized mRMR ensemble feature selection - Some of the same reasons that applied to MMuFLR also apply to mRMRe. Try saying this three times fast and you'll see what I mean. It seems that mRMRe is an 'ensemble' implementation of something called mRMR (minimum redundancy maximum relevance). Still not sure why the first 'm' is free from the need of capitalization (or the 'e' for 'ensemble').
  4. TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference - Aways a bit suspicious when you have a tool name that features 15 words but somehow gets reduced to 5 letters for the abbreviated name. Could equally make a case that this tool should be called TIRBI. In any case, TIGAR is quite an unusual spelling and you might think that a Google search for TIGAR would put this resource near the top of the results. However it also seems that TIGAR is the The Inland Gateway Association of Realtors as well as The International Gymnastics Academy of the Rockies. But perhaps more importantly, TIGAR already has a scientific connection as it also the name of a gene (TIGAR: TP53-induced glycolysis and apoptosis regulator00762-8)).

Thanks to Bioinformatics journal for providing some new JABBA recipients.