More duplicate names for bioinformatics software: a tale of two HIPPIES

Thanks to Sara Gosline (@sargoshoe) for bringing this to my attention. Compare and contrast the following:

The former tool, published in 2012 in PLOS ONE, takes its name from 'Human Integrated Protein-Protein Interaction rEference' (it was doing so well until it reached the last letter). The latter tool ('High-throughput Identification Pipeline for Promoter Interacting Enhancer elements') was published in 2014 in the journal Bioinformatics.

Leaving aside the issue of whether these names are worthy of a JABBA award, the issue here is that we have yet another duplicate set of software names for two different bioinformatics tools. The authors of the 2nd paper could, and should, have checked for 'prior art'.

If you are planning to develop a new bioinformatics tool and have thought of a possible name, please take the time to do the following:

  1. Visit http://google.com (or your preferred web search engine of choice)
  2. In the search box type the proposed name of your tool followed by a space
  3. Then add the word 'bioinformatics'
  4. Click search
  5. That's it

Inconsistent bioinformatics branding: SAMtools vs Samtools vs samtools

The popular Sequence Alignment Map format, SAM, has given rise to an equally popular toolkit for working with SAM files (and BAM, CRAM too). But what is the name of this tool?


SAMtools?

If we read the official publication, then we see this software described as 'SAMtools' (also described by Wikipedia in this manner).

Samtools?

Head to the official website and we see consistent references to 'Samtools'.

samtools?

Head to the official GitHub repository and we see consistent references to 'samtools'.


This is not exactly a problem that is halting the important work of bioinformaticians around the world, but I find it surprising that all of these names are in use by the people that developed the software. Unix-based software is typically — but not always — implemented as a set of lower-case commands and this can add one level of confusion when comparing a tool's name to the actual commands that are run ('samtools' is what you type at the terminal). However, you can still be consistent in your documentation!

How do people choose a single isoform of a gene to use for bioinformatics analyses?

 

Update 2015-09-29: in addition to the comments at the end of the post below, also see the follow up post that I wrote which offers some more suggestions including the APPRIS database/webserver which looks very useful.

 

This post is somewhat of a follow-up to something that I wrote earlier this week. In bioinformatics, we often want to analyze all genes from an organism (or from multiple organisms). In many well-annotated genome databases, there is often a choice of isoforms available for each protein-coding gene, and the number of isoforms only ever seems to increase.

For example, in the latest set of human gene annotations (Ensembl 78), there are 406 protein-coding genes that have more than 25 transcripts. At one extreme, the human GPR56 gene has 77 transcripts, 61 of which are annotated as protein-coding! The length of these 61 putative protein products ranges from just 6 amino acids (!) all the way up to 693.

In Caenorhabditis elegans, sequence identifiers for genes were historically based on appending numbers to the identifier of the BAC/YAC/Cosmid clone containing that gene. E.g. B0348.1 would represent the first predicted gene on the B0348 clone, B0348.2 the second gene…and so on. When splice variants were discovered, curators appended letters for each isoform. E.g. B0348.2a and B0348.2b represent the two alternative isoforms of this gene. In the latest WS248 release of WormBase, one gene (egl-8) has 25 isoforms (all the way up to B0348.4y). I wonder what WormBase will do when a 27th isoform is discovered?

So how does one attempt to choose a single variant for use in a bioinformatics pipeline, and is this something that we should even be attempting? Historically, people have often opted for a quick-and-easy approach in order to get around this problem. Some examples from papers indexed by Google Scholar:

"In cases of alternative splicing, we chose the longest protein to represent a gene"

"In cases of multiple transcript isoforms, we chose the isoform with the longest CDS supported by transcript and protein homology in other mammalian species"

"Because of the redundancy of protein sequences, we chose only the longest isoform for every entry"

"In cases where a gene possesses more than one reference sequence, we chose the longest"

"When multiple protein entries are found for the same EntrezGene identifier, choose the longest sequence isoform"

This methodology is obviously not without problems (as others have reported on). So I'm genuinely curious as to what people do in order to choose a 'representative' isoform (whatever that means). The problem is further complicated when the reality might be that some genes consistently use different isoforms in different tissues or at different developmental time points.

Please comment below if you think you have found a good solution to this problem!