New JABBA award recipients for the bogus use of bioinformatics acronyms

The latest issue of the journal Bioinformatics was released today and as with most issues, it features a large number of bioinformatics tools and resources. Many of these tools feature questionable use of acronyms and initialisms and so it is time to hand out some new JABBA awards.

A quick reminder, JABBA stands for 'Just Another Bogus Bioinformatics Acronym'. I recently gave out the inaugural JABBA award and have since discovered that JABBA is not the only game in town (if the game is critically evaluating the overreaching use of acronymns in science). Step forward ORCA - the Organisation of Really Contrived Acronyms, a blog that was started by an old bioinformatics colleague of mine. I highly recommend checking this out to see more acronym-derived-crimes.

Anyway, here are the several nominations for JABBA awards from the latest issue of Bioinformatics:

  1. MaGnET: Malaria Genome Exploration Tool - My main issue with this one is the ungainly capitalization. There is also the issue that the tool is in no way related to magnets so if you don't remember the fact that it is called MaGnET then it it hard to find. A Google search for malaria tool doesn't feature MaGnET on the first page of hits.
  2. MMuFLR: missense mutation and frameshift location reporter - Okay, so part of me thinks that this is probably meant to be pronounced 'muffler'? If this is not the case, then it doesn't really roll of the tongue. "Oh you're looking to find missense and frameshift mutations? Then you should try em-em-yoo-ef-el-ar". It doesn't help when you follow the link in the article to find the resource (a Galaxy workflow) only to find no mentions of MMuFLR on the resulting page (unless you search for it).
  3. mRMRe: an R package for parallelized mRMR ensemble feature selection - Some of the same reasons that applied to MMuFLR also apply to mRMRe. Try saying this three times fast and you'll see what I mean. It seems that mRMRe is an 'ensemble' implementation of something called mRMR (minimum redundancy maximum relevance). Still not sure why the first 'm' is free from the need of capitalization (or the 'e' for 'ensemble').
  4. TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference - Aways a bit suspicious when you have a tool name that features 15 words but somehow gets reduced to 5 letters for the abbreviated name. Could equally make a case that this tool should be called TIRBI. In any case, TIGAR is quite an unusual spelling and you might think that a Google search for TIGAR would put this resource near the top of the results. However it also seems that TIGAR is the The Inland Gateway Association of Realtors as well as The International Gymnastics Academy of the Rockies. But perhaps more importantly, TIGAR already has a scientific connection as it also the name of a gene (TIGAR: TP53-induced glycolysis and apoptosis regulator00762-8)).

Thanks to Bioinformatics journal for providing some new JABBA recipients.

Mining Altmetric data to discover what types of research article get the most social media engagement

Altmetric is a service that tracks the popularity of published research articles via the impact that those articles make on social media sites. Put simply, the more an article is tweeted and blogged about, the higher its Altmetric score will be. The scoring system is fairly complex as it also tracks who is reading your article on sites such as Mendeley.

I was pleased to see that the recent Assemblathon 2 paper — on which I am the lead author —  gained a lot of mentions on twitter. Curious, I looked up its Altmetric score and was surprised to see that it was 61. This puts it in in the 99th percentile of all articles tracked by Altmetric (almost 1.4 million articles). I imagine that the score will possibly rise a little more in the coming weeks (at the time of writing the paper is only four days old).

I was then curious as to how well the Assemblathon 1 paper fared. Published in September, 2011, it has an Altmetric score of 71. This made me curious as to where both papers ranked in the entire list of 1,384,477 articles tracked by this service. So I quickly joined the free trial of Altmetric (they have a few paid services) and was able to download details of the top 25,000 articles. This revealed that the two Assemblathon papers came in at a not-too-shabby 5,616th and 10,250th place overall. If you're interested, this paper — with an off-the-charts Altmetric score of 11,152 — takes the top spot.

Just to satisfy my curiosity, I made a word cloud based on the titles of the research papers that appear in the top 10,000 Altmetric articles.

2013-07-26 at 3.37 PM.png

Perhaps unsurprisingly, this reveals that analyses (and meta-analyses) of data relating to human health prompt a lot of engagement via social media. Okay, time to go and write my next paper 'A Global Study of Human Health Risks'.

Thinking of writing a FASTA parser? 12 scenarios that might break your code

In today's Bits and Bites Coding class, we talked about the basics of FASTA and GFF file formats. Specifically, we were discussing the problems that can arise when you write scripts to parse these files, and the types of problems that might be present in the file which may break (or confuse) your script.

Even though you can find code to parse FASTA files from projects such as BioPerl, it can be instructive to try to do this yourself when you are learning a language. Many of the problems that occur when trying to write code to parse FASTA files will fall into the 'I wasn't expecting the file to look like that' category. I recently wrote about how the simplicity of the FASTA format is a double-edged sword. Because almost anything is allowed, it means that someone will — accidentally or otherwise — produce a FASTA file at some point that contains one of the following 12 scenarios. These are all things that a good FASTA parser should be able to deal with and, if necessary, warn the user:

> space_at_start_of_header_line
ACGTACGTACGTACGT

>Extra_>_in_FASTA_header
ACGTACGTACGTACGT

>Spaces_in_sequence
ACGTACGT ACGTACGT

>Spaces_in_sequence_and_between_lines
A C 

G T 

A C

A G 

A T

>Redundant_sequence_in_header_ACGTACGTACGT
ACGTACGTACGTACGT

><- missing definition line
ACGTACGTACGTACGT

>mixed_case
ACGTACGTACGTACGTgtagaggacgcaccagACGTACGTACGTACGT

>missing_sequence

>rare, but valid, IUPAC characters 
ACGTRYSWKMBDHVN

>errant sequence
Maybe I accidentally copied and pasted something
Maybe I used Microsoft Word to edit my FASTA sequence

>duplicate_FASTA_header
ACGTACGTACGTACGT
>duplicate_FASTA_header
ACGTACGTACGTACGT

>line_ending_problem^MACGTACGTACGTACGT^MACGTACGTACGTACGT^M>another_sequence_goes_here^MACGTACGTACGTACGT^MACGTACGTACGTACGT

Why is N50 used as an assembly metric (and what's the deal with NG50)?

This post started out as a lengthy reply on a thread at SEQanswers.com. I thought that it was worth reproducing it here (though I've also added some more information)...

Why N50?

Maybe it is worth remembering why we even have N50 as a statistic. The final assembly for any genome project, i.e. the one that is described in a paper or uploaded to a database, is not necessarily the biggest assembly that was generated.

This is because there will always be contigs that are too short to be of any use to anyone. In the pre-NGS era of assembly, these contigs could potentially be represented by a single Sanger read that didn't align to anything else.

However, one might consider that there is still some utility in including a long Sanger read in an assembly, even if it doesn't overlap with any other reads. After quality trimming, there are many Sanger reads that are over 1,000 bp in length and this is long enough to contain an exon or two, or even an entire gene (depending on the species). But what if a contig was 500 bp? Or 250 bp? Or 100 bp? Where do you draw the line?

Clearly it is not desirable to fill up an assembly with an abundance of overly short sequences, even if they represent accurate, and unique, biological sequences from the genome of interest. So it has always been common to to remove very short contigs from an assembly. The problem is that different groups might use very different criteria for what to keep and what to exclude. Imagine a simple assembly consisting of the following contigs:

  • 5,000 x 100 bp
  • 100 x 10 Kbp
  • 10 x 1 Mbp
  • 1 x 10 Mbp

Now let's say that Distinguished Bioinformatics Center #1 decides to produce an assembly (DBC1) that includes all of these contigs. However, another DBC decides to make another assembly from the same data, but they remove the 100 bp contigs. A third DBC decides to also remove the 10 Kbp contigs. What does this do to the mean contig length of each assembly?

Mean contig lengths

  • DBC1 = 4,207 bp
  • DBC2 = 189,189 bp
  • DBC3 = 1,818,182 bp

Hopefully it becomes obvious that if you only considered mean contig length, it would be extremely easy to 'game' the system by deliberately excluding shorter contigs (no matter their utility) just to increase the average contig length. But what do we get if we instead rely on N50?

N50 contig lengths

  • DBC1 = 1 Mbp
  • DBC2 = 1 Mbp
  • DBC3 = 10 Mbp

This now reduces the overall discrepancies, and puts DBC1 and DBC2 on an equal footing. But you might still, naively, conclude that DBC3 is the better assembly, and if you were extremely wide-eyed and innocent, then maybe you would conclude that DBC3 was ten times better than the other two assemblies.

So N50 does a better, though still imperfect, job of avoiding the dangers inherent in relying on the mean length. In some ways, the actual method you use to calculate N50 does not matter too much, as long as you use the same method when comparing all assemblies. Back in the day of Sanger-sequence derived assemblies, it was fairly common to see assembly statistics report not just N50, but everything from N10 through to N90. This gives you a much better insight into the overall variation in contig (or scaffold lengths).

In the Assemblathon and Assemblathon 2 contests, we actually plotted all N(x) values to see the full distribution of contig lengths. Except, we didn't use the N50 metric, we used something called NG50. This normalizes for differences in overall assembly sizes by asking 'at what contig/scaffold length — from a set of sorted scaffold lengths — do we see a sum length that is greater than 50% of the estimated or known genome size?'

Returning to my earlier fictional example, lets assume that the estimated genome size that was being assembled was 25 Mbp. This means we want to see what contig length takes us past 12.5 Mbp (when summing all contig lengths from longest to shortest):

NG50 contig lengths

  • DBC1 = 1 Mbp
  • DBC2 = 1 Mbp
  • DBC3 = 1 Mbp

We now see parity between all assemblies, at least when assessing their contig lengths. The actual differences in the assemblies mostly reflect variations in which short sequences are included which may, or may not, be of any utility to the end user of the assembly. In my mind, this gives us a much fairer way of comparing the assemblies, at least in terms of their average contig lengths.

NG(X) graphs

Here is an example of an NG(X) chart, that borrows some results from the Assemblathon 2 contest:

ngx.png

Some things to note:

  • Y-axis (scaffold length) is plotted on a log scale
  • Each line represents scaffold lengths from a genome assembly
  • The NG50 value is shown as a dotted line. If you only looked at this you might consider that the red assembly is the best and purple the worst.
  • The first data point in each series, reveals the longest scaffold length. This is why you can see horizontal lines for some entries. E.g. the longest scaffold for the red assembly is about 70 Mbp. This exceeds the value of NG1, NG2, NG3 etc.
  • From NG1 to NG50, the relative differences in assemblies is fairly constant. Except that the longest scaffold in the purple assembly is longer than that in the orange assembly. But by NG5, the orange assembly has longer scaffolds.
  • Around NG75, we start to see changes. The lengths of scaffolds in the blue assembly start to drop off.
  • At NG90, the lengths of scaffolds in the green assembly plummet. The fact that this series touches the x-axis (at about NG91) tells us that the sum size of this assembly is less than the estimated genome size of the species in question (i.e. you have to have a value at NG100 to have 100% of the estimated genome size present in your assembly).
  • By NG95, the orange assembly has moved from 5th to 2nd. This just means that the assembly contains a higher proportion of longer scaffolds. However...
  • Three of the assemblies reach NG100 with positive values on the y-axis. This reveals that these assemblies are all bigger than the estimated genome size. Ideally we could continue this plot to see where they would eventually touch the x-axis. If they reached NG200, then we would know that the genome assembly is twice the estimated genome size.
  • Hopefully, this shows that you can get a lot of information out of a plot like this.

In summary

N50 is just a better, though still flawed, way of calculating an average sequence length from a set of sequences. It can still be biased, and you really should consider what the estimated/known genome size is (when comparing two or more assemblies of the same species), and/or look to see the full distribution of sequence lengths (e.g. with an NG(x) graph).