A paper recently published in Genome Biology by Smolka et al. may offer some help to the problem of choosing which read mapping program to use in order to align a set of sequencing reads to a genome:

Teaser: Individualized benchmarking and optimization of read mapping results for NGS data

The paper starts by neatly summarising the problem:

Recent and ongoing advances in sequencing technologies and applicationslead to a rapid growth of methods that align next generation sequencing reads to a reference genome (read mapping). By mid 2015, nearly 100 different mappers are available, although not all are equally suited for a given application or dataset.

The program Teaser attempts to automate the benchmarking of not just different mappers, but also (some of) the different parameters that are available to these programs. The latter problem should not be underestimated. The Bowtie 2 program describes almost 100 different command-line options in its documentation and many of these options control how Bowtie runs and/or what output it generates.

Teaser uses small sets of simulated read data, leading to very quick run times (< 30 minutes for many comparisons), but you can also supply real data to it. By default, Teaser will test the performance of five read mapping programs: BWA, BWA-MEM, BWA-SW, Bowtie2, and NextGenMap.

Impressively, you can run Teaser on the web as well as a standalone program. The web output includes results displayed graphically for many different test datasets (x-axis):

The paper concludes by asking the community to submit optimal parameter combinations to the Teaser GitHub repository

Teaser is easy to use and at the same time extendable to other methods and parameters combinations. Future work will include the incorporation of benchmarking RNA-Seq mappers and variant calling methods. We furthermore encourage the scientific community to contribute the optimal parameter combinations they detected to our github repository (available at github.com/Cibiv/Teaser) for their particular organism of interest. This will help others to quickly select the optimal combination of mapper and parameter values using Teaser.

I can't wait for the companion program Firecat!

2015-10-26 11.05: Updated to remove specific references to software versions of mapping tools.

Help us do science! I’ve teamed up with researcher Paige Brown Jarreau to create a survey of ACGT readers. By participating, you’ll be helping me improve ACGT and contributing to the SCIENCE on blog readership. You will also get FREE science art from Paige's Photography for participating, as well as a chance to win a t-shirt and other perks! It should only take 10–15 minutes to complete.

You can find the survey here: http://bit.ly/mysciblogreaders

I have previously written about the range of mapping quality scores (MAPQ) that you might see in BAM/SAM files, as produced by popular read mapping programs. A very quick recap:

Bowtie 2 generates MAPQ scores between 0–42
BWA generates MAPQ scores between 0–37
Neither piece of software describes the range of possible scores in their documentation
The SAM specification defines the possible ranges of the MAPQ score as 0–255 (though 255 should indicate that mapping quality was not available)
I advocated that you should always take a look at your mapped sequence data to see what ranges of scores are present before doing anything else with your BAM/SAM files

So what is my latest gripe? Well, I've recently been running TopHat (version 2.0.13) to map some RNA-Seq reads to a genome sequence. TopHat uses Bowtie (or Bowtie 2) as the tool to do the intial mapping of reads to the genome, so you might expect it to generate the same range of MAPQ scores as the standalone version of Bowtie.

But it doesn't.

From my initial testing, it seems that the BAM/SAM output file from TopHat only contains MAPQ scores of 0, 1, 3, or 50. I find this puzzling and incongruous. Why produce only four MAPQ scores (compared to >30 different values that Bowtie 2 can produce), and why change the maximum possible value to 50? I turned to the TopHat manual, but found no explanation regarding MAPQ scores.

Turning to Google, I found this useful Biostars post which suggests that five MAPQ values are possible with TopHat (you can also have a value of 2 which I didn't see in my data), and that these values correspond to the following:

0 = maps to 10 or more locations
1 = maps to 4-9 locations
2 = maps to 3 locations
3 = maps to 2 locations
50 = unique mapping

The post also reveals that, confusingly, TopHat previously used a value of 255 to indicate uniquely mapped reads. However, I then found another Biostars post which says that a MAPQ score of 2 isn't possible with TopHat, and that the meaning of the scores are as follows:

0 = maps to 5 or more locations
1 = maps to 3-4 locations
3 = maps to 2 locations
255 = unique mapping

This post was in reference to an older version of TopHat (1.4.1) which probably explains the use of the 255 score rather than 50. The comments on this post reflect some of the confusion over this topic. Going back to the original Biostars post, I then noticed a recent comment suggesting that MAPQ scores of 24, 28, 41, 42, and 44 are also possible with TopHat (version 2.0.13).

As this situation shows, when there is no official explanation that fully describes how a piece of software should work, it can lead to mass speculation by others. Such speculation can sometimes be inconsistant which can end up making things even more confusing. This is what drives bioinformaticians crazy.

I find it deeply frustrating when so much of this confusion could be removed with better documentation by the people that developed the original software. In this case the documentation needs just one paragraph added; something along the lines of…

Mapping Quality scores (MAPQ)
TopHat outputs MAPQ scores in the BAM/SAM files with possible values 0, 1, 2, or 50. The first three values indicate mappings to 5, 3–4, or 2 locations, whereas a value of 50 represents a unique match. Please note that older versions of TopHat used a value of 255 for unique matches. Further note that standalone versions of Bowtie and Bowie 2 (used by TopHat) produce a different range of MAPQ scores (0–42).

Would that be so hard?

Teaser: a solution for our read mapping dilemma?

More madness with MAPQ scores (a.k.a. why bioinformaticians hate poor and incomplete software documentation)