Comparisons of computational methods for differential alternative splicing detection using RNA-seq in plant systems →

December 18, 2014 by Keith Bradnam

Marc Robinson-Rechavi (@marc_rr) tweeted about this great new paper in BMC Bioinformatics by Ruolin Liu, Ann Loraine, and Julie Dickerson. From the abstract:

The goal of this paper is to benchmark existing computational differential splicing (or transcription) detection methods so that biologists can choose the most suitable tools to accomplish their goals.

Like so many other areas of bioinformatics, there are many methods available for detecting alternative splicing, and it is far from clear which — if any — is the best. This paper attempts to compare eight of them, and the abstract contains a sobering conclusion:

No single method performs the best in all situations

Figure 5 from the paper is especially depressing. It looks at the overlap of differentially spliced genes as detected by five different methods. There are zero differentially spliced genes that all methods agreed on.

Liu et al. BMC Bioinformatics 2014 15:364   doi:10.1186/s12859-014-0364-4 — Liu et al. BMC Bioinformatics 2014 15:364 doi:10.1186/s12859-014-0364-4

Understanding MAPQ scores in SAM files: does 37 = 42?

December 16, 2014 by Keith Bradnam

The official specification for the Sequence Alignment Map (SAM) format outlines what is stored in each column of this tab-separated value file format. The fifth column of a SAM file stores MAPping Quality (MAPQ) values. From the SAM specification:

MAPQ: MAPping Quality. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.

So if you happened to know that the probability of correctly mapping some random read was 0.99, then the MAPQ score should be 20 (i.e. log10 of 0.01 * -10). If the probability of a correct match increased to 0.999, the MAPQ score would increase to 30. So the upper bounds of a MAPQ score depends on the level of precision of your probability (though elswhere in the SAM spec, it defines an upper limit of 255 for this value). Conversely, as the probability of a correct match tends towards zero, so does the MAPQ score.

So I'm sure that the first thing that everyone does after generating a SAM file is to assess the spread of MAPQ scores in your dataset. Right? Anyone?

< sound of crickets >

Okay, so maybe you don't do this. Maybe you don't really care, and you are happy to trust the default output of whatever short read alignment program that you used to generate your SAM file. Why should it matter? Will these scores really vary all that much?

Here is a frequency distribution of MAPQ scores from two mapping experiments. The bottom panel zooms in to more clearly show the distribution of low frequency MAPQ scores:

Distribution of MAPQ scores from two experiments: bottom panel shows zoomed in view of MAPQ scores with frequencies < 1%. Click to enlarge.

What might we conclude from this? There seems to be some clear differences between both experiments. The most frequent MAPQ scores in the first experiment are 42 followed by 1. In the second experiment, scores only reach a maximum value of 37, and scores of 0 are the second most frequent value.

These two experiments reflect some real world data. Experiment 1 is based on data from mouse, and experiment 2 uses data from Arabidopsis thaliana. However, that is probably not why the distributions are different. The mouse data is based on unpaired Illumina reads from a DNase-Seq experiment, wheras the A. thaliana data is from paired Illumina reads from whole genome sequencing. However, that still probably isn't the reason for the differences.

The reason for the different distributions is that experiment 1 used Bowtie 2 to map the reads whereas experiment 2 used BWA. It turns out that different mapping programs calculate MAPQ scores in different ways and you shouldn't really compare these values unless they came from the same program.

The maximum MAPQ value that Bowtie 2 generates is 42 (though it doesn't say this anywhere in the documentation). In contrast, the maximum MAPQ value that BWA will generate is 37 (though once again, you — frustratingly — won't find this information in the manual).

The data for Experiment 1 is based on a sample of over 25 million mapped reads. However, you never see MAPQ scores of 9, 10, or 20, something that presumably reflects some aspect of how Bowtie 2 calculates these scores.

In the absence of any helpful information in the manuals of these two popular aligners, others have tried doing their own experimentation to work out what the values correspond to. Dave Tang has a useful post on Mappinq Qualities on his Musings from a PhD Candidate blog. There are also lots of posts about mapping quality on the SEQanswers site (e.g. see here, here or here). However, the prize for the most detailed investigation of MAPQ scores — from Bowtie 2 at least — goes to John Urban who has written a fantastic post on his Biofinysics blog:

How does Bowtie 2 assign MAPQ scores?

So in conclusion, there are 3 important take home messages:

MAPQ scores vary between different programs and you should not directly compare results from, say, Bowtie 2 and BWA.
You should look at your MAPQ scores and potentially filter out the really bad alignments.
Bioinformatics software documentation can often omit some really important details (see also my last blog post on this subject).

More thoughts on the documentation for the SAM file format

December 15, 2014 by Keith Bradnam

I recently wrote about how bioinformatics documentation is not always very user-friendly. I made some references to the opaque nature of the official SAM Format Specification (specifically the lack of clarity in explaining the nature of what is in column 2 of a SAM file). After writing this post, I raised an issue on the relevant SAMtools GitHub repository:

I feel that many people have trouble understanding what is meant by bitwise FLAG values. The documentation is very technical and not very transparent to people who may be new to bioinformatics.

Many people might be turning to the documentation after looking at their SAM output file. Maybe they see that their output file has a range of integer values in column 2 and are puzzled by the explanation in the documentation (this is very likely if you have no familiarity with bit patterns).

I think this section would be greatly helped by the following:

A reminder that the SAM file itself stores an integer value

An explicit description that a bitwise value of zero means that your read has mapped to the forward strand of the reference

Some specific examples that explain what various integer values correspond to.

Most of the responses to this were of the form well people can go elsewhere if they want to find better documentation that explains this. E.g.

"There is space on SEQwiki for user created format information."

I accept that the official specification for a file format is not necessarily the same thing as a user guide, but people presumably arrive at the official SAM documentation when searching for help with this kind of thing. E.g. if I search for sam format documentation or sam file format, the top hit is the aforementioned SAM Format Specification.

So the advice seems to be that you don't need to bother making your bioinformatics documentation easy to understand because someone else might come along and do this for you.

Kablammo: an interactive, web-based BLAST results visualizer →

December 11, 2014 by Keith Bradnam

Another great name for a piece of bioinformatics software! This tool has just been published in the journal Bioinformatics by Jeff Wintersinger and James Wasmuth. From the abstract:

Kablammo is a web-based application that produces interactive, vector-based visualizations of sequence alignments generated by BLAST. These visualizations can illustrate many features, including shared protein domains, chromosome structural modifications, and genome misassembly.