Filtering a SAM file generated by TopHat to find uniquely mapped, concordant read pairs: AWK vs SAMtools

Software suites like SAMtools (or should that be [SAMsam]tools?) offer a powerful way to slice and dice files in the SAM, BAM, and CRAM file formats. But sometimes other approaches work just as well.

If you have aligned paired RNA-Seq read data to a genome or transcriptome using TopHat you may be interested in filtering the resulting SAM/BAM file to keep reads that are:

  • a) uniquely aligned (only match one place in target sequence)
  • b) mapped as concordant read pairs (both pairs map to same sequence in correct orientation with suitable distance between them)

TopHat has a --no-discordant command-line option which only report read alignments if both reads in a pair can be mapped, but the name of option is somewhat misleading, as you still end up with discordantly mapped read pairs in the final output (always good to check what difference a command-line option actually makes to your output!.

So if you have a TopHat SAM file, and you wanted to filter it to only keep uniqueley mapped, concordant read pairs, you could use two of the options that the samtools view command provides:

  • -q 50 — This filters on the MAPQ field (5th column of SAM file). TopHat uses a value of 50 in this field to denote unique mappings (this important piece of information is not in the TopHat manual).
  • -f 0x2 — This option filters on the bitwise FLAG score (column 2 of SAM file), and will extract only those mappings where the 2nd bit is set. From the SAM documentation, this is described as each segment properly aligned according to the aligner. In practice this means looking for values that are either 83, 89, 147, or 163 (see this helpful Biobeat blog post for more information about this).

So if you have a SAM file, in essence you just need to filter that file based on matching certain numbers in two different columns. This is something that the Unix AWK tool excels at, and unlike SAMtools, AWK is installed on just about every Unix/Linux system by default. So do both tools give ou the same result? Only one way to find out:

Using SAMtools

The 'unfiltered.sam' file is the result of a TopHat run that used the --no-discordant and --no-mixed options. The SAM file contained 34,340,754 lines of mapping information:

time samtools  view -q 50 -f 0x2 unfiltered.sam > filtered_by_samtools.sam

real    1m57.068s
user    1m18.705s
sys     0m13.712s

Using AWK

time awk '$5 == 50 && ($2 == 163 || $2 == 147 || $2 == 83 || $2 == 99) {print}' unfiltered.sam  > filtered_by_awk.sam

real    1m31.734s
user    0m46.855s
sys     0m15.775s

Does it make a difference?

wc -l filtered_by_*.sam
31710476 filtered_by_awk.sam
31710476 filtered_by_samtools.sam

diff filtered_by_samtools.sam filtered_by_awk.sam

No difference in the final output files, with AWK running quite a bit quicker than SAMtools. In this situation, filtering throws away about 8% of the mapped reads.

More duplicate names for bioinformatics software: a tale of two HIPPIES

Thanks to Sara Gosline (@sargoshoe) for bringing this to my attention. Compare and contrast the following:

The former tool, published in 2012 in PLOS ONE, takes its name from 'Human Integrated Protein-Protein Interaction rEference' (it was doing so well until it reached the last letter). The latter tool ('High-throughput Identification Pipeline for Promoter Interacting Enhancer elements') was published in 2014 in the journal Bioinformatics.

Leaving aside the issue of whether these names are worthy of a JABBA award, the issue here is that we have yet another duplicate set of software names for two different bioinformatics tools. The authors of the 2nd paper could, and should, have checked for 'prior art'.

If you are planning to develop a new bioinformatics tool and have thought of a possible name, please take the time to do the following:

  1. Visit http://google.com (or your preferred web search engine of choice)
  2. In the search box type the proposed name of your tool followed by a space
  3. Then add the word 'bioinformatics'
  4. Click search
  5. That's it

Inconsistent bioinformatics branding: SAMtools vs Samtools vs samtools

The popular Sequence Alignment Map format, SAM, has given rise to an equally popular toolkit for working with SAM files (and BAM, CRAM too). But what is the name of this tool?


SAMtools?

If we read the official publication, then we see this software described as 'SAMtools' (also described by Wikipedia in this manner).

Samtools?

Head to the official website and we see consistent references to 'Samtools'.

samtools?

Head to the official GitHub repository and we see consistent references to 'samtools'.


This is not exactly a problem that is halting the important work of bioinformaticians around the world, but I find it surprising that all of these names are in use by the people that developed the software. Unix-based software is typically — but not always — implemented as a set of lower-case commands and this can add one level of confusion when comparing a tool's name to the actual commands that are run ('samtools' is what you type at the terminal). However, you can still be consistent in your documentation!

How do people choose a single isoform of a gene to use for bioinformatics analyses?

 

Update 2015-09-29: in addition to the comments at the end of the post below, also see the follow up post that I wrote which offers some more suggestions including the APPRIS database/webserver which looks very useful.

 

This post is somewhat of a follow-up to something that I wrote earlier this week. In bioinformatics, we often want to analyze all genes from an organism (or from multiple organisms). In many well-annotated genome databases, there is often a choice of isoforms available for each protein-coding gene, and the number of isoforms only ever seems to increase.

For example, in the latest set of human gene annotations (Ensembl 78), there are 406 protein-coding genes that have more than 25 transcripts. At one extreme, the human GPR56 gene has 77 transcripts, 61 of which are annotated as protein-coding! The length of these 61 putative protein products ranges from just 6 amino acids (!) all the way up to 693.

In Caenorhabditis elegans, sequence identifiers for genes were historically based on appending numbers to the identifier of the BAC/YAC/Cosmid clone containing that gene. E.g. B0348.1 would represent the first predicted gene on the B0348 clone, B0348.2 the second gene…and so on. When splice variants were discovered, curators appended letters for each isoform. E.g. B0348.2a and B0348.2b represent the two alternative isoforms of this gene. In the latest WS248 release of WormBase, one gene (egl-8) has 25 isoforms (all the way up to B0348.4y). I wonder what WormBase will do when a 27th isoform is discovered?

So how does one attempt to choose a single variant for use in a bioinformatics pipeline, and is this something that we should even be attempting? Historically, people have often opted for a quick-and-easy approach in order to get around this problem. Some examples from papers indexed by Google Scholar:

"In cases of alternative splicing, we chose the longest protein to represent a gene"

"In cases of multiple transcript isoforms, we chose the isoform with the longest CDS supported by transcript and protein homology in other mammalian species"

"Because of the redundancy of protein sequences, we chose only the longest isoform for every entry"

"In cases where a gene possesses more than one reference sequence, we chose the longest"

"When multiple protein entries are found for the same EntrezGene identifier, choose the longest sequence isoform"

This methodology is obviously not without problems (as others have reported on). So I'm genuinely curious as to what people do in order to choose a 'representative' isoform (whatever that means). The problem is further complicated when the reality might be that some genes consistently use different isoforms in different tissues or at different developmental time points.

Please comment below if you think you have found a good solution to this problem!

24 carat JABBA awards

jabba logo.png

Here is a new paper published in the journal PLOSBuzzFeed…sorry, I mean PLOS Computational Biology:

It's a good job that they mention the name of the algorithm ninety-one times in the paper, otherwise you might forget just how bogus the name is. At least DIAMOnD has that lower-case 'n' which means that no-one will confuse it with:

This second DIAMOND paper dates all the way back to November 2014. Where does this DIAMOND get its name?

Double Index AlignMent Of Next-generation sequencing Data

This DIAMOND gets a bonus point for having a website link in the paper which doesn't seem to work.

So DIAMOnD and DIAMOND are both the latest recipients of JABBA awards for giving us Just Another Bogus Bioinformatics Acronym.

101 questions with a bioinformatician #24: Sara Gosline

Sara Gosline is a postdoc at the Fraenkel Lab, in the Department of Biological Engineering at MIT. Her current work has focused on studying the impact of microRNA changes on global mRNA expression. As her postdoc comes to an end, Sara is seeking a tenure-track faculty position to further explore the broader impacts of RNA regulation to better interpret gene expression data in a network context (contact her if interested).

Read More

Excellent blog post about coding and documentation

There was an exchange on twitter today between several bioinformaticians regarding the need to have good documentation for bioinformatics tools. I was all set to write something about my own thoughts on this topic, but Robert Davey (@froggleston) has already written an excellent post on the subject (and probably done a better job of expressing my own views than I could):

I highly recommend reading his post as he makes some great points, including the following:

We need, as a community, usable requirements and standards for saying “this is how code should go from being available to being reusable“. How do we get our lab notebook code into that form via a number of checkpoints that both programmers and reviewers agree on?


Transcriptional noise, isoform prediction, and the utility of mass spec data in gene annotation

The human genome may be 80% functional or 8.2% functional. Maybe it's 93.7% functional or only 6.1%. I guess that all we know for sure is that it is not 0% functional (although my genome on a Monday morning may provide evidence to the contrary).

Transcript data can be used to ascribe some sort of functionality to a genome and, in an ideal world, we would sequence full-length cDNAs for every gene. But in the less-than-ideal world we often end up sequencing lots of small bits of RNA using an ever-changing set of technologies. ESTs, SAGE, CAGE, RACE, MPSS, and RNA-Seq have all been used to provide evidence for where genes are and how highly they are being expressed.

Having some transcriptional evidence is (usually) better than not having any transcriptional evidence, but it doesn't necessarily imply functionality. A protein-coding gene that is transcribed may not be translated. Transcript data is used in gene annotation to add new genes, especially in the case of a first-pass annotation of a new genome. But in established genomes, it is probably used more to annotate transcript isoforms (e.g. splice variants). This can lead to a problem for the end users of such data…how to tell if all isoforms are equally likely?

Consider the transcript data for the rpl-22 gene in Caenorhabditis elegans. This gene has two annotated splice variants and there is indeed EST evidence for both variants, but it is a little bit unbalanced:

This gene encodes the large ribosomal subunit protein…a pretty essential protein! Notice how the secondary isoform (shown on top) a) encodes for a much shorter protein and b) has very little transcript evidence. In my mind, this secondary isoform is the result of 'transcriptional noise'. Maybe a couple of lucky ESTs captured the transcript in the process of heading towards destruction via nonsense-mediated decay? It seems highly unlikely that this secondary transcript gives rise to a functional protein though someone who is new to viewing data like this might initially consider each isoform as equally valid.

If we turn on some additional data tracks to look at protein homology to human (shown in orange) and mass spectromety data from C. elegans (shown in red) it becomes clear that all of the evidence is really pointing towards just one functional isoform:

Indeed mass spec data has the potential to really clean up a lot of noisy gene annotations. In light of this I was very happy to see this new paper published in the Journal of Proteome Research (where talented up-and-coming scientists publish!):

Pooling data from 8 mass spec analyses of human data, the authors attempted to see how much protein support there was for the different annotated isoforms of the human genome. They could reliably map peptides to about two-thirds of the protein-coding genes from the GENCODE 20 gene set (Ensembl 76). What did they find?

We identified alternative splice isoforms for just 246 human genes; this clearly suggests that the vast majority of genes express a single main protein isoform.

They also found that the mass spec data was not always in agreement with the dominant isoforms that can be predicted from RNA-Seq data:

…part of the reason for this will be that more RNAseq reads map to longer sequences, it does suggest that either transcript expression is very different from protein expression for many genes or that transcript reconstruction methods may not be interpreting the RNAseq reads correctly.

The headline conclusion that mass spec evidence only supports alternate isoforms for 1.2% of human genes is thought provoking. It suggests to me that we should be careful in relying too heavily on gene annotations which describe large numbers of isoforms mostly on the basis of transcript data. Paraphrasing George Orwell:

All isoforms are equal, but some isoforms are more qual than others

The top 10 #PLOSBuzzFeed tweets that will put a smile on your face

It all started so innocently. Nick Loman (@pathogenomenick) expressed his dissatisfaction with yet another PLOS Computational Biology article that uses the 10 Simple Rules… template:


There were two immediate responses from Kai Blin (@kaiblin) and Phil Spear (@Duke_of_neural):


I immediately saw the possibility that this could become a meme-worthy hashtag, so I simply retweeted Phil’s tweet, added the hashtag #PLOSBuzzFeed, and waited to see what would happen (as well making some of my own contributions).

At the time of writing — about 10 hours later — there have been several hundred tweets using this hashtag. Presented in reverse order, here are the most ‘popular’ tweets from today (as judged by summing retweets and favorites):