How can you choose a single isoform to best represent a gene?

I asked this question a few weeks ago. There are some situations in bioinformatics when you want to look at all genes from one or more organisms, but where you want to only have one representative of a gene. This is not always a straightforward task, and I previously discussed how many people opt for overly simplistic methods such as 'choosing the longest isoform'.

Since my blog post I had some feedback and have also come across a relevant paper which I would like to share here.

Feedback

Sara G said:

We use epigenetic data from ENCODE/modENCODE/Epigenetic roadmap or our own data to identify the transcription start site that has the most evidence of transcriptional activity.

From my experience with using data from fly, worm, and Arabidopsis I have often seen many genes which have multiple isoforms that all share the same transcription start site (and differ in downstream exon/intron structure).

Michael Paulini commented:

What you can easily do (if you got the data) is just pick the isoform with the highest FPKM value per gene.

The important caveat that Michael mentions is that you might not have the data, and if you are trying to do this as part of multiple species comparison then this becomes a much more complex task.

Richard Edwards added to the debate (I've edited his response a little bit):

My previous solution to this problem was this:

"…a dataset consisting of one protein sequence per gene was constructed from the EnsEMBL human genome…by mapping all EnsEMBL human peptides onto their genes and assessing them in the context of the external database entry used as evidence for that gene. If the external database was SwissProt and one of the peptides has the exact same sequence as the SwissProt sequence, this peptide sequence was used for that gene. In all other cases, the longest peptide (in terms of non-X amino acids) was used."

Richard also conceded that this approach "only works for well annotated proteomes".

Another approach

For Arabidopsis thaliana, I have previously used data from their exon confidence ranking system which ends up producing a 5 star rating for every transcript of a gene. This is based on various strands of evidence and can be useful for choosing between isoforms which have different star ratings (but not so helpful when all isoforms have the same rating). The TAIR database remains the only model organism database (that I know of) that have attempted to address this problem and provide a systematic way of ranking all genes. Anecdotally, I would say that not many people know about this system and you need to find the relevant files on their FTP site to make use of this information.

Hopefully someone will correct me if other databases have implemented similar schemes!

From the literature

If your research focuses on vertebrates, then this approach looks promising.

Here's the abstract (emphasis mine):

Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform.

It seems that since this paper was published, they have expanded their dataset and the online APPRIS database now has annotations for five vertebrate genomes (human, mouse, zebrafish, rat, and pig).

The future

This APPRIS approach seems promising, yet I wish there was an standardized ranking system used by all model organism databases (as well as Ensembl, UCSC Genome Browser etc.).

There are always going to be different levels of supporting evidence available in different species. E.g. not every organism is going to be able to make use of mass spec data to assess the differential usage of transcript variants. Furthermore, there are going to be tissue-specific — not to mention time-specific — patterns of differential isoform expression for some genes.

However, what I would love to see is each database using whatever information they have availble to them to assign relative 'weights' to each annotated isoform. Initially, this would focus on protein-coding genes and ideally, this scheme would entail the following:

  1. A score for every annotated isoform of a gene where all scores sum to 100. In my favorite example of the Caenorhabditis elegans rpl-22 gene these scores would probably end up as 99 for the dominant isoform and 1 for the secondary isoform. The scores should reflect the relative likelihood that this transcript encodes a functional protein. To keep things simple, I would prefer integer values with values of zero allowed for annotated isoforms that represent pseudogenic transcripts, targets of nonsense-mediated decay etc.
  2. Evidence for the score. This would be something akin to the Gene Ontology Evidence codes. Each gene may have one or more codes attached to them and these would reflect the source of the experimental evidence (transcript data, mass spec, functional assay etc.)
  3. Spatial/temporal information. If some of the supporting evidence is from a particular cell-line, tissue, or developmental time point then it should be possible to annotate this.
  4. Historical information. Such scores should be expected to change over time as new evidence emerges. All transcripts should keep a time-stamped history of previous scores to allow researchers to see whether isoforms have increased or decreased in their relative probability of usage over time.
  5. Data to be added to GTF/GFF files: The score of each transcript should be embedded as a note in the attribution column of these files, maybe as a 'RF' score (Relative Functionality).

Can someone please make this happen? :-)

The dangers of default parameters in bioinformatics: lessons from Bowtie and TopHat

Updated 2015-04-28 to include a reference to a relevant blog post by Nick Loman.

Most bioinformatics tools are equipped with a vast array of command-line options which let the user configure the inputs, outputs, and performance of the software. You may not wish to explore every possible option when using a particular piece of software, but you should always try to have a look at the manual. In fact I would go so far as to say:

Never use a piece of bioinformatics software for the first time without looking to see what command-line options are available and what default parameters are being used

If a bioinformatics tool provides command-line options, it's a good bet that some of those options will have default parameters associated with them. The most common default parameter that you should always check for might be called something like -p, -t, or --threads. I.e. you should always check to see if a program can make use of multiple CPUs/cores to speed up its operation. The default value for such parameter is usually '1'.

Failure to check (and change) this option will merely result in your program taking longer than it otherwise could, so no harm done. More serious issues arise when you stick with default parameters which might actually make you miss interesting results, or produce misleading results.

E.g. the TopHat RNA-Seq aligner has an command-line option -i which lets you set a minimum intron length. The manual reveals that TopHat will ignore donor/acceptor pairs closer than this many bases apart. The default value for this parameter is 70 bp which is probably fine for all vertebrate genomes, but is probably not suitable for genomes such as Arabidopsis thaliana or Caenorhabditis elegans where there are many introns shorter than this length.

Now let's look at some specific examples which shed some light on what can happen if you stick with the default parameters.

Testing Bowtie

I used Bowtie 2 to align two FASTQ files of bovine RNA-Seq data to the reference cow transcriptome. These two files represented paired read data, about 25 million reads in each file. Following alignment, I filtered the data to only keep uniquely mapped, concordant read pairs. Furthermore, I filtered the data to only keep reads pairs that aligned with the highest possible mapping quality (for Bowtie, this means MAPQ values of 42).

From the resulting set of aligned reads, I calculated the inner read pair distance (see this great post about the difference between this and insert sizes), which I then plotted:

Thicker red vertical line indicates mean.

You will notice that most read pairs have a negative inner read pair distance, i.e. the reads overlap. But do you notice anything odd about these results? Look at the far right of the distribution…the maximum inner read pair distance cuts off abruptly at 300 bp. To me this is very suspicious. After looking in the Bowtie manual I discovered this:

-X <int> The maximum fragment length for valid paired-end alignments. Default: 500 bp.

Aha! My reads are 100 bp, so the cutoff at 300 bp in the above graph represents fragment lengths of 500 bp. If there are reads that correctly map further than 300 bp apart, I am not going to see them because of the default value of -X. So what happens if I increase this default parameter to 2,000 bp?

Thicker red line indicates mean of dataset with -X 500, thicker blue line indicates mean of dataset at -X 2000.

I have colored in the new data in blue; we now see many more paired reads that map further than 300 bp apart. There are actually still some reads that correctly map even further than 1,800 bp apart (the limit imposed by -X 2000). So not only does this default parameter of Bowtie mean that I lose some data, it also means that I lose the type of data that might be particularly useful in some applications where we especially want the reads that map far apart (e.g. transcriptome assembly).

It turns out that Nick Loman was writing about the dangers of the -X parameter of Bowtie back in 2013, and I encourage everyone to read his blog post: Use X with bowtie2 to set minimum and maximum insert sizes for Nextera libraries.

Testing TopHat

The TopHat program actually uses Bowtie for the alignment phase, and there are a couple of default parameters used by TopHat that might cause issues for you. From the TopHat manual:

-r <int>  This is the expected (mean) inner distance between mate pairs. The default is 50 bp.

--mate-std-dev <int> The standard deviation for the distribution on inner distances between mate pairs. The default is 20 bp.

The default value for --mate-std-dev seems worryingly low; how many biological datasets ever have standard deviations that are less than half the mean? So I did some testing…

I used TopHat to map a small sample of my RNA-Seq data (200,000 paired reads) to the cow transcriptome (using the -T option of TopHat). In addition to using the defaults, I also tried various runs that used different values for the -r and --mate-std-dev options. My run with default parameters ended up with 150,235 reads mapped to the transcriptome, whereas increasing those parameters (-r 100 and --mate-std-dev 125) produced fewer reads in my output (138,970). So you might conclude that the default parameters are performing better. However, of the 150,235 reads mapped with default parameters, only 84,996 represent high quality alignments (uniquely mapped, concordant read pairs). My custom parameters produced 105,748 high quality alignments. So the custom parameters are giving me better results overall.

Conclusions

It takes time and effort to be prepared to test command-line options, especially for programs like Bowtie and TopHat which have a lot of options. For TopHat, I ended up doing 40 different test runs in order to investigate the effects of various parameters to work out what was most suitable (and effective) for my input data.

Bioinformatics programs should never be treated as 'black box' devices. You may get results out of them, but you will never know whether you could have more and/or better results from changing some of those default parameters.

7 things that brilliant scientists do: the last item might surprise you!

I was presenting in our weekly lab meeting and I used the occasion to talk on a few different subjects (each of which will probably end up as a separate blog post). I started by trying to get people to think about the different things that scientists do and I offered my own list of suggestions:

The last item on the list is maybe not something that everyone thinks about. I believe that scientists should have some curiousity about their field (and about science in general) and one way that this can manifest itself is by asking questions.

I then sprang a surprise on the lab by revealing that for the last year, I've secretly been tracking which people ask questions in our lab meetings. The data is incomplete: I have probably missed some questions and not everyone has been to every lab meeting. Also, we have some new people in the lab who weren't here a year ago. But in any case, I showed the (anonymized) data which is showing how many times (out of 41 lab meetings) someone asked at least 1 question:

I stressed that it isn't automatically a good thing if you are on the left-hand side of the graph, and it isn't automatically a bad thing if you are the right-hand side of the graph. But clearly, we have some people in our lab meetings who ask more questions than others. I then shared some reasons for why I think many people do not ask questions:

The 2nd point is very common for younger scientists, and I fully understand that people can be nervous about asking what they feel might be a stupid question (especially to people that they perceive as being 'above them'). But if you don't understand something…then it is not a stupid question. It is only stupid if you never ask and instead sit through meetings being confused.

The 3rd point sometimes happens when it is the speaker who is the junior person and the audience doesn't want to risk giving them a hard time. It is laudable to be kind to others, but this doesn't mean that you can't ask questions. If nobody ever asked questions then scientists will never get the useful feedback as to which parts of their talks are confusing or unclear.

Learning to ask (and answer) questions is a useful, and necessary skill, for scientists. If you can't practice this skill in the relative safety of a lab meeting, then when will you practice it? I ended this section of my lab talk by pointing out that your scientific curiousity (or lack of) is something that may be used when people assess you (for job interviews, promotions etc.):

Viewing online figures in an Nucleic Acids Research article…an exercise in frustration

I found a Nucleic Acids Research article that I wanted to read online. The article in question had — like most articles — some figures. Here's how my web browser displays one of the figures:

I've blurred out the surrounding text so as not to risk any copyright issues (and also to let you focus on just the figure). If your eyesight is like mine, you may feel that the three subpanels of this figure are too small to be of much use.

So I clicked the image…

The white rectangle enclosing the figure increases in size from about 250 x 130 pixels to about 460 x 210. If I squint, I can just about make out some of the figure text, but it remains far from clear.

So I clicked the image…

Now I see exactly the same image as before, but it's in a new webpage which has a wider figure legend. The new legend is acually a little bit harder to read than previously as there are more words per line. I'm really not sure what the point of this page is.

So I clicked the image…

Hooray! Now I can finally see the figure at a decent size where all of the figure text is legible. It only took me three clicks to get there. To make sense of the figure I turn to the legend, only to find that there is no legend!

So you can have your figure at a reasonable size without the legend, or you can have the legend but only with a small version of the figure. It is obviously beyond the journals ability to give you both.