When will ‘open science’ become simply ‘science’?

A great commentary piece by Mick Watson (@BioMickWatson) in Genome Biology where he discusses the six O's (Open data, Open access, Open methodology, Open source, Open peer review, and Open education). On the former:

…it is no longer acceptable for scientists to hold on to data until they have extracted every last possible publication from it. The data do not belong to the scientist, they belong to the funder (quite often the taxpayer). Datasets should be freely available to those who funded them. Scientists who hoard data, far from pushing back the boundaries of human knowledge, instead act as barriers to discovery.

Amen.

101 questions with a bioinformatician #26: Kerstin Howe

Kerstin Howe is a Senior Scientific Manager, leading the Genome Reference Informatics group at the Wellcome Trust Sanger Institute. As part of the Genome Reference Consortium (GRC), Kerstin’s group is helping ensure that “the human, mouse and zebrafish reference assemblies are biologically relevant by closing gaps, fixing errors and representing complex variation”.

This important work entails the generation of ‘long range’ information (sequencing and optical mapping) for a variety of genomes and using that information to provide genome analyses, visualise assembly evaluations, and curate assemblies. You may also wish to check out 101 questions interviewee #3 (Deanna Church), another key player in the GRC.

Kerstin is not my first ‘101 Questions’ interviewee that I know from my time working on the WormBase project. Unlike interviewee #23 though, I did have the pleasure of sharing an office with Kerstin — or WBPerson3103 as she will forever be known in WormBase circles — during my time at the Sanger Institute. It was after leaving WormBase that she became a big fish (of a little fish) in the vertebrate genomics community. And now, on to the 101 questions…

Read More

How do we assess the value of biological data repositories?

For the most part, model organism databases such as WormBase, FlyBase, SGD etc. offer an invaluable and indispensible resource to their respective communities. These sites exist due to funding agencies recognizing their essential nature, and are typically funded on five-year grants.

There are many other bioinformatics resources out there, most of which have probably been useful to some people at some time or other. But how should we decide what is useful enough to merit continued funding? It's a very tricky question, and is one which Todd Harris is starting to explore on his blog:

In an era of constrained funding and shifting focus, how do we effectively measure the value of online biological data repositories? Which should receive sustained funding? Which should be folded into other resources? What efficiencies can be gained over how these repositories are currently operated?

Worth a read. I've written before about the issue of whether there are too many biological databases, especially when very multiple databases emerge with heavily overlapping areas of interest. I think funding agencies and journals should think carefully before supporting/publishing new resources without first really establishing:

  1. Is this really needed?
  2. Does it overlap with other existing resources?
  3. What will happen to the resource should funding and/or personnel not be available to keep it going?

Goodbye CEGMA, hello BUSCO!

Some history

CEGMA (Core Eukaryotic Genes Mapping Approach) is a bioinformatics tool that I helped develop about a decade ago. It led to a paper that outlined how you could use CEGMA to find a small number of highly conserved genes in any new genome sequence that might be devoid of annotation. Once you have even a handful of genes, then you can use that subset to train a gene finder in order to annotate the rest of the genome.

It was a bit of a struggle to get the paper accepted by a journal, and so it didn't appear until 2007 (a couple of years after we had started work on the project). For CEGMA to work it needed to use a set of orthologous genes that were highly conserved across different eukaryotic species. We used the KOGs database (euKaryotic Orthologous Groups) that was first described in 2003. So by the time of our publication we were already using a dataset that was a few years old.

The original CEGMA paper did not attract much attention but we subsequently realized that the same software could be used to broadly assess how complete the 'gene space' was of any published genome assembly. To do this, we defined a subset of core genes that were the most highly conserved and which tended to be single-copy genes. The resulting 2009 paper seemed to generate a lot of interest in CEGMA and citations to the original paper have increased every year since (139 citations in 2014).

This is good news except:

  1. CEGMA can be a real pain to install due to its dependency on many other tools (though we've made things easier)
  2. CEGMA has been very hard to continue developing. The original developer left our group about 7 years ago and he was the principle software architect. I have struggled to keep CEGMA working and updated.
  3. CEGMA continues to generate a lot of support email requests (that end up being dealt with by me).

We have no time or resources to devote to CEGMA but the emails keep on coming. It's easy to envisage many ways how CEGMA could be improved and extended; we submitted a grant proposal to do this but it was unsuccessful. One planned aspect of 'CEGMA v3' was to replace the reliance on the aging KOGs database. Another aspect of the new version of CEGMA would be to develop clade-specific sets of core genes. If you are interested in plant genomes, we should be able to develop a much larger set of plant-specific core genes.

And so?

Today I draw a line in the sand and say…

CEGMA is dead

CEGMA had a good life and and we shall look back with fond memories, but its time has passed. Please don't grieve (and don't send flowers), but be thankful that people will no longer have to deal with the software-dependency-headache of trying to get Genewise working on Ubuntu Linux.

But what now?

The future of CEGMA has arrived and it's called BUSCO.

  • BUSCO: assessing genome assembly and annotation completeness with single-copy ortholog

This new tool (Benchmarking Universal Single-Copy Orthologs) has been developed by Felipe A. Simao, Robert Waterhouse, Panagiotis Ioannidis, Evgenia Kriventseva, and Evgeny Zdobnov. You can visit the BUSCO website, have a read of the manual, or look at the supplementary online material (this is set out like a paper…I don't think the tool has been formally published yet though). Here is the first section from that supplementary material:

Benchmarking Universal Single-Copy Orthologs (BUSCO) sets are collections of orthologous groups with near-universally-distributed single-copy genes in each species, selected from OrthoDB root-level orthology delineations across arthropods, vertebrates, metazoans, fungi, and eukaryotes (Kriventseva, et al., 2014; Waterhouse, et al., 2013). BUSCO groups were selected from each major radiation of the species phylogeny requiring genes to be present as single-copy orthologs in at least 90% of the species; in others they may be lost or duplicated, and to ensure broad phyletic distribution they cannot all be missing from one sub-clade. The species that define each major radiation were selected to include the majority of OrthoDB species, excluding only those with unusually high numbers of missing or duplicated orthologs, while retaining representation from all major sub-clades. Their widespread presence means that any BUSCO can therefore be expected to be found as a single-copy ortholog in any newly-sequenced genome from the appropriate phylogenetic clade (Waterhouse, et al., 2011). A total of 38 arthropods (3,078 BUSCO groups), 41 vertebrates (4,425 BUSCO groups), 93 metazoans (1,008 BUSCO groups), 125 fungi (1,438 BUSCO groups), and 99 eukaryotes (431 BUSCO groups), were selected from OrthoDB to make up the initial BUSCO sets which were then filtered based on uniqueness and conservation as described below to produce the final BUSCO sets for each clade, representing 2,675 genes for arthropods, 3,023 for vertebrates, 843 for metazoans, 1,438 for fungi, and 429 for eukaryotes. For bacteria, 40 universal marker genes were selected from (Mende, et al., 2013).

BUSCO seems to do everything that we wanted to include in CEGMA v3 and it is based on OrthoDB, a resource that has generated a new set of orthologs (developed by the same authors). The online material includes a comparison of BUSCO to CEGMA, and also outlines how BUSCO can be much quicker than CEGMA (depending on what set of of orthologs you use).

DISCLAIMER: I have not installed, tested, or analyzed BUSCO in any way. I make no promises as to its performance, but they seem to have gone about things in the right way.

This is what happens when you name your bioinformatics tool 'JEPEG'

This is always going to be a problem if you search for 'JEPEG' alone, but if you include 'bioinformatics' in your search, then the top hit correctly points to the name of this recently described bioinformatics tool:

The name 'JEPEG' is derived from Joint Effect on Phenotype of Eqtl/functional snps associated with a Gene, a name that is worthy of a JABBA award. Also, it's not clear to me how the name should be pronounced: 'jee-peg', 'jep-egg', 'jep-peg'?

Kablammo: an interactive, web-based BLAST results visualizer

I like it when someone publishes a new tool that:

  1. Has a clever, fun name which isn't just another bogus bioinformatics acronym.
  2. Has a great looking website with a fun logo (bonus points for being in charge of the domain name so that links won't break should the Wasmuth lab move somewhere else).
  3. Has the code available on GitHub.

Congratulations to Jeff Wintersinger and James Wasmuth!

How can you choose a single isoform to best represent a gene?

I asked this question a few weeks ago. There are some situations in bioinformatics when you want to look at all genes from one or more organisms, but where you want to only have one representative of a gene. This is not always a straightforward task, and I previously discussed how many people opt for overly simplistic methods such as 'choosing the longest isoform'.

Since my blog post I had some feedback and have also come across a relevant paper which I would like to share here.

Feedback

Sara G said:

We use epigenetic data from ENCODE/modENCODE/Epigenetic roadmap or our own data to identify the transcription start site that has the most evidence of transcriptional activity.

From my experience with using data from fly, worm, and Arabidopsis I have often seen many genes which have multiple isoforms that all share the same transcription start site (and differ in downstream exon/intron structure).

Michael Paulini commented:

What you can easily do (if you got the data) is just pick the isoform with the highest FPKM value per gene.

The important caveat that Michael mentions is that you might not have the data, and if you are trying to do this as part of multiple species comparison then this becomes a much more complex task.

Richard Edwards added to the debate (I've edited his response a little bit):

My previous solution to this problem was this:

"…a dataset consisting of one protein sequence per gene was constructed from the EnsEMBL human genome…by mapping all EnsEMBL human peptides onto their genes and assessing them in the context of the external database entry used as evidence for that gene. If the external database was SwissProt and one of the peptides has the exact same sequence as the SwissProt sequence, this peptide sequence was used for that gene. In all other cases, the longest peptide (in terms of non-X amino acids) was used."

Richard also conceded that this approach "only works for well annotated proteomes".

Another approach

For Arabidopsis thaliana, I have previously used data from their exon confidence ranking system which ends up producing a 5 star rating for every transcript of a gene. This is based on various strands of evidence and can be useful for choosing between isoforms which have different star ratings (but not so helpful when all isoforms have the same rating). The TAIR database remains the only model organism database (that I know of) that have attempted to address this problem and provide a systematic way of ranking all genes. Anecdotally, I would say that not many people know about this system and you need to find the relevant files on their FTP site to make use of this information.

Hopefully someone will correct me if other databases have implemented similar schemes!

From the literature

If your research focuses on vertebrates, then this approach looks promising.

Here's the abstract (emphasis mine):

Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform.

It seems that since this paper was published, they have expanded their dataset and the online APPRIS database now has annotations for five vertebrate genomes (human, mouse, zebrafish, rat, and pig).

The future

This APPRIS approach seems promising, yet I wish there was an standardized ranking system used by all model organism databases (as well as Ensembl, UCSC Genome Browser etc.).

There are always going to be different levels of supporting evidence available in different species. E.g. not every organism is going to be able to make use of mass spec data to assess the differential usage of transcript variants. Furthermore, there are going to be tissue-specific — not to mention time-specific — patterns of differential isoform expression for some genes.

However, what I would love to see is each database using whatever information they have availble to them to assign relative 'weights' to each annotated isoform. Initially, this would focus on protein-coding genes and ideally, this scheme would entail the following:

  1. A score for every annotated isoform of a gene where all scores sum to 100. In my favorite example of the Caenorhabditis elegans rpl-22 gene these scores would probably end up as 99 for the dominant isoform and 1 for the secondary isoform. The scores should reflect the relative likelihood that this transcript encodes a functional protein. To keep things simple, I would prefer integer values with values of zero allowed for annotated isoforms that represent pseudogenic transcripts, targets of nonsense-mediated decay etc.
  2. Evidence for the score. This would be something akin to the Gene Ontology Evidence codes. Each gene may have one or more codes attached to them and these would reflect the source of the experimental evidence (transcript data, mass spec, functional assay etc.)
  3. Spatial/temporal information. If some of the supporting evidence is from a particular cell-line, tissue, or developmental time point then it should be possible to annotate this.
  4. Historical information. Such scores should be expected to change over time as new evidence emerges. All transcripts should keep a time-stamped history of previous scores to allow researchers to see whether isoforms have increased or decreased in their relative probability of usage over time.
  5. Data to be added to GTF/GFF files: The score of each transcript should be embedded as a note in the attribution column of these files, maybe as a 'RF' score (Relative Functionality).

Can someone please make this happen? :-)

The dangers of default parameters in bioinformatics: lessons from Bowtie and TopHat

Updated 2015-04-28 to include a reference to a relevant blog post by Nick Loman.

Most bioinformatics tools are equipped with a vast array of command-line options which let the user configure the inputs, outputs, and performance of the software. You may not wish to explore every possible option when using a particular piece of software, but you should always try to have a look at the manual. In fact I would go so far as to say:

Never use a piece of bioinformatics software for the first time without looking to see what command-line options are available and what default parameters are being used

If a bioinformatics tool provides command-line options, it's a good bet that some of those options will have default parameters associated with them. The most common default parameter that you should always check for might be called something like -p, -t, or --threads. I.e. you should always check to see if a program can make use of multiple CPUs/cores to speed up its operation. The default value for such parameter is usually '1'.

Failure to check (and change) this option will merely result in your program taking longer than it otherwise could, so no harm done. More serious issues arise when you stick with default parameters which might actually make you miss interesting results, or produce misleading results.

E.g. the TopHat RNA-Seq aligner has an command-line option -i which lets you set a minimum intron length. The manual reveals that TopHat will ignore donor/acceptor pairs closer than this many bases apart. The default value for this parameter is 70 bp which is probably fine for all vertebrate genomes, but is probably not suitable for genomes such as Arabidopsis thaliana or Caenorhabditis elegans where there are many introns shorter than this length.

Now let's look at some specific examples which shed some light on what can happen if you stick with the default parameters.

Testing Bowtie

I used Bowtie 2 to align two FASTQ files of bovine RNA-Seq data to the reference cow transcriptome. These two files represented paired read data, about 25 million reads in each file. Following alignment, I filtered the data to only keep uniquely mapped, concordant read pairs. Furthermore, I filtered the data to only keep reads pairs that aligned with the highest possible mapping quality (for Bowtie, this means MAPQ values of 42).

From the resulting set of aligned reads, I calculated the inner read pair distance (see this great post about the difference between this and insert sizes), which I then plotted:

Thicker red vertical line indicates mean.

You will notice that most read pairs have a negative inner read pair distance, i.e. the reads overlap. But do you notice anything odd about these results? Look at the far right of the distribution…the maximum inner read pair distance cuts off abruptly at 300 bp. To me this is very suspicious. After looking in the Bowtie manual I discovered this:

-X <int> The maximum fragment length for valid paired-end alignments. Default: 500 bp.

Aha! My reads are 100 bp, so the cutoff at 300 bp in the above graph represents fragment lengths of 500 bp. If there are reads that correctly map further than 300 bp apart, I am not going to see them because of the default value of -X. So what happens if I increase this default parameter to 2,000 bp?

Thicker red line indicates mean of dataset with -X 500, thicker blue line indicates mean of dataset at -X 2000.

I have colored in the new data in blue; we now see many more paired reads that map further than 300 bp apart. There are actually still some reads that correctly map even further than 1,800 bp apart (the limit imposed by -X 2000). So not only does this default parameter of Bowtie mean that I lose some data, it also means that I lose the type of data that might be particularly useful in some applications where we especially want the reads that map far apart (e.g. transcriptome assembly).

It turns out that Nick Loman was writing about the dangers of the -X parameter of Bowtie back in 2013, and I encourage everyone to read his blog post: Use X with bowtie2 to set minimum and maximum insert sizes for Nextera libraries.

Testing TopHat

The TopHat program actually uses Bowtie for the alignment phase, and there are a couple of default parameters used by TopHat that might cause issues for you. From the TopHat manual:

-r <int>  This is the expected (mean) inner distance between mate pairs. The default is 50 bp.

--mate-std-dev <int> The standard deviation for the distribution on inner distances between mate pairs. The default is 20 bp.

The default value for --mate-std-dev seems worryingly low; how many biological datasets ever have standard deviations that are less than half the mean? So I did some testing…

I used TopHat to map a small sample of my RNA-Seq data (200,000 paired reads) to the cow transcriptome (using the -T option of TopHat). In addition to using the defaults, I also tried various runs that used different values for the -r and --mate-std-dev options. My run with default parameters ended up with 150,235 reads mapped to the transcriptome, whereas increasing those parameters (-r 100 and --mate-std-dev 125) produced fewer reads in my output (138,970). So you might conclude that the default parameters are performing better. However, of the 150,235 reads mapped with default parameters, only 84,996 represent high quality alignments (uniquely mapped, concordant read pairs). My custom parameters produced 105,748 high quality alignments. So the custom parameters are giving me better results overall.

Conclusions

It takes time and effort to be prepared to test command-line options, especially for programs like Bowtie and TopHat which have a lot of options. For TopHat, I ended up doing 40 different test runs in order to investigate the effects of various parameters to work out what was most suitable (and effective) for my input data.

Bioinformatics programs should never be treated as 'black box' devices. You may get results out of them, but you will never know whether you could have more and/or better results from changing some of those default parameters.