Are there too many biological databases?

The annual 'Database' issue of Nucleic Acids Research (N.A.R.) was recently published. It contains a mammoth set of 172 papers that describe 56 new biological databases as well as updates to 115 others. I've already briefly commented on one of these papers, and expect that I'll be nominating several others for JABBA awards.

In this post I just wanted to comment on the the seemingly inexorable growth of these computational resources. There are databases for just about everything these days. Different species, different diseases, different types of sequence, different biological mechanisms…every possible biological topic has a relevant database, and sometimes they have several.

It is increasinly hard to even stay on top of just how many databases are out there. Wikipedia has a listing of biological databases as well as a category for biological databases, but both of these barely scratch the surface of what is out there.

So maybe one might turn to 'DBD': a Database of Biological Datsbases or even MetaBase which also describes itself as a 'Database of Biological Databases' (please don't start thinking about creating 'DBDBBDB': A Database of Databases of Biological Databases!).

However, the home pages of these two sites were last updated in 2008 and 2011 respectively, perfectly reflecting one of the problems in the world of biological databases…they often don't get removed when they go out of date. In a past life, I was a developer of several databases at something called UK CropNet. Curation of these databases, particularly the Arabidopis Genome Resource, effectively stopped when I left the job in 2001 but the databases were only taken offline in 2013!!!

So old, out-of-date, databases are part of the problem, but the other issue is that there seems to be some independent databases that — in an ideal world — should really be merged with similar databases. E.g. there is a database called BeetleBase that describes its remit as follows:

BeetleBase is a comprehensive sequence database and important community resource for Tribolium genetics, genomics and developmental biology.

This database has been around since at least 2007 though I'm not entirely sure if it is still being actively developed. However, I was still surprised to see this paper as part of the N.A.R. Database issue:

iBeetle-Base has been seemingly developed from a separate group of people from BeetleBase. Is it helpful to the wider community to have two databases like this, with confusingly similar names? It's possible that iBeetle-Base people tried reaching out to the BeetleBase folks to include their data in the pre-existing database, but were rebuffed or found out that BeetleBase is no longer a going concern. Who knows, but it just seems a shame to have so much genomics information for a species split across multiple databases.

I'm not sure what could, or should, be done to tackle these issues. Should we discourage new databases if there are already existing resources that cover much of the subject matter? Should we require the people who run databases to 'wind up' the resources in a better way when funding runs out (i.e. retire databases or make it abundantly clear that a resource is no longer being updated)? Is it even possible to set some minimum standards for database usage that must be met in order for subsequent 'update papers' to get published (i.e. 'X' DB accesses per month)?

diArk – the database for eukaryotic genome and transcriptome assemblies in 2014

A new paper in Nucleic Acids Research describes a database that I was not aware of. The abstract features an eye-catching, not to mention ambitious, claim (the emphasis is mine):

The database…has been developed with the aim to provide access to all available assembled genomes and transcriptomes.

The diArk database currently features data on 2,771 species. There are many options to filter your search queries including filtering by 'sequencing type' and by the status of completion. So when I search for 'completed' genome sequencing projects, it reports that there 3,626 projects corresponding to 1,848 species. The FAQ has this to say regarding 'completeness':

The term completeness is intended to describe the coverage of the genome and the chance to find all homologs of the gene of interest.

I was a bit put off by the interface to this database. As far as I can tell, diArk is mostly containing links to other resources (rather than hosting any sequence information). There are lots of very small icons everywhere which are hard to understand (unless you mouse over each icon). When I went to the page for Caenorhabditis elegans, I was struck by the confusing nature of just posting links to every C. elegans resource on the web. There are 12 'Project' links listed. Which one gives you access to the latest version of the genome sequence?

diArk summary of Caenorhabditis elegans data

diArk summary of Caenorhabditis elegans data

As a final comment, I noticed that the latest entry on the diArk news page is from September 2011 which is a bit worrying (nothing newsworthy has happened in the last 3 years?).

Red flag alert for a bogus bioinformatics acronym

The first JABBA award of 2015 goes to a paper that was published at the end of 2014 (thanks to twitter user @chenghlee for bringing this to my attention). The paper, published in BMC Medical Genomics, has a succinct title that contains a very bogus name:

The title doesn't explicitly reveal the source of the acronym 'FLAGS', but you can probably take a guess. From the abstract:

We termed these genes FLAGS for FrequentLy mutAted GeneS

This gets a JABBA award because a majority (3 out of 5) of the letters in 'FLAGS' are not from the intial letters of words.