Tales of drafty genomes: part 3 – all genomes are complete…except for those that aren't

This is the third post in an infrequent series that looks at the world of unfinished genomes.

One of the many, many resources at the NCBI is their Genome database. Here's how they describe themselves:

The Genome database contains sequence and map data from the whole genomes of over 1000 species or strains. The genomes represent both completely sequenced genomes and those with sequencing in-progress. All three main domains of life (bacteria, archaea, and eukaryota) are represented, as well as many viruses, phages, viroids, plasmids, and organelles.

This text could probably be updated because the size of the database is now wrong by an order of magnitude…there are currently 11,322 genomes represented in this database. But how many of them are 'completely sequenced' and how many are at the 'sequencing in-progress' stage?

Luckily, the NCBI classifies all genomes into one of four 'levels':

  • Complete
  • Chromosome
  • Scaffold
  • Contig

I couldn't find any definitions for these categories within the NCBI Genome database, but elsewhere on the NCBI website I found the following definitions for the latter three categories:

Chromosome - there is sequence for one or more chromosomes. This could be a completely sequenced chromosome (gapless) or a chromosome containing scaffolds with unlinked gaps between them.

Scaffold - some sequence contigs have been connected across gaps to create scaffolds, but the scaffolds are all unplaced or unlocalized.

Contig - nothing is assembled beyond the level of sequence contigs

So considering just the 2,032 Eukaryotic species in the NCBI Genome Database, we can ask…how many of them are complete?

Completion status of 2,032 eukaryotic genomes, as classified by NCBI

Completion status of 2,032 eukaryotic genomes, as classified by NCBI

The somewhat depressing answer is that only a meagre 24 eukaryotic genomes are listed as complete, about 1% of the total. Even if we include genomes with chromosome sequences, we are still only talking about 13% of all genomes. You might imagine that the state of completion would be markedly better when looking at prokaryotes. However, only 11.5% of the 31,696 prokaryotic genomes are classified as complete.

In the last post in this series, I included a dictionary definition of the word 'draft'. This time, let's look to see how Merriam-Webster defines 'complete':

having all necessary parts : not lacking anything

not limited in any way

not requiring more work : entirely done or completed

By this definition, I think we could all agree that very few genomes are actually complete.

Choosing names for bioinformatics software: it's a snap

Image from flickr user plashingvole

Image from flickr user plashingvole

Compare the following published bioinformatics resources:

  1. SNAP: Semi-HMM-based Nucleic Acid Parser (published 2004)
  2. SNAP: Suite of Nucleotide Analysis Programs (published 2005)
  3. SNAP: SNP Annotation And Proxy search (published 2008)
  4. SNAP: Screening for NonAcceptable Polymorphisms (published 2008)
  5. SNAP: Scalable Nucleotide Alignment Program (published 2011)

Every new bioinformatics tool that decides to reuse an existing name — either wilfully or by ignorance — makes it that little bit harder for people to find one of the other similarly-named-tools that they might be searching for.

h/t to @byuhobbes for bringing some of these duplicates to my attention.

Time for a classic example of a JABBA-award winning piece of bioinformatics software

jabba logo.png

Normally, I introduce the name of the JABBA-award-worthy acronym before I show you the full name of the offending piece of software. But this time, let's play a little game. Here is the title of a recent article from the journal Bioinformatics, only I have removed the software acromym and the tell-tale capitalization from the name:

small molecule activity scanner web service based

So now you know the name, have a guess at what the acronym/initalism is. I feel confident that no-one will guess the answer. You'll have to scroll down for the reveal…

 

Okay, here it is:

SEABED: Small molEcule activity scanner weB servicE baseD

Note that:

  1. Only the 'S' is clearly derived from the initial letter of a word
  2. The 'A' is left ambiguously unexplained in the capitalization (as presented in the journal title). One might presume that it comes from 'Activity' but I wouldn't rule out 'scAnner'.
  3. However you derive the letters in SEABED, one (or more) words don't contribute to the acronym at all.

All of which makes SEABED a worthy recipient of a JABBA award. The only saving grace is that a Google search for seabed bioinformatics finds the paper as the top hit.

One downside to this tool is that the SEABED webserver (http://www.bsc.es/SEABED) doesn't seem to working at all at the moment.

Tales of drafty genomes: part 2 — when draft genomes took over the world

This is the second post in an infrequent series that looks at draft genomes.

At the time of writing, Google has indexed almost 400,000 pages that include a mention of the phrase draft genome. Prior to the year 2000, there are zero mentions of this phrase in the tech giant’s search index.

The phrase ‘draft genome’ came to prominence with the publication of the ‘working draft’ version of the human genome[1]. But referring to published genomes as anything other than ‘complete’ was still atypical at this time. This can be seen if you search Google Scholar for papers that include in their titles either the phrase draft genome sequence or complete genome sequence. When you look at how these results change over time, an interesting pattern emerges:

Number of papers indexed by Google Scholar that include the phrases 'Complete genome sequence' or 'Draft genome sequence' in their titles.

Around 2000–2003, there were a small number of papers mentioning draft genome sequences. These are nearly all related to the draft sequences of the human or rice genomes. Usage of the phrase (in journal titles) didn’t break double digits until 2011. Draft genomes then became a much more widely used phrase in 2012 and by 2013 they overtook usage of ‘complete genome sequence’

I find this reveals something about the nature of sequencing and genome assembly. It almost feels like we are giving up our ambition to finish genomes (whatever ‘finished’ actually means) and are more willing to settle for something that is clearly incomplete.

A definition of ‘draft’ provided by Merriam-Webster is as follows:

A version of something (such as a document) that you make before you make the final version

In an ideal world, I would hope that all of these draft genomes would also end up being replaced by ‘final versions’. But I’m doubtful that many of these published sequences will be completed any time soon.


  1. See part 1 in this series for more details about the drafty nature of the human genome.  ↩