Trying to download the cow genome (again): where's the beef (again)?

March 04, 2015 by Keith Bradnam

Almost a year ago, I blogged about my frustrations regarding the extremely confusing nature of the cow genome and the many genome assemblies that are out there. Much of that frustration was due to websites and FTP sites that had broken links, misleading information, and woefully incomplete documentation.

One year on and I hear a rumor that a new version of the cow genome is available. So I went off in search of 'UMD 3.1.1'. My first stop was bovinegenome.org which is one place where you can find the previous 'UMD 3.1' assembly. But alas, they do not list UMD 3.1.1.

After some Google searching I managed to find this information at the UCSC Genome Bioinformatics news archive:

We are pleased to announce the release of a Genome Browser for the June 2014 assembly of cow, Bos taurus (BostaurusUMD 3.1.1, UCSC version bosTau8). This updated cow assembly was provided by the UMD Center for Bioinformatics and Computational Biology (CBCB). This assembly is an update to the previous UMD 3.1 (bosTau6) assembly. UMD 3.1 contained 138 unlocalized contigs that were found to be contaminants. These have been suppressed in UMD 3.1.1.

This reveals that the update is pretty minor (removal of contaminant contigs which were never part of any chromosome sequence anyway). In any case, the USCC FTP site contains the UMD 3.1.1 assembly so that's great.

But out of curiosity I followed UCSC's link to the UMD Center for Bioinformatics and Computational Biology (CBCB) website. The home page doesn't make it easy to find the cow genome data. Searching the site for 'UMD 3.1.1' didn't help but searching for 'cow genome' did take me to their Assembly data page which lists the cow genome. Unfortunately the link for the Bos taurus genome takes you to 'page not found'. In contrast, the 'data download' link does work and takes you to their FTP site which fails to include the new assembly (but it does list all of the older cow genome assemblies).

Plus ça change, plus c'est la même chose.

Community annotation — by any name — still isn’t a part of the research process. It should be →

March 03, 2015 by Keith Bradnam

In order for community annotation efforts to succeed, they need to become part of the established research process: mine annotations, generate hypotheses, do experiments, write manuscripts, submit annotations. Rinse and repeat.

A thoughtful post by Todd Harris on his blog which lists some suggestions for how to fix the failure of community annotation projects.

I particularly like Todd's 3rd suggestion:

We need to recognize the efforts of people who do [community annotation]. This system must have professional currency to it, akin to writing a review paper, and should be citable…

Tales of drafty genomes: part 3 – all genomes are complete…except for those that aren't

March 03, 2015 by Keith Bradnam

This is the third post in an infrequent series that looks at the world of unfinished genomes.

One of the many, many resources at the NCBI is their Genome database. Here's how they describe themselves:

The Genome database contains sequence and map data from the whole genomes of over 1000 species or strains. The genomes represent both completely sequenced genomes and those with sequencing in-progress. All three main domains of life (bacteria, archaea, and eukaryota) are represented, as well as many viruses, phages, viroids, plasmids, and organelles.

This text could probably be updated because the size of the database is now wrong by an order of magnitude…there are currently 11,322 genomes represented in this database. But how many of them are 'completely sequenced' and how many are at the 'sequencing in-progress' stage?

Luckily, the NCBI classifies all genomes into one of four 'levels':

Complete
Chromosome
Scaffold
Contig

I couldn't find any definitions for these categories within the NCBI Genome database, but elsewhere on the NCBI website I found the following definitions for the latter three categories:

Chromosome - there is sequence for one or more chromosomes. This could be a completely sequenced chromosome (gapless) or a chromosome containing scaffolds with unlinked gaps between them.

Scaffold - some sequence contigs have been connected across gaps to create scaffolds, but the scaffolds are all unplaced or unlocalized.

Contig - nothing is assembled beyond the level of sequence contigs

So considering just the 2,032 Eukaryotic species in the NCBI Genome Database, we can ask…how many of them are complete?

Completion status of 2,032 eukaryotic genomes, as classified by NCBI

The somewhat depressing answer is that only a meagre 24 eukaryotic genomes are listed as complete, about 1% of the total. Even if we include genomes with chromosome sequences, we are still only talking about 13% of all genomes. You might imagine that the state of completion would be markedly better when looking at prokaryotes. However, only 11.5% of the 31,696 prokaryotic genomes are classified as complete.

In the last post in this series, I included a dictionary definition of the word 'draft'. This time, let's look to see how Merriam-Webster defines 'complete':

having all necessary parts : not lacking anything

not limited in any way

not requiring more work : entirely done or completed

By this definition, I think we could all agree that very few genomes are actually complete.

Choosing names for bioinformatics software: it's a snap

March 02, 2015 by Keith Bradnam

Compare the following published bioinformatics resources:

SNAP: Semi-HMM-based Nucleic Acid Parser (published 2004)
SNAP: Suite of Nucleotide Analysis Programs (published 2005)
SNAP: SNP Annotation And Proxy search (published 2008)
SNAP: Screening for NonAcceptable Polymorphisms (published 2008)
SNAP: Scalable Nucleotide Alignment Program (published 2011)

Every new bioinformatics tool that decides to reuse an existing name — either wilfully or by ignorance — makes it that little bit harder for people to find one of the other similarly-named-tools that they might be searching for.

h/t to @byuhobbes for bringing some of these duplicates to my attention.

Time for a classic example of a JABBA-award winning piece of bioinformatics software

February 27, 2015 by Keith Bradnam

Normally, I introduce the name of the JABBA-award-worthy acronym before I show you the full name of the offending piece of software. But this time, let's play a little game. Here is the title of a recent article from the journal Bioinformatics, only I have removed the software acromym and the tell-tale capitalization from the name:

small molecule activity scanner web service based

So now you know the name, have a guess at what the acronym/initalism is. I feel confident that no-one will guess the answer. You'll have to scroll down for the reveal…

Okay, here it is:

SEABED: Small molEcule activity scanner weB servicE baseD

Note that:

Only the 'S' is clearly derived from the initial letter of a word
The 'A' is left ambiguously unexplained in the capitalization (as presented in the journal title). One might presume that it comes from 'Activity' but I wouldn't rule out 'scAnner'.
However you derive the letters in SEABED, one (or more) words don't contribute to the acronym at all.

All of which makes SEABED a worthy recipient of a JABBA award. The only saving grace is that a Google search for seabed bioinformatics finds the paper as the top hit.

One downside to this tool is that the SEABED webserver (http://www.bsc.es/SEABED) doesn't seem to working at all at the moment.

Screen Shot 2015-02-27 at 2.13.36 PM.png

Tales of drafty genomes: part 2 — when draft genomes took over the world

February 18, 2015 by Keith Bradnam

This is the second post in an infrequent series that looks at draft genomes.

At the time of writing, Google has indexed almost 400,000 pages that include a mention of the phrase draft genome. Prior to the year 2000, there are zero mentions of this phrase in the tech giant’s search index.

The phrase ‘draft genome’ came to prominence with the publication of the ‘working draft’ version of the human genome[1]. But referring to published genomes as anything other than ‘complete’ was still atypical at this time. This can be seen if you search Google Scholar for papers that include in their titles either the phrase draft genome sequence or complete genome sequence. When you look at how these results change over time, an interesting pattern emerges:

Number of papers indexed by Google Scholar that include the phrases 'Complete genome sequence' or 'Draft genome sequence' in their titles.

Around 2000–2003, there were a small number of papers mentioning draft genome sequences. These are nearly all related to the draft sequences of the human or rice genomes. Usage of the phrase (in journal titles) didn’t break double digits until 2011. Draft genomes then became a much more widely used phrase in 2012 and by 2013 they overtook usage of ‘complete genome sequence’

I find this reveals something about the nature of sequencing and genome assembly. It almost feels like we are giving up our ambition to finish genomes (whatever ‘finished’ actually means) and are more willing to settle for something that is clearly incomplete.

A definition of ‘draft’ provided by Merriam-Webster is as follows:

A version of something (such as a document) that you make before you make the final version

In an ideal world, I would hope that all of these draft genomes would also end up being replaced by ‘final versions’. But I’m doubtful that many of these published sequences will be completed any time soon.

See part 1 in this series for more details about the drafty nature of the human genome. ↩

Sowing the seeds of bad bioinformatics names

February 17, 2015 by Keith Bradnam

Here are two simple pieces of advice for people who are looking for a name for their latest bioinformatics tool/database/resource:

Avoid common words which might cause people searching for your tool to find something else instead.
Choose a name that hasn't been used before by the bioinformatics community.

Having said that, let's look at a new paper in the journal Bioinformatics:

Seed: a user-friendly tool for exploring and visualizing microbial community data

This name 'Seed', is a not-too-offensive acronym for Simple Exploration of Ecological Data. So what's my beef with it?

The problem is that words like seed are going to appear all over the Internet. My standard test for the 'searchability' of a bioinformatics tool is to search for the tool name followed by the word 'bioinformatics'. Your resource's website or publication should hopefully be the number one result (or somewhere on the first page). However, that is not what happens here.

And searching for 'seed bioinformatics' raises more problems by clashing with my first piece of advice. E.g. here are a couple of papers that were in my first page of Google results:

2010: Accessing the SEED Genome Databases via Web Services API: Tools for Programmers

2011: SEED: efficient clustering of next-generation sequences

So what happens if you include 'microbial' into your search terms? Won't that help?

Nope. Turns out that the SEED — not an ancronym as far as I can tell — is an annotation environment for microbial genomes that has been around for a decade, and which has spawned many papers, e.g.:

2014: The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)

All of which means that people looking to find the newly published Seed tool, are not going to have much luck when using search engines.

Is it a 'bad idea' to include gratuitous pictures of cleavage on an Oxford Journals website?

February 14, 2015 by Keith Bradnam

In a word, 'yes'.

2015-02-16: Note that this story has been updated after Oxford Journals contacted me about this (see end of post).

I know that journals need to make money, but it seems a bit shoddy when they allow any form of advertising to appear on their websites. Came across an article at Nucleic Acids Research today which featured the following advert:

Given that I have published in this journal before, I suppose that people reading our articles will also have a chance of seeing ads like this. I would ask Oxford Journals to think carefully about whether they really want adverts like this appearing on their site. This doesn't seem a particularly good fit for them as a scientific publisher — for that matter, it doesn't seem a great fit for for the advertiser either.

Update: Oxford Journals reached out to me on twitter with some good news:

@kbradnam Thank you for letting us know. The advertiser has been removed from our list of online ads.
— Oxford Journals (@OxfordJournals) February 16, 2015

BLAST bug (or feature?) in NCBI BLAST v2.2.30+

February 13, 2015 by Keith Bradnam

Something changed in the latest version of NCBI BLAST+ which breaks our CEGMA software. Compare the behavior of this simple TBLASTN command in v2.2.29+ and v2.2.30+ (from October 2014):

v2.2.29+

tblastn -db sample.dna -query sequence.prot -word_size 5

TBLASTN 2.2.29+

Database: sample.dna
           1 sequences; 2,499,950 total letters

Query= 7292122___KOG0292

Length=1234
                                                 Score     E
Sequences producing significant alignments:      (Bits)  Value

CHROMOSOME_I 1 15072418                           38.9    0.002

v2.2.30+

tblastn -db sample.dna -query sequence.aa -word_size 5

BLAST query/options error: Compressed alphabet lookup table requires word size 6 or 7
Please refer to the BLAST+ user manual.

One step in the CEGMA pipeline involves running TBLASTN with a word size of 5. This no longer works in the latest version and the error message suggests that only a word size of 6 or 7 is permitted. I can confirm that this is the case by looking at the latest source code for the blast_option.c file:


else if (options->lut_type == eCompressedAaLookupTable &&
         options->word_size != 6 && options->word_size != 7) {
         Blast_MessageWrite(blast_msg, eBlastSevError, kBlastMessageNoContext,
               "Compressed alphabet lookup table requires "
               "word size 6 or 7");
         return BLASTERR_OPTION_VALUE_INVALID;
}

The error message suggests I look at the BLAST+ user manual. I did this, and according to Table C5:

tblastn application options:

option = word_size    
type = integer
default value  = 3 
description and notes = "Valid word sizes are 2-7."

There also seems to be no mention of this change in the release notes, all of which makes me think that this is a bug. So I will report this to the NCBI, but any CEGMA users out there may wish to hold off updating to v.2.2.30+.

10 bioinformatics tools you should be using on Valentines Day

February 13, 2015 by Keith Bradnam

1. HUGS: the database of HUman Genome Sequences

"We envisage that the growth in personal genomics will mean that researchers will increasingly want HUGS to cope with their work."

2. LOVE: LncRNA Ortholog Validation and Evaluation

"If you are unsure as to the quality of your lncRNA annotations, we suggest that you need LOVE."

3. KISSES: Kmers In aSsembled SEquenceS

"We envisage that KISSES will be widely distributed by people working in the field of genome assembly."

4. HEART: Histidine Enrichment Analysis Report Tool

"Accurate detection of histidine-enriched sequences can be achieved if researchers have HEART."

5. ILOVEYOU: Intergenic LOng VariablE Yeast Operational Units

"Detection of this new class of conserved intergenic element will open new avenues for S. cerevisae researchers, and we predict that many will benefit from a deeper understanding of ILOVEYOUs".

6. ROSESARERED: Random Ortholog SEquence Simulations that ARE REDundant

"This tool effectively generates a series of, largely pointless, simulated ortholog sequences. See also our companion software: ValIdation Of Long Eukaryotic TranscriptS thAt Randomly appEar BioLogically UsEful (VIOLETSAREBLUE)."

7. VALENTINE: VALidation of ENcode Transcriptomes IN Eukaryotes

"We believe that the ENCODE annotations of the human genome are only 80% useful, therefore genome annotators will likely appreciate a VALENTINE."

8. PASSI(ON): Predicting ASSembly Integrity (Or Not)

"Based on our observations, we feel that there is an urgent need for PASSI(ON) within the genomics community."

9. CHOCOLATES: CHOosing COmputationaL AlgoriThms for Testing Evolutionary Simulations

"In a field where which increasingly offers a bewildering choice of bioinformatics tools, we feel that researchers will appreciate CHOCOLATES."

10. SNUGGLES:: SearchiNg for Unique Genes in orGanisms Like Eels and Snakes

"There is a desperate shortage of bioinformatics tools that are dedicated to finding unique genes in creatures that look a bit like worms. Hence we are confident that the community of people who work on snakes, eels, nematodes, and other tubular-like organisms will be receptive to SNUGGLES."