Sowing the seeds of bad bioinformatics names

February 17, 2015 by Keith Bradnam

Here are two simple pieces of advice for people who are looking for a name for their latest bioinformatics tool/database/resource:

Avoid common words which might cause people searching for your tool to find something else instead.
Choose a name that hasn't been used before by the bioinformatics community.

Having said that, let's look at a new paper in the journal Bioinformatics:

Seed: a user-friendly tool for exploring and visualizing microbial community data

This name 'Seed', is a not-too-offensive acronym for Simple Exploration of Ecological Data. So what's my beef with it?

The problem is that words like seed are going to appear all over the Internet. My standard test for the 'searchability' of a bioinformatics tool is to search for the tool name followed by the word 'bioinformatics'. Your resource's website or publication should hopefully be the number one result (or somewhere on the first page). However, that is not what happens here.

And searching for 'seed bioinformatics' raises more problems by clashing with my first piece of advice. E.g. here are a couple of papers that were in my first page of Google results:

2010: Accessing the SEED Genome Databases via Web Services API: Tools for Programmers

2011: SEED: efficient clustering of next-generation sequences

So what happens if you include 'microbial' into your search terms? Won't that help?

Nope. Turns out that the SEED — not an ancronym as far as I can tell — is an annotation environment for microbial genomes that has been around for a decade, and which has spawned many papers, e.g.:

2014: The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)

All of which means that people looking to find the newly published Seed tool, are not going to have much luck when using search engines.

Is it a 'bad idea' to include gratuitous pictures of cleavage on an Oxford Journals website?

February 14, 2015 by Keith Bradnam

In a word, 'yes'.

2015-02-16: Note that this story has been updated after Oxford Journals contacted me about this (see end of post).

I know that journals need to make money, but it seems a bit shoddy when they allow any form of advertising to appear on their websites. Came across an article at Nucleic Acids Research today which featured the following advert:

Given that I have published in this journal before, I suppose that people reading our articles will also have a chance of seeing ads like this. I would ask Oxford Journals to think carefully about whether they really want adverts like this appearing on their site. This doesn't seem a particularly good fit for them as a scientific publisher — for that matter, it doesn't seem a great fit for for the advertiser either.

Update: Oxford Journals reached out to me on twitter with some good news:

@kbradnam Thank you for letting us know. The advertiser has been removed from our list of online ads.
— Oxford Journals (@OxfordJournals) February 16, 2015

BLAST bug (or feature?) in NCBI BLAST v2.2.30+

February 13, 2015 by Keith Bradnam

Something changed in the latest version of NCBI BLAST+ which breaks our CEGMA software. Compare the behavior of this simple TBLASTN command in v2.2.29+ and v2.2.30+ (from October 2014):

v2.2.29+

tblastn -db sample.dna -query sequence.prot -word_size 5

TBLASTN 2.2.29+

Database: sample.dna
           1 sequences; 2,499,950 total letters

Query= 7292122___KOG0292

Length=1234
                                                 Score     E
Sequences producing significant alignments:      (Bits)  Value

CHROMOSOME_I 1 15072418                           38.9    0.002

v2.2.30+

tblastn -db sample.dna -query sequence.aa -word_size 5

BLAST query/options error: Compressed alphabet lookup table requires word size 6 or 7
Please refer to the BLAST+ user manual.

One step in the CEGMA pipeline involves running TBLASTN with a word size of 5. This no longer works in the latest version and the error message suggests that only a word size of 6 or 7 is permitted. I can confirm that this is the case by looking at the latest source code for the blast_option.c file:


else if (options->lut_type == eCompressedAaLookupTable &&
         options->word_size != 6 && options->word_size != 7) {
         Blast_MessageWrite(blast_msg, eBlastSevError, kBlastMessageNoContext,
               "Compressed alphabet lookup table requires "
               "word size 6 or 7");
         return BLASTERR_OPTION_VALUE_INVALID;
}

The error message suggests I look at the BLAST+ user manual. I did this, and according to Table C5:

tblastn application options:

option = word_size    
type = integer
default value  = 3 
description and notes = "Valid word sizes are 2-7."

There also seems to be no mention of this change in the release notes, all of which makes me think that this is a bug. So I will report this to the NCBI, but any CEGMA users out there may wish to hold off updating to v.2.2.30+.

10 bioinformatics tools you should be using on Valentines Day

February 13, 2015 by Keith Bradnam

1. HUGS: the database of HUman Genome Sequences

"We envisage that the growth in personal genomics will mean that researchers will increasingly want HUGS to cope with their work."

2. LOVE: LncRNA Ortholog Validation and Evaluation

"If you are unsure as to the quality of your lncRNA annotations, we suggest that you need LOVE."

3. KISSES: Kmers In aSsembled SEquenceS

"We envisage that KISSES will be widely distributed by people working in the field of genome assembly."

4. HEART: Histidine Enrichment Analysis Report Tool

"Accurate detection of histidine-enriched sequences can be achieved if researchers have HEART."

5. ILOVEYOU: Intergenic LOng VariablE Yeast Operational Units

"Detection of this new class of conserved intergenic element will open new avenues for S. cerevisae researchers, and we predict that many will benefit from a deeper understanding of ILOVEYOUs".

6. ROSESARERED: Random Ortholog SEquence Simulations that ARE REDundant

"This tool effectively generates a series of, largely pointless, simulated ortholog sequences. See also our companion software: ValIdation Of Long Eukaryotic TranscriptS thAt Randomly appEar BioLogically UsEful (VIOLETSAREBLUE)."

7. VALENTINE: VALidation of ENcode Transcriptomes IN Eukaryotes

"We believe that the ENCODE annotations of the human genome are only 80% useful, therefore genome annotators will likely appreciate a VALENTINE."

8. PASSI(ON): Predicting ASSembly Integrity (Or Not)

"Based on our observations, we feel that there is an urgent need for PASSI(ON) within the genomics community."

9. CHOCOLATES: CHOosing COmputationaL AlgoriThms for Testing Evolutionary Simulations

"In a field where which increasingly offers a bewildering choice of bioinformatics tools, we feel that researchers will appreciate CHOCOLATES."

10. SNUGGLES:: SearchiNg for Unique Genes in orGanisms Like Eels and Snakes

"There is a desperate shortage of bioinformatics tools that are dedicated to finding unique genes in creatures that look a bit like worms. Hence we are confident that the community of people who work on snakes, eels, nematodes, and other tubular-like organisms will be receptive to SNUGGLES."

Can you say the name of this new bioinformatics method three times fast?

February 12, 2015 by Keith Bradnam

New in the journal Bioinformatics:

jNMFMA: a joint non-negative matrix factorization meta-analysis of transcriptomics data

JNMF stand for Joint Non-negative Matrix Factorization. Throw in some meta-analysis and randomly decide to make the 'J' lower-case as well as itacilized and you end up with the trips-off-the-tongue name of jNMFMA. Try saying it three-times fast! Actually, I had trouble pronouncing this just once.

How not to write a sequence assembly comparison paper →

February 07, 2015 by Keith Bradnam

A great post by Keith Robison in which he casts his expert eye in the direction of a new pre-print published at F1000 Research.

In the preprint there is a a very ill-designed figure, which should be a table, that prints badly, with the font much too light for the unnecessarily heavy background. Displaying a four quadrant image adds nothing; a neatly organized table could display more information in a far more readable format allowing easy comparison. Even if the design were defensible, the content is embarassingly out-of-date, I'd estimate by close to two years.

The post ends with a great coda.

If cars were made by bioinformaticians... →

February 02, 2015 by Keith Bradnam

Saw this on Twitter today:

If cars were made by bioinformaticians... a tribute to the JABBA award of @kbradnam (among others) #bioinformatics http://t.co/SfeMvvuMpH
— Guillaume Filion (@thegrandlocus) February 2, 2015

Guillaume has some fun with this topic on his blog (The Grand Locus). Obviously I liked the first item on the list the most ('Cars would have nice names'), this included:

Here we present CaЯ (vehiCle for chAnging geo-cooЯdinates), a fast and accurate tool as an alternative to existing vehicles.

Genome Assembly: the art of trying to make one BIG thing from millions of very small things

January 30, 2015 by Keith Bradnam

Here are the slides from a talk I gave this week at UC Davis (also embedded below). This talk was for a group of graduate students (from different backgrounds).

Note, because I tend to make very visual slides which don't always work well in isolation (you need to hear my sparkling narrative!), I have taken time to duplicate many slides and embed notes to indicate approximately what I would have said to explain the slide.

Genome Assembly: the art of trying to make one BIG thing from millions of very small things from Keith Bradnam

Why Siri might not be the best tool for dictating ideas about bioinformatics

January 27, 2015 by Keith Bradnam

I recently tried using Siri on my iPhone to dictate some notes regarding the outline of a lecture that I am giving this week. It seems that although Siri is great for capturing lots of 'regular' text, she doesn't do so well at bioinformatics/genomics jargon. Here is what I captured before I abandoned this attempt:

Some background for the total truth, why is Tina assembly important and why is genome is somebody difficult.

Do you have a background will consist of looking at twin we still about the human genome and then we will move onto the positives it most human gene into multiple GMC different individuals and different tissues such as cancer

We can show some background for project such as the 959 Nemitz hoot genomes. And the 1001 hour blocks is genomes and Jean and tank hey and I 5K and Billy and genomes project

Make the analogy of chicks holes maybe share the wealth Lodge is Txell is. Then showed my old slides to inches terminology. We should also have a definition of genome assembly.

For my genome Dicsal slides I can introduce the concept called holds but I have to make a point, but in routine assembly there are many little pieces. Showed how many pieces that would be 100 base pay rate. How can I how much covers there is a huge 11 translate into detail pieces. Make the point there is no image to work from. Or if there is an image it's for a flirt. Then mention it the issue of repeats in the genome. Also discussed the issue of deployed genomes

If you're curious, these notes refer to a talk about genome assembly. I like how 'nematode' came out as 'Nemitz hoot', 'jigsaws' became 'chicks holes', and 'diploid' turned into 'deployed'.

From CASP to Poreathon: what makes for a good bioinformatics 'brand' name?

January 23, 2015 by Keith Bradnam

One of my more significant contributions to the world of bioinformatics is that I came up with the name for The Assemblathon.

Towards the end of 2010, our group at the UC Davis Genome Center was tasked with helping organize a new competition to assess software in the field of genome assembly. I remember a midweek meeting with my boss (Ian Korf) where he informed me that by the end of the week we had to come up with a name for the project, set up a website, and have a mailing list up and running…and by 'we' he meant 'me'.

I was aware that there had been several other comparative software assessments in the field of bioinformatics, and that a certain theme had arisen in the naming of such exercises:

CASP - Critical Assessment of protein Structure Prediction: running since 1994 and organized by a team that are also in the Genome Center
GASP - Genome Annotation aSsessment Project (later renamed GASP1): a 1999 attempt to assess annotation in a region of the Drosophila melanogaster genome
EGASP - the human ENCODE Genome Annotation aSsessment Project: 2005–2006
nGASP - nematode Genome Annotation aSsessment Project: 2006–2008
RGASP - RNA-seq Genome Annotation aSsessment Project: 2005–2013 (RGASP1 and RGASP2 were designed to evaluate computational methods for RNA-seq data analysis whereas the latest RGASP3 is focusing on comparing RNA-seq read alignment software)
dnGASP - de novo Genome Assembly aSsessment Project: 2010–2011 (something that ran in parallel with Assemblathon 1)

It seems amazing to me that after GASP decided to make a bogus acronym by including the 'S' from 'aSsessment', all subsequent evaluation exercises followed suit (although you could also argue that CASP could have worked equally well as 'CAPS').

I felt quite strongly that the world did not need another '…ASP' style of name and so I came up with 'The Assemblathon'. Although many might shudder at this, I was really thinking of it as a 'brand' name, rather than just another forgettable scientific project name. The Assemblathon name ticked several boxes:

Memorable
Different
Pronounceable
Website name was available
Twitter account name was available

The last two items are kind of obvious when you realize that this is a completely new word. You may disagree, but I think that these are important — but not essential — aspects of naming a scientific project.

So what has happened since I bequeathed the Assemblathon brand to the world? Well, we've now had:

Alignathon - A collaborative competition to assess the state of the art in whole genome sequence alignment (published in 2014)
Variathon - A challenge to analyze existing or new pipelines for variant calling in terms of accuracy and efficiency (completed in 2013, but not published yet as far as I can tell)
Poreathon - Assessment of bioinformatics pipelines relating to Oxford Nanopore sequencing data (announced by Nick Loman this week)

I don't have any issues with 'Alignathon', as the name is based on a verb and the goal of the project is probably guessble by any bioinformatician. Like Assemblathon, it is a portmanteau that just seems to work.

In contrast, I find 'Variathon' a horrible name. The name doesn't scan well and may not make as much sense to others. If you search Google for this name you will see the following:

Not a good sign if your project name is regarded as a spelling mistake!

So what about 'Poreathon'? While I find this less offensive than Variathon, I still don't think it is a particularly snappy name…a bit of a snoreathon perhaps? ;-) Pore is both a noun and a verb, so the dual meaning of the word somewhat dilutes its impact as a project name.

5 suggestions for naming scientific projects

You should not feel committed to naming something in order to continue a previous naming trend
Acronyms are not the only option for the name of a scientific project!
If there is any confusion as to how your project name is spelt or pronounced, this will not help you promote the name among your peers.
Consider treating the intended name as a brand, and explore the issues that arise (how discoverable is the name, how similar to other 'brands', can you trademark it, is your name offensive in other languages, can you buy a suitable domain name? etc.)
At the very least, perform a Google search for your intended name to see if others in your field have already used it (see my post on Identical Classifications In Science)