Why Siri might not be the best tool for dictating ideas about bioinformatics

I recently tried using Siri on my iPhone to dictate some notes regarding the outline of a lecture that I am giving this week. It seems that although Siri is great for capturing lots of 'regular' text, she doesn't do so well at bioinformatics/genomics jargon. Here is what I captured before I abandoned this attempt:

Some background for the total truth, why is Tina assembly important and why is genome is somebody difficult.

Do you have a background will consist of looking at twin we still about the human genome and then we will move onto the positives it most human gene into multiple GMC different individuals and different tissues such as cancer

We can show some background for project such as the 959 Nemitz hoot genomes. And the 1001 hour blocks is genomes and Jean and tank hey and I 5K and Billy and genomes project

Make the analogy of chicks holes maybe share the wealth Lodge is Txell is. Then showed my old slides to inches terminology. We should also have a definition of genome assembly.

For my genome Dicsal slides I can introduce the concept called holds but I have to make a point, but in routine assembly there are many little pieces. Showed how many pieces that would be 100 base pay rate. How can I how much covers there is a huge 11 translate into detail pieces. Make the point there is no image to work from. Or if there is an image it's for a flirt. Then mention it the issue of repeats in the genome. Also discussed the issue of deployed genomes

If you're curious, these notes refer to a talk about genome assembly. I like how 'nematode' came out as 'Nemitz hoot', 'jigsaws' became 'chicks holes', and 'diploid' turned into 'deployed'.

From CASP to Poreathon: what makes for a good bioinformatics 'brand' name?

One of my more significant contributions to the world of bioinformatics is that I came up with the name for The Assemblathon.

Towards the end of 2010, our group at the UC Davis Genome Center was tasked with helping organize a new competition to assess software in the field of genome assembly. I remember a midweek meeting with my boss (Ian Korf) where he informed me that by the end of the week we had to come up with a name for the project, set up a website, and have a mailing list up and running…and by 'we' he meant 'me'.

I was aware that there had been several other comparative software assessments in the field of bioinformatics, and that a certain theme had arisen in the naming of such exercises:

It seems amazing to me that after GASP decided to make a bogus acronym by including the 'S' from 'aSsessment', all subsequent evaluation exercises followed suit (although you could also argue that CASP could have worked equally well as 'CAPS').

I felt quite strongly that the world did not need another '…ASP' style of name and so I came up with 'The Assemblathon'. Although many might shudder at this, I was really thinking of it as a 'brand' name, rather than just another forgettable scientific project name. The Assemblathon name ticked several boxes:

  1. Memorable
  2. Different
  3. Pronounceable
  4. Website name was available
  5. Twitter account name was available

The last two items are kind of obvious when you realize that this is a completely new word. You may disagree, but I think that these are important — but not essential — aspects of naming a scientific project.

So what has happened since I bequeathed the Assemblathon brand to the world? Well, we've now had:

  1. Alignathon - A collaborative competition to assess the state of the art in whole genome sequence alignment (published in 2014)
  2. Variathon - A challenge to analyze existing or new pipelines for variant calling in terms of accuracy and efficiency (completed in 2013, but not published yet as far as I can tell)
  3. Poreathon - Assessment of bioinformatics pipelines relating to Oxford Nanopore sequencing data (announced by Nick Loman this week)

I don't have any issues with 'Alignathon', as the name is based on a verb and the goal of the project is probably guessble by any bioinformatician. Like Assemblathon, it is a portmanteau that just seems to work.

In contrast, I find 'Variathon' a horrible name. The name doesn't scan well and may not make as much sense to others. If you search Google for this name you will see the following:

Not a good sign if your project name is regarded as a spelling mistake!

So what about 'Poreathon'? While I find this less offensive than Variathon, I still don't think it is a particularly snappy name…a bit of a snoreathon perhaps? ;-) Pore is both a noun and a verb, so the dual meaning of the word somewhat dilutes its impact as a project name.

5 suggestions for naming scientific projects

  1. You should not feel committed to naming something in order to continue a previous naming trend
  2. Acronyms are not the only option for the name of a scientific project!
  3. If there is any confusion as to how your project name is spelt or pronounced, this will not help you promote the name among your peers.
  4. Consider treating the intended name as a brand, and explore the issues that arise (how discoverable is the name, how similar to other 'brands', can you trademark it, is your name offensive in other languages, can you buy a suitable domain name? etc.)
  5. At the very least, perform a Google search for your intended name to see if others in your field have already used it (see my post on Identical Classifications In Science)

Unpronounceable bioinformatics database names

First a quick reminder that an acronym is something that is meant to be pronounced as an entire word (e.g. NATO, AIDS etc.). Sometimes these end up becoming regular, non-capitalized, words (e.g. radar, laser).

In contrast, an initialism is something where the component letters are read out individually (e.g. BBC, CPU). In bioinformatics, there are also names which are part acronym and part initialism (e.g.GWAS…which I have only every heard pronounced as gee-was).

Most initialisms that we use in everday life tend to be short (2–4 letters) because this makes them easier to read and to pronounce. As you move past 4 letters, you run the risk of making your initialism unprouncible and unmemorable.

So here are some recently published bioinformatics tools with names that are a bit cumbersome to repeat. For each one I include how someone might try to pronounce them. Try repeating these names quickly and for an added test, see how many of these names you can remember 5 minutes after you read this:

5 characters

6 characters

7 characters

And the winner goes to…

Conclusions

If you want people to actually use your bioinformatics tools, then you should aim to give them names that are memorable and pronounceable.

More bioinformatics link rot: where is EUROCarbDB?

Update 2015-01-19 15.19: I contacted the corresponding author about this and now the EurocarbDB link in the original paper works.

First published online a few months ago in the journal Bioinformatics (September 12th, 2014):

The name of this resource is not the snappiest name out there. "Oh, you're interested in resources for glycomics, have you tried EuroCarbDB-open parentheses-cee-cee-ar-cee-close parentheses?", but leaving that aside the paper lists the following URLs as part of the abstract:

Availability and implementation: The installation with the glycan standards is available at http://glycomics.ccrc.uga.edu/eurocarb/. The source code of the project is available at https://code.google.com/p/ucdb/.

The first link says that the server is down. The parent page (http://glycomics.ccrc.uga.edu/ seems to make no mention at all of this resource (not that I can find anywhere). Following the second link in the abstract, I found the following text:

An incubator project for the future direction of the EUROCarbDB project. More to follow.... This new project is in it's infancy - please use the original EUROCarbDB site. A new project will be hosted at UniCarb-DB (http://www.unicarb-db.org to reflect the continued work of the developers

I followed the first of these links to the 'original' EUROCarbDB site. This Google Code page in turn told me that the online version of EuroCarbDB is hosted by the European Institute of Bioinformatics.

Following the link for the online version of EUROCarbDB takes me to what seems to be a closed down site at the EBI titled 'What happened to the EuroCarbDB website?' which has this to say:

The pilot project ended in 2009 but efforts to obtain renewed funding have unfortunately not been successful. The EuroCarbDB website was hosted by the Protein Data Bank in Europe at EMBL-EBI but has now been discontinued

So that's all very helpful then.