If cars were made by bioinformaticians... →

February 02, 2015 by Keith Bradnam

Saw this on Twitter today:

If cars were made by bioinformaticians... a tribute to the JABBA award of @kbradnam (among others) #bioinformatics http://t.co/SfeMvvuMpH
— Guillaume Filion (@thegrandlocus) February 2, 2015

Guillaume has some fun with this topic on his blog (The Grand Locus). Obviously I liked the first item on the list the most ('Cars would have nice names'), this included:

Here we present CaЯ (vehiCle for chAnging geo-cooЯdinates), a fast and accurate tool as an alternative to existing vehicles.

Genome Assembly: the art of trying to make one BIG thing from millions of very small things

January 30, 2015 by Keith Bradnam

Here are the slides from a talk I gave this week at UC Davis (also embedded below). This talk was for a group of graduate students (from different backgrounds).

Note, because I tend to make very visual slides which don't always work well in isolation (you need to hear my sparkling narrative!), I have taken time to duplicate many slides and embed notes to indicate approximately what I would have said to explain the slide.

Genome Assembly: the art of trying to make one BIG thing from millions of very small things from Keith Bradnam

Why Siri might not be the best tool for dictating ideas about bioinformatics

January 27, 2015 by Keith Bradnam

I recently tried using Siri on my iPhone to dictate some notes regarding the outline of a lecture that I am giving this week. It seems that although Siri is great for capturing lots of 'regular' text, she doesn't do so well at bioinformatics/genomics jargon. Here is what I captured before I abandoned this attempt:

Some background for the total truth, why is Tina assembly important and why is genome is somebody difficult.

Do you have a background will consist of looking at twin we still about the human genome and then we will move onto the positives it most human gene into multiple GMC different individuals and different tissues such as cancer

We can show some background for project such as the 959 Nemitz hoot genomes. And the 1001 hour blocks is genomes and Jean and tank hey and I 5K and Billy and genomes project

Make the analogy of chicks holes maybe share the wealth Lodge is Txell is. Then showed my old slides to inches terminology. We should also have a definition of genome assembly.

For my genome Dicsal slides I can introduce the concept called holds but I have to make a point, but in routine assembly there are many little pieces. Showed how many pieces that would be 100 base pay rate. How can I how much covers there is a huge 11 translate into detail pieces. Make the point there is no image to work from. Or if there is an image it's for a flirt. Then mention it the issue of repeats in the genome. Also discussed the issue of deployed genomes

If you're curious, these notes refer to a talk about genome assembly. I like how 'nematode' came out as 'Nemitz hoot', 'jigsaws' became 'chicks holes', and 'diploid' turned into 'deployed'.

From CASP to Poreathon: what makes for a good bioinformatics 'brand' name?

January 23, 2015 by Keith Bradnam

One of my more significant contributions to the world of bioinformatics is that I came up with the name for The Assemblathon.

Towards the end of 2010, our group at the UC Davis Genome Center was tasked with helping organize a new competition to assess software in the field of genome assembly. I remember a midweek meeting with my boss (Ian Korf) where he informed me that by the end of the week we had to come up with a name for the project, set up a website, and have a mailing list up and running…and by 'we' he meant 'me'.

I was aware that there had been several other comparative software assessments in the field of bioinformatics, and that a certain theme had arisen in the naming of such exercises:

CASP - Critical Assessment of protein Structure Prediction: running since 1994 and organized by a team that are also in the Genome Center
GASP - Genome Annotation aSsessment Project (later renamed GASP1): a 1999 attempt to assess annotation in a region of the Drosophila melanogaster genome
EGASP - the human ENCODE Genome Annotation aSsessment Project: 2005–2006
nGASP - nematode Genome Annotation aSsessment Project: 2006–2008
RGASP - RNA-seq Genome Annotation aSsessment Project: 2005–2013 (RGASP1 and RGASP2 were designed to evaluate computational methods for RNA-seq data analysis whereas the latest RGASP3 is focusing on comparing RNA-seq read alignment software)
dnGASP - de novo Genome Assembly aSsessment Project: 2010–2011 (something that ran in parallel with Assemblathon 1)

It seems amazing to me that after GASP decided to make a bogus acronym by including the 'S' from 'aSsessment', all subsequent evaluation exercises followed suit (although you could also argue that CASP could have worked equally well as 'CAPS').

I felt quite strongly that the world did not need another '…ASP' style of name and so I came up with 'The Assemblathon'. Although many might shudder at this, I was really thinking of it as a 'brand' name, rather than just another forgettable scientific project name. The Assemblathon name ticked several boxes:

Memorable
Different
Pronounceable
Website name was available
Twitter account name was available

The last two items are kind of obvious when you realize that this is a completely new word. You may disagree, but I think that these are important — but not essential — aspects of naming a scientific project.

So what has happened since I bequeathed the Assemblathon brand to the world? Well, we've now had:

Alignathon - A collaborative competition to assess the state of the art in whole genome sequence alignment (published in 2014)
Variathon - A challenge to analyze existing or new pipelines for variant calling in terms of accuracy and efficiency (completed in 2013, but not published yet as far as I can tell)
Poreathon - Assessment of bioinformatics pipelines relating to Oxford Nanopore sequencing data (announced by Nick Loman this week)

I don't have any issues with 'Alignathon', as the name is based on a verb and the goal of the project is probably guessble by any bioinformatician. Like Assemblathon, it is a portmanteau that just seems to work.

In contrast, I find 'Variathon' a horrible name. The name doesn't scan well and may not make as much sense to others. If you search Google for this name you will see the following:

Not a good sign if your project name is regarded as a spelling mistake!

So what about 'Poreathon'? While I find this less offensive than Variathon, I still don't think it is a particularly snappy name…a bit of a snoreathon perhaps? ;-) Pore is both a noun and a verb, so the dual meaning of the word somewhat dilutes its impact as a project name.

5 suggestions for naming scientific projects

You should not feel committed to naming something in order to continue a previous naming trend
Acronyms are not the only option for the name of a scientific project!
If there is any confusion as to how your project name is spelt or pronounced, this will not help you promote the name among your peers.
Consider treating the intended name as a brand, and explore the issues that arise (how discoverable is the name, how similar to other 'brands', can you trademark it, is your name offensive in other languages, can you buy a suitable domain name? etc.)
At the very least, perform a Google search for your intended name to see if others in your field have already used it (see my post on Identical Classifications In Science)

Unpronounceable bioinformatics database names

January 21, 2015 by Keith Bradnam

First a quick reminder that an acronym is something that is meant to be pronounced as an entire word (e.g. NATO, AIDS etc.). Sometimes these end up becoming regular, non-capitalized, words (e.g. radar, laser).

In contrast, an initialism is something where the component letters are read out individually (e.g. BBC, CPU). In bioinformatics, there are also names which are part acronym and part initialism (e.g.GWAS…which I have only every heard pronounced as gee-was).

Most initialisms that we use in everday life tend to be short (2–4 letters) because this makes them easier to read and to pronounce. As you move past 4 letters, you run the risk of making your initialism unprouncible and unmemorable.

So here are some recently published bioinformatics tools with names that are a bit cumbersome to repeat. For each one I include how someone might try to pronounce them. Try repeating these names quickly and for an added test, see how many of these names you can remember 5 minutes after you read this:

5 characters

CeCaFDB: a curated database for the documentation, visualization and comparative analysis of central carbon metabolic flux distributions explored by 13C-fluxomics: cee-car-eff-dee-bee? — this assumes that 'Ce' and 'Ca' are not treated separately as two letters…one could argue that if it is not clear how your bioinformatics tool name should be pronounced, then it does not have a good name.
EHFPI: a database and analysis resource of essential host factors for pathogenic infection: ee-aitch-eff-pee-aye
PAIDB v2.0: exploration and analysis of pathogenicity and resistance islands: pee-ay-aye-dee-bee — this is a particularly bad choice of name as it will read to many as 'paid-bee'
rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development: ar-ar-en-dee-bee (the first 3 characters are not easy to say quickly!)
The TTSMI database: a catalog of triplex target DNA sites associated with genes and regulatory elements in the human genome: tee-tee-ess-em-aye

6 characters

DBTMEE: a database of transcriptome in mouse early embryos: dee-bee-tee-em-ee-ee — I accept that maybe this one is just pronounced dee-bee-tee-me, but once again do you really want there to be uncertaintly as to how the name of your bioinformatics tool is read by others?
euL1db: the European database of L1HS retrotransposon insertions in humans: ee-you-ell-one-dee-bee
SASBDB, a repository for biological small-angle scattering data: ess-ay-ess-bee-dee-bee
WDSPdb: a database for WD40-repeat proteins: dub-ball-you-dee-ess-pee-dee-bee

7 characters

BCCTBbp: the Breast Cancer Campaign Tissue Bank bioinformatics portal: bee-cee-cee-tee-bee-bee-pee
PFP/ESG: automated protein function prediction servers enhanced with Gene Ontology visualization tool: pee-eff-pee-slash-ee-ess-gee (only 6 characters if you omit the slash I guess)
PHI-DAC: protein homology database through dihedral angle conservation: pee-aitch-aye-dash-dee-ay-cee (shorter if you omit dash and/or pronounce 'DAC' as a word)

And the winner goes to…

BioVLAB-MMIA-NGS: microRNA–mRNA integrated analysis using high-throughput sequencing data: this is a 7-letter initialism that comes after a three syllable (non-standard) word, so to pronounce this you have to say bio-vee-lab-em-em-aye-ay-en-gee-ess!!!

Conclusions

If you want people to actually use your bioinformatics tools, then you should aim to give them names that are memorable and pronounceable.

Are there too many biological databases?

January 16, 2015 by Keith Bradnam

The annual 'Database' issue of Nucleic Acids Research (N.A.R.) was recently published. It contains a mammoth set of 172 papers that describe 56 new biological databases as well as updates to 115 others. I've already briefly commented on one of these papers, and expect that I'll be nominating several others for JABBA awards.

In this post I just wanted to comment on the the seemingly inexorable growth of these computational resources. There are databases for just about everything these days. Different species, different diseases, different types of sequence, different biological mechanisms…every possible biological topic has a relevant database, and sometimes they have several.

It is increasinly hard to even stay on top of just how many databases are out there. Wikipedia has a listing of biological databases as well as a category for biological databases, but both of these barely scratch the surface of what is out there.

So maybe one might turn to 'DBD': a Database of Biological Datsbases or even MetaBase which also describes itself as a 'Database of Biological Databases' (please don't start thinking about creating 'DBDBBDB': A Database of Databases of Biological Databases!).

However, the home pages of these two sites were last updated in 2008 and 2011 respectively, perfectly reflecting one of the problems in the world of biological databases…they often don't get removed when they go out of date. In a past life, I was a developer of several databases at something called UK CropNet. Curation of these databases, particularly the Arabidopis Genome Resource, effectively stopped when I left the job in 2001 but the databases were only taken offline in 2013!!!

So old, out-of-date, databases are part of the problem, but the other issue is that there seems to be some independent databases that — in an ideal world — should really be merged with similar databases. E.g. there is a database called BeetleBase that describes its remit as follows:

BeetleBase is a comprehensive sequence database and important community resource for Tribolium genetics, genomics and developmental biology.

This database has been around since at least 2007 though I'm not entirely sure if it is still being actively developed. However, I was still surprised to see this paper as part of the N.A.R. Database issue:

iBeetle-Base: a database for RNAi phenotypes in the red flour beetle Tribolium castaneum

iBeetle-Base has been seemingly developed from a separate group of people from BeetleBase. Is it helpful to the wider community to have two databases like this, with confusingly similar names? It's possible that iBeetle-Base people tried reaching out to the BeetleBase folks to include their data in the pre-existing database, but were rebuffed or found out that BeetleBase is no longer a going concern. Who knows, but it just seems a shame to have so much genomics information for a species split across multiple databases.

I'm not sure what could, or should, be done to tackle these issues. Should we discourage new databases if there are already existing resources that cover much of the subject matter? Should we require the people who run databases to 'wind up' the resources in a better way when funding runs out (i.e. retire databases or make it abundantly clear that a resource is no longer being updated)? Is it even possible to set some minimum standards for database usage that must be met in order for subsequent 'update papers' to get published (i.e. 'X' DB accesses per month)?

diArk – the database for eukaryotic genome and transcriptome assemblies in 2014 →

January 15, 2015 by Keith Bradnam

A new paper in Nucleic Acids Research describes a database that I was not aware of. The abstract features an eye-catching, not to mention ambitious, claim (the emphasis is mine):

The database…has been developed with the aim to provide access to all available assembled genomes and transcriptomes.

The diArk database currently features data on 2,771 species. There are many options to filter your search queries including filtering by 'sequencing type' and by the status of completion. So when I search for 'completed' genome sequencing projects, it reports that there 3,626 projects corresponding to 1,848 species. The FAQ has this to say regarding 'completeness':

The term completeness is intended to describe the coverage of the genome and the chance to find all homologs of the gene of interest.

I was a bit put off by the interface to this database. As far as I can tell, diArk is mostly containing links to other resources (rather than hosting any sequence information). There are lots of very small icons everywhere which are hard to understand (unless you mouse over each icon). When I went to the page for Caenorhabditis elegans, I was struck by the confusing nature of just posting links to every C. elegans resource on the web. There are 12 'Project' links listed. Which one gives you access to the latest version of the genome sequence?

diArk summary of Caenorhabditis elegans data — diArk summary of *Caenorhabditis elegans* data

As a final comment, I noticed that the latest entry on the diArk news page is from September 2011 which is a bit worrying (nothing newsworthy has happened in the last 3 years?).

Red flag alert for a bogus bioinformatics acronym

January 12, 2015 by Keith Bradnam

The first JABBA award of 2015 goes to a paper that was published at the end of 2014 (thanks to twitter user @chenghlee for bringing this to my attention). The paper, published in BMC Medical Genomics, has a succinct title that contains a very bogus name:

FLAGS, frequently mutated genes in public exomes

The title doesn't explicitly reveal the source of the acronym 'FLAGS', but you can probably take a guess. From the abstract:

We termed these genes FLAGS for FrequentLy mutAted GeneS

This gets a JABBA award because a majority (3 out of 5) of the letters in 'FLAGS' are not from the intial letters of words.

A little bit of end-of-year DNA from ACGT

December 31, 2014 by Keith Bradnam

It just remains for me to say:

CATGCCCCCCCCTATAATGAATGGTATGAAGCCCGCTA
ACATGCCGTCGAAGCCGGCCGCGAAGCCACCACCTGGG
AAAATACCTATTTTATTTTTACCGAAGAAAATTAAATG
GCCTATACCCATGAAAATGAATGGTATGAAGCCCGCGG
CGAAAATGAACGCGCCACCGAATTAGAAAATGGCACCC
ATTATCGCGAAGCCGATTCCTAA