What’s in a (gene) name?

Screengrab from WormBase database showing exon and intron structure of a gene called PPA52350

Screengrab from WormBase database showing exon and intron structure of a gene called PPA52350.

A chance conversation this week gave me a reason to check in on everybody’s favourite model organism database for nematodes…WormBase. Over two decades ago I spent four years of my life as a project manager for the UK arm of WormBase. Based at the Sanger Institute near Cambridge, we partnered with three groups in the USA (CalTech, CSHL and WashU) to maintain and develop the database that was used by thousands of nematode researchers around the world.

At the heart of WormBase was genetic and genomic data for the model organism Caenorhabditis elegans. This was the first animal to have it’s genome mapped and then sequenced. More impressively, the work to accurately fill in every last gap of the genome continued for many years after the formal publication of the genome sequence.

The 1998 Science publication described just over 19,000 protein-coding genes and a few hundred non-coding RNA genes.

I joined WormBase in 2001 and within a year or so I was tasked with developing what would frequently be referred to as simply ‘the new gene model’. Prior to any genome sequencing, there were many genes in C. elegans that had been defined by over two decades of classical genetic mapping approaches.

These genes were named in a simple, but consistent, way with three letters and then a number (this would necessarily have to be expanded to four letters in later years). E.g. unc-10 was the 10th gene named in the unc gene family (referring to UNCoordinated movements in worms with that mutation).

Enter the world of gene and genome sequencing

As the worm genome project began, genes would gain new identifiers based on in silico gene predictions made against the sequences of the cosmids, fosmids and YACs that would comprise the genome. So the first gene prediction on cosmid clone ‘T10A3’ would be named T10A3.1, the next adjacent gene prediction would be 'T10A3.2' and so on. Alternative splicing further complicated this nomenclature but it was relatively easy to append ‘a’, ‘b’, ‘c’ etc for splice variants. Luckily (for WormBase) I think 25 splice variants were the most ever discovered (see egl-8 in WormBase).

These sequence-based identifiers would sometimes become confusing when a new gene would be identified between the .1 and .2 gene predictions within a cosmid, fosmid or YAC. This messed up the original co-linearity of gene identifier with genomic location. E.g., you might end up with genes, F46H5.3 and F46H5.4 flanking a newly discovered gene which would gain a number such as F46H5.12 (you would use the next available number for that cosmid/fosmid/YAC).

When two genes go to war

The situation just continued to get more confusing. Turns out that a lot of computer gene predictions were not always right. This might mean removing a gene, or (more commonly) merging two genes into one new gene structure (where the original genes might now become splice variants. So genes C55B6.4 and C55B6.5 might become C55B6.4a and C55B6.4b (I’m using some made up examples here, but I recall it was somewhat arbitrary as to what gene identifier of the two merged genes would survive and which would die).

As genes were originally predicted on each assembled cosmid/fosmid/YAC sequence, and as these sequences overlapped — a necessary aspect of how the genome project was completed — it was possible that the same gene might be predicted independently on two overlapping cosmids/fosmids/YACs (and hence initially gain two unrelated gene identifiers).

Then of course there was the frequent situation where a classically defined gene would be matched with it’s sequence-identified counterpart. Hence, from my earlier example the unc-10 gene could also be known as T10A3.1.

Both names needed to be searchable and take you to the same entry in the database. But even more confusingly, there would be many more names that might exist. This could be because two researchers had independently mapped/published the same gene at different times, or just because some people would give the gene their own name without first referring to the literature. Generally speaking, the worm community is very good at avoiding this sort of thing but it did occasionally happen.

So unc-10 is not only the same gene as T10A3.1 but it could also be referred to as rim-1 or CELE_T10A3.1. I love that the WormBase database has preserved so many ‘curatorial remarks’ left by WormBase staff (myself included) as we tried grappling with how to merge genes and deal with other anomalies. E.g. here is a remark about the twk-18 gene:

”There are two twk-18 loci. One is CGC-approved (C24A3.6) and one is CGC-unapproved, which became an other-name of unc-58(T06H11.1a/b)”

This reflects yet another level of complexity where two different researchers or groups had used the same gene name for different genes.

Enter ‘the new gene model’

So it became my job to work out how we could:

  1. Create a new systematic identifier for every gene (yes, this reminds of me a bit of that XKCD comic)
  2. Roll this out across all of our parter groups in a way that didn’t break anything

This was a challenging problem but we eventually got there and all genes gained a new stable identifier that acted (hopefully) as a way to bring some order to how we (and others) worked with genes. These identifiers were ultimately rolled out in 2004 and an important part of ‘the new gene model’ was to allow a better way of capturing future gene births, deaths, mergers, and splits.

WormBase was based on ACeDB (A C. elegans DataBase), a bespoke database that was originally created by Richard Durbin and Jean Thierry-Mieg for the express purpose of managing C. elegans genetic data (it would go on to be used for many other organisms).

In AceDB you can always see the underlying model for any object by switching to something called the ‘Tree Display’. Amazingly (to me anyway), you can still do this in WormBase to the present day. It is buried within the Tree Display of genes, that you can see the original version control information that we added when we first migrated everything to ‘the new gene model’:

Screengrab from Tree Display view of a gene. Text in columns starts with 'Version_change' and includes a date column, a person identifier (WBPerson1971) and then a break down of an 'Event' which explains the detail of how the first import

Screengrab from Tree Display view of a gene in WormBase showing the Version change information for a gene.

In this case ‘geneace’ was the local ACeDB database that we used at the Sanger Institute to store information regarding all of the classically mapped genes. I was (and perhaps still am?) ‘WBPerson1971'; this means I am forever indirectly associated with most of the initial gene set of C. elegans in WormBase! This now brings me to the fundamental point that I wanted to make with this very long blog post…

What was the format of the new gene identifier?

I remember wanting to borrow from the format that the Ensembl genome database had established. Gene identifiers should have a fixed width by using leading zeroes to pad out the identifier (this makes it much friendly to computer programs that have to process such data). I also wanted there to be a simple text prefix to the identifier (Ensembl has gene IDs such as ENSG00000139618: ENSG = ENSembl Gene).

I went with ‘WBGene’ for the prefix part which just left the question of how many digits to reserve for the gene space. At the time I was working on this I think the genome of a related nematode C. briggsae had been finished. As I recall, WormBase contained that data as well as some very limited data for a few other nematodes (including some classically mapped genes as well as many sequence derived genes). So I knew that WormBase would grow beyond the 20,000 or so genes there were at the time of the C. elegans genome publication.

I opted for eight digits in the identifier, allowing for 99,999,999 genes. At the time this did seem like overkill, but I would rather err on the side of caution than give future bioinformaticans major headaches, e.g. what if we had gone for a five-digit number and then it turns out that we needed to store over 100,000 genes?

All of this meant that our unc-10 gene from before could now also be known as WBGene00006750.

An end to WormBase and an end to ‘the new gene model’

This week I wanted to have a look to see what the highest gene identifier was that had been added to WormBase. In taking a look at the website, I was saddened to see that WormBase came to an end in July 2025 with the 298th release of the database.

Thankfully, most of the data will live on in the Alliance of Genome Resources which is a consortium of seven model organism databases. It’s not clear to me whether the alliance will end up with yet another tier of identifier that will span all of the species that are represented. I note that unc-10 in their database currently gains a slight tweak with an extra prefix of ‘WB:’ to become WB:WBGene00006750 (presumably because they have the extra challenge of needing to know which database an identifier came from).

I wonder whether the new gene model will live on in the Alliance of Genome Resources and whether those eight digits of reserved identifier space will continue to fill up. Given how easy (and relatively cheap) it has become to sequence genomes these days, someone might decide they want to sequence the genomes of all nematodes (there are about 25,000 described species, but there could be many more).

The last gene in WormBase

And so I can finally bring this post to a close with the reveal of the last gene identifier that made it into the last release of WormBase:

WBGene00311061 (also known as PPA52350) is a protein-coding gene from the nematode Pristionchus pacificus. It is the highest number gene that I could find and it reflects the fact that the gene count of WormBase has increased over 15-fold since my time there.

I bet this is due to a lot of other species being added, but also due to an explosion in RNA genes in C. elegans. I’m glad that my choice of gene identifier all those years ago has survived.

In writing this blog post, it’s been fun a lot of fun to take this trip down memory lane. In my research for this post, I came across a recent video (March 2026) by the Alliance of Genome Resources which explains much more about how worm and fly genes are named. Tim Schedl is the gene name curator for the WormBase data within the Alliance dataset who took over from Jonathan Hodgkin who I worked very closely with during my time at WormBase.

If there is are any lessons to be learned from all of this (especially to any young bioinformaticians out there), I would say:

  1. It’s hard to design databases for biological data. Biology is messy and will produce surprises that might only emerge years after you define a schema that thought would capture all possible edge cases (biology will then laugh at your schema)
  2. If you can, try to future proof things as much as possible
  3. Do not - under any circumstances - allow asterisks or question marks to be part of a valid gene name! There were originally some genes in the geneace database that included asterisks which caused all manner of problems in ACeDB as asterisks were also used as a wildcard search operator.

Bad bioinformatics software names revisited

I recently have been sorting through lots of old notes files, including many from my time as a genomics researcher at UC Davis. One note file I had was called ‘Strategies for naming bioinformatics software’ and I initially assumed it was one of the blog posts posted on this blog.

However, I couldn’t find it as an actual post and when I did a quick web search, I instead discovered this ‘Bioinformatics lab’ podcast from earlier this year:

I have been out of the field of genomics/bioinformatics for many years now and didn’t know about The Bioinformatics Lab podcast which describes itself as ‘ramblings on all things bioinformatics’.

The conversation between the hosts (Kevin Libuit and Andrew Page) is good and listening to it brought back lots of memories from the many things I’ve written about on this blog. At the end of the episode, Andrew concludes:

“It’s kind of hard. People should bit a bit of effort into it”

100% this! Naming software should definitely not be an afterthought. Andrew goes on:

“Before you do any development on anything, go and choose a really good name and make sure it doesn’t conflict with any trademarks or existing tools, you can Google it easily and it’s not offensive in any language.”

These are the types of things that I have written about extensively on this blog. If you are interested, perhaps start with

Then you can ready any one of the nearly forty posts I wrote which handed out ‘JABBA awards’ which stands for Just Another Bogus Bioinformatics Acronym.

This award series started all the way back in 2013 and the inaugural award went to a tool with the crazy capitalisation of 'BeAtMuSiC'.

There’s also a series of posts on duplicate names in bioinformatics where people haven’t checked whether their software name is stepping on someone else’s toes.

This includes a post about the audacious attempt to name a new piece of bioinformatics software BLAST. There is also a post about the five different tools that are all called ‘SNAP’.

Admittedly I’ve been out of the loop for so long there is the possibility of there being many more SNAPs out there now!

The moral of this blog post is that names are important and it is very easy to mess them up which could end up meaning that fewer people ever discover your tool in teh first place.

CEGMA is dying…just very, very slowly

This is my first post on this blog in almost three years and it is now almost nine years since I could legitimately call myself a genomics researcher or bioinformatician.

However, I feel that I need to 'come out of retirement' for one quick blog post on a topic that has spanned many others…CEGMA.

As I outlined in my last post on this blog, the CEGMA tool that I helped develop back in 2005 and which was first published in 2007, continues to be used.

This is despite many attempts to tell/remind people not to use it anymore! There are better tools out there (probably many that I'm not even aware of). Fundamentally, the weakness of using CEGMA is that is based on an identified set of orthologs that was published over two decades ago.

And yet, every week I receive Google Scholar alerts that tell me that someone else has cited the tool again. We (myself and Ian Korf) should perhaps take some of the blame for keeping the software available on the Korf Lab website (I wonder how many other bioinformatics tools from 2007 can still be downloaded and successfully run?).

CEGMA citations (2011-2024)

When I saw that citations had peaked in 2017 and when I saw better tools come along, I thought it would be only a couple of years until the death knell tolled for CEGMA. I was wrong. It is dying…just very, very slowly. There were 119 citations last year and there have been 88 so far this year.

Academics (including former academics) obviously love to see their work cited. It is good to know that you have built tools that were actively used. But please, stop using CEGMA now! Myself and the other co-authors no longer need the citations to justify our existence.

Come back to this blog in another three years when I will no doubt post yet another post about CEGMA ('For the love of all that is holy why won't you just curl up and die!').

New BUSCO vs (very old) CEGMA

If I’m only going to write one or two blog posts a year on this blog, then it makes sense to return to my recurring theme of don’t use CEGMA, use BUSCO!

In 2015 I was foolishly optimistic that the development of BUSCO would mean that people would stop using CEGMA — a tool that we started developing in 2005 and which used a set of orthologs published in 2003! — and that we would reach ‘peak-CEGMA’ citations that year.

That didn’t happen. At the end of 2017, I again asked the question have we reached peak-CEGMA? because we had seen ten consecutive years of increasing publications.

Well I’m happy to announce that 2017 did indeed see citations to our 2007 CEGMA paper finally peak:

CEGMA citations by year (from Google Scholar)

CEGMA citations by year (from Google Scholar)

Although we have definitely passed peak CEGMA, it still receives over a 100 citations a year and people really should be using tools like BUSCO instead.

This neatly leads me to mention that a recent publication in Molecular Biology and Evolution describes an update to BUSCO:

From the introduction:

With respect to v3, the last BUSCO version, v5, features: 1) a major upgrade of the underlying data sets in sync with OrthoDB v10; 2) an updated workflow for the assessment of prokaryotic and viral genomes using the gene predictor Prodigal (Hyatt et al. 2010); 3) an alternative workflow for the assessment of eukaryotic genomes using the gene predictor MetaEuk (Levy Karin et al. 2020); 4) a workflow to automatically select the most appropriate BUSCO data set, enabling the analysis of sequences of unknown origin; 5) an option to run batch analysis of multiple inputs to facilitate high-throughput assessments of large data sets and metagenomic bins; and 6) a major refactoring of the code, and maintenance of two distribution channels on Bioconda (Grüning et al. 2018) and Docker (Merkel 2014).

Please, please, please…don’t use CEGMA anymore! It is enjoying a well-earned retirement at the Sunnyvale Home for Senior Bioinformatics Tools.

Three cheers for JABBA awards

jabba logo.png

These days, I mostly think of this blog as a time capsule to my past life as a scientist. Every so often though, I’m tempted out of retirement for one more post. This time I’ve actually been asked to bring back my JABBA awards by Martin Hunt (@martibartfast)…and with good reason!

There is a new preprint in bioRxiv…

I’m almost lost for words about this one. You know that it is a tenuous attempt at an acronym or initialism when you don’t use any letters from the 2nd, 3rd, 4th, or 5th words of the full software name!

The approach here is very close to just choosing a random five-letter word. The authors could also have had:

CLAMP: hierarChical taxonomic cLassification for virAl Metagenomic data via deeP learning

HOTEL: hierarcHical taxOnomic classificaTion for viral mEtagenomic data via deep Learning

RAVEN: hieraRchical tAxonomic classification for Viral metagenomic data via dEep learNing

ALIEN: hierArchical taxonomic cLassification for vIral metagEnomic data via deep learniNg

LARVA: hierarchicaL taxonomic classificAtion for viRal metagenomic data Via deep leArning

Okay, as this might be my only blog post of 2020, I’ll say CHEERio!

DOGMA: a new tool for assessing the quality of proteomes and transcriptomes

A new tool, recently published in Nucleic Acids Research, caught my eye this week:

The tool, by a team from the University of Münster, uses protein domains and domain arrangements in order to assess 'completeness' of a proteome or transcriptome. From the abstract…

Even in the era of next generation sequencing, in which bioinformatics tools abound, annotating transcriptomes and proteomes remains a challenge. This can have major implications for the reliability of studies based on these datasets. Therefore, quality assessment represents a crucial step prior to downstream analyses on novel transcriptomes and proteomes. DOGMA allows such a quality assessment to be carried out. The data of interest are evaluated based on a comparison with a core set of conserved protein domains and domain arrangements. Depending on the studied species, DOGMA offers precomputed core sets for different phylogenetic clades

Unlike CEGMA and BUSCO, which run against unannotated assemblies, DOGMA first requires a set of gene annotations. The paper focuses on the web server version of DOGMA but you can also access the source code online.

It's good to see that other groups are continuing to look at new ways of asssessing the quality of large genome/transcriptome/proteome datasets.

What's in a name?

Initially, I thought the name was just a word that both echoed 'CEGMA' and reinforced the central dogma of molecular biology. Hooray I thought, a bioinformatics tool that just has a regular word as a name without relying on contrived acronyms.

Then I saw the website…

  • DOGMA: DOmain-based General Measure for transcriptome and proteome quality Assessment

This is even more tenuous than the older, unrelated, version of DOGMA:

  • DOGMA: Dual Organellar GenoMe Annotator

Beyond Generations: My Vocabulary for Sequencing Tech

Many writers have attempted to divide Next Generation Sequencing into Second Generation Sequencing and Third Generation Sequencing. Personally, I think it isn't helpful and just confuses matters. I'm not the biggest fan of Next Generation Sequencing (NGS) to start with, as like "post-modern architecture" (or heck, "modern architecture") it isn't future-proofed.

Keith Robison gives an interesting deep dive on how sequencing technologies have been named and potentially could be named.

This post reminded me of my previous takes on the confusing, and inconsistent labelling of these technologies:

Reflections on the 2019 Festival of Genomics conference in London

IMG_8201.jpg

For the third year in a row, I attended the Festival of Genomics conference in London. This year saw the conference change venue, moving from the ExCel Arena to the Business Design Centre in Islington.

The new venue was notably smaller leading to many sessions being heavily overcrowded. There were also fewer 'fun' activities compared to previous years. No graffiti wall and no recharging stations (massage stands and power points for phones).

The opening keynote was given by Professor Mark Caulfield (Chief Scientist at Genomics England

From 100K to 500K

Reflecting on the completion of the 100,000 Genomes Project, Professor Caulfield revealed that the 100,000th genome was completed at 2:40 am on the 2nd December.

He also shared details that at the peak, the project was completing 6,000 genomes a month and it has now reached 103,311 genomes.

The next phase will see 500,000 genomes completed within the NHS over the next five years, with an 'ambition' to go on to sequence five million genomes.

Looking at the global picture of human genome sequencing, Professor Caulfield projected that there will be 60 million completed genomes by 2023.

I wrote more about the conference in a blog post for The Institute of Cancer Research:

Damn and blast…I can't think of what to name my software

1441920213651.png

As many people have pointed out on Twitter this week, there is a new preprint on bioRxiv that merits some discussion:

The full name of the test that is the subject of this article is the Bron/Lyon Attention Stability Test. You have to admit that 'BLAST' is a punchy and catchy acronym for a software tool.

It's just a shame that is also an acronym for another piece of software that you may have come across.

It's a bold move to give your software the same name as another tool that has only been cited at least 135,000 times!

This is not the first, nor will it be the last, example of duplicate names in bioinformatics software, many of which I have written about before.

The 100,000 Genomes Project has finished

This week I helped write a blog post for The Institute of Cancer Research to mark the completion of the 100,000 Genomes Project. This blog post was co-written by a former colleague, Dr Sam Dick, who wrote the majority of the article:

Read the blog post:

Reflecting on this milestone achievement, I also took to Twitter this week for a lengthy (and admittedly rambling) thread that reflected on how far genomics has come as a field. Click on the tweet below to see the full Twitter thread: