Assembling a twitter following: people continue to be interested in genome assembly

Late in 2010, I was asked to help organise what would initially become The Assemblathon and then more formally Assemblathon 1. One of the very first things I did was to come up with the name itself — more here on naming bioinformatics projects — register the domain name, and secure the Twitter account @Assemblathon.

The original goal was to use the website and Twitter account to promote the contest and then share details of how the competition was unfolding. This is exactly what we did, all the way through to the publication of the Assemblathon 1 paper in late 2011. Around this time it seemed to make sense to also use the Twitter account to promote anything else related to the field of genome assembly and that is exactly what I did.

As well as tweeting a lot about Assemblathon 2 and a little bit about the aborted but oh-so-close-to-launching Assemblathon 3, I have found time to tweet (and retweet) several thousand links to many relevant publications and software tools.

It seems that people are finding this useful as the account keeps gaining a steady trickle of followers. The graph below shows data from when I started tracking the follower growth in early 2014:

All of which leaves me to make two concluding remarks:

  1. There can be tremendous utility in having an outlet — such as a Twitter account — to focus on a very niche subject (maybe some would say that genome assembly is no longer a niche field?).
  2. Although I am no longer working on the Assemblathon projects — I'm not even a researcher any more — I'm happy to keep posting to this account as long as people find it useful.

Assemble a genome and evaluate the result [Link]

There is a new page on the bioboxes site (such a great name!) which details how bioboxes can be used to assemble a genome and then evaluate the results:

A common task in genomics is to assemble a FASTQ file of reads into a genome assembly and followed by evaluating the quality of this assembly. This recipe will explore using bioboxes to do this task.

A third Assemblathon contest came very close to launching earlier this year…except that it didn't — maybe this will be the subject of a future blog post! — and we planned to make biobox containers a requisite part of submitting assemblies. If Assemblathon 3 ever gets off the ground I feel happier knowing that the bioboxes team is doing so much great work that will make running such a contest easier to manage.

Metassembler: Merging and optimizing de novo genome assemblies

There's a great new paper in bioRxiv by Alejandro Hernandez Wences and Michael Schatz. They directly address something I wondered about as we were running the Assemblathon contests. Namely, can you combine some of the submitted assemblies to make an even better assembly? Well the answer seems to be a resounding 'yes'.

For each of three species in the Assemblathon 2 project we applied our algorithm to the top 6 assemblies as ranked by the cumulative Z-score reported in the paper…

We evaluated the correctness and contiguity of the metassembly at each merging step using the metrics used by the Assemblathon 2 evaluation…

In all three species, the contiguity statistics are significantly improved by our metassembly algorithm

Hopefully their Metassembler tool will be useful in improving many other poor quality assemblies that are out there!

Slides: Thoughts on the feasibility of Assemblathon 3

The slides below represent the draft assembly version of the talk that Ian Korf will be giving today at the Genome 10K meeting. I.e. these are slides that I made for him to use as the basis of his talk. I expect his final version will differ somewhat.

After I made these slides I discovered that two of the species that I listed as potential candidates for Assemblathon 3 already have genome projects. The tuatara genome project is actually the subject of another talk at the Genome 10K meeting, and a colleague tells me that there is also a California condor genome project too.

Thoughts on a possible Assemblathon 3

Lex Nederbragt has written a post outlining his thoughts on what any Assemblathon 3 contest should look like. This is something that Ian Korf will be talking about today at the Genome 10K meeting which is happening at the moment (though it seems that there has been a lot of discussion about this in other sessions). From his post:

I believe it is here the Assemblathon 3 could make a contribution. By switching the focus from the assembly developers to the assembly users, Assemblathon 3 could help to answer the question:

How to choose the ‘right’ assembly from a set of generated assemblies

From CASP to Poreathon: what makes for a good bioinformatics 'brand' name?

One of my more significant contributions to the world of bioinformatics is that I came up with the name for The Assemblathon.

Towards the end of 2010, our group at the UC Davis Genome Center was tasked with helping organize a new competition to assess software in the field of genome assembly. I remember a midweek meeting with my boss (Ian Korf) where he informed me that by the end of the week we had to come up with a name for the project, set up a website, and have a mailing list up and running…and by 'we' he meant 'me'.

I was aware that there had been several other comparative software assessments in the field of bioinformatics, and that a certain theme had arisen in the naming of such exercises:

It seems amazing to me that after GASP decided to make a bogus acronym by including the 'S' from 'aSsessment', all subsequent evaluation exercises followed suit (although you could also argue that CASP could have worked equally well as 'CAPS').

I felt quite strongly that the world did not need another '…ASP' style of name and so I came up with 'The Assemblathon'. Although many might shudder at this, I was really thinking of it as a 'brand' name, rather than just another forgettable scientific project name. The Assemblathon name ticked several boxes:

  1. Memorable
  2. Different
  3. Pronounceable
  4. Website name was available
  5. Twitter account name was available

The last two items are kind of obvious when you realize that this is a completely new word. You may disagree, but I think that these are important — but not essential — aspects of naming a scientific project.

So what has happened since I bequeathed the Assemblathon brand to the world? Well, we've now had:

  1. Alignathon - A collaborative competition to assess the state of the art in whole genome sequence alignment (published in 2014)
  2. Variathon - A challenge to analyze existing or new pipelines for variant calling in terms of accuracy and efficiency (completed in 2013, but not published yet as far as I can tell)
  3. Poreathon - Assessment of bioinformatics pipelines relating to Oxford Nanopore sequencing data (announced by Nick Loman this week)

I don't have any issues with 'Alignathon', as the name is based on a verb and the goal of the project is probably guessble by any bioinformatician. Like Assemblathon, it is a portmanteau that just seems to work.

In contrast, I find 'Variathon' a horrible name. The name doesn't scan well and may not make as much sense to others. If you search Google for this name you will see the following:

Not a good sign if your project name is regarded as a spelling mistake!

So what about 'Poreathon'? While I find this less offensive than Variathon, I still don't think it is a particularly snappy name…a bit of a snoreathon perhaps? ;-) Pore is both a noun and a verb, so the dual meaning of the word somewhat dilutes its impact as a project name.

5 suggestions for naming scientific projects

  1. You should not feel committed to naming something in order to continue a previous naming trend
  2. Acronyms are not the only option for the name of a scientific project!
  3. If there is any confusion as to how your project name is spelt or pronounced, this will not help you promote the name among your peers.
  4. Consider treating the intended name as a brand, and explore the issues that arise (how discoverable is the name, how similar to other 'brands', can you trademark it, is your name offensive in other languages, can you buy a suitable domain name? etc.)
  5. At the very least, perform a Google search for your intended name to see if others in your field have already used it (see my post on Identical Classifications In Science)

The Assemblathon Gives Back (a bit like The Empire Strikes Back, but with fewer lightsabers)

So we won an award for Open Data. Aside from a nice-looking slab of glass that is weighty enough to hold down all of the papers that someone with a low K-index has published, the award also comes with a cash prize.

Naturally, my first instinct was to find the nearest sculptor and request that they chisel a 20 foot recreation of my brain out of Swedish green marble. However, this prize has been — somewhat annoyingly — awarded to all of the Assemblathon 2 co-authors.

While we could split the cash prize 92 ways, this would probably only leave us with enough money to buy a packet of pork scratchings each (which is not such a bad thing if you are fan of salty, fatty, porcine goodness).

Instead we decided — and by 'we', I'm really talking about 'me' — to give that money back to the community. Not literally of course…though the idea of throwing a wad of cash into the air at an ISMB meeting is appealing.

Rather, we have worked with the fine folks at BioMed Central (that's BMC to those of us in the know), to pay for two waivers that will cover the cost of Article Processing Charges (that's APCs to those of us in the know). We decided that these will be awarded to papers in a few select categories relating to 'omics' assembly, Assemblathon-like contests, and things to do with 'Open Data' (sadly, papers that relate to 'pork scratchings' are not eligible).

We are calling this event the Assemblathon 'Publish For Free' Contest (that's APFFC to those of us in the know), and you can read all of the boring details and contest rules on the Assemblathon website.

Winning an award that shouldn't exist: progress towards 'open data' and 'open science'

It was announced yesterday that the Assemblathon 2 paper has won the 2013 BioMed Central award for ‘Open Data’ (sponsored by Lab Archives). For more details on this see here and here.

While it is flattering to be recognized for our efforts to conduct science transparently, it still feels a little strange that we need to have awards for this kind of thing. All data that results from publicly funded science research should be open data. Although I feel there is growing support for the open science movement, much still needs to be done.

One of the things that needs to become commonplace is for scientists to put their data and code in stable, online repositories, that are hopefully citable as independent resources (i.e. with a DOI). For too long, people have used their lab websites as the end point for all of their (non-sequence[1]) related data (something that I have also been guilty of).

Part of the problem is that even when you take steps to submit data to an online repository of some kind, not all journals allow you to cite them. This tweet by Vince Bufflo from yesterday illustrated one such issue (see this Storify page for more details of the resulting discussion):


Tools like arXiv.org, BioRxiv, Figshare, Slideshare, GitHub, and GigaDB are making it easier to make our data, code, presentations, and preliminary results more available to others. I hope that we see more innovation in this area and I hope that more people take an ‘open’ approach to other aspects of science, not just the sharing of data[2]. Luckily, with people around like Jonathan Eisen and C. Titus Brown, we have some great role models for how to do this.

How will we know when we are all good practitioners of open science? When we no longer need to give out awards to people just for doing what we should all be doing.


  1. For the most part, journals require authors to submit nucleotide and protein sequences to an INSDC database, though this doesn’t always happen.  ↩

  2. I have written elsewhere about the steps that the Assemblathon 2 took to try to be open throughout the whole process of doing the science, writing the paper, and communicating the results.  ↩

Mining Altmetric data to discover what types of research article get the most social media engagement

Altmetric is a service that tracks the popularity of published research articles via the impact that those articles make on social media sites. Put simply, the more an article is tweeted and blogged about, the higher its Altmetric score will be. The scoring system is fairly complex as it also tracks who is reading your article on sites such as Mendeley.

I was pleased to see that the recent Assemblathon 2 paper — on which I am the lead author —  gained a lot of mentions on twitter. Curious, I looked up its Altmetric score and was surprised to see that it was 61. This puts it in in the 99th percentile of all articles tracked by Altmetric (almost 1.4 million articles). I imagine that the score will possibly rise a little more in the coming weeks (at the time of writing the paper is only four days old).

I was then curious as to how well the Assemblathon 1 paper fared. Published in September, 2011, it has an Altmetric score of 71. This made me curious as to where both papers ranked in the entire list of 1,384,477 articles tracked by this service. So I quickly joined the free trial of Altmetric (they have a few paid services) and was able to download details of the top 25,000 articles. This revealed that the two Assemblathon papers came in at a not-too-shabby 5,616th and 10,250th place overall. If you're interested, this paper — with an off-the-charts Altmetric score of 11,152 — takes the top spot.

Just to satisfy my curiosity, I made a word cloud based on the titles of the research papers that appear in the top 10,000 Altmetric articles.

2013-07-26 at 3.37 PM.png

Perhaps unsurprisingly, this reveals that analyses (and meta-analyses) of data relating to human health prompt a lot of engagement via social media. Okay, time to go and write my next paper 'A Global Study of Human Health Risks'.