MinIONs...do my bidding!

Oh the fun you can have with new sequencing technologies...

Which genome assembler gives you the best genome assembly?

This is a question that I have been asked many times. I think that the opposite question should also be asked — which genome assembler gives you the worst genome assembly? —  but people seem less interested in asking this. By the end of this blog post, I will try to answer both questions.

Earlier this week, I wrote about the fact that I get to look at lots of genome assemblies that people have asked me to run CEGMA on. The reason I run CEGMA is to calculate what percentage of a set of 248 Core Eukaryotic Genes (CEGs) are present in each genome (or transcriptome) assembly. This figure can be used as a proxy for the much larger set of 'all genes'.

I can also calculate the N50 length of each assembly, and if you plot both statistics for each assembly you capture a tiny slice of that nebulous characteristic known as 'assembly quality'. Here's what such a plot looks like for 32 different genome assemblies (from various species, assembled by various genome assemblers):

Figure 1. CEGMA results vs N50 length for 32 genome assemblies. Click to enlarge.

Three key observations can be made from this figure:

  1. There's a lot of variation in the percentage of CEGs present (52.4–98.8%).
  2. There's a lot of variation in N50 length (997 bp all the way to 3.9 million bp).
  3. There is almost no correlation between the two measures (= 0.04)

Let's dwell on that last point by highlighting a couple of the extreme data points:

Figure 2: CEGMA results vs N50 length — red data points highlight assemblies with highest values of %248 CEGs or N50 length. Click to enlarge.

The highlighted point at the top of the graph represents the assembly with the best CEGMA result (245 out of 248 CEGs present). However, this assembly ranks 13th for N50 length. The data point on the far right of the graph represents a genome assembly with the highest N50 length (almost 4 million bp). But this assembly only ranks 27th for its CEGMA results. Such inconsistency was exactly what we saw in the Assemblathon 2 paper (but with even more metrics involved).

Can we shed any light as to which particular genome assemblers excel in either of these statistics? Well, as I now ask people who submit CEGMA jobs to me what was the principle assembly tool used?, I can overlay this information on the plot:

Figure 3: CEGMA results vs N50 length — colors refer to different assemblers used as the principle software tool in the assembly process. Click to enlarge.

It might not be clear but there are more data points for the Velvet assembler than any other (12/32). You can see that Velvet assemblies produce a relatively wide range in CEGMA results. Assemblies made by SOAPdenovo produce an even wider range of CEGMA results (not to mention a wide range of N50 results). The truth is that there is no consistent pattern of quality in the results of any one assembler (and remember we are only measuring 'quality' by just two paltry metrics).

To answer the questions raised at the start of this post:

  1. All assemblers can be used to make terrible genome assemblies
  2. Some assemblers can (probably) be used to make great genome assemblies

There is no magic bullet in genome assembly and there are so many parameters that can affect the quality of your final assembly (repeat content of genome, sequencing technology biases, amount of heterozygosity in genome, quality of input DNA, quality of sample preparation steps, suitable mix of libraries with different insert sizes, use of most suitable assembler options for your genome of interest, amount of coffee drunk by person running the assembler, etc. etc.).

Don't even ask the question which genome assembler gives you the best genome assembly? if you are not prepared to define what you mean by 'best' (and please don't define it just on the basis of two simple metrics like %248 CEGs present and N50 length).

101 questions with a bioinformatician #2: Titus Brown

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

 

There is only one bioinformatician that I know of who has been made the subject of a twitter hashtag which compares him to Chuck Norris....step forward C. Titus Brown (just 'C' to his friends, 'Titus' to everyone else).

Titus is Assistant Professor in the Departments of Microbiology & Molecular Genetics and Computer Science & Engineering at Michigan State University (or 'MMG & CSE at MSU' if you find that easier to say). You can find out more about his work on his lab's website, on his Living in an Ivory Basement blog, or from his twitter feed (@ctitusbrown).

Aside from his research, I think one of Titus's biggest contributions to bioinformatics is his commitment to open science. This (much longer) interview with him is a great read if you want to know more about his views on this. I really feel that science would be all the better if more scientists followed the path that Titus walks on this important issue.

In his spare time — and this just reiterates that he really is Chuck Norris — Titus is also a linebacker who played for a couple of years with the Cleveland Browns. Anyway, on to the 101 questions...

 

001. What's something that you enjoy about current bioinformatics research?

The range of biology problems that can be tackled!  Our ability to ask more detailed questions about important organisms and communities is growing daily, and it's just amazing to start to connect the genomic "dots".

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

Oy... I get the sense that an awful lot of bioinformatics is focused on improving methods by 1 or 10%.  I don't know if that's useful and it doesn't seem like it would be satisfying, but you read papers with all these benchmarks and you realize that some poor graduate student slaved for half a year to push the edge on an algorithm that was already fast enough.

On the other hand, I would be the last person to gainsay someone else's bioinformatics project.  We've gone in what seemed like well-travelled directions and found virgin territory so I don't think this can be predicted very well.

 

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Learn more statistics. And chemistry.

 

100. What's your all-time favorite piece of bioinformatics software, and why?

It's a little self-absorbed, but: digital normalization. The fact that it works at all is stunning; that it works as well as it does is amazing; it's only 5 lines of code; and it took us over three years to figure it out!?

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality? 

N. I like being flexible.

Fun with TimeTree.org: but remember to check the sample size

On Friday, I saw this tweet by Richard Smith-Unna (@blahah404):

TimeTree.org is a great site and can be really useful, but as always in science, you should sometimes be careful of some statistics. TimeTree uses published estimates of divergence times to list the 'mean' and 'median' times that two species diverged.

For some species, e.g. cat and dog, there are lots of published studies so you can perhaps be more confident about the averages:

Each black dot in this figure represents a date from a separate published study. In the sheep/goat comparison that Richard tweeted about, there are only two data points (6.2 and 8.5 million years ago). However, these are close enough together that the headline mean divergence time of 7.3 million years gives a reasonable estimate (albeit based on two data points).

But then you get to species comparisons such as Caenorhabditis elegans vs Caenorhabditis briggsae:

I would hate it if anyone seriously reported this 'average' figure of 51.3 million years, without pointing out that it is an average of '1.0' and '101.5'. Always be suspicious of averages without seeing the spread of data!

Update: 2014-04-07 14.48 — Richard Smith-Unna has now looked at the references behind the 1.0 and 10.1.5 million year dates and spotted errors in how this has been reported in TimeTree. See his comment below. This merits another cautionary warning when using data like this...don't assume that data extracted from papers by humans/robots/software will always be done correctly.