L50 vs N50: that's another fine mess that bioinformatics got us into

N50 is a statistic that is widely used to describe genome assemblies. It describes an average length of a set of sequences, but the average is not the mean or median length. Rather it is the length of the sequence that takes the sum length of all sequences — when summing from longest to shortest — past 50% of the total size of the assembly. The reasons for using N50, rather than the mean or median length, is something that I've written about before in detail.

The number of sequences evaluated at the point when the sum length exceeds 50% of the assembly size is sometimes referred to as the L50 number. Admittedly, this is somewhat confusing: N50 describes a sequence length whereas L50 describes a number of sequences. This oddity has led to many people inverting the usage of these terms. This doesn't help anyone and leads to confusion and to debate.

I believe that the aforementioned definition of N50 was first used in the 2001 publication of the human genome sequence:

We used a statistic called the ‘N50 length’, defined as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L.

I've since had some independent confirmation of this from Deanna Church (@deannachurch):

I also have a vague memory that other genome sequences — that were made available by the Sanger Institute around this time — also included statistics such as N60, N70, N80 etc. (at least I recall seeing these details in README files on an FTP site). Deanna also pointed out that the Celera Human Genome paper (published in Science, also in 2001) describes something that we might call N25 and N90, even though they didn't use these terms in the paper:

More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or large

I don't know when L50 first started being used to describe lengths, but I would bet it was after 2001. If I'm wrong, please comment below and maybe we can settle this once and for all. Without evidence for an earlier use of L50 to describe lengths, I think people should stick to the 2001 definition of N50 (which I would also argue is the most common definition in use today).

Updated 2015-06-26 - Article includes new evidence from Deanna Church.

Which genome assembler gives you the best genome assembly?

This is a question that I have been asked many times. I think that the opposite question should also be asked — which genome assembler gives you the worst genome assembly? —  but people seem less interested in asking this. By the end of this blog post, I will try to answer both questions.

Earlier this week, I wrote about the fact that I get to look at lots of genome assemblies that people have asked me to run CEGMA on. The reason I run CEGMA is to calculate what percentage of a set of 248 Core Eukaryotic Genes (CEGs) are present in each genome (or transcriptome) assembly. This figure can be used as a proxy for the much larger set of 'all genes'.

I can also calculate the N50 length of each assembly, and if you plot both statistics for each assembly you capture a tiny slice of that nebulous characteristic known as 'assembly quality'. Here's what such a plot looks like for 32 different genome assemblies (from various species, assembled by various genome assemblers):

Figure 1. CEGMA results vs N50 length for 32 genome assemblies. Click to enlarge.

Three key observations can be made from this figure:

  1. There's a lot of variation in the percentage of CEGs present (52.4–98.8%).
  2. There's a lot of variation in N50 length (997 bp all the way to 3.9 million bp).
  3. There is almost no correlation between the two measures (= 0.04)

Let's dwell on that last point by highlighting a couple of the extreme data points:

Figure 2: CEGMA results vs N50 length — red data points highlight assemblies with highest values of %248 CEGs or N50 length. Click to enlarge.

The highlighted point at the top of the graph represents the assembly with the best CEGMA result (245 out of 248 CEGs present). However, this assembly ranks 13th for N50 length. The data point on the far right of the graph represents a genome assembly with the highest N50 length (almost 4 million bp). But this assembly only ranks 27th for its CEGMA results. Such inconsistency was exactly what we saw in the Assemblathon 2 paper (but with even more metrics involved).

Can we shed any light as to which particular genome assemblers excel in either of these statistics? Well, as I now ask people who submit CEGMA jobs to me what was the principle assembly tool used?, I can overlay this information on the plot:

Figure 3: CEGMA results vs N50 length — colors refer to different assemblers used as the principle software tool in the assembly process. Click to enlarge.

It might not be clear but there are more data points for the Velvet assembler than any other (12/32). You can see that Velvet assemblies produce a relatively wide range in CEGMA results. Assemblies made by SOAPdenovo produce an even wider range of CEGMA results (not to mention a wide range of N50 results). The truth is that there is no consistent pattern of quality in the results of any one assembler (and remember we are only measuring 'quality' by just two paltry metrics).

To answer the questions raised at the start of this post:

  1. All assemblers can be used to make terrible genome assemblies
  2. Some assemblers can (probably) be used to make great genome assemblies

There is no magic bullet in genome assembly and there are so many parameters that can affect the quality of your final assembly (repeat content of genome, sequencing technology biases, amount of heterozygosity in genome, quality of input DNA, quality of sample preparation steps, suitable mix of libraries with different insert sizes, use of most suitable assembler options for your genome of interest, amount of coffee drunk by person running the assembler, etc. etc.).

Don't even ask the question which genome assembler gives you the best genome assembly? if you are not prepared to define what you mean by 'best' (and please don't define it just on the basis of two simple metrics like %248 CEGs present and N50 length).

The long tail of the distribution: do people ever really look at their genome assemblies?

Since I (somewhat foolishly) volunteered to run CEGMA on behalf of people who were having trouble installing it, I've had the opportunity to look at a lot of genome (and transcriptome) assemblies.

Some people get a little bit obsessed by just how many (or how few) core genes are present in their assembly. They also get obsessed by how long (or short) the N50 length of their assembly is.

Earlier this week, I wrote about a new tool (N50 Booster!!!) that can increase the N50 length of any assembly. Although this was published on April 1st, the tool does actually work and it merits some further discussion. The tool achieves higher N50 lengths by doing two things:

  1. It removes the shortest 25% of all sequences
  2. It then sums the lengths of sequences that were removed and adds an equivalent length of Ns to the end of the longest sequence

Both steps contribute to increasing the N50 length, though the amount of the increase really depends on the distribution of sequence lengths in the assembly. The second step is clearly bogus, and I added it just to preserve the assembly size. The first step might seem drastic, but is exactly the same sort of step that is performed by many genome assemblers (or by the people post-processing the assembly).

If you don't remove short sequences from an assembly, you could end up including a large number of reads that don't overlap with any other read. I.e. it is possible that the shortest contig/scaffold length in an assembly will be the same as whatever the read length of the sequencing technology being used is. If you trim reads for quality, you could potentially end up with contigs/scaffolds with even shorter lengths.

How useful is it to include such short sequences in an assembly, and how often do 'assemblers' (the software and/or the people running the assemblers) do this? Well by looking at some of the assemblies that I have run CEGMA against, I can take a look.

For 34 genome assemblies (from many different species, using many different assemblers) I looked to see whether the shortest 10 sequences were all the same length or were unique:

So about half of the assemblies have unique lengths for their 10 shortest sequences. The remainder represent assemblies that probably either removed all sequences below a certain length (which seems likely with the assembly that had the shortest sequences at 2,000 bp), or which simply included all unassembled reads (six assemblies have an abundance of 100 bp sequences).

This begs the question, how useful is it to include all of these short sequences? It's always possible that a 100 bp read that doesn't overlap anything else contains something useful, but probably not. You might see an exon, possibly an intron, and very possibly an entire gene (depending on the species)...but can anyone do much with this information?

The most extreme example I came across is actually from one of the 16 assemblies which initially appeared to have sequences with unique lengths. This vertebrate genome assembly contains 72,214 sequences. If I ask what percentage of these sequence are shorter than various cutoffs, this is what I find:

This is pretty depressing. The vast majority of sequences in this assembly are likely to be too short to be of any use. The reason that this assembly counted among by 'unique' category is because the shortest ten sequences have lengths as follows: 100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp. That's right, this assembly includes a sequence that's only three base pairs in length!

There are a couple of points that  I am trying to make here:

  1. You can increase the N50 length of your assembly by removing the shortest sequences.  This is not cheating — though you should clearly state what cutoff has been used — and should be considered part of the genome assembly process (getting rid of the crap).
  2. Please look at your assembly before doing anything with it. On it's own, N50 is not a very useful statistic. Look at the distribution of all sequence lengths (or plot an NG(X) graph).

It turned out that this assembly did contain a fair number of core genes and these were almost all located in the 1.2% of sequences that were > 10,000 bp. That 3 bp sequence though, turned out it contained no core genes at all. Shocker!

A new tool to boost the N50 length of your genome assembly

We all know that the most important aspect of any genome assembly is the N50 length of its contigs or scaffolds. Higher N50 lengths are clearly correlated with increases in assembly quality and any good bioinformatician should be looking to maximize the N50 length of any assembly they are making.

I am therefore pleased that I can today announce the release of a new software tool, N50 Booster!!! that can help you increase the N50 length of an existing assembly. This tool was written in C for maximum computational efficiency and then reverse engineered into Perl for maximum obfuscation.

This powerful software is available as a Perl script (n50_booster.pl) that can be downloaded from our lab's website. The only requirement for this script is the FAlite.pm Perl module (also available from our lab's website).

Before I explain how this script works to boost an assembly's N50 length, I will show a real-world example. I ran the script on release WS230 of the Caenorhabditis japonica genome assembly:

$ n50_booster.pl c_japonica.WS230.genomic.fa

Before:
==============
Total assembly size = 166256191 bp
N50 length = 94149 bp

Boosting N50...please wait

After:
==============
Total assembly size = 166256191 bp
N50 length = 104766 bp

Improvement in N50 length = 10617 bp

See file c_japonica.WS230.genomic.fa.n50 for your new (and improved) assembly

As you can see, N50 Booster!!! not only makes a substantial increase to the N50 length of the C. japonica assembly, it does so while preserving the assembly size. No other post-assembly manipulation tool boasts this feature!

The n50_booster.pl script works by creating a new FASTA file based on the original (but which includes a .n50 suffix) and ensures that the new file has an increased N50 length. The exact mechanism by which N50 Booster!!! works will be evident from an inspection of the code.

I am confident that N50 Booster!!! can give your genome assembly a much needed boost and the resultant increase in N50 length will lead to a much superior assembly which will increase your chances of a publication in a top-tier journal such as the International Journal of Genome Assembly or even the Journal of International Genome Assembly.

Update: 2014-04-08 09.44 — I wrote a follow up post to this one which goes into more detail about how N50 Booster!!! works and discusses what people could (and should) do to the shortest sequences in their genome assemblies.

As more people use N50 as a metric, fewer genomes seem to be 'completed'

If you search Google Scholar for the term genome contig|scaffold|sequence +"N50 size|length" and then filter by year, you can see that papers which mention N50 length have increased dramatically in recent years:

Google Scholar results for papers that mention N50 length. 2000–2013.

Google Scholar results for papers that mention N50 length. 2000–2013.

I'm sure that my search term doesn't capture all mentions of N50, and it probably includes a few false positives as well. It doesn't appear to be mentioned before 2001 at all, and I think that the 2001 Nature human genome paper may have been the first publication to use this metric.

Obviously, part of this growth simply reflects the fact that more people are sequencing genomes (or at least writing about sequenced genomes), and therefore feel the need to include some form of genome assembly metric. A Google Scholar search term for "genome sequence|assembly" shows another pattern of growth, but this time with a notable spurt in 2013:

Google Scholar results for papers that mention genome sequences or assemblies. 2000–2013.

Google Scholar results for papers that mention genome sequences or assemblies. 2000–2013.

Okay, so more and more people are sequencing genomes. This is good news, but only if those genomes are actually usable. This led me to my next query. How many people refer to their published genome sequence as complete? I.e. I searched Google Scholar for "complete|completed genome sequence|assembly". Again, this is not a perfect search term, and I'm sure it will miss some descriptions of what people consider to be complete genomes. But at the same time it probably filters out all of the 'draft genomes' that have been published. The results are a little depressing:

Google Scholar results for papers that mention genome sequences or assemblies vs those that make mention of 'completed' genome sequences or assemblies. 2000–2013.

Google Scholar results for papers that mention genome sequences or assemblies vs those that make mention of 'completed' genome sequences or assemblies. 2000–2013.

So although there were nearly 90,000 publications last year that mentioned a genome sequence (or assembly), approximately just 7,500 papers mentioned the C-word. This is a little easier to visualize if you show the number of 'completed' genome publications as a percentage of the number of publications that mention 'genome sequence' (irrespective of completion status):

Numbers of publications that mention 'completed' genomes as percentage of those that mention genomes. 2000–2013.

Numbers of publications that mention 'completed' genomes as percentage of those that mention genomes. 2000–2013.

Maybe  journal reviewers are more stringent about not allowing people to use the 'completed' word if the genome isn't really complete (which depending on your definition of 'complete' may include most genomes)? Or maybe people are just happier these days to sequence something, throw it through an assembler and then publish it, regardless of how incomplete it is?

Why is N50 used as an assembly metric (and what's the deal with NG50)?

This post started out as a lengthy reply on a thread at SEQanswers.com. I thought that it was worth reproducing it here (though I've also added some more information)...

Why N50?

Maybe it is worth remembering why we even have N50 as a statistic. The final assembly for any genome project, i.e. the one that is described in a paper or uploaded to a database, is not necessarily the biggest assembly that was generated.

This is because there will always be contigs that are too short to be of any use to anyone. In the pre-NGS era of assembly, these contigs could potentially be represented by a single Sanger read that didn't align to anything else.

However, one might consider that there is still some utility in including a long Sanger read in an assembly, even if it doesn't overlap with any other reads. After quality trimming, there are many Sanger reads that are over 1,000 bp in length and this is long enough to contain an exon or two, or even an entire gene (depending on the species). But what if a contig was 500 bp? Or 250 bp? Or 100 bp? Where do you draw the line?

Clearly it is not desirable to fill up an assembly with an abundance of overly short sequences, even if they represent accurate, and unique, biological sequences from the genome of interest. So it has always been common to to remove very short contigs from an assembly. The problem is that different groups might use very different criteria for what to keep and what to exclude. Imagine a simple assembly consisting of the following contigs:

  • 5,000 x 100 bp
  • 100 x 10 Kbp
  • 10 x 1 Mbp
  • 1 x 10 Mbp

Now let's say that Distinguished Bioinformatics Center #1 decides to produce an assembly (DBC1) that includes all of these contigs. However, another DBC decides to make another assembly from the same data, but they remove the 100 bp contigs. A third DBC decides to also remove the 10 Kbp contigs. What does this do to the mean contig length of each assembly?

Mean contig lengths

  • DBC1 = 4,207 bp
  • DBC2 = 189,189 bp
  • DBC3 = 1,818,182 bp

Hopefully it becomes obvious that if you only considered mean contig length, it would be extremely easy to 'game' the system by deliberately excluding shorter contigs (no matter their utility) just to increase the average contig length. But what do we get if we instead rely on N50?

N50 contig lengths

  • DBC1 = 1 Mbp
  • DBC2 = 1 Mbp
  • DBC3 = 10 Mbp

This now reduces the overall discrepancies, and puts DBC1 and DBC2 on an equal footing. But you might still, naively, conclude that DBC3 is the better assembly, and if you were extremely wide-eyed and innocent, then maybe you would conclude that DBC3 was ten times better than the other two assemblies.

So N50 does a better, though still imperfect, job of avoiding the dangers inherent in relying on the mean length. In some ways, the actual method you use to calculate N50 does not matter too much, as long as you use the same method when comparing all assemblies. Back in the day of Sanger-sequence derived assemblies, it was fairly common to see assembly statistics report not just N50, but everything from N10 through to N90. This gives you a much better insight into the overall variation in contig (or scaffold lengths).

In the Assemblathon and Assemblathon 2 contests, we actually plotted all N(x) values to see the full distribution of contig lengths. Except, we didn't use the N50 metric, we used something called NG50. This normalizes for differences in overall assembly sizes by asking 'at what contig/scaffold length — from a set of sorted scaffold lengths — do we see a sum length that is greater than 50% of the estimated or known genome size?'

Returning to my earlier fictional example, lets assume that the estimated genome size that was being assembled was 25 Mbp. This means we want to see what contig length takes us past 12.5 Mbp (when summing all contig lengths from longest to shortest):

NG50 contig lengths

  • DBC1 = 1 Mbp
  • DBC2 = 1 Mbp
  • DBC3 = 1 Mbp

We now see parity between all assemblies, at least when assessing their contig lengths. The actual differences in the assemblies mostly reflect variations in which short sequences are included which may, or may not, be of any utility to the end user of the assembly. In my mind, this gives us a much fairer way of comparing the assemblies, at least in terms of their average contig lengths.

NG(X) graphs

Here is an example of an NG(X) chart, that borrows some results from the Assemblathon 2 contest:

ngx.png

Some things to note:

  • Y-axis (scaffold length) is plotted on a log scale
  • Each line represents scaffold lengths from a genome assembly
  • The NG50 value is shown as a dotted line. If you only looked at this you might consider that the red assembly is the best and purple the worst.
  • The first data point in each series, reveals the longest scaffold length. This is why you can see horizontal lines for some entries. E.g. the longest scaffold for the red assembly is about 70 Mbp. This exceeds the value of NG1, NG2, NG3 etc.
  • From NG1 to NG50, the relative differences in assemblies is fairly constant. Except that the longest scaffold in the purple assembly is longer than that in the orange assembly. But by NG5, the orange assembly has longer scaffolds.
  • Around NG75, we start to see changes. The lengths of scaffolds in the blue assembly start to drop off.
  • At NG90, the lengths of scaffolds in the green assembly plummet. The fact that this series touches the x-axis (at about NG91) tells us that the sum size of this assembly is less than the estimated genome size of the species in question (i.e. you have to have a value at NG100 to have 100% of the estimated genome size present in your assembly).
  • By NG95, the orange assembly has moved from 5th to 2nd. This just means that the assembly contains a higher proportion of longer scaffolds. However...
  • Three of the assemblies reach NG100 with positive values on the y-axis. This reveals that these assemblies are all bigger than the estimated genome size. Ideally we could continue this plot to see where they would eventually touch the x-axis. If they reached NG200, then we would know that the genome assembly is twice the estimated genome size.
  • Hopefully, this shows that you can get a lot of information out of a plot like this.

In summary

N50 is just a better, though still flawed, way of calculating an average sequence length from a set of sequences. It can still be biased, and you really should consider what the estimated/known genome size is (when comparing two or more assemblies of the same species), and/or look to see the full distribution of sequence lengths (e.g. with an NG(x) graph).