This post started out as a lengthy reply on a thread at SEQanswers.com. I thought that it was worth reproducing it here (though I've also added some more information)...
Why N50?
Maybe it is worth remembering why we even have N50 as a statistic. The final assembly for any genome project, i.e. the one that is described in a paper or uploaded to a database, is not necessarily the biggest assembly that was generated.
This is because there will always be contigs that are too short to be of any use to anyone. In the pre-NGS era of assembly, these contigs could potentially be represented by a single Sanger read that didn't align to anything else.
However, one might consider that there is still some utility in including a long Sanger read in an assembly, even if it doesn't overlap with any other reads. After quality trimming, there are many Sanger reads that are over 1,000 bp in length and this is long enough to contain an exon or two, or even an entire gene (depending on the species). But what if a contig was 500 bp? Or 250 bp? Or 100 bp? Where do you draw the line?
Clearly it is not desirable to fill up an assembly with an abundance of overly short sequences, even if they represent accurate, and unique, biological sequences from the genome of interest. So it has always been common to to remove very short contigs from an assembly. The problem is that different groups might use very different criteria for what to keep and what to exclude. Imagine a simple assembly consisting of the following contigs:
- 5,000 x 100 bp
- 100 x 10 Kbp
- 10 x 1 Mbp
- 1 x 10 Mbp
Now let's say that Distinguished Bioinformatics Center #1 decides to produce an assembly (DBC1) that includes all of these contigs. However, another DBC decides to make another assembly from the same data, but they remove the 100 bp contigs. A third DBC decides to also remove the 10 Kbp contigs. What does this do to the mean contig length of each assembly?
Mean contig lengths
- DBC1 = 4,207 bp
- DBC2 = 189,189 bp
- DBC3 = 1,818,182 bp
Hopefully it becomes obvious that if you only considered mean contig length, it would be extremely easy to 'game' the system by deliberately excluding shorter contigs (no matter their utility) just to increase the average contig length. But what do we get if we instead rely on N50?
N50 contig lengths
- DBC1 = 1 Mbp
- DBC2 = 1 Mbp
- DBC3 = 10 Mbp
This now reduces the overall discrepancies, and puts DBC1 and DBC2 on an equal footing. But you might still, naively, conclude that DBC3 is the better assembly, and if you were extremely wide-eyed and innocent, then maybe you would conclude that DBC3 was ten times better than the other two assemblies.
So N50 does a better, though still imperfect, job of avoiding the dangers inherent in relying on the mean length. In some ways, the actual method you use to calculate N50 does not matter too much, as long as you use the same method when comparing all assemblies. Back in the day of Sanger-sequence derived assemblies, it was fairly common to see assembly statistics report not just N50, but everything from N10 through to N90. This gives you a much better insight into the overall variation in contig (or scaffold lengths).
In the Assemblathon and Assemblathon 2 contests, we actually plotted all N(x) values to see the full distribution of contig lengths. Except, we didn't use the N50 metric, we used something called NG50. This normalizes for differences in overall assembly sizes by asking 'at what contig/scaffold length — from a set of sorted scaffold lengths — do we see a sum length that is greater than 50% of the estimated or known genome size?'
Returning to my earlier fictional example, lets assume that the estimated genome size that was being assembled was 25 Mbp. This means we want to see what contig length takes us past 12.5 Mbp (when summing all contig lengths from longest to shortest):
NG50 contig lengths
- DBC1 = 1 Mbp
- DBC2 = 1 Mbp
- DBC3 = 1 Mbp
We now see parity between all assemblies, at least when assessing their contig lengths. The actual differences in the assemblies mostly reflect variations in which short sequences are included which may, or may not, be of any utility to the end user of the assembly. In my mind, this gives us a much fairer way of comparing the assemblies, at least in terms of their average contig lengths.
NG(X) graphs
Here is an example of an NG(X) chart, that borrows some results from the Assemblathon 2 contest: