Some short slide decks from a recent Bioinformatics Core workshop

Last week I helped teach at a workshop organized by the Bioinformatics Core facility of the UC Davis Genome Center. The workshop was on:

  • Using the Linux Command Line for Analysis of High Throughput Sequence Data

I like that the Bioinformatics Core makes all of their workshop documentation available for free, even if you didn't attend the workshop. So have a look at the docs if you want to learn about genome assembly, RNA-Seq, or learning the basics of the Unix command-line (these were just some of the topics covered).

Anyway, I tried making some fun slide decks to kick off some topics. They are included below.

 

This bioinformatics lesson is brought to you by the letter 'D'

'D' is for 'Default parameters', 'Danger', and 'Documentation

 

This bioinformatics lesson is brought to you by the letter 'T'

'T' is for 'Text editors', 'Time', and 'Tab-completion'

 

This bioinformatics lesson is brought to you by the letter 'W'

'W' is for 'Worfklows', 'What?', and 'Why?'

Developments in high throughput sequencing – June 2015 edition

If you're at all interested in the latest developments in sequencing technology, then you should be following Lex Nederbragt's In beteween lines of code blog. In particular you should always take time to read his annual snapshop overview of how the major players are all faring.

This is the fourth edition of this visualisation…As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale.

The 2015 update looks interesting because of the addition of a certain new player!

L50 vs N50: that's another fine mess that bioinformatics got us into

N50 is a statistic that is widely used to describe genome assemblies. It describes an average length of a set of sequences, but the average is not the mean or median length. Rather it is the length of the sequence that takes the sum length of all sequences — when summing from longest to shortest — past 50% of the total size of the assembly. The reasons for using N50, rather than the mean or median length, is something that I've written about before in detail.

The number of sequences evaluated at the point when the sum length exceeds 50% of the assembly size is sometimes referred to as the L50 number. Admittedly, this is somewhat confusing: N50 describes a sequence length whereas L50 describes a number of sequences. This oddity has led to many people inverting the usage of these terms. This doesn't help anyone and leads to confusion and to debate.

I believe that the aforementioned definition of N50 was first used in the 2001 publication of the human genome sequence:

We used a statistic called the ‘N50 length’, defined as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L.

I've since had some independent confirmation of this from Deanna Church (@deannachurch):

I also have a vague memory that other genome sequences — that were made available by the Sanger Institute around this time — also included statistics such as N60, N70, N80 etc. (at least I recall seeing these details in README files on an FTP site). Deanna also pointed out that the Celera Human Genome paper (published in Science, also in 2001) describes something that we might call N25 and N90, even though they didn't use these terms in the paper:

More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or large

I don't know when L50 first started being used to describe lengths, but I would bet it was after 2001. If I'm wrong, please comment below and maybe we can settle this once and for all. Without evidence for an earlier use of L50 to describe lengths, I think people should stick to the 2001 definition of N50 (which I would also argue is the most common definition in use today).

Updated 2015-06-26 - Article includes new evidence from Deanna Church.