The long tail of the distribution: do people ever really look at their genome assemblies?

Since I (somewhat foolishly) volunteered to run CEGMA on behalf of people who were having trouble installing it, I've had the opportunity to look at a lot of genome (and transcriptome) assemblies.

Some people get a little bit obsessed by just how many (or how few) core genes are present in their assembly. They also get obsessed by how long (or short) the N50 length of their assembly is.

Earlier this week, I wrote about a new tool (N50 Booster!!!) that can increase the N50 length of any assembly. Although this was published on April 1st, the tool does actually work and it merits some further discussion. The tool achieves higher N50 lengths by doing two things:

  1. It removes the shortest 25% of all sequences
  2. It then sums the lengths of sequences that were removed and adds an equivalent length of Ns to the end of the longest sequence

Both steps contribute to increasing the N50 length, though the amount of the increase really depends on the distribution of sequence lengths in the assembly. The second step is clearly bogus, and I added it just to preserve the assembly size. The first step might seem drastic, but is exactly the same sort of step that is performed by many genome assemblers (or by the people post-processing the assembly).

If you don't remove short sequences from an assembly, you could end up including a large number of reads that don't overlap with any other read. I.e. it is possible that the shortest contig/scaffold length in an assembly will be the same as whatever the read length of the sequencing technology being used is. If you trim reads for quality, you could potentially end up with contigs/scaffolds with even shorter lengths.

How useful is it to include such short sequences in an assembly, and how often do 'assemblers' (the software and/or the people running the assemblers) do this? Well by looking at some of the assemblies that I have run CEGMA against, I can take a look.

For 34 genome assemblies (from many different species, using many different assemblers) I looked to see whether the shortest 10 sequences were all the same length or were unique:

So about half of the assemblies have unique lengths for their 10 shortest sequences. The remainder represent assemblies that probably either removed all sequences below a certain length (which seems likely with the assembly that had the shortest sequences at 2,000 bp), or which simply included all unassembled reads (six assemblies have an abundance of 100 bp sequences).

This begs the question, how useful is it to include all of these short sequences? It's always possible that a 100 bp read that doesn't overlap anything else contains something useful, but probably not. You might see an exon, possibly an intron, and very possibly an entire gene (depending on the species)...but can anyone do much with this information?

The most extreme example I came across is actually from one of the 16 assemblies which initially appeared to have sequences with unique lengths. This vertebrate genome assembly contains 72,214 sequences. If I ask what percentage of these sequence are shorter than various cutoffs, this is what I find:

This is pretty depressing. The vast majority of sequences in this assembly are likely to be too short to be of any use. The reason that this assembly counted among by 'unique' category is because the shortest ten sequences have lengths as follows: 100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp. That's right, this assembly includes a sequence that's only three base pairs in length!

There are a couple of points that  I am trying to make here:

  1. You can increase the N50 length of your assembly by removing the shortest sequences.  This is not cheating — though you should clearly state what cutoff has been used — and should be considered part of the genome assembly process (getting rid of the crap).
  2. Please look at your assembly before doing anything with it. On it's own, N50 is not a very useful statistic. Look at the distribution of all sequence lengths (or plot an NG(X) graph).

It turned out that this assembly did contain a fair number of core genes and these were almost all located in the 1.2% of sequences that were > 10,000 bp. That 3 bp sequence though, turned out it contained no core genes at all. Shocker!

101 questions with a bioinformatician #1: Mick Watson

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

 

This week's interviewee is well suited for a career in biology and bioinformatics. After all, his name is only one substitution, insertion, and translocation away from reading as 'Watson Crick'. Welcome to 101 questions...Mick Watson.

Mick is Head of Bioinformatics at Edinburgh Genomics and a research group leader at The Roslin Institute. He also professes to be Titus Brown’s code-tester, and Nick Loman’s conscience.

His Opiniomics blog should be required reading for anyone interested in bioinformatics and you can also learn a lot if you follow him on twitter (@biomickwatson). The most important thing that you should know about Mick is that he doesn't really have strong views on anything...especially about what's wrong with the current state of bioinformatics research.

 

001. What's something that you enjoy about current bioinformatics research?

I just love biology, and bioinformatics is part of biology. Being a bioinformatician means you come into contact with lots of biologists and every single one of them has an interesting story to tell, a fascinating challenge and a problem that needs to be solved. I love solving those problems. Bioinformatics is now the key skill required in so many areas of biology, and it is bioinformaticians who now make the discovery, it is bioinformaticians who now have the eureka moment. It’s all very exciting!

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

How long have you got? I hate the language wars – I hate how every single bioinformatics algorithm has to be implemented and re-implemented in 7 different languages; I hate “the 2% club” who publish something because it’s 2% better than an existing tool on a very specific dataset dreamt up by the authors; I hate the fact that we have several hundred “short-read aligners”. I hate the waste of time and resource that that represents. I hate the way that many bioinformaticians have stopped being biologists, and I hate the way our science has been enslaved; I hate that we have allowed it to happen that bioinformaticians are employed in lab groups just to process their data, and no more. I hate that people see us as “support”, not researchers. I hate that, after 15 years in the field, the same problems come around again and again, and I hate that we haven’t learned from our mistakes.

And I despise anonymous peer review. Stand proud next to your words, it’s the only way.

 

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Be confident. YES. YOU. CAN. I know this is cheesy, but the simple fact is that most bioinformaticians have the ability to be amazing; we have biological knowledge and we are not scared of computers. So much of bioinformatics is about setting yourself a goal and just doing it. If there is one thing you don’t think you can do, that’s the thing I’d recommend you go out and do right now. Be confident. You can do it. Nothing is impossible.

 

100. What's your all-time favorite piece of bioinformatics software, and why?

The first one I used – I remember telnet-ing into a server somewhere in about 1997 and using The Wisconsin Package. Wonderful. I also like Clustal and it’s weird menu system. I’m constantly amazed that people assembled a 3 Gbp genome using Staden. And I love the fact you can run EMBOSS’s revseq and choose not to reverse or complement the sequence.

In terms of impact, I’d say it is BLAST. However, I also think Ensembl is amazing – it is a complete genome annotation and management package, and it is completely free and open-source. I think it’s the biggest and best open-source bioinformatics project out there.

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality? 

I’d be N, because as a bioinformatician, you have to be everything: software engineer, mathematician, bioinformatician, database designer, biologist, statistician etc. etc.

 

A new tool to boost the N50 length of your genome assembly

We all know that the most important aspect of any genome assembly is the N50 length of its contigs or scaffolds. Higher N50 lengths are clearly correlated with increases in assembly quality and any good bioinformatician should be looking to maximize the N50 length of any assembly they are making.

I am therefore pleased that I can today announce the release of a new software tool, N50 Booster!!! that can help you increase the N50 length of an existing assembly. This tool was written in C for maximum computational efficiency and then reverse engineered into Perl for maximum obfuscation.

This powerful software is available as a Perl script (n50_booster.pl) that can be downloaded from our lab's website. The only requirement for this script is the FAlite.pm Perl module (also available from our lab's website).

Before I explain how this script works to boost an assembly's N50 length, I will show a real-world example. I ran the script on release WS230 of the Caenorhabditis japonica genome assembly:

$ n50_booster.pl c_japonica.WS230.genomic.fa

Before:
==============
Total assembly size = 166256191 bp
N50 length = 94149 bp

Boosting N50...please wait

After:
==============
Total assembly size = 166256191 bp
N50 length = 104766 bp

Improvement in N50 length = 10617 bp

See file c_japonica.WS230.genomic.fa.n50 for your new (and improved) assembly

As you can see, N50 Booster!!! not only makes a substantial increase to the N50 length of the C. japonica assembly, it does so while preserving the assembly size. No other post-assembly manipulation tool boasts this feature!

The n50_booster.pl script works by creating a new FASTA file based on the original (but which includes a .n50 suffix) and ensures that the new file has an increased N50 length. The exact mechanism by which N50 Booster!!! works will be evident from an inspection of the code.

I am confident that N50 Booster!!! can give your genome assembly a much needed boost and the resultant increase in N50 length will lead to a much superior assembly which will increase your chances of a publication in a top-tier journal such as the International Journal of Genome Assembly or even the Journal of International Genome Assembly.

Update: 2014-04-08 09.44 — I wrote a follow up post to this one which goes into more detail about how N50 Booster!!! works and discusses what people could (and should) do to the shortest sequences in their genome assemblies.

Why I think buzzword phrases like 'big data' are a huge problem

There are many phrases in bioinformatics that I find annoying because they can be highly subjective:

  • Short-read mapping
  • Long-read technology
  • High-throughput sequencing

One person's definition of 'short', 'long', or 'high' may differ greatly from someone else's. Furthermore, our understanding of these phrases changes with the onwards march of technological innovation. Back in 2000 'short' meant 16–20 bp whereas in 2014, 'short' can mean 100–200 bp.

The new kid on the block, which is not specific to bioinformatics, is 'big data'. Over the last week, I've been helping with a NIH grant application entitled Courses for Skills Development in Biomedical Big Data Science. This grant mentions the phrase thirty-nine times so it must be important. Why do I dislike the phrase so much? Here is why:

  1. Even within a field like bioinformatics, it's a subjective term and may not mean the same thing to everyone.
  2. Just as the phrases 'next-generation' and 'second-generation' sequencing inspired a set of clumsy and ill-defined successors (e.g. '2.5th generation', 'next-next-next generation' etc.), I expect that 'big data' might lead to similar language atrocities being committed. Will people start talking about 'really big data' or 'extremely large data' to distinguish themselves from one another?
  3. This term might be subjective within bioinformatics, but it probably much more subjective when used across different scientific disciplines. In astronomy there are space telescopes that are producing petabytes of data. In the field of particle physics, the Data Center at the Wigner Research Centre for Physics processes one petabyte of data per day. If you work for the NSA, then you may well have exabytes of data lying around.

I joked about the issue of 'big data' on twitter:

My Genome Center colleague Jo Fass had a great comment in response to this:

This is an excellent point. When people talk about the challenges of working with 'big data', it really depends on how well your infrastructure is equipped to deal with such data. If your data is readily accessible and securely backed up, then you may only be working with 'data' and not 'big data'.

In another post, I will suggest that the issue for much of bioinformatics is not 'big data' per se but 'obese data', or even 'grotesquely obese data'. I will also suggest a sophisticated computational tool that I call Operational Heuristics for Management of Your Grotesquely Obese Data (OHMYGOD), but which you might know as rm -f.