Tales of drafty genomes: part 1 — The Human Genome

December 08, 2014 by Keith Bradnam

One of my recent blog posts discussed this new paper in PLOS Computational Biology:

Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies

There has also been a lot of chatter on twitter about this paper. Here is just part of an exchange that I was involved in yesterday:

The issue of what is or isn’t a draft genome — and whether this even matters — is something on which I have much to say. It’s worth mentioning that there are a lot of draft genomes out there: Google Scholar reports that there are 1,440 artices that mention the phrase ‘draft genome’ in their title [1]. In the first part of this series, I’ll take a look at one of the most well-studied genome sequences in existence…the human genome.

The most famous example of a draft genome is probably the ‘working draft’ of the human genome that was announced — with much fanfare — in July 2000 [2]. At this time, the assembly was reported as consisting of “overlapping fragments covering 97 percent of the human genome”. By the time the working draft was formally published in Nature in January 2001, the assembly was reported as covering “about 94% of the human genome” (incidentally, this Nature paper seems to be first published use of the N50 statistic).

On April 14, 2003 the National Human Genome Research Institute and the Department of Energy announced the “successful completion of the Human Genome Project” (emphasis mine). This was followed by the October 2004 Nature paper that discussed the ongoing work in finishing the euchromatic portion of the human genome[3]. Now, the genome was being referred to as ‘near-complete’ and if you focus on the euchromatic portion, it was indeed about 99% complete. However, if you look at the genome as a whole, it was still only 93.5% complete [4].

Of course the work to correctly sequence, assemble, and annotate the human genome has never stopped, and probably will never stop for some time yet. As of October 14, 2014, the latest version of the human genome reference sequence is GRCh38.p1[5] lovingly maintained by the Genome Reference Consortium (GRC). The size of the human genome has increased just a little bit compared to the earlier publications from a decade ago[6], but there is still several things that we don’t know about this ‘complete/near-complete/finished’ genome. Unknown bases still account for 5% of the total size (that’s over 150 million bp). Furtheremore, there are almost 11 million bp of unplaced scaffolds that are still waiting to be given a (chromosomal) home. Finally, there remains 875 gaps in the genome (526 are spanned gaps and 349 unspanned gaps[7]).

If we leave aside other problematic issues in deciding what a reference genome actually is, and what it should contain[8], we can ask the simple question is the current human genome a draft genome? Clearly I think everyone would say ‘no’. But what if I asked is the current human genome complete? I’m curious how many people would say ‘yes’ and how many people would ask me to first define ‘complete’.

Here are some results for how many hits you get when Googling for the following phrases:

251,000 — finished human genome sequence
171,000 — “almost complete” human genome sequence
69,400 — “near complete” human genome sequence
26,200 — “essentially complete” human genome sequence

Scientists and journalists don’t help the situation by maybe being too eager to overhype the state of completion of the human genome[9]. In conclusion, the human genome is no longer a draft genome, but it is still just a little bit drafty. More on this topic of drafty genomes in part 2!

There are 1,570 if you don’t require the words ‘draft’ and ‘genome’ to be together in the article title. ↩
The use of the ‘working draft’ as a phrase had been in use since at least late 1998. ↩
There is also the entire batch of chromosome-specific papers published between 2001 and 2006. ↩
This percentage is based on the following line in the paper: “The euchromatic genome is thus ~2.88 Gb and the overall human genome is ~3.08 Gb” ↩
This is the 1st patched updated to version 38 of the reference sequence ↩
There are 3,212,670,709 bp in the latest assembly ↩
The GRC defines the two categories as follows:

Spanned gaps are found within scaffolds and there is some evidence suggesting linkage between the two sequences flanking the gap. Unspanned gaps are found between scaffolds and there is no evidence of linkage. ↩
Remember, human genomes are diploid and not only vary between individuals but can also vary from cell-to-cell. The idea of a ‘reference’ sequence is therefore a nebulous one. How much known variation do you try to represent (the GRC represents many alternative loci)? How should a reference sequence represent things like ribosomal DNA arrays or other tandem repeats? ↩
Jonathan Eisen wrote a great blog post on this: Some history of hype regarding the human genome project and genomics ↩

Recent changes to the ACGT blog

December 07, 2014 by Keith Bradnam

There's probably some published rules for how to write blog posts and I imagine that one of those rules might be don't blog about your blog. Oh well…

Over the last few months I've been making lots of tweaks to this site. Most of them have been subtle changes in order to make the pages less cluttered and more aesthetically pleasing — e.g. did you notice that I moved the 'About Blog Contact' menu bar ~20 pixels closer to the top of the page as it was just a little bit too lonely where it used to be? Aside from these minor cosmetic alterations, I've also made some more substantial changes:

You can no longer see tags for individual blog posts (but I have kept the tag cloud on the About page so you can still find all posts that share the same tag).
On the same About page, I added an option to let people subscribe to the blog via email (maximum one email per day).
Just about every post listed on the main page of this site used to have a 'Read More' link which you needed to click in order to read the entire article. I'm now restricting this only to my longer posts.
I've disabled comments from the blog; partly because I wasn't getting a lot of them but mostly because of the many excellent reasons stated on the @avoidcomments Twitter account.
I've started including occasional 'link blog' posts. These are short posts — often just a paragraph or two — that typically comment on a paper or on another blog post. The titles of link blog posts are themselves links to the source paper/blog post, and include a little arrow to indicate this. E.g. see here, here, or here.

As a final note, I'd like to thank everyone who has tweeted or otherwise spread the word about this blog. Aside from churning out more posts about bioinformatics tools with bogus names, I do want to make a bit more of an effort to write about some more substantive issues in bioinformatics. Stay tuned!

Ten recommendations for software engineering in research →

December 05, 2014 by Keith Bradnam

The list of recommendations in this new GigaScience paper by Janna Hastings et al. is not aimed at bioinformatics in particular, but many bioinformaticians would benefit from reading it. I can particularly relate to this suggestion:

Document everything
Comprehensive documentation helps other developers who may take over your code, and will also help you in the future. Use code comments for in-line documentation, especially for any technically chal- lenging blocks, and public interface methods. However, there is no need for comments that mirror the exact detail of code line-by-line.

3 word summary of new PLOS Computational Biology paper: genome assemblies suck

December 05, 2014 by Keith Bradnam

Okay, I guess a more accurate four word summary would be 'genome assemblies sometimes suck'.

The paper that I'm referring to, by Denton et al, looks at the problem of fragmentation (of genes) in many draft genome assemblies:

Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies

As they note in their introduction:

Our results suggest that low-quality assemblies can result in huge numbers of both added and missing genes, and that most of the additional genes are due to genome fragmentation (“cleaved” gene models).

One section of this paper looks at the quality of different versions of the chicken genome and CEGMA is one of the tools they use in this analysis. I am a co-author of CEGMA, and reading this paper brought back some memories of when we were also looking at similar issues.

In our 2nd CEGMA paper we tried to find out why 36 core genes were not present in the v2.1 version of the chicken genome (6.6x coverage). Turns out that there were ESTs available for 29 of those genes, indicating that they are not absent from the genome, just from the genome assembly. This led us to find pieces of these missing genes in the unanchored set of sequences that were included as part of the set of genome sequences (these often appear as a 'ChrUn' FASTA file in genome sequence releases).

Something else that we reported on in our CEGMA paper is that, sometimes, a newer version of a genome assembly can actually be worse than what it replaces (at least in terms of genic content). Version 1.95 of the Ciona intestinalis genome contained several core genes that subsequently disappeared in the v2.0 release.

In conclusion — and echoing some of the findings of this new paper by Denton et al.:

Many genomes are poorly assembled
Many genomes are poorly annotated (often a result of the poor assembly)
Newer versions of genome assemblies are not always better

101 questions with a bioinformatician #19: Valerie Schneider

December 04, 2014 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Valerie Schneider is a Staff Scientist at the NCBI in charge of the teams that provide bioinformatics support for the Genome Reference Consortium (GRC). Her teams develop web resources and databases for the analysis and visualization of genomic data, including Map Viewer, various genome browsers and the NCBI Genome Remapping Service.

She tells me that the work of her group focuses on "providing tools that enable researchers to take advantage of the wealth of genomic data available in public databases". Valerie also wanted to mention the following:

The Genome Reference Consortium always appreciates feedback on the human, mouse and zebrafish reference assemblies. If you think the genome looks wrong, or have questions, about it, please let us know.

And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I’m excited by the current push for the sharing of big data. Projects like the Global Alliance for Genomics and Health are bridging the gap between bioinformatics and clinical research. At the NCBI, we’re actively developing new resources that will help researchers navigate and use all this data effectively. I’m also really excited by the effect that longer read sequencing technologies are having on software development. For many years, we’ve been attacking questions and developing software tools based on short read data. Long reads make it possible to investigate genomic regions, such as segmental duplications and other repetitive regions that were previously largely inaccessible, and should also result in more contiguous assemblies. I’m looking forward to seeing what will likely be a variety of new tools to manipulate this data.

010. What's something that you don't enjoy about current bioinformatics research?

The ability to reproduce somebody else’s results is integral to good science. Unfortunately, this is often a challenge in current bioinformatics research. This is not because the science is unreliable or the results wrong. On the contrary, it can be hard because software and datasets aren’t always in public databases that are maintained for the long-term, software versions change (and may not be explicitly noted in publications) or software uses non-standard file formats or works as part of a specific tool chains. As someone who works for an informatics repository, I’m well aware that for most researchers, data management isn’t as exciting as the data. But if it’s not well-managed, science suffers in the long term because we can’t easily reanalyze the data in light of new findings. As bioinformatics-based analyses work their way into journals that haven’t traditionally dealt with big data, this becomes more and more a challenge. Journals are looking at ways to get it right, and archive resources are storing data and tools in more forms than ever, but researchers must also make this a priority when submitting and reviewing publications.

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Don’t over-specialize in any one area of biology (and take more programming classes!). Bioinformatics can lead you all over the genome. A solid grounding in genetics, population biology, structural biology and evo/devo, is helpful not only in defining your own interests, but will prepare you to follow the data, whatever it reveals.

100. What's your all-time favorite piece of bioinformatics software, and why?

I’m going to put in a shameless plug for the NCBI Coordinate Remapping service. Between last year’s release of an updated human reference assembly and the growth in the number of genome assemblies, it’s critical that researchers be able to translate their data between coordinate systems. The Remapping service is based upon assembly-assembly alignments (which are also available); the remapping is done as a base-by-base comparison. It has a notion of first and second pass alignments that can be useful for identifying duplicate sequences. It also stands out because not only does it let you map between chromosomes, you can map between chromosomes and alt loci in GRC

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

C (cytosine). I appreciate its propensity for change, in its ability to spontaneously convert to Uracil, and I like the way it can take alternate methylated forms. The former reflects the way I work, tackling a wide variety of projects every day and switching between biological and technical discussions. The latter reflects the fact that I wear multiple professional hats: scientist and scuba instructor.

Buy one bogus bioinformatics acronym, get one free!

December 04, 2014 by Keith Bradnam

New in the journal Bioinformatics:

Amplicon identification using SparsE representation of multiplex PYROsequencing signal (AdvISER-M-PYRO): application to bacterial resistance genotyping

So the software is called AdvISER-M-PYRO and this is (presumably) derived from Amplicon identification using SparsE representation of multiplex PYROsequencing. However, I don't understand why 'S', 'E', and 'PYRO' get capitalized but not the 'I' of 'identification' or the 'M' of 'multiplex'? Also, I like how they sneak in two letters ('d' and 'v') that don't occur in the name of the software (at least not before the 'I' of 'identification').

This paper is a 2-for-1 type of deal, because if you read a bit more of the abstract you will see:

In parallel, the nucleotide dispensation order was improved by developing the SENATOR (‘SElecting the Nucleotide dispensATion Order’) algorithm.

This second bogus acronym also lacks some clarity. Is the 'R' of 'SENATOR' derived from the first or second 'r' of 'the word 'Order'???

Time for a new JABBA award for Just Another Bogus Bioinformatics Acronym

December 02, 2014 by Keith Bradnam

From the journal Bioinformatics, we have:

PEPPER: cytoscape app for protein complex expansion using protein–protein interaction network

The bogus nature of this acronym is quickly revealed from the very first line of the abstract:

We introduce PEPPER (Protein complex Expansion using Protein–Protein intERactions)

Winning a JABBA award is one thing, but you get bonus points if you decide to use the same name as a completely different bioinformatics tool (something that is, sadly, becoming more common). So if you run a Google search for pepper bioinformatics, you may also come across a molecular visualizer called PeppeR.

Turkey bioinformatics by the numbers

November 27, 2014 by Keith Bradnam

109,700,700 - mean divergence time (in years) between turkeys (Meleagris gallopavo) and turkey vultures (Cathartes aura)
101,849 - number of turkey nucleotide sequences in GenBank
3,597 - number of selenoproteins in UniProt which allow for the possibility of generating a peptide called ‘TURKEY’ (the IUPAC code for selenocysteine is U)[1]
1 - number of published turkey genomes
0 - number of bioinformatics tools that have tried using ‘turkey’ as part of a bogus acronym in their name[2]

I can’t find any such peptide when using the UniProt BLAST server though :-( ↩
And let's try to keep it that way! ↩

More mixed-case madness in the name of a bioinformatics tool

November 26, 2014 by Keith Bradnam

From the latest issue of Bioinformatics we have:

SUBAcon: a consensus algorithm for unifying the subcellular localization data of the Arabidopsis proteome

According to the abstract, the 'SUB' comes from subcellular, the 'A' comes from Arabidopsis, and the 'con' comes from 'consensus'. So why isn't it SUBACON? Maybe because people might then read it as 'sue bacon'?

It's not clear to me if this is meant to be pronounced 'soo-ba-con' or 'sub-ay-con'. The abstract then goes on to mention something called the ASURE portal (pronounced 'azure' or 'ay-sure'???), where ASURE = Arabidopsis SUbproteome REference.. If this was following the same rules as SUBAcon, shouldn't this be called ASUre (or even ASUBre)?

How user-friendly should bioinformatics documentation be?

November 25, 2014 by Keith Bradnam

Imagine that you have never seen a SAM output file before. Now imagine that you are relatively new to bioinformatics, perhaps you are PhD student doing a rotation in a bioinformatics lab. If you are asked to work with some SAM files, you might reasonably want to look at the SAM documentation to understand the structure of this 11-column plain text file format.

Let's consider just the second column of a SAM output file. You've been looking at the SAM file that your boss provided to you and you notice that column 2 is full of integer values, mostly 0, 4, and 16. You want to know what these mean and so you turn to the relevant section of the SAM documentation to find out more about column 2:

Column 2 — FLAG: bitwise FLAG

Each bit is explained in the following table:

Bit — Description
0x1 — template having multiple segments in sequencing
0x2 — each segment properly aligned according to the aligner
0x4 — segment unmapped
0x8 — next segment in the template unmapped
0x10 — SEQ being reverse complemented
0x20 — SEQ of the next segment in the template being reversed
0x40 — the first segment in the template
0x80 — the last segment in the template
0x100 — secondary alignment
0x200 — not passing quality controls
0x400 — PCR or optical duplicate
0x800 — supplementary alignment

For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies ‘FLAG & 0x900 == 0’. This line is called the primary line of the read.

Bit 0x100 marks the alignment not to be used in certain analyses when the tools in use are aware of this bit. It is typically used to flag alternative mappings when multiple mappings are presented in a SAM.

Bit 0x800 indicates that the corresponding alignment line is part of a chimeric alignment. A line flagged with 0x800 is called as a supplementary line.

Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100 and 0x800, and the bit 0x20 of the previous read in the template.

If 0x40 and 0x80 are both set, the read is part of a linear template, but it is neither the first nor the last read. If both 0x40 and 0x80 are unset, the index of the read in the template is unknown. This may happen for a non-linear template or the index is lost in data processing.

If 0x1 is unset, no assumptions can be made about 0x2, 0x8, 0x20, 0x40 and 0x80.

So having read all of this, my question to you is: what does a value of zero in your SAM file correspond to?

To me this is far from clear from the documentation. You first have to understand what bitwise actually means. You then need to understand that these bitwise flag values will be represented as an integer value in the SAM file (this is mentioned in passing elsewhere in the documentation).

Finally, you must deduce that a value of zero in your SAM output file means that no bitwise flags have been set. So, if the 3rd 'segment unmapped' bit isn't set, then that means that your segment (i.e. sequence) was mapped. Likewise, the lack of a 5th bit (reverse complemented) means that your sequence match must be on the forward strand.

Phew. I find this to be be frustratingly opaque and in desperate need of some examples. Particularly because zero values in a SAM output file are among the most common things that a user will see. The above table could also benefit from including equivalent integer values, to make it clearer than 0x10 = 16, 0x20 = 32 etc.

I've raised a GitHub issue regarding these points. The larger issue here is that I think software developers sometimes assume too much about the skill set of their end users and fail to write their documentation in terms that mere mortals will understand.