The Graphical Fragment Assembly (GFA) format

Shaun Jackman added a comment to my previous post about the ongoing development of a new format by which to represent genome assemblies. I thought I would reproduce this in a separate blog post in order to bring this issue to more attention.

But first, a quick reminder that currently nearly all genome assemblies are ultimately stored as DNA sequences in FASTA format. This format was developed over 25 years ago and is not best suited to representing a genome assembly.

One obvious reason for this is that we commonly sequence the genomes of diploid individuals who have two genomes present in every cell (one derived from each parent). We often know that a particular region of the genome should be represented as sequence X or sequence Y, but the FASTA format requires you to choose one or the other.

There has already been one effort to develop a new file format to best represent the variation present in an assembly, and a final specification was formalized. However, this FASTG format has seemingly not been widely adopted by the community (at least, not that I know of).

At this point, I will simply reproduce Shaun's comment from the earlier post (minor edits made to restructure some of the links and layout):

There has been three fantastic blog posts in the past three months on the topic of devising a common file format for a sequence overlap graph to enable modular assembly pipelines.

Heng Li (@lh3lh3) has proposed a Grapical Fragment Assembly (GFA) file format. An implementation will be included in the next release of ABySS. Jared Simpson (@jaredtsimpson) is working on an implementation for String Graph Assembler (SGA). I hope that other implementations will follow.

  1. Dear assemblers, we need to talk … together by Páll Melsted (@pmelsted) and Michael R. Crusoe (@biocrusoe). tl;dr we need a common file format for contig graphs and other stuff too
  2. A proposal of the Grapical Fragment Assembly format by Heng Li and…
  3. First update on GFA by Heng Li

Please add you comments to this posting with your thoughts on the GFA file format. 

There are a lot of comments on the two blog posts by Heng and I tweeted my (minor) concerns regarding how this format proposal has developed. This led to some further discussion on twitter, some of which I have storified:

I hope that Heng takes up Shaun's suggestion to move the spec to GitHub. The FASTG proposal used a mailing list to help focus some of the discussion and I feel that something similar needs to happen to ensure that any future debate about the GFA format is productive.

101 questions with a bioinformatician #14: Shaun Jackman

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Shaun Jackman is a PhD student working on various problems relating to genome assembly at the University of British Columbia. Specifically, Shaun works under the supervision of  İnanç Birol in the Bioinformatics Technology Lab at the BC Cancer Agency's Genome Sciences Centre in Vancouver. You may know him for his work in writing and directing the 1989 smash hit The Abyss, which was later developed into a popular genome assembler.

In addition to being a talented bioinformatician who has contributed to lots of useful software, he is also a very patient guy. I say this because he has been waiting for me to publish this interview for over 3 months (my sincere apologies for the delay, I will try to make this series a regular feature once again).

You can find out more about Shaun from his website or by following him on twitter (@sjackman). And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

I’m excited to see the increasing popularity of enabling reproducible research using tools such as R Markdown and iPython Notebook. After reading a paper, it should be straight forward to download the raw data, install the necessary software, reproduce the results and regenerate the figures. I’m really hoping that we get to that point.

I'm also happy to see more interaction between developers and users using revision control web sites, such as GitHub.

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

Most genome sequence assembly tools are structured as a pipeline: for example, count the k-mers of a set of reads, construct a de Bruijn graph of those k-mers, remove k-mers caused by sequencing errors, identify heterozygous sequences and finally assemble contigs.

It should be possible to mix and match these individual components from different assemblers to create new assembly pipelines that are hybrids of existing tools. Not only could it create a better overall assembler, but it could identify which of the individual components of the various assemblers are strongest. It should be encouraged to improve on an individual component without having to reinvent an entire assembly pipeline.

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Learn to use R and R Studio to visualize your data. I wasted a lot of time making ugly figures with inferior tools. Use Make to automate every analysis pipeline. No pipeline is too small or too large. A one-off analysis never is.

 

100. What's your all-time favorite piece of bioinformatics software, and why?

  • I use Make and R nearly every day.
  • I like Heng Li‘s tools because they stick to the principle of doing one thing well.
  • I’m fond of ABySS, for one because it was the first bioinformatics tool that I helped to develop, but primarily because it’s designed as a pipeline of reusable modular tools that use standard (when possible) file formats, all bound together by a Makefile.

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

I’m an N, because it leaves all options open.

If music be the food of bioinformatics, play on: time for a new JABBA award

It's been a while, but it is time for another JABBA award. I occasionally hand out these awards whenever I see Just Another Bogus Bioinformatics Acronym (though the awards also apply to initialisms). You can see details of many previous JABBA award winners elsewhere on this site.

The latest recipient of a JABBA award is this new paper published in Genome Biology:

As soon as I saw the title, I assumed that MUSIC would be an acronym or initialism but the paper itself does not explain what the name means. My hopes were raised that maybe this was just a simple, pleasant-sounding, name for a piece of software — a name which didn't try to clumsily form itself from a desperate grab of various letters from a longer description of the software. I was just about to thank them for the MUSIC when I thought I would check the software page that is linked to from the paper. That's when I saw this:

They chose not to mention this in the paper, but MUSIC is derived from Multiscale Enrichment Calling! The use of the letter 'i' from the word 'Enrichment' is what really escalates this to the status of a bogus acronym. By this logic they could have called this tool many other things (including 'MENTAL'). I'm a bit surprised that they didn't go for the slightly longer 'MUSICAL' (or even 'MUSICALL' if they wanted a genuinely unique name).

Although this software tool will be hard to find with a Google Search — unless people specifically search for 'ChIP-seq MUSIC —  we should at least be thankful that no-one else has ever published a bioinformatics tool called MUSIC. Oh wait, they have.

Update 2014-10-10: Nicolas Robine (@notsojunkDNA) has alerted me to the fact that there is also some bioinformatics software called Genome MuSiC.

Academic link rot seems to be getting faster: should a published URL last more than 100 days?

Consider this paper that was recently published in the journal Bioinformatics, and which showed up today in my RSS feed:

Presumably it is a typo when the journal says that it was received on November 14th 2014:

I'll assume that this is meant to be 2013! The paper first appeared online on June 13th 2014, just 103 days ago. The text of this paper links to some software that should be available at http://ww2.cs.mu.oz.au/∼gwong/LICRE. Except that this URL doesn't work. Neither does http://ww2.cs.mu.oz.au/∼gwong/. Only when I visit http://ww2.cs.mu.oz.au/ do I discover the following:

The new website for the merged departments says that the merger happened in 2012, and this is confirmed by the redirection page which has a date of 18th January 2012. It is also confirmed by looking at the Internet Archive's Wayback Machine which shows that the redirection page has been in place since at least February 2012. 

All of which suggests that the software link in the paper may have not even worked properly at the time they submitted the manuscript. I'm sure there are other similar examples of speedy link rot, but this seems particularly striking. Especially since a search for 'LICRE' on the new website doesn't return any hits (nor can I find any mention of it on Google or various search engine caches).

I will contact the lead author to let him know about the disappearance of the software. In the meantime, I'll remind people of this previous post of mine:

Update 2014-09-24 19.52:  I heard back from the author, the LICRE code is now at https://sites.google.com/site/licrerepository/

Another CEGMA post: KOGs vs CEGs and 458 vs 248

I posted another answer about CEGMA on seqanswers.com last week. I thought I'd cover this in a little more detail here (note, questions edited from how they originally appeared):

Question 1: CEGMA uses a 'kogs.fa' file — containing 2,748 proteins — to compare to a user's genome sequence. These KOGs define a set of 458 core eukaryotic genes (CEGs). Some CEGMA publications present the number of 458 CEGs that are present, others list results from the 248 most-highly-conserved CEGS. Does anyone know why kogs.fa is the default? Does it get 'curated' down to a smaller set during a CEGMA run?

The kogs.fa file represents a subset of the published set of 4,852 KOGs (euKaryotic Orthologous Groups). The KOGs database — which is still available online — describes protein groups that are present among seven different eukaryotes (not all groups are present in all species). We excluded data from the microsporidian Encephalitozoon cuniculi as it is a parasite and may have an atypical protein complement and focused on the 1,788 groups that were present in all of the remaining six species. We then applied various filtering criteria — see methods in original paper — to reduce this to the 458 KOGs (renaming this subset as CEGs in the process). We also chose just one protein to represent each species.

So that's why our kogs.fa file contains 2,748 proteins (458 x 6). CEGMA tries to determine which of these 458 CEGs are present in your input file. It's worth pointing out that the original purpose of CEGMA was to try to find a handful of genes in a genome which may lack gene annotations. Someone could then use this small gene set to train a gene finder, by which to annotate the entire genome.

After CEGMA has found which of the 458 CEGs are present, it then performs its secondary role of assessing the completeness of the gene space. To do this, it only wants to use the most conserved, and least paralogous of the 458 CEGs. Paralogy is a big issue here. The original KOGs database grouped together proteins when there were often many, many paralogs for each group. E.g. KOG0001 corresponds to the Ubiquitin gene family. Here are how many proteins occur in each of the seven species that represent this KOG:

  • Arabidopsis thaliana - 28
  • Caenorhabditis elegans - 12
  • Drosophila melanogaster - 3
  • Encephalitozoon cuniculi - 1
  • Homo sapiens - 17
  • Saccharomyces cerevisiae - 2
  • Schizosaccharomyces pombe - 1

The high degree of paralogy from A. thaliana is one reason why this KOG is not included in our subset of 248 CEGs. In contrast, KOG0018  — Structural maintenance of chromosome protein 1 (sister chromatid cohesion complex Cohesin, subunit SMC1) — is included in the 248 CEGs:

  • Arabidopsis thaliana - 1
  • Caenorhabditis elegans - 4
  • Drosophila melanogaster - 1
  • Encephalitozoon cuniculi - 1
  • Homo sapiens - 3
  • Saccharomyces cerevisiae - 1
  • Schizosaccharomyces pombe - 1

This secondary role of CEGMA uses information in the completeness_cutoff.tbl file (inside the CEGMA data directory) to narrow the 458 CEGs results down to a subset of 248 CEGs. Because different filtering criteria are used, a CEG may be classed as present in the 458 CEG set, but not in the 248 CEG set, even if it was on the list of 248 candidate CEGs.

Question 2: CEGMA output includes many KOG IDs but no descripition of what protein name/function each KOG ID represents. This makes it not so useful for annotating new genomes. Is there a lookup table somewhere?

One of the reason why we maintained KOG identifiers in the CEGMA output was so that people could, if so inclined, look up more information in the KOGs database. If you download the 'kog' file from the KOGs database, you will see each KOG has a one line description. E.g.

[O] KOG0019 Molecular chaperone (HSP90 family)
[KC] KOG0025 Zn2+-binding dehydrogenase (nuclear receptor binding factor-1)
[ZD] KOG0028 Ca2+-binding protein (centrin/caltractin), EF-Hand superfamily protein
[C] KOG0042 Glycerol-3-phosphate dehydrogenase
[T] KOG0044 Ca2+ sensor (EF-Hand superfamily)
[K] KOG0048 Transcription factor, Myb superfamily

The letters inside square brackets, represent various functional categories annotated by the KOGs database. These are as follows:

INFORMATION STORAGE AND PROCESSING
 [J] Translation, ribosomal structure and biogenesis
 [A] RNA processing and modification
 [K] Transcription
 [L] Replication, recombination and repair
 [B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING
 [D] Cell cycle control, cell division, chromosome partitioning
 [Y] Nuclear structure
 [V] Defense mechanisms
 [T] Signal transduction mechanisms
 [M] Cell wall/membrane/envelope biogenesis
 [N] Cell motility
 [Z] Cytoskeleton
 [W] Extracellular structures
 [U] Intracellular trafficking, secretion, and vesicular transport
 [O] Posttranslational modification, protein turnover, chaperones

METABOLISM
 [C] Energy production and conversion
 [G] Carbohydrate transport and metabolism
 [E] Amino acid transport and metabolism
 [F] Nucleotide transport and metabolism
 [H] Coenzyme transport and metabolism
 [I] Lipid transport and metabolism
 [P] Inorganic ion transport and metabolism
 [Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED
 [R] General function prediction only
 [S] Function unknown

Maybe this is useful to someone. However, I would remind people that KOGs was published over a decade ago (and presumably the work to generate the KOGs database begun in 2002 if not earlier). There were probably several gene annotations that were missing in the source genomes at that time, and many annotations have presumably since been updated (I bet many genes have had minor alterations to their structure).

 

Updated version of my 'Genome assembly: then and now' talk is now available

This is a presentation that I have probably given five times now. Originally, the main focus of the talk was purely about the Assemblathon 2 paper, with some thoughts about how the field of genome assembly has changed since the days of Sanger-only sequencing.

Over time, I've increasingly downplayed the Assemblathon 2 content of the talk, and made way for updates relating to the latest developments in genome sequencing and assembly. To that end, I've decided to start adding version numbers to this talk to help make it easier to distinguish between different versions.

So here is version 1.2 of my talk, presented below with and without notes (my talks are very visual, so I have embedded notes to try to capture what I talk about for each slide). Don't be put off by the high slide count (many of these just reflect animated steps).

Without notes…

With notes (probably need to go full-screen to be able to clearly read these)…

Too many genome assemblers to keep track of? Nucleotid.es to the rescue!

Yesterday, I presented an updated version of my 'Genome Assembly: Then and Now' talk. I'll try to post the full set of slides (with notes) later today on Slideshare. But I thought I'd share just one of the new slides from the talk; here are six papers that describe new genome assembly tools…

Read More

New option to subscribe to this blog via email

Shamelessly borrowing this idea from Matt Gemmell's excellent blog, I thought I'd offer the chance to subscribe to my infrequent ramblings via email. If you enter your email address below, you can receive a weekly email (sent on Friday afternoons) with all of my posts for that week.

Your email address will only be used for the purpose of receiving my blog content and will not be shared with anyone else. Each email will offer a simple link by which to unsubscribe.

Some sage advice on avoiding confusing names for bioinformatics tools

SAGE is a molecular technique used to investigate the mRNA population from a chosen sample. It stands for Serial Analysis of Gene Expression and was first described back in 1995. The technique spawned spin-offs such as LongSAGE, RL-SAGE (Really Long SAGE), and SuperSAGE.

Although this technique has largely been superseded by other methods (such as RNA-Seq), it is still widely referenced (over 1,300 publications from 2013 mention this technique).

Fast-forward to the present day and I note that a new tool has just been published in the journal BMC Bioinformatics:

SAGE: String-overlap Assembly of GEnomes

As long as you query your favorite web search engine for some combination of 'SAGE' and 'genome assembly' you will probably find this tool and not end up on one of the half a million pages that talk about the other SAGE. I'm still not sure whether it is a bit risky giving a new tool the same name as such an established molecular technique.

All of this means that there is the potential for a certain company to use the aforementioned molecular technique to help annotate the output of the aforementioned computational technique, and apply both of these techniques to data from a certain plant. This could give you the world's first SAGE, SAGE, SAGE, sage genome!