Real bioinformaticians and old bioinformaticians

A passing mention of the phrase 'real bioinformaticians' by Michael Hoffman (@michaelhoffman) yesterday, prompted me to elevate the concept to be worthy of its own hashtag. This is what happened next:

You will notice that Sara G's response (@sargoshoe) humorously introduced the concept of #oldbioinformaticians, and this in turn spawned an even longer set of tweets (see below). I think that many of the more — how shall we put this — wise and distinguished members of the bioinformatics community, enjoyed the chance for a trip down memory lane.

Musical encores in bioinformatics and other sciences

I've previously flagged a few examples of independently developed bioinformatics software tools that share the same name. My recent post about the JABBA-award winning software called MUSIC prompted some people to let me know that this is another name that has been used repeatedly by different groups.

So thanks to Nicolas Robine and commenter LMikeF, we can see that MUSIC is a very popular name for bioinformatics tools:

  1. MuSiC: a tool for multiple sequence alignment with constraints (2004)
  2. RE-MuSiC: a tool for multiple sequence alignment with regular expression constraints (2007)
  3. MuSiC: identifying mutational significance in cancer genomes (2012)
  4. MUSIC: Identification of Enriched Regions in ChIP-Seq Experiments using a Mappability-Corrected Multiscale Signal Processing Framework (2014)
  5. MUSiCC: Towards an accurate estimation of average genomic copy-numbers in the human microbiome (2014)

The first two publications sadly suffer from link rot and the provided URLs no longer work. These two publications are also by the same group, which begs the question, what would they call a 3rd iteration of their software (RE-RE-MuSiC?).

A little bit of additional searching reveals that MUSIC is a popular name in other scientific endeavors as well:

  1. MUSIC: MUltiScale Initial Conditions — software to generate initial conditions for cosmological simulations
  2. MUSIC: MUltiScale SImulation Code — fluid dynamics software: warning this website will make you nauseous!
  3. MUSIC: Muerte Subita en Insufficiencia Cardiaca — a longitudinal study to assess risk predictors of death inpatients with heart failure
  4. MUSIC: MUtation-based SQL Injection vulnerabilities Checking tool — a tool to help check for vulnerabilities in web based applications

I guess people like the name MUSIC and will go to almost any lengths to make an acronym/initialism for it. 

The Graphical Fragment Assembly (GFA) format

Shaun Jackman added a comment to my previous post about the ongoing development of a new format by which to represent genome assemblies. I thought I would reproduce this in a separate blog post in order to bring this issue to more attention.

But first, a quick reminder that currently nearly all genome assemblies are ultimately stored as DNA sequences in FASTA format. This format was developed over 25 years ago and is not best suited to representing a genome assembly.

One obvious reason for this is that we commonly sequence the genomes of diploid individuals who have two genomes present in every cell (one derived from each parent). We often know that a particular region of the genome should be represented as sequence X or sequence Y, but the FASTA format requires you to choose one or the other.

There has already been one effort to develop a new file format to best represent the variation present in an assembly, and a final specification was formalized. However, this FASTG format has seemingly not been widely adopted by the community (at least, not that I know of).

At this point, I will simply reproduce Shaun's comment from the earlier post (minor edits made to restructure some of the links and layout):

There has been three fantastic blog posts in the past three months on the topic of devising a common file format for a sequence overlap graph to enable modular assembly pipelines.

Heng Li (@lh3lh3) has proposed a Grapical Fragment Assembly (GFA) file format. An implementation will be included in the next release of ABySS. Jared Simpson (@jaredtsimpson) is working on an implementation for String Graph Assembler (SGA). I hope that other implementations will follow.

  1. Dear assemblers, we need to talk … together by Páll Melsted (@pmelsted) and Michael R. Crusoe (@biocrusoe). tl;dr we need a common file format for contig graphs and other stuff too
  2. A proposal of the Grapical Fragment Assembly format by Heng Li and…
  3. First update on GFA by Heng Li

Please add you comments to this posting with your thoughts on the GFA file format. 

There are a lot of comments on the two blog posts by Heng and I tweeted my (minor) concerns regarding how this format proposal has developed. This led to some further discussion on twitter, some of which I have storified:

I hope that Heng takes up Shaun's suggestion to move the spec to GitHub. The FASTG proposal used a mailing list to help focus some of the discussion and I feel that something similar needs to happen to ensure that any future debate about the GFA format is productive.

101 questions with a bioinformatician #14: Shaun Jackman

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Shaun Jackman is a PhD student working on various problems relating to genome assembly at the University of British Columbia. Specifically, Shaun works under the supervision of  İnanç Birol in the Bioinformatics Technology Lab at the BC Cancer Agency's Genome Sciences Centre in Vancouver. You may know him for his work in writing and directing the 1989 smash hit The Abyss, which was later developed into a popular genome assembler.

In addition to being a talented bioinformatician who has contributed to lots of useful software, he is also a very patient guy. I say this because he has been waiting for me to publish this interview for over 3 months (my sincere apologies for the delay, I will try to make this series a regular feature once again).

You can find out more about Shaun from his website or by following him on twitter (@sjackman). And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

I’m excited to see the increasing popularity of enabling reproducible research using tools such as R Markdown and iPython Notebook. After reading a paper, it should be straight forward to download the raw data, install the necessary software, reproduce the results and regenerate the figures. I’m really hoping that we get to that point.

I'm also happy to see more interaction between developers and users using revision control web sites, such as GitHub.

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

Most genome sequence assembly tools are structured as a pipeline: for example, count the k-mers of a set of reads, construct a de Bruijn graph of those k-mers, remove k-mers caused by sequencing errors, identify heterozygous sequences and finally assemble contigs.

It should be possible to mix and match these individual components from different assemblers to create new assembly pipelines that are hybrids of existing tools. Not only could it create a better overall assembler, but it could identify which of the individual components of the various assemblers are strongest. It should be encouraged to improve on an individual component without having to reinvent an entire assembly pipeline.

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Learn to use R and R Studio to visualize your data. I wasted a lot of time making ugly figures with inferior tools. Use Make to automate every analysis pipeline. No pipeline is too small or too large. A one-off analysis never is.

 

100. What's your all-time favorite piece of bioinformatics software, and why?

  • I use Make and R nearly every day.
  • I like Heng Li‘s tools because they stick to the principle of doing one thing well.
  • I’m fond of ABySS, for one because it was the first bioinformatics tool that I helped to develop, but primarily because it’s designed as a pipeline of reusable modular tools that use standard (when possible) file formats, all bound together by a Makefile.

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

I’m an N, because it leaves all options open.