More mixed-case madness in the name of a bioinformatics tool

November 26, 2014 by Keith Bradnam

From the latest issue of Bioinformatics we have:

SUBAcon: a consensus algorithm for unifying the subcellular localization data of the Arabidopsis proteome

According to the abstract, the 'SUB' comes from subcellular, the 'A' comes from Arabidopsis, and the 'con' comes from 'consensus'. So why isn't it SUBACON? Maybe because people might then read it as 'sue bacon'?

It's not clear to me if this is meant to be pronounced 'soo-ba-con' or 'sub-ay-con'. The abstract then goes on to mention something called the ASURE portal (pronounced 'azure' or 'ay-sure'???), where ASURE = Arabidopsis SUbproteome REference.. If this was following the same rules as SUBAcon, shouldn't this be called ASUre (or even ASUBre)?

How user-friendly should bioinformatics documentation be?

November 25, 2014 by Keith Bradnam

Imagine that you have never seen a SAM output file before. Now imagine that you are relatively new to bioinformatics, perhaps you are PhD student doing a rotation in a bioinformatics lab. If you are asked to work with some SAM files, you might reasonably want to look at the SAM documentation to understand the structure of this 11-column plain text file format.

Let's consider just the second column of a SAM output file. You've been looking at the SAM file that your boss provided to you and you notice that column 2 is full of integer values, mostly 0, 4, and 16. You want to know what these mean and so you turn to the relevant section of the SAM documentation to find out more about column 2:

Column 2 — FLAG: bitwise FLAG

Each bit is explained in the following table:

Bit — Description
0x1 — template having multiple segments in sequencing
0x2 — each segment properly aligned according to the aligner
0x4 — segment unmapped
0x8 — next segment in the template unmapped
0x10 — SEQ being reverse complemented
0x20 — SEQ of the next segment in the template being reversed
0x40 — the first segment in the template
0x80 — the last segment in the template
0x100 — secondary alignment
0x200 — not passing quality controls
0x400 — PCR or optical duplicate
0x800 — supplementary alignment

For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies ‘FLAG & 0x900 == 0’. This line is called the primary line of the read.

Bit 0x100 marks the alignment not to be used in certain analyses when the tools in use are aware of this bit. It is typically used to flag alternative mappings when multiple mappings are presented in a SAM.

Bit 0x800 indicates that the corresponding alignment line is part of a chimeric alignment. A line flagged with 0x800 is called as a supplementary line.

Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100 and 0x800, and the bit 0x20 of the previous read in the template.

If 0x40 and 0x80 are both set, the read is part of a linear template, but it is neither the first nor the last read. If both 0x40 and 0x80 are unset, the index of the read in the template is unknown. This may happen for a non-linear template or the index is lost in data processing.

If 0x1 is unset, no assumptions can be made about 0x2, 0x8, 0x20, 0x40 and 0x80.

So having read all of this, my question to you is: what does a value of zero in your SAM file correspond to?

To me this is far from clear from the documentation. You first have to understand what bitwise actually means. You then need to understand that these bitwise flag values will be represented as an integer value in the SAM file (this is mentioned in passing elsewhere in the documentation).

Finally, you must deduce that a value of zero in your SAM output file means that no bitwise flags have been set. So, if the 3rd 'segment unmapped' bit isn't set, then that means that your segment (i.e. sequence) was mapped. Likewise, the lack of a 5th bit (reverse complemented) means that your sequence match must be on the forward strand.

Phew. I find this to be be frustratingly opaque and in desperate need of some examples. Particularly because zero values in a SAM output file are among the most common things that a user will see. The above table could also benefit from including equivalent integer values, to make it clearer than 0x10 = 16, 0x20 = 32 etc.

I've raised a GitHub issue regarding these points. The larger issue here is that I think software developers sometimes assume too much about the skill set of their end users and fail to write their documentation in terms that mere mortals will understand.

Is this acceptable behavior for a bioinformatics program developed in the year 2014?

November 23, 2014 by Keith Bradnam

Last week I installed a relatively new read aligner with the humorous name of ARYANA:

ARYANA: Aligning Reads by Yet Another Approach

The journal article describing the tool was published on September 10th 2014, and the associated code repository on GitHub first appeared earlier in the same year. So we're not talking about an old program here.

If I have time I'm planning to investigate the use of ARYANA alongside other established mapping tools like BWA and Bowtie 2. Installing ARYANA was straightforward, so then I proceeded to try the first thing that I attempt with all new bioinformatics software (and most Unix command-line software):

Run the program without any parameters to see what happens

I don't think I'm alone in this approach. In the absence of any necessary command-line options, a good Unix program will return helpful information about how it should be used. At the very least it might prompt you with the minimal use scenario and/or point out how you can find out more information by invoking the help mode. So here is what happened with ARYANA:

% aryana
Need more inputs

Not very helpful. So I tried the next obvious thing, let's see if there is a help mode:

% aryana -h
Need more inputs

% aryana --help
Need more inputs

Hmm. This is really not helpful. Out of curiosity, I tried to see if ARYANA would tell me what version it is (a fairly common behavior for a lot of command-line software):

% aryana -v
Need more inputs

% aryana --version
Need more inputs

At this point I sighed. Not figuratively. I literally sighed, because this type of feedback from a program — especially a bioinformatics program developed in the year 2014 — is maddening. I tweeted about this issue and judging by the feedback, I am not alone with my views on this.

It may have been less frustrating to return no output at all rather than return just those three words. I feel like the program is taunting me. It may as well have returned any of the following output:

% aryana
Not gonna work

% aryana
No can do

% aryana
Please go away

I could use this blog post to tell you about some of the basic requirements of a bioinformatics command-line program, but I don't need to do this because others have already done so. Specifically, people should look at this great paper by Torsten Seemann (@torstenseemann), published in GigaScience last year:

Ten recommendations for creating usable bioinformatics command line software

This is a fantastic set of recommendations, and coincidentally the first three things on the list relate to the first three things that I tried doing when running the ARYANA program:

Print something if no parameters are supplied
Always have a “-h” or “--help” switch
Have a “-v” or “--version” switch

This is good advice of developers of bioinformatics software, but equally it is good advice for reviewers of bioinformatics software. If I was a reviewer of the ARYANA paper, I would have made comments regarding the lack of useful output from the program.

101 questions with a bioinformatician #18: Richard Emes

November 21, 2014 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Richard Emes is an Associate Professor and Reader in Bioinformatics at The University of Nottingham (where they let in lots of riffraff). He is also the Director of the University's shiny, new Advanced Data Analysis Centre (ADAC).

His research interests include the comparative genomics and epigenomics of (mostly) animal species to understand health and disease, and in his role as Director of ADAC, he is forging collaborations that help others with their informatics needs across the university and further afield. Most importantly, he and his team know how to come up with a decidedly non-bogus acronym for a piece of bioinformatics software.

You can find out more about Richard by visiting his lab's website/blog, or by following him on twitter (@rdemes). And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I love the variation of ideas. I could never have followed a career of working on a single gene, protein, or disorder. Bioinformatics lets you think in a slightly less reductionist way. Letting the data drive discovery can be exciting and rewarding

010. What's something that you *don't* enjoy about current bioinformatics research?

Seeing junior researchers working really hard to clean and analyze a complex dataset to allow visualization that provokes insight, then getting little recognition because, “they made a figure”. Recognition of author contribution is changing, but slowly

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

I would say get a deep understanding of statistics and start learning helpful one-liners. The fact that sed -i 's/old/new/g' filename edits a file without you having to open it is mind blowing when you first come to the command-line.

100. What's your all-time favorite piece of bioinformatics software, and why?

My first full project in bioinformatics was looking for gene family expansions as part of the Mouse Genome project. All the alignments and editing were done in SeaView and this is still my go to editor.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

Arginine. I was brought up in the West Country of England and my accent becomes more pronounced when presenting. Arginine makes me sound most like a Pirate when I pronounce it “Arrrrrjenine” (KB: 15 years experience as a bioinformatician and Richard doesn't seem to have learnt the difference between nucleotides and amino acids ;-) I will note his answer as an 'R').