Should reviewers of bioinformatics software insist that some form of documentation is always included alongside the code?

Yesterday I gave out some JABBA awards and one recipient was a tool called HEALER. I found it disappointing that the webpage that hosts the HEALER software contains nothing but the raw C++ files (I also found it strange that none of the filenames contain the word 'HEALER'). This is what you would see if you go to the download page:

Today, Mick Watson alerted me to a piece of software called ScaffoldScaffolder. It's a somewhat unusual name, but I guess it at least avoids any ambiguity about what it does. Out of curiosity I went to the the website to look at the software and this is what I found:

Ah, but maybe there is some documentation inside that tar.gz file? Nope.

At the very least, I think it is good practice to include a README file alongside any software. Developers should remember that some people will end up on these software pages, not from reading the paper, but by following a link somewhere else. The landing page for your software should make the following things clear:

  1. What is this software for?
  2. Who made it?
  3. How do I install it or get it running?
  4. What license is the software distributed under?
  5. What is the version of this software?

The last item can be important for enabling reproducible science. Give your software a version number — the ScaffoldScaffolder included a version number in the file name — or, at the very least, include a release date. Ideally, the landing page for your software should contain even more information:

  1. Where to go for more help, e.g. a supplied PDF/text file, link to online documentation, or instructions about activating help from within the software
  2. Contact email address(es)
  3. Change log

This is something that I feel that reviewers of software-based manuscripts need be thinking about. In turn, this means that it is something that the relevant journals may wish to start including in the guidelines for their reviewers.

How user-friendly should bioinformatics documentation be?

Imagine that you have never seen a SAM output file before. Now imagine that you are relatively new to bioinformatics, perhaps you are PhD student doing a rotation in a bioinformatics lab. If you are asked to work with some SAM files, you might reasonably want to look at the SAM documentation to understand the structure of this 11-column plain text file format.

Let's consider just the second column of a SAM output file. You've been looking at the SAM file that your boss provided to you and you notice that column 2 is full of integer values, mostly 0, 4, and 16. You want to know what these mean and so you turn to the relevant section of the SAM documentation to find out more about column 2:

Column 2 — FLAG: bitwise FLAG

Each bit is explained in the following table:

Bit — Description
0x1 — template having multiple segments in sequencing
0x2 — each segment properly aligned according to the aligner
0x4 — segment unmapped
0x8 — next segment in the template unmapped
0x10 — SEQ being reverse complemented
0x20 — SEQ of the next segment in the template being reversed
0x40 — the first segment in the template
0x80 — the last segment in the template
0x100 — secondary alignment
0x200 — not passing quality controls
0x400 — PCR or optical duplicate
0x800 — supplementary alignment

  • For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies ‘FLAG & 0x900 == 0’. This line is called the primary line of the read.
  • Bit 0x100 marks the alignment not to be used in certain analyses when the tools in use are aware of this bit. It is typically used to flag alternative mappings when multiple mappings are presented in a SAM.
  • Bit 0x800 indicates that the corresponding alignment line is part of a chimeric alignment. A line flagged with 0x800 is called as a supplementary line.
  • Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100 and 0x800, and the bit 0x20 of the previous read in the template.
  • If 0x40 and 0x80 are both set, the read is part of a linear template, but it is neither the first nor the last read. If both 0x40 and 0x80 are unset, the index of the read in the template is unknown. This may happen for a non-linear template or the index is lost in data processing.
  • If 0x1 is unset, no assumptions can be made about 0x2, 0x8, 0x20, 0x40 and 0x80.

So having read all of this, my question to you is: what does a value of zero in your SAM file correspond to?

To me this is far from clear from the documentation. You first have to understand what bitwise actually means. You then need to understand that these bitwise flag values will be represented as an integer value in the SAM file (this is mentioned in passing elsewhere in the documentation).

Finally, you must deduce that a value of zero in your SAM output file means that no bitwise flags have been set. So, if the 3rd 'segment unmapped' bit isn't set, then that means that your segment (i.e. sequence) was mapped. Likewise, the lack of a 5th bit (reverse complemented) means that your sequence match must be on the forward strand.

Phew. I find this to be be frustratingly opaque and in desperate need of some examples. Particularly because zero values in a SAM output file are among the most common things that a user will see. The above table could also benefit from including equivalent integer values, to make it clearer than 0x10 = 16, 0x20 = 32 etc.

I've raised a GitHub issue regarding these points. The larger issue here is that I think software developers sometimes assume too much about the skill set of their end users and fail to write their documentation in terms that mere mortals will understand.

Is this acceptable behavior for a bioinformatics program developed in the year 2014?

Last week I installed a relatively new read aligner with the humorous name of ARYANA:

The journal article describing the tool was published on September 10th 2014, and the associated code repository on GitHub first appeared earlier in the same year. So we're not talking about an old program here.

If I have time I'm planning to investigate the use of ARYANA alongside other established mapping tools like BWA and Bowtie 2. Installing ARYANA was straightforward, so then I proceeded to try the first thing that I attempt with all new bioinformatics software (and most Unix command-line software):

Run the program without any parameters to see what happens

I don't think I'm alone in this approach. In the absence of any necessary command-line options, a good Unix program will return helpful information about how it should be used. At the very least it might prompt you with the minimal use scenario and/or point out how you can find out more information by invoking the help mode. So here is what happened with ARYANA:

% aryana
Need more inputs

Not very helpful. So I tried the next obvious thing, let's see if there is a help mode:

% aryana -h
Need more inputs

% aryana --help
Need more inputs

Hmm. This is really not helpful. Out of curiosity, I tried to see if ARYANA would tell me what version it is (a fairly common behavior for a lot of command-line software):

% aryana -v
Need more inputs

% aryana --version
Need more inputs

At this point I sighed. Not figuratively. I literally sighed, because this type of feedback from a program — especially a bioinformatics program developed in the year 2014 — is maddening. I tweeted about this issue and judging by the feedback, I am not alone with my views on this.

It may have been less frustrating to return no output at all rather than return just those three words. I feel like the program is taunting me. It may as well have returned any of the following output:

% aryana
Not gonna work

% aryana
No can do

% aryana
Please go away

I could use this blog post to tell you about some of the basic requirements of a bioinformatics command-line program, but I don't need to do this because others have already done so. Specifically, people should look at this great paper by Torsten Seemann (@torstenseemann), published in GigaScience last year:

Ten recommendations for creating usable bioinformatics command line software

This is a fantastic set of recommendations, and coincidentally the first three things on the list relate to the first three things that I tried doing when running the ARYANA program:

  1. Print something if no parameters are supplied
  2. Always have a “-h” or “--help” switch
  3. Have a “-v” or “--version” switch

This is good advice of developers of bioinformatics software, but equally it is good advice for reviewers of bioinformatics software. If I was a reviewer of the ARYANA paper, I would have made comments regarding the lack of useful output from the program.