More thoughts on the documentation for the SAM file format

I recently wrote about how bioinformatics documentation is not always very user-friendly. I made some references to the opaque nature of the official SAM Format Specification (specifically the lack of clarity in explaining the nature of what is in column 2 of a SAM file). After writing this post, I raised an issue on the relevant SAMtools GitHub repository:

I feel that many people have trouble understanding what is meant by bitwise FLAG values. The documentation is very technical and not very transparent to people who may be new to bioinformatics.

Many people might be turning to the documentation after looking at their SAM output file. Maybe they see that their output file has a range of integer values in column 2 and are puzzled by the explanation in the documentation (this is very likely if you have no familiarity with bit patterns).

I think this section would be greatly helped by the following:

  1. A reminder that the SAM file itself stores an integer value
  2. An explicit description that a bitwise value of zero means that your read has mapped to the forward strand of the reference
  3. Some specific examples that explain what various integer values correspond to.

Most of the responses to this were of the form well people can go elsewhere if they want to find better documentation that explains this. E.g.

"There is space on SEQwiki for user created format information."

I accept that the official specification for a file format is not necessarily the same thing as a user guide, but people presumably arrive at the official SAM documentation when searching for help with this kind of thing. E.g. if I search for sam format documentation or sam file format, the top hit is the aforementioned SAM Format Specification.

So the advice seems to be that you don't need to bother making your bioinformatics documentation easy to understand because someone else might come along and do this for you.

Kablammo: an interactive, web-based BLAST results visualizer

Another great name for a piece of bioinformatics software! This tool has just been published in the journal Bioinformatics by Jeff Wintersinger and James Wasmuth. From the abstract:

Kablammo is a web-based application that produces interactive, vector-based visualizations of sequence alignments generated by BLAST. These visualizations can illustrate many features, including shared protein domains, chromosome structural modifications, and genome misassembly.

101 questions with a bioinformatician #20: Roy Chaudhuri

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Roy Chaudhuri is a Lecturer in Bioinformatics in the Department of Molecular Biology and Biotechnology at the University of Sheffield, and is part of the Sheffield Bioinformatics Hub. Roy's expertise concerns the comparative genomics and phylogenetics of bacterial pathogens and in a previous life he helped set up the coliBASE and xBASE databases. In a previous-previous life he was also a pioneering website designer (I shouldn't judge: people in glass houses… and all that).

He claims that his current duties involve "research, teaching, publishing, and trying to convince people to give me money". If you would like to give Roy money (perhaps a £1 donation towards his Eccles Cake fund?), you can get in contact with him via the Sheffield Bioinformatics Hub website. You can also find out more about Roy by following him on twitter (@RoyChaudhuri)…but be warned, he is a non-stop tweeter! And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

I like that after 16 years as a bioinformatician, I'm still learning new things every day, and that there's no shortage of cool datasets and interesting problems to keep me busy. I also like how far it's possible to get by knowing a little bit of biology and a little Perl.

010. What's something that you don't enjoy about current bioinformatics research?

I worry that too much community effort has been devoted to dealing with problems that are specific to short-read data. I'd like to think that in five years time sequencing will just work, and we will be able to devote our time to dealing with biological quirks rather than technical ones. I'm pretty sure I said the same thing five years ago, though.



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Most of my advice wouldn't be work-related, but I'd certainly mention that the clock starts ticking on potential fellowship opportunities as soon as you get your PhD. I definitely missed the starting gun on that one.



100. What's your all-time favorite piece of bioinformatics software, and why?

I'll go for Prokka, because it does an astonishingly good job at annotating bacterial genomes (better than many manual attempts...), because Torsten wrote the book (well, blog post) on creating usable command-line bioinformatics tools. I particularly like that it checks for its dependencies at the start, rather than choking half-way through, and because it sometimes finishes with a quote from the Hitchhiker's Guide to the Galaxy.

Other than that, I'm a big fan of MUMmer, and I'm always impressed by how many different things it's possible to achieve by stringing two or three SAMtools commands together. If non-bioinformatics-specific software counts, then I'd also mention GNU Parallel, Perl and UNIX itself.



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

M, because it's Not K.

Searching for sausage rolls: using Google Scholar to look at the popularity of British culinary delights

Sometimes it can be fun to search Google Scholar for words or phrases that you might not expect to ever appear in the title of an academic article. So last night, I conducted an important scientific study and looked at the popularity of various quintessential items of Britsh cuisine:

Updated: 2014-12-10: includes addition of 'Spotted Dick' thanks to reader @MattBashton.