101 questions with a bioinformatician #19: Valerie Schneider

December 04, 2014 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Valerie Schneider is a Staff Scientist at the NCBI in charge of the teams that provide bioinformatics support for the Genome Reference Consortium (GRC). Her teams develop web resources and databases for the analysis and visualization of genomic data, including Map Viewer, various genome browsers and the NCBI Genome Remapping Service.

She tells me that the work of her group focuses on "providing tools that enable researchers to take advantage of the wealth of genomic data available in public databases". Valerie also wanted to mention the following:

The Genome Reference Consortium always appreciates feedback on the human, mouse and zebrafish reference assemblies. If you think the genome looks wrong, or have questions, about it, please let us know.

And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I’m excited by the current push for the sharing of big data. Projects like the Global Alliance for Genomics and Health are bridging the gap between bioinformatics and clinical research. At the NCBI, we’re actively developing new resources that will help researchers navigate and use all this data effectively. I’m also really excited by the effect that longer read sequencing technologies are having on software development. For many years, we’ve been attacking questions and developing software tools based on short read data. Long reads make it possible to investigate genomic regions, such as segmental duplications and other repetitive regions that were previously largely inaccessible, and should also result in more contiguous assemblies. I’m looking forward to seeing what will likely be a variety of new tools to manipulate this data.

010. What's something that you don't enjoy about current bioinformatics research?

The ability to reproduce somebody else’s results is integral to good science. Unfortunately, this is often a challenge in current bioinformatics research. This is not because the science is unreliable or the results wrong. On the contrary, it can be hard because software and datasets aren’t always in public databases that are maintained for the long-term, software versions change (and may not be explicitly noted in publications) or software uses non-standard file formats or works as part of a specific tool chains. As someone who works for an informatics repository, I’m well aware that for most researchers, data management isn’t as exciting as the data. But if it’s not well-managed, science suffers in the long term because we can’t easily reanalyze the data in light of new findings. As bioinformatics-based analyses work their way into journals that haven’t traditionally dealt with big data, this becomes more and more a challenge. Journals are looking at ways to get it right, and archive resources are storing data and tools in more forms than ever, but researchers must also make this a priority when submitting and reviewing publications.

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Don’t over-specialize in any one area of biology (and take more programming classes!). Bioinformatics can lead you all over the genome. A solid grounding in genetics, population biology, structural biology and evo/devo, is helpful not only in defining your own interests, but will prepare you to follow the data, whatever it reveals.

100. What's your all-time favorite piece of bioinformatics software, and why?

I’m going to put in a shameless plug for the NCBI Coordinate Remapping service. Between last year’s release of an updated human reference assembly and the growth in the number of genome assemblies, it’s critical that researchers be able to translate their data between coordinate systems. The Remapping service is based upon assembly-assembly alignments (which are also available); the remapping is done as a base-by-base comparison. It has a notion of first and second pass alignments that can be useful for identifying duplicate sequences. It also stands out because not only does it let you map between chromosomes, you can map between chromosomes and alt loci in GRC

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

C (cytosine). I appreciate its propensity for change, in its ability to spontaneously convert to Uracil, and I like the way it can take alternate methylated forms. The former reflects the way I work, tackling a wide variety of projects every day and switching between biological and technical discussions. The latter reflects the fact that I wear multiple professional hats: scientist and scuba instructor.

Buy one bogus bioinformatics acronym, get one free!

December 04, 2014 by Keith Bradnam

New in the journal Bioinformatics:

Amplicon identification using SparsE representation of multiplex PYROsequencing signal (AdvISER-M-PYRO): application to bacterial resistance genotyping

So the software is called AdvISER-M-PYRO and this is (presumably) derived from Amplicon identification using SparsE representation of multiplex PYROsequencing. However, I don't understand why 'S', 'E', and 'PYRO' get capitalized but not the 'I' of 'identification' or the 'M' of 'multiplex'? Also, I like how they sneak in two letters ('d' and 'v') that don't occur in the name of the software (at least not before the 'I' of 'identification').

This paper is a 2-for-1 type of deal, because if you read a bit more of the abstract you will see:

In parallel, the nucleotide dispensation order was improved by developing the SENATOR (‘SElecting the Nucleotide dispensATion Order’) algorithm.

This second bogus acronym also lacks some clarity. Is the 'R' of 'SENATOR' derived from the first or second 'r' of 'the word 'Order'???

Time for a new JABBA award for Just Another Bogus Bioinformatics Acronym

December 02, 2014 by Keith Bradnam

From the journal Bioinformatics, we have:

PEPPER: cytoscape app for protein complex expansion using protein–protein interaction network

The bogus nature of this acronym is quickly revealed from the very first line of the abstract:

We introduce PEPPER (Protein complex Expansion using Protein–Protein intERactions)

Winning a JABBA award is one thing, but you get bonus points if you decide to use the same name as a completely different bioinformatics tool (something that is, sadly, becoming more common). So if you run a Google search for pepper bioinformatics, you may also come across a molecular visualizer called PeppeR.

Turkey bioinformatics by the numbers

November 27, 2014 by Keith Bradnam

109,700,700 - mean divergence time (in years) between turkeys (Meleagris gallopavo) and turkey vultures (Cathartes aura)
101,849 - number of turkey nucleotide sequences in GenBank
3,597 - number of selenoproteins in UniProt which allow for the possibility of generating a peptide called ‘TURKEY’ (the IUPAC code for selenocysteine is U)[1]
1 - number of published turkey genomes
0 - number of bioinformatics tools that have tried using ‘turkey’ as part of a bogus acronym in their name[2]

I can’t find any such peptide when using the UniProt BLAST server though :-( ↩
And let's try to keep it that way! ↩

More mixed-case madness in the name of a bioinformatics tool

November 26, 2014 by Keith Bradnam

From the latest issue of Bioinformatics we have:

SUBAcon: a consensus algorithm for unifying the subcellular localization data of the Arabidopsis proteome

According to the abstract, the 'SUB' comes from subcellular, the 'A' comes from Arabidopsis, and the 'con' comes from 'consensus'. So why isn't it SUBACON? Maybe because people might then read it as 'sue bacon'?

It's not clear to me if this is meant to be pronounced 'soo-ba-con' or 'sub-ay-con'. The abstract then goes on to mention something called the ASURE portal (pronounced 'azure' or 'ay-sure'???), where ASURE = Arabidopsis SUbproteome REference.. If this was following the same rules as SUBAcon, shouldn't this be called ASUre (or even ASUBre)?

How user-friendly should bioinformatics documentation be?

November 25, 2014 by Keith Bradnam

Imagine that you have never seen a SAM output file before. Now imagine that you are relatively new to bioinformatics, perhaps you are PhD student doing a rotation in a bioinformatics lab. If you are asked to work with some SAM files, you might reasonably want to look at the SAM documentation to understand the structure of this 11-column plain text file format.

Let's consider just the second column of a SAM output file. You've been looking at the SAM file that your boss provided to you and you notice that column 2 is full of integer values, mostly 0, 4, and 16. You want to know what these mean and so you turn to the relevant section of the SAM documentation to find out more about column 2:

Column 2 — FLAG: bitwise FLAG

Each bit is explained in the following table:

Bit — Description
0x1 — template having multiple segments in sequencing
0x2 — each segment properly aligned according to the aligner
0x4 — segment unmapped
0x8 — next segment in the template unmapped
0x10 — SEQ being reverse complemented
0x20 — SEQ of the next segment in the template being reversed
0x40 — the first segment in the template
0x80 — the last segment in the template
0x100 — secondary alignment
0x200 — not passing quality controls
0x400 — PCR or optical duplicate
0x800 — supplementary alignment

For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies ‘FLAG & 0x900 == 0’. This line is called the primary line of the read.

Bit 0x100 marks the alignment not to be used in certain analyses when the tools in use are aware of this bit. It is typically used to flag alternative mappings when multiple mappings are presented in a SAM.

Bit 0x800 indicates that the corresponding alignment line is part of a chimeric alignment. A line flagged with 0x800 is called as a supplementary line.

Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100 and 0x800, and the bit 0x20 of the previous read in the template.

If 0x40 and 0x80 are both set, the read is part of a linear template, but it is neither the first nor the last read. If both 0x40 and 0x80 are unset, the index of the read in the template is unknown. This may happen for a non-linear template or the index is lost in data processing.

If 0x1 is unset, no assumptions can be made about 0x2, 0x8, 0x20, 0x40 and 0x80.

So having read all of this, my question to you is: what does a value of zero in your SAM file correspond to?

To me this is far from clear from the documentation. You first have to understand what bitwise actually means. You then need to understand that these bitwise flag values will be represented as an integer value in the SAM file (this is mentioned in passing elsewhere in the documentation).

Finally, you must deduce that a value of zero in your SAM output file means that no bitwise flags have been set. So, if the 3rd 'segment unmapped' bit isn't set, then that means that your segment (i.e. sequence) was mapped. Likewise, the lack of a 5th bit (reverse complemented) means that your sequence match must be on the forward strand.

Phew. I find this to be be frustratingly opaque and in desperate need of some examples. Particularly because zero values in a SAM output file are among the most common things that a user will see. The above table could also benefit from including equivalent integer values, to make it clearer than 0x10 = 16, 0x20 = 32 etc.

I've raised a GitHub issue regarding these points. The larger issue here is that I think software developers sometimes assume too much about the skill set of their end users and fail to write their documentation in terms that mere mortals will understand.

Is this acceptable behavior for a bioinformatics program developed in the year 2014?

November 23, 2014 by Keith Bradnam

Last week I installed a relatively new read aligner with the humorous name of ARYANA:

ARYANA: Aligning Reads by Yet Another Approach

The journal article describing the tool was published on September 10th 2014, and the associated code repository on GitHub first appeared earlier in the same year. So we're not talking about an old program here.

If I have time I'm planning to investigate the use of ARYANA alongside other established mapping tools like BWA and Bowtie 2. Installing ARYANA was straightforward, so then I proceeded to try the first thing that I attempt with all new bioinformatics software (and most Unix command-line software):

Run the program without any parameters to see what happens

I don't think I'm alone in this approach. In the absence of any necessary command-line options, a good Unix program will return helpful information about how it should be used. At the very least it might prompt you with the minimal use scenario and/or point out how you can find out more information by invoking the help mode. So here is what happened with ARYANA:

% aryana
Need more inputs

Not very helpful. So I tried the next obvious thing, let's see if there is a help mode:

% aryana -h
Need more inputs

% aryana --help
Need more inputs

Hmm. This is really not helpful. Out of curiosity, I tried to see if ARYANA would tell me what version it is (a fairly common behavior for a lot of command-line software):

% aryana -v
Need more inputs

% aryana --version
Need more inputs

At this point I sighed. Not figuratively. I literally sighed, because this type of feedback from a program — especially a bioinformatics program developed in the year 2014 — is maddening. I tweeted about this issue and judging by the feedback, I am not alone with my views on this.

It may have been less frustrating to return no output at all rather than return just those three words. I feel like the program is taunting me. It may as well have returned any of the following output:

% aryana
Not gonna work

% aryana
No can do

% aryana
Please go away

I could use this blog post to tell you about some of the basic requirements of a bioinformatics command-line program, but I don't need to do this because others have already done so. Specifically, people should look at this great paper by Torsten Seemann (@torstenseemann), published in GigaScience last year:

Ten recommendations for creating usable bioinformatics command line software

This is a fantastic set of recommendations, and coincidentally the first three things on the list relate to the first three things that I tried doing when running the ARYANA program:

Print something if no parameters are supplied
Always have a “-h” or “--help” switch
Have a “-v” or “--version” switch

This is good advice of developers of bioinformatics software, but equally it is good advice for reviewers of bioinformatics software. If I was a reviewer of the ARYANA paper, I would have made comments regarding the lack of useful output from the program.

101 questions with a bioinformatician #18: Richard Emes

November 21, 2014 by Keith Bradnam

Richard Emes is an Associate Professor and Reader in Bioinformatics at The University of Nottingham (where they let in lots of riffraff). He is also the Director of the University's shiny, new Advanced Data Analysis Centre (ADAC).

His research interests include the comparative genomics and epigenomics of (mostly) animal species to understand health and disease, and in his role as Director of ADAC, he is forging collaborations that help others with their informatics needs across the university and further afield. Most importantly, he and his team know how to come up with a decidedly non-bogus acronym for a piece of bioinformatics software.

You can find out more about Richard by visiting his lab's website/blog, or by following him on twitter (@rdemes). And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I love the variation of ideas. I could never have followed a career of working on a single gene, protein, or disorder. Bioinformatics lets you think in a slightly less reductionist way. Letting the data drive discovery can be exciting and rewarding

010. What's something that you *don't* enjoy about current bioinformatics research?

Seeing junior researchers working really hard to clean and analyze a complex dataset to allow visualization that provokes insight, then getting little recognition because, “they made a figure”. Recognition of author contribution is changing, but slowly

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

I would say get a deep understanding of statistics and start learning helpful one-liners. The fact that sed -i 's/old/new/g' filename edits a file without you having to open it is mind blowing when you first come to the command-line.

100. What's your all-time favorite piece of bioinformatics software, and why?

My first full project in bioinformatics was looking for gene family expansions as part of the Mouse Genome project. All the alignments and editing were done in SeaView and this is still my go to editor.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

Arginine. I was brought up in the West Country of England and my accent becomes more pronounced when presenting. Arginine makes me sound most like a Pirate when I pronounce it “Arrrrrjenine” (KB: 15 years experience as a bioinformatician and Richard doesn't seem to have learnt the difference between nucleotides and amino acids ;-) I will note his answer as an 'R').

THe popUlARiTY of VARioUS iUpAC NUCleoTiDe AMBiGUiTY CHARACTeRS

November 18, 2014 by Keith Bradnam

There have now been 18 interviews in my series of 101 questions with a bioinformatician. The final question in each interview is always:

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

So after 18 interviews we have the statistical possibility of equal representation by all possible nucleotide ambiguity codes. Let's take a look at what the results actually look like:

So N and Y are the most popular choices so far but no love for A, C, G, U, K, M, D, or H! What's so bad about the letter K? I always thought of K as a distinguished member of the IUPAC ambiguity code community!

If you are sharp-eyed you may notice that there are actually 19 responses shown here. That's because a certain someone claimed two characters in their answer. I'm sure that you will all be glued to the next 18 interviews to see if, and how, these frequencies change. And I will be Keen in my undertaKing to maKe sure that I Keep this blog free from any subtle bias that may influence folK.

10 tips for improving your presentations & speeches →

November 18, 2014 by Keith Bradnam

Some fantastic advice here from the Presentation Zen site (which is always worth looking at). Many scientific presentations would be greatly enlivened if presenters took more effort to turn a collection of facts and observations into a story. Tip #4 is something that I frequently mention to students in our lab:

(4) Have a clear theme.
What is your key message? What is it you REALLY want people to remember? What action do you want them to take? Details are important. Data and evidence and logical flow are important. But we must not lose sight of what is really important and what is not. Often, talks take people down a path of great detail and loads of information, most of which is completely forgotten (if it was ever understood in the first place) after the talk is finished. The more details that you include and the more complex your talk, the more you must be very clear on what it is you want your audience to hear, understand, and remember. If the audience only remembers one thing, what should it be? Write it down and stick it on the wall so it's never out of your sight.

Sometimes students seem almost surprised by the notion that the audience should be expected to remember something from their talk.