101 questions with a bioinformatician #19: Valerie Schneider

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Valerie Schneider is a Staff Scientist at the NCBI in charge of the teams that provide bioinformatics support for the Genome Reference Consortium (GRC). Her teams develop web resources and databases for the analysis and visualization of genomic data, including Map Viewer, various genome browsers and the NCBI Genome Remapping Service.

She tells me that the work of her group focuses on "providing tools that enable researchers to take advantage of the wealth of genomic data available in public databases". Valerie also wanted to mention the following:

The Genome Reference Consortium always appreciates feedback on the human, mouse and zebrafish reference assemblies. If you think the genome looks wrong, or have questions, about it, please let us know.

And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I’m excited by the current push for the sharing of big data. Projects like the Global Alliance for Genomics and Health are bridging the gap between bioinformatics and clinical research. At the NCBI, we’re actively developing new resources that will help researchers navigate and use all this data effectively. I’m also really excited by the effect that longer read sequencing technologies are having on software development. For many years, we’ve been attacking questions and developing software tools based on short read data. Long reads make it possible to investigate genomic regions, such as segmental duplications and other repetitive regions that were previously largely inaccessible, and should also result in more contiguous assemblies. I’m looking forward to seeing what will likely be a variety of new tools to manipulate this data.



010. What's something that you don't enjoy about current bioinformatics research?

The ability to reproduce somebody else’s results is integral to good science. Unfortunately, this is often a challenge in current bioinformatics research. This is not because the science is unreliable or the results wrong. On the contrary, it can be hard because software and datasets aren’t always in public databases that are maintained for the long-term, software versions change (and may not be explicitly noted in publications) or software uses non-standard file formats or works as part of a specific tool chains. As someone who works for an informatics repository, I’m well aware that for most researchers, data management isn’t as exciting as the data. But if it’s not well-managed, science suffers in the long term because we can’t easily reanalyze the data in light of new findings. As bioinformatics-based analyses work their way into journals that haven’t traditionally dealt with big data, this becomes more and more a challenge. Journals are looking at ways to get it right, and archive resources are storing data and tools in more forms than ever, but researchers must also make this a priority when submitting and reviewing publications.



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Don’t over-specialize in any one area of biology (and take more programming classes!). Bioinformatics can lead you all over the genome. A solid grounding in genetics, population biology, structural biology and evo/devo, is helpful not only in defining your own interests, but will prepare you to follow the data, whatever it reveals.



100. What's your all-time favorite piece of bioinformatics software, and why?

I’m going to put in a shameless plug for the NCBI Coordinate Remapping service. Between last year’s release of an updated human reference assembly and the growth in the number of genome assemblies, it’s critical that researchers be able to translate their data between coordinate systems. The Remapping service is based upon assembly-assembly alignments (which are also available); the remapping is done as a base-by-base comparison. It has a notion of first and second pass alignments that can be useful for identifying duplicate sequences. It also stands out because not only does it let you map between chromosomes, you can map between chromosomes and alt loci in GRC



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

C (cytosine). I appreciate its propensity for change, in its ability to spontaneously convert to Uracil, and I like the way it can take alternate methylated forms. The former reflects the way I work, tackling a wide variety of projects every day and switching between biological and technical discussions. The latter reflects the fact that I wear multiple professional hats: scientist and scuba instructor.

Buy one bogus bioinformatics acronym, get one free!

New in the journal Bioinformatics:

So the software is called AdvISER-M-PYRO and this is (presumably) derived from Amplicon identification using SparsE representation of multiplex PYROsequencing. However, I don't understand why 'S', 'E', and 'PYRO' get capitalized but not the 'I' of 'identification' or the 'M' of 'multiplex'? Also, I like how they sneak in two letters ('d' and 'v') that don't occur in the name of the software (at least not before the 'I' of 'identification').

This paper is a 2-for-1 type of deal, because if you read a bit more of the abstract you will see:

In parallel, the nucleotide dispensation order was improved by developing the SENATOR (‘SElecting the Nucleotide dispensATion Order’) algorithm.

This second bogus acronym also lacks some clarity. Is the 'R' of 'SENATOR' derived from the first or second 'r' of 'the word 'Order'???

Time for a new JABBA award for Just Another Bogus Bioinformatics Acronym

From the journal Bioinformatics, we have:

The bogus nature of this acronym is quickly revealed from the very first line of the abstract:

We introduce PEPPER (Protein complex Expansion using Protein–Protein intERactions)

Winning a JABBA award is one thing, but you get bonus points if you decide to use the same name as a completely different bioinformatics tool (something that is, sadly,  becoming more common). So if you run a Google search for pepper bioinformatics, you may also come across a molecular visualizer called PeppeR.

Turkey bioinformatics by the numbers

  • 109,700,700 - mean divergence time (in years) between turkeys (Meleagris gallopavo) and turkey vultures (Cathartes aura)
  • 101,849 - number of turkey nucleotide sequences in GenBank
  • 3,597 - number of selenoproteins in UniProt which allow for the possibility of generating a peptide called ‘TURKEY’ (the IUPAC code for selenocysteine is U)[1]
  • 1 - number of published turkey genomes
  • 0 - number of bioinformatics tools that have tried using ‘turkey’ as part of a bogus acronym in their name[2]

  1. I can’t find any such peptide when using the UniProt BLAST server though :-(  ↩

  2. And let's try to keep it that way! ↩