101 questions with a bioinformatician #0: Ian Korf

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

 

Ian Korf is the Associate Director of the UC Davis Genome Center and interim manager of the Genome Center Bioinformatics Core facility. He also mentioned that "According to the cake I was given after getting tenure, my title is also 'ASS PRO' (there wasn't enough room to write Associate Professor"). He also asks to "Please just call me Ian. If you call me Dr. Korf, I'll think you're talking about my dad (the in-famous Mycologist)".

You can find out more about Ian from the Korf Lab website and from his new SuperScience and Sorcery blog. Ian is also on twitter as @iankorf. Careful though, you might find his constant tweeting a bit distracting.

And now, on to the 101 questions...

 

001. What's something that you enjoy about current bioinformatics research?

The constant innovation. There's are so many problems to solve and so much cleverness out there.


010. What's something that you *don't* enjoy about current bioinformatics research?

The constant innovation. Many people are inventing the same thing and giving it a different name. Redundancy is sometimes unavoidable (but also sometimes useful).


011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career? 

Be nicer to everyone. Science is a social activity. When choosing between being correct and being agreeable, I generally tend towards correctness. Sometimes it's necessary to beat someone with the club of correctness (sounds like a Munchkin card), but more and more I think it's better to walk a little off the optimal path if you can walk there with someone.


100. What's your all-time favorite piece of bioinformatics software? 

BLAST, for many reasons. It has great historical importance and is still relevant today. It has a lot of educational value because it has a mixture of rigorous theory and rational heuristics.


101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

That website has only 15 (U is for RNA and gaps do not symbolize letters - see Karlin-Altschul statistics). But to answer the question, I think B, because B is a lot of things, but not A.

101 Questions: a series of interviews with notable bioinformaticians

I've noticed recently that more and more bioinformaticians are talking about what it means to be a bioinformatician, and — in an ideal world — what sort of training bioinformaticians of the future should be getting. Mick Watson's Opiniomics blog in particular has a lot of great material on these subjects. One of his blog posts that talks about the disconnect between actual bioinformatics skills a student/postdoc might have and the expectation of those skills from their trainer/supervisor makes for sober reading (especially if you see all of the comments).

So I thought it might be instructive to ask a simple series of questions to a bunch of notable bioinformaticians to assess their feelings on the current state of bioinformatics research, and maybe get any tips they have about what has been useful to their bioinformatics careers.

I will attempt to keep this post updated with links to all of the interviews that I have published:

List of interviews

Next-generation sequencing must die (part 3) — a tale of two titles

This morning I came across two new papers. Compare and contrast:

  1. De Novo Assembly and Annotation of Salvia splendens Transcriptome Using the Illumina Platform — Ge et al. PLOS ONE
  2. RepARK—de novo creation of repeat libraries from whole-genome NGS reads — Koch et al. Nucleic Acids Research

The former paper lets me know that the research is based on a specific sequencing technology whereas the latter paper is possibly suggesting that the RepARK tool might work with any 'NGS' data.

Given the wide, and sometimes inappropriate, use of the 'NGS' phrase, it is not always obvious what someone means when they refer to 'NGS reads'. This could include 25–30 bp reads from older Illumina sequencing all the way to PacBio reads that may be >15,000 bp (and which contain a high fraction of indels).

Reading the Methods section of the paper, I see that they only used simulated 101 bp reads as well as real Illumina reads (average length = 82 bp). They do point out in the discussion that "long [PacBio] reads may also provide new opportunities for de novo repeat prediction" . This is something that I have an interest in because we have previously published data that used PacBio data to find tandem repeats. 

In order to find out that they don't have any PacBio data, I had to read the title, abstract, methods, and then scan the rest of the paper. I accept that 'NGS' is a convenient term to use, but it would have been helpful (to me anyway) if at least the abstract could have pinpointed which NGS technologies the paper was using.

A very tenuously derived bioinformatics acronym...and winner of a JABBA award!

I can only imagine that some papers start off with the name of software tool that they want to use, and then work backwards to form an acronym or initialism. After exhausting valid combinations — where the letters are derived from the initial letters of each word — they switch to plucking letters at random. All that matters is coming up with a fun word or phrase for your tool, right?

And so that brings us to MATE-CLEVER, a new tool described in the latest issue of the journal Bioinformatics. The article title at least provides a hint for where the 'M' comes from:

MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels

The fact that the title of the paper includes no words beginning with 'T', 'E', 'C', 'V', or 'R' makes me a little bit afraid as to just where these letters will come from. Ready for this? MATE-CLEVER is derived from:

Mendelian-inheritance-AtTEntive CLique-Enumerating Variant findER

This is a particularly egregious use of selectively choosing the letters to fit your desired name, and for that feat this paper becomes a recipient of yet another JABBA award.

Update 2014-03-18 14.10: The more I look at this name, the more I think they missed the opportunity to name it 'MEAT-CLEAVER'.