101 questions with a bioinformatician #13: Michael Schatz

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting theirbioinformatics careers.


Mike Schatz is an Assistant Professor of Quantitative Biology at Cold Spring Harbor Laboratory.  Prior to getting into the world of genomics and bioinformatics, Mike worked for a startup company that specialized in network security (working on encryption software for online banking amongst other things):

It was unplanned serendipity, but code breaking turned out to be perfect training for genomics, and the startup turned out to be perfect training to become a PI. 

His research focuses on the development of scalable algorithms and systems to analyze biological sequence data, concentrating on the alignment, assembly, and analysis of high-throughput DNA sequencing reads. If you visit his lab research page, you will see an impressive list of software tools that he has helped develop.

Aside from his contributions to genomics, I am perhaps more impressed that Mike has made available slides from all of his major research presentations going back to 2005 (over 80 talks). I wish more scientists were as dedicated at sharing talks like this. You can find out more about Mike from his lab website or by following him on twitter (@mike_schatz). And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

What brought me into the field was the opportunity to apply my training and experience in computer science to really meaningful problems in biology and medicine. I’m fascinated by the deep connections between how computers and software are organized and operate compared to how cells and genomes are replicated, transcribed, and evolve.

Right now is by far the most fantastic time to be in a field that is driven by rapid improvements to the biotechnology. How amazing that just 15 or 20 years ago it would have been cheaper and easier to land a team on the moon than to sequence their genomes, but now we do it on a routine basis!

This growth has fundamentally and forever changed the types of questions that we can even ask. The really exciting and scary point is we are still at the very beginning, and are still feeling around in the dark. I recently gave a talk about how long we should expect to wait until we have sequenced one billion genomes (hint: it is a lot sooner than you might expect).

 

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

The FASTQ “file format”. Do we really need the read identifier listed twice (sometimes), newlines within a single record, and an unspecified encoding scheme for quality values that changes every so often depending on when the software was run?

I cringe every time I have to teach it to a new student. There is no rational to it and it's so obviously flawed. It just feels dirty to teach it. I like to think that in 10 or 100 years this will all be sorted out, but today, this and so many other poorly designed systems are entrenched into our day-to-day lives. It is a constant, if dull, irritation that makes everything slow to change, and brittle to use.

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Take more probability and statistics. So much of my life now is spent looking for patterns in enormously large and complex data that the only hope is through statistical analysis. I used to stay up late reading algorithms textbooks, but now this is where I spend my free time.

The one really successful tip I’ve learned is that, even though my intuition for probability is poor, I can often work backwards using a simulator. I’ll write a little code so I can look at what happens to the distribution if this rate goes up, or if the genome was twice as complex. I then use that to guide me to the analytical form. I always understand an algorithm better if I implement it from scratch, and I think that this is an extension of that concept.

 

100. What's your all-time favorite piece of bioinformatics software, and why?

Do I have to pick just one? Ben Langmead blew my mind when he taught me about the FM-index. A very close second was the genome assembler Art Delcher wrote in about 50 lines of awk. More recently my lab went over the SGA algorithm from Simpson and Durbin in great detail. All of these have beauty in their simplicity and elegance — like a great work of art everything locks together perfectly in step.

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

S – It is the strongest code, of course! ;)

 

Is there ever a valid reason for storing bioinformatics data in a Microsoft Word document?

Short answer

No.

Long answer

Noooooooooo!!!

Background

Yesterday I finished reviewing a paper. My review was generally very positive and I enjoyed reading the manuscript. The authors linked to some supplementary files that were available on another website. As I'm the type of reviewer that likes to look at every file that is part of a submission, I logged on to the website to see what files were there.

The first file that was listed had a 'docx' extension. Someone might argue that if this file contained a textual description of how the other files were being generated, then maybe there is nothing wrong with somebody using Microsoft Word. I would disagree. Any sort of documentation should ideally be in plain text, and maybe PDF as an alternative.

In any case, I opened the file to see what we were dealing with. The file contained a list of several thousand gene identifiers, one identifier per line. There was nothing else in the thirty-six page file.

This is not an acceptable practice! Use of Microsoft Word to store bioinformatics data will only ever result in unhappiness, frustration, and anger. And we all know what anger leads to…

Supplemental madness: on the hunt for 'Figure S1'

I've just been looking at this new paper by Vanesste et al.  in Genome Research:

Analysis of 41 plant genomes supports a wave of successful genome duplications in association with the Cretaceous–Paleogene boundary

I was curious as to where their 41 plant genomes came from, so I jumped to the Methods section to see: 

No surprise there, this is exactly the sort of thing you expect to find in the supplementary material of a paper. So I followed the link to the supplementary material only to see this:

So the 'Supplemental Material' contains 'Supplemental Information' and the — recursively named — 'Supplemental Material'. So where do you think Supplemental Table S1 is? Well it turns out that this table is in the Supplemental Material PDF. But when looking at both of these files, I noticed something odd. Here is Figure S1 from the Supplemental Information:

And here is part of another Figure S1 from the Supplemental Material file:

You will notice that the former figure S1 (in the Supplemental Information) is actually called a Supporting Figure. I guess this helps distinguish it from the completely-different-and-in-no-way-to-be-confused Supplementary Figure S1.

This would possibly make some sort of sense if the main body of the paper distinguished between the two different types of Figure S1. Except the paper mentions 'Supplemental Figure S1' twice (not even 'Supplementary Figure S1) and doesn't mention Supporting Figure S1 at all (or any supporting figures for that matter)!

What does all of this mean? It means that Supplementary Material is a bit like the glove compartment in your car: a great place to stick all sorts of stuff that will possibly never be seen again. Maybe we need better reviewer guidelines to stop this sort of confusion happening? 

 

The Assemblathon Gives Back (a bit like The Empire Strikes Back, but with fewer lightsabers)

So we won an award for Open Data. Aside from a nice-looking slab of glass that is weighty enough to hold down all of the papers that someone with a low K-index has published, the award also comes with a cash prize.

Naturally, my first instinct was to find the nearest sculptor and request that they chisel a 20 foot recreation of my brain out of Swedish green marble. However, this prize has been — somewhat annoyingly — awarded to all of the Assemblathon 2 co-authors.

While we could split the cash prize 92 ways, this would probably only leave us with enough money to buy a packet of pork scratchings each (which is not such a bad thing if you are fan of salty, fatty, porcine goodness).

Instead we decided — and by 'we', I'm really talking about 'me' — to give that money back to the community. Not literally of course…though the idea of throwing a wad of cash into the air at an ISMB meeting is appealing.

Rather, we have worked with the fine folks at BioMed Central (that's BMC to those of us in the know), to pay for two waivers that will cover the cost of Article Processing Charges (that's APCs to those of us in the know). We decided that these will be awarded to papers in a few select categories relating to 'omics' assembly, Assemblathon-like contests, and things to do with 'Open Data' (sadly, papers that relate to 'pork scratchings' are not eligible).

We are calling this event the Assemblathon 'Publish For Free' Contest (that's APFFC to those of us in the know), and you can read all of the boring details and contest rules on the Assemblathon website.