Front Line Genomics interview with Craig Venter includes a question from yours truly

Issue 4 of the Front Line Genomics magazine is now available online. It includes an interview with Craig Venter who gave a much anticipated talk at their recent Festival of Genomics conference in Boston. Front Line Genomics kindly allowed some of their previous interviewees (which includes me) to pose some of the questions. Here's mine:

KRB: What do you see as the limits of synthetic biology? Could we assemble a functional eukaryotic genome, and what are the practical applications of such technology?

JCV: That’s a great question! The limitations will ultimately be more society limitations, ethical limitations, and standards rather than technology. I think a synthetic single eukaryotic cell would be very straightforward to do today. Various groups of scientists have been trying to build the yeast genome. It’s kind of like rebuilding a house one brick at a time, but they’re making a synthetic version of yeast. That’s not quite the same as writing the genetic code and then booting it up as we did, but that’s just because of the limitations on writing the genetic code now.

I think understanding what makes a multicellular organism, and all the regulation associated with that, are so far away from design that we’re going to have to learn a whole lot more biology before we get to that stage of deliberate design. I think about 10% of the genes in our designed synthetic bacterial cell, are of unknown function. All we know is that you can’t get life without them. That problem expands tremendously with eukaryotic cells. If you extrapolate to the challenge of interpreting the human genome, we only understand a tiny fraction of the human genome today.

Get With the Program: DIY tips for adding coding to your analysis arsenal

A new article in The Scientist magazine by Jeffrey M. Perkel shares some coding advice from Cole Trapnell, C. Titus Brown, and Vince Buffalo (I interviewed Vince in my last blog post). It is a great article, and worth a look. I particularly enjoyed this piece of advice (something that is not mentioned enough):

Treat data as "read-only"
Use an abundance of caution when working with your hard-won data, Buffalo says. For instance, “treat data as read-only.” In other words, don’t work with original copies of the data, make working copies instead. “If you have the data in an Excel spreadsheet and you make a change, that original data is gone forever,” he says.

I have seen too many students double click on FASTA, GFF, and other large bioinformatics text files and end up 'viewing' them in some inappropriate program (including Microsoft Word). If you want to view text data, use a text viewer (such as less).

101 questions with a bioinformatician #30: Vince Buffalo

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Vince is a second year graduate student in the lab of Graham Coop at UC Davis. Before that he earned his bioinformatics 'chops' working in other groups on the UC Davis campus as a bioinformatician and statistical programmer.

I came to know Vince when he was working as part of the Genome Center's Bioinformatics Core Facility; I was immediately impressed, not only by his diverse set of computational skills, but by the way he applied those skills. Put simply, Vince does things the right way. He believes that bioinformatics should be a carefully documented, reproducible science. He also sees the strengths and advantages of using core Unix skills to organize and manage bioinformatics pipelines. These skills will provide a more useful, and lasting, toolbox than if you only ever learn how to use the latest and greatest set of published bioinformatics tools.

Impressively, Vince has recently published a book (Bioinformatics Data Skills by O'Reilly), this is something that I highly encourage people to buy, and I'm convinced that it will become an indispensible guide to everyone working in this field. In the book's introduction, he neatly states the problem that I alluded to earlier:

Many biologists starting out in bioinformatics tend to equate “learning bioinformatics” with “learning how to run bioinformatics software.” This is an unfortunate and misinformed idea of what bioinformaticians actually do. This is analogous to thinking “learning molecular biology” is just “learning pipetting." … the approach of this book is to focus on the skills bioinformaticians use to explore and extract meaning from complex, large bioinformatics datasets.

You can find out more about Vince by visiting his 'digital notebook' website at vincebuffalo.org, or by following him on twitter @vsbuffalo. And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

Watching bioinformatics grow to tackle exciting evolutionary questions,especially with non-model organisms. While bioinformatics has clearly revolutionized the human genomics field, I think in the next decade we'll see interesting developments in bioinformatics tailored to problems in complex non-model organism genomics.

I love plants and have worked in plant genomics, and I've seen first hand that it's very hard. Many bioinformatics tools we used were designed to work with human data, not gigantic polyploid genomes. It will be exciting over the next few years to see how reads grow in length, new algorithms emerge, and how this will enable more non-model research. As a budding evolutionary biologist, I'm hopeful that these bioinformatics advances will fuel more discoveries in neat species that have traditionally been harder to work with.



010. What's something that you don't enjoy about current bioinformatics research?

A large proportion of a bioinformatician's time is spent tackling unnecessary human-made problems: data is poorly organized, file formats are both poorly specified and followed, and software is often poorly documented or isn't robust to different data. These are neither interesting scientific problems nor fun computational problems — these are frustrating social and community issues. No one wants to tackle these problems for that reason, but at some point we'll have to as a community — to avoid wasting our collective time on these annoyances.



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Study more mathematics. I fell in love with statistics before I did math because I quickly saw the beauty in using statistics to understand data. Now I'm working backwards and trying to bolster my maths skills and seeing the beauty in other mathematical fields and really enjoying it. Darwin said "mathematics seems to endow one with something like a new sense" — I'd argue that this is especially true in biology.



100. What's your all-time favorite piece of bioinformatics software, and why?

It's a tie — SAMtools and PSMC. SAMtools is an amazing piece of engineering — from an algorithmic perspective, from a usability perspective, and from a community perspective. If you dig inside the source, everything is so cleverly written and carefully optimized (e.g. the klib library). I've learned a lot of C tricks from reading Heng Li's code.

SAMtools is also extremely well designed from the user perspective — it adopts the Unix philosophy and its subcommand interface is much like Git's. However, SAMtools is not a perfect program; there have been numerous bugs found over the years and some folks attack it for this. But these bugs are quickly patched thanks to active development and an excellent community. I don't work on SAMtools (other than one tiny bug fix) but I enjoy following along via GitHub and reading and learning from the source.



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

S — and it's a simple puzzle why this is the letter I chose.

ORCID: Over 1.5 million IDs served

I was pleased to read that there are now over 1.5 million ORCID IDs in existence. If you didn't know, ORCID provide unique identifiers to researchers. Once you have an ORCID identifier, you can start linking all of your research to that identifier. If you change name, or if you suffer from having a very common surname, ORCID makes it easier to track your contributions to science.

From their about page:

ORCID is an open, non-profit, community-driven effort to create and maintain a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers. ORCID is unique in its ability to reach across disciplines, research sectors and national boundaries. It is a hub that connects researchers and research through the embedding of ORCID identifiers in key workflows, such as research profile maintenance, manuscript submissions, grant applications, and patent applications.

I'm hopeful that ORCID will one day become the glue to tie together all scientific output. Written a blog post, or grant, or git commit message for some piece of scientific software? There is no reason why these couldn't all be 'digitally signed' with your ORCID identifier.

If you don't have an ORCID ID — and yes I appreciate that this is somewhat of a clumsy nomenclature — you really should register now.