101 questions with a bioinformatician #12: Karen Eilbeck

101 questions.png

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting theirbioinformatics careers.


Karen Eilbeck is an Associate Professor of Biomedical Informatics at the University of Utah. Karen comes from a long line of distinguished bioinformaticians who learned their skills at the highly regarded Bioinformatics M.Sc. program at the UK's University of Manchester (although they do let some riff-raff in).

If you read Karen's research statement, you will see that there is a clear focus to her work:

Quality control of genomic annotations; Management and analysis of personal genomics data; Ontology development to structure biological, genomic and phenotypic data

In helping build both the Gene Ontology and Sequence Ontology resources, Karen's work has led to the development of powerful structured vocabularies that help ensure that all biologists can speak the same language. Developing ontologies is harder than you might imagine, especially when you are trying to generate precise definitions for very nebulous concepts such as what is a gene?

You can find out more about Karen from the Eilbeck Lab website. And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

I think genomic analysis is fascinating. The human genetics stories suck me in, where bioinformatics is used to find the variant causing the phenotype. The story does not end there, tests are developed, or therapies targeted. 

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

This is a positive and a negative. I like being part of collaborative projects. It is exciting and things get done. The downside is the amount of time on the phone. It is not something I would ever have anticipated. Conference calls either go OK, or someone is heavy breathing in a train station and hasn’t put their phone on mute. The video conference is either delayed or the resolution is not great. One of my colleagues shared this video with me, which has a lot of truth to it.

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Only a single piece? OK, take your math classes more seriously. I wish I had known how to program when I was doing statistics classes. Instead of using packages like SPSS it may have been more educational to implement tests myself. 

 

100. What's your all-time favorite piece of bioinformatics software, and why?

I am totally in love with a piece of software right now called Phevor, which re-ranks variant prioritization based on phenotype descriptions and uses a variety of ontologies to do its magic. Which brings me to my all time fave tool: OBO-Edit. I think that OBO-edit was underrated. This tool was developed by the Gene Ontology consortium to build their ontology, and it rapidly became adopted by the biological community. It is easy to use and underpinned many of the ontologies in the bioinformatics domain today. The lead developer for a long time was John Richter who is also a stand-up comedian that went on to work for Google. OBO-edit will always have a place in my heart

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

W (A or T). On the one hand it's reserved and to the point (A), on the other hand it's full of energy and works well with others (T). Also, much like my name, there is confusion when it comes to pronunciation (Eel-beck or I’ll-Beck).

W and its skinny friend V, seem interchangeable regarding pronunciation. A friend of mine calls a character from Star Wars Darth Wader, which make me smile. 

Good news: CEGMA is more popular than ever — Bad news: CEGMA is more popular than ever

I noticed from my Google Scholar page today that our 2007 CEGMA paper continues to gain more and more citations. It turns out that there have now been more citations to this paper in 2014 than in any previous year (69 so far and we still have almost half a year to go):

Growth of citations to CEGMA paper, as reported by Google Scholar

Growth of citations to CEGMA paper, as reported by Google Scholar

I've previously written about the problems of supporting software that a) was written by someone else and b) is based on an underlying dataset that is now over a decade old. These problems are not getting any easier to deal with.

In a typical week I receive 3–5 emails relating to CEGMA; these are mostly requests for help with installing and/or running CEGMA, but we also receive bug reports and feature requests. We hope to shortly announce something that will help with the most common problem, that of getting CEGMA to work. We are putting together a virtual machine that will come pre-installed and configured to run CEGMA. So you'll just need to install something like VirtualBox, and then download the CEGMA VM. Hopefully we can make this available in the coming week or two.

Unfortunately, we have almost zero resources to devote to the continuing development of this old version of CEGMA; any development that does happen is therefore extremely limited (and slow). A forthcoming grant submission will request resources to completely redevelop CEGMA and add many new capabilities. If this grant is not successful then we may need to consider holding some sort of memorial service for CEGMA as it becoming untenable to support the old code base. Seven years of usage in bioinformatics is a pretty good run and the website link in the original paper still works (how many other bioinformatics papers can claim this I wonder?).

 

Update: 2014-07-21 14.44

Shaun Jackman (@sjackman on twitter) helpfully reminded me that CEGMA is available as a homebrew package. There is also an iPlant application for CEGMA. I've added details of both of these to a new item in the CEGMA FAQ:

 

Update: 2014-07-22 07.36

Since publishing this post, I've been contacted by three different people who have pointed out different ways to get CEGMA running. I'm really glad that I blogged about this else I may not have found about these other methods.

In addition to Shaun's suggestion (above), it seems that you can also install CEGMA on Linux using the Local Package Manager software. Thanks to Masahiro Kasahara for bringing this to my attention. Finally, Matt MacManes alerted me to the fact that their is a public Amazon Machine Instance called CEGMA on the Cloud. More details here.

 

Update: 2014-07-30 19.31

Thanks to Rob Syme, there is now a Docker container for CEGMA. And finally, we have now made a Ubuntu VM that is pre-installed with CEGMA (thanks to Richard Feltstykket at the UC Davis Genome Center's Bioinformatics Core).

None shall stare into the face of Medusa (or Medusa, or MeDUSA): more bioinformatics tools that use the same name

Following on from yesterday's post where I pointed out that there are three completely different bioinformatics tools that are all called 'Kraken', I bring you more news of the same. Scott Edmunds (@SCEdmunds on twitter) brought to my attention today that there is some bioinformatics software called Medusa that is either:

  1. A tool from 2005 for interaction graph analysis
  2. Some software published in 2011 that explores and clusters biological networks
  3. Or an acronym for a 2012 resource (MeDUSA) that can be used for methylome analysis

So if someone asks you to install Kraken and Medusa, it's good to know that there's only nine different combinations of tools that they might be referring to.

You wait ages for somebody to develop a bioinformatics tool called 'Kraken' and then three come along at once

I recently wrote about the growing problem of duplicated names for bioinformatics tools. A couple of weeks ago, Stephen Turner (@genetics_blog) pointed out another case:

So Kraken is either a universal genomic coordinate translator for comparative genomics, or a tool for ultrafast metagenomic sequence classification using exact alignments, or even a set of tools for quality control and analysis of high-throughput sequence data. The latter publication is from 2013, and the other two are from this year (2014).

I feel sorry for the poor Grad Student who is going to lose a day of their life trying to install one of these tools before realizing that they have been installing the wrong Kraken.