101 questions with a bioinformatician #12: Karen Eilbeck

July 23, 2014 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting theirbioinformatics careers.

Karen Eilbeck is an Associate Professor of Biomedical Informatics at the University of Utah. Karen comes from a long line of distinguished bioinformaticians who learned their skills at the highly regarded Bioinformatics M.Sc. program at the UK's University of Manchester (although they do let some riff-raff in).

If you read Karen's research statement, you will see that there is a clear focus to her work:

Quality control of genomic annotations; Management and analysis of personal genomics data; Ontology development to structure biological, genomic and phenotypic data

In helping build both the Gene Ontology and Sequence Ontology resources, Karen's work has led to the development of powerful structured vocabularies that help ensure that all biologists can speak the same language. Developing ontologies is harder than you might imagine, especially when you are trying to generate precise definitions for very nebulous concepts such as what is a gene?

You can find out more about Karen from the Eilbeck Lab website. And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I think genomic analysis is fascinating. The human genetics stories suck me in, where bioinformatics is used to find the variant causing the phenotype. The story does not end there, tests are developed, or therapies targeted.

010. What's something that you *don't* enjoy about current bioinformatics research?

This is a positive and a negative. I like being part of collaborative projects. It is exciting and things get done. The downside is the amount of time on the phone. It is not something I would ever have anticipated. Conference calls either go OK, or someone is heavy breathing in a train station and hasn’t put their phone on mute. The video conference is either delayed or the resolution is not great. One of my colleagues shared this video with me, which has a lot of truth to it.

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Only a single piece? OK, take your math classes more seriously. I wish I had known how to program when I was doing statistics classes. Instead of using packages like SPSS it may have been more educational to implement tests myself.

100. What's your all-time favorite piece of bioinformatics software, and why?

I am totally in love with a piece of software right now called Phevor, which re-ranks variant prioritization based on phenotype descriptions and uses a variety of ontologies to do its magic. Which brings me to my all time fave tool: OBO-Edit. I think that OBO-edit was underrated. This tool was developed by the Gene Ontology consortium to build their ontology, and it rapidly became adopted by the biological community. It is easy to use and underpinned many of the ontologies in the bioinformatics domain today. The lead developer for a long time was John Richter who is also a stand-up comedian that went on to work for Google. OBO-edit will always have a place in my heart

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

W (A or T). On the one hand it's reserved and to the point (A), on the other hand it's full of energy and works well with others (T). Also, much like my name, there is confusion when it comes to pronunciation (Eel-beck or I’ll-Beck).

W and its skinny friend V, seem interchangeable regarding pronunciation. A friend of mine calls a character from Star Wars Darth Wader, which make me smile.

Good news: CEGMA is more popular than ever — Bad news: CEGMA is more popular than ever

July 21, 2014 by Keith Bradnam

I noticed from my Google Scholar page today that our 2007 CEGMA paper continues to gain more and more citations. It turns out that there have now been more citations to this paper in 2014 than in any previous year (69 so far and we still have almost half a year to go):

Growth of citations to CEGMA paper, as reported by Google Scholar

I've previously written about the problems of supporting software that a) was written by someone else and b) is based on an underlying dataset that is now over a decade old. These problems are not getting any easier to deal with.

In a typical week I receive 3–5 emails relating to CEGMA; these are mostly requests for help with installing and/or running CEGMA, but we also receive bug reports and feature requests. We hope to shortly announce something that will help with the most common problem, that of getting CEGMA to work. We are putting together a virtual machine that will come pre-installed and configured to run CEGMA. So you'll just need to install something like VirtualBox, and then download the CEGMA VM. Hopefully we can make this available in the coming week or two.

Unfortunately, we have almost zero resources to devote to the continuing development of this old version of CEGMA; any development that does happen is therefore extremely limited (and slow). A forthcoming grant submission will request resources to completely redevelop CEGMA and add many new capabilities. If this grant is not successful then we may need to consider holding some sort of memorial service for CEGMA as it becoming untenable to support the old code base. Seven years of usage in bioinformatics is a pretty good run and the website link in the original paper still works (how many other bioinformatics papers can claim this I wonder?).

Update: 2014-07-21 14.44

Shaun Jackman (@sjackman on twitter) helpfully reminded me that CEGMA is available as a homebrew package. There is also an iPlant application for CEGMA. I've added details of both of these to a new item in the CEGMA FAQ:

Are there other - less painful - ways that I can install CEGMA?

Update: 2014-07-22 07.36

Since publishing this post, I've been contacted by three different people who have pointed out different ways to get CEGMA running. I'm really glad that I blogged about this else I may not have found about these other methods.

In addition to Shaun's suggestion (above), it seems that you can also install CEGMA on Linux using the Local Package Manager software. Thanks to Masahiro Kasahara for bringing this to my attention. Finally, Matt MacManes alerted me to the fact that their is a public Amazon Machine Instance called CEGMA on the Cloud. More details here.

Update: 2014-07-30 19.31

Thanks to Rob Syme, there is now a Docker container for CEGMA. And finally, we have now made a Ubuntu VM that is pre-installed with CEGMA (thanks to Richard Feltstykket at the UC Davis Genome Center's Bioinformatics Core).

None shall stare into the face of Medusa (or Medusa, or MeDUSA): more bioinformatics tools that use the same name

July 18, 2014 by Keith Bradnam

Following on from yesterday's post where I pointed out that there are three completely different bioinformatics tools that are all called 'Kraken', I bring you more news of the same. Scott Edmunds (@SCEdmunds on twitter) brought to my attention today that there is some bioinformatics software called Medusa that is either:

A tool from 2005 for interaction graph analysis
Some software published in 2011 that explores and clusters biological networks
Or an acronym for a 2012 resource (MeDUSA) that can be used for methylome analysis

So if someone asks you to install Kraken and Medusa, it's good to know that there's only nine different combinations of tools that they might be referring to.

You wait ages for somebody to develop a bioinformatics tool called 'Kraken' and then three come along at once

July 17, 2014 by Keith Bradnam

I recently wrote about the growing problem of duplicated names for bioinformatics tools. A couple of weeks ago, Stephen Turner (@genetics_blog) pointed out another case:

Another "kraken" in #bioinformatics, this one for translating genomic coord's for cross-species comparative #genomics http://t.co/5v4NjV4qYR
— Stephen Turner (@genetics_blog) July 7, 2014

So Kraken is either a universal genomic coordinate translator for comparative genomics, or a tool for ultrafast metagenomic sequence classification using exact alignments, or even a set of tools for quality control and analysis of high-throughput sequence data. The latter publication is from 2013, and the other two are from this year (2014).

I feel sorry for the poor Grad Student who is going to lose a day of their life trying to install one of these tools before realizing that they have been installing the wrong Kraken.

Which 'omics' assembly tools are currently the most popular?

July 03, 2014 by Keith Bradnam

I recently organized an online poll to find out which tools for genome, transcriptome, and metagenome assembly are currently the most popular with researchers. After a week or so of collecting results, I ended up with 116 responses that describe over 30 different assembly tools.

Thanks to everyone who took part. I've posted the results to Figshare as a PDF report, and have also embedded this below (I suggest downloading the PDF so that you can use all of the embedded hyperlinks in the report).

Impactstory: Publications are an important part of research…but they’re not the only part

July 01, 2014 by Keith Bradnam

I'm a great fan of the Impactstory service that makes it easy to aggregate all of your research output in one place, and then see how people are engaging with your research. I like it so much, that I signed up to be an Impactstory Advisor.

Today I'm giving a talk at UC Davis about Impactstory, and so that everyone can see why I like this service so much, I've made a video version of my presentation.

Visit the Impactstory website to find out more or follow them on twitter (@Impactstory). For an example of the types of things that Impactstory can track, have a look at my own Impactstory page (impactstory.org/KeithBradnam) .

These go up to 13

June 27, 2014 by Keith Bradnam

No words can add to this video.

Survey: what are your preferred genome assembly tools?

June 24, 2014 by Keith Bradnam

Time for a quick survey to identify what the preferred tools are for people who are currently doing genome/transcriptome/metagenome assembly. Please add your responses to the survey below and I'll publish the results in a week or two.

Some slides from a recent talk about genome assembly (and thoughts on evolving slide decks)

June 23, 2014 by Keith Bradnam

Last week I presented a talk about genome assembly at a UC Davis Bioinformatics Core workshop. The first time that I gave this talk, it concentrated almost exclusively on the the results from our Assemblathon 2 paper. In the handful of times that I've subsequently given this talk, it has always evolved:

More background to better explain some of the more common terminology in this field
Less detail about the specifics of the Assemblathon 2 results
New information relating to the latest developments in sequencing and assembly
Added an 'intermission' so I can explain why I think Next-generation sequencing must die

Even if I didn't add any new content to my talk, and even if I was giving the same talk twice in the same week, I would still almost certainly change some aspect of my presentation. Here are some reasons for why I often end up changing things:

Things which seemed like a good idea when planning and making slides, don't always work as well in front of an actual audience. Sometimes this might be unnecessary detail which slows things down, or it might be something which is no longer as relevant (or exciting) as when you first gave a talk on this topic.
Inevitably there will be some parts of my talk which don't flow as well as others. Sometimes I will switch the order of sections, or drop sections altogether.
If people ask me questions during the talk, then this is often because something is unclear. I try to make mental reminders about this, as it might mean that there is something I can better explain.
Some visual elements will look great on my screen, and even on certain projectors, but then I will give a talk somewhere where a different projector makes a slide look horrible. Most common it will be when two colors end up looking far too similar. Always a good idea to change things so that they look clear on any projector.
If I know that the audience for a talk may contain many people that don't speak English as their primary language, I might add more text content on key slides.
A final reason for changing content is just to keep your talk fresh. It's possible that you become stale when you give the exact same talk over and over again. Changing the order of sections, or adding/removing content, means that you have to re-engage with your own material.

But hey, enough of my yakking…here are the slides. Note that I include two versions; the first version doesn't have any notes (harder to follow as I often prefer to talk around what's on my slides). The second version has notes added below each slide (these notes try to capture the gist of what I talk about on each slide). Also, don't be alarmed by the high slide count, each animation step appears as a separate slide (so that you can almost capture all of the animated fun of a real Keith Bradnam presentation).

Genome Assembly: then and now from Keith Bradnam

Genome Assembly: then and now — with notes from Keith Bradnam

Designing a musical motif for the UC Davis Genome Center

June 20, 2014 by Keith Bradnam

Over the last month, I have spent much of my time helping to develop a new website for the UC Davis Genome Center (a site which will hopefully be launched very soon). In trying to bring the website into the modern era, I've been trying to set things up so that we can better promote any news that arises from the work of the talented faculty, staff, and students that we have.

In particular, I'm keen to feature some video clips on the new site, and that made me think that we should have our own Genome Center 'ident' to use in any videos. Idents are a bit like stingers on radio stations, something that gives an audio signature that people might come to recognize (and maybe even like).

I have a smattering of music knowledge so I thought it might be fun to create something based on DNA. As there are four canonical DNA bases (A, C, G, and T), I thought that the musical motif should have four principle notes. I then decided to arrange the notes with musical intervals based on the intervals between the alphabet positions of A, C, G, and T. If you start this sequence on a C note, you end up with C, D, F# and G (one octave up). This progression feels like it needs to be resolved, and a basic G major chord seems to work.

So this is what I have come up with so far. This may end up being vetoed by the powers-that-be, but I'm still pretty happy with it:

Update: just to add that this piece was made entirely using GarageBand on my Mac. There are: three tracks that use Classic Electric Piano (I was using the onscreen keyboard which is why I ended up doing these as three separate tracks); one Tonewheel Organ track; one Upright Studio Bass track; one Classic Analog Pad track; and one String Ensemble track. The latter three tracks combine to form the final chord.