Paper review: anybody who works in bioinformatics and/or genomics should read this paper!

I rarely blog about specific papers but felt moved to write about a new paper by Jonathan Mudge, Adam Frankish, and Jennifer Harrow who work in the Vertebrate Annotation group at the Wellcome Trust Sanger Institute.

Their paper, now out in Genome Research, is titled: Functional transcriptomics in the post-ENCODE era.

They brilliantly, and comprehensively, list the various ways in which gene architecture — and by extension gene annotation — is incredibly complex and far from a solved problem. However, they also provide an exhaustive description of all the various experimental technologies that are starting to shine a lot more light on this, at times, dimly lit field of genomics.

In their summary, they state:

Modern genomics (and indeed medicine) demands to understand the entirety of the genome and transcriptome right now

I'd go so far as to say that many people in genomics assume that genomes and transcriptomes are already understood. I often feel that too many people enter this field with false beliefs that many genomes are complete and that we know about all of the genes in this genomes. Jonathan Mudge et al. start this paper by firmly pointing out that even the simple question of 'what is a gene?' is something that we are far from certain about.

Reading this paper, I was impressed by how comprehensively they have reviewed the relevant literature, pulling in numerous examples that indicate just how complex genes are, and which show that we need to move away from the very protein-centric world view that has dominated much of the history of this field.

LncRNAs, microRNAs, and piwi-interacting RNAs are three categories of RNA that you probably wouldn't find mentioned anywhere in text books from a decade ago, but which now — along with 'traditional' non-coding RNAs such as rRNAs, tRNAs, snoRNAs etc. — probably outnumber the number of protein-coding genes in the human genome. Many parts of this paper tackle the issue of transcriptional complexity, particularly trying to address the all-important question how much of this is functional?

I found that so many parts of this paper touched on previous, current, and possible future projects in our lab. Producing an accurate catalog of genes, understanding alternative splicing, examining the relationship between mRNA and protein abundances, looking for conservation of signals between species...these are all topics that are near and dear to people in our lab.

Even if you have no interest in the importance of gene annotation — and shame on you if that is how you feel — this paper also serves as a fantastic catalog of the latest experimental techniques that can be used to capture and study genes (e.g. CAGE, ribosome profiling, polyA-seq etc).

If you have ever worked with a set of genes from a well curated organism, spare a thought for the huge amount of work that goes into trying to provide those annotations and keep them up to date. I'll leave you with the last couple of sentences from the paper...please repeat this every morning as your new mantra:

Finally, no one knows what proportion of the transcriptome is functional at the present time; therefore, the appropriate scientific position to take is to be open-minded. We thus do not claim that the annotation of the human genome is close to completion. If anything, it seems as if the hard work is just beginning.

More JABBA awards for inventive bioinformatics acronyms

A quick set of new JABBA award recipients. Once again these are drawn from the journal Bioinformatics.

  1. NetWeAvers: an R package for integrative biological network analysis with mass spectrometry data - the mixed capitalization of this software tool is a little uneasy on the eye. But more importantly, a Google search for 'netweavers' returns lots of links about something entirely different. I.e. NetWeavers (and NetWeaving) is already a recognized term in another field.
  2. GIM3E: condition-specific models of cellular metabolism developed from metabolomics and expression data. - the 3 part of this algorithm's name is deliberately written in superscript by the authors. This implies 'cubed', but I think it is really referring to 3 lots of 'M' related words because the full name of the algorithm is 'Gene Inactivation Moderated by Metabolism, Metabolomics and Expression'. GIM3E is not something that is particularly easy to say quickly, though it is much more Google friendly than NetWeavers.
  3. INSECT: IN-silico SEarch for Co-occurring Transcription factors - making an acronym into the name of a plant or animal name is quite common in bioinformatics. A couple of examples are worth mentioning. There is the MOUSE resource (Mitochondria and Other Useful SEquences) and also something called HAMSTeRS (the Haemophilus A Mutation, Structure, Test and Resource Site). The main problem with acronyms like these is that they can be to hard to find using online search tools (e.g. Google for hamster resources). A secondary issue is that the name just doesn't really connect to what the resource/database/algorithm is about. The INSECT database contains information about 14 different species, only one of which is an insect.
2013-11-26 at 2.38 PM.png

I'll no doubt be posting again the next time I come across some more dubious acroynms.

Top twitter talent: UC Davis genome scientists lead the way

The Next Gen Seq website has just published its 2013 list of the Top N Genome Scientists to Follow on Twitter. Over 10% of this International list of scientists are all staff or Faculty here at UC Davis, which says a lot about the quality of genomics talent here on campus:

It is also worth mentioning that there are so many other people at UC Davis who work in genomics and bioinformatics and who use twitter to effectively communicate their research and engage with the community. E.g.

  • @dr_bik - Holly Bik (Postdoc in Jon Eisen's lab)
  • @ryneches - Russel Neches  (Grad student in Jon Eisen's lab)
  • @theladybeck - Kristen Beck (Grad student in Ian Korf's lab)
  • @sudogenes - Gina Turco (Grad student in Siobhan Brady's lab...and winner of best twitter account name)

Great to see UC Davis recognized like this.

 

Update

Updated at 9:09 am to reflect that Next Gen Seq have now added Vince Buffalo to the list (he was apparently meant to be on the list anyway).

Another winner of the JABBA award for horrible bioinformatics acronyms

It's time to hand out another JABBA (Just Another Bogus Bioinformatics Acronym) award. Joining the recent recipients is a tool described in the latest issue of the Bioinformatics journal.

I don't have any problem with the acronym itself, and this is not a tool which is randomly adding or removing letters from the full name to produce the acronym. So what is my problem? Well the tool — which calculates a score to assess the local quality of a protein structure — is called The Local Distance Difference Test. And the acronym? Oh, the acronym is just 'lDDT' with a lower-case 'L'.

Now, this might not be so bad if it were not for the fact that all fonts used by the Bioinformatics journal (HTML & PDF versions) as well as the author's own website make this 'L' look like the letter I or the number 1.

From the HTML

2013-10-22 at 2.27 PM.png
2013-10-22 at 2.28 PM.png

From the PDF

2013-10-22 at 2.29 PM.png

From the author's website

2013-10-22 at 2.29 PM 2.png

I can't help but imagine that people will only ever read this as IDDT and not LDDT...which of course doesn't bode well if someone ends up Googling for this tool at a later date. Compare a search for LDDT (which finds the correct tool) vs a search for IDDT (which doesn't:

2013-10-22 at 2.32 PM.png
2013-10-22 at 2.32 PM 2.png

Congratulations on being the recipient of another JABBA award!

What's in a name? Better vocabularies = better bioinformatics?

About 7:00 this morning I was somewhat relieved because my scheduled lab talk had been postponed (my boss was not around). But we were still having the lab meeting anyway.

About 8:00 this morning, I stumbled across this blog post by @biomickwatson on twitter. I really enjoyed the post and thought I would mention in in the lab meeting. Suddently though that prompted me to think about some other topics relating to Mick's blog post.

Before I knew it, I had made about 30 slides and ended up speaking for most of the lab meeting. I thought I'd add some notes and post the talk on SlideShare.

What's in a name? Better vocabularies = better bioinformatics?

from

Keith Bradnam

I get very frustrated by people who rely heavily on GO term analysis, without having a good understanding of what Gene Ontology terms are, or how they get assigned to database objects. There are too many published anayses which see an enrichment of a particular GO term as some reliable indicator that there is a difference in datasets X & Y. Do they ever check to see how these GO terms were assigned? No.

New recipient of the Just Another Bogus Bioinformatics Acronym (JABBA) award

It was only a few weeks ago that I gave out the last JABBA award. One of the winning recipients that time was a database — featuring excessive use of mixed-case characters — called 'mpMoRFsDB'.

Well it seems that if you work on 'MoRFs' (Molecular Recognition Features) then you must love coming up with fun acronyms. This week in BMC Bioinformatics we have another MoRFs related tool that is worthy of a JABBA award:

The oh-so-catchy 'MFSPSSMpred' (Masked, Filtered and Smoothed Position-Specific Scoring Matrix-based Predictor) is the kind of name that requires you to first sit down and take a deep breath before attempting to pronounce it. Just imagine having to tell someone about this tool:

"Hi Keith, can you recommend any bioinformatics tools for identifying MoRFs?"

"Why certainly, have you tried em-eff-ess-pee-ess-ess-em-pred?"

Congratulations MFSPSSMpred, you join the ranks of former JABBA winners.

Some free code editors for Macs (that work in a UC Davis computer lab)

Every year I help teach a course[1] to grad students that hopefully leaves them with an understanding of how to use Unix and Perl to solve bioinformatics-related problems. We use a Mac-based computer lab because all Macs come with Unix and Perl installed. Many of our students are new to programming and many are new to Macs. Because of this, and because they need to use a code editor to write their Perl scripts, we have previously pointed them towards Fraise. Despite its age [2], this relatively lightweight editor has proved fine for the last few years that we have taught this course.

This year, however, Fraise proved problematic. The computer lab has now upgraded to OS X 10.8 which provides extra safeguards about what apps can be run. This Gatekeeper technology has been set up to only allow ‘signed’ apps[3]. The version of Fraise that we were using required administrator privileges for it to be opened (not possible in this computer lab).

My first thought was to direct students to download and install TextWrangler. This is an excellent, powerful, and well maintained code editor for Macs. Most importantly, it is free and also a signed app. However, it does try to install a helper app which caused a persistent dialog window to keep popping up during the installation. Clicking ‘cancel’ worked…but only after about 20 attempts[4]. I like TextWrangler as an app, but prefer the cleaner look of Fraise. So today I set out to find code editors for Macs that:

  • were free
  • could be run on the Macs in our computer lab (i.e. had to be signed apps)
  • were relatively simple to use and/or were easy on the eye

Here is what I came up with. These are all apps that seem to be under current development (updated at some point in 2013):

AppSize in MB Free? Notes

Komodo Edit301.1YesBig because it is part of a larger IDE tool which is not free[5]

Sublime Text 227.3sort of[6]Gaining in popularity (a version 3 beta is also available)

TextMate 230.3YesWhile this is technically an ‘alpha’[7] release, it seems very stable.

TextWrangler19.2YesVery robust and venerable app. Free since 2005

Tincta 25.6YesSmall app, similar to Fraise in appearance

 

If I had to suggest one, it would probably be Sublime Text 2 (though I will encourage students to buy this if they like it). TextMate 2 is a good second choice, particularly because it is also a very clean looking app. Of course, at some point we need to tell students about the joys of real text editors such as vi, vim, and emacs…but of course this might lead to hostilities![8]

  1. This course material is available for free and became the basis for our much more expansive book on the same topic  ↩

  2. Fraise is itself a fork of Smultron which stopped development in 2009 but which reappeared as a paid app in the Mac App Store in 2011.  ↩

  3. Those apps that are approved by Apple, even if they are not in the Mac App Store.  ↩

  4. Seriously, it takes a lot of clicks to make this dialog box go away. It then produces more pop-up dialogs asking whether you want to register, or install command-line tools.  ↩

  5. Currently $332 for a single license  ↩

  6. This is a paid app, but can be used in trial mode indefinitely with occasional reminders.  ↩

  7. TextMate 2 has been in alpha status since 2009  ↩

  8. Editor wars should be avoided if possible  ↩

How well do UC Davis Graduate Groups communicate their work to the wider world?

PhD students in our lab are mostly split between a couple of UC Davis's many graduate groups. A conversation with some of the students today about 'outreach' and 'social media' led me to wonder how well these graduate groups are communicating their presence to the outside world. The simplest ways of doing this would be:

  • maintain a current website for your graduate group (i.e. with news items)
  • use Facebook (ideally with an open group)
  • establish a blog
  • use twitter

I looked at 11 different graduate groups to see how well they ticked the above boxes. I might be missing some blogs, Facebook groups, and twitter accounts, but if I can't find the relevant details from a Google/Facebook/Twitter search, then I'm assuming that others won't discover them either. This is what I found:

Headline links take you to the home page for the respective graduate group.

Biochemistry, Molecular, Cellular and Developmental Biology (BMCDB)

  • No news page but has an actively maintained blog (easily linked to from above site)
  • Facebook group (open)
  • Active twitter account

Biomedical Engineering (BME)

Biostatistics

  • No news page, though there is a short 'announcements' box on main page
  • No Facebook group
  • No twitter account

Ecology

Epidemiology

  • No news page
  • Facebook group (closed)
  • No twitter account

Integrative Genetics and Genomics (IGG, formerly GGG)

  • Has a news page, but only one item from 2013, remaining items from 2009 and 2008!
  • Facebook group (closed, and have to search for GGG or IGG to find it)
  • No twitter account

Immunology

  • No news page
  • No Facebook group
  • No twitter account

Microbiology

  • No news page
  • Facebook group (closed)
  • No twitter account
  • Has separate website
  • Other: website told me I had to enable Javascript to view their home page even though I have javascript enabled

Nutritional biology

  • No news page
  • No Facebook group
  • No twitter account

Plant Biology

  • No news page
  • No Facebook group
  • No twitter account

Population Biology

  • No news page
  • No Facebook group
  • No twitter account

Please let me know of any updates or additions that I can make to this list

So overall it is pretty poor. BMCDB outshines the others, though BME and Ecology also have a good presence on the web. In many ways, I think it looks worse to do these things badly than to not to them at all. Closed Facebook groups don't send out an inviting message, and having a 'news' page for your graduate group with items from 5 years ago, also sends out the wrong signals.

It takes time and effort to maintain a social media presence, but it doesn't take much effort to at least maintain a news page or simple twitter account (even posting just 1–2 times a week is better than nothing).

Furthermore, the ability to show that you can communicate your work to the wider world is of increasing relevance when applying for grants. It can also raise your profile with your peers and be a useful addition on a resume that helps you stand out from other applicants. Finally, starting a blog or twitter account also helps you hone your writing skills (the latter is great for making you think about how to condense complex thoughts into 'bite size' chunks).

I hope that some of UC Davis's graduate groups make more of an effort in this area (and of course the same can be said for many of UC Davis's departmental and lab websites).

 

Updated 26th September: Added details of some graduate groups that do have blogs and/or websites but which, unhelpfully, are not linked to from their official graduate group webpage.

New JABBA awards for crimes against bioinformatics acronyms

JABBA is an acronym for Just Another Bogus Bioinformatics Acronym. I hand out JABBA awards to bioinformatics papers that reach just a little bit too far when trying to come up with acronyms (or intialisms). See details of previous winners here.

So without further delay, let's see who the new recipients are. As always, journals like Bioinformatics produce many strong candidates for JABBA awards. Here are three winners from the latest issue:

  1. mpMoRFsDB: a database of molecular recognition features in membrane proteins - What is it with the need for so much use of mixed-case characters these days? It makes names harder to read, and this database has a name which doesn't really roll off the tongue.
  2. CoDNaS: a database of conformational diversity in the native state of proteins - More mixed case madness. I'm really unsure how this database should be pronounced. 'Cod Naz'? 'Code Nass'? 'Coe Dee-en-ays'? The abstract suggests that the acronym stems from the name 'Conformational Diversity of Native State', so I guess we should be thankful that they avoided the potential confusion of naming this database 'CoDoNS'.
  3. GALANT: a Cytoscape plugin for visualizing data as functional landscapes projected onto biological networks - If you want your bioinformatics tool to have a cool sounding name, just give it that name. Don't feel that you somehow have to tenuously arrive at that name from a dubious method of selecting just the letters that make it work. The abstract of this paper reveals that GALANT stems from 'GrAph LANdscape VisualizaTion'. So the full name only consists of three words, and the initial letter of one of those words doesn't even make it into the acronym/initialism. They may as well call their tool 'GREAT' (GRaph landscapE visualizATion).