New option to subscribe to this blog via email

Shamelessly borrowing this idea from Matt Gemmell's excellent blog, I thought I'd offer the chance to subscribe to my infrequent ramblings via email. If you enter your email address below, you can receive a weekly email (sent on Friday afternoons) with all of my posts for that week.

Your email address will only be used for the purpose of receiving my blog content and will not be shared with anyone else. Each email will offer a simple link by which to unsubscribe.

Some sage advice on avoiding confusing names for bioinformatics tools

SAGE is a molecular technique used to investigate the mRNA population from a chosen sample. It stands for Serial Analysis of Gene Expression and was first described back in 1995. The technique spawned spin-offs such as LongSAGE, RL-SAGE (Really Long SAGE), and SuperSAGE.

Although this technique has largely been superseded by other methods (such as RNA-Seq), it is still widely referenced (over 1,300 publications from 2013 mention this technique).

Fast-forward to the present day and I note that a new tool has just been published in the journal BMC Bioinformatics:

SAGE: String-overlap Assembly of GEnomes

As long as you query your favorite web search engine for some combination of 'SAGE' and 'genome assembly' you will probably find this tool and not end up on one of the half a million pages that talk about the other SAGE. I'm still not sure whether it is a bit risky giving a new tool the same name as such an established molecular technique.

All of this means that there is the potential for a certain company to use the aforementioned molecular technique to help annotate the output of the aforementioned computational technique, and apply both of these techniques to data from a certain plant. This could give you the world's first SAGE, SAGE, SAGE, sage genome!

Understanding CEGMA output: complete vs partial

On Friday I posted a reply to a thread on SEQanswers about CEGMA. I thought I'd include a modified version of that response here as it is an issue that gets raised fairly frequently. It concerns the 'complete' and 'partial' results that CEGMA includes in the final output file that it generates (typically called 'output.completeness_report'). Here were the two questions that were posted:

1) If a partial score is higher than a complete score then does this indicate that the assembly is fragmented?

2) Also, should the partial score be lower than the complete score in an ideal situation?

Remember, these are not scores per se. Both of these figures describe a number of core eukaryotic genes (CEGs) that the CEGMA pipeline predicts to be present in the input assembly file. The 'complete' set  refers to those gene predictions which CEGMA classes as 'full-length'. Note that even if CEGMA says something is 'complete' there is still the possibility that parts of the protein is missing.

This is because CEGMA is taking each CEG that it has predicted and aligns the protein sequence of that CEG to the HMM profile generated from the corresponding core gene family (made up of six proteins from Schizosacchromyces pombe, Saccharomyces cerevisiaeCaenorhabditis elegans, Drosophila melanogasterArabidopsis thaliana, and Homo sapiens). As I recall from memory, if the alignment spans more than 70% of the protein profile the CEG is considered to be 'complete'. This 70% threshold is an arbitrary cut-off, but seems to work well in finding genuine orthologs of CEGs.

Somewhat confusingly, although we consider 'partial' matches to be those below 70% (but above some unspecified minimum score), the output in output.completeness_report uses 'partial' to include both 'complete' and 'partial' matches. So the number of partial matches will always be at least as high as the number of complete matches.

You should look at both results. If you don't have 248 core genes 'completely' present, the next thing is look at how many additional partial matches there are. If you have a result like 200/240 (i.e. 200 complete CEGs and 40 additional partial matches) then this at least suggests that most of the core gene set is present in your assembly, but some may be split across contigs or missing from the assembly. Remember, CEGMA only looks for genes that are located inside individual contigs or scaffolds. Theoretically, you could have an assembly that splits every gene across contigs which might lead to a 'complete' result of zero, and a partial result of '248'.

From looking at results of many different runs of CEGMA, it is common to see something like 90–95% of core gene present in the 'complete' category, and another 1–5% present as partial genes (for good assemblies at least). I have also seen one case where the results were 157/223. This is more unusual, suggesting that a relatively large number (27%) of the core genes were present as fragments. This might simply reflect lots of short contigs/scaffolds in the assembly. In contrast to this, one of the best results that I have seen is 245/248. It is rare to see all core genes present, even when you allow for partial matches.

Below is a chart that shows the results from 50 runs of CEGMA against different assemblies. The x-axis shows the percentage of 248 CEGs that were completely present, and the y-axis shows the percentage of CEGs that were only partially present.

Is yours bigger than mine? Big data revisited

Google Scholar lists 2,090 publications that contain the phrase 'big data' in their title. And that's just from the first 9 months of 2014! The titles of these articles reflect the interest/concern/fear in this increasingly popular topic:

One paper, Managing Big Data for Scientific Visualization, starts out by identifying a common challenge of working with 'big data':

Many areas of endeavor have problems with big data…while engineering and scientific visualization have also faced the problem for some time, solutions are less well developed, and common techniques are less well understood

They then go on to discuss some of the problems of storing 'big data', one of which is listed as:

Data too big for local disk — clearly, not only do some of these data objects not fit in main memory, but they do not even fit on local disk on most workstations. In fact, the largest CFD study of which we are aware is 650 gigabytes, which would not fit on centralized storage at most installations!

Wait, what!?! 650 GB is too large for storage? Oh yes, that's right. I forgot to mention that this paper is from 1997. My point is that 'big data' has been a problem for some time now and will no doubt continue to be a problem.

I understand that having a simple, user-friendly, label like 'big data' helps with the discussion, but it remains such an ambiguous, and highly relative term. It's relative because whether you deem something to be 'big data' or not might depend heavily on the size of your storage media and/or the speed of your networking infrastructure. It's also relative in terms of your field of study; a typical set of 'big data' in astrophysics might be much bigger than a typical set of 'big data' in genomics.

Maybe it would help to use big dataTM when talking about any data that you like to think of as big, and then use BIG data for those situations where your future data acquisition plans cause your sys admin to have sleepless nights.

The problem with posters at academic conferences

I recently attended the Genome Science: Biology, Technology, and Bioinformatics meeting in the UK, where I presented a poster. As I was walking around, looking at other people's posters, I was reminded of the common problem that occurs with many academic posters. Here are some pseudo-anonomous examples to show what I mean (click images to enlarge):

The problem here is not with the total amount of text — though that can sometimes be an issue — but with the width of the text. These posters are 84 cm (33 inches) wide, and it is not ideal to create text blocks that span the entire width of the poster. The reasons behind this are the same reasons why you never see newspapers display text like this…we are not very good at reading information in this manner.

To quote from Lynch & Horton's Web Style Guide; specifically the section on Page Width and Line Length:

The ideal line length for text layout is based on the physiology of the human eye. The area of the retina used for tasks requiring high visual acuity is called the macula. The macula is small, typically less than 15 percent of the area of the retina. At normal reading distances the arc of the visual field covered by the macula is only a few inches wide—about the width of a well-designed column of text, or about twelve words per line. Research shows that reading slows as line lengths begin to exceed the ideal width, because the reader then needs to use the muscles of the eye or neck to track from the end of one line to the beginning of the next line. If the eye must traverse great distances on a page, the reader must hunt for the beginning of the next line.

In contrast to the above examples, there were a couple of posters at the #UKGS2014 meeting that I thought were beautifully displayed. Bright, colorful, clearly laid out, not too much text, and good use of big fonts. Congratulations to Warry Owen et al. and Karim Gharbi et al. for your poster presentation prowess!

When is a citation not a citation?

Today I received a notification from Google Scholar that one of my papers had been cited. I often have a quick look at such papers to see how our work is being referenced. The article in question was from the Proceedings of the 3rd Annual Symposium on Biological Data Visualization: Data Analysis and Redesign Contests:

FixingTIM: interactive exploration of sequence and structural data to identify functional mutations in protein families

The paper describes a tool that helps "identify protein mutations across a family of structural models and to help discover the effect of these mutations on protein function". I was a bit surprised by this because this isn't a topic that I've published on. So I looked to see what paper of mine was being cited and how it was being cited. Here is the relevant sentence from the background section of the paper:

To improve the exploration process, many efforts have been made, from folding the sequences through classification [1,2], to tools for 3D view exploration [3] and to web-based applications which present large amounts of information to the users [4].

Citation number 2 is the paper on which I am a co-author:

  • Chen N, Harris TW, Antoshechkin I, Bastiani C, Bieri T, Blasiar D, Bradnam K, Canaran P, Chan J, Chen C, Chen WJ, Cunningham F, Davis P, Kenny E, Kishore R, Lawson D, Lee R, Muller H, Nakamura C, Pai S, Ozersky P, Petcherski A, Rogers A, Sabo A, Schwarz EM, Van Auken K, Wang Q, Durbin R, Spieth J, Sternberg PW, Stein LD: Wormbase: A comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res 2005, 33(1):383-389.

The cited paper simply describes the WormBase database and includes only a passing reference to the fact that WormBase contains some links to protein structures (when known), but that's about it. The WormBase paper doesn't mention 'folding' or 'classification' anywhere, which makes it seem a really odd choice of paper to be cited. It makes me wonder how many other papers end up gaining seemingly spurious citations like this one.

Thoughts on the supply of bioinformatics services and training in the UK

I am currently at the 2014 UK Genome Sciences meeting (hashtag #UKGS2014). It has been a long time since I have been at a UK science conference and it has been good to meet old colleagues and acquaintances who I have known from various stages of my career.

From informal chats with various people, it seems that UK universities are tackling their bioinformatics needs in different ways. Some have specialized facilities that try to meet the bioinformatics need from local users (and potentially from those further afield). E.g. the University of Surrey has a Bioinformatics Core Facility, Newcastle University has a Bioinformatics Support Unit, and here at Oxford there is the Computational Biology Research Group.

These examples represent core facilities with dedicated staff. An alternative approach is to bring together — physically or virtually — existing bioinformatics talent, with a view that they will be able to help others. This is the strategy taken by the new Bioinformatics Hub at the University of Sheffield, which brings together six talented folk who are based in different departments. The success of strategies like this may heavily depend on having enough skilled bioinformatics faculty who also have enough time to help others.

Other universities seem to lack any central pooling of bioinformatics expertise, and instead rely on people doing bioinformatics themselves or outsourcing it to places like TGAC. The former approach (doing it yourself) will be fine for some people, particularly those who are comfortable learning new computational skills themselves, but this will not be a good fit for everyone. 

If you are not outsourcing your bioinformatics and you don't have the necessary skills yourself, then the other approach is to attend one or more training courses. Three places that seem to be leading the field for bioinformatics training are TGACCGAT, and Edinburgh Genomics…and all three have a heavy presence at this conference.

Depending on your definition, bioinformatics has been around — as either a recognized skill set, or a field of study — since the early 1990s. The number of people who might consider themselves a bioinformatician has probably grown exponentially since then. Likewise, the demand for skilled bioinformaticians, or for facilities that offer bioinformatics services and training, continues to grow. Clearly, there are different ways of meeting this demand.

The current diversity of approaches to bioinformatics services and training presumably is a reflection on the local supply of, and demand for, such services. If you are about to join a new university, and if you plan on needing some bioinformatics help at some point, it may be useful to first find out more about that university's bioinformatics strategy.

My poster for the UK Genome Sciences meeting is about a new version of our IMEter software

One of the many projects I am involved with looks at Intron-mediated enhancement (IME) of gene expression. Our collaboration with Alan Rose at UC Davis has been a fruitful one, and has led to the development of computational tools that can predict how much an intron might enhance expression.

The initial version of what we called 'the IMEter' was published in 2008 and an improved v2.0 version was published in 2011. The online version of this software only lets you test Arabidopsis introns…not so useful when there are now so many different sequenced plant genomes.

We addressed this limitation in a new — as yet unpublished — v2.1 version which is available online. IMEter v2.1 can now test the expression enhancing ability of introns from 34 different plant species.

The new IMEter is the subject of my poster at the forthcoming UK Genome Sciences meeting in Oxford. The poster, available below via Figshare, explains a little more about how the new version of the IMEter came about. It also discusses some of the problems that arise in trying to adapt a software tool from working with one, very well annotated, genome, to working with many different genomes of varying quality.

5 things to consider when publishing links to academic websites

Preamble

One of the reasons I've been somewhat quiet on this blog recently is because I've been involved with a big push to finish the new Genome Center website. This has been in development for a long time and provides a much needed update to the previous website that was really showing its age. Compare and contrast:

The old Genome Center website…what's with all that whitespace in the middle?

The new Genome Center website, less than 24 hours old at the time of writing.

This type of redesign is a once-in-a-decade event, and provides the opportunity not just to add new features (e.g. proper RSS news feed, twitter account, YouTube channel, respsonvive website design etc.), but also to clean up a lot of legacy material (e.g. webpages for people who left the Genome Center many years ago).

This cleanup prompted me to check Google Scholar to see if there are any published papers that include links to Genome Center websites. This includes links to the main site and also to all of the many subdomains that exist (for different labs, core facilities etc.) It's pretty easy to search Google Scholar for the core part of a URL, e.g. genomecenter.ucdavis.edu and I would encourage anyone else that is looking after an aging academic website to do so.

Here are some of the key things that I noticed:

  1. Most mentions of Genome Center URLs are to resources by Peggy Farnham's lab. Although Peggy left UC Davis several years ago (she is now here), her — very old, and out of date — lab page still exists (http://farnham.genomecenter.ucdavis.edu).
  2. Many people link to Craig Benham's work using http://genomecenter.ucdavis.edu/benham/. This redirects to Craig's own lab site (http://benham.genomecenter.ucdavis.edu), but the redirect doesn't quite work when people have linked to a specific tool (e.g. http://genomecenter.ucdavis.edu/benham/sidd). This redirects to http://benham.genomecenter.ucdavis.edu/sidd which then produces a 404 error (page not found).
  3. There are many papers that link to resources from Jonathan Eisen's group and these papers all point to various pages on a domain that is either down or no longer in existence (http://bobcat.genomecenter.ucdavis.edu).

There is an issue here of just how long is it valid to try to keep links active and working. In the case of Peggy Farnham, she no longer works at UC Davis, so is it okay if I redirected all of her web traffic to her new website? I plan to do this but will let Peggy know so that she can maybe arrange to copy some of the existing material over to her new site.

In the case of Craig's lab, maybe he should be adding his own redirect links for tools that now have new URLs. What would also help would be to have a dedicated 404 page which might point to the likely target page that people are looking for (a completely blank 'not found' page is rarely ever helpful).

In the case of Jonathan's lab, there is a big problem here in that all of the papers are tied to a very specific domain name (which itself has no obvious naming connection to his lab). You can always rename a new machine to be called 'bobcat', but maybe there are better things we should be doing to avoid these situations arising in the first place…

5 things to consider when publishing links to academic websites

  1. Don't do it! Use resources like Figshare, Github, or Dryad if at all possible. Of course this might not be possible if you are publishing some sort of online software tool.
  2. If you have to link to a lab webpage, consider spending $10 a year or so and buying your own domain name that you can take with you if you ever move anywhere else in future. I bought http://korflab.com for my boss, and I see that Peggy Farnham is now using http://farnhamlab.com.
  3. If you can't, or don't want to, buy your own domain name, try using a generic lab domain name and not a machine-specific domain name. E.g. our lab's website is on a machine called 'raiden' and can be accessed at http://raiden.genomecenter.ucdavis.edu. But we only ever use the domain name http://korflab.ucdavis.edu which allows us to use a different machine as the server without breaking any links.
  4. If you must link to a specific machine, try avoiding URLs that get too complex. E.g. http://supersciencelab.ucdavis.edu/Tools/Foo/v1/foo_v1.cgi. The more complex the URL, the more likely it will break in future. Instead, link to your top level domain (http://supersciencelab.ucdavis.edu) and provide clear links on that page on how to find things.
  5. Any time you publish a link to a URL, make sure you keep a record of this in a simple text file somewhere. This might really help if/when you decide to redesign your website 5 years from now and want to know whether you might be breaking any pre-existing links.

 

Random capitalization strikes again, or am I only dreaming?

A paper in BMC Bioinformatics describes a new piece of software:

morFeus: a web-based program to detect remotely conserved orthologs using symmetrical best hits and orthology network scoring

Naturally, my first instincts were to check whether this was a name worthy of a JABBA award, but morFeus does not appear to be an acronym or initialism. I say that because although the name morFeus appears 116 times in the manuscript, no explanation is ever given as to why the software has that name.

My first thought was that maybe it is a reference to Morpheus, the Greek god of dreams, or maybe to the character of Morpheus from The Matrix. I don't really care about why it is called morFeus — a name that my spell checker keeps correcting to morgues — but it is another example of the, seemingly random, capitalization of bioinformatics tools.

When I visited the web server for the morFeus tool, I did notice something in small print at the bottom of the page:

  • morFeus stands for meta-analysis based orthology finder using symmetrical best hits

This is something that also appears as a keyword in the manuscript, but it is not entirely obvious as to whether this really is meant to be an initialism, or why the F is capitalized. I'm completely stuMped.