Thoughts on the supply of bioinformatics services and training in the UK

September 02, 2014 by Keith Bradnam

I am currently at the 2014 UK Genome Sciences meeting (hashtag #UKGS2014). It has been a long time since I have been at a UK science conference and it has been good to meet old colleagues and acquaintances who I have known from various stages of my career.

From informal chats with various people, it seems that UK universities are tackling their bioinformatics needs in different ways. Some have specialized facilities that try to meet the bioinformatics need from local users (and potentially from those further afield). E.g. the University of Surrey has a Bioinformatics Core Facility, Newcastle University has a Bioinformatics Support Unit, and here at Oxford there is the Computational Biology Research Group.

These examples represent core facilities with dedicated staff. An alternative approach is to bring together — physically or virtually — existing bioinformatics talent, with a view that they will be able to help others. This is the strategy taken by the new Bioinformatics Hub at the University of Sheffield, which brings together six talented folk who are based in different departments. The success of strategies like this may heavily depend on having enough skilled bioinformatics faculty who also have enough time to help others.

Other universities seem to lack any central pooling of bioinformatics expertise, and instead rely on people doing bioinformatics themselves or outsourcing it to places like TGAC. The former approach (doing it yourself) will be fine for some people, particularly those who are comfortable learning new computational skills themselves, but this will not be a good fit for everyone.

If you are not outsourcing your bioinformatics and you don't have the necessary skills yourself, then the other approach is to attend one or more training courses. Three places that seem to be leading the field for bioinformatics training are TGAC, CGAT, and Edinburgh Genomics…and all three have a heavy presence at this conference.

Depending on your definition, bioinformatics has been around — as either a recognized skill set, or a field of study — since the early 1990s. The number of people who might consider themselves a bioinformatician has probably grown exponentially since then. Likewise, the demand for skilled bioinformaticians, or for facilities that offer bioinformatics services and training, continues to grow. Clearly, there are different ways of meeting this demand.

The current diversity of approaches to bioinformatics services and training presumably is a reflection on the local supply of, and demand for, such services. If you are about to join a new university, and if you plan on needing some bioinformatics help at some point, it may be useful to first find out more about that university's bioinformatics strategy.

My poster for the UK Genome Sciences meeting is about a new version of our IMEter software

August 22, 2014 by Keith Bradnam

One of the many projects I am involved with looks at Intron-mediated enhancement (IME) of gene expression. Our collaboration with Alan Rose at UC Davis has been a fruitful one, and has led to the development of computational tools that can predict how much an intron might enhance expression.

The initial version of what we called 'the IMEter' was published in 2008 and an improved v2.0 version was published in 2011. The online version of this software only lets you test Arabidopsis introns…not so useful when there are now so many different sequenced plant genomes.

We addressed this limitation in a new — as yet unpublished — v2.1 version which is available online. IMEter v2.1 can now test the expression enhancing ability of introns from 34 different plant species.

The new IMEter is the subject of my poster at the forthcoming UK Genome Sciences meeting in Oxford. The poster, available below via Figshare, explains a little more about how the new version of the IMEter came about. It also discusses some of the problems that arise in trying to adapt a software tool from working with one, very well annotated, genome, to working with many different genomes of varying quality.

5 things to consider when publishing links to academic websites

August 20, 2014 by Keith Bradnam

Preamble

One of the reasons I've been somewhat quiet on this blog recently is because I've been involved with a big push to finish the new Genome Center website. This has been in development for a long time and provides a much needed update to the previous website that was really showing its age. Compare and contrast:

The old Genome Center website…what's with all that whitespace in the middle?

The new Genome Center website, less than 24 hours old at the time of writing.

This type of redesign is a once-in-a-decade event, and provides the opportunity not just to add new features (e.g. proper RSS news feed, twitter account, YouTube channel, respsonvive website design etc.), but also to clean up a lot of legacy material (e.g. webpages for people who left the Genome Center many years ago).

This cleanup prompted me to check Google Scholar to see if there are any published papers that include links to Genome Center websites. This includes links to the main site and also to all of the many subdomains that exist (for different labs, core facilities etc.) It's pretty easy to search Google Scholar for the core part of a URL, e.g. genomecenter.ucdavis.edu and I would encourage anyone else that is looking after an aging academic website to do so.

Here are some of the key things that I noticed:

Most mentions of Genome Center URLs are to resources by Peggy Farnham's lab. Although Peggy left UC Davis several years ago (she is now here), her — very old, and out of date — lab page still exists (http://farnham.genomecenter.ucdavis.edu).
Many people link to Craig Benham's work using http://genomecenter.ucdavis.edu/benham/. This redirects to Craig's own lab site (http://benham.genomecenter.ucdavis.edu), but the redirect doesn't quite work when people have linked to a specific tool (e.g. http://genomecenter.ucdavis.edu/benham/sidd). This redirects to http://benham.genomecenter.ucdavis.edu/sidd which then produces a 404 error (page not found).
There are many papers that link to resources from Jonathan Eisen's group and these papers all point to various pages on a domain that is either down or no longer in existence (http://bobcat.genomecenter.ucdavis.edu).

There is an issue here of just how long is it valid to try to keep links active and working. In the case of Peggy Farnham, she no longer works at UC Davis, so is it okay if I redirected all of her web traffic to her new website? I plan to do this but will let Peggy know so that she can maybe arrange to copy some of the existing material over to her new site.

In the case of Craig's lab, maybe he should be adding his own redirect links for tools that now have new URLs. What would also help would be to have a dedicated 404 page which might point to the likely target page that people are looking for (a completely blank 'not found' page is rarely ever helpful).

In the case of Jonathan's lab, there is a big problem here in that all of the papers are tied to a very specific domain name (which itself has no obvious naming connection to his lab). You can always rename a new machine to be called 'bobcat', but maybe there are better things we should be doing to avoid these situations arising in the first place…

5 things to consider when publishing links to academic websites

Don't do it! Use resources like Figshare, Github, or Dryad if at all possible. Of course this might not be possible if you are publishing some sort of online software tool.
If you have to link to a lab webpage, consider spending $10 a year or so and buying your own domain name that you can take with you if you ever move anywhere else in future. I bought http://korflab.com for my boss, and I see that Peggy Farnham is now using http://farnhamlab.com.
If you can't, or don't want to, buy your own domain name, try using a generic lab domain name and not a machine-specific domain name. E.g. our lab's website is on a machine called 'raiden' and can be accessed at http://raiden.genomecenter.ucdavis.edu. But we only ever use the domain name http://korflab.ucdavis.edu which allows us to use a different machine as the server without breaking any links.
If you must link to a specific machine, try avoiding URLs that get too complex. E.g. http://supersciencelab.ucdavis.edu/Tools/Foo/v1/foo_v1.cgi. The more complex the URL, the more likely it will break in future. Instead, link to your top level domain (http://supersciencelab.ucdavis.edu) and provide clear links on that page on how to find things.
Any time you publish a link to a URL, make sure you keep a record of this in a simple text file somewhere. This might really help if/when you decide to redesign your website 5 years from now and want to know whether you might be breaking any pre-existing links.

Random capitalization strikes again, or am I only dreaming?

August 08, 2014 by Keith Bradnam

A paper in BMC Bioinformatics describes a new piece of software:

morFeus: a web-based program to detect remotely conserved orthologs using symmetrical best hits and orthology network scoring

Naturally, my first instincts were to check whether this was a name worthy of a JABBA award, but morFeus does not appear to be an acronym or initialism. I say that because although the name morFeus appears 116 times in the manuscript, no explanation is ever given as to why the software has that name.

My first thought was that maybe it is a reference to Morpheus, the Greek god of dreams, or maybe to the character of Morpheus from The Matrix. I don't really care about why it is called morFeus — a name that my spell checker keeps correcting to morgues — but it is another example of the, seemingly random, capitalization of bioinformatics tools.

When I visited the web server for the morFeus tool, I did notice something in small print at the bottom of the page:

morFeus stands for meta-analysis based orthology finder using symmetrical best hits

This is something that also appears as a keyword in the manuscript, but it is not entirely obvious as to whether this really is meant to be an initialism, or why the F is capitalized. I'm completely stuMped.

101 questions with a bioinformatician #13: Michael Schatz

August 07, 2014 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting theirbioinformatics careers.

Mike Schatz is an Assistant Professor of Quantitative Biology at Cold Spring Harbor Laboratory. Prior to getting into the world of genomics and bioinformatics, Mike worked for a startup company that specialized in network security (working on encryption software for online banking amongst other things):

It was unplanned serendipity, but code breaking turned out to be perfect training for genomics, and the startup turned out to be perfect training to become a PI.

His research focuses on the development of scalable algorithms and systems to analyze biological sequence data, concentrating on the alignment, assembly, and analysis of high-throughput DNA sequencing reads. If you visit his lab research page, you will see an impressive list of software tools that he has helped develop.

Aside from his contributions to genomics, I am perhaps more impressed that Mike has made available slides from all of his major research presentations going back to 2005 (over 80 talks). I wish more scientists were as dedicated at sharing talks like this. You can find out more about Mike from his l ab website or by following him on twitter (@mike_schatz). And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

What brought me into the field was the opportunity to apply my training and experience in computer science to really meaningful problems in biology and medicine. I’m fascinated by the deep connections between how computers and software are organized and operate compared to how cells and genomes are replicated, transcribed, and evolve.

Right now is by far the most fantastic time to be in a field that is driven by rapid improvements to the biotechnology. How amazing that just 15 or 20 years ago it would have been cheaper and easier to land a team on the moon than to sequence their genomes, but now we do it on a routine basis!

This growth has fundamentally and forever changed the types of questions that we can even ask. The really exciting and scary point is we are still at the very beginning, and are still feeling around in the dark. I recently gave a talk about how long we should expect to wait until we have sequenced one billion genomes (hint: it is a lot sooner than you might expect).

010. What's something that you *don't* enjoy about current bioinformatics research?

The FASTQ “file format”. Do we really need the read identifier listed twice (sometimes), newlines within a single record, and an unspecified encoding scheme for quality values that changes every so often depending on when the software was run?

I cringe every time I have to teach it to a new student. There is no rational to it and it's so obviously flawed. It just feels dirty to teach it. I like to think that in 10 or 100 years this will all be sorted out, but today, this and so many other poorly designed systems are entrenched into our day-to-day lives. It is a constant, if dull, irritation that makes everything slow to change, and brittle to use.

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Take more probability and statistics. So much of my life now is spent looking for patterns in enormously large and complex data that the only hope is through statistical analysis. I used to stay up late reading algorithms textbooks, but now this is where I spend my free time.

The one really successful tip I’ve learned is that, even though my intuition for probability is poor, I can often work backwards using a simulator. I’ll write a little code so I can look at what happens to the distribution if this rate goes up, or if the genome was twice as complex. I then use that to guide me to the analytical form. I always understand an algorithm better if I implement it from scratch, and I think that this is an extension of that concept.

100. What's your all-time favorite piece of bioinformatics software, and why?

Do I have to pick just one? Ben Langmead blew my mind when he taught me about the FM-index. A very close second was the genome assembler Art Delcher wrote in about 50 lines of awk. More recently my lab went over the SGA algorithm from Simpson and Durbin in great detail. All of these have beauty in their simplicity and elegance — like a great work of art everything locks together perfectly in step.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

S – It is the strongest code, of course! ;)

Is there ever a valid reason for storing bioinformatics data in a Microsoft Word document?

August 05, 2014 by Keith Bradnam

Short answer

No.

Long answer

Noooooooooo!!!

Background

Yesterday I finished reviewing a paper. My review was generally very positive and I enjoyed reading the manuscript. The authors linked to some supplementary files that were available on another website. As I'm the type of reviewer that likes to look at every file that is part of a submission, I logged on to the website to see what files were there.

The first file that was listed had a 'docx' extension. Someone might argue that if this file contained a textual description of how the other files were being generated, then maybe there is nothing wrong with somebody using Microsoft Word. I would disagree. Any sort of documentation should ideally be in plain text, and maybe PDF as an alternative.

In any case, I opened the file to see what we were dealing with. The file contained a list of several thousand gene identifiers, one identifier per line. There was nothing else in the thirty-six page file.

This is not an acceptable practice! Use of Microsoft Word to store bioinformatics data will only ever result in unhappiness, frustration, and anger. And we all know what anger leads to…

Supplemental madness: on the hunt for 'Figure S1'

August 02, 2014 by Keith Bradnam

I've just been looking at this new paper by Vanesste et al. in Genome Research:

Analysis of 41 plant genomes supports a wave of successful genome duplications in association with the Cretaceous–Paleogene boundary

I was curious as to where their 41 plant genomes came from, so I jumped to the Methods section to see:

No surprise there, this is exactly the sort of thing you expect to find in the supplementary material of a paper. So I followed the link to the supplementary material only to see this:

So the 'Supplemental Material' contains 'Supplemental Information' and the — recursively named — 'Supplemental Material'. So where do you think Supplemental Table S1 is? Well it turns out that this table is in the Supplemental Material PDF. But when looking at both of these files, I noticed something odd. Here is Figure S1 from the Supplemental Information:

Screen Shot 2014-08-01 at 4.04.02 PM.png

And here is part of another Figure S1 from the Supplemental Material file:

You will notice that the former figure S1 (in the Supplemental Information) is actually called a Supporting Figure. I guess this helps distinguish it from the completely-different-and-in-no-way-to-be-confused Supplementary Figure S1.

This would possibly make some sort of sense if the main body of the paper distinguished between the two different types of Figure S1. Except the paper mentions 'Supplemental Figure S1' twice (not even 'Supplementary Figure S1) and doesn't mention Supporting Figure S1 at all (or any supporting figures for that matter)!

What does all of this mean? It means that Supplementary Material is a bit like the glove compartment in your car: a great place to stick all sorts of stuff that will possibly never be seen again. Maybe we need better reviewer guidelines to stop this sort of confusion happening?

The Assemblathon Gives Back (a bit like The Empire Strikes Back, but with fewer lightsabers)

August 01, 2014 by Keith Bradnam

So we won an award for Open Data. Aside from a nice-looking slab of glass that is weighty enough to hold down all of the papers that someone with a low K-index has published, the award also comes with a cash prize.

Naturally, my first instinct was to find the nearest sculptor and request that they chisel a 20 foot recreation of my brain out of Swedish green marble. However, this prize has been — somewhat annoyingly — awarded to all of the Assemblathon 2 co-authors.

While we could split the cash prize 92 ways, this would probably only leave us with enough money to buy a packet of pork scratchings each (which is not such a bad thing if you are fan of salty, fatty, porcine goodness).

Instead we decided — and by 'we', I'm really talking about 'me' — to give that money back to the community. Not literally of course…though the idea of throwing a wad of cash into the air at an ISMB meeting is appealing.

Rather, we have worked with the fine folks at BioMed Central (that's BMC to those of us in the know), to pay for two waivers that will cover the cost of Article Processing Charges (that's APCs to those of us in the know). We decided that these will be awarded to papers in a few select categories relating to 'omics' assembly, Assemblathon-like contests, and things to do with 'Open Data' (sadly, papers that relate to 'pork scratchings' are not eligible).

We are calling this event the Assemblathon 'Publish For Free' Contest (that's APFFC to those of us in the know), and you can read all of the boring details and contest rules on the Assemblathon website.

The Tesla index: a measure of social isolation for scientists

July 31, 2014 by Keith Bradnam

Abstract

In the era of social media there are now many different ways that a scientist can build their public profile; the publication of high-quality scientific papers being just one. While publishing journal and book articles is a valuable tool for the dissemination of knowledge, there is a danger that scientists become isolated, and remain disconnected from reality, sitting alone in their ivory towers. Such reclusiveness has been long been all too common among academic scientists and we are losing sight of other key outreach efforts such as the use of social media as a tool for communicating science. To help quantify this problem of social isolation, I propose the ‘Tesla Index’, a measure of the discrepancy between the somewhat stuffy, outdated practice of generating peer-reviewed publications and the growing trend of vibrant, dynamic engagement with other scientists and the general public through use of social media.

Introduction

There are many scientists who actively take the time to pursue their science in as much of a public manner as possible. They work hard to ensure that their peers, and the public at large, are kept informed of their latest research. Consider Titus Brown, a genomics and evolution professor at Michigan State University[1]. Although he has contributed to a meagre number of — largely uninteresting — publications[2], he has instead embraced social media[3] to excite and stimulate others with news of his past, current, and future work.

Now consider Nikola Tesla[4]; although he may have forever changed the world through his many scientific inventions[5], he was a famous recluse[6] and surprisingly did not contribute to any blog, nor did he even bother to set up an account on twitter. I am concerned that the anti-social and secretive behavior of Nikola Tesla is something that is all too common in many other scientists, particularly in those who continue their obsession with publishing work that will forever live behind pay-walls, invisible to all but the priviledged few.

I therefore think it’s time that we develop a metric that will clearly indicate if a scientist is a reclusive introvert with no interest in sharing their work with others or engaging with the wider community. This will allow others to adjust our expectations of them accordingly. In order to quantify the problem and to devise a solution, I have compared the numbers of followers that research scientists have on twitter with the number of citations they have for their peer-reviewed work. This analysis has identified clear outliers, or ‘Teslas’, within the scientific community. I propose a new metric, which I call the ‘Tesla Index’, which allows a simple quantification as to the degree of social isolation of any particular scientist.

Results and Discussion

I took the number of Twitter followers as a measure of ‘social outreach and engagement’ while the number of citations was taken as a measure of ‘boring scientific output’. The data gathered are shown in Figure 1.

Figure 1: Twitter followers versus number of scientific citations for a sort-of-random sample of researcher tweeters — Figure 1: Twitter followers versus number of scientific citations for a sort-of-random sample of researcher tweeters

I propose that the Tesla Index (T-index) can be calculated as simply the number of Twitter followers a user has, divided by their total number of citations. A low T-index is a warning to the community that researcher 'X' may be forsaking all methods of publicly sharing their work at the expense of soley publishing manuscripts. In contrast, a very high T-index suggests that a scientist is being active in the community, informing and educating their peers, colleagues, and the wider public. They are thus playing a positive role in society. Here, I propose that those people whose T-index is lower than 0.5 can be considered ‘Science Teslas’; these individuals are highlighted in Figure 1.

References

http://ged.msu.edu ↩
http://scholar.google.com/citations?user=O4rYanMAAAAJ&hl=en ↩
https://twitter.com/ctitusbrown ↩
http://en.wikipedia.org/wiki/Nikola_Tesla#Literary_works ↩
http://theoatmeal.com/comics/tesla ↩
http://www.viewzone.com/tesla.html ↩

Acknowledgments

This research was inspired by a piece of completely unrelated work by Neil Hall.

A CEGMA Virtual Machine (VM) is now available!

July 30, 2014 by Keith Bradnam

Last week I blogged about the ever growing popularity of CEGMA and also the problems of maintaining this difficult-to-install piece of software. In response to that post, people helpfully pointed out that you can more easily install/run CEGMA by using package managers such as Homebrew and/or even run CEGMA on a dedicated Amazon Machine Instance.

These responses led me to update the CEGMA FAQ to list all of the alternative methods of getting CEGMA to run (including running it as an iPlant application). I’m happy that I can today announce a new addition to this list: CEGMA is now available through virtualization:

Korflab CEGMA VM Information

Our CEGMA VM runs the Ubuntu operating system and is pre-configured to have everything installed that CEGMA needs. I’ve tested the VM using the free VirtualBox software and it seems to work just fine [1].

This also means that I will no longer be offering a service to run CEGMA on behalf of others. I had previously offered to run CEGMA for people who had trouble installing the software (or more commonly, the pieces of software that CEGMA requires). I’ve run CEGMA over 100 times for others and this has been a bit of a drain on my time to say the least. Hopefully, our CEGMA VM is a viable alternative. Many thanks are due to Richard Feltstykket at the UC Davis Genome Center’s Bioinformatics Core for setting this up.

Words that will come back to haunt me I expect! ↩