24 carat JABBA awards

jabba logo.png

Here is a new paper published in the journal PLOSBuzzFeed…sorry, I mean PLOS Computational Biology:

It's a good job that they mention the name of the algorithm ninety-one times in the paper, otherwise you might forget just how bogus the name is. At least DIAMOnD has that lower-case 'n' which means that no-one will confuse it with:

This second DIAMOND paper dates all the way back to November 2014. Where does this DIAMOND get its name?

Double Index AlignMent Of Next-generation sequencing Data

This DIAMOND gets a bonus point for having a website link in the paper which doesn't seem to work.

So DIAMOnD and DIAMOND are both the latest recipients of JABBA awards for giving us Just Another Bogus Bioinformatics Acronym.

Trying to download the cow genome (again): where's the beef (again)?

Almost a year ago, I blogged about my frustrations regarding the extremely confusing nature of the cow genome and the many genome assemblies that are out there. Much of that frustration was due to websites and FTP sites that had broken links, misleading information, and woefully incomplete documentation.

One year on and I hear a rumor that a new version of the cow genome is available. So I went off in search of 'UMD 3.1.1'. My first stop was bovinegenome.org which is one place where you can find the previous 'UMD 3.1' assembly. But alas, they do not list UMD 3.1.1.

After some Google searching I managed to find this information at the UCSC Genome Bioinformatics news archive:

We are pleased to announce the release of a Genome Browser for the June 2014 assembly of cow, Bos taurus (BostaurusUMD 3.1.1, UCSC version bosTau8). This updated cow assembly was provided by the UMD Center for Bioinformatics and Computational Biology (CBCB). This assembly is an update to the previous UMD 3.1 (bosTau6) assembly. UMD 3.1 contained 138 unlocalized contigs that were found to be contaminants. These have been suppressed in UMD 3.1.1.

This reveals that the update is pretty minor (removal of contaminant contigs which were never part of any chromosome sequence anyway). In any case, the USCC FTP site contains the UMD 3.1.1 assembly so that's great.

But out of curiosity I followed UCSC's link to the UMD Center for Bioinformatics and Computational Biology (CBCB) website. The home page doesn't make it easy to find the cow genome data. Searching the site for 'UMD 3.1.1' didn't help but searching for 'cow genome' did take me to their Assembly data page which lists the cow genome. Unfortunately the link for the Bos taurus genome takes you to 'page not found'. In contrast, the 'data download' link does work and takes you to their FTP site which fails to include the new assembly (but it does list all of the older cow genome assemblies).

Plus ça change, plus c'est la même chose.

More bioinformatics link rot: where is EUROCarbDB?

Update 2015-01-19 15.19: I contacted the corresponding author about this and now the EurocarbDB link in the original paper works.

First published online a few months ago in the journal Bioinformatics (September 12th, 2014):

The name of this resource is not the snappiest name out there. "Oh, you're interested in resources for glycomics, have you tried EuroCarbDB-open parentheses-cee-cee-ar-cee-close parentheses?", but leaving that aside the paper lists the following URLs as part of the abstract:

Availability and implementation: The installation with the glycan standards is available at http://glycomics.ccrc.uga.edu/eurocarb/. The source code of the project is available at https://code.google.com/p/ucdb/.

The first link says that the server is down. The parent page (http://glycomics.ccrc.uga.edu/ seems to make no mention at all of this resource (not that I can find anywhere). Following the second link in the abstract, I found the following text:

An incubator project for the future direction of the EUROCarbDB project. More to follow.... This new project is in it's infancy - please use the original EUROCarbDB site. A new project will be hosted at UniCarb-DB (http://www.unicarb-db.org to reflect the continued work of the developers

I followed the first of these links to the 'original' EUROCarbDB site. This Google Code page in turn told me that the online version of EuroCarbDB is hosted by the European Institute of Bioinformatics.

Following the link for the online version of EUROCarbDB takes me to what seems to be a closed down site at the EBI titled 'What happened to the EuroCarbDB website?' which has this to say:

The pilot project ended in 2009 but efforts to obtain renewed funding have unfortunately not been successful. The EuroCarbDB website was hosted by the Protein Data Bank in Europe at EMBL-EBI but has now been discontinued

So that's all very helpful then.

Academic link rot seems to be getting faster: should a published URL last more than 100 days?

Consider this paper that was recently published in the journal Bioinformatics, and which showed up today in my RSS feed:

Presumably it is a typo when the journal says that it was received on November 14th 2014:

I'll assume that this is meant to be 2013! The paper first appeared online on June 13th 2014, just 103 days ago. The text of this paper links to some software that should be available at http://ww2.cs.mu.oz.au/∼gwong/LICRE. Except that this URL doesn't work. Neither does http://ww2.cs.mu.oz.au/∼gwong/. Only when I visit http://ww2.cs.mu.oz.au/ do I discover the following:

The new website for the merged departments says that the merger happened in 2012, and this is confirmed by the redirection page which has a date of 18th January 2012. It is also confirmed by looking at the Internet Archive's Wayback Machine which shows that the redirection page has been in place since at least February 2012. 

All of which suggests that the software link in the paper may have not even worked properly at the time they submitted the manuscript. I'm sure there are other similar examples of speedy link rot, but this seems particularly striking. Especially since a search for 'LICRE' on the new website doesn't return any hits (nor can I find any mention of it on Google or various search engine caches).

I will contact the lead author to let him know about the disappearance of the software. In the meantime, I'll remind people of this previous post of mine:

Update 2014-09-24 19.52:  I heard back from the author, the LICRE code is now at https://sites.google.com/site/licrerepository/

5 things to consider when publishing links to academic websites

Preamble

One of the reasons I've been somewhat quiet on this blog recently is because I've been involved with a big push to finish the new Genome Center website. This has been in development for a long time and provides a much needed update to the previous website that was really showing its age. Compare and contrast:

The old Genome Center website…what's with all that whitespace in the middle?

The new Genome Center website, less than 24 hours old at the time of writing.

This type of redesign is a once-in-a-decade event, and provides the opportunity not just to add new features (e.g. proper RSS news feed, twitter account, YouTube channel, respsonvive website design etc.), but also to clean up a lot of legacy material (e.g. webpages for people who left the Genome Center many years ago).

This cleanup prompted me to check Google Scholar to see if there are any published papers that include links to Genome Center websites. This includes links to the main site and also to all of the many subdomains that exist (for different labs, core facilities etc.) It's pretty easy to search Google Scholar for the core part of a URL, e.g. genomecenter.ucdavis.edu and I would encourage anyone else that is looking after an aging academic website to do so.

Here are some of the key things that I noticed:

  1. Most mentions of Genome Center URLs are to resources by Peggy Farnham's lab. Although Peggy left UC Davis several years ago (she is now here), her — very old, and out of date — lab page still exists (http://farnham.genomecenter.ucdavis.edu).
  2. Many people link to Craig Benham's work using http://genomecenter.ucdavis.edu/benham/. This redirects to Craig's own lab site (http://benham.genomecenter.ucdavis.edu), but the redirect doesn't quite work when people have linked to a specific tool (e.g. http://genomecenter.ucdavis.edu/benham/sidd). This redirects to http://benham.genomecenter.ucdavis.edu/sidd which then produces a 404 error (page not found).
  3. There are many papers that link to resources from Jonathan Eisen's group and these papers all point to various pages on a domain that is either down or no longer in existence (http://bobcat.genomecenter.ucdavis.edu).

There is an issue here of just how long is it valid to try to keep links active and working. In the case of Peggy Farnham, she no longer works at UC Davis, so is it okay if I redirected all of her web traffic to her new website? I plan to do this but will let Peggy know so that she can maybe arrange to copy some of the existing material over to her new site.

In the case of Craig's lab, maybe he should be adding his own redirect links for tools that now have new URLs. What would also help would be to have a dedicated 404 page which might point to the likely target page that people are looking for (a completely blank 'not found' page is rarely ever helpful).

In the case of Jonathan's lab, there is a big problem here in that all of the papers are tied to a very specific domain name (which itself has no obvious naming connection to his lab). You can always rename a new machine to be called 'bobcat', but maybe there are better things we should be doing to avoid these situations arising in the first place…

5 things to consider when publishing links to academic websites

  1. Don't do it! Use resources like Figshare, Github, or Dryad if at all possible. Of course this might not be possible if you are publishing some sort of online software tool.
  2. If you have to link to a lab webpage, consider spending $10 a year or so and buying your own domain name that you can take with you if you ever move anywhere else in future. I bought http://korflab.com for my boss, and I see that Peggy Farnham is now using http://farnhamlab.com.
  3. If you can't, or don't want to, buy your own domain name, try using a generic lab domain name and not a machine-specific domain name. E.g. our lab's website is on a machine called 'raiden' and can be accessed at http://raiden.genomecenter.ucdavis.edu. But we only ever use the domain name http://korflab.ucdavis.edu which allows us to use a different machine as the server without breaking any links.
  4. If you must link to a specific machine, try avoiding URLs that get too complex. E.g. http://supersciencelab.ucdavis.edu/Tools/Foo/v1/foo_v1.cgi. The more complex the URL, the more likely it will break in future. Instead, link to your top level domain (http://supersciencelab.ucdavis.edu) and provide clear links on that page on how to find things.
  5. Any time you publish a link to a URL, make sure you keep a record of this in a simple text file somewhere. This might really help if/when you decide to redesign your website 5 years from now and want to know whether you might be breaking any pre-existing links.