Academic link rot seems to be getting faster: should a published URL last more than 100 days?

Consider this paper that was recently published in the journal Bioinformatics, and which showed up today in my RSS feed:

Presumably it is a typo when the journal says that it was received on November 14th 2014:

I'll assume that this is meant to be 2013! The paper first appeared online on June 13th 2014, just 103 days ago. The text of this paper links to some software that should be available at http://ww2.cs.mu.oz.au/∼gwong/LICRE. Except that this URL doesn't work. Neither does http://ww2.cs.mu.oz.au/∼gwong/. Only when I visit http://ww2.cs.mu.oz.au/ do I discover the following:

The new website for the merged departments says that the merger happened in 2012, and this is confirmed by the redirection page which has a date of 18th January 2012. It is also confirmed by looking at the Internet Archive's Wayback Machine which shows that the redirection page has been in place since at least February 2012. 

All of which suggests that the software link in the paper may have not even worked properly at the time they submitted the manuscript. I'm sure there are other similar examples of speedy link rot, but this seems particularly striking. Especially since a search for 'LICRE' on the new website doesn't return any hits (nor can I find any mention of it on Google or various search engine caches).

I will contact the lead author to let him know about the disappearance of the software. In the meantime, I'll remind people of this previous post of mine:

Update 2014-09-24 19.52:  I heard back from the author, the LICRE code is now at https://sites.google.com/site/licrerepository/

Another CEGMA post: KOGs vs CEGs and 458 vs 248

I posted another answer about CEGMA on seqanswers.com last week. I thought I'd cover this in a little more detail here (note, questions edited from how they originally appeared):

Question 1: CEGMA uses a 'kogs.fa' file — containing 2,748 proteins — to compare to a user's genome sequence. These KOGs define a set of 458 core eukaryotic genes (CEGs). Some CEGMA publications present the number of 458 CEGs that are present, others list results from the 248 most-highly-conserved CEGS. Does anyone know why kogs.fa is the default? Does it get 'curated' down to a smaller set during a CEGMA run?

The kogs.fa file represents a subset of the published set of 4,852 KOGs (euKaryotic Orthologous Groups). The KOGs database — which is still available online — describes protein groups that are present among seven different eukaryotes (not all groups are present in all species). We excluded data from the microsporidian Encephalitozoon cuniculi as it is a parasite and may have an atypical protein complement and focused on the 1,788 groups that were present in all of the remaining six species. We then applied various filtering criteria — see methods in original paper — to reduce this to the 458 KOGs (renaming this subset as CEGs in the process). We also chose just one protein to represent each species.

So that's why our kogs.fa file contains 2,748 proteins (458 x 6). CEGMA tries to determine which of these 458 CEGs are present in your input file. It's worth pointing out that the original purpose of CEGMA was to try to find a handful of genes in a genome which may lack gene annotations. Someone could then use this small gene set to train a gene finder, by which to annotate the entire genome.

After CEGMA has found which of the 458 CEGs are present, it then performs its secondary role of assessing the completeness of the gene space. To do this, it only wants to use the most conserved, and least paralogous of the 458 CEGs. Paralogy is a big issue here. The original KOGs database grouped together proteins when there were often many, many paralogs for each group. E.g. KOG0001 corresponds to the Ubiquitin gene family. Here are how many proteins occur in each of the seven species that represent this KOG:

  • Arabidopsis thaliana - 28
  • Caenorhabditis elegans - 12
  • Drosophila melanogaster - 3
  • Encephalitozoon cuniculi - 1
  • Homo sapiens - 17
  • Saccharomyces cerevisiae - 2
  • Schizosaccharomyces pombe - 1

The high degree of paralogy from A. thaliana is one reason why this KOG is not included in our subset of 248 CEGs. In contrast, KOG0018  — Structural maintenance of chromosome protein 1 (sister chromatid cohesion complex Cohesin, subunit SMC1) — is included in the 248 CEGs:

  • Arabidopsis thaliana - 1
  • Caenorhabditis elegans - 4
  • Drosophila melanogaster - 1
  • Encephalitozoon cuniculi - 1
  • Homo sapiens - 3
  • Saccharomyces cerevisiae - 1
  • Schizosaccharomyces pombe - 1

This secondary role of CEGMA uses information in the completeness_cutoff.tbl file (inside the CEGMA data directory) to narrow the 458 CEGs results down to a subset of 248 CEGs. Because different filtering criteria are used, a CEG may be classed as present in the 458 CEG set, but not in the 248 CEG set, even if it was on the list of 248 candidate CEGs.

Question 2: CEGMA output includes many KOG IDs but no descripition of what protein name/function each KOG ID represents. This makes it not so useful for annotating new genomes. Is there a lookup table somewhere?

One of the reason why we maintained KOG identifiers in the CEGMA output was so that people could, if so inclined, look up more information in the KOGs database. If you download the 'kog' file from the KOGs database, you will see each KOG has a one line description. E.g.

[O] KOG0019 Molecular chaperone (HSP90 family)
[KC] KOG0025 Zn2+-binding dehydrogenase (nuclear receptor binding factor-1)
[ZD] KOG0028 Ca2+-binding protein (centrin/caltractin), EF-Hand superfamily protein
[C] KOG0042 Glycerol-3-phosphate dehydrogenase
[T] KOG0044 Ca2+ sensor (EF-Hand superfamily)
[K] KOG0048 Transcription factor, Myb superfamily

The letters inside square brackets, represent various functional categories annotated by the KOGs database. These are as follows:

INFORMATION STORAGE AND PROCESSING
 [J] Translation, ribosomal structure and biogenesis
 [A] RNA processing and modification
 [K] Transcription
 [L] Replication, recombination and repair
 [B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING
 [D] Cell cycle control, cell division, chromosome partitioning
 [Y] Nuclear structure
 [V] Defense mechanisms
 [T] Signal transduction mechanisms
 [M] Cell wall/membrane/envelope biogenesis
 [N] Cell motility
 [Z] Cytoskeleton
 [W] Extracellular structures
 [U] Intracellular trafficking, secretion, and vesicular transport
 [O] Posttranslational modification, protein turnover, chaperones

METABOLISM
 [C] Energy production and conversion
 [G] Carbohydrate transport and metabolism
 [E] Amino acid transport and metabolism
 [F] Nucleotide transport and metabolism
 [H] Coenzyme transport and metabolism
 [I] Lipid transport and metabolism
 [P] Inorganic ion transport and metabolism
 [Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED
 [R] General function prediction only
 [S] Function unknown

Maybe this is useful to someone. However, I would remind people that KOGs was published over a decade ago (and presumably the work to generate the KOGs database begun in 2002 if not earlier). There were probably several gene annotations that were missing in the source genomes at that time, and many annotations have presumably since been updated (I bet many genes have had minor alterations to their structure).

 

Updated version of my 'Genome assembly: then and now' talk is now available

This is a presentation that I have probably given five times now. Originally, the main focus of the talk was purely about the Assemblathon 2 paper, with some thoughts about how the field of genome assembly has changed since the days of Sanger-only sequencing.

Over time, I've increasingly downplayed the Assemblathon 2 content of the talk, and made way for updates relating to the latest developments in genome sequencing and assembly. To that end, I've decided to start adding version numbers to this talk to help make it easier to distinguish between different versions.

So here is version 1.2 of my talk, presented below with and without notes (my talks are very visual, so I have embedded notes to try to capture what I talk about for each slide). Don't be put off by the high slide count (many of these just reflect animated steps).

Without notes…

With notes (probably need to go full-screen to be able to clearly read these)…

Too many genome assemblers to keep track of? Nucleotid.es to the rescue!

Yesterday, I presented an updated version of my 'Genome Assembly: Then and Now' talk. I'll try to post the full set of slides (with notes) later today on Slideshare. But I thought I'd share just one of the new slides from the talk; here are six papers that describe new genome assembly tools…

Read More

New option to subscribe to this blog via email

Shamelessly borrowing this idea from Matt Gemmell's excellent blog, I thought I'd offer the chance to subscribe to my infrequent ramblings via email. If you enter your email address below, you can receive a weekly email (sent on Friday afternoons) with all of my posts for that week.

Your email address will only be used for the purpose of receiving my blog content and will not be shared with anyone else. Each email will offer a simple link by which to unsubscribe.

Some sage advice on avoiding confusing names for bioinformatics tools

SAGE is a molecular technique used to investigate the mRNA population from a chosen sample. It stands for Serial Analysis of Gene Expression and was first described back in 1995. The technique spawned spin-offs such as LongSAGE, RL-SAGE (Really Long SAGE), and SuperSAGE.

Although this technique has largely been superseded by other methods (such as RNA-Seq), it is still widely referenced (over 1,300 publications from 2013 mention this technique).

Fast-forward to the present day and I note that a new tool has just been published in the journal BMC Bioinformatics:

SAGE: String-overlap Assembly of GEnomes

As long as you query your favorite web search engine for some combination of 'SAGE' and 'genome assembly' you will probably find this tool and not end up on one of the half a million pages that talk about the other SAGE. I'm still not sure whether it is a bit risky giving a new tool the same name as such an established molecular technique.

All of this means that there is the potential for a certain company to use the aforementioned molecular technique to help annotate the output of the aforementioned computational technique, and apply both of these techniques to data from a certain plant. This could give you the world's first SAGE, SAGE, SAGE, sage genome!

Understanding CEGMA output: complete vs partial

On Friday I posted a reply to a thread on SEQanswers about CEGMA. I thought I'd include a modified version of that response here as it is an issue that gets raised fairly frequently. It concerns the 'complete' and 'partial' results that CEGMA includes in the final output file that it generates (typically called 'output.completeness_report'). Here were the two questions that were posted:

1) If a partial score is higher than a complete score then does this indicate that the assembly is fragmented?

2) Also, should the partial score be lower than the complete score in an ideal situation?

Remember, these are not scores per se. Both of these figures describe a number of core eukaryotic genes (CEGs) that the CEGMA pipeline predicts to be present in the input assembly file. The 'complete' set  refers to those gene predictions which CEGMA classes as 'full-length'. Note that even if CEGMA says something is 'complete' there is still the possibility that parts of the protein is missing.

This is because CEGMA is taking each CEG that it has predicted and aligns the protein sequence of that CEG to the HMM profile generated from the corresponding core gene family (made up of six proteins from Schizosacchromyces pombe, Saccharomyces cerevisiaeCaenorhabditis elegans, Drosophila melanogasterArabidopsis thaliana, and Homo sapiens). As I recall from memory, if the alignment spans more than 70% of the protein profile the CEG is considered to be 'complete'. This 70% threshold is an arbitrary cut-off, but seems to work well in finding genuine orthologs of CEGs.

Somewhat confusingly, although we consider 'partial' matches to be those below 70% (but above some unspecified minimum score), the output in output.completeness_report uses 'partial' to include both 'complete' and 'partial' matches. So the number of partial matches will always be at least as high as the number of complete matches.

You should look at both results. If you don't have 248 core genes 'completely' present, the next thing is look at how many additional partial matches there are. If you have a result like 200/240 (i.e. 200 complete CEGs and 40 additional partial matches) then this at least suggests that most of the core gene set is present in your assembly, but some may be split across contigs or missing from the assembly. Remember, CEGMA only looks for genes that are located inside individual contigs or scaffolds. Theoretically, you could have an assembly that splits every gene across contigs which might lead to a 'complete' result of zero, and a partial result of '248'.

From looking at results of many different runs of CEGMA, it is common to see something like 90–95% of core gene present in the 'complete' category, and another 1–5% present as partial genes (for good assemblies at least). I have also seen one case where the results were 157/223. This is more unusual, suggesting that a relatively large number (27%) of the core genes were present as fragments. This might simply reflect lots of short contigs/scaffolds in the assembly. In contrast to this, one of the best results that I have seen is 245/248. It is rare to see all core genes present, even when you allow for partial matches.

Below is a chart that shows the results from 50 runs of CEGMA against different assemblies. The x-axis shows the percentage of 248 CEGs that were completely present, and the y-axis shows the percentage of CEGs that were only partially present.

Is yours bigger than mine? Big data revisited

Google Scholar lists 2,090 publications that contain the phrase 'big data' in their title. And that's just from the first 9 months of 2014! The titles of these articles reflect the interest/concern/fear in this increasingly popular topic:

One paper, Managing Big Data for Scientific Visualization, starts out by identifying a common challenge of working with 'big data':

Many areas of endeavor have problems with big data…while engineering and scientific visualization have also faced the problem for some time, solutions are less well developed, and common techniques are less well understood

They then go on to discuss some of the problems of storing 'big data', one of which is listed as:

Data too big for local disk — clearly, not only do some of these data objects not fit in main memory, but they do not even fit on local disk on most workstations. In fact, the largest CFD study of which we are aware is 650 gigabytes, which would not fit on centralized storage at most installations!

Wait, what!?! 650 GB is too large for storage? Oh yes, that's right. I forgot to mention that this paper is from 1997. My point is that 'big data' has been a problem for some time now and will no doubt continue to be a problem.

I understand that having a simple, user-friendly, label like 'big data' helps with the discussion, but it remains such an ambiguous, and highly relative term. It's relative because whether you deem something to be 'big data' or not might depend heavily on the size of your storage media and/or the speed of your networking infrastructure. It's also relative in terms of your field of study; a typical set of 'big data' in astrophysics might be much bigger than a typical set of 'big data' in genomics.

Maybe it would help to use big dataTM when talking about any data that you like to think of as big, and then use BIG data for those situations where your future data acquisition plans cause your sys admin to have sleepless nights.

The problem with posters at academic conferences

I recently attended the Genome Science: Biology, Technology, and Bioinformatics meeting in the UK, where I presented a poster. As I was walking around, looking at other people's posters, I was reminded of the common problem that occurs with many academic posters. Here are some pseudo-anonomous examples to show what I mean (click images to enlarge):

The problem here is not with the total amount of text — though that can sometimes be an issue — but with the width of the text. These posters are 84 cm (33 inches) wide, and it is not ideal to create text blocks that span the entire width of the poster. The reasons behind this are the same reasons why you never see newspapers display text like this…we are not very good at reading information in this manner.

To quote from Lynch & Horton's Web Style Guide; specifically the section on Page Width and Line Length:

The ideal line length for text layout is based on the physiology of the human eye. The area of the retina used for tasks requiring high visual acuity is called the macula. The macula is small, typically less than 15 percent of the area of the retina. At normal reading distances the arc of the visual field covered by the macula is only a few inches wide—about the width of a well-designed column of text, or about twelve words per line. Research shows that reading slows as line lengths begin to exceed the ideal width, because the reader then needs to use the muscles of the eye or neck to track from the end of one line to the beginning of the next line. If the eye must traverse great distances on a page, the reader must hunt for the beginning of the next line.

In contrast to the above examples, there were a couple of posters at the #UKGS2014 meeting that I thought were beautifully displayed. Bright, colorful, clearly laid out, not too much text, and good use of big fonts. Congratulations to Warry Owen et al. and Karim Gharbi et al. for your poster presentation prowess!

When is a citation not a citation?

Today I received a notification from Google Scholar that one of my papers had been cited. I often have a quick look at such papers to see how our work is being referenced. The article in question was from the Proceedings of the 3rd Annual Symposium on Biological Data Visualization: Data Analysis and Redesign Contests:

FixingTIM: interactive exploration of sequence and structural data to identify functional mutations in protein families

The paper describes a tool that helps "identify protein mutations across a family of structural models and to help discover the effect of these mutations on protein function". I was a bit surprised by this because this isn't a topic that I've published on. So I looked to see what paper of mine was being cited and how it was being cited. Here is the relevant sentence from the background section of the paper:

To improve the exploration process, many efforts have been made, from folding the sequences through classification [1,2], to tools for 3D view exploration [3] and to web-based applications which present large amounts of information to the users [4].

Citation number 2 is the paper on which I am a co-author:

  • Chen N, Harris TW, Antoshechkin I, Bastiani C, Bieri T, Blasiar D, Bradnam K, Canaran P, Chan J, Chen C, Chen WJ, Cunningham F, Davis P, Kenny E, Kishore R, Lawson D, Lee R, Muller H, Nakamura C, Pai S, Ozersky P, Petcherski A, Rogers A, Sabo A, Schwarz EM, Van Auken K, Wang Q, Durbin R, Spieth J, Sternberg PW, Stein LD: Wormbase: A comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res 2005, 33(1):383-389.

The cited paper simply describes the WormBase database and includes only a passing reference to the fact that WormBase contains some links to protein structures (when known), but that's about it. The WormBase paper doesn't mention 'folding' or 'classification' anywhere, which makes it seem a really odd choice of paper to be cited. It makes me wonder how many other papers end up gaining seemingly spurious citations like this one.