New option to subscribe to this blog via email

Shamelessly borrowing this idea from Matt Gemmell's excellent blog, I thought I'd offer the chance to subscribe to my infrequent ramblings via email. If you enter your email address below, you can receive a weekly email (sent on Friday afternoons) with all of my posts for that week.

Your email address will only be used for the purpose of receiving my blog content and will not be shared with anyone else. Each email will offer a simple link by which to unsubscribe.

Some sage advice on avoiding confusing names for bioinformatics tools

SAGE is a molecular technique used to investigate the mRNA population from a chosen sample. It stands for Serial Analysis of Gene Expression and was first described back in 1995. The technique spawned spin-offs such as LongSAGE, RL-SAGE (Really Long SAGE), and SuperSAGE.

Although this technique has largely been superseded by other methods (such as RNA-Seq), it is still widely referenced (over 1,300 publications from 2013 mention this technique).

Fast-forward to the present day and I note that a new tool has just been published in the journal BMC Bioinformatics:

SAGE: String-overlap Assembly of GEnomes

As long as you query your favorite web search engine for some combination of 'SAGE' and 'genome assembly' you will probably find this tool and not end up on one of the half a million pages that talk about the other SAGE. I'm still not sure whether it is a bit risky giving a new tool the same name as such an established molecular technique.

All of this means that there is the potential for a certain company to use the aforementioned molecular technique to help annotate the output of the aforementioned computational technique, and apply both of these techniques to data from a certain plant. This could give you the world's first SAGE, SAGE, SAGE, sage genome!

Understanding CEGMA output: complete vs partial

On Friday I posted a reply to a thread on SEQanswers about CEGMA. I thought I'd include a modified version of that response here as it is an issue that gets raised fairly frequently. It concerns the 'complete' and 'partial' results that CEGMA includes in the final output file that it generates (typically called 'output.completeness_report'). Here were the two questions that were posted:

1) If a partial score is higher than a complete score then does this indicate that the assembly is fragmented?

2) Also, should the partial score be lower than the complete score in an ideal situation?

Remember, these are not scores per se. Both of these figures describe a number of core eukaryotic genes (CEGs) that the CEGMA pipeline predicts to be present in the input assembly file. The 'complete' set  refers to those gene predictions which CEGMA classes as 'full-length'. Note that even if CEGMA says something is 'complete' there is still the possibility that parts of the protein is missing.

This is because CEGMA is taking each CEG that it has predicted and aligns the protein sequence of that CEG to the HMM profile generated from the corresponding core gene family (made up of six proteins from Schizosacchromyces pombe, Saccharomyces cerevisiaeCaenorhabditis elegans, Drosophila melanogasterArabidopsis thaliana, and Homo sapiens). As I recall from memory, if the alignment spans more than 70% of the protein profile the CEG is considered to be 'complete'. This 70% threshold is an arbitrary cut-off, but seems to work well in finding genuine orthologs of CEGs.

Somewhat confusingly, although we consider 'partial' matches to be those below 70% (but above some unspecified minimum score), the output in output.completeness_report uses 'partial' to include both 'complete' and 'partial' matches. So the number of partial matches will always be at least as high as the number of complete matches.

You should look at both results. If you don't have 248 core genes 'completely' present, the next thing is look at how many additional partial matches there are. If you have a result like 200/240 (i.e. 200 complete CEGs and 40 additional partial matches) then this at least suggests that most of the core gene set is present in your assembly, but some may be split across contigs or missing from the assembly. Remember, CEGMA only looks for genes that are located inside individual contigs or scaffolds. Theoretically, you could have an assembly that splits every gene across contigs which might lead to a 'complete' result of zero, and a partial result of '248'.

From looking at results of many different runs of CEGMA, it is common to see something like 90–95% of core gene present in the 'complete' category, and another 1–5% present as partial genes (for good assemblies at least). I have also seen one case where the results were 157/223. This is more unusual, suggesting that a relatively large number (27%) of the core genes were present as fragments. This might simply reflect lots of short contigs/scaffolds in the assembly. In contrast to this, one of the best results that I have seen is 245/248. It is rare to see all core genes present, even when you allow for partial matches.

Below is a chart that shows the results from 50 runs of CEGMA against different assemblies. The x-axis shows the percentage of 248 CEGs that were completely present, and the y-axis shows the percentage of CEGs that were only partially present.

Is yours bigger than mine? Big data revisited

Google Scholar lists 2,090 publications that contain the phrase 'big data' in their title. And that's just from the first 9 months of 2014! The titles of these articles reflect the interest/concern/fear in this increasingly popular topic:

One paper, Managing Big Data for Scientific Visualization, starts out by identifying a common challenge of working with 'big data':

Many areas of endeavor have problems with big data…while engineering and scientific visualization have also faced the problem for some time, solutions are less well developed, and common techniques are less well understood

They then go on to discuss some of the problems of storing 'big data', one of which is listed as:

Data too big for local disk — clearly, not only do some of these data objects not fit in main memory, but they do not even fit on local disk on most workstations. In fact, the largest CFD study of which we are aware is 650 gigabytes, which would not fit on centralized storage at most installations!

Wait, what!?! 650 GB is too large for storage? Oh yes, that's right. I forgot to mention that this paper is from 1997. My point is that 'big data' has been a problem for some time now and will no doubt continue to be a problem.

I understand that having a simple, user-friendly, label like 'big data' helps with the discussion, but it remains such an ambiguous, and highly relative term. It's relative because whether you deem something to be 'big data' or not might depend heavily on the size of your storage media and/or the speed of your networking infrastructure. It's also relative in terms of your field of study; a typical set of 'big data' in astrophysics might be much bigger than a typical set of 'big data' in genomics.

Maybe it would help to use big dataTM when talking about any data that you like to think of as big, and then use BIG data for those situations where your future data acquisition plans cause your sys admin to have sleepless nights.