JABBA vs Jabba: when is software not really software?

It was only a matter of time I guess. Today I was alerted to a new publication by Simon Cockell (@sjcockell), it's a book chapter titled:

From the abstract:

Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data

Now as far as I can tell, this Jabba is not an acronym, so we safely avoid the issue of presenting a JABBA award for Jabba. However, one might argue that naming any bioinformatics software 'Jabba' is going to present some problems because this is what happens when you search Google for 'Jabba bioinformatics'.

There is a bigger issue with this paper that I'd like to address though. It is extremely disappointing to read a software bioinformatics paper in the year 2015 and not find any explicit link to the software. The publication includes a link to http://www.ibcn.intec.ugent.be, but only as part of the author details. This web page is for the Internet Based Communication Networks and Services research group at the University of Gent. The page contains no mention of Jabba, nor does their 'Facilities and Tools' page, nor does searching their site for Jabba.

Initially I wondered if this is paper is more about the algorithm behind Jabba (equations are provided) and not about an actual software implementation. However, the paper includes results from their Jabba tool in comparison to another piece of software (LoRDEC) and includes details of CPU time and memory requirements. This suggests that the Jabba software exists somewhere.

To me this is an example of 'closed science' and represents a failure of whoever reviewed this article. I will email the authors to find out if the software exists anywhere…it's a crazy idea but maybe they might be interested if people could, you know, use their software.

Update 2015-11-20: I heard back from the authors…the Jabba software is on GitHub.

Get shorty: the decreasing usefulness of referring to 'short read' sequences

I came across a new paper today:

When I see such papers, I always want to know 'what do you consider short?'. This particular paper makes no attempt to define what 'short' refers to, with the only mention of the 'S' word in the paper being as follows:

Finally, qAlign stores metadata for all generated BAM files, including information about alignment parameters and checksums for genome and short read sequences

There are hundreds of papers that mention 'short read' in their title and many more which refer to 'long read' sequences.

But 'short' and 'long' are tremendously unhelpful terms to refer to sequences. They mean different things to different people, and they can even mean different things to the same person at different times. I think that most people would agree that Illumina's HiSeq and MiSeq platforms are considered 'short read' technologies. The HiSeq 2500 is currently capable of generating 250 bp reads (in Rapid Run Mode), yet this is an order of magnitude greater than when Illumina/Solexa started out generating ~25 bp reads. So should we refer to these as long-short reads?

The length of reads generated by the first wave of new sequencing technologies (Solexa/Illumina, ABI SOLiD, and Ion Torrent) were initially compared to the 'long' (~800 bp) reads generated by Sanger sequencing methods. But these technologies have evolved steadily. The latest reagent kits for the MiSeq platform offer the possibility of 300 bp reads. However, if you perform paired end sequencing of libraries with insert sizes of ~600 bp, then you may end up generating single consensus reads that approach this length. Thus we are already at the point where a 'short read' sequencing technology can generate some reads that are longer than some of the reads produced by the former gold-standard 'long read' technology.

But the read lengths of any of these technologies pales into comparison when we consider the output of instruments from Pacficic Biosciences and Oxford Nanopore. By their standards, even Sanger sequencing reads could be considered 'short'.

If someone currently has reads that are 500-600 bp in length, it is not clear whether any software tool that proclaims to work with 'short reads' is suitable or not. Just as the 'Short Read Archive' (SRA) became the more-meaningfully-named Sequence Read Archive, so we as a community should banish these unhelpful names. If you develop tools that are optimized to work with 'short' or 'long' read data, please provide explicit guidelines as to what you mean!

To conclude:

There are no 'short' or 'long' reads, there are only sequences that are shorter or longer than other sequences.

Data access for the 1,000 Plants (1KP) project

From the abstract of a new paper in GigaScience:

The 1,000 plants (1KP) project is an international multi-disciplinary consortium that has generated transcriptome data from over 1,000 plant species, with exemplars for all of the major lineages across the Viridiplantae (green plants) clade. Here, we describe how to access the data used in a phylogenomics analysis of the first 85 species, and how to visualize our gene and species trees.

The paper doesn't provide a link to what seems to be the actual project website. They mention directories within the iPlant Collaborative project where you can access data. The project website reveals that this project can be referred to either '1000 plants', 'oneKP' or '1KP' (but not '1000P'?).

Being a pedantic kind of guy, I was curious by the paper's vague mention of 'over 1,000 plant species'. How many species exactly? The paper doesn't say. But if you go to one of the iPlant pages for 1KP, you will see this:

Altogether, we sequenced 1320 samples (from 1162 species)

So this project seems to have exceeded the boundaries suggested by its name. How about the '1.2KP' project?