BLAST bug (or feature?) in NCBI BLAST v2.2.30+

Something changed in the latest version of NCBI BLAST+ which breaks our CEGMA software. Compare the behavior of this simple TBLASTN command in v2.2.29+ and v2.2.30+ (from October 2014):


v2.2.29+

tblastn -db sample.dna -query sequence.prot -word_size 5

TBLASTN 2.2.29+

Database: sample.dna
           1 sequences; 2,499,950 total letters

Query= 7292122___KOG0292

Length=1234
                                                 Score     E
Sequences producing significant alignments:      (Bits)  Value

CHROMOSOME_I 1 15072418                           38.9    0.002

v2.2.30+

tblastn -db sample.dna -query sequence.aa -word_size 5

BLAST query/options error: Compressed alphabet lookup table requires word size 6 or 7
Please refer to the BLAST+ user manual.

One step in the CEGMA pipeline involves running TBLASTN with a word size of 5. This no longer works in the latest version and the error message suggests that only a word size of 6 or 7 is permitted. I can confirm that this is the case by looking at the latest source code for the blast_option.c file:


else if (options->lut_type == eCompressedAaLookupTable &&
         options->word_size != 6 && options->word_size != 7) {
         Blast_MessageWrite(blast_msg, eBlastSevError, kBlastMessageNoContext,
               "Compressed alphabet lookup table requires "
               "word size 6 or 7");
         return BLASTERR_OPTION_VALUE_INVALID;
}
    

The error message suggests I look at the BLAST+ user manual. I did this, and according to Table C5:

tblastn application options:

option = word_size    
type = integer
default value  = 3 
description and notes = "Valid word sizes are 2-7."

There also seems to be no mention of this change in the release notes, all of which makes me think that this is a bug. So I will report this to the NCBI, but any CEGMA users out there may wish to hold off updating to v.2.2.30+.

10 bioinformatics tools you should be using on Valentines Day

1. HUGS: the database of HUman Genome Sequences

"We envisage that the growth in personal genomics will mean that researchers will increasingly want HUGS to cope with their work."


2. LOVE: LncRNA Ortholog Validation and Evaluation

"If you are unsure as to the quality of your lncRNA annotations, we suggest that you need LOVE."


3. KISSES: Kmers In aSsembled SEquenceS 

"We envisage that KISSES will be widely distributed by people working in the field of genome assembly."


4. HEART: Histidine Enrichment Analysis Report Tool

"Accurate detection of histidine-enriched sequences can be achieved if researchers have HEART."


5. ILOVEYOU: Intergenic LOng VariablE Yeast Operational Units

"Detection of this new class of conserved intergenic element will open new avenues for S. cerevisae researchers, and we predict that many will benefit from a deeper understanding of ILOVEYOUs".


6. ROSESARERED: Random Ortholog SEquence Simulations that ARE REDundant

"This tool effectively generates a series of, largely pointless, simulated ortholog sequences. See also our companion software: ValIdation Of Long Eukaryotic TranscriptS thAt Randomly appEar BioLogically UsEful (VIOLETSAREBLUE)."


7. VALENTINE: VALidation of ENcode Transcriptomes IN Eukaryotes

"We believe that the ENCODE annotations of the human genome are only 80% useful, therefore genome annotators will likely appreciate a VALENTINE."

 

8. PASSI(ON): Predicting ASSembly Integrity (Or Not)

"Based on our observations, we feel that there is an urgent need for PASSI(ON) within the genomics community."

 

9. CHOCOLATES: CHOosing COmputationaL AlgoriThms for Testing Evolutionary Simulations

"In a field where which increasingly offers a bewildering choice of bioinformatics tools, we feel that researchers will appreciate CHOCOLATES."


10. SNUGGLES:: SearchiNg for Unique Genes in orGanisms Like Eels and Snakes

"There is a desperate shortage of bioinformatics tools that are dedicated to finding unique genes in creatures that look a bit like worms. Hence we are confident that the community of people who work on snakes, eels, nematodes, and other tubular-like organisms will be receptive to SNUGGLES."

 

 

 

Can you say the name of this new bioinformatics method three times fast?

New in the journal Bioinformatics:

jNMFMA: a joint non-negative matrix factorization meta-analysis of transcriptomics data

JNMF stand for Joint Non-negative Matrix Factorization. Throw in some meta-analysis and randomly decide to make the 'J' lower-case as well as itacilized and you end up with the trips-off-the-tongue name of jNMFMA. Try saying it three-times fast! Actually, I had trouble pronouncing this just once.

How not to write a sequence assembly comparison paper

A great post by Keith Robison in which he casts his expert eye in the direction of a new pre-print published at F1000 Research.

In the preprint there is a a very ill-designed figure, which should be a table, that prints badly, with the font much too light for the unnecessarily heavy background. Displaying a four quadrant image adds nothing; a neatly organized table could display more information in a far more readable format allowing easy comparison. Even if the design were defensible, the content is embarassingly out-of-date, I'd estimate by close to two years.

The post ends with a great coda.