Details of GFF version 4 have emerged

gff4.png

One of the most widely used file formats in bioinformatics is the General Feature Format (GFF). This venerable tab-delimited format uses 9 columns of text to help describe any set of features that can be localized to a DNA or RNA sequence.

It is most commonly used to provide a set of genome annotations that accompany a genome sequence file, and the success of this format has also spawned the similar Gene Transfer Format (GTF), which focuses on gene structural information.

GFF has been an evolving format, and the widely adopted 2nd version has largely been superceded by use of GFF version 3. This was developed by Lincoln Stein from around 2003 onwards.

As version 3 is now over a decade old, work has been ongoing to develop a new version of GFF 4 that is suitable for the rigors of modern day genomics. The principle change to version 4 will be the addition of a 10th GFF column. This 'Feature ID' column is defined in the spec as follows:

Column 10: Feature ID

Format: FeatureID=<integer>

Every feature in a GFF file should be referenced by a numerical identifier which is unique to that particular feature across all GFF files in existence.

This field will store an integer in the range 1–999,999,999,999,999 (no zero-padding) and identifiers will be generated via tools available from the GFF 4 consortium. If you wish to generate a GFF 4 file, you will need to obtain official sanctioned Feature IDs for this mandatory field.

The advantage of this new field is that all bioinformatics tools and databases will have a convenient way to uniquely reference any feature in any GFF file (as long as it is version 4 compliant)

Large institutions may wish to work with the GFF 4 consortium to reserve blocks of consecutive numeric ranges for Feature IDs

It is intended that the GFF 4 consortium will act as a gatekeeper to all Feature IDs, and that via their APIs you will be able to check whether any given Feature ID exists, and if it does you will be able extract the relevant details of that feature from whatever GFF file in the world contains that specific Feature ID.

Here is an example of how GFF version 4 would describe an intron from a gene:

## gff-version 4
## sub-version 1.02
## generated: 2015-02-01
## sequence-region   chr1 1 2097228       
chrX    Coding_transcript   intron 14192   14266   .   -   gene=Gene00071  FeatureID=125731789

In this example, the intron is the 125,731,789th feature to be registered globally with the GFF 4 consortium. The big advantage of this format is a researcher can now guarantee that this particular Feature ID will not exist in any other GFF file anywhere in the world. The use of unique identifiers like this will be a huge leap forward for bioinformatics as we will no longer have to worry about lines in our GFF files possibly existing in someone else's GFF files as well.

Update: check the date

Bogus bioinformatics acronyms…there's a lot of them about

Time for some new JABBA awards to recognize the ongoing series of crimes perpetrated in the name of bioinformatics. Two new examples this week…

 

Exhibit A (h/t @attilacsordas): from arxiv.org we have…

CoMEt derives from 'Combinations of Mutually Exclusive Alterations'. Of course the best way of making it easy for people to find your bioinformatics tool is to give it an identical name as an existing tool which does something completely different. So don't be surprised if you search for the web for 'CoMEt' only to find a bioinformatics tool called 'CoMet' from 2011 (note the lower-case 'e'!). CoMet is a web server for comparative functional profiling of metagenomes.

 

Exhibit B: from the journal Bioinformatics — the leading provider of bogus bioinformatics acronyms since 1998 — we have…

MUSCLE is derived from 'Multi-platform Unbiased optimization of Spectrometry via Closed-Loop Experimentation'. Multi-platform you say? What platforms would those be? From the paper:

MUSCLE is a stand-alone desktop application and has been tested on Windows XP, 7 and 8

What, no love for Windows Vista?

Of course, it should be obvious to anyone that this bioinformatics tool called MUSCLE should in no way be confused with the other (pre-existing) bioinformatics tool called MUSCLE.

Is Amazon's new 'unlimited' cloud drive suitable for bioinformatics?

Amazon have revealed new plans for their cloud drive service. Impressively, their 'Unlimited Everything' plan offers the chance to store an unlimited number of files and documents for just $59.99 per year (after a 3-month free trial no less).

News of this new unlimited storage service caught the attention of more than one bioinformatician:

If you didn't know, bioinformatics research can generate a lot of data. It is not uncommon to see individual files of DNA sequences stored in the FASTQ format reach 15–20 GB in size (and this is just plain text). Such files are nearly always processed to remove errors and contamination resulting in slightly smaller versions of each file. These processed files are often mapped to a genome or transcriptome which generates even more output files. The new output files in turn may be processed with other software tools leading to yet more output files. The raw input data should always be kept in case experiments need to be re-run with different settings so a typical bioinformatics pipeline may end up generating terabytes of data. Compression can help, but the typical research group will always be generating more and more data, which usually means a constant struggle to store (and backup) everything.

So could Amazon's new unlimited storage offer a way of dealing with the common file-management headache which plagues bioinformaticians (and their sys admins)? Well probably not. Their Terms of Use contain an important section (emphasis mine):

3.2 Usage Restrictions and Limits. The Service is offered in the United States. We may restrict access from other locations. There may be limits on the types of content you can store and share using the Service, such as file types we don't support, and on the number or type of devices you can use to access the Service. We may impose other restrictions on use of the Service.

You may be able to get away with using this service to store large amounts of bioinformatics data, but I don't think Amazon are intending for it to be used by anyone in this manner. So it wouldn't surprise me if Amazon quietly started imposing restrictions on certain file types or slowing bandwidth for heavy users such that it would make it impractical to rely on for day-to-day usage.

Google and WormBase: these are not the search results you're looking for

Today I wanted to look up a particular gene in the WormBase database. Rather than go to the WormBase website, I thought I would just search Google for the word 'wormbase' followed by the gene name (rpl-22). Surely this would be enough to put the Gene Summary page for rpl-22 at the top of the results?

Sadly no. Here are the results that I was presented with:

All ten of these results include information from the WormBase database regarding the rpl-22 gene and/or link to the WormBase page for the gene. But there are no search results for wormbase.org at all.

Very odd. Is WormBase not allowing themselves to be indexed by search engines? I see a similar lack of wormbase.org results when using bing, Ask, or DuckDuckGo. However, if I search Google for flybase rpl-22 or pombase rpl-22 I find the desired fly/yeast orthologs of the worm rpl-22 gene as the top Google result.