The top 10 #PLOSBuzzFeed tweets that will put a smile on your face

It all started so innocently. Nick Loman (@pathogenomenick) expressed his dissatisfaction with yet another PLOS Computational Biology article that uses the 10 Simple Rules… template:


There were two immediate responses from Kai Blin (@kaiblin) and Phil Spear (@Duke_of_neural):


I immediately saw the possibility that this could become a meme-worthy hashtag, so I simply retweeted Phil’s tweet, added the hashtag #PLOSBuzzFeed, and waited to see what would happen (as well making some of my own contributions).

At the time of writing — about 10 hours later — there have been several hundred tweets using this hashtag. Presented in reverse order, here are the most ‘popular’ tweets from today (as judged by summing retweets and favorites):

How to cite bioinformatics resources

I saw a post on Biostars today that asked how specific versions of genome assemblies should be cited. This question also applies to the more general issue of citing any bioinformatics resource which may have multiple releases/versions which are not all formally published in papers. Here is how I replied:

Citing versions of any particular bioinformatics/genomics resources can get tricky because there is often no formal publication for every release of a given dataset. Further complicating the situation is the fact that you will often come across different dates (and even names) for the same resource. E.g. the latest cow genome assembly generated by the University of Maryland is known as 'UMD 3.1.1'. However, the UCSC genome browser uses their own internal IDs for all cow genome assemblies and refers to this as 'bosTau8'. Someone new to the field might see the UCSC version and not know about the original UMD name.

Sometimes you can use dates of files on FTP sites to approximately date sequence files, but these can sometimes change (sometimes files accidentally get removed and replaced from backups, which can change their date).

The key thing to aim for is to provide suitable information so that someone can reproduce your work. In my mind, this requires 2–3 pieces of information:

  1. The name or release number of the dataset you are downloading (provide alternate names when known)
  2. The specific URL for the website or FTP site that you used to download the data
  3. The date on which you downloaded the data

E.g. The UMD 3.1.1 version of the cow genome assembly (also known as bosTau8) was downloaded from the UCSC Genome FTP site (ftp://hgdownload.cse.ucsc.edu/bosTau8/bigZips/bosTau8.fa.gz).

When no version number is available — it is very unhelpful not to provide version numbers of sequence resources: they can, and will change — I always refer to the date that I downloaded it instead.

Get shorty: the decreasing usefulness of referring to 'short read' sequences

I came across a new paper today:

When I see such papers, I always want to know 'what do you consider short?'. This particular paper makes no attempt to define what 'short' refers to, with the only mention of the 'S' word in the paper being as follows:

Finally, qAlign stores metadata for all generated BAM files, including information about alignment parameters and checksums for genome and short read sequences

There are hundreds of papers that mention 'short read' in their title and many more which refer to 'long read' sequences.

But 'short' and 'long' are tremendously unhelpful terms to refer to sequences. They mean different things to different people, and they can even mean different things to the same person at different times. I think that most people would agree that Illumina's HiSeq and MiSeq platforms are considered 'short read' technologies. The HiSeq 2500 is currently capable of generating 250 bp reads (in Rapid Run Mode), yet this is an order of magnitude greater than when Illumina/Solexa started out generating ~25 bp reads. So should we refer to these as long-short reads?

The length of reads generated by the first wave of new sequencing technologies (Solexa/Illumina, ABI SOLiD, and Ion Torrent) were initially compared to the 'long' (~800 bp) reads generated by Sanger sequencing methods. But these technologies have evolved steadily. The latest reagent kits for the MiSeq platform offer the possibility of 300 bp reads. However, if you perform paired end sequencing of libraries with insert sizes of ~600 bp, then you may end up generating single consensus reads that approach this length. Thus we are already at the point where a 'short read' sequencing technology can generate some reads that are longer than some of the reads produced by the former gold-standard 'long read' technology.

But the read lengths of any of these technologies pales into comparison when we consider the output of instruments from Pacficic Biosciences and Oxford Nanopore. By their standards, even Sanger sequencing reads could be considered 'short'.

If someone currently has reads that are 500-600 bp in length, it is not clear whether any software tool that proclaims to work with 'short reads' is suitable or not. Just as the 'Short Read Archive' (SRA) became the more-meaningfully-named Sequence Read Archive, so we as a community should banish these unhelpful names. If you develop tools that are optimized to work with 'short' or 'long' read data, please provide explicit guidelines as to what you mean!

To conclude:

There are no 'short' or 'long' reads, there are only sequences that are shorter or longer than other sequences.

101 questions with a bioinformatician #23: Todd Harris

Todd Harris is a Bioinformatics Consultant and Project Manager at WormBase. I first came to know Todd when I was also working on the WormBase project. As part of the UK operation (based at the Sanger Institute), we would frequently refer to him as 'SuperTodd' for his amazing skills at single-handedly keeping the WormBase website updated and working smoothly.

Read More

Details of GFF version 4 have emerged

gff4.png

One of the most widely used file formats in bioinformatics is the General Feature Format (GFF). This venerable tab-delimited format uses 9 columns of text to help describe any set of features that can be localized to a DNA or RNA sequence.

It is most commonly used to provide a set of genome annotations that accompany a genome sequence file, and the success of this format has also spawned the similar Gene Transfer Format (GTF), which focuses on gene structural information.

GFF has been an evolving format, and the widely adopted 2nd version has largely been superceded by use of GFF version 3. This was developed by Lincoln Stein from around 2003 onwards.

As version 3 is now over a decade old, work has been ongoing to develop a new version of GFF 4 that is suitable for the rigors of modern day genomics. The principle change to version 4 will be the addition of a 10th GFF column. This 'Feature ID' column is defined in the spec as follows:

Column 10: Feature ID

Format: FeatureID=<integer>

Every feature in a GFF file should be referenced by a numerical identifier which is unique to that particular feature across all GFF files in existence.

This field will store an integer in the range 1–999,999,999,999,999 (no zero-padding) and identifiers will be generated via tools available from the GFF 4 consortium. If you wish to generate a GFF 4 file, you will need to obtain official sanctioned Feature IDs for this mandatory field.

The advantage of this new field is that all bioinformatics tools and databases will have a convenient way to uniquely reference any feature in any GFF file (as long as it is version 4 compliant)

Large institutions may wish to work with the GFF 4 consortium to reserve blocks of consecutive numeric ranges for Feature IDs

It is intended that the GFF 4 consortium will act as a gatekeeper to all Feature IDs, and that via their APIs you will be able to check whether any given Feature ID exists, and if it does you will be able extract the relevant details of that feature from whatever GFF file in the world contains that specific Feature ID.

Here is an example of how GFF version 4 would describe an intron from a gene:

## gff-version 4
## sub-version 1.02
## generated: 2015-02-01
## sequence-region   chr1 1 2097228       
chrX    Coding_transcript   intron 14192   14266   .   -   gene=Gene00071  FeatureID=125731789

In this example, the intron is the 125,731,789th feature to be registered globally with the GFF 4 consortium. The big advantage of this format is a researcher can now guarantee that this particular Feature ID will not exist in any other GFF file anywhere in the world. The use of unique identifiers like this will be a huge leap forward for bioinformatics as we will no longer have to worry about lines in our GFF files possibly existing in someone else's GFF files as well.

Update: check the date

Bogus bioinformatics acronyms…there's a lot of them about

Time for some new JABBA awards to recognize the ongoing series of crimes perpetrated in the name of bioinformatics. Two new examples this week…

 

Exhibit A (h/t @attilacsordas): from arxiv.org we have…

CoMEt derives from 'Combinations of Mutually Exclusive Alterations'. Of course the best way of making it easy for people to find your bioinformatics tool is to give it an identical name as an existing tool which does something completely different. So don't be surprised if you search for the web for 'CoMEt' only to find a bioinformatics tool called 'CoMet' from 2011 (note the lower-case 'e'!). CoMet is a web server for comparative functional profiling of metagenomes.

 

Exhibit B: from the journal Bioinformatics — the leading provider of bogus bioinformatics acronyms since 1998 — we have…

MUSCLE is derived from 'Multi-platform Unbiased optimization of Spectrometry via Closed-Loop Experimentation'. Multi-platform you say? What platforms would those be? From the paper:

MUSCLE is a stand-alone desktop application and has been tested on Windows XP, 7 and 8

What, no love for Windows Vista?

Of course, it should be obvious to anyone that this bioinformatics tool called MUSCLE should in no way be confused with the other (pre-existing) bioinformatics tool called MUSCLE.

Is Amazon's new 'unlimited' cloud drive suitable for bioinformatics?

Amazon have revealed new plans for their cloud drive service. Impressively, their 'Unlimited Everything' plan offers the chance to store an unlimited number of files and documents for just $59.99 per year (after a 3-month free trial no less).

News of this new unlimited storage service caught the attention of more than one bioinformatician:

If you didn't know, bioinformatics research can generate a lot of data. It is not uncommon to see individual files of DNA sequences stored in the FASTQ format reach 15–20 GB in size (and this is just plain text). Such files are nearly always processed to remove errors and contamination resulting in slightly smaller versions of each file. These processed files are often mapped to a genome or transcriptome which generates even more output files. The new output files in turn may be processed with other software tools leading to yet more output files. The raw input data should always be kept in case experiments need to be re-run with different settings so a typical bioinformatics pipeline may end up generating terabytes of data. Compression can help, but the typical research group will always be generating more and more data, which usually means a constant struggle to store (and backup) everything.

So could Amazon's new unlimited storage offer a way of dealing with the common file-management headache which plagues bioinformaticians (and their sys admins)? Well probably not. Their Terms of Use contain an important section (emphasis mine):

3.2 Usage Restrictions and Limits. The Service is offered in the United States. We may restrict access from other locations. There may be limits on the types of content you can store and share using the Service, such as file types we don't support, and on the number or type of devices you can use to access the Service. We may impose other restrictions on use of the Service.

You may be able to get away with using this service to store large amounts of bioinformatics data, but I don't think Amazon are intending for it to be used by anyone in this manner. So it wouldn't surprise me if Amazon quietly started imposing restrictions on certain file types or slowing bandwidth for heavy users such that it would make it impractical to rely on for day-to-day usage.

Google and WormBase: these are not the search results you're looking for

Today I wanted to look up a particular gene in the WormBase database. Rather than go to the WormBase website, I thought I would just search Google for the word 'wormbase' followed by the gene name (rpl-22). Surely this would be enough to put the Gene Summary page for rpl-22 at the top of the results?

Sadly no. Here are the results that I was presented with:

All ten of these results include information from the WormBase database regarding the rpl-22 gene and/or link to the WormBase page for the gene. But there are no search results for wormbase.org at all.

Very odd. Is WormBase not allowing themselves to be indexed by search engines? I see a similar lack of wormbase.org results when using bing, Ask, or DuckDuckGo. However, if I search Google for flybase rpl-22 or pombase rpl-22 I find the desired fly/yeast orthologs of the worm rpl-22 gene as the top Google result.

Regarding the current state of bioinformatics training

Todd Harris (@tharris) is on a bit of a roll at the moment. Last month I linked to his excellent blog post regarding community annotation, and today I find myself linking to his latest blog post:

Todd makes a convincing argument that bioinformatics education has largely failed, and he lists three reasons for this, the last of which is as follows:

Finally, the nature of much bioinformatics training is too rarefied. It doesn’t spend enough time on core skills like basic scripting and data processing. For example, algorithm development has no place in a bioinformatics overview course, more so if that is the only exposure to the field the student will have.

I particularly empathize with this point. There should be a much greater emphasis on core data processing skills in bioinformatics training, but ideally students should be getting access to some of these skills at an even earlier age. Efforts such as the Hour of Code initiative are helping raise awareness regarding the need to teach coding skills — and it's good to see the President join in with this — but it would be so much better if coding was part of the curriculum everywhere. As Steve Jobs once said:

"I think everybody in this country should learn … a computer language because it teaches you how to think … I view computer science as a liberal art. It should be something that everybody learns, takes a year in their life, one of the courses they take is learn how to program" — Steve Jobs, 1995.

Taken from 'Steve Jobs: The Lost Interview

Maybe this is still a pipe dream, but if we can't teach useful coding skills for everyone, we should at least be doing this for everyone who is considering any sort of career in the biological sciences. During my time at UC Davis, I've helped teach some basic Unix and Perl skills to many graduate students, but frustratingly this teaching has often come at the end of their first year in Grad School. By this point in their graduate training, they have often already encountered many data management problems and have not been equipped with the necessary skills to help them deal with those problems.

I think that part of the problem is that we still use the label 'bioinformatics training' and this reinforces the distinction from a more generic 'biological training'. It may once have been the case that bioinformatics was its own specialized field, but today I find that bioinformatics mostly just describes a useful set of data processing skills…skills which will be needed by anybody working in the life sciences.

Maybe we need to rebrand 'bioinformatics training', and use a name which better describes the general importance of these skills ('Essential data training for biologists?'). Whatever we decide to call it, it is clear that we need it more than ever. Todd ends his post with a great piece of advice for any current graduate students in the biosciences:

You should be receiving bioinformatics training as part of your core curriculum. If you aren’t, your program is failing you and you should seek out this training independently. You should also ask your program leaders and department chairs why training in this field isn’t being made available to you.


Thinking of naming your bioinformatics software? Be afraid (of me), be very afraid (of me)

Saw this tweet today by Jessica Chong (@jxchong):

I like the idea that people might be fearful of choosing a name that could provoke me into writing a JABBA post. If my never-ending tirade about bogus bioinformatics acronyms causes some people to at least think twice about their intended name, then I will take that as a minor victory for this website!