Red flag alert for a bogus bioinformatics acronym

January 12, 2015 by Keith Bradnam

The first JABBA award of 2015 goes to a paper that was published at the end of 2014 (thanks to twitter user @chenghlee for bringing this to my attention). The paper, published in BMC Medical Genomics, has a succinct title that contains a very bogus name:

FLAGS, frequently mutated genes in public exomes

The title doesn't explicitly reveal the source of the acronym 'FLAGS', but you can probably take a guess. From the abstract:

We termed these genes FLAGS for FrequentLy mutAted GeneS

This gets a JABBA award because a majority (3 out of 5) of the letters in 'FLAGS' are not from the intial letters of words.

A little bit of end-of-year DNA from ACGT

December 31, 2014 by Keith Bradnam

It just remains for me to say:

CATGCCCCCCCCTATAATGAATGGTATGAAGCCCGCTA
ACATGCCGTCGAAGCCGGCCGCGAAGCCACCACCTGGG
AAAATACCTATTTTATTTTTACCGAAGAAAATTAAATG
GCCTATACCCATGAAAATGAATGGTATGAAGCCCGCGG
CGAAAATGAACGCGCCACCGAATTAGAAAATGGCACCC
ATTATCGCGAAGCCGATTCCTAA

101 questions with a bioinformatician #21: Stephen Turner

December 19, 2014 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Stephen Turner is Director of the Bioinformatics Core and Assistant Professor of Public Health Sciences at the University of Virginia School of Medicine.

His blog, Getting Genetics Done, should be required reading for anyone who wishes to get lots of practical, hands-on, advice about doing bioinformatics. This is especially so if you want to know more about R (he has 140 posts on the topic!). He has a great overview about the goal of the blog:

Many resources offer a 10,000-foot view of the current trends in the field, reviews of various technologies, and guidelines on how to effectively design, analyze, and interpret experiments in human genetics and bioinformatics research. By comparison very few resources focus on the mundane, yet critical know-how for those on the ground actually doing the science (i.e. grad students, postdocs, analysts, and junior faculty). Getting Genetics Done aims to fill that gap by featuring software, code snippets, literature of interest, workflow philosophy, and anything else that can boost productivity and simplify getting things done in human genetics research.

You can find out more about Stephen by visiting his aforementioned blog, or by following him on twitter (@genetics_blog). And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I'm faculty in Public Health but my primary position is directing our Bioinformatics Core. That means I get to work on all kinds of projects with a very diverse set of collaborators. Monday I might be assembling plant genomes for a collaborator in the biology department, Tuesday I might be working on RNA-seq in patient kidney biopsies with a urologist in the hospital, the next day I might be figuring out how to best approach hybrid assembly with Nanopore and short read sequencing for a plasmid genome. Every day is something different, and the job never gets boring.

010. What's something that you don't enjoy about current bioinformatics research?

Same answer as 001: working on all kinds of projects with a very diverse set of collaborators.

Seriously, as fun as this can be, I often have to sacrifice depth of expertise for breadth. And I think most other bioinformaticians who exist for collaboration have to do the same. I have to be an expert in data analysis and study design of hundreds of different *-seq assays. I can't spend two months working on hybrid assembly with Nanopore and short read sequencing for one collaborator when I have a PAR-CLIP project, an exome variant-calling/annotation project, a 16S microbial profiling project, and a breakpoint mapping project with other collaborators, all needing the same level of attention to detail.

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Take some programming classes in college, and try contributing to an open-source project.

I, like many other bioinformaticians, am a self-taught programmer. I cut my teeth on Perl years ago before Python was so popular, and have picked up a handful of other generic programming languages and numerical/statistical computing languages since then. But I'm not a software engineer, and at this point I'll only be able to polish my software development practices so much. Sure, most of my code is version controlled, and I know very well how to modularize code with functions, but there's much more to writing and contributing to good software than this. Good science increasingly relies on great software, and not just in genomics. More formal training would have been nice to have.

100. What's your all-time favorite piece of bioinformatics software, and why?

It's not one piece of software, but the Bioconductor community in general is just awesome. Pick any of the applications I mentioned in questions 001 and 010, and there's probably a Bioconductor package to help you with it. Most packages have great documentation, and reliance on a common set of data structures really simplifies things. The mailing list is responsive, and you don't have to have the same thick skin necessary to email R-help.

If I had to nail it down to just one single application, I'm going to have to be unoriginal and go with BEDTools. Way back when, I used to load genomic intervals into MySQL database tables and write impossibly complex (and slow) queries to do very simple BEDTools-y kinds of operations. Just when you think you have a one-of-a-kind "genome arithmetic" problem that no one has ever seen before, you'll often find that you're not so special after all and there's a BEDTools subcommand or recipe that gets you exactly what you need.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

Besides knowing the ins and outs of many different kinds of NGS studies, what makes a bioinformatician a great scientist is being really good at lots of things at once: a skilled programmer, a skeptical statistician, an influential writer, a perceptive reader, a captivating speaker, a convincing salesman, a careful financial planner, a creative graphic designer, a thoughtful experimentalist, and a friendly colleague. I'm certainly not all of these things, but I'm still going to go with N.

University of Spin: every British university is ranked #1 for research

December 18, 2014 by Keith Bradnam

The UK government published the latest Research Excellence Framework (REF) results today. One goal of this exercise is to make it easier for everyone to see who is winning and losing at academic research¹. The Times Higher Education website has produced a Table of Excellence showing the overall rankings.

The underlying results are broken down by different subject area, measured using three different criteria (‘Output’, ‘Impact’, and ‘Environment’, each of which is further broken down into four main grades (1* through to 4*). All of which means that everyone has something to cheer about.

If you looked at the #REF2014 hashtag on twitter today, you might conclude that everyone is a winner. I’ve gathered together some of these tweets in the Storify below, but also check out the tweets at the end which offer further comment regarding all of this spinning:

In reality, these results will be used to distribute future research funding to universities. ↩

Comparisons of computational methods for differential alternative splicing detection using RNA-seq in plant systems →

December 18, 2014 by Keith Bradnam

Marc Robinson-Rechavi (@marc_rr) tweeted about this great new paper in BMC Bioinformatics by Ruolin Liu, Ann Loraine, and Julie Dickerson. From the abstract:

The goal of this paper is to benchmark existing computational differential splicing (or transcription) detection methods so that biologists can choose the most suitable tools to accomplish their goals.

Like so many other areas of bioinformatics, there are many methods available for detecting alternative splicing, and it is far from clear which — if any — is the best. This paper attempts to compare eight of them, and the abstract contains a sobering conclusion:

No single method performs the best in all situations

Figure 5 from the paper is especially depressing. It looks at the overlap of differentially spliced genes as detected by five different methods. There are zero differentially spliced genes that all methods agreed on.

Liu et al. BMC Bioinformatics 2014 15:364   doi:10.1186/s12859-014-0364-4 — Liu et al. BMC Bioinformatics 2014 15:364 doi:10.1186/s12859-014-0364-4

Understanding MAPQ scores in SAM files: does 37 = 42?

December 16, 2014 by Keith Bradnam

The official specification for the Sequence Alignment Map (SAM) format outlines what is stored in each column of this tab-separated value file format. The fifth column of a SAM file stores MAPping Quality (MAPQ) values. From the SAM specification:

MAPQ: MAPping Quality. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.

So if you happened to know that the probability of correctly mapping some random read was 0.99, then the MAPQ score should be 20 (i.e. log10 of 0.01 * -10). If the probability of a correct match increased to 0.999, the MAPQ score would increase to 30. So the upper bounds of a MAPQ score depends on the level of precision of your probability (though elswhere in the SAM spec, it defines an upper limit of 255 for this value). Conversely, as the probability of a correct match tends towards zero, so does the MAPQ score.

So I'm sure that the first thing that everyone does after generating a SAM file is to assess the spread of MAPQ scores in your dataset. Right? Anyone?

< sound of crickets >

Okay, so maybe you don't do this. Maybe you don't really care, and you are happy to trust the default output of whatever short read alignment program that you used to generate your SAM file. Why should it matter? Will these scores really vary all that much?

Here is a frequency distribution of MAPQ scores from two mapping experiments. The bottom panel zooms in to more clearly show the distribution of low frequency MAPQ scores:

Distribution of MAPQ scores from two experiments: bottom panel shows zoomed in view of MAPQ scores with frequencies < 1%. Click to enlarge.

What might we conclude from this? There seems to be some clear differences between both experiments. The most frequent MAPQ scores in the first experiment are 42 followed by 1. In the second experiment, scores only reach a maximum value of 37, and scores of 0 are the second most frequent value.

These two experiments reflect some real world data. Experiment 1 is based on data from mouse, and experiment 2 uses data from Arabidopsis thaliana. However, that is probably not why the distributions are different. The mouse data is based on unpaired Illumina reads from a DNase-Seq experiment, wheras the A. thaliana data is from paired Illumina reads from whole genome sequencing. However, that still probably isn't the reason for the differences.

The reason for the different distributions is that experiment 1 used Bowtie 2 to map the reads whereas experiment 2 used BWA. It turns out that different mapping programs calculate MAPQ scores in different ways and you shouldn't really compare these values unless they came from the same program.

The maximum MAPQ value that Bowtie 2 generates is 42 (though it doesn't say this anywhere in the documentation). In contrast, the maximum MAPQ value that BWA will generate is 37 (though once again, you — frustratingly — won't find this information in the manual).

The data for Experiment 1 is based on a sample of over 25 million mapped reads. However, you never see MAPQ scores of 9, 10, or 20, something that presumably reflects some aspect of how Bowtie 2 calculates these scores.

In the absence of any helpful information in the manuals of these two popular aligners, others have tried doing their own experimentation to work out what the values correspond to. Dave Tang has a useful post on Mappinq Qualities on his Musings from a PhD Candidate blog. There are also lots of posts about mapping quality on the SEQanswers site (e.g. see here, here or here). However, the prize for the most detailed investigation of MAPQ scores — from Bowtie 2 at least — goes to John Urban who has written a fantastic post on his Biofinysics blog:

How does Bowtie 2 assign MAPQ scores?

So in conclusion, there are 3 important take home messages:

MAPQ scores vary between different programs and you should not directly compare results from, say, Bowtie 2 and BWA.
You should look at your MAPQ scores and potentially filter out the really bad alignments.
Bioinformatics software documentation can often omit some really important details (see also my last blog post on this subject).

Kablammo: an interactive, web-based BLAST results visualizer →

December 11, 2014 by Keith Bradnam

Another great name for a piece of bioinformatics software! This tool has just been published in the journal Bioinformatics by Jeff Wintersinger and James Wasmuth. From the abstract:

Kablammo is a web-based application that produces interactive, vector-based visualizations of sequence alignments generated by BLAST. These visualizations can illustrate many features, including shared protein domains, chromosome structural modifications, and genome misassembly.

101 questions with a bioinformatician #20: Roy Chaudhuri

December 11, 2014 by Keith Bradnam

Roy Chaudhuri is a Lecturer in Bioinformatics in the Department of Molecular Biology and Biotechnology at the University of Sheffield, and is part of the Sheffield Bioinformatics Hub. Roy's expertise concerns the comparative genomics and phylogenetics of bacterial pathogens and in a previous life he helped set up the coliBASE and xBASE databases. In a previous-previous life he was also a pioneering website designer (I shouldn't judge: people in glass houses… and all that).

He claims that his current duties involve "research, teaching, publishing, and trying to convince people to give me money". If you would like to give Roy money (perhaps a £1 donation towards his Eccles Cake fund?), you can get in contact with him via the Sheffield Bioinformatics Hub website. You can also find out more about Roy by following him on twitter (@RoyChaudhuri)…but be warned, he is a non-stop tweeter! And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I like that after 16 years as a bioinformatician, I'm still learning new things every day, and that there's no shortage of cool datasets and interesting problems to keep me busy. I also like how far it's possible to get by knowing a little bit of biology and a little Perl.

010. What's something that you don't enjoy about current bioinformatics research?

I worry that too much community effort has been devoted to dealing with problems that are specific to short-read data. I'd like to think that in five years time sequencing will just work, and we will be able to devote our time to dealing with biological quirks rather than technical ones. I'm pretty sure I said the same thing five years ago, though.

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Most of my advice wouldn't be work-related, but I'd certainly mention that the clock starts ticking on potential fellowship opportunities as soon as you get your PhD. I definitely missed the starting gun on that one.

100. What's your all-time favorite piece of bioinformatics software, and why?

I'll go for Prokka, because it does an astonishingly good job at annotating bacterial genomes (better than many manual attempts...), because Torsten wrote the book (well, blog post) on creating usable command-line bioinformatics tools. I particularly like that it checks for its dependencies at the start, rather than choking half-way through, and because it sometimes finishes with a quote from the Hitchhiker's Guide to the Galaxy.

Other than that, I'm a big fan of MUMmer, and I'm always impressed by how many different things it's possible to achieve by stringing two or three SAMtools commands together. If non-bioinformatics-specific software counts, then I'd also mention GNU Parallel, Perl and UNIX itself.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

M, because it's Not K.

Searching for sausage rolls: using Google Scholar to look at the popularity of British culinary delights

December 10, 2014 by Keith Bradnam

Sometimes it can be fun to search Google Scholar for words or phrases that you might not expect to ever appear in the title of an academic article. So last night, I conducted an important scientific study and looked at the popularity of various quintessential items of Britsh cuisine:

Among all of the foods that I searched for, Fish and Chips proved the most popular item with 52 results. Most of these are articles talking about Fluorescent In Situ Hybridization (FISH) and Chromatin IP (ChIP) experiments.
The next most popular item was the healthy delight that is Black Pudding. This was represented by 9 results, one of which is this gem from the British Medical Journal: Controlled prospective study of faecal occult blood screening for colorectal cancer in Bury, black pudding capital of the world.
Sticking with puddings, I looked at the popularity of what is unquestionably the King of all puddings: the Yorkshire pudding. This has just 3 search results and one of these appears to be a gripping thriller: Rheological Study of Batter Dough for Yorkshire Pudding Production.
There were only 3 results for pork pies but they include the wonderful title of a PhD thesis from the University of Nottingham: Storage changes in pork pies (a real page turner!).
The beloved Cornish Pasty also merits just 3 results including a paper in the Journal of Genetics and Development that sounds bizarre: A modified cornish pasty method for ex ovo culture of the chick embryo.
There are 3 mentions for Spotted Dick, one of which seems to be a zinc-finger protein in Drosophila.
The humble Sausage roll gets only a solitary mention (a piece in New Scientist titled Sausage-roll science: the battle of the buffet).
Last, but not least, is a dish that often confuses people that are not from the UK. The delighful Toad in the hole was something that I thought would never feature at all in this list. It only merits 1 result, but what a result! The article in question is something from a 1924 issue of The Boston Medical and Surgical Journal titled: The Toad in the Hole Circumcision—A Surgical Bugbear.

Updated: 2014-12-10: includes addition of 'Spotted Dick' thanks to reader @MattBashton.