The Take-Home Message: a new web comic about biology, genomics, and bioinformatics

I am extremely excited to announce the launch of a new venture that I am involved in:

The Take-Home Message

The Take-Home Message

The Take-Home Message is a new web comic that tries to graphically represent interesting and fun ideas from the world of biology, with a special focus on genomics and bioinformatics. It was partly inspired by the growing trend towards journals that require graphical abstracts, though we hope to be entertaining as well as informative.

The driving force behind THM (as it will surely become known) is the very talented Abby Yu. Until very recently, Abby was working hard as a Graduate Student in our lab at UC Davis. Abby's amazing artistic talents were always on evidence whenever she presented in our lab meetings. But she isn't just a great artist and illustrator, she uses those talents to be a great science communicator too.

After obtaining her PhD and getting a job in the real world TM, Abby approached me with the idea of starting a web comic together. We should all be grateful that she didn't suggest that I do any of the drawings! I play more of an editorial role and will help come up with the ideas and write the explanatory text that accompanies each comic. You can see more of Abby's amazing artistic creations on her Tumblr: Oh THAT sketch blog.

The first issue of THM is now online and concerns the Tuxedo Suite of bioinformatics software. We are provisionally aiming to have a new comic online every two weeks. As well as reading the comic online, you can also subscribe to the RSS feed, have the comics emailed to you, or follow us on Twitter (@takehomemessage).

To celebrate the launch of issue #1, Abby has prepared a little celebratory version of our banner image:

When 'verbose' mode is maybe a little too verbose: lessons from the Trinity transcriptome assembler

The transcriptome assembler Trinity, like many other bioinformatics command-line tools, sends its principle output (a transcriptome assembly) to a named output file. It writes other information about the status of the run to standard output.

Another feature in common with other bioinformatics programs is that provides a --verbose mode. The Trinity command-line help describes this as follows:

verbose: provide additional job status info during the run

I recently helped a colleague use Trinity to generate a primate transcriptome assembly, and when we ran the program we did two runs, one with standard logging and one with the verbose output turned on. In both cases we used file redirection to send the output to a file. So what did we end up with?

  1. transcriptome.fasta - 60.4 MB
  2. stdout.log - 2.1 MB
  3. stdout_verbose.log - 140.7 MB

The verbose log file was 70 times bigger than the standard log file, and over twice the size of the final transcriptome assembly! I tried converting the verbose text file to a PDF which gave me a 15,385 page document. The Unix word count program tells me that this file contains over 15 million 'words', but the problem is that that these are not words that you would necessarily want to read. There are thousands and thousands of pages of output with text that looks like this:

If you run Trinity without redirecting the output to a file, you will just see the percentage completion number overwrite itself on a single line of output. This doesn't work so well though if someone does choose to redirect the output to a file. You could also make an argument that no-one really needs to see such a high level of precision when reporting the state-of-completion of each step (four decimal places!).

I think this is an example where the verbose log file ends up being so big as to be largely unusable. If you wanted to search for a specific string in that file, then maybe it would be helpful. The main problem is that the Trinity developers are trying to be smart by having the program overwrite output — regarding the percentage completion status of each step — on various lines of output. However, this is only useful if the user chooses not to redirect the output to a file (something which is incredibly common in bioinformatics). I would argue that for 99% of cases, it is more than sufficient for a program to indicate 10–20 lines of output regarding the state of completion, e.g.

Calculating stage 1 of shamrock.pl…
10% complete
20% complete
30% complete
40% complete
50% complete
60% complete
70% complete
80% complete
90% complete
100% complete

About my idea of a 33% target for women speakers at genomics conferences…

Last week I wrote a post on the subject of gender bias at genomics/bioinformatics conferences. I suggested a figure of 33% might make for a minimum target for the proportion of women (and men) who give talks at such conferences. I also went so far as to end that post by saying:

I don't attend many conferences, but from now on I won't be attending any if at least 33% of the talks are not by women.

At the time that I wrote this, I knew that I was going to be speaking at a genomics conference myself later this year. What I didn't know at the time, was the gender ratio of speakers at this conference. That information only came to light this week. So what is the proportion of talks by women at this conference?

28.2%

If you're quick on the uptake, you will notice that 28 < 33 So what did I do? Well, I wrote to the conference organizers and explained my position and told them that I would like to withdraw my speaking role. I also suggested that they find a woman to take my place (and offered a suggestion of a female co-worker who has worked on the very project that I was intending to talk about).

The conference in question is the new Festival of Genomics that will take place in California in November. This is the second Festival of Genomics conference organized by Front Line Genomics and you may have read about the first conference in this series that recently took place in Boston. This conference was very well received (e.g. see this, this, or this) and so I was very much looking forward to speaking in November (especially as this was the first time that I have been asked to speak at a conference).

The current list of speakers shows 66 men and 26 women. It's possible that these numbers might change slightly; adding just 7 more women speakers, or replacing only 5 male speakers with women would be enough to reach my suggested 33% target.

I have had several productive exchanges with Front Line Genomics about this issue. They acknowledge the problem and seem to genuinely want to do something about it to reduce gender bias in this field. I'm confident that subsequent conferences that they organize will do an even better job at representing women in speaking roles. It also must be said that they are doing much better than most genomics conferences and 28% is higher than the current proportion of women in senior roles at most genome institutes. Once again, I want to reiterate that I have found Front Line Genomics to be extremely open about this issue, and I genuinely believe that they are receptive to suggestions that might improve the situation in future.

What can be done?

If you are a male scientist who is concerned by the current level of gender bias at genomics conferences, and if you are ever invited to give a talk at such a conference, then you do have the power to help change the situation. If you learn that women speakers are going to be underrepresented, you can withdraw your speaking position and instead make some suggestions of female scientists to take your place. You can also raise this issue when first invited to speak. If conference organizers received responses from all potential speakers saying 'I will only talk if your conference has an unbiased gender ratio of speakers', then this could change the situation dramatically.

Time to conclude this post by saying (once again): I don't attend many conferences, but from now on I won't be attending any if at least 33% of the talks are not by women.

Everything you ever wanted to know about working with RNA-Seq data but were afraid to ask

  1. Do you work with RNA-Seq data?
  2. Do you plan to work with RNA-Seq data?
  3. Have you ever heard of RNA-Seq data?

If the answer to any of these questions is 'yes' (or even 'maybe') then you should definitely check out this fantastic online guide to all things RNA-Seq:

RNA-seqlopedia

The RNA-seqlopedia provides an overview of RNA-seq and of the choices necessary to carry out a successful RNA-seq experiment

Written by Rodger Voelker and Clay Small of the Cresko Lab at the University of Oregon, it is a fantastically detailed, beautifully written resource to walk you through every step of working with RNA-Seq data.

I wish there were more online guides like this! Here's the Table of Contents, with the 'Analysis' section expanded, to give you a feeling for what it covers:

  1. Experimental Design
  2. RNA Preparation
  3. Library Preparation
  4. Sequencing
  5. Analysis
    • Overview
    • Initial Processing
    • Demultiplexing
    • Removing adapters
    • Trimming
    • Kmer Normalization
    • de Novo Assembly
    • de Bruijn Graph assembly
    • Overlap Layout Assembly
    • Aligning reads to a reference
    • Aligning to a ref. genome
    • Aligning to a transcriptome
    • microRNA Aligners
    • Short Read Aligners Output
    • Annotation of transcripts
    • Differential gene expression
    • Normalization
    • Discrete Discrete Models
    • Continuous Discrete Models
    • Nonparametric Models
    • Choice of Analysis Software
  6. References