How to ask for bioinformatics help online

Part two of a two-part series.

In part one I covered where to ask for bioinformatics help. Now it is time to turn to the issue of how you should go about asking for help. Hat tip to reader Venu Thatikonda (@nerd_yie) for pointing me out to this 2011 PLOS Computational Biology article that tackles similar ground to this blog post. Here are my five main suggestions, with the last one being further broken down into 9 different tips:

  1. Be polite. Posting a question to an online forum does not mean that you deserve to be answered. If people do answer, they are usually doing so by giving up their own free time to try to help you. Don't berate people for their answers, or insult them in any way.
  2. Be relevant. Choose the right forum in which to ask your question. Sites like SEQanswers have different forums that discuss particular topics, so don't post your PacBio question in the Ion Torrent forum.
  3. Be aware of the rules. Most online forums will have some rules, guidelines, and/or an FAQ which covers general posting etiquette and other things that you should know. It is a good idea to check this before posting on a site for the first time.
  4. Be clever. Search the forum before asking your question, there is often a good chance that your question has already been asked (and answered) by others.
  5. Be helpful. The biggest thing you can probably do in order to get a useful answer to a question is to provide as many useful details as possible, these include:
    1. Type of operating system and version number, e.g. Mac OS X 10.10.5.
    2. Version number/name of software tool(s) you are using, e.g. NCBI BLAST+ v2.2.26, Perl v5.18.2 etc. A good bioinformatics or Unix tool will have a -v, -V, or --version command-line option that will give you this information.
    3. Any error message that you saw. Report the full error message exactly as it appeared.
    4. Where possible, provide steps that would let someone else reproduce the problem (assuming it is reproducible).
    5. Outline the steps that you have tried, if any, to fix the problem. Don't wait for someone to suggest 'quit and restart your terminal' before you reply 'Already tried that'.
    6. A description of what you were expecting to happen. Some perceived errors are not actually errors at all (the software was doing exactly what was asked of it, though this may not be what the user was expecting).
    7. Any other information that could help someone troubleshoot your problem, e.g. a listing of your Unix terminal before and/or after you ran a command which caused a problem.
    8. A snippet of your data that would allow others to reproduce the problem. You may not be able to upload data to the website in question, but small data snippets could be shared via a Dropbox or Google Drive link, or on sites like Github gist.
    9. Attach a screenshot that illustrates the problem. Many forum sites allow you to add image files to a post.

Any other suggestions?

 

Updates

2015-11-08 09.44: Added link to PLOS Computational Biology article

Where to ask for bioinformatics help online

Part one of a two-part series. In part two I tackle the issue of how to ask for help online.

You have many options when seeking bioinformatics help online. Here are ten possible places to ask for help, loosely arranged by their usefulness (as perceived by me):

  1. SEQanswers — the most popular online forum devoted to bioinformatics?
  2. Biostars — another very popular forum.
  3. Mailing lists — many useful bioinformatics tools have their own mailing lists where you can ask questions and get help from the developers or from other users, e.g. SAMtools and Bioconductor. Also note that resources such as Ensembl have their own mailing lists for developers.
  4. Google Discussion Groups — as well as having very general discussion groups, e.g. Bioinformatics, there are also groups like Tuxedo Tool Users…the perfect place to ask your TopHat or Cufflinks question.
  5. Stack Overflow — more suited for questions related to programming languages or Unix/Linux.
  6. Google — I'm including this here because I have solved countless bioinformatics problems just by searching Google with an error message.
  7. Reddit — try asking in r/bioinformatics or r/genome.
  8. Twitter — this may be more useful if you have enough followers who know something about bioinformatics, but it is potentially a good place to ask a question, though not a great forum for long questions (or replies). Try using the hashtag #askabioinformatician (this was @sjcockell's idea).
  9. Voat — Voat is like reddit's younger, hipster nephew. However, the bioinformatics 'subverse' is not very active.
  10. Research Gate — you may know it better as 'that site that sends me email every day', but some people use this site to ask questions about science. Surprisingly, they have 15 different categories relating to bioinformatics.
  11. LinkedIn — Another generator of too many emails, but they do have discussion groups for bioinformatics geeks and NGS.

Other suggestions welcome.

 

Updates

2015-11-02 09.53: Added twitter at the suggestion of Stephen Turner (@nextgenseek).

Welcome to the JABBA menagerie: a collection of animal-themed, bogus bioinformatics names…that have nothing to do with animals!

Bioinformaticians make the worst zookeepers:

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

 

Other suggestions welcome! Only requirements are that:

  1. The name is bogus, i.e. not a straightforward acronym and worthy of a JABBA award
  2. The acronym is named after an animal (or animal grouping)
  3. The software/tool has nothing to do with the animal in question

Understanding MAPQ scores in SAM files: does 37 = 42?

The official specification for the Sequence Alignment Map (SAM) format outlines what is stored in each column of this tab-separated value file format. The fifth column of a SAM file stores MAPping Quality (MAPQ) values. From the SAM specification:

MAPQ: MAPping Quality. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.

So if you happened to know that the probability of correctly mapping some random read was 0.99, then the MAPQ score should be 20 (i.e. log10 of 0.01 * -10). If the probability of a correct match increased to 0.999, the MAPQ score would increase to 30. So the upper bounds of a MAPQ score depends on the level of precision of your probability (though elswhere in the SAM spec, it defines an upper limit of 255 for this value). Conversely, as the probability of a correct match tends towards zero, so does the MAPQ score.

So I'm sure that the first thing that everyone does after generating a SAM file is to assess the spread of MAPQ scores in your dataset. Right? Anyone?

< sound of crickets >

Okay, so maybe you don't do this. Maybe you don't really care, and you are happy to trust the default output of whatever short read alignment program that you used to generate your SAM file. Why should it matter? Will these scores really vary all that much?

Here is a frequency distribution of MAPQ scores from two mapping experiments. The bottom panel zooms in to more clearly show the distribution of low frequency MAPQ scores:

Distribution of MAPQ scores from two experiments: bottom panel shows zoomed in view of MAPQ scores with frequencies < 1%. Click to enlarge.

What might we conclude from this? There seems to be some clear differences between both experiments. The most frequent MAPQ scores in the first experiment are 42 followed by 1. In the second experiment, scores only reach a maximum value of 37, and scores of 0 are the second most frequent value.

These two experiments reflect some real world data. Experiment 1 is based on data from mouse, and experiment 2 uses data from Arabidopsis thaliana. However, that is probably not why the distributions are different. The mouse data is based on unpaired Illumina reads from a DNase-Seq experiment, wheras the A. thaliana data is from paired Illumina reads from whole genome sequencing. However, that still probably isn't the reason for the differences.

The reason for the different distributions is that experiment 1 used Bowtie 2 to map the reads whereas experiment 2 used BWA. It turns out that different mapping programs calculate MAPQ scores in different ways and you shouldn't really compare these values unless they came from the same program.

The maximum MAPQ value that Bowtie 2 generates is 42 (though it doesn't say this anywhere in the documentation). In contrast, the maximum MAPQ value that BWA will generate is 37 (though once again, you — frustratingly — won't find this information in the manual).

The data for Experiment 1 is based on a sample of over 25 million mapped reads. However, you never see MAPQ scores of 9, 10, or 20, something that presumably reflects some aspect of how Bowtie 2 calculates these scores.

In the absence of any helpful information in the manuals of these two popular aligners, others have tried doing their own experimentation to work out what the values correspond to. Dave Tang has a useful post on Mappinq Qualities on his Musings from a PhD Candidate blog. There are also lots of posts about mapping quality on the SEQanswers site (e.g. see here, here or here). However, the prize for the most detailed investigation of MAPQ scores — from Bowtie 2 at least — goes to John Urban who has written a fantastic post on his Biofinysics blog:

So in conclusion, there are 3 important take home messages:

  1. MAPQ scores vary between different programs and you should not directly compare results from, say, Bowtie 2 and BWA.
  2. You should look at your MAPQ scores and potentially filter out the really bad alignments.
  3. Bioinformatics software documentation can often omit some really important details (see also my last blog post on this subject).

Making genome assemblies in the year 2014

I often like to encourage students to explain their work without using any complex scientific vocabulary. If you can explain what you do to your parents or grand-parents then this is great practice for explaining your work to other scientists from outside your field.

I also encourage students to think of analogies and metaphors for their work as these can really help others to grasp difficult concepts. Yesterday, I wrote a post called Making cakes in the year 2014 which was (hopefully) an obvious attempt to explain some of the complexities and problems inherent in the field of genome assembly.

It almost feels wrong to even attempt to convert millions of ~100 bp DNA fragments into — in the case of some species — a small number of sequences that span billions of bp. Every single step in the process is fraught with errors and difficulties. Every single step is controlled by software with numerous options that are often unexplored. Every single step has many alternative pieces of software available.

If we just focus on one of the earliest steps in any modern sequencing pipeline, the need to remove adapter contamination from your sequenced reads. There are at least thirty-four different tools that can be used for this step and there are over 240 different threads on SEQanswers.com that contain the words 'trim' and 'adapter' (suggesting that this process is not straightforward, and that many people need help).

I had a look at some of these tools. The program Btrim has 12 different command-line options that can all affect how the program trims adapter sequences (it has 27 different command-line options in total). Skewer has 9 different command-line options that will affect the output of the program. The trimmer Concerti has 8 options that will also affect the output. Do we even have a good idea of what is the best way to remove adapter sequences? Maybe we need a 'trimmathon' to help test all of these tools! 

If there is a point to this post maybe it would be that genome assembly is an amazingly complex, time consuming, and fundamentally difficult problem. But even the 'little steps' that that have to be done before you even start assembling your sequences are also far from straightforward. Don't convince yourself for a moment that a single tool — with default parameters — will do all of the hard work for you.

 

 

Making cakes in the year 2014

I've been trying to make a cake. There are lots of published recipes out there for how to make this cake, but the one that I used came with only a very blurry image of what the finished cake should look like. So I really had to hope that the recipe was a good one, because I wasn't entirely sure if I would be able to tell whether it worked or not.

To get started, I used one of those online shopping services that can deliver all the ingredients to your door. Even though they claimed that they stocked everything on my shopping list, they then informed me that there were a small number of ingredients that they were not able to physically access at the moment. Frustratingly, they weren't able to tell me which ingredients would be missing when they delivered them. How odd. 

Something else that seemed unusual was that my cake recipe specified that I needed almost 100 times the amount of ingredients compared to what will end up in the finished cake. Seems a bit wasteful, but who am I to argue with the recipe?

Before I could actually start the baking process, I found that there were a few issues that I had to overcome. Lots of the ingredients had become stuck to the packaging and I had to use a tool which could separate the two. Only, some of the time it didn't get rid of all the packaging, and some of the time it ended up getting rid of not just the packaging but some of the ingredient as well. There's actually several tools on the market for doing this, but they all seem to perform slightly differently.

After I got rid of the packaging I noticed that lots of the ingredients had started to spoil and had to be thrown away, but some of them could be salvaged by cutting off the bad parts. There also seemed to be a lot of implements that you can buy to help with the cutting. Wasn't obvious which one was the best, so I used the first one that Google suggested.

At this point it was kind of frustrating to notice that a small proportion of my ingredients weren't cake ingredients at all. I had to throw them all away, but I think that some of them may have ended up in the final cake.

When it came to the actual baking, I was a bit overwhelmed by the fact that there were dozens of different manufacturers who all claimed that I could make a better cake if only I used their brand of oven. Nearly all of these ovens just let you put your raw ingredients in one slot — after you have removed packaging, the spoilt ingredients, and the non-cake ingredients — and voila, out comes your cake!

I chose one of the more popular ovens on the market and waited patiently for many hours as my cake baked happily in the oven. When the timer buzzed and I took the cake out, I was surprised to that many of the raw ingredients were left behind in the oven's 'waste overflow unit'. The real surprise however, was that the finished cake didn't really look anything like the — admittedly blurry — photo that came with the recipe. 

The cake had many different layers, but they weren't quite all the same size and some of them seemed to have been assembled in the wrong order. The pattern on the cake decoration — yes this oven also decorates the cake — was inconsistent at best. It would mostly use one color of icing, but every now and then, it would insert a different color. The same thing happened with the fillings, it would randomly switch from one flavor to another, and then back again. It was almost like there were two different cakes which had  been squished together to make a new one.

When I finally showed the cake to one my baking friends, I was hoping that he would enjoy it. However, all he kept asking me was "How big are the layers?". When I told him, he replied "My cake has bigger layers so yours can't be very good", and then he left. How rude. I took it to another friend and she just said "Your cake is smaller than mine so mine must be better". She also left without trying it. Finally, I took it to another baking colleague. Before I could show him the cake he just said "My cake has most of the common ingredients expected in all cakes, how many does yours have?". I didn't know so he left.

Making cakes is a very strange business.

Developing CEGMA: how working on old code can drive you mad and some tips on how to avoid this

Today marks the day when the original paper that describes the CEGMA software (Core Eukaryotic Gene Mapping Approach) becomes my most cited paper (as tracked by Google Scholar):

Does this fact make me happy? Not really. In fact, you may be surprised to learn that I find working on CEGMA a little bit depressing. I say this on a day when, purely coincidentally, I am releasing a new version of CEGMA. Why the grumpy face Keith? (I hear you ask). Let's take a trip down memory lane to find out why:

  • Early 2004: A paper is published that describes the KOGs database of euKaryotic Orthologous Groups.
  • Early 2005: I become the first person to join the Korf Lab after Ian Korf moves to Davis in 2004.
  • Mid 2005: Genís Parra becomes the second person to join the lab.
  • 2005–2006: The three of us work on the idea which became CEGMA. This project was primarily driven forward by Genís; during this time our initial CEGMA manuscript was rejected by two journals.
  • Late 2006: Our CEGMA paper was accepted!
  • Early 2007: CEGMA paper is published — as an aside, the URL for CEGMA that we include in the paper still works!
  • 2007: We work on the CEGMA spin-off idea: that it can be used to assess the 'gene space' of draft genomes.
  • 2008: Write new manuscript, get rejected twice (again), finally get accepted late 2008.
  • Early 2009: The 2nd CEGMA paper gets published!
  • Mid 2010: Genís leaves the lab.

By the time Genís left Davis, our original CEGMA paper had been cited 11 times (one of which was by our second CEGMA paper). I think that we had all expected the tool to have been a little more popular, but our expectations had been dampened somewhat by the difficulties in getting the paper published. Anyway, no sooner than Genís had left the lab, then the paper started getting a lot more attention:

Growth in citations to the two CEGMA papers.

This was in no doubt due to its use as a tool in the Assemblathon 1 paper (of which I was also involved), a project that started in late 2010. However, any interest generated from the Assemblathon project probably just reflected the fact that everyone and their dog had started sequencing genomes and producing — how best to describe them? —'assemblies of questionable quality'.

This is also about the time when I started to turn into this guy:

This was because it had fallen on me to continue to deal with all CEGMA-related support requests. Until 2010, there hadn't really been any support requests because almost no-one was using CEGMA. This changed dramatically and I started to receive lots of emails that:

  • Asked questions about interpreting CEGMA output
  • Reported bugs
  • Asked for help installing CEGMA
  • Suggested new features
  • Asked me to run CEGMA for them

I started receiving lots of the latter requests because CEGMA is admittedly a bit of a pig to install (on non Mac-based Unix systems at least). In the last 6 months alone, I've run CEGMA 80 times for various researchers who (presumably) are unable to install it themselves.

After the version 2.3 release — necessary to transition to the use of NCBI BLAST+ instead of WU-BLAST — and 2.4 release — necessary to fix the bugs I introduced in v2.3! — I swore an oath never to update CEGMA again. This was mostly because we no longer have any money to work on the current version of CEGMA. However, it was also because it is not much fun to spend your days working on code that you barely understand.

It should be said that we do have plans for a completely new version of CEGMA that will — subject to our grant proposal being successful — be redeveloped from the ground up, and will include many completely new features. Perhaps most importantly — for me at least — a version 3.0 release of CEGMA will be much more maintainable.

And now we get to the main source of my ire when dealing with CEGMA. It is built on a complex web of Perl scripts and modules, which make various system calls to run BLAST, genewise, geneid, and hmmsearch (from HMMER). I still find the scripts difficult to understand — I didn't write any of the original code — and therefore I find it almost impossible to maintain. One of the reasons I had to make this v2.5 update is because the latest versions of Perl have deprecated a particular feature causing CEGMA to break for some people.

Most fundamentally, the biggest problem with CEGMA (v2.x) is that it is centered around use of the KOGs database, a resource that is now over a decade old. This wasn't an issue when we were developing the software in 2005, but it is an issue now. Our plans for CEGMA v3.0 will address this by moving to a much more modern source of orthologous group information.

In making this final update to v2.x of CEGMA, I've tried adopting some changes to bring us up to date with the modern age. Although the code remains available from our lab's website, I've also pushed the code to GitHub (which wasn't in existence when we started developing CEGMA!). In doing this, I've also taken the step to give our repository a DOI and therefore make the latest version citable in its own right. This is done through use of Zenodo.

Although I hope that this is the last thing that I ever have to write about CEGMA v2.x, it is worth reflecting on some of the ways that the process of managing and maintaining CEGMA could have been made easier:

  1. Maintain documentation for your code that is more than just an installation guide and a set of embedded comments. From time to time, I've had some help from Genís in understanding how the code is working, but the complexity of this software really requires a detailed document that explains how and why everything works the way it does. There have been times when I have been unable to help people with CEGMA-related questions because I still can't understand what some of the code is doing.
  2. Start a FAQ file from day one. This is something that, foolishly, I have only recently started. I could have probably saved myself many hours of email-related support if I had sorted this out earlier.
  3. Put your code online for others to contribute to. Although GitHub wasn't around when we started CEGMA, I could have put the code up there at some point before today!
  4. Don't assume that people will use a mailing list for support, or even contact you directly. One thing I did do many years ago, is set up a CEGMA mailing list. However, I'm still surprised that many people just report their CEGMA problems on sites like SEQanswers or BioStars. I probably should have started checking these sites earlier.
  5. Don't underestimate how much time can be spent supporting software! I probably should have started setting aside a fixed portion of time each week to deal with CEGMA-related issues, rather than trying to tackle things as and when they landed on my doorstep.
  6. Assume that you will not be the last person to manage a piece of software. There are many things you can do to start good practices very early on, including using email addresses for support which are not tied to a personal account, ensuring that your changes to the code base have meaningful (and helpful) commit messages, and making sure that more than one person has access to wherever the code is going to end up.

In some ways it is very unusual for software to have this type of popularity where people only start using it several years after it is originally developed. But as CEGMA shows, it can happen, and hopefully these notes will serve as a bit of a warning to others who are developing bioinformatics software.

Survey results: The extent of gender bias in bioinformatics

I have completed an analysis of my survey that attempted to see whether there is notable gender bias among bioinformaticians. Thank you to the 370 people that completed the survey! A few things to note:

  1. All survey responses are available on Figshare (in tab-separated value format). Anyone else can come along and play with this data, and maybe ask more intelligent questions about it than I did.
  2. My detailed analysis of these responses is also on Figshare as a separate document.
  3. The original Google survey form remains available (also see my blog post about it). If people continue to complete the survey, I will update the main data file on Figshare.

I encourage people to read the full document on Figshare. Because of the high response to this survey, I had enough data to compare gender bias at different career stages, and also between different countries (for a small number of countries).

I'll leave you with just one result from my analysis. I had asked people to identify their current career position, and  I offered 10 possible career stages as answers:

  1. Currently pursuing undergraduate degree (with focus on bioinformatics/genomics
  2. Undergraduate level position in academia or industry  (e.g. Research officer / Junior specialist)
  3. Currently pursuing postgraduate qualification (with focus on bioinformatics/genomics)
  4. Postgraduate level position (e.g. Research assistant). MSc or PhD required for role.
  5. Postdoctoral scholar / Fellow / Research Associate
  6. Lecturer / Instructor/ Senior Fellow / Project Scientist (3+ years post-PhD research experience)
  7. Assistant Professor / Reader / Senior Lecturer (5+ years post-PhD research experience)
  8. Associate or Full Professor / Team Leader (7+ years post-PhD research experience)
  9. Senior Professorial role (e.g. head of a department, 10+ years post-PhD research experience)
  10. Super Senior role (e.g. Dean of a school or CEO, 15+ years post-PhD research experience)

Because these categories are a little bit subjective, and because some of the categories (levels 1, 9, and 10) had the least number of responses, I decided to smooth the data by combining adjacent categories. I.e. 1&2, 2&3, etc.

So this is what the percentage of male and female bioinformaticians looks like with respect to progress through their scientific career:

Things start off looking quite equitable but proceed to diverge around the time that people are becoming Associate Professors. However, the situation is more complex than this (see Figure 3 in my full analysis).

ACGT...TGCA — has every possible DNA-based initialism been used by the bioinformatics/genomics community?

 

Short answer

Yes. 

Long answer…

You might work in a field that's related to biology, genetics, genomics, or bioinformatics. You might be working on a new piece of software, or a research proposal, or you need to form a committee. Maybe you have even been given the power to name a new research facility.

Suddenly you have an inspiration...why don't we name our new software, proposal, committee, or facility after a DNA-based initialism! That would be clever and make us stand out from the crowd, right? Maybe...maybe not.

What follows is a fairly exhaustive list of — presumably intentional — DNA-based initialisms that are in use (or have been used). As of 2020-07-20 the current list contains 67 names in total with all 24 possible combinations of [ACGT] being used. The additions since I first created this page are included at the end.

See also this related blog post by David Lawrence from 2014, which I only discovered in mid-2020. His post — which beat me to the punch by just a couple of weeks! — has provided me with a few additional examples which I hadn’t heard about and which have now been included here.

Please let me know of any errors or omissions, though note that potential names have to be initialisms and has to be somewhat related to to the fields of genetics, genomics, or bioinformatics.


ACGT

  1. Advisory Committee on Genetic Testing — Committee — 1996
  2. Alliance for Cancer Gene Therapy — Research Network — 2001
  3. A Comparative Genomics Tool — Software — 2003
  4. Advancing Clinico-genomic Trials on Cancer — Research Project — 2011
  5. Algorithms in Computational Genomics at Tau — Lab web page — ???
  6. Advanced Center for Genome Technology — Research Center? — ???
  7. African Centre for Gene Technologies — Research Network — ???
  8. Applied Computational Genomics Team — Research Group — ???
  9. Amino aCids To Genome — Software — 2017
  10. Analysis of Czech Genomes for Theranostics — Research Project? — 2020?

ACTG

  1. Automatic Correspondence of Tags and Genes — Software — 2007

AGCT

  1. Applied Genomics & Cancer Theraeputics — Research Program? — ???

AGTC

  1. Applied Genomics Technology Center — Core Facility? — 1998
  2. Advanced Genome Technologies Core — Core Facility — ???
  3. University of Kentucky Advanced Genetic Technologies Center — Core Facility (now defunct?) — ???

ATCG

  1. Applied Technology in Conservation Genetics — Research Lab — ???

ATGC

  1. Arabidopsis Thaliana Genome Center — Core Facility? — 2000?
  2. Another Tool for Genome Comparison — Software — 2001
  3. Advanced Thermal Gradient dna Chip — Patent — 2002
  4. Another Tool for Genomic Comprehension — Database & web tool — 2012
  5. Alignable Tight Genomic Clusters - Database - 2009

CAGT

  1. Center for Advanced Genomic Technology — Research Facility — 2000?
  2. Center for Applied Genetics and Technology — Research Facility — 2004
  3. Center for Applied Genetic Technologies) — Research Facility — ???
  4. Clustering AGgregation Tool — Software — 2012?

CATG

  1. Cross-legume Advances Through Genomics — Conference — 2004?
  2. Center for Advanced Technologies in Genomics — Research Facility — 2008

CGAT

  1. Comparative Genome Analysis Tool — Software — 2006
  2. Computational Genomics Analysis and Training — Training program — 2010
  3. Computational Genomics Analysis Toolkit — Software — 2013
  4. Centre for Gene Analysis and Technology — Research Facility — ???
  5. Canadian Genome Analysis and Technology program — Research program (now defunct) — 1992

CGTA

  1. CNS Gene therapy Translation Acceleration - Research Group - ???

CTAG

  1. Corn Transcriptome Analysis Group — Working Group — 2014
  2. Canadian Triticum Advancement Through Genomics - Research project - 2011

CTGA

  1. the Catalogue for Transmission Genetics in Arabs — Database — 2006

GACT

  1. The Center for Genetic Architecture of Complex Traits - Research Center - 2013

GATC

  1. Genetic Analysis Technology Consortium — Biotech Consortium (now defunct?) — circa 1997?

GCAT

  1. Genome Comparison & Analytic Testing — Software? — ???
  2. Genome Consortium for Active Teaching — Teaching Consortium — 2007?
  3. Gene-set Cohesion Analysis Tool — Software — 2011 (or 2007) 4.Genotype-Conditional Association Test — Statistical method — 2015
  4. Genomics, Computational biology And Technology - study section - ???

GCTA

  1. Genome-wide Complex Trait Analysis — Software — 2011

GTAC

  1. Gene Technology Access Center — Teaching Facility — 2000
  2. Genomics Technology Access Center — Core Facility — 2009?
  3. Genome Technology Access Center — Core Facility — 2010
  4. Genomics/Transcriptomics Analysis Core — Core Facility — ???
  5. Genomes and Transcriptomes of Arctic Chromists — Research Program — 2012
  6. Gene Technology Advisory Committee — Government Committee — ???

GTCA

  1. Genomic Tetranucleotide Composition Analysis — Database — 2006
  2. Genome Transcriptome Correlation Analysis — Software — 2007

TACG

  1. Talking About Computing and Genomics — Workshop — 2013

TAGC

  1. The Applied Genomics Core — Core Facility — 1998
  2. The Ashkenazi Genome Consortium — Consortium — 2012
  3. Technological Advances for Genomics and Clinics — Research Lab/Program? — ???
  4. The Arts & Genomics Centre — An Arts/Science Center — ???
  5. The Allied Genetics Conference — Conference — 2016?
  6. Taxon-Annotated GC plots — software visualisation method/tool — 2013

TCAG

  1. The Centre for Applied Genomics — Research Facility — 2007?
  2. The Center for the Advancement of Genomics — Research Facility (superseded by this) — ???

TCGA

  1. The Centre for Genetic Anthropology — Research Facility — 1996
  2. The Tayside Centre for Genomic Analysis — Core facility — 2001 (?)
  3. The Center for Genomic Application — Core Facility — 2004
  4. The Cancer Genome Atlas — Research Program — 2006

TGAC

  1. The Genome Access Course — Training Course — 2002
  2. The Genome Analysis Center — Research Facility — 2009

TGCA

  1. The Genome Counselling App — iOS Application — 2014
 

Updates:

  • 2020-08-20 Added 5th example of ATGC, 3rd example of AGTC, 2nd example of CTAG, and 4th example of GCAT (all courtesy of David Lawrence)

  • 2020-07-18 Added 10th example of ACGT

  • 2019-07-23 Added 9th example of ACGT (thanks to Sam Lent @samanthalent)

  • 2016-09-03 Added 4th example of TCGA (thanks to @malcolmacaulay)

  • 2016-02-16 Added 6th example of TAGC

  • 2015-09-11 - Added 5th example of TAGC

  • 2015-07-06 - Added 8th example of ACGT

  • 2015-04-06 - Added 4th example of GCTA (thanks to John Didion)

  • 2014-12-12 - Added first usage of TACG (thanks to @NazeefaFatima)

  • 2014-04-25 - Added Jeff Ross-Ibarra's planned use of CTAG

  • 2014-04-25 - Included a second instance of AGTC

  • 2014-05-18 - Included a fourth example of TAGC

  • 2014-09-08 - Included first usage of CGTA, GACT, and TGCA

When is a genome complete...and does it even matter? Part 1: the 1% rule vs Sydney Brenner's CAP criteria

This will be the first in a new series of blog posts that discuss my thoughts on the utility of genomes at various stages of completion (both in terms of genome assembly and annotation). These posts will mostly be addressing issues that pertain to eukaryotic genomes...are there any other kind? ;-)




I often find myself torn between two conflicting viewpoints about the utility of unfinished genomes. First, let's look at the any-amount-of-sequence-is-better-than-no-sequence-at-all argument. This is clearly true in many cases. If you sequence only 1% of a genome, and if that 1% contains something you're interested in (gene, repeat, binding site, sequence variant etc), then you may well think that the sequencing effort was tremendously useful.

Indeed, one of my all-time favorite papers in science is an early bioinformatics analysis of gene sequences in GenBank. Published way back in 1980, this paper (Codon catalog usage and the genome hypothesis) studied "all published mRNA sequences of more than about 50 codons". Today, that would be a daunting exercise. Back then, the dataset comprised just 90 genes! Most of these were viral sequences, with just six vertebrate species represented (and only four sequences from human).

The abstract of this paper concluded:

Each gene in a genome tends to conform to its species' usage of the codon catalog; this is our genome hypothesis.

This mostly remains true today and the original work on this tiny dataset established a pattern that spawned an entire sub-discipline of genomics, that of codon-usage bias (now with over 7,000 publications). So clearly, you can do lots of great and useful science with only a tiny amount of genome sequence information. So what's the problem?

pause-to-switch-hats-to-argue-the-other-point

Well, 1% of a genome may be better than 0%, and 2% is better than 1%, and so on. But I want 100% of a genome (yes, I'm greedy like that). However, I begrudgingly accept that generating a complete and accurate genome assembly (not to mention a complete set of gene annotations) currently falls into the nice-idea-kid-but-we-can't-all-be-dreamers category.

The danger in not getting to 100% completion is that there is a perception — by scientists as well as the general public — that these genomes are indeed all finished. This disconnect between the actual state of completion, versus the perceived state of completion can lead to reactions of the wait-a-minute-I-thought-this-was-meant-to-be-finished!?! variety. Indeed, it can be highly confusing when people go to download the genome of their species of interest, under the impression that the genome was 'finished' many years ago, only to find that they can't find what they're looking for.

Someone might be looking for their favorite gene annotation, but maybe this 'finished' genome hasn't actually been annotated. Or maybe it's been annotated by four different gene finders and left in a state where the user has to decide which ones to trust. Maybe the researcher is interested in chromosome evolution and is surprised to find that the genome doesn't consist of chromosome sequences, just scaffolds. Maybe they find that there are two completely different versions of the same genome, that were assembled by different groups. Or maybe they find that the download link provided by the paper no longer works and they can't even find the genome in question.

The great biologist Sydney Brenner has often spoke of the need to achieve CAP criteria in efforts such as genome sequencing. What are these criteria?

  • C - Complete I.e. if you're going to do it, do a thorough job so that someone doesn't have to come along later to redo it.
  • A - Accurate This is kind of obvious but there are so many published genomes out there that are far from accurate.
  • P - Permanent Do it once, and forever.

The last point is probably not something that is thought about as much as the first two criteria. It relates to where these genomes end up being stored and the file formats that people use. But it also applies to other subtle issues. I.e. let's assume that research group 'X' has sequenced a genome to an impressive depth but that they made a terrible assembly. As long as their raw reads remain available, someone else can (in theory) attempt a better assembly, or attempt to remake the exact same assembly (science should be reproducible, right?).

However, reproducibility is not always easy in bioinformatics. Even if all of the methodologies are carefully documented, the software involved may no longer be available, or it may only run on an architecture that no longer exists. If you are attempting to make a better genome assembly, you could face issues if some critical piece of information was missing from the SRA Experiment metadata. A potentially more problematic situation would be if the metadata was incorrect in some way (e.g. a wrong insert size was listed).

In subsequent posts, I'll explore how different genomes hold up to these criteria. I will also suggest my own 'five levels of genome completeness' criteria (for genome sequences and annotations).