Mining UniProt: the Roy Chaudhuri quest to find DNA-like protein sequences

Image adapted from original image by flickr user lwr

So I recently came up with this idea for #tweetsascode: using twitter to write tweets which contain functional programs within their 140 characters. I posted one such example last Friday, a Perl script which checks a FASTA file (specified on the command-line), in order to determine whether it contains a protein or DNA sequence:

#!/usr/bin/perl
use strict; 
use warnings;

while(<>){
    next if m/(^>)|(^$)/;
    die "Protein" if (m/[EFILOPQ]/i);
    die "DNA";
}
#tweetsascode

This code skips FASTA definition lines (those starting with '>') and blank lines, and then asks: does the first line of sequence contain any of the seven amino-acid characters which are not IUPAC nucleotide characters? If so, then it must be protein; otherwise the sequence is probably DNA.

This led @RoyChaudhuri to comment:

Roy's point is that there are so many IUPAC nucleotide characters, that a protein sequence which only contained 13 out of the 20 canonical amino acids, would also pass the test as a valid nucleotide sequence. Is it possible to therefore determine how many 'DNA-like' proteins there are?

Experiment

With the help of a little Perl script, I did the following:

  1. I first downloaded the FASTA files for SWISS-PROT and TrEMBL, which collectively comprise the UniProt protein database. If you didn't know, SWISS-PROT contains manually annotated entries whereas the much larger TrEMBL database is automatically annotated.
  2. For every protein sequence in SWISS-PROT or TrEMBL, my script counts the use of various protein ambiguity characters (this was just out of curiosity). These are B (aspartic acid or asparagine), J (leucine or iosoleucine), Z (glutamate or glutamine), and X (unknown amino acid).
  3. The script also counts usage of the 21st and 22nd amino acids (selenocysteine and pyrrolysine, which have the valid IUPAC characters U and O respectively).
  4. The script counted any protein sequences which only contained amino acids that have equivalent IUPAC characters for the set of four canonical nucleotides (i.e. alanine, cysteine, glycine, and threonine).
  5. Finally, the script counted any protein sequences which only contained amino acids that have equivalents from any of the 16 IUPAC nucleotide characters.

Results (SWISS-PROT)

Dataset = 549,008 proteins

  • 546,360 only contained the 20 'classic' amino acid characters
  • 254 contained selenocysteine characters (U)
  • 29 contained pyrrolysine characters (O)
  • 138 contained alternative amino acid character B (representing D or N)
  • 0 contained alternative amino acid character J (representing L or I)
  • 114 contained alternative amino acid character Z (representing E or Q)
  • 2,222 contained unknown amino acid characters (X)
  • Only 1 protein was comprised entirely of A, C, G, and T
  • An additional 123 proteins were comprised entirely from valid IUPAC nucleotide characters ([ACGTNUWSKMRYBDHV])

The sequence that contained only 'classic' DNA characters was a 31 amino acid fragment, which turned out to contain only two different amino acids (alanine and threonine):

>sp|P02732|ANP3_PAGBO Ice-structuring glycoprotein 3 (Fragments) OS=Pagothenia borchgrevinki PE=1 SV=1 AATAATAATAATAATAATAATAATAATAATA

Of the 123 proteins that used various characters from the full set of IUPAC nucleotide characters, this 128 amino acid protein was the longest:

>sp|Q925H4|KR211_MOUSE Keratin-associated protein 21-1 OS=Mus musculus GN=Krtap21-1 PE=2 SV=2 MCCNYYGNSCGGCGYGSRYGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYG SGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSG YGCGYGSRYGCGYGSGCCSYRKCYSSCC

This SWISS-PROT entry (accession Q925H4) is a mouse protein which has experimental evidence, and which has an Annotation score of 5 out of 5.

Results (TrEMBL)

Dataset = 50,011,027 proteins

  • 49,373,499 only contained the 20 'classic' amino acid characters
  • 1697 contained selenocysteine characters (U)
  • 199 contained pyrrolysine characters (O)
  • 15,314 contained alternative amino acid character B (representing D or N)
  • 0 contained alternative amino acid character J (representing L or I)
  • 5,842 contained alternative amino acid character Z (representing E or Q)
  • 632,742 contained unknown amino acid characters (X)
  • Only 2 proteins were comprised entirely of A, C, G, and T
  • An additional 1,827 proteins were comprised entirely from valid IUPAC nucleotide characters ([ACGTNUWSKMRYBDHV])

In this much larger set of proteins we still don't see a sequence that resembles 'classic DNA' that is any longer than the 31 amino acid fragment found in SWISS-PROT. Instead, the longest sequence was a 24 amino acid fragment (which only contains alanine and glycine):

>tr|U6PUI2|U6PUI2HAECO ISE/inbred ISE genomic scaffold, scaffoldpathogensHcontortusscaffold6340 (Fragment) OS=Haemonchus contortus GN=HCOI01698300 PE=4 SV=1
GAAAAGGGGGGGGGGGGAAAAGGA

However, in TrEMBL there was a much longer sequence which contains various characters from the full set of IUPAC nucleotide characters. This uncharacterized protein is 495 amino acids long, and contains mostly serine, arginine, and cysteine:

>tr|A0A0E9H024|A0A0E9H024STREE Uncharacterised protein OS=Streptococcus pneumoniae GN=ERS23250802220 PE=4 SV=1 MRSRSYYTSVSRRKSSSSSSRSSSSSRSSSSCSSCRSSSSSRSSSSCRSS SSCSSCRSSRSSRSSSSCSSSRSCRSCSSSRSCSSCRSSSSCSSCRSSRS SRSSSSSRSSRSSSSCSSSRSSSSCRSSRSSRSSRSSRSCSSCRSSSSCS SCRSCRSSSSCSSCRSSRSSRSSSSCRSSSSCSSCSSCRSSRSSSSCSSS RSCRSCRSCRSSSSSSSCSSSRSSRSSSSSRSCSSCRSSSSCSSCRSSSS CSSCRSSRSSSSCSSCSSCRSSRSSSSCSSSRSSRSSRSCRSSSSSRSCR SCSSCRSSSSCSSCRSSRSSRSCSSCRSSRSSRSCRSSSSSRSSRSSSSS RSSRSSSSCRSSRSSSSSRSCSSCRSSRSSSSCSSSRSSSSSRSCSSSRS CSSCRSSSSCSSCRSSRSSRSSSSSRSSRSSSSCSSSRSSSSCRSSRSSR SSRSSRSSSSSRSSRSSSSSRSSRSSSSCRSCRSSSSCSSCSSSR

Honorable mention to XXX-rated protein

It is worth giving a shout out to UniProt accession W4XLU5 (from the TrEMBL database). This uncharacterized protein has a length of 21,842 amino acids…21,292 of which are represented by unknown amino acids!!! This is probably why the protein has an Annotation score of just 1 out of 5.

Conclusions

  1. To answer Roy's initial question, only 0.00004% of proteins in UniProt (1,953 / 50,560,035) fulfil the requirement of only containing amino acids that have equivalent IUPAC nucleotide characters.
  2. From a coding point of view, one should possibly account for the fact that you can see almost 500 DNA-like characters in a sequence, but you still could be looking at a protein sequence.
  3. A ~22,000 amino acid protein which contains 97% 'unknown' residues, should maybe take the award for 'least-useful protein annotation'

Insights into the mind and practices of a bioinformatics developer

Michael Barton  — who has the coolest twitter handle: @bioinformatics — is a talented bioinformatician who has a great, though infrequently updated, blog (Bioinformatics Zen).

His latest post (How I Develop) takes a look at his overall coding and working practices, reflecting on his decade long experience in bioinformatics. I particularly like the section of 'Focus':

Getting up very early. I built nucleotid.es by getting up at 6am every morning for about 6 months. An average of 2-3 hours of time everyday allowed me to create the prototype site to prove that this could be useful. These hours in the morning feel like my most productive and usually there's no one else awake to interrupt you.

Well worth a read, as is his FAQ.

Does your bioinformatics software pass the 'elevator test'?

The name of your bioinformatics software is important. A good name should be clear, unambiguous, pronouncable, memorable, and meaningful. Sadly many (most?) names of existing tools do not satisfy all of these criteria. Here is a simple thought experiment that you can use when trying to decide on a new name for your software; this is something which might help you avoid many common naming problems that can arise.

Imagine that you are in an elevator going from the 6th floor of a building to the ground floor. The elevator stops at the 5th floor and a visiting bioinformatics/genomics scholar steps in. He/she is someone that you admire and someone who you would really like to know about the latest software tool that you've been working on.

They press the button for the 2nd floor. You have maybe 30 seconds to introduce the tool and hopefully make them curious enough to check it out when they next get back to their computer. You say something like:

Hi. I'm a big fan of your work. I wanted to let you know that I've been working on a tool that you might be interested in…it's called 'X'

In this example, we will assume that you may never see this person again and that you don't know when they will have time to look up your software tool. It might be days, so the name has to be something that they will remember. The more meaningful and pronouncable the name, the more chance that it will be memorable.

Now, let's consider the names of some recently published bioinformatics tools…do these pass the elevator test? You should always consider how you might have to spell out the name of your software:

  • tmle.npvi — tee-em-el-ee-dot-en-pee-vee-aye
  • EW_dmGWAS — Ee-double-you-underscore-dee-em-gee-was
  • do_x3dna — dee-oh- (or do?) -underscore-ex-three-DNA
  • R3D-2-MSA — ar-three-dee-dash-two-dash-em-ess-ay 
  • Pse-in-One — pee-ess-ee- (or see?) -dash-in-dash-one
  • (PS)2 — open-parentheses-pee-ess-close-parentheses-superscript-two

In these examples you would probably choose to omit details of the dots, dashes, underscores, parentheses, and superscript characters that are part of the name. So you should ask yourself whether you really need to include them in the first place.

The bottom line is that it is not enough for the name of your sofware to be comprehensible when read from a screen or page…it should also sound good!

Bioinformatics and genomics resources on reddit

 

Although many people in this field turn to sites like SEQanswers and Biostars to get help with bioinformatics problems, there are a number of subreddits that are devoted to discussion of bioinformatics and genomics. Reddit isn't just a forum for asking questions though, and people also share lots of relevant links (to papers, resources, news etc.). As a new subreddit appeared this week, I thought I'd present a quick roundup:

 

1. bioinformatics

This is the most popular subreddit in this list (in terms of readers). The posts are roughly split equally between sharing links of interest and asking questions. Some of the questions frequently relate to career advice from people wanting to get into this field.

 

2. genomics

  • URL:https://www.reddit.com/r/genomics/
  • Description: Genomics, genetics, DNA, health, and personalized medicine
  • Frequency of posts: Infrequent, ~1–5 new items per week_
  • Current readership:_ 3,085 readers

This subreddit seems to be declining in popularity. It mostly features shared links rather than people asking questions.

 

3. genome

  • URL:https://www.reddit.com/r/genome/
  • Description: Please submit primary genomics literature, discussions of primary literature (e.g. blog posts and serious news stories), and resources for genomics research.
  • Frequency of posts: Moderate, ~5–10 new items per week
  • Current readership: 211 readers

This is a relatively new subreddit and it is focused on people sharing new and interesting papers, and using the subreddit to discuss those papers.

 

4. learnbioinformatics

  • URL:https://www.reddit.com/r/learnbioinformatics/
  • Description: This is a subreddit for providing you with the most relevant academic papers, textbooks, websites, and tutorials in the field of bioinformatics. If you have any recommended resources, please feel free to post away!
  • Frequency of posts: Too early to reliably predict
  • Current readership: 299 readers

This is the newest subreddit (just a few days old), and so has only attracted about a dozen posts. The intended role of this subreddit has obvious overlaps with the other subreddits.