Mining UniProt: the Roy Chaudhuri quest to find DNA-like protein sequences
So I recently came up with this idea for #tweetsascode: using twitter to write tweets which contain functional programs within their 140 characters. I posted one such example last Friday, a Perl script which checks a FASTA file (specified on the command-line), in order to determine whether it contains a protein or DNA sequence:
#!/usr/bin/perl
use strict;
use warnings;
while(<>){
next if m/(^>)|(^$)/;
die "Protein" if (m/[EFILOPQ]/i);
die "DNA";
}
#tweetsascode
This code skips FASTA definition lines (those starting with '>') and blank lines, and then asks: does the first line of sequence contain any of the seven amino-acid characters which are not IUPAC nucleotide characters? If so, then it must be protein; otherwise the sequence is probably DNA.
This led @RoyChaudhuri to comment:
@kbradnam Wonder what percentage of protein seqs in nr would be valid IUPAC nucleotide sequences?
— Roy Chaudhuri (@RoyChaudhuri) July 31, 2015
Roy's point is that there are so many IUPAC nucleotide characters, that a protein sequence which only contained 13 out of the 20 canonical amino acids, would also pass the test as a valid nucleotide sequence. Is it possible to therefore determine how many 'DNA-like' proteins there are?
Experiment
With the help of a little Perl script, I did the following:
- I first downloaded the FASTA files for SWISS-PROT and TrEMBL, which collectively comprise the UniProt protein database. If you didn't know, SWISS-PROT contains manually annotated entries whereas the much larger TrEMBL database is automatically annotated.
- For every protein sequence in SWISS-PROT or TrEMBL, my script counts the use of various protein ambiguity characters (this was just out of curiosity). These are B (aspartic acid or asparagine), J (leucine or iosoleucine), Z (glutamate or glutamine), and X (unknown amino acid).
- The script also counts usage of the 21st and 22nd amino acids (selenocysteine and pyrrolysine, which have the valid IUPAC characters U and O respectively).
- The script counted any protein sequences which only contained amino acids that have equivalent IUPAC characters for the set of four canonical nucleotides (i.e. alanine, cysteine, glycine, and threonine).
- Finally, the script counted any protein sequences which only contained amino acids that have equivalents from any of the 16 IUPAC nucleotide characters.
Results (SWISS-PROT)
Dataset = 549,008 proteins
- 546,360 only contained the 20 'classic' amino acid characters
- 254 contained selenocysteine characters (U)
- 29 contained pyrrolysine characters (O)
- 138 contained alternative amino acid character B (representing D or N)
- 0 contained alternative amino acid character J (representing L or I)
- 114 contained alternative amino acid character Z (representing E or Q)
- 2,222 contained unknown amino acid characters (X)
- Only 1 protein was comprised entirely of A, C, G, and T
- An additional 123 proteins were comprised entirely from valid IUPAC nucleotide characters ([ACGTNUWSKMRYBDHV])
The sequence that contained only 'classic' DNA characters was a 31 amino acid fragment, which turned out to contain only two different amino acids (alanine and threonine):
>sp|P02732|ANP3_PAGBO Ice-structuring glycoprotein 3 (Fragments) OS=Pagothenia borchgrevinki PE=1 SV=1 AATAATAATAATAATAATAATAATAATAATA
Of the 123 proteins that used various characters from the full set of IUPAC nucleotide characters, this 128 amino acid protein was the longest:
>sp|Q925H4|KR211_MOUSE Keratin-associated protein 21-1 OS=Mus musculus GN=Krtap21-1 PE=2 SV=2 MCCNYYGNSCGGCGYGSRYGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYG SGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSG YGCGYGSRYGCGYGSGCCSYRKCYSSCC
This SWISS-PROT entry (accession Q925H4) is a mouse protein which has experimental evidence, and which has an Annotation score of 5 out of 5.
Results (TrEMBL)
Dataset = 50,011,027 proteins
- 49,373,499 only contained the 20 'classic' amino acid characters
- 1697 contained selenocysteine characters (U)
- 199 contained pyrrolysine characters (O)
- 15,314 contained alternative amino acid character B (representing D or N)
- 0 contained alternative amino acid character J (representing L or I)
- 5,842 contained alternative amino acid character Z (representing E or Q)
- 632,742 contained unknown amino acid characters (X)
- Only 2 proteins were comprised entirely of A, C, G, and T
- An additional 1,827 proteins were comprised entirely from valid IUPAC nucleotide characters ([ACGTNUWSKMRYBDHV])
In this much larger set of proteins we still don't see a sequence that resembles 'classic DNA' that is any longer than the 31 amino acid fragment found in SWISS-PROT. Instead, the longest sequence was a 24 amino acid fragment (which only contains alanine and glycine):
>tr|U6PUI2|U6PUI2HAECO ISE/inbred ISE genomic scaffold, scaffoldpathogensHcontortusscaffold6340 (Fragment) OS=Haemonchus contortus GN=HCOI01698300 PE=4 SV=1
GAAAAGGGGGGGGGGGGAAAAGGA
However, in TrEMBL there was a much longer sequence which contains various characters from the full set of IUPAC nucleotide characters. This uncharacterized protein is 495 amino acids long, and contains mostly serine, arginine, and cysteine:
>tr|A0A0E9H024|A0A0E9H024STREE Uncharacterised protein OS=Streptococcus pneumoniae GN=ERS23250802220 PE=4 SV=1 MRSRSYYTSVSRRKSSSSSSRSSSSSRSSSSCSSCRSSSSSRSSSSCRSS SSCSSCRSSRSSRSSSSCSSSRSCRSCSSSRSCSSCRSSSSCSSCRSSRS SRSSSSSRSSRSSSSCSSSRSSSSCRSSRSSRSSRSSRSCSSCRSSSSCS SCRSCRSSSSCSSCRSSRSSRSSSSCRSSSSCSSCSSCRSSRSSSSCSSS RSCRSCRSCRSSSSSSSCSSSRSSRSSSSSRSCSSCRSSSSCSSCRSSSS CSSCRSSRSSSSCSSCSSCRSSRSSSSCSSSRSSRSSRSCRSSSSSRSCR SCSSCRSSSSCSSCRSSRSSRSCSSCRSSRSSRSCRSSSSSRSSRSSSSS RSSRSSSSCRSSRSSSSSRSCSSCRSSRSSSSCSSSRSSSSSRSCSSSRS CSSCRSSSSCSSCRSSRSSRSSSSSRSSRSSSSCSSSRSSSSCRSSRSSR SSRSSRSSSSSRSSRSSSSSRSSRSSSSCRSCRSSSSCSSCSSSR
Honorable mention to XXX-rated protein
It is worth giving a shout out to UniProt accession W4XLU5 (from the TrEMBL database). This uncharacterized protein has a length of 21,842 amino acids…21,292 of which are represented by unknown amino acids!!! This is probably why the protein has an Annotation score of just 1 out of 5.
Conclusions
- To answer Roy's initial question, only 0.00004% of proteins in UniProt (1,953 / 50,560,035) fulfil the requirement of only containing amino acids that have equivalent IUPAC nucleotide characters.
- From a coding point of view, one should possibly account for the fact that you can see almost 500 DNA-like characters in a sequence, but you still could be looking at a protein sequence.
- A ~22,000 amino acid protein which contains 97% 'unknown' residues, should maybe take the award for 'least-useful protein annotation'