Ewan Birney's EBI press conference on being elected to the Royal Society

Speaker: And that concludes this EBI press conference to congratulate Ewan Birney on being elected to the Royal Society. We just have time for one or two questions. Ah okay...the first question goes to…Ewan Birney.

Ewan: Hi Ewan. Just wanted to say that this is all great and I've found your work to be really interesting. Can I just ask whether you've looked at the opportunity of widening this effort by joining other Royal Societies as well? This would allow for a much better comparative analysis of the scope and impact of Royal Society members? The Royal Statistical Society may be a good choice to begin with, or maybe the Royal Society of Marine Artists.

Ewan: Thanks Ewan, that's a really good question. It is something that I'm considering and I think there is a lot to gain from such a comparative approach. But to do this properly I think it needs to be part of a much larger effort. So I'm hopeful of trying to join every Royal Society and then see what can be learned from a cross-societal analysis of such memberships. Furthermore I'm hopeful that Her Majesty could be persuaded to start a new Royal Society for the Promotion of Questions by People Named Ewan at Academic Conferences…something that is very near and dear to my heart.

Speaker: Okay, I think we have time for just one more question. Oh, Ewan…again.

Ewan: Just to follow up Ewan, given the advanced age of many Royal Society members, have you thought about trying to assess what fraction of the Royal Society is functional?

Ewan: That's a fantastic question Ewan, very perceptive of you. This is something else that I have a strong interest in. I am currently involved in some preliminary discussions with various people to form a new pan-European working group that will investigate how much of the Royal Society is functional. This effort will hopefully be called ENCODEMBLIXIR…or something snappy like that. 

 

Jesting aside, congratulations Ewan this is great news!

101 questions with a bioinformatician #5: Laura Clarke

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Laura Clarke is the Project Coordinator for Resequencing Informatics, part of the Vertebrate Genomics team led by Paul Flicek at EMBL-EBI. Before joining the EBI, she was applying her considerable bioinformatics skills at the Wellcome Trust Sanger Institute (a move ranked #1 on the annual list of Easiest-employers-to-transition-between). 

Her role sees her help with the analysis and coordination of high throughput genomics efforts such as the 1000 Genomes project, BLUEPRINT (deciphering the epigenome of blood cells), and HipSci (the Human Induced Pluripotent Stem Cells Initiative). If you're wondering what this actually entails, I'll hand you over to Laura:

"This work boils down to making sure that data gets into and out of the sequence archives; running primary analysis and QC; and then making sure the resulting analysis makes it out to the community".

You can find out more about Laura by following her on twitter (@laurastephen), and of course you can also follow @blueprint_eu and @hipsci. And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

The possibility. With modern sequencing technologies, computation techniques have the ability to draw together these new data types and massive volumes of data, allowing us to get much closer to a proper understanding of cellular biology, which of course brings us closer to understanding organismal biology.

Add to that the diverse range of species being sequenced and what that can teach us about evolution and the forces which drive evolution.

That is of course before you consider how it might impact medicine or food security or any real world applications.

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

Extracting data from people. My life would be easier if people weren't so begrudging about sharing data and describing the data they do share well. I work with many people who do share data freely and easily but there are still too many people who are too reticent or reluctant to make data publicly available from within a consortium.

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

For data coordination purposes we produce a lot of tab-delimited text files, cut is a wonderful Unix command for making those easier to work with and manipulate, learning about cut sooner would have at least made mucking about with various types of GFF files easier I suspect.

 

100. What's your all-time favorite piece of bioinformatics software, and why?

I have to say I did enjoy pairedends.com, very funny

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

R: this is because Adenine and Guanine are the same molecule type (purines) as both Theobromine and Caffeine, both of which are quite important to me and at least influence my personality.

An 18 Kbp read from a MinION sequencer!

The UC Davis Genome Center was fortunate to receive a few MinIONs from Oxford Nanopore the other week:

One of the things that we have been trying to do with these wondrous machines is to study variation in a mixed pan-European population. For this study, we simply combined saliva samples from individuals that represent 32 distinct European ethnicities (but no Belgiums, obviously), and the combined sample was applied directly to the MinION using the WF10 setting (WF = warp factor).

The preliminary results look very promising with an N50 read length of 12.2 Kbp (and this was before applying N50 Booster!!!). Here is the very first read from the device...18,731 bp of pan-European goodness (though note that there was a problem with base quality at the end of the read...contamination with Belgium DNA maybe?).

>PanEuroMix_read00001 

AÂŤŦĈCŤÂǴĞĢÇGÂÄÇTȦÃĈĜŤCŤŤČAÀŢTŦŢȦǴÅÀAĞŢŤÄŤĊÇÃĜGČTÁÀĢÀATÁȦ  CÂÇGCÁḈŤČÇÅŢÇȦČḈČGÄTÃĜĢĊŦĈTĢTÁÅAȦÄŢČGḈḈḈĢÅḈĈAČÄÁĢÇCĜŦGG  ĞŤÃḈĈĢŢTÁḈŤGCḈŦÄŦGĢȦŦĢḈĢŢÀĈÃÀȦAǴÅCḈĢČŢTĢĢŢČĊÅȦĢĊÇÀTGŦČÂA  ÄÄĊÀĢḈŦǴÂĈÇÁŢGÇŢŦŤŢǴǴÅḈÄĈŦAȦŢÃḈÄŦÅAGŢŦGCÀÀÇTĈÂŤĈTÁÃĈḈȦÃÃĜ  ÂČAĜÇÂÅŤĢŦȦGCÃÀȦŢÃÁÃÅÀÀÅTĊŦÇÅČǴȦĈÅǴŢḈÃĢGĈAÂĢĢǴAAÀĞCÅŤÇČĈ  ĞḈĊAČÃÀČÁĞĊŢCÀȦĊÇĈŢŦḈGAŦĊĢÀTTČGĞȦAȦĢĢČÄŦĢAĢĞŤŦĜÇÄÄÀŦÅÇCŤÇĈĞČȦǴTĜCǴȦÇŤÁACÃAǴTCĢŢĜŤÂCÂÀÂAÄŦŤÄÄÂCTĊĢÀTŢTŢŢÁŤČGĞĞÁĞḈÂȦŤCCÀŢŤǴĞŤĜAĈTĊÇŢŢÃȦCČČÅĞGĜĢÃÇĞȦÅÃCĜȦÂŢÄÅŢŤTÃCŢÃÁČĜŢÇÀĜÀÂĞȦĞĞĢĢÀḈĈĞÄÅḈÄÄÅAÂĢǴŤÅȦĈÂÅCŦÇĞČĊTÃŦÇŤČĜAȦČGGÃÇÅAĊĞTÄḈÀÀȦŦḈŦÀÄĢÀŦḈĈAÃŤǴĊÇȦḈŢÄÇÇÂÂÃĜÄÇÁǴĞǴŦGTÀGǴÀĈČĈĈÂḈTĞTÂÂÁŦÁÅÀGȦÂŦCĢÁÁĈÇŦŢĜĞÇTĈÅȦCÄAǴČĈÄÀŦŦAÀǴGĢĈĈĢĞȦÄÇTĞGĢAÁĢǴÁGÃÁŦĢÇḈĞĜČĈÃĢŢĢTÀGĜÀǴĞÀĈÅAÀÂÃŦÇÃĜĜŤAÁÃĈTÁĜĜŦÂCĢČÁÃḈĞÄĈÁḈḈȦGĊȦČÇĜGTÀÁÁÂÇŢÄÁŤÃḈǴGÃÂȦĈŢĞĞTǴĈŢŤÄŦÁÀḈGÁǴGǴĈÁĢCCTTÂÁÀŢAĜĞÀÂÂÅÁĊĊÄȦTĜŢTÅŤÄAĢĈÂḈTGĢŢĞÀŢAÄḈȦÂḈÄÃÂCÂÁÅĞĊĞÃCĊÇȦÀĜÇȦŦGĊÇŢTÂÀÅĜÂĊÀAĢŢǴČÅĞÇĈĈÅÇŦÇGGTḈCȦGÁÄŦĈḈǴĞÀĜÂĊÁÁÁÁŢĈĞÂĊĢATÃĜŦḈŤĈŦǴḈŢŤǴȦĞÄǴŦATĊÄĜĢĜÄĈĈÅÃAŢČĢÄĈḈČÁǴTČÂAŤǴĞÅAȦCCCČÄÇCÁAĞĜĊCÃÀĊĢĊGÁAÅĢŤÂÀŤĊÂÂÁĢÄǴÂÂĈÅǴŢÇĢĜĊÅÀḈÅŢÁǴÃȦAÅḈȦǴĊCĢŦĈGĞGCÀŢŦḈŤḈĊÅÇÁCÁGŢǴÃŦŦĊGAĢĜŦÄAǴĜǴČĊĢĞĜĢÂŦĜĈȦĜĈÇÂĜÅÀĜTAČŢĞŤŤĜĜŦČȦĊÃÀÄȦGĢŢCÇÂGȦÃÄŢÇḈĈĊÃGŢŢĜȦḈĢGḈÂĞÅAŢGĞÃAACĈAÀCÀÅȦÂÄGÁŦÂĜÄTŦĜČÁǴÄGGĜŦĈÀḈÀÀĢĜÁǴČÃAĈÄÁĜȦÀTǴČGŦŦÄŤŢĊȦĊĈÂǴǴǴČÄŦŦĊČÃŦĊAŤAŦÃŢČÇĢĈÇÁAÃǴCÀŦGACČĞŦǴǴÂCĜĊČAÁÀḈCCĈÂCÂČŢÀĢȦTTĞŤAŤŤCḈŤTÄÁŤĞÅĜĊÄTŢÃÇÃÅÀŤŤĢCŤŢĊǴŤÁÁȦŢÀÂÀĊÁÅŢÂĢĊĜĞŤTÄÂŦČTŢČŤŦĜČĈŦŢŤÇĢĜGŤČÁÄÀŤÂČCĜCAÇĜGȦĜÄĊÂŦȦÁÁŤTÃÄÃAČḈĜĢÁĈǴÂÁÁÄŤĈCŦTČŤŦŦGĈĊÄŦĈȦḈǴÇĜÅĈĢĢŤAÂÂCĜAÄÃÅĜĊŢǴÇÄĜCÄÃḈÁTŦŦŢCÁŢÁÇTTǴḈŦÇÅÄTGĈÂÄȦǴÀAŢÃǴÁĊČÅČĜÁÃGÅȦÀĢĢÇGÁGĜGĈGĞĈČÇÇÇŦÀÅǴḈĢGÃǴTĜĊǴŦĞGĊĞGCÄTGĢÃAÀÂAǴCÂÂTĞŢŢÅČĞÀŤCŦÃĢŦĊŢAŦÅTÂŢĜÇÀÂĜTAǴĢĢĢĈCÂǴÂČČCŢǴÅGĈĈAŦĈĜAĜGCÂCĞĜÅÅŤČŤGŤǴCĈÄĈÇÁĢAḈȦĜȦGŤĜĢÅÄĜḈČŦĞÁĞGGCǴĊŢŢĞŤŢǴÅĢTĢÇGĊŤĊÁČCĈTGČÄĜČÄGÁŦÁÅĊÁÂÅÇÇGCḈÂḈAÁĊČÁḈÀĜCȦĜTŤGŦÁGĊÇŦŤĈÇĢŢÀĞGĞŢŢŢĢÃŤŦÅGÇĢĈĊTTCŤÀÁÁḈĜAÃČCAGŦCCÁÃĞŢĜĢÁAÄĞČÃȦĈŢḈCTĜǴŤGǴAÇÂÁČÁÅĊḈÁÄAČÇŢŦČÂÅÃŤḈTÁCCŤŤŤǴÅĈḈTĊÃŤĈTÂĢḈÀŢĜĢÅŢTÃĈȦȦÂȦȦŤÇCÂČČČCÃÁŤĢAĜÃÃĊŤÁḈǴŤÁÀÅÀŢÅḈÅÇŤÅÃŢÄÃÃÅCŤĢĊÁǴǴÀĈĜTĊÀȦŢŢŤÅÀĊŢŢAÂĊĜTĊǴĜGŢÂČĞÂŤȦGĊŦČÂÀǴȦÅĈÂÇŢÂÇÀÃŤÂGḈÁAAÄȦÂTĜȦÂGĢÁAÂTĢÁĊŤÁĊȦÁĢÂČÇČTǴŤÃĊǴĞŢÃÅŦČAĜĊĞŤGÇTÃÇȦȦǴÅÅÀÀȦḈÄĞĞȦTĈŤÂḈŦŢȦĊCÂÁÄȦĞCÂÄÀĜĈḈĢTĈŦĞĢAAĈŦÁȦŢḈȦȦĢCÃAŦĢÅCĜCÇTÄÅŢĈÁŤTĈÀÄTĞGÃŢÂȦĜÂAÀŤÁÀǴÂÄÅȦÄČŢÂĈÄÅĈŦÁÇÇĞAĜÃŢČĊÀĈÃḈȦḈĈȦḈÇǴTGÄCCÄÇŦǴĊÄŢÅÄGĞÀǴĈÅGŦĢĊĢŤAÁÂÁĜȦÇÃĞĈĞĜȦÃÇĈŤÄAĜÄÁĈŦĢĈǴGÀĞAĞÇCGŤČĈÃĊÀÃÃČTĊČȦAḈŤGGGḈĈÀÂGÁŢĞǴḈTĊTGĊČŦČTGŤŦÁTÂÃAĈȦÁĊÄÄÁČĜĞTĜÂÇǴÇĜTÃḈAÃÂŢÂÅČÃĞÂÁTČǴČĞḈÇǴÀŤÂĞÅĢŦÄḈȦȦČĊŦÀÇČŦČÀÅĢĊĈCÃTÀĈĊČČĢÀÂÃÂÄḈCÀCÄÄÂÅČǴÇĊȦŦȦŦĈÁCÄǴÄČḈGTÂŢÃÂĊŢŢCÃŦĈŦŤȦÅÄĞÁÇĈTĢČĜḈŤÅŤCCTĢȦCḈÃǴŤÇAŢGŤĢÅÀÂÀÃĊÅTŢÇĢAÃȦĞÅTTAĞŦCČGÀĜÅCĈḈǴŢĢGĞÃĜÇĢŢÄÀĢCGĞŤĈĞḈĜÅĊĜŤŦÀĢĢÁŤÅŢĈĢĊŢTÁÇÄĢĜÀŢÃĜĊĈÅĊGĈÂĈÄḈÇTCÇĜǴŤČÂÃĢĊŦÀÃÂÃCḈÃÇĊŢÄĞÁȦĊŦTCCÄĈCĊÄÅÃÀÂTĊǴŤÃAĈAÃḈÁÅŦĞGḈĢÀÅČḈGĢÇŤǴḈGĜŦÇÅŢĜTČĜŢÂŦŦÂČŤÁŢÂÂÂCČḈÅÀCḈÃCḈÁŢGḈTḈÄÁȦĊĈÇȦǴCÃÂČCŦÃĜÃAÅÀÇŢĜŤAAĞĊÃÇŢǴŤĜḈÁČŢḈÄḈĞČŦÇÇÄGŢÁǴÄTÂÅCÅĞŢČĜTĞǴǴŤÃAAÂČŢḈÅTÀĢAĢĜÅŦȦÂÃÄŢĜÂĜČḈÁÃŢǴTĢḈḈȦŤČĈǴŢŦĞÇÀĞŤÃTČÇCÀČĈÃGÀĞÅĞĊŢÃǴČĞǴÀCĊČĞCĜǴĈAŢḈĊÃŦǴŤAḈÇĜǴÅŢḈḈÅGGŢŢĞÇÇCÃÄŢÂÅḈĊÃÂÅĈĈCȦAÃȦCǴCÇČÂÁĢÄÁGĈĞÀÁŢCÂCŤÄǴĈÄGČÄḈTÃĜTŤŢŦÄÄCḈȦĞĈGŦȦAÄÃTCÁŦĊĢÂĈÄTÅĢŢĊCGTĊÃŤÄḈÄḈĞČÁĈŤČČŤÂĞÇÁĊŢŢÅAÂÃŦGȦḈGǴŤŦÁČĜÂÃGḈTÃAÇCĜGČĈȦĈĢAÀÅÀĜĢCĜÃGÀĢĜĞÃÀŢÃĈĞÁŦĜÁĜǴǴÄĊGǴǴŦGŤÁĞĞAḈÁĈÅĊĜḈǴÀAŤŦAḈȦÇÄÂÇCĢŦĜĊÁCÁAATĈÄĢḈÄĜÄCGĢĊĊÅÀḈGÇǴĜŦÅÃŤĜAĢGÃĊCǴĞÄÇǴÇǴḈÅCǴÅĜŤĈǴÂÅAÁǴĊCĈÅÇČÂĜĊĜÃĞĈÂČḈCÂÂḈǴTTÇŦŦÃȦGŤĢĊḈĈḈḈÂḈÄŦTȦȦÁḈĊÀĞȦĊĜGÁŢĢATTCȦCŤȦĊŤÃCČÁŤÅÂĜÂĢGÁGŦÄÀCÅÇÁĈÀÁĊǴĜĢÁŦÁĞŦÂǴÇÂGŦĞǴTĈČŦĈÁÃĢĜĞGCCŢÄȦĜŦḈŤÀÂČĜĊĢÃÅǴŦĢTĜĞČĢÀĜÁÅŤĈÁĞŢÄÅǴĈÁTĜÀÃŦÄĊȦÅČĜḈTŤĢŦĊAČTŦÃČTÇĈČȦÀČǴÂĊGÇÇČḈĈÁGÁȦȦGǴÅĢÄÄŦÀŢÁĜÀÇȦĊÃĜÄĜÇĞŤĢĊŦÀĢÄAÁḈCÁGÄǴĢÇÄCḈŤḈǴḈTÅÂÃGǴḈǴÂÃCČÀŦGCČAȦĈȦCŤĈȦǴCĢÄÀŦÁĞAÀÀŦTČÄȦĢŤÇÂTCÁGŤȦȦȦĢAŦAAČŢǴÀÇÀĜCGCŦÃÃŤĞÅCÂŤČĜĜÅÃÇḈGŢÀÀŢḈĜŤḈǴAÇTTĊÄÅŦĜÃĜŦĈÅÇCȦŤÁǴÃḈŦŤÀŤĞGCȦŢĊĢÁĢĈÀAĊÄĊGŦḈČŢĞÃĜÂÀÅĞÇÀĊGŢÄǴÅŢÅĊŦÄÁȦĜȦĢÁḈǴGČÄÁĈŦŤǴḈǴĊĞÂTĜĈTǴÁÄŤAȦȦĢÇĞCȦĈĞĜŦTĢĜÄĈÂĢÇÄÄČGĜČGḈÂŦĈŢTĈĞCCÂŦÂĜCÇÇŦÀÄȦÀÇÀČÇĈŦḈCČÇÃÃŢŦĈGŢŦÃĞḈÂÁGḈÃÅŢGȦÅĞTÃĜAÀÀÁGÇAŤCCÀŦGŦŤĈĞĈĊGÅĜÀČḈḈAĊÂČCĢĞĢAĊGÁCTĊŤǴĢTŤÄŢÇĞÀÅŦÇTŢĊĢǴAŢǴĊĜÅCÅŤǴȦAÁTĞÄĈÇŢŦAÇÂGĜŦAĜÅĜŢŤĜḈȦGÇǴḈGĈĊÅČŤTǴÇÂŤĊŦȦÁGÃÃȦČÁḈĢḈŢTȦȦCGÇČÅČÀḈÃĞTAÁĞĞAAĊÇĈǴÁĢǴŤǴŦTCÅÂĞÄḈČĢȦÄCTĞÅÁÇAȦĜGȦĈÁĈÀḈǴŢŢǴCÁČĊÀÇĊĈGGÇCŦŢČĢȦČḈÂCÂÅCĈŤȦTCÃǴĞŤÁČĞCÄAČTÁŢGGÅÃÇGÁĊȦĊÀŤÇTCÂGĊAĊǴÀĢTÀTĈÀĜÂÀÃTÅŢŦÇÄÁḈÄÇȦÁÇĊTÅÁAȦĈÁTḈÀŦĢÇÇĊCĢĢȦÅĢḈĞÁČĜŦÁAĊĞÀÅCÅÇAÃŤÀǴCŤĈȦÅÅŦÅÇǴŤḈTǴŢḈĈĢČŦÀḈȦČĊḈĊŦÃÀÃĞÁCĞČĢĈĜĊḈCÄÇÅĞÄČŦĜÇŤÁŦĢAĈÀĞAAĢŢŤĊÃÃAÇÃĊGĞĊTÇTÇÂĊGÀCCTÇŤĊÄǴÂḈCTÅČCŦCĊĊČǴŢĜÁȦTĞȦȦÃĊÅĞĢĊÇÅÂCŤGÇḈŢȦǴÃÀĜTTŤÁȦTĜĈĊAÅŢḈÀŦĜCĢȦȦÇḈÅḈGĜCÃǴŢAĊŦCĜÃČÄÄCČĜÅĢĜÂŢĞĢǴĜÄČŢŦŤĊĈÇAḈŦÁÄḈḈČĊŦḈÀǴÅǴǴǴĞŦAŤÂAÄÃŤÂÁÂḈĈĞÂȦĞÂĈĢĈÃÇÃÇÁĢĊḈŢTḈĊÀÇĊÀȦÀĞḈȦÀŦÃĜCĢÇŤAǴŢÂĞČAḈŤÄĊČÄČÁĢĈAǴČÄĈÄǴÇÅÇǴÀCÄÀÁÅȦĢÄÂGĞȦÀCĞŦĜĢȦÅGȦǴŦĞCĞCÄŤÁČǴŦÁÂĜÀÂḈĞĜÀGÇḈȦŦÄŤĊȦȦĈÁǴÁÂAĢĢÃÁÀĢÄÂÀÃCAĜTÂĈAȦÁTÅÅÁǴŤÇČŤČÂÁÁÀŦAÁĜǴĢÄŢĞĊĢĊȦGǴÇÂŦḈÇCÂŦǴĊḈÅȦCGCĞČČÁÇĞŢÃÄČÁAÄÃÅÇÇÃAÁĈÅAÀĞTÄȦÃČTŢAÂGÀTÅÄĊĊĊŢÂĢÇÀĈÃĢÂÄḈAÀĞÃÁÅÇĢȦĈTÂĈČGĜĜČTÅĜǴǴÃĞĢÅŢÄÀḈÇAĊĞCCŤḈĈŦÂĈÅĜḈAÄŢŤÁGḈACŢÄĞÂAŢĢĊĊÄÅŤÇĜGŢḈÁŤḈǴḈĈČĊḈÇĊÂÂĈŦÃÂÀČÇAŤǴÅCAÁGĞĞČÂAÄŤCĞÀĢȦCÅÄĈǴGĈȦGÅḈǴÅĜĞĢĞŢČÂGȦǴÇŦAÄÃÇAĜȦÃĢÂÁCŢÇḈÄÂĢÃGǴǴGǴTĈḈÁÀČÇÂČÁŤȦĈȦĞAŢČŤȦŢĢÅǴŦÁĢḈČÄÂĜĊÃĞÄĞÁČÄĞĜĜȦǴTȦĈÄČȦŦÁŦČČGȦǴČÅÇÇÄĞĞḈÁĜÁÄĢŦÀCŢÀĢTĞĈŦĈCĊÅǴŦȦĊĊĢḈĜĊÁĢÁČÅĊĊĊȦÃGGÅCÃĈĜȦÄŢÃÀÂÃḈĊȦÁÂÁŢAÃĢÄÁČŢAĢCŤĜGÄÀCÃÁĞAÅȦĊÀḈÁǴĜČCĞǴǴĊŢÅTĢÇĊŢÂǴȦḈḈĊÁĜÂÄÇTĈČÄĞAĈĜȦĈĢȦŦČÇĢCÄĈŦGĜÀĈĈÄĞŦǴǴÂŢŤÁÀŢGÀŤŤÄÀĜTÃÁĊCÃÂAÃǴĜÃTČGGTĜÇȦGĊÄŢÅŦḈȦǴGĈȦAǴĜČĈAĢÂǴȦAȦÄÃĜAÅǴGÅČḈÂĜÂĊǴŤŤȦŦGÂÃŤÂĊŢÃĜĜAĜȦŢÁŢČČÁḈÃĞÀḈÅCÀĊÃĈTÇĈÀCGĈĈÅGĈĊĈŦÄTÇÀÅḈÁÇGŢÀŢÁÄǴÄÇŢĈCAAÃÃCĈGǴĢÄḈCÀÂŦČÁÇGǴAČĈČĢĈŤÇŤĊȦŤŤȦTÇÂČÄÄAĢĞĢȦAŤŦÄŢÁĊĜGČÄḈĢÁŢÂǴÄAŦŤĊĊTḈȦÇÂÇǴĈTÅČÇĈȦŦČÁḈĢÄÄGGTÅĢÀÂÂĈǴÂGCÄĈȦĢĈAŢGĢŦĊŤȦTÇCŦŢĊÁÅGĊAĞÂĊÄȦĞĊGĊŤĊŦÃĊĜCÅČȦȦȦÄĞÅGÀÃÃŢŦĞǴÄCÂCĢČÄÄGÇḈTÂÁÂÀĞTĜÂÃAÇÇȦĜȦÁAÀĊTǴŢTÁĢŢÀÇÃĜTÀGÀÃGḈÅŦĞȦŤGŦŦĜAÇÃŦĜŤĢGÇÇḈŦÄĢÀŦḈÄÁḈGGÄCÄḈĢÅGĜCĢĢḈÂǴÁḈČGǴĢCGĢŤÁČÂĈÃĞCŦÄÄĞĢÅȦČÀĜȦCȦḈǴTÅĈĞÃAŢTÃŢĜTÀÅTGÃÇÁCGĞÅCÂÅĈŤĞČÂȦŢCÁĜÂḈÂÅTÅÄÀĢŦAŤTÅÂÄĜTȦÇÀÅTGḈÀÃĞŦŢḈŤǴÁÇÇĢĢÅČŦĜÄĜḈÁAÇGĊÇĜÀÅČĞĊŦǴȦĞATČĈḈČǴŦČĈÃĊÄČĢÀÅÄȦȦĢČAÄTÁŤÄÅḈŦĊÄĜĜÁȦÄǴÁÅÅĈÁÃŦĊÀÇŢǴŢĈŤȦĈÃCŢŤÅŤTČĢĈÀGŦAČḈAATĈĢĊĜŢĢÇĢŢŢGGGÄÁŦÀÃTĞGÅĜTĢǴÀḈGÅAŦÂĞḈTĜǴÂAGÃÅAĊÃÂḈTÀǴŢCGĜÁŢÀĈĊĈCŦÀĈŤÅȦȦǴŦĈŦÂȦŤČCŦŢTÅCAĜĈÃĢŤŢÂĢĞÅŤŦŦČATÀŤḈĊÄÀŢĞḈĊĢÅǴČĊCTÀTÇĈĈÃḈĞÄČÀTÃÂŢÂÀĊḈĜÂŢĢÁŢȦĜĈTŤÀCĢÄAÁÄĞTÃŦĊȦŤĈǴČÇAÇTÅĜŦÁĈÀĜŢŦÂĊĊĊĊŤCČŤŦČĜĢḈĜŦČŦÀŢÁÂGĊÀČÅĜÃĊCĈTTÄĜCĜÁÂGGĜĜḈĢǴÅÄÂŤCḈTȦÁČĞĢCÇŤĊĜAÂGGŦTĜÇĈĈĜȦȦÄŦÄĞḈÀÂCĞŢÀĞÄAĊGTŢGǴAČŢÄŤḈǴÀÄĞÇĢÅĈĞḈǴAǴĈĊŦŢČĢŤĈŦḈḈÂÂČḈGǴAĞAŤÃŦŦÃĊĜÄÀÅÇŦŤÅÂǴTCĞĞǴÇŦAḈȦÅTÇÂŤÃĊŢTÁÅĞÇÀĜȦÅÇǴÃGǴÅÀŢĊĢGČÅÂĞĞAŦĜḈÅĈĜĊĜĢĈÄCCŢAÂÃŦÄĞŢCǴĈTÃÂḈĢÄĢAǴÇÁĞĜŤĊĜÁŦȦÇḈČÇAŤÄÇĢCÃÄĢAÄÂĞÂĈĜŢÃČȦÃÀȦAČÀȦTTÁǴÂǴŢŦCÇÁĢÁČĢÁCŤÀÇĢTǴŦČŦCĜÁGGGCCTȦÄCĜÂÃŤŢĊGĢČȦÄḈÃČĊGÇŦĊǴÁŢĞȦȦĜAGGÀŢÀÄÄÁŤAŤÅŤAĊGCĜCÄÁÇÅĈÀÂÇCǴŢŦÀÇȦĊÄĞÇĜČĈÄTǴÇÁŢḈCCGCĞÇḈÇÃĈḈÂGĈČČTÀGŢAÃĊŦŦÅCĢÅČÂTǴGÀĊÂŢÃGÄGÁÄŤĞÄTŢĞǴĈȦÅŤĜĢGŦÃĈĈÂTÀŢŢÃĈŤGÂŤǴḈÁÁGȦŤÅCÂŤȦÃÂCĈÇAĢḈÅǴÂÅĢTGĢÀGȦÄÅCAŦČĜÁČÅÁŤŦCĊȦȦĞTḈḈAÁÅĞǴḈÅGGŦAĈḈȦȦCĜĜŦĞĢǴÅÂCĈÁGĞĈAÁÅȦǴḈĊĈĊĊȦĞĞÂÁČŤÀÅǴĊGĢAÇȦCÀǴÀǴCŤÂAȦČGĊĊŢGĈŢÁĈŦĞÂŤŤĈǴǴĈĞĊÂÄÀĊĜŢÀÁGTAÅČȦČÂÄĜḈÇÄÂÅǴŤŤÀCĜÇĢGḈĊGÂÀḈȦĢŤAGĞŢǴÅÇĜCŢČÀĈŢĜÅǴTǴÃŦÃÁŢGÂÀÄŤČĈGGḈÀAĞGǴĞÅÁÀȦÇÃCÃĊÁĊḈŢTḈȦȦŤĊÂḈÄČŤÂÃĊÄÂĈĞCÃTĜCÀǴAȦĊǴÁCÁÅŦȦȦĢŦGÅĊÄÅŦCǴČĞČŤĞǴÃAČŢÂĜĢĈŢÅĞTÅTÁĈĈḈÁȦǴÄŤǴŦÅĊGĜŤĢCŦŦĢCǴĢĈÇÄÄÂÇČÇÀÅTÄCTÇĢḈĜŤÁŦȦÃŤŦĊAĊŢȦǴTĞĢGČĈÅGŢĊTGÀCÀTÇĜÄŤÃÁÁÇTÁĢŤĈŢÃǴÇĢÂŢÄGĞÁTTĈŢĢAGTÄŦÁÇGŢĢTĢÁȦÇÇȦÀĊĜÃAŤÇĈȦAÀĈTÀḈǴÀÃÄÃTCĢÀŦGḈĈGǴĞŤÄĊĜĈÅÃČŤÄGÇŢĢŢḈÀTŤḈŢÅCÄŢÃÁŤĞÄŤĊĞÅǴÇÃCŢÂÀAĈŤĞÁÃĈGĜḈŤGĢGČAAÄCĊĞTŦǴÄŤÇÄÁTAǴÃAĈĈŦŤȦĊČÁÄǴĢÇĢĈḈČÇŦGÂŤĊAÀČÂČĜTÇÅÀŤĜÁČȦTĢǴÁÄÂCÅĜŤĊȦĈÃḈĢĜĈŤTĜĢŤĢŢŢCAĊŤÅŢCÇÂČǴŦCȦÃĊÁḈŢĊÀÃÀĜĊḈÁÄĈĞĜCŢĊĜĜŦÃǴÂĜȦǴǴĢÂŢĜGĢḈÇĞČČGTÁČCÅǴĈÀŦŤČÇÀĢÅŦĊÁČÂAÄTČÇČÅŦÅÇǴȦÃTḈĊȦǴǴĊÃȦÇĈÀČĊÅĜÅĊŤŢĜĜŦĜǴATAÇǴCĊĞŦḈŦŤTŦTÀĈÇĊÃḈÅÀCḈÀǴĊÀÂĜĈÀAÁŦǴÅÇĈÄĞÄŢĜÂȦĢĈǴḈTŢŦÁĞǴÃÂTĢÀǴCAḈĞĜAAGÄĜŤŢTAGǴÃḈÄŢĞÇGŤCĞGĊŢÃÁÅÂŤCTŦTGĞÄGŤÃAÁĢÄŢŦÃÀĊAÇČÅǴȦÃÇŢŢǴÀČČÀÁĊŢŦŤĈĈČÇĞÄȦŤGÇCĢCCÇATČTÁĈČĜÂAḈĞTÃĜÇŦTÃAÇÅĢÀÃȦÇĈǴĊȦĊḈÀŢÀÃÇĞĢÅGÀĈTĢÅĊḈĞĈĜÁTAGÂĢĊḈĢḈǴÃĈÂŤĜTTÅCĞĢTĢTTTŦÅÂAĢŢȦÅTŤŦCḈĞÃĞĊĞÅTÂŤĢTŢŦČGAŢĈŢǴŤAÁĞÅČŢŢCGĊÄĢÃCÂĈAĜŤǴǴŢÃÃȦĞÇÃĊĈḈŤÀČĢŤĢCTĊǴTĞḈAȦĊÁÁÀÃÁĈÁÀÂǴÃÁCÄÅÀĈǴČȦŦÀĈAÁŦGŢÇÀḈÅTḈÂÂŦÃĞǴĈĊCȦÀÀÀCḈTTÂŢĊŢÁȦĢĢǴÃAÁȦCTȦĈĞĊĞČŤŤÂĈŦĞŢÃTǴAÄȦĜÃČGȦÃĊĈŤGĞŢǴÅÁCČḈÅTĢÇĞGÁĢĢŦĞĞČAĈAÇŢĢÁŦŢĈÃÂŤḈAÀCŤTÄÅŤŤÀTAŤÇĢĢÄGGĈĜŤÅÅĊAÅŤŦAĊĈČÂḈGÃǴÄĞGÇÅÀĈÅGÅGŤÄĢĢĈŦĞĢĞĢŦÁCŦTCĞÂTĈÀĊŢȦAÄÄÇAÂTÂÄGḈÀČAĊĢĊŤȦÂAĊĞČČÅAAŢǴÇÃĊÁTÂȦŦÃACÀÄČĈTÀĈĢḈTAČĢÁȦCĞŤĞĜÂȦȦŦĊÇÇǴȦŤČÂŢGÂŢĢḈÇĜŤÂÇŦČGÇĈÅǴÃĜĜĊAÂÃÄȦȦǴÁÂCGĊÁĈTTǴǴĊAĜÁČŤĢĜÄÁÁÄČḈÀȦĢGČĢÀCÇCÇǴĜÇÀŤĈĞĈĢḈTĜĜĞÁĜĜĞAĞȦÃÃÄḈḈǴȦÄÂŤḈÅČĈAĢĞĊĈŦÄŤÃĢCAŤĜĞCĜĈȦĊŢÅÄATÃÅÁŦḈGÇŦTŦÄČÂGǴĈÁÃĢCGÀÃAĞÂĊĜÂÇÀĈḈĢACÅAÄȦÅĢḈĊǴÂCĊCĊḈTḈCĜĢĊÄÃĜTÂĢÄÁČĜÇÂĈÀĜAÄĢȦŦÅŤĈŢĈĜČĞŤḈTGÀGĈǴŤĞŢAŢČGḈÁŦÁŢŤAČÇĊÂČŦŢĜÄŦGÃTĊÀTÀAĊȦÂĢĈḈŦȦÂÄAÀTČŤḈȦḈḈĞĈḈÀÂĢÂÀČÀÂȦTĈÄḈÅĜČÁAĜŦÁÇŦĈÃÀÂĞÄCCĈĞǴŦĞĞŦÇÅAŦǴÄĜČGÃȦĊÇÇÃĊÇÃŤÄÀGŢǴȦĊÄČÅĜÅÄTŤČȦÅÃĈĞGČĢÅÇÇÄÅĊŢÄTÀGǴÄČTŦĊȦTḈÁĢČTTĞŤȦĢĞŢAAČCŢÀĊĈÅAĊÁĞĈCÀĢÁĜȦÀḈĊŢGČÀŦŦACÀCÁŤĈÅȦǴĞAĞÁǴCĞŦŢÃĢȦCÇATȦAĞĞĢĈAĢŦÁǴÃĜĜTÁTĢĈÅĞCŦÇÄÃǴÇÄȦÃTĊÀĢÇÂÃÃḈÅĈȦŦÃŤGŢÇÇÃCÀŦTÅÃCGĈÁĊÄÄŢČÁÁTĜĊÂŤCGŢȦḈŤǴŤĞĜÄŤŢŤGÀŢÅCTĈŦÀĜĊǴŢĈAÃĞČÄCÄÁÂTǴḈÅÄÃĢȦŢḈĊḈȦÄÀḈÇÇTĈŢĊTḈCĢÂČǴŢTÂTÄÄÁÀĢĜÃÂĢÁGĊŢĊĞĈÂÇÃḈÀŦĈĊŢÅŦČCÃAÄĢÀÃȦȦĜÄĜȦŤTGĜÅGŦĈÂÂÇTĞGŢŤGÃČCĢŢCÅÃAŤŦḈŢAĞŤǴÃAĈČĢCČÀÁĞČĜGĜÂÀŤCĈTĊÄŢĞĢČŤĜŤĊČTŤǴTÇÅÃĞĢÅǴAḈŦÃCŤǴĞÀĞÃÇÂŦCḈČÄŦGCŢGŢĢÅŢǴĞĢĢŦÁĊÇĈĜĞÀGÂTȦȦȦÀȦCŦǴCĈÅÂTŤŤĈÂČȦÂÂŤŢAÅÂŦĞAĞÃÅTÅTĞÅČŦŦÀÀCŤGŤÁGÇÁŤČÃȦGÅŦŦAÀÇTÇÄČAĞTÅCÇĈȦÄCGĊŤTĞTÂĈGČÂĞČÀCĈĞȦCĈŤÄTÃCǴŢTČŢȦAÇŢCŢÇĢĞȦÁĢÂŤTÇĢḈÃŦGÁÁĊŤḈÁŦČÂAŤĞȦČÀÀÄĞȦAÂGGÂĈŦÀČČÄĞÀŢCĈĊÀÇĈÄCČČÅĈĞǴGĞÀÁTAÀŢŤÇǴŤCĊǴÃȦČÄǴÅÁŦÅĈǴḈÇAĜĞAŤÁŢCŤÅÄĞAŤĢÃḈÅĢŤÄTÇÅḈČÃǴÅGÁĊÅǴČÅÄÀÂĜḈĞǴÁÁŦÄÇĜŦÄÇČĈÂÂŢĞÁÁĢTĞŦĢÅȦČÂȦÇḈĢČŢŤÂȦȦŤĞĈĜĞŢḈḈǴĜḈĞŢCḈĢGÅĈŢǴŢAÃĢŤĢĜÇȦĢĞÄĈGḈÅČḈŢĢÅAAĈḈÇŤTÀCAǴÃḈĊḈČÇCČǴÄĈĊŦĢGÄĈĈGÁCŤŢĜĞČŦÀTÁǴÄȦCĞÃĜAÂÀĞĊĈÃGAĊŢÅḈĞǴĈAĊÇĜÁTÀÂÀĊĈĈŦĜÀÇGĞǴĜÂTŢÅĞCĈCGǴŦĜÂĈÅÃTAÂÄÀȦŤÇÅČĈŢÄŤÂĞḈÅǴÁĈČŢŤŤČÀÂTĈḈŤCÄCǴTĞĊTĢÃŦGŦÁĞȦÃḈÇŦGĞĊÅĞŢÂŤḈŢȦÀĜÂĞÄÄÀȦḈÇĜǴAĞÇȦǴCÇĢĢÃŤḈḈĢŢĞĊĢGGĊḈĞAǴḈŢÁȦÂGÀĢĞAĈǴǴÅǴĞȦŢÄŤÃĢĞAÁḈÀÇĞĞÀĈŢAÅAȦČÅÄĜǴÃÀČTÂČGCAÀAÅĢĈĢÄĞŦĞĜÀÇTŤÄÅÄÂŦTŦTÄÃTŦČǴĜĈĞCŢTGÀAŤȦŢÅĞAŦÂÀŤĈÄÁĜȦČŤÁAAȦĜȦTĜÁÄČŢĢǴÂȦŢÁḈḈĞGĞTĢÁÀĢŦḈÃÃCTÁĜŢÅAŦÃĊḈÅÇŢǴÃĊŢCÃCTGCÄGǴTǴAČŢCČÄĞÅȦAḈŦḈČGĈGĊĞȦÇČĢCÀĊÇÃǴĜAÅĊÀĜĢTĈǴȦÇĈŤÇŢÄŢÇŢÅÁÇŤÅAÃŤAGÅĢȦGGAǴȦÅÇǴŢGŤĜŢÃAĊȦĞĞTĜÄĈÃTĞĞŤĈĢȦŢCĞĊAGĊḈAḈĞÄŢŤCÀĞŤǴÀCȦÅĊŤČŦÀÅAAÄǴČǴḈTŦĢǴȦÇÇŦȦĜḈŦÁĢĜCÁCÅÄÂĢTÃÀÁÄȦÅGŢÂCŦŦÄÇAŦCČTÂÄÇĈCŦACÄÅǴḈCCÄĜGŦAĞĊÇČČŦĊĜÇḈÁḈȦÅĢAÁŤAŤCĞḈĈȦĊÄĞĊÅÅŦÀÂÃÇḈÃÄÃḈÁÄGGÅȦÃCǴGÁÀĊŢAȦȦĜŢĢǴĊĢǴÁĜÂÅÅĢAĜÂĞŤĢĜČÃĢÂĊȦŤÃȦŢĈCĞÁĈÀAȦCŤĢḈŢÂȦCGĜŤÃÂǴATÄACḈÃCŢḈÅAÃÄÀÅĞGÄĜÂŦĊÃÀÂÁŦǴŦTŢḈȦTÃÀCÂĊŢÅATÃÂÃȦÅÄŢĜČȦCĢŢĊÂÇÃĢĈCĞĞĢTÇŦGŢÅÃŤḈḈǴĈTAȦÃČÃAÂĊĊCÇȦÃCĊÇǴÃTÂǴÀAĈĢĊAĈCĊŦĜĜŦĈAǴĈÅŢḈŦÅGGÀCÁČÁÃĊÇÃĊÁAĈŢĜŢḈȦÄČCÄÂÅĢÂǴĞǴǴĢCĢÅTŢÃÁÀŢĞĊÀĞĊĜČĞĢĜÅĜTÇÃĜÅÅÃĊŤČÂĜÂÅÃȦŦĞḈÁÃŤÇČÀĊCŦĈÃTĈÅÀÇGÅĈÂḈĜTAÅÅCÄĈÄĞÂḈÅGĞĞǴĈŤÅÃḈȦȦÇȦÄǴĜČŦČÇȦTAÅÀÁÃȦÇTȦÄTÅTAÄÂÁÇCÁÂŢĜḈÇŤŤĈŦĜÂÃŦAÇĈḈÁÀTÀŢCÂÁTÅÇTÄĊĢĞḈĜÅÄÇḈÃČĢTŢĊĊÂCȦÂGÂĞÇĊGȦÃŤÄȦÀÇŦḈÂÇÇÄḈČŢŦÃȦĞÄTÁĈÃḈÁCǴČŦÁǴŦÃÀTCĈŤTŢŤÂĜÄḈČǴGĈĊȦÀĊÅĜŤŢTTǴGAÃĈȦÀÇÇȦČGČTÂȦŦÀÄĊḈÂGŢḈĞÅÂÀŤĜCĈŢÂŤÀAAÇĢĈĜĞÂÃŤÄŢGĈĞĊÃTŦTGÄÂAAŦĞGŦŢÅŦACŢĜĈŦGÅÁȦÃÁĊÂḈÂCŤŦTĊȦǴŦAÁÂḈGÇÀÇȦÀȦGŢĢŦÀĞAÄĢÁĊTḈǴCTAḈĞCGḈǴÁAḈÀÂḈŢǴĈČÅŦTǴǴŦÂÀTŤĜCĈGĜĈCAŢČĜŤĊḈŢŤȦĢÁÃȦĊÂÃĈCGĈÇḈAÁǴAŦĞŤÀḈĞÄȦȦǴĢĞĞÁÃČÇÄCÅȦÁŤAĊŦČĈŢŦĢÁGÃḈÅḈŤĜÄÅÂŤCĜÅAĊGĜĞŤGÁČÃĞŤȦÃÄĜÄČÂÅŤAĊḈĢḈŢGǴČGȦGḈÅCÂÄČŢĢÀÃGŤĜĊAĞĞGŦŤCTĢČĜTCCŦŤÇĈAÃŦÇĜÇAÁÇTŦŢÄÄAČÄĊḈĢĊĢÇḈĜĞAAÇŦŢĊČAAÀÅȦÂGÃÂḈTǴÂAĢǴĢŦǴÀÃČĜȦŦÇŦÇÀCAǴGÃGÃÅĜAČČÄTĈĈČÁTAȦTÁÁÅȦÅĊĜÅȦČĊAǴǴTĜȦĊÃÃÂŦĞĢĜĞǴČǴÁČTÀŤÁĜȦTÄTGĢŢĞŢĞȦAĈǴĈĊǴŤATĊAĞĊĊTÁČĜCÀȦCÄGAǴŤÄÃÃĊǴGÁÇGÄǴAĜAŤÀÂḈÁÀĢČŢÇÅĈÄČČÂÁÀŦÂŤĊŢÇĜĈÃĢȦĜCTŤĈǴŢCŦĜĞTČǴÀŦĜĞÁGČAǴÄÀGḈÂÂȦÅǴGȦŇŅŅÑŇŃÑÑŅŇŅŅÑŇŃÑÑŅŃŅŇŇŅŇŃŇÑŇŃÑŃÑÑÑŅŃŃŅŃŅŃÑŇŅŃŅŇŇÑŇŅŅÑŇŃÑÑŅŃŅŇŇŅŇŃŇÑŇŃÑŃÑÑÑŅŃŃŅŃŅŃÑŇŅŃŅŇŇÑŇŅŅÑŇŃÑÑŅŃŅŇŇŅŇŃŇÑŇŃÑŃÑÑÑŅŃŃŅŃŅŃÑŇŅŃŅŇŇÑŃŅŇŇŅŇŃŇÑŇŃÑŃÑÑÑŅŃŃŅŃŅŃÑŇŅŃŅŇŇÑŇŅŅÑŇŃÑÑŅŃŅŇŇŅŇŃŇÑŇŃÑŃÑÑÑŅŃŃŅŃŅŃÑŇŅŃŅŇŇÑŇŅŅÑŇŃÑÑŅŃŅŇŇŅŇŃŇÑŇŃÑŃÑÑÑŅŃŃŅŃŅŃÑŇŅŃŅŇŇÑŇŅŅÑŇŃÑÑŅŃŅŇŇŅŇŃŇÑŇŃÑŃÑÑÑŅŃŃŅŃŅŃÑŇŅŃŅŇŇÑŇŅŅÑŇŃÑÑŅŃŅŇŇŅŇŃŇÑŇŃÑŃÑÑÑŅŃŃŅŃŅŃÑŇŅŃŅŇŇÑŇŅŅÑŇŃÑÑŅŃŅŇŇŅŇŃŇÑŇ

ACGT...TGCA — has every possible DNA-based initialism been used by the bioinformatics/genomics community?

 

Short answer

Yes. 

Long answer…

You might work in a field that's related to biology, genetics, genomics, or bioinformatics. You might be working on a new piece of software, or a research proposal, or you need to form a committee. Maybe you have even been given the power to name a new research facility.

Suddenly you have an inspiration...why don't we name our new software, proposal, committee, or facility after a DNA-based initialism! That would be clever and make us stand out from the crowd, right? Maybe...maybe not.

What follows is a fairly exhaustive list of — presumably intentional — DNA-based initialisms that are in use (or have been used). As of 2020-07-20 the current list contains 67 names in total with all 24 possible combinations of [ACGT] being used. The additions since I first created this page are included at the end.

See also this related blog post by David Lawrence from 2014, which I only discovered in mid-2020. His post — which beat me to the punch by just a couple of weeks! — has provided me with a few additional examples which I hadn’t heard about and which have now been included here.

Please let me know of any errors or omissions, though note that potential names have to be initialisms and has to be somewhat related to to the fields of genetics, genomics, or bioinformatics.


ACGT

  1. Advisory Committee on Genetic Testing — Committee — 1996
  2. Alliance for Cancer Gene Therapy — Research Network — 2001
  3. A Comparative Genomics Tool — Software — 2003
  4. Advancing Clinico-genomic Trials on Cancer — Research Project — 2011
  5. Algorithms in Computational Genomics at Tau — Lab web page — ???
  6. Advanced Center for Genome Technology — Research Center? — ???
  7. African Centre for Gene Technologies — Research Network — ???
  8. Applied Computational Genomics Team — Research Group — ???
  9. Amino aCids To Genome — Software — 2017
  10. Analysis of Czech Genomes for Theranostics — Research Project? — 2020?

ACTG

  1. Automatic Correspondence of Tags and Genes — Software — 2007

AGCT

  1. Applied Genomics & Cancer Theraeputics — Research Program? — ???

AGTC

  1. Applied Genomics Technology Center — Core Facility? — 1998
  2. Advanced Genome Technologies Core — Core Facility — ???
  3. University of Kentucky Advanced Genetic Technologies Center — Core Facility (now defunct?) — ???

ATCG

  1. Applied Technology in Conservation Genetics — Research Lab — ???

ATGC

  1. Arabidopsis Thaliana Genome Center — Core Facility? — 2000?
  2. Another Tool for Genome Comparison — Software — 2001
  3. Advanced Thermal Gradient dna Chip — Patent — 2002
  4. Another Tool for Genomic Comprehension — Database & web tool — 2012
  5. Alignable Tight Genomic Clusters - Database - 2009

CAGT

  1. Center for Advanced Genomic Technology — Research Facility — 2000?
  2. Center for Applied Genetics and Technology — Research Facility — 2004
  3. Center for Applied Genetic Technologies) — Research Facility — ???
  4. Clustering AGgregation Tool — Software — 2012?

CATG

  1. Cross-legume Advances Through Genomics — Conference — 2004?
  2. Center for Advanced Technologies in Genomics — Research Facility — 2008

CGAT

  1. Comparative Genome Analysis Tool — Software — 2006
  2. Computational Genomics Analysis and Training — Training program — 2010
  3. Computational Genomics Analysis Toolkit — Software — 2013
  4. Centre for Gene Analysis and Technology — Research Facility — ???
  5. Canadian Genome Analysis and Technology program — Research program (now defunct) — 1992

CGTA

  1. CNS Gene therapy Translation Acceleration - Research Group - ???

CTAG

  1. Corn Transcriptome Analysis Group — Working Group — 2014
  2. Canadian Triticum Advancement Through Genomics - Research project - 2011

CTGA

  1. the Catalogue for Transmission Genetics in Arabs — Database — 2006

GACT

  1. The Center for Genetic Architecture of Complex Traits - Research Center - 2013

GATC

  1. Genetic Analysis Technology Consortium — Biotech Consortium (now defunct?) — circa 1997?

GCAT

  1. Genome Comparison & Analytic Testing — Software? — ???
  2. Genome Consortium for Active Teaching — Teaching Consortium — 2007?
  3. Gene-set Cohesion Analysis Tool — Software — 2011 (or 2007) 4.Genotype-Conditional Association Test — Statistical method — 2015
  4. Genomics, Computational biology And Technology - study section - ???

GCTA

  1. Genome-wide Complex Trait Analysis — Software — 2011

GTAC

  1. Gene Technology Access Center — Teaching Facility — 2000
  2. Genomics Technology Access Center — Core Facility — 2009?
  3. Genome Technology Access Center — Core Facility — 2010
  4. Genomics/Transcriptomics Analysis Core — Core Facility — ???
  5. Genomes and Transcriptomes of Arctic Chromists — Research Program — 2012
  6. Gene Technology Advisory Committee — Government Committee — ???

GTCA

  1. Genomic Tetranucleotide Composition Analysis — Database — 2006
  2. Genome Transcriptome Correlation Analysis — Software — 2007

TACG

  1. Talking About Computing and Genomics — Workshop — 2013

TAGC

  1. The Applied Genomics Core — Core Facility — 1998
  2. The Ashkenazi Genome Consortium — Consortium — 2012
  3. Technological Advances for Genomics and Clinics — Research Lab/Program? — ???
  4. The Arts & Genomics Centre — An Arts/Science Center — ???
  5. The Allied Genetics Conference — Conference — 2016?
  6. Taxon-Annotated GC plots — software visualisation method/tool — 2013

TCAG

  1. The Centre for Applied Genomics — Research Facility — 2007?
  2. The Center for the Advancement of Genomics — Research Facility (superseded by this) — ???

TCGA

  1. The Centre for Genetic Anthropology — Research Facility — 1996
  2. The Tayside Centre for Genomic Analysis — Core facility — 2001 (?)
  3. The Center for Genomic Application — Core Facility — 2004
  4. The Cancer Genome Atlas — Research Program — 2006

TGAC

  1. The Genome Access Course — Training Course — 2002
  2. The Genome Analysis Center — Research Facility — 2009

TGCA

  1. The Genome Counselling App — iOS Application — 2014
 

Updates:

  • 2020-08-20 Added 5th example of ATGC, 3rd example of AGTC, 2nd example of CTAG, and 4th example of GCAT (all courtesy of David Lawrence)

  • 2020-07-18 Added 10th example of ACGT

  • 2019-07-23 Added 9th example of ACGT (thanks to Sam Lent @samanthalent)

  • 2016-09-03 Added 4th example of TCGA (thanks to @malcolmacaulay)

  • 2016-02-16 Added 6th example of TAGC

  • 2015-09-11 - Added 5th example of TAGC

  • 2015-07-06 - Added 8th example of ACGT

  • 2015-04-06 - Added 4th example of GCTA (thanks to John Didion)

  • 2014-12-12 - Added first usage of TACG (thanks to @NazeefaFatima)

  • 2014-04-25 - Added Jeff Ross-Ibarra's planned use of CTAG

  • 2014-04-25 - Included a second instance of AGTC

  • 2014-05-18 - Included a fourth example of TAGC

  • 2014-09-08 - Included first usage of CGTA, GACT, and TGCA

Winning an award that shouldn't exist: progress towards 'open data' and 'open science'

It was announced yesterday that the Assemblathon 2 paper has won the 2013 BioMed Central award for ‘Open Data’ (sponsored by Lab Archives). For more details on this see here and here.

While it is flattering to be recognized for our efforts to conduct science transparently, it still feels a little strange that we need to have awards for this kind of thing. All data that results from publicly funded science research should be open data. Although I feel there is growing support for the open science movement, much still needs to be done.

One of the things that needs to become commonplace is for scientists to put their data and code in stable, online repositories, that are hopefully citable as independent resources (i.e. with a DOI). For too long, people have used their lab websites as the end point for all of their (non-sequence[1]) related data (something that I have also been guilty of).

Part of the problem is that even when you take steps to submit data to an online repository of some kind, not all journals allow you to cite them. This tweet by Vince Bufflo from yesterday illustrated one such issue (see this Storify page for more details of the resulting discussion):


Tools like arXiv.org, BioRxiv, Figshare, Slideshare, GitHub, and GigaDB are making it easier to make our data, code, presentations, and preliminary results more available to others. I hope that we see more innovation in this area and I hope that more people take an ‘open’ approach to other aspects of science, not just the sharing of data[2]. Luckily, with people around like Jonathan Eisen and C. Titus Brown, we have some great role models for how to do this.

How will we know when we are all good practitioners of open science? When we no longer need to give out awards to people just for doing what we should all be doing.


  1. For the most part, journals require authors to submit nucleotide and protein sequences to an INSDC database, though this doesn’t always happen.  ↩

  2. I have written elsewhere about the steps that the Assemblathon 2 took to try to be open throughout the whole process of doing the science, writing the paper, and communicating the results.  ↩

101 questions with a bioinformatician #4: Michael Hoffman

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Michael Hoffman is a principal investigator at the Princess Margaret Cancer Center in Toronto. His research group is based in the glamorous sounding Toronto Medical Discovery Tower, and the focus of his current work is on developing machine learning techniques to better understand chromatin biology. The highest complement that I can pay to Michael is that he understands the need to properly document his code; the description for his segway software states:

Our software has extensive documentation and was designed from the outset with external users in mind.

I wish more bioinformaticians had this attitude! You can find out more about Michael by following him on Twitter (@michaelhoffman).

 

001. What's something that you enjoy about current bioinformatics research?

I love how easy it is to experiment with new ideas. The activation energy for writing and managing a useful piece of code or looking at results keeps reducing. Improvements in lower levels of abstraction keep making it easier to think about more complex problems rather than low-level of implementations.

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

The amount of time wasted by moving data around, converting it from one format to another. Was it Nick Loman who referred to bioinformatics as "advanced file copying"? I hate that stuff. I can't believe no one has solved this problem yet.

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

I was a biochemistry undergraduate in a chemistry and biochemistry department. I would have been served better by more statistics classes and fewer advanced chemistry classes. I still learned some cool stuff in those classes though, and I got to quantify the hotness of commercial salsas via HPLC. Best lab teaching experiment ever.

 

100. What's your all-time favorite piece of bioinformatics software, and why?

Can I bend the rules and name and name my all-time favorite bioinformatics data resource? That would be Margaret Dayhoff's Atlas of Protein Sequence and Structure (here is a good review on how this resource was developed). Dayhoff and colleagues were the first people to realize that we needed to gather all the available protein sequence information in a database so that we could do cool stuff with it. The whole field traces its origin to Dayhoff's work starting in the 1950s. Of course, back then you could print out all the sequence information available in a book. Try doing that today (well there is this, KB).

Bioinformatics has been around longer than people realize.

 

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

I'm going to go with R because of my interest in pure science.

 

2014-04-22 11.04 - Article updated to correct typo and correct the web link for Michael's research group.

When is a genome complete...and does it even matter? Part 1: the 1% rule vs Sydney Brenner's CAP criteria

This will be the first in a new series of blog posts that discuss my thoughts on the utility of genomes at various stages of completion (both in terms of genome assembly and annotation). These posts will mostly be addressing issues that pertain to eukaryotic genomes...are there any other kind? ;-)




I often find myself torn between two conflicting viewpoints about the utility of unfinished genomes. First, let's look at the any-amount-of-sequence-is-better-than-no-sequence-at-all argument. This is clearly true in many cases. If you sequence only 1% of a genome, and if that 1% contains something you're interested in (gene, repeat, binding site, sequence variant etc), then you may well think that the sequencing effort was tremendously useful.

Indeed, one of my all-time favorite papers in science is an early bioinformatics analysis of gene sequences in GenBank. Published way back in 1980, this paper (Codon catalog usage and the genome hypothesis) studied "all published mRNA sequences of more than about 50 codons". Today, that would be a daunting exercise. Back then, the dataset comprised just 90 genes! Most of these were viral sequences, with just six vertebrate species represented (and only four sequences from human).

The abstract of this paper concluded:

Each gene in a genome tends to conform to its species' usage of the codon catalog; this is our genome hypothesis.

This mostly remains true today and the original work on this tiny dataset established a pattern that spawned an entire sub-discipline of genomics, that of codon-usage bias (now with over 7,000 publications). So clearly, you can do lots of great and useful science with only a tiny amount of genome sequence information. So what's the problem?

pause-to-switch-hats-to-argue-the-other-point

Well, 1% of a genome may be better than 0%, and 2% is better than 1%, and so on. But I want 100% of a genome (yes, I'm greedy like that). However, I begrudgingly accept that generating a complete and accurate genome assembly (not to mention a complete set of gene annotations) currently falls into the nice-idea-kid-but-we-can't-all-be-dreamers category.

The danger in not getting to 100% completion is that there is a perception — by scientists as well as the general public — that these genomes are indeed all finished. This disconnect between the actual state of completion, versus the perceived state of completion can lead to reactions of the wait-a-minute-I-thought-this-was-meant-to-be-finished!?! variety. Indeed, it can be highly confusing when people go to download the genome of their species of interest, under the impression that the genome was 'finished' many years ago, only to find that they can't find what they're looking for.

Someone might be looking for their favorite gene annotation, but maybe this 'finished' genome hasn't actually been annotated. Or maybe it's been annotated by four different gene finders and left in a state where the user has to decide which ones to trust. Maybe the researcher is interested in chromosome evolution and is surprised to find that the genome doesn't consist of chromosome sequences, just scaffolds. Maybe they find that there are two completely different versions of the same genome, that were assembled by different groups. Or maybe they find that the download link provided by the paper no longer works and they can't even find the genome in question.

The great biologist Sydney Brenner has often spoke of the need to achieve CAP criteria in efforts such as genome sequencing. What are these criteria?

  • C - Complete I.e. if you're going to do it, do a thorough job so that someone doesn't have to come along later to redo it.
  • A - Accurate This is kind of obvious but there are so many published genomes out there that are far from accurate.
  • P - Permanent Do it once, and forever.

The last point is probably not something that is thought about as much as the first two criteria. It relates to where these genomes end up being stored and the file formats that people use. But it also applies to other subtle issues. I.e. let's assume that research group 'X' has sequenced a genome to an impressive depth but that they made a terrible assembly. As long as their raw reads remain available, someone else can (in theory) attempt a better assembly, or attempt to remake the exact same assembly (science should be reproducible, right?).

However, reproducibility is not always easy in bioinformatics. Even if all of the methodologies are carefully documented, the software involved may no longer be available, or it may only run on an architecture that no longer exists. If you are attempting to make a better genome assembly, you could face issues if some critical piece of information was missing from the SRA Experiment metadata. A potentially more problematic situation would be if the metadata was incorrect in some way (e.g. a wrong insert size was listed).

In subsequent posts, I'll explore how different genomes hold up to these criteria. I will also suggest my own 'five levels of genome completeness' criteria (for genome sequences and annotations).

101 questions with a bioinformatician #3: Deanna Church

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


After 15 years working at the NCBI as a staff scientist, Deanna Church packed her bags and headed over to the West Coast (which some of us think of as the best coast) to join Personalis, a company that is 'pioneering genome guided medicine'. In her new role as Senior Director of Genomics and Content, Deanna is helping to improve their bioinformatics pipelines which will help lead to improved analysis of human genome data. This work will also involve supporting the move to GRCh38

If you don't know what GRCh38 is, then you've either been living under a rock or you probably have never worked with vertebrate genomes. The 'GRC' part of GRCh38 refers to the Genome Reference Consortium, an organization that Deanna was heavily involved with during her time at the NCBI. The GRC are the official 'gatekeepers of genomic light and truth' (a title which I may or may not have just invented)...the key point is that they ensure that the 'reference sequence' for the genomes of human and other species remains a trusted reference. They coordinate the incorporation of changes to the reference sequence, changes that need to be made based on the latest sequencing and genome variation data.

I think that Deanna's work in genomics can best be summarized using her very own words taken from her About.me page:

Deanna Church: making the genome a friendlier place

To find out more about Deanna, follow her on twitter (@DeannaChurch). And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

In general, I really enjoy bioinformatics for the problem solving aspects. Most of the time, even the (seemingly) smallest problem will throw you unanticipated challenges. The thing I like most about the work I’m currently doing is that I feel like I’m part of a team that is really working on processes that will have a direct impact on people’s medical care. 

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

This could change on a day to day basis, but my current woe is managing sequence identifiers.  This is a serious problem — while I understand the convenience of reporting results as either ‘chr1’ or ‘1’ these are not robust sequence identifiers. We should be managing and exchanging data using a more robust nomenclature (e.g. by using things like accession.version) as these provide a robust and traceable history of a sequence. The current standards make it too difficult to make simple mistakes — I fear we may see a lot of this as folks transition from GRCh37 to GRCh38.

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Can I give two? Keep taking the liberal arts classes as an undergrad, but work more computer programming and math into your schedule!

 

100. What's your all-time favorite piece of bioinformatics software, and why?

This is a little self-serving, but I really like the the GeT-RM browser. I managed the development of this tool while I was at NCBI. It is not my favorite necessarily because of the usage or impact this has had in the community, but rather for what I learned while we were doing this project. I learned a huge amount about gathering user requirements, writing specifications, agile development and testing. Plus, we’ve gotten good feedback from users so that is always a plus.  

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

I think I might have to say ‘.’ for a couple of reasons. First, I’ve spent a huge amount of my career trying to fill the actual gaps in assemblies — especially the human and mouse assemblies. Second, on many of my projects I’ve been a metaphorical gap filler: project manager in some cases, backend developer in others, and even a couple of turns at web UI development. I’m not quite comfortable calling myself a jack of all trades, but I try not to be too afraid of taking on new roles. It is good to continually test yourself...and to fail every now and again.

MinIONs...do my bidding!

Oh the fun you can have with new sequencing technologies...

Which genome assembler gives you the best genome assembly?

This is a question that I have been asked many times. I think that the opposite question should also be asked — which genome assembler gives you the worst genome assembly? —  but people seem less interested in asking this. By the end of this blog post, I will try to answer both questions.

Earlier this week, I wrote about the fact that I get to look at lots of genome assemblies that people have asked me to run CEGMA on. The reason I run CEGMA is to calculate what percentage of a set of 248 Core Eukaryotic Genes (CEGs) are present in each genome (or transcriptome) assembly. This figure can be used as a proxy for the much larger set of 'all genes'.

I can also calculate the N50 length of each assembly, and if you plot both statistics for each assembly you capture a tiny slice of that nebulous characteristic known as 'assembly quality'. Here's what such a plot looks like for 32 different genome assemblies (from various species, assembled by various genome assemblers):

Figure 1. CEGMA results vs N50 length for 32 genome assemblies. Click to enlarge.

Three key observations can be made from this figure:

  1. There's a lot of variation in the percentage of CEGs present (52.4–98.8%).
  2. There's a lot of variation in N50 length (997 bp all the way to 3.9 million bp).
  3. There is almost no correlation between the two measures (= 0.04)

Let's dwell on that last point by highlighting a couple of the extreme data points:

Figure 2: CEGMA results vs N50 length — red data points highlight assemblies with highest values of %248 CEGs or N50 length. Click to enlarge.

The highlighted point at the top of the graph represents the assembly with the best CEGMA result (245 out of 248 CEGs present). However, this assembly ranks 13th for N50 length. The data point on the far right of the graph represents a genome assembly with the highest N50 length (almost 4 million bp). But this assembly only ranks 27th for its CEGMA results. Such inconsistency was exactly what we saw in the Assemblathon 2 paper (but with even more metrics involved).

Can we shed any light as to which particular genome assemblers excel in either of these statistics? Well, as I now ask people who submit CEGMA jobs to me what was the principle assembly tool used?, I can overlay this information on the plot:

Figure 3: CEGMA results vs N50 length — colors refer to different assemblers used as the principle software tool in the assembly process. Click to enlarge.

It might not be clear but there are more data points for the Velvet assembler than any other (12/32). You can see that Velvet assemblies produce a relatively wide range in CEGMA results. Assemblies made by SOAPdenovo produce an even wider range of CEGMA results (not to mention a wide range of N50 results). The truth is that there is no consistent pattern of quality in the results of any one assembler (and remember we are only measuring 'quality' by just two paltry metrics).

To answer the questions raised at the start of this post:

  1. All assemblers can be used to make terrible genome assemblies
  2. Some assemblers can (probably) be used to make great genome assemblies

There is no magic bullet in genome assembly and there are so many parameters that can affect the quality of your final assembly (repeat content of genome, sequencing technology biases, amount of heterozygosity in genome, quality of input DNA, quality of sample preparation steps, suitable mix of libraries with different insert sizes, use of most suitable assembler options for your genome of interest, amount of coffee drunk by person running the assembler, etc. etc.).

Don't even ask the question which genome assembler gives you the best genome assembly? if you are not prepared to define what you mean by 'best' (and please don't define it just on the basis of two simple metrics like %248 CEGs present and N50 length).