Why the UCSC Genome Browser FTP site is one of my least favorite places to visit

If you visit the Golden Path directory of the UCSC Genome Browser FTP site (ftp://hgdownload.cse.ucsc.edu//apache/htdocs/goldenPath), you will come across the following quirks:

  1. Multiple genomes for the same species are not grouped together under a parent directory for each species, so the number of items in this directory (~250) gives no indication of the number of species represented (~125).
  2. Species identifiers are ambiguous. You have to know that 'mm9' refers to Mus musculus and not Macaca mulatta
  3. Species identifiers are also inconsistent. Some species get just two lower-case characters (e.g. 'mm' = Mus musculus, 'dm' = Drosophila melanogaster) whereas most get six characters (e.g. 'felCat' = Felis catus, 'sacCer' = Saccharomyces cerevisiae).
  4. Humans, hallowed species that we are, simply get 'hg' (presumably for 'human genome').
  5. The six-character format reverses centuries (!) of naming convention by making the genus part of the name start with a lower-case character and the specific part of the name start with an upper-case character.
  6. Some species also have date-versioned directories in addition to numerical-suffixed directories. So do you want to download the 'hg7' version of the human genome or instead get the 'hg7oct2000_oo21' (don't ask me what the 'oo_21' part means)?

If you want a challenge, try writing some bioinformatics software that goes from the Latin name for a species to the correct directory on their FTP site! I guess the UCSC team are going to hope that six characters is enough to uniquely identify any future species that end up here. So I hope they don't start sequencing too many more Drosophila species. E.g.

Compare this madness — and it is madness — to the calming orderliness of the Ensembl Genomes FTP site (e.g. ftp://ftp.ensemblgenomes.org//pub/release-23/metazoa/fasta):

A view from UCSC Genome Browser FTP site…

A view from UCSC Genome Browser FTP site…

…compared to a view from the Ensembl Genomes FTP site

I think the key point from this story is that a lot of bioinformatics research can be hard enough without the added complexities of working with unstructured data. When you start building any new resource in bioinformatics, be it an FTP site, web site, GitHub repository, you should plan for the future! I.e. expect things to expand, grow, and greatly increase in complexity.

Even if you intend for a resource to only ever contain information for a single species, assume that it will end up containing hundreds of species. You should also assume that people may wish to automate the querying of your data. If you plan for these things from the moment you start building your resource, you might make some bioinformaticans happy — and you certainly don't want to make us angry…you wouldn't like us when we're angry.

How does the popularity of the UC Davis Genome Center vary with geographic location?

If I perform a Google search for the two words genome center, I see that the UC Davis Genome Center (henceforth UCDGC) is the top hit. But this is to be expected because Google has been personalizing search results for some time now, so this result is obviously tailored to me (if you didn't know, I work at the UCDGC).

If you are signed in to Google when you perform a search, the results will be heavily influenced by your search history and by what Google knows about you and your interests. Even if you sign out of Google, the search engine giant can track some information via cookies. Even if you disable cookies or use a private browsing mode, Google is still altering your search results because it knows your location (even if only approximately).

This explains why I will almost always see UCDGC as the top result when I search for 'genome center'. To get around this, I could use a search engine that doesn't track my activity, or I could use a private browsing mode in combination with a little-known feature of Google, that of changing your search location. It's possible to perform a search as if I was located in any major city or state in America.

So this allows me to see how often the UCDGC appears in the #1 position as I move around the country. I first performed a search for 'genome center' as if I was located in each state (e.g. set location to be 'Alabama', 'Alaska', 'Arkansas' etc.):

Ranking of UC Davis Genome Center among Google search results when searching for 'genome center' in each state

When you search for 'genome center', the UCDGC is the top search result in every state! One caveat to this approach is that it may not be all that meaningful to set your location to be an entire state. So I repeated the approach but this time I set my location to be the most populous city in each state:

Ranking of UC Davis Genome Center among Google search results when searching for 'genome center' in the most populous city of each state (as indicated by position of marker within each state). 

This shows that UCDGC is the #1 search result for cities in 36/50 states. The places where UCDGC is not #1 are all cities that have a notable genome center of their own (or are located close to one). A few notes relating to this:

  1. The New York Genome Center dominates results not only in New York City (NY), but also in Newark (NJ), Bridgeport (CT), and Philadephia (PA)
  2. The #1 result in Baltimore (MD) is for the Institute of Genome Sciences at the University of Maryland
  3. St. Louis (MO) sees The Genome Institute at Washington University take the top spot
  4. In the north west, a search from Seattle gives the Seattle Structural Genomics Center for Infectious Disease as the most popular result. But if you head to Spokane (Washington's 2nd city), then the UCDGC becomes the #1 result again
  5. In Texas, the Department of Genomic Medicine at the Houston Methodist Research Institute, pushes UCDGC to 4th place. However, move to San Antonio or Dallas and the UCDGC regains first place
  6. Chicago (IL) has the Institute for Genomics and Systems Biology at #1
  7. In Minneapolis (MN) it is the University of Minnesota Genomics Center who is the top dog
  8. The home of the King (Memphis, TN) is also home to the W. Harry Feinstone Center for Genomic Research which takes the #1 position. Once again, if you move to this state's second city (Nashville), the UCDGC regains the top spot in the search results.
  9. Las Vegas, NV is home to the University of Nevada Las Vegas Genomics Core Facility. Moving to Nevada's second city (Henderson) puts UCDGC back on top.
  10. In Salt Lake City (UT) you can find the Utah Genome Depot at the University of Utah dominating the rankings.
  11. Finally, in Atlanta (GA), it is the Emory University Integrated Genomics Core which denies the UCDGC the #1 position

The UC Davis Genome Center is not only the top hit when you search for 'genome center' in various locations in the USA. If you use the Google location option to go truly global, you will see that we rank as the top search result for 'genome center' in London, Paris, Berlin, Moscow, Dehli, Seoul, Cairo, Buenos Aires, Bogota, Rio de Janeiro, Cape town, Kuala Lumpur, and Sydney!

While this could all be the result of UC Davis spending millions of dollars to adopt search engine optimization strategies to unduly influence our position in the search results, I prefer to believe that it reflects our reputation for world-class genomics research and training.

Real bioinformaticians and old bioinformaticians

A passing mention of the phrase 'real bioinformaticians' by Michael Hoffman (@michaelhoffman) yesterday, prompted me to elevate the concept to be worthy of its own hashtag. This is what happened next:

You will notice that Sara G's response (@sargoshoe) humorously introduced the concept of #oldbioinformaticians, and this in turn spawned an even longer set of tweets (see below). I think that many of the more — how shall we put this — wise and distinguished members of the bioinformatics community, enjoyed the chance for a trip down memory lane.

Musical encores in bioinformatics and other sciences

I've previously flagged a few examples of independently developed bioinformatics software tools that share the same name. My recent post about the JABBA-award winning software called MUSIC prompted some people to let me know that this is another name that has been used repeatedly by different groups.

So thanks to Nicolas Robine and commenter LMikeF, we can see that MUSIC is a very popular name for bioinformatics tools:

  1. MuSiC: a tool for multiple sequence alignment with constraints (2004)
  2. RE-MuSiC: a tool for multiple sequence alignment with regular expression constraints (2007)
  3. MuSiC: identifying mutational significance in cancer genomes (2012)
  4. MUSIC: Identification of Enriched Regions in ChIP-Seq Experiments using a Mappability-Corrected Multiscale Signal Processing Framework (2014)
  5. MUSiCC: Towards an accurate estimation of average genomic copy-numbers in the human microbiome (2014)

The first two publications sadly suffer from link rot and the provided URLs no longer work. These two publications are also by the same group, which begs the question, what would they call a 3rd iteration of their software (RE-RE-MuSiC?).

A little bit of additional searching reveals that MUSIC is a popular name in other scientific endeavors as well:

  1. MUSIC: MUltiScale Initial Conditions — software to generate initial conditions for cosmological simulations
  2. MUSIC: MUltiScale SImulation Code — fluid dynamics software: warning this website will make you nauseous!
  3. MUSIC: Muerte Subita en Insufficiencia Cardiaca — a longitudinal study to assess risk predictors of death inpatients with heart failure
  4. MUSIC: MUtation-based SQL Injection vulnerabilities Checking tool — a tool to help check for vulnerabilities in web based applications

I guess people like the name MUSIC and will go to almost any lengths to make an acronym/initialism for it.