Why the UCSC Genome Browser FTP site is one of my least favorite places to visit
If you visit the Golden Path directory of the UCSC Genome Browser FTP site (ftp://hgdownload.cse.ucsc.edu//apache/htdocs/goldenPath), you will come across the following quirks:
- Multiple genomes for the same species are not grouped together under a parent directory for each species, so the number of items in this directory (~250) gives no indication of the number of species represented (~125).
- Species identifiers are ambiguous. You have to know that 'mm9' refers to Mus musculus and not Macaca mulatta
- Species identifiers are also inconsistent. Some species get just two lower-case characters (e.g. 'mm' = Mus musculus, 'dm' = Drosophila melanogaster) whereas most get six characters (e.g. 'felCat' = Felis catus, 'sacCer' = Saccharomyces cerevisiae).
- Humans, hallowed species that we are, simply get 'hg' (presumably for 'human genome').
- The six-character format reverses centuries (!) of naming convention by making the genus part of the name start with a lower-case character and the specific part of the name start with an upper-case character.
- Some species also have date-versioned directories in addition to numerical-suffixed directories. So do you want to download the 'hg7' version of the human genome or instead get the 'hg7oct2000_oo21' (don't ask me what the 'oo_21' part means)?
If you want a challenge, try writing some bioinformatics software that goes from the Latin name for a species to the correct directory on their FTP site! I guess the UCSC team are going to hope that six characters is enough to uniquely identify any future species that end up here. So I hope they don't start sequencing too many more Drosophila species. E.g.
Compare this madness — and it is madness — to the calming orderliness of the Ensembl Genomes FTP site (e.g. ftp://ftp.ensemblgenomes.org//pub/release-23/metazoa/fasta):
I think the key point from this story is that a lot of bioinformatics research can be hard enough without the added complexities of working with unstructured data. When you start building any new resource in bioinformatics, be it an FTP site, web site, GitHub repository, you should plan for the future! I.e. expect things to expand, grow, and greatly increase in complexity.
Even if you intend for a resource to only ever contain information for a single species, assume that it will end up containing hundreds of species. You should also assume that people may wish to automate the querying of your data. If you plan for these things from the moment you start building your resource, you might make some bioinformaticans happy — and you certainly don't want to make us angry…you wouldn't like us when we're angry.