How to cite bioinformatics resources
I saw a post on Biostars today that asked how specific versions of genome assemblies should be cited. This question also applies to the more general issue of citing any bioinformatics resource which may have multiple releases/versions which are not all formally published in papers. Here is how I replied:
Citing versions of any particular bioinformatics/genomics resources can get tricky because there is often no formal publication for every release of a given dataset. Further complicating the situation is the fact that you will often come across different dates (and even names) for the same resource. E.g. the latest cow genome assembly generated by the University of Maryland is known as 'UMD 3.1.1'. However, the UCSC genome browser uses their own internal IDs for all cow genome assemblies and refers to this as 'bosTau8'. Someone new to the field might see the UCSC version and not know about the original UMD name.
Sometimes you can use dates of files on FTP sites to approximately date sequence files, but these can sometimes change (sometimes files accidentally get removed and replaced from backups, which can change their date).
The key thing to aim for is to provide suitable information so that someone can reproduce your work. In my mind, this requires 2–3 pieces of information:
- The name or release number of the dataset you are downloading (provide alternate names when known)
- The specific URL for the website or FTP site that you used to download the data
- The date on which you downloaded the data
E.g. The UMD 3.1.1 version of the cow genome assembly (also known as bosTau8) was downloaded from the UCSC Genome FTP site (ftp://hgdownload.cse.ucsc.edu/bosTau8/bigZips/bosTau8.fa.gz).
When no version number is available — it is very unhelpful not to provide version numbers of sequence resources: they can, and will change — I always refer to the date that I downloaded it instead.