New BUSCO vs (very old) CEGMA
If I’m only going to write one or two blog posts a year on this blog, then it makes sense to return to my recurring theme of don’t use CEGMA, use BUSCO!
In 2015 I was foolishly optimistic that the development of BUSCO would mean that people would stop using CEGMA — a tool that we started developing in 2005 and which used a set of orthologs published in 2003! — and that we would reach ‘peak-CEGMA’ citations that year.
That didn’t happen. At the end of 2017, I again asked the question have we reached peak-CEGMA? because we had seen ten consecutive years of increasing publications.
Well I’m happy to announce that 2017 did indeed see citations to our 2007 CEGMA paper finally peak:
CEGMA citations by year (from Google Scholar)
Although we have definitely passed peak CEGMA, it still receives over a 100 citations a year and people really should be using tools like BUSCO instead.
This neatly leads me to mention that a recent publication in Molecular Biology and Evolution describes an update to BUSCO:
From the introduction:
With respect to v3, the last BUSCO version, v5, features: 1) a major upgrade of the underlying data sets in sync with OrthoDB v10; 2) an updated workflow for the assessment of prokaryotic and viral genomes using the gene predictor Prodigal (Hyatt et al. 2010); 3) an alternative workflow for the assessment of eukaryotic genomes using the gene predictor MetaEuk (Levy Karin et al. 2020); 4) a workflow to automatically select the most appropriate BUSCO data set, enabling the analysis of sequences of unknown origin; 5) an option to run batch analysis of multiple inputs to facilitate high-throughput assessments of large data sets and metagenomic bins; and 6) a major refactoring of the code, and maintenance of two distribution channels on Bioconda (Grüning et al. 2018) and Docker (Merkel 2014).
Please, please, please…don’t use CEGMA anymore! It is enjoying a well-earned retirement at the Sunnyvale Home for Senior Bioinformatics Tools.

