Is yours bigger than mine? Big data revisited

Google Scholar lists 2,090 publications that contain the phrase 'big data' in their title. And that's just from the first 9 months of 2014! The titles of these articles reflect the interest/concern/fear in this increasingly popular topic:

One paper, Managing Big Data for Scientific Visualization, starts out by identifying a common challenge of working with 'big data':

Many areas of endeavor have problems with big data…while engineering and scientific visualization have also faced the problem for some time, solutions are less well developed, and common techniques are less well understood

They then go on to discuss some of the problems of storing 'big data', one of which is listed as:

Data too big for local disk — clearly, not only do some of these data objects not fit in main memory, but they do not even fit on local disk on most workstations. In fact, the largest CFD study of which we are aware is 650 gigabytes, which would not fit on centralized storage at most installations!

Wait, what!?! 650 GB is too large for storage? Oh yes, that's right. I forgot to mention that this paper is from 1997. My point is that 'big data' has been a problem for some time now and will no doubt continue to be a problem.

I understand that having a simple, user-friendly, label like 'big data' helps with the discussion, but it remains such an ambiguous, and highly relative term. It's relative because whether you deem something to be 'big data' or not might depend heavily on the size of your storage media and/or the speed of your networking infrastructure. It's also relative in terms of your field of study; a typical set of 'big data' in astrophysics might be much bigger than a typical set of 'big data' in genomics.

Maybe it would help to use big dataTM when talking about any data that you like to think of as big, and then use BIG data for those situations where your future data acquisition plans cause your sys admin to have sleepless nights.

Why I think buzzword phrases like 'big data' are a huge problem

There are many phrases in bioinformatics that I find annoying because they can be highly subjective:

  • Short-read mapping
  • Long-read technology
  • High-throughput sequencing

One person's definition of 'short', 'long', or 'high' may differ greatly from someone else's. Furthermore, our understanding of these phrases changes with the onwards march of technological innovation. Back in 2000 'short' meant 16–20 bp whereas in 2014, 'short' can mean 100–200 bp.

The new kid on the block, which is not specific to bioinformatics, is 'big data'. Over the last week, I've been helping with a NIH grant application entitled Courses for Skills Development in Biomedical Big Data Science. This grant mentions the phrase thirty-nine times so it must be important. Why do I dislike the phrase so much? Here is why:

  1. Even within a field like bioinformatics, it's a subjective term and may not mean the same thing to everyone.
  2. Just as the phrases 'next-generation' and 'second-generation' sequencing inspired a set of clumsy and ill-defined successors (e.g. '2.5th generation', 'next-next-next generation' etc.), I expect that 'big data' might lead to similar language atrocities being committed. Will people start talking about 'really big data' or 'extremely large data' to distinguish themselves from one another?
  3. This term might be subjective within bioinformatics, but it probably much more subjective when used across different scientific disciplines. In astronomy there are space telescopes that are producing petabytes of data. In the field of particle physics, the Data Center at the Wigner Research Centre for Physics processes one petabyte of data per day. If you work for the NSA, then you may well have exabytes of data lying around.

I joked about the issue of 'big data' on twitter:

My Genome Center colleague Jo Fass had a great comment in response to this:

This is an excellent point. When people talk about the challenges of working with 'big data', it really depends on how well your infrastructure is equipped to deal with such data. If your data is readily accessible and securely backed up, then you may only be working with 'data' and not 'big data'.

In another post, I will suggest that the issue for much of bioinformatics is not 'big data' per se but 'obese data', or even 'grotesquely obese data'. I will also suggest a sophisticated computational tool that I call Operational Heuristics for Management of Your Grotesquely Obese Data (OHMYGOD), but which you might know as rm -f.