In Texas, there’s a deeply held belief that if it’s bigger, it’s better. Just look at the 159’ by 71’ big-screen TV in the Cowboys’ new football stadium as a prime example of the prevalent “go big, or go home” mentality. But it’s not just Texas that’s enamored with this “bigger is better” type of thinking. Many IT professionals focusing on the new “big data” craze follow the mantra that if a lot of data is good, even more must be better.

Alignment around the exact definition of “big data” is hard to come by, especially since much of the discussion is being driven by enabling vendors. That said, big data was concisely defined in a recent New York Times article as a “shorthand label that typically means applying the tools of artificial intelligence, like machine learning, to vast new troves of data beyond that captured in standard databases.” Most big data definitions often go on to reference the three Vs: volume, velocity and variety. Yet, often overlooked are the two additional Vs: value and veracity, which are critical in an information governance and legal context. To harmonize the five Vs of big data, it’s important to examine each definition in sequence.

The Five Vs of “Big Data”

  1. Volume: Volume, not surprisingly, is the hallmark of the big data concept. Since data creation doubles every 18 months, we’ve rapidly moved from a gigabyte world to a universe where terabytes and exabytes rule the day.  In fact, according to a 2011 report from the McKinsey Global Institute, numerous U.S. companies now have more data stored than the U.S. Library of Congress, which has more than 285 terabytes of data (as of early this year). And to complicate matters, this trend is escalating exponentially with no reasonable expectation of abating. 
  2. Velocity: According to the analysts firm Gartner, velocity can be thought of in terms of “streams of data, structured record creation, and availability for access and delivery.” In practical terms, this means organizations are having to constantly address a torrential flow of data into/out of their information management systems. Take Twitter, for example, where it’s possible to see more than 400 million tweets per day. As with the first V, data velocity isn’t slowing down anytime either.
  3. Variety: Perhaps more vexing than both the volume and velocity issues, the Variety element of big data increases complexity exponentially as organizations must account for data sources/types that are moving in different vectors. Just to name a few variants, most organizations routinely must wrestle with structured data (databases), unstructured data (loose files/documents), email, video, static images, audio files, transactional data, social media, cloud content and more.
  4. Value:  A more novel big data concept, value hasn’t typically been part of the typical definition. Here, the critical inquiry is whether the retained information is valuable either individually or in combination with other data elements, which are capable of rendering patterns and insights. Given the rampant existence of spam, nonbusiness data (like fantasy football emails) and duplicative content, it’s easy to see that just because data may have the other 3 Vs, it isn’t inherently valuable from a big data perspective.
  5. Veracity: Particularly in an information governance era, it’s vital that the big data elements have the requisite level of veracity (or integrity). In other words, specific controls must be put in place to ensure that the integrity of the data is not impugned. Otherwise, any subsequent usage (particularly for a legal or regulatory proceeding, like e-discovery) may be unnecessarily compromised.

When the five Vs are then looked at in concert and cutting-edge analytical software is applied, the promise of “big data” starts to be revealed. In healthcare, for example, researchers are employing big data analytics to analyze factors in multiple sclerosis to search for personalized treatments. Similarly, healthcare professionals are also mining large genomic databases to find the best ways to treat cancer. Many of these insights are coming from novel data sources (new varieties) like web-browsing data trails, social network communications, sensor data and surveillance content to divine unheard of insights.

And yet, given the relatively narrow range of existing big data use cases (retail trending, advertising insights, healthcare data-mining, etc.) most organizations should still carefully assess the value of information before blindly provisioning another terabyte of storage simply under the auspices that big data insights might be possible. While there are clearly nuggets to be mined in this new, big data era, these analytical insights don’t come without potential costs and risks.

Many organizations sadly aren’t cognizant of the lurking tensions associated with the rapid acceleration of big data initiatives and other competing corporate concerns around important constructs like information governance. Latent information risk is a byproduct of keeping too much data and the resulting exposure due to e-discovery costs/sanctions, potential security breaches and regulatory investigations. As evidence of this potential information liability, it costs only $.20 a day to manage 1GB of storage. Yet, according to a recent Rand survey, it costs $18,000 to review that same gigabyte of storage for e-discovery purposes.

To combat these risks and costs, many entities have deployed information archives as a way to attack the data deluge, periodically deleting data when legally permissible. It is this necessary and laudable goal of defensible deletion/expiration that can be at odds with concepts like big data. The challenge for many organizations is the rather straightforward exercise of evaluating the potential risk of keeping too much information against the conceptual value of mining information for a given big data project. Even in the absence of big data analytics, this type of risk/reward inquiry is at the core of the information governance dilemma that every organization faces.  At least with the potential value big data can generate, organizations have a better chance to reap some value out of the terabytes of data that many have been mindlessly keeping in perpetuity. 

In the end, it is critical to have a laser focus on the fourth V (value) to ensure that data, which won’t be mined/analyzed, isn’t kept any longer than can be rationalized for other business needs or due to applicable regulations. Retaining meaningless data that has no big data potential threatens to turn big data into “bad data” that merely increases information risk.