Informationweek: Big Data: Avoid ‘Wanna V’ Confusion
By Seth Grimes | Information Week
The three V’s — volume, velocity and variety — do a fine job of defining big data. Don’t be misled by the “wanna-V’s:” variability, veracity, validity and value.
When it comes to big data, how many V’s are enough?
Analyst Doug Laney used three — volume, velocity and variety — in defining big data back in the ’90s. In recent years, revisionists have blown out the count to a too-many seven or eight. “Embrace and extend” is alive and well, it seems, expanding the market space but also creating confusion.
When a concept resonates, as big data has, vendors, pundits and gurus — the revisionists — spin it for their own ends. Big data revisionists would elevate value, veracity, variability/variance, viability and even victory (the last being a notion so obscure that I won’t mention it further) to canonical V status. Each of the various new V’s has its champions. Joining them are the contrarians who have given us the “small data” countertrend.
In my opinion, the wanna-V backers and the contrarians mistake interpretive, derived qualities for essential attributes.
The original 3 V’s do a fine job of capturing essential big data attributes, but they do have shortcomings, specifically related to usefulness. As Forrester analyst Mike Gualtieri puts it, the original 3 V’s are not “actionable.” Gualtieri poses three pragmatic questions. The first relates to Big Data capture. The others relate to data processing and use: “Can you cleanse, enrich and analyze the data?” and “Can you retrieve, search, integrate and visualize the data?”
As for “small data:” The concept is a misframing of the data challenge. Small data is nothing more or less than a filtered and reduced topical subset of the big data motherlode, again the product of analytics. Fortunately, attention to this bit of big data backlash seems to have ebbed, which lets us get back to the big picture.
3 V’s and Beyond
The big picture is that original 3 V’s work well. I won’t explain them; instead, I will refer you to “Big Data 3 V’s: Volume, Variety, Velocity,” an infographic posted by Gil Press. You’ll see that the infographic posits viability — essentially, can the data be analyzed in a way that makes it decision-relevant? — as “the missing V.” The concluding line: “Many data scientists believe that perfecting as few as 5% of the relevant variables will get a business 95% of the same benefit. The trick is identifying that viable 5%, and extracting the most value from it.” Hmm… It seems to me that the missing V could equally well have been Value.
Neil Biehn, writing in Wired, sees viability and value as distinct missing V’s. Biehn’s take on viability is similar to Press’s. “We want to carefully select the attributes and factors that are most likely to predict outcomes that matter most to businesses,” Biehn says. I agree, but note that the selection process is purpose-driven and external to the data.
“The secret is uncovering the latent, hidden relationships among these variables,” Biehn continues. Again, I agree, but how do you determine predictive viability, generated by those latent relationships among variables? Professor Gary King of Harvard University read my mind when he stated, at a conference I attended in June, “Big data isn’t about the data. It’s about analytics.” Viability isn’t a big data property. It’s a quality that you determine via big data analytics.
“We define prescriptive, needle-moving actions and behaviors and start to tap into the fifth V from big data: Value,” Biehn asserts. Again, how do you determine prescriptive value, which Biehn notes is derived from, and hence is not an intrinsic quality of, big data? Analytics.
Analytics verifies not only the accuracy of predictions, but also the effectiveness of outcomes in achieving goals. Analytics ascertains the validity of the methods and the ROI impact of the overall data-centered initiative. ROI quantifies value, complementing the qualitative measure validity. Both V’s are external to the data itself.
Compounding the Confusion
Variability and veracity are similarly analytics-derived qualities that relate more to data uses than to the data itself.
Variability is particularly confusing. “Many options or variable interpretations confound analysis,” observed Forrester analysts Brian Hopkins and Boris Evelson back in 2011. Sure, and you can use a stapler to bang in a nail (I have), but that doesn’t make it any less a stapler.
“For example, natural language search requires interpretation of complex and highly variable grammar,” Hopkins and Evelson wrote. Put aside that grammar doesn’t vary so much; rather, it’s usage that is highly variable. Natural-language processing (NLP) techniques, as implemented in search and text-analytics systems, deal with variable usage by modeling language. NLP facilitates entity and information extraction, applied for particular business purposes.
(An entity is a uniquely identifiable thing or object; for instance, the name of a person, place, product or pattern, such as an e-mail address or Social Security number. Extractable information may include attributes of entities, relationships among entities, and constructs such as events — “Michelle LaVaughn Robinson Obama, born January 17, 1964, an American lawyer and writer, is the wife of the 44th and current President of the United States” — that we recognize as facts.)
IBM sees veracity as a fourth big data V. (Like me, IBM doesn’t advocate variability, validity, or value as big data essentials.) Regarding veracity, IBM asks, “How can you act upon information if you don’t trust it?”
Yet facts, whether captured in natural language or in a structured database, are not always true. False or outdated data may nonetheless be useful, non-factual subjective data (feelings and opinions) too.
Consider two statements, one asserting a fact and the other containing one that is no longer true. Join me in concluding that data may contain value unlinked from veracity:
— “The Iraqi regime… possesses and produces chemical and biological weapons.” — George W. Bush, October 7, 2002.
— “I am glad that George Bush is President.” — Daniel Pinchbeck, writing ironically, June, 2003.
Veracity does matter. I’ll cite an old Russian proverb: “Trust, but verify.” That is, analyze your data — evaluate it in context, taking into account provenance — in order to understand it and use it appropriately.
3 V’s Versus ‘Wanna-V’s’
My aim here is to differentiate the essence of big data, as defined by Doug Laney’s original-and-still-valid 3 V’s, from the derived qualities of new Vs proposed by various vendors, pundits and gurus. My hope is to maintain clarity and stave off market-confusing fragmentation begotten by the wanna-V’s.
On one side of the divide we have data capture and storage; on the other, business-goal oriented filtering, analysis and presentation. Databases and data streaming technologies answer the big data need; for the balance, the smart stuff, you need big data analytics.
Variability, veracity, validity and value aren’t intrinsic, definitional big data properties. They are not absolutes. By contrast, they reflect the uses you intend for your data. They relate to your particular business needs.
You discover context-dependent variability, veracity, validity and value in your data via analyses that assess and reduce data and present insights in forms that facilitate business decision-making. This function — analytics — is the key to understanding big data.
Seth Grimes is the leading industry analyst covering text analytics and sentiment analysis. He founded Washington-based Alta Plana Corporation technology strategy consultancy, in 1997.