Wired: The Missing Vs in Big Data: Viability and Value

May 6, 2013- 

By Neil Biehn

The era of Big Data is not “coming soon.” It’s here – today – and it has brought both painful changes and unprecedented opportunity to businesses in countless high-transaction, data-rich industries. In this first wave of Big Data, IT professionals have rightly focused on the underlying resource demands of Big Data, which are outstripping traditional data infrastructures and, in many cases, rewriting the rules for how and where data is stored, managed, and processed.

Data scientists are looking at the classic Vs:

• Volume – The costs of compute, storage, and connectivity resources are plunging, and new technologies like scanners, smartphones, ubiquitous video, and other data-collectors mean we are awash in volumes of data that dwarf what was available even five to 10 years ago. We capture every mouse click, phone call, text message, Web search, transaction, and more. As the volume of data grows, we can learn more – but only if we uncover the meaningful relationships and patterns.

• Variety – From the endless streams of text data in social networking and geolocation data, to structured wallet share and demographics, companies are capturing a more diverse set of data than ever. Bringing it together is no small task.

• Velocity – It’s a truism that the pace of business is inexorably accelerating. The volume and variety of Big Data alone would be daunting enough. But now, that data is coming faster than ever. For some applications, the data shelf life is short. Speed kills competitors if you tame these waves of data – or it can kill your organization if it overwhelms you.

IBM has coined a worthy V – “veracity” – that addresses the inherent trustworthiness of data. The uncertainty about the consistency or completeness of data and other ambiguities can become major obstacles. As a result, basic principles as data quality, data cleansing, master data management, and data governance remain critical disciplines when working with Big Data.

It wasn’t very long ago when a terabyte was considered large. But now, that seems like a rounding error. Today, we create 2.5 quintillion bytes of data every day. In fact, we’re creating so much data so quickly that 90 percent of the data in the world today has been created in the last two years alone. Clearly, traditional ways of managing data must change.

In response, IT organizations have rethought their infrastructures and made tremendous progress in designing sophisticated computing architectures to tackle these extraordinary computing challenges. Data scientists have harnessed such technologies as grid computing, cloud computing, and in-database processing to bring a level of pragmatic feasibility to what were inconceivable computing challenges.

The Fourth V: Viability

But we need more than shiny plumbing to analyze massive data sets in real time. That’s merely a great start. But what can we do with that infrastructure? Where do we start? The first place to look is in the metadata. We want to carefully select the attributes and factors that are most likely to predict outcomes that matter most to businesses. With Big Data, we’re not simply collecting a large number of records. We’re collecting multidimensional data that spans a broadening array of variables. The secret is uncovering the latent, hidden relationships among these variables.

• What effect does time of day or day of week have on buying behavior?

• Does a surge in Twitter or Facebook mentions presage an increase or decrease in purchases?

• How do geolocation, product availability, time of day, purchasing history, age, family size, credit limit, and vehicle type all converge to predict a consumer’s propensity to buy?

Our first task is to assess the viability of that data because, with so many varieties of data and variables to consider in building an effective predictive model, we want to quickly and cost-effectively test and confirm a particular variable’s relevance before investing in the creation of a fully featured model. And, like virtually all scientific disciplines, that process begins with a simple hypothesis.

For instance, does weather (e.g. precipitation) affect sales volumes? In other words, we want to validate that hypothesis before we take further action and, in the process of determining the viability of a variable, we can expand our view to determine if other variables – those that were not part of our initial hypothesis – have a meaningful impact on our desired or observed outcomes.
For example, a data scientist at a telecom provider might theorize that product mentions on Twitter can spike shortly before a customer churns. She then extracts a sample of the data and performs some simple statistical tests and calculations to determine if there is a statistically significant correlation between the chosen variable (Twitter mentions) and customer churn. If so, we’ve established the viability of that variable and will want to Broaden our scope and further invest more resources into collecting and refining that data source. We can then repeat this process of confirming the viability of key variables (and ruling out others) until our model demonstrates a high level of predictability. Perhaps the risk of attrition increases after 30 months (regardless of the number of support calls). Or maybe attrition events are more likely to occur after a corporate customer’s stock price rises 10 percent in two months.

The Fifth V: Value

Once we have confirmed the viability of our key variables, we can then create a model that answers sophisticated queries, delivers counterintuitive insights, and creates unique learnings. We define prescriptive, needle-moving actions and behaviors and start to tap into the fifth V from Big Data: value.

Data science can help us uncover these subtle interactions, enabling a manufacturer, for instance, to manipulate heretofore hidden – often counterintuitive – levers that directly impact sales results. Our fictitious telecom provider trying to reduce churn, for instance, might look at the number or duration of calls to a support center. But data science might further analyze the Big Data and present the things you didn’t know. We extend the value of a predictive model by subsequently uncovering a virtually unfathomable combination of additional variables – the so-called “long tail” – that collectively predicts what you’re seeking to measure.

For our telecom provider, a sales executive might hypothesize that region, income, and age will help improve the accuracy of attrition forecasts among consumers. But once the viability of those dimensions is confirmed, we might expand our exploration to learn that customers in warm-weather Southwestern states with master’s degrees who own automobiles with a model year of 2008 or earlier and have a credit score of 625-650 show an outsized, statistically significant propensity to churn in the 45 days following their birthday.

Even if our aggregation of predictive variables – our model – is producing excellent results, we must remember what every undergrad student learns: Correlation does not mean causation. It would be foolhardy to blindly follow a predictive model of correlations without examining and understanding the interrelationships they embody. (Although a Super Bowl win by an NFC team has been correlated with gains in the Dow Jones Industrial Average [MD: has it?], few of us would immediately put in buy orders on the following morning if the Dallas Cowboys take the Lombardi Trophy.)

But we can prudently and analytically validate these correlations with business intuition to better understand the drivers of buyer behavior and initiate micro-campaigns, at much lower cost, to present attractive offers to prevent churn. Regardless of how we get there, what matters is that our model points us to actions we can take that improve business outcomes.

What’s more, we needn’t pursue perfection in validating our hypotheses. If there are 100 relevant variables that affect the metric you’re seeking to measure and improve, you’re facing a tremendous analytical problem. But many data scientists believe that as few as 5 percent of the relevant variables will get you 95 percent of the sales lift/benefit. The trick, of course, is identifying the right 5 percent of the variables – and that’s what good data scientists can do by determining viability.

Unquestionably, Big Data is a key trend that corporate IT must accommodate with proper computing infrastructures. But without high-performance analytics and data scientists to make sense of it all, you run the risk of simply creating Big Costs without creating the value that translates into business advantage.

Neil Biehn is vice president and leader of the science and research group at PROS.


Sales Initiative Magazine: Blowing up the sales balloon – where pricing matters


Business Intelligence: The Missing V’s In Big Data: Viability And Value