A Blog by Jonathan Low


Dec 7, 2014

Lies, Damn Lies and the Siren Song of Big Data

Just follow the data. That is, increasingly, what we are advised to do. The truth will then supposedly set us free. 

But the question is, who produced the data that miraculously revealed this truth? And given the difficulty of making sense of it, who interpreted the data? What are their personal or philosophical or ideological inclinations? Who trained them? What's their agenda?

So, yes, it would be lovely to know that 'the data' are somehow infallible. That they will always lead us in the right direction and deliver us from evil, or at least from declining market share. The chances of that happening, however, are somewhere south of certainty.

So, it seems, our work is not done. Data have not - yet anyway - supplanted thought, discussion, analysis and debate. They have not eliminated uncertainty and, in fact, may have complicated rather than simplified our lives. Which is not to say that data have not improved them and made for better decisions. It's just not quite as simple as we might have hoped. Bummer. JL

Ron Miller comments in Tech Crunch:

You have to be sure the data is based on solid questions that produced accurate information and follows best data gathering practices –but even then it’s possible to produce unexpected outcomes, proving just following the data isn’t always as simple as it sounds.
We are told to follow the data and the truth will be revealed, but data tells many tales and it depends on the data and how you interpret it. It makes me wonder if anything is definitive if you can present two similar sets of data and draw wildly different conclusions, depending on your emphasis. That’s because data is a tool in the hands of humans and we can interpret it as we choose. And to be clear, this isn’t because we choose to be deliberately deceptive either, although that’s probably true sometimes. It’s because being human, we can bring unintended biases to the data.
It’s a huge conundrum in the age of big data. How do you find definitive answers when you can look at different data points on the same topic and come to different interpretations?
Pam Baker who is author of the book Data Divination: Big Data Strategies, looks at it from a data science perspective, but she still acknowledges you have to ask the right questions to get good answers.
“Data is pulled according to its relevancy to the precise question being asked of the data. Algorithms are written to include several inputs as identified as necessary to answering the question,” Baker explained to me in an email.
She says data scientists have a number of tools at their disposal to do this work, but mistakes are always possible. “There is always room for error, of course, but data science and statistics have hammered out many of these issues long before big data came to be. But it is true that if the wrong data points are used in the algorithm or the data is flawed in some way, the algorithm output (answer) will be wrong or flawed too.”
That’s useful as far as it goes, but we know there is a shortage of data scientists. I’ve heard there is one or none at the vast majority of companies, so there is all this data, but companies are lacking the expertise to help them understand it –and data can be manipulated to give you the answers you want.
I listened to a speaker earlier this week at the Gilbane Conference in Boston give a bunch of statistics that suggested people didn’t use that many apps and most had fewer than 10. He also suggested 90 percent of users didn’t mind receiving spam SMS messages. Not coincidentally he worked for a company that offered an SMS advertising solution. He shared a bunch of data that suggested you would be foolish to build an app if you wanted to get a customer’s attention.
The speaker who followed, displayed a data point that indicated we download 154,000 apps a minute. So which is it? How can you have fewer than 10 apps and at the same time be downloading apps at that pace? When you have clearly conflicting data like this, it makes it hard to answer questions definitively, suggesting once again the old axiom of ‘lies, damn lies and statistics’ could be truer than we imagine.
And when we put data into the hands of people other than the data scientists as Baker recommended, it could get even dicier, especially when those folks are in marketing and trying to use data to put their products and services in the best possible light. It could get even worse if they try to draw conclusions about their markets based on bad information.
Scott Liewehr, who is president at consulting firm Digital Clarity Group says, this is a very real danger. He told me that marketers have to be willing to put in the hard work to build valid studies, or they could be using bad data to make bad decisions about where to direct company resources. “Everybody can use data to tell whatever story you want to tell and it’s a big challenge for marketers,” Liewehr told me. “If they don’t know how to run studies, they can make a lot of bad decisions.”
Baker agrees, but she says folks in the line of business can also add value because they know their markets better than the data experts and putting the two together could produce better results. “Sometimes marketing and sales people understand better than data scientists what to ask — which is why it’s important to have a diverse data team,” she said.
But she cautions, lay people don’t always have all the right information. “Other times, business users can flounder and come to the wrong conclusion because they don’t understand statistical methods and other necessary methods involved in correctly doing this work.”
Last week I ran a story about looking at the most popular enterprise sync and share tools based on a study run by 451 Research. Now, this is a highly reputable firm, which ran not one, but two, studies several months apart before publishing this research. That’s clearly careful methodology and I’m not casting aspersions here, but in that article I wondered if they had asked the right questions or the right people. Instead of looking at usage generically, if they had asked specifically about enterprise licenses versus consumer licenses, would they have seen a different picture? It’s not as easy to figure out as you would imagine as I learned in researching this article.
First of all, the 451 Research data found that more than 40 percent of respondents reported using Dropbox, giving them a commanding lead in the enterprise, a finding that I reported I found surprising. Box, which has been the cloud poster child for the enterprise market was listed as fourth with around 15 percent responding they used Box, but that data does not necessarily tell the whole story.
Consumer licenses are the ones all of us can buy. Each service offers a certain amount of storage for free, and more if you’re willing to pay for it. For example, I have a one terabyte Dropbox account for $99 per year. This is in contrast to the business or enterprise license which comes with a variety of tools to help IT manage all of the licenses in an organization and provides access to the product’s APIs to build solutions on top of the base product that tie into other enterprise software (a feature Dropbox released just this week).
Ilya Fushman, head of product at Dropbox for Business told me last week that Dropbox recently passed 100,000 business customers (some of which are very small businesses and some of which are larger), a number that’s fairly impressive when you consider it launched the product in April, 2013. Interestingly, by comparison, Box tells me it has 39,000 business customers, but the numbers don’t tell the whole story because Box has some pretty big customers.
For instance, Box counts Eli Lilly, Toyota, Dreamworks, Comcast, MD Andersen and GlaxoSmithKline among its clients and recently sold 300,000 enterprise licenses in its deal with General Electric. If you add in Schneider Electric with 65,000  licenses and another 44,000 from Procter and Gamble, you could begin to come to different conclusions than the 451 Research study about enterprise usage, even with the difference in the total number of customers.
For the record, it’s hard to know just how many customers Dropbox has because it won’t share seat numbers, but their large reference customers include large name brands such as Hearst, Hyatt, MIT and News Corp and they list a number of smaller customers on their Dropbox for Business site.
Alan Pelz-Sharpe, a 451 Research analyst who helped author the study, says they are still working on the methodology, and the data they reported on is just the start of a long process of analyzing this market.
“The October survey data I think revealed a number of things – firstly Dropbox has a huge footprint in companies (no surprise to anyone – particularly its competitors). That the market is very immature, but growing at a clip, and that many enterprises are reluctant to embrace public cloud options. Those are all trends that will become more interesting over time – and as this is the first release of a new survey that information is what is really going to be of value – what changes over time.  Added to this is the fact that we are doing some detailed market and revenue modeling for this space and new dimensions will emerge,” he wrote me in an email.
Data does have great value, of course, but even when you’re careful, it’s possible to come to ambiguous interpretations or have trouble nailing down an answer. That’s because even with all of the data we have, sometimes we still have gaps. Obviously, you have to be sure the data you have is based on solid questions that produced accurate information and follows best data gathering practices –but even then it’s possible to produce unexpected outcomes, proving just following the data isn’t always as simple as it sounds.


Post a Comment