A Blog by Jonathan Low

 

Aug 1, 2019

Is Data Science Legit?

Data is not necessarily The Truth. JL

Andrea Jones-Rooy reports in Quartz:

Data is now perceived as the Rosetta stone for cracking the code of human existence. But we’ve conflated data with truth. Data doesn’t say anything. Humans say things. They say what they notice in data, data that only exists because humans chose to collect it, and collected it using human-made tools. Data can’t say anything any more than a hammer can build a house or almond meal can make a macaron. Data is a necessary ingredient in discovery, but you need a human to select it, shape it, and turn it into an insight. Data is only as useful as its quality and the skills of the person wielding it. We need to question data rather than assuming that because we’ve assigned a number to something it’s The Truth.
After millennia of relying on anecdotes, instincts, and old wives’ tales as evidence of our opinions, most of us today demand that people use data to support their arguments and ideas. Whether it’s curing cancer, solving workplace inequality, or winning elections, data is now perceived as being the Rosetta stone for cracking the code of pretty much all of human existence.
But in the frenzy, we’ve conflated data with truth. And this has dangerous implications for our ability to understand, explain, and improve the things we care about.
I have skin in this game. I am a professor of data science at NYU and a social-science consultant for companies, where I conduct quantitative research to help them understand and improve diversity. I make my living from data, yet I consistently find that whether I’m talking to students or clients, I have to remind them that data is not a perfect representation of reality: It’s a fundamentally human construct, and therefore subject to biases, limitations, and other meaningful and consequential imperfections.
The clearest expression of this misunderstanding is the question heard from boardrooms to classrooms when well-meaning people try to get to the bottom of tricky issues:
“What does the data say?”
Data doesn’t say anything. Humans say things. They say what they notice or look for in data—data that only exists in the first place because humans chose to collect it, and they collected it using human-made tools.
Data can’t say anything about an issue any more than a hammer can build a house or almond meal can make a macaron. Data is a necessary ingredient in discovery, but you need a human to select it, shape it, and then turn it into an insight.
Data is therefore only as useful as its quality and the skills of the person wielding it. (You know this if you’ve ever tried to make a macaron. Which I have. And let’s just say that data would certainly not be up to a French patisserie’s standard.)
So if data on its own can’t do or say anything, then what is it?

What is data?

Data is an imperfect approximation of some aspect of the world at a certain time and place. (I know, that definition is a lot less sexy than we were all hoping for.) It’s what results when humans want to know something about something, try to measure it, and then combine those measurements in particular ways.
Here are four big ways that we can introduce imperfections into data.
  • random errors
  • systematic errors
  • errors of choosing what to measure
  • and errors of exclusion
These errors don’t mean that we should throw out all data ever and nothing is knowable, however. It means approaching data collection with thoughtfulness, asking ourselves what we might be missing, and welcoming the collection of further data.
This view is not anti-science or anti-data. To the contrary, the strength of both comes from being transparent about the limitations of our work. Being aware of possible errors can make our inferences stronger.
The first is random errors. This is when humans decide to measure something, and then either due to broken equipment or their own mistakes, the data recorded is wrong. This could take the form of hanging a thermometer on a wall to measure the temperature, or using a stethoscope to count heartbeats. If the thermometer is broken, it might not tell you the right number of degrees. The stethoscope might not be broken, but the human doing the counting might space out and miss a beat.
Just as we want to analyze things carefully with statistics and algorithms, we also need to collect it carefully, too.
A big way this plays out in the rest of our lives (when we’re not assiduously logging temperatures and heartbeats) is in the form of false positives in medical screenings. A false positive for, say, breast cancer, means the results suggest we have cancer but we don’t. There are lots of reasons this might happen, most of which boil down to a misstep in the process of turning a fact about the world (whether or not we have cancer) into data (through mammograms and humans).
The consequences of this error are very real, too. Studies show a false positive can lead to years of negative mental-health consequences, even though the patient turned out to be physically well. On the bright side, the fear of false positives can also lead to more vigilant screening (…which increases the chances of further false positives, but I digress).
Generally speaking, as long as our equipment isn’t broken and we’re doing our best, we hope these errors are statistically random and thus cancel out over time—though that’s not a great consolation if your medical screening is one of the errors.
The second is systematic errors. This refers to the possibility that some data is consistently making its way into your dataset at the expense of others, thus potentially leading you to make faulty conclusions about the world. This might happen for lots of different reasons: who you sample, when you sample them, or who joins your study or fills out your survey.
A common kind of systematic error is selection bias. For example, using data from Twitter posts to understand public sentiment about a particular issue is flawed because most of us don’t tweet—and those who do don’t always post their true feelings. Instead, a collection of data from Twitter is just that: a way of understanding what some people who have selected to participate in this particular platform have selected to share with the world, and no more.
The 2016 US presidential election is an example where a series of systematic biases may have led the polls to wrongly favor Hillary Clinton. It can be tempting to conclude that all polling is wrong—and it is, but not in the general way we might think.
One possibility is that voters were less likely to report that they were going to vote for Trump due to perceptions that this was the unpopular choice. We call this social desirability bias. It’s useful to stop to think about this, because if we’d been more conscious of this bias ahead of time, we might have been able to build it into our models and better predict the election results.
Medical studies are sadly riddled with systematic biases, too: They are often based on people who are already sick and who have the means to get to a doctor or enroll in a clinical trial. There’s some excitement about wearable technology as a way of overcoming this. If everyone who has an Apple Watch, for example, could just send their heart rates and steps per day to the cloud, then we would have tons more data with less bias. But this may introduce a whole new bias: The data will likely now be skewed to wealthy members of the Western world.
The third is errors of choosing what to measure. This is when we think we’re measuring one thing, but in fact we’re measuring something else.
I work with many companies who are interested in—laudably—finding ways to make more objective hiring and promotion decisions. The temptation is often to turn to technology: How can we get more data in front of our managers so they make better decisions, and how can we apply the right filters to make sure we are getting the best talent in front of our recruiters?
But very few pause to ask if their data is measuring what they think it’s measuring. For example, if we are looking for top job candidates, we might prefer those who went to top universities. But rather than that being a measure of talent, it might just be a measure of membership in a social network that gave someone the “right” sequence of opportunities to get them into a good college in the first place. A person’s GPA is perhaps a great measure of someone’s ability to select classes they’re guaranteed to ace, and their SAT scores might be a lovely expression of the ability of their parents to pay for a private tutor.
Companies—and my students—are so obsessed with being on the cutting edge of methodologies that they’re skipping the deeper question: Why are we measuring this in this way in the first place? Is there another way we could more thoroughly understand people? And, given the data we have, how can we adjust our filters to reduce some of this bias?
Finally, errors of exclusion. This happens when populations are systematically ignored in datasets, which can set a precedent for further exclusion.
We’re inferences about apples from data about oranges—but with worse consequences than an unbalanced fruit salad.
For example, women are now more likely to die from heart attacks than men, which is thought to be largely due to the fact that most cardiovascular data is based on men, who experience different symptoms from women, thus leading to incorrect diagnoses.
We also currently have a lot of data on how white women fare when they run for political office in the US, but not a lot on the experiences of people of color (of any gender, for that matter), who face different biases compared to white women on the campaign trail. (And that’s not even mentioning the data on the different experiences of, say, black candidates compared to Latinx candidates, and so on). Until we do these studies, we’ll be trying to make inferences about apples from data about oranges—but with worse consequences than an unbalanced fruit salad.
Choosing to study something can also incentivize further research on that topic, which is a bias in and of itself. As it’s easier to build from existing datasets than create your own, researchers often gather around certain topics—like white women running for office or male cardiovascular health—at the expense of others. If you repeat this enough times, all of a sudden men are the default in heart-disease studies and white women are the default in political participation studies.
Other examples abound. Measuring “leadership” might incentivize people to be more aggressive in meetings, thus breaking down communication in the long run. Adding an “adversity” score to the SATs might incentivize parents to move to different zip codes so their scores are worth more.
I also see this play out in the diversity space: DiversityInc. and other organizations that try to evaluate diversity of companies have chosen a few metrics on which they reward companies—for example, “leadership buy-in,” which is measured by having a Chief Diversity Officer. In order to tick this box, it has incentivized a burst of behaviors that may not actually do anything, like appointing a CDO who has no real power.

Why we still need to believe in data

In the age of anti-intellectualism, fake news, alternative facts, and pseudo-science, I am very reluctant to say any of this. Sometimes it feels like we scientists are barely hanging on as it is. But I believe that the usefulness of data and science comes not from the fact that it’s perfect and complete, but from the fact that we recognize the limitations of our efforts. Just as we want to analyze things carefully with statistics and algorithms, we also need to collect it carefully, too. We are only as strong as our humility and awareness of our limitations.
This doesn’t mean throw out data. It means that when we include evidence in our analysis, we should think about the biases that have affected their reliability. We should not just ask “what does it say?” but ask, “who collected it, how did they do it, and how did those decisions affect the results?”
We need to question data rather than assuming that just because we’ve assigned a number to something that it’s suddenly the cold, hard Truth. When you encounter a study or dataset, I urge you to ask: What might be missing from this picture? What’s another way to consider what happened? And what does this particular measure rule in, rule out, or incentivize?
We need to be as thoughtful about data as we are starting to be about statistics, algorithms, and privacy. As long as data is considered cold, hard, infallible truth, we run the risk of generating and reinforcing a lot of inaccurate understandings of the world around us.

0 comments:

Post a Comment