Why Nobody Said Big Data Is Truth
Over the weekend, the New York Times’ Bits blog featured a post by Quentin Hardy, summarizing a talk given by Kate Crawford, a Microsoft Research employee, at a Berkeley I School conference. The (somewhat provocative) title of the post is “Why Big Data is Not Truth” and it makes the rather uncontroversial conclusion, by way of a quotation from Ms. Crawford, that “We need to think about how we will navigate these [big data] systems. Not just individually, but as a society.” It gets there, however, by arguing against a number of strawmen surrounding big data that I’ve heard elsewhere and which I wanted to highlight.
First, Ms. Crawford sets up the “myth” that big data is objective, which she then dismantles, in part by pointing out that the users of Twitter skew toward the young, urban, and affluent. While this weekend’s protesters in Gezi Park may disagree with this characterization (fun fact: around 90% of the tweets using the various hashtags associated with the protests are coming from local Turkish users, and around 88% are in Turkish), the broader point that Crawford seems to be making — that the data in our big data system is inherently biased in various ways and we must accommodate that truth — is a well understood limitation of all statistical systems and at the same time is no reason to stop engaging in statistics. What wasn’t mentioned in the Times post, however, was the way in which big data attempts to compensate for the biases in statistical modeling through larger and larger data sets. No data set is ever perfect but the more you have, the closer to “truth” you can get.
Secondly, the Times piece has Crawford questioning the myth that big data doesn’t discriminate. She points out that even anonymized data can have information such as gender, race, and sexual orientation extracted from it by a determined analyst. By pointing out this supporting fact, it seem to me that she undermines her argument. Indeed, big data doesn’t discriminate. Big data is just a term for the force multipliers that come into play when you cram a lot of information into one system. People discriminate. How any one entity uses big data should be examined for discrimination. Casting the blame on big data itself is meaningless.
Finally, Ms. Crawford sets up as one of her myths the idea that “you can opt-out.” This is a somewhat misleading myth to put forward, because in some cases you can actually opt-out, and in others you can’t, and that’s how it should be. For example, many websites collect large-scale data about how people use the website itself. Which pages they visit, in which order, where they linger, and what they breeze by. That information is usually called web analytics and its used to modify websites so that people find what they’re looking for faster and easier. Similarly, Google scans the queries that its users input for various terms that they’ve learned are associated with flu outbreaks, and uses that information to predict the impact of the flu state-by-state. As a user you cannot, and should not be able to, opt-out of this collection.
On the other hand, there are some collections of data that you can opt-out of, such as for interest-based advertising. In these circumstances, Ms. Crawford claims, companies can change their terms and conditions, and she points out that Instagram recently did just that to expand the sharing for photos uploaded to the site. She then laments the lack of a paid option at Instagram that would avoid this sort of sharing. She seems to disregard, however, the vast competition in online photo storage, including Flickr, Picasa, Imgur, TwitPic, etc, etc, etc. Many of these other options allow you to carefully control who gets to see your photos, or you can use a service like Dropbox (which does allow you to pay for storage) if you want to share them with nobody at all.
Big data is at the end of the day just a tool. Like all other tools, its results should be examined for efficacy and desired outcomes. In that, I suspect Ms. Crawford and I probably agree. The ways that she gets to her conclusion, however, are suspect and should be examined closely.