Does Dog Dataset Decide Data Dominance Debate?
In discussions about technology markets, the notion often prevails that a larger dataset is an insurmountable lead. As a result, notwithstanding that math tells us otherwise, many assume large tech firms will necessarily dominate over smaller entrants with less data.
This may be because, when it comes to tech, there has long been a sense that “this time it is different” — that ‘big data’-related services won’t perform as other statistical prediction models behave. However, there’s a growing body of empirical evidence that refutes this exception, and shows that there are observable diminishing returns to the size of the dataset in improving prediction.
Instead of beginning with tech companies’ data sets, however, let’s begin with a generally uncontested illustration of this statistical phenomenon: surveys of voting practices for an upcoming election.
As DisCo has previously reported, if one is to conduct a survey of voting preferences for an upcoming election, one needs to construct a large enough sample size to ensure accuracy. However, each additional survey participant does not increase the quality of the survey by the same amount as the one before her.
For example, if a pollster has 5 respondents, adding a 6th proves extremely valuable. If the pollster already has 100,000 respondents, then adding an additional one is almost insignificant. Therefore, a survey that has 100,000 respondents is not twice as accurate as a survey with 50,000 respondents. In fact, in both cases, the margin of error is less than 1%.
This mathematical reality leads to the conclusion that how one utilizes and parses data is much more important than the sheer volume of data used.
DOG DATASETS AND INTERNET SEARCH
Papers published by Hal Varian, the chief economist at Google, and Catherine Tucker, the Sloan Distinguished Professor of Management Science and Professor of Marketing at MIT Sloan, reinforce the universality of this principle, and particularly its applicability to machine learning.
One experiment that Varian notes in his paper, “Artificial Intelligence, Economics, and Industrial Organization,” tests how the accuracy of the Stanford dog breed classification would behave as the amount of data collected increased. The study found that, “as one would expect accuracy improves as the number of training images increases, but it does so at a decreasing rate” — meaning that greater quantities of data resulted in less improvement in the accuracy of dog classifications.
Tucker conducted a similar experiment in which she investigated whether large amounts of data “confers a competitive advantage on firms that offer Internet search.” Tucker found scant evidence that “reducing the length of storage of past search engine searches affected the accuracy of search.” Thus, her results cast doubt on the assumption that possessing historical data does confer a significant competitive advantage.
Instead, Tucker argues that the tools used to analyze the data and “provide value to consumers” in unique ways confers a more “sustainable advantage.” Varian echoes this, stating that “…other factors such as improved algorithms, improved hardware, and improved expertise have been much more important than the number of observations in the training data.”
AMAZON’S PRODUCT FORECASTS
Amazon’s Chief Economist, Patrick Bajari, and his colleagues came to a similar conclusion in a recent paper examining the relationship between the quantity of data Amazon collects for retail forecasts and the accuracy of machine learning models.
Like Tucker and Varian, Bajari and his colleagues found that “as more and more data is available for a particular product, demand forecasts for that product improve over time, though with diminishing returns to scale.”
That is, the authors found that the accuracy of the machine learning models that forecast which products will be in demand by customers in the current week, and in the future, improved at a decreasing rate as the amount of data collected increased.
WHY PEOPLE STILL THINK MORE DATA ⇒ SUCCESS
But, in spite of this research, there remains a prevailing concern that the collection of large quantities of data (‘big data’), particularly user’s data, “leads to markets “tipping” to dominant online platforms and, as a result, whether online markets merit earlier and more aggressive antitrust intervention.”’
According to Andres Lerner, these concerns are predicated on the false assumption that there is strong feedback loop present — that in order to effectively compete in the digital marketplace online entities require a large user base to collect data from.
However, while these firms may have a larger user base, they tend to use only a small fraction of the data collected from them.
As Varian notes in his paper, firms build models based on small samples of data and typically do not use all the data they collect. Google’s samples, for example, use about 0.1% of the data collected. This is, in part, due to the fact that data has a short shelf life; information about customer preferences from a year ago is much less valuable than current datasets.
And, because data is more freely available now than ever before, smaller firms can get access to these timely datasets.
Not only are there are millions of public datasets available online, for free, firms can collect and analyze their own website data using data analytic tools, also available for free online. What’s more, cloud computing has significantly reduced the cost of collecting and analyzing this data.
Thus, Lerner concludes, “the conclusion that the benefits of user data lead to significant returns to scale and to the entrenchment of dominant online platforms is based on assumptions unsupported by real-world evidence.”
Rather, it is “engineering resources and technological investments and innovation” that provide a firm with a competitive advantage.