What is Data Science?

The phrase “data science” has many meanings and there is currently no consensus about the right definition. This is a problem because right now the phrase is hot and means big bucks, since employers want to hire data scientists and granting agencies want to fund research in the area.

Naturally, there are many prophets claiming to provide the true definition of data science. When these prophets come from academia, they tend to assert that data science has been around for a long time and that the current phrase is just a buzz word that is at best a convenient new bottle into which to poor old wine. Interestingly, both computer science (Naur 1966) and statistics (Wu 1997, Cleveland 2001) have laid claim to the phrase data science. A good example of this point of view is Donoho’s “Fifty Years of Data Science,” (2017) which damns the field with the faint praise of contributing some nice software to the work that statisticians have been doing all along.

These kinds of narratives are off-base. They are based on the assumption that the question of defining data science is like tracing the ancestry of an idea or prhase in the family tree of academic fields. It is only partly that. The reality is that data science has become popular not because any other effort to expand an existing academic field -- certainly not because statisticians began to realize the importance of the work of computer scientists in the late 1990s. The reason we use the word today is because the phrase was adopted, appropriately or not, by social media companies like LinkedIn and Facebook to define a new kind of work that could not be classified by the traditional categories of “data analyst” or “programmer.”

This new kind of work was, essentially, the efficient application of machine learning to large data sets produced by the datasphere. The work combined expertise in software engineering with computational methods in statistical learning. Most of these methods were developed by computer scientists interested in artificial intelligence and data mining. But most importantly, this kind of work is a messy cross-roads of knowledge, technology, and circumstance that is not easily defined.

It’s not that the academic definitions are invalid. They represent an important phase in the history of ideas that we haver the good fortune to experience. In the language of Hegel, the academic response to industry’s appropriation of an idea in a predictable dialectical response. But these definitions need to be put in perspective. To be blunt, the reason students want to get an MS or Ph.D in DS is not beause they are interested in causality, even though they perhaps ought to be. It is because they want to learn this new kind of knowledge that got called “data science.” So we need to have a good, empirical understanding of this new kind of knowledge, even as we want to connect it to other endeavors, such as computational science, information science, etc.

References

Cleveland, William S. 2001. “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.” International Statistical Review / Revue Internationale de Statistique 69 (1): 21–26. https://doi.org/10.2307/1403527.

Donoho, David. 2017. “50 Years of Data Science.” Journal of Computational and Graphical Statistics 26 (4): 745–66. https://doi.org/10.1080/10618600.2017.1384734.

Naur, Peter. 1966. "The science of datalogy". Communications of the ACM. 9 (7): 485. doi:10.1145/365719.366510.

Wu, C. F. J. 1997. "Statistics = Data Science?" (PDF)..