Discovery, Truth and Utility: Defining ‘Data Science’
Frank and Martin explore the definition of data science.
Gregory Piatetsky-Shapiro knows a thing or two about extracting insight from data. He co-founded the first Knowledge Discovery and Data Mining workshop in 1989 that we briefly discussed in the second installment of this series of blogs. And he has been practicing and instructing pretty much continuously since then.
But what is it, exactly, that he has been practicing? Even Piatetsky-Shapiro might struggle to give you a consistent answer to that question, as this quote of his from 2012 hints:
Although the buzzwords describing the field have changed – from ‘knowledge discovery’ to ‘data mining’ to ‘predictive analytics’, and now to ‘data science’, the essence has remained the same – discovery of what is true and useful in mountains of data.
We like this quote a lot. Firstly, because it speaks to the fact that historically we have used at least four different terms - knowledge discovery, data mining, predictive analytics and data science – to describe substantially the same thing. The tools, techniques and technologies that we use continue to evolve, but our objective is basically the same.
And the second reason that we like this quote so much is because it contains three words that we think are key to understanding the analytic process.
Discovery. True. And Useful.
Let’s take each of these in turn.
Analytics is fundamentally about discovery. It’s about revealing patterns in data that we didn’t know existed – and extrapolating from them to try and know things that we otherwise wouldn’t know.
In fact, the analytic discovery process has more in common with research and development (R&D) than with software engineering. If we are doing it right, we should have a reasonably clear idea about the business challenges or opportunities that we are trying to address - for example, we may want to try and measure customer sentiment to establish if it is correlated with store performance and to understand which parts of the shopping experience we should try to improve to increase customer satisfaction. Or we might want to predict the failure of train-sets based on patterns in sensor data. But often we won’t know which approach is likely to be most successful, whether the data available to us can support the desired outcome – or even whether the project is feasible at all. And that means - first and foremost – that whatever we call it, analytics is about experimentation. Repeated experimentation. As Foster Provost and Tom Fawcet put it in their (excellent) textbook Data Science for Business: “the results of a given step may change the fundamental understanding of the problem.” Traditional notions of scope and requirements are therefore often difficult to apply to analytics projects.
Secondly, whilst many process models have been developed to try and codify the analytic process and so make it more reliable and repeatable – of which the Cross Industry Standard Process Model for Data Mining (CRISP-DM) shown below is probably the most successful and the most widely known – the reality is that analytics is an iterative, rather than a linear process. We can’t simply execute each step of the process in-turn and hope that insight will miraculously “pop” out of the end of the process. An unsuccessful attempt at modelling, say, customer propensity-to-buy, may cause us to re-visit the data preparation step to create new metrics that we hope will be more predictive. Or it may cause us to realize that we are insufficiently clear in our understanding of the business problem – and require us to start over. One important outcome of all of this is that “failure” rates for analytics initiatives are high. Often, these “failures” really aren’t failures in the traditional sense at all – rather they represent important learning about which approaches, tools and techniques are relevant to a particular problem. The industry refers to this as “fail fast”, although it might be more appropriate to call it a “learn quick” approach to analytics. But whatever we call it, this high failure rate has important consequences for the way we organize and manage analytic projects that we will return to later in this series.
There are many ways in which data can mislead, rather than inform us. Sometimes we can find results that appear to be interesting, but that are not statistically significant. We may conflate correlation with causality. Or we may be misled by Simpson’s paradox. Paradoxically, as Kaiser Fung points out in his book Numbersense, big data can get us into big trouble, by multiplying the number of blind alleys and irrelevant correlations that we can chase - and so causing us to waste precious time and organizational resources.
But something even more basic can also trip us up: data quality. The most sophisticated techniques, algorithms and analytic technologies are still hostage to the quality of our data. If we feed them garbage, garbage is what they will give us in return.
We cannot automatically assume that data are “true” – in particular, because the data that we are seeking to re-use and re-purpose for our analytics project are likely to have been collected to serve very different purposes. Analytics of the sort that we are undertaking may never have been intended or foreseen. That is why the CRISP-DM model places so much emphasis on “data discovery”; it is important that we first understand whether the data that are available to us are “fit for purpose” – or if we need either to change our purpose and/or to get better data.
Defining data science
So how then, should we define data science? Spend 10 minutes with Google and you will find plenty of contradictory definitions. Our personal favorite is –
Data Science = Machine Learning + Data Mining + Experimental Method
It may lack mathematical rigor, but it’s short, sweet – and, if we say so ourselves - spot-on!