記事

Discovery, Truth and Utility: Defining ‘Data Science’

Frank and Martin explore the definition of data science.

2017年5月15日 4 分で読める

Gregory Piatetsky-Shapiro knows a thing or two about extracting insight from data. He co-founded the first Knowledge Discovery and Data Mining workshop in 1989 that we briefly discussed in the second installment of this series of blogs. And he has been practicing and instructing pretty much continuously since then.

But what is it, exactly, that he has been practicing? Even Piatetsky-Shapiro might struggle to give you a consistent answer to that question, as this quote of his from 2012 hints:

Although the buzzwords describing the field have changed – from ‘knowledge discovery’ to ‘data mining’ to ‘predictive analytics’, and now to ‘data science’, the essence has remained the same – discovery of what is true and useful in mountains of data.

We like this quote a lot. Firstly, because it speaks to the fact that historically we have used at least four different terms - knowledge discovery, data mining, predictive analytics and data science – to describe substantially the same thing. The tools, techniques and technologies that we use continue to evolve, but our objective is basically the same.

And the second reason that we like this quote so much is because it contains three words that we think are key to understanding the analytic process.

Discovery. True. And Useful.

Let’s take each of these in turn.

Analytics is fundamentally about discovery. It’s about revealing patterns in data that we didn’t know existed – and extrapolating from them to try and know things that we otherwise wouldn’t know.

In fact, the analytic discovery process has more in common with research and development (R&D) than with software engineering. If we are doing it right, we should have a reasonably clear idea about the business challenges or opportunities that we are trying to address - for example, we may want to try and measure customer sentiment to establish if it is correlated with store performance and to understand which parts of the shopping experience we should try to improve to increase customer satisfaction. Or we might want to predict the failure of train-sets based on patterns in sensor data. But often we won’t know which approach is likely to be most successful, whether the data available to us can support the desired outcome – or even whether the project is feasible at all. And that means - first and foremost – that whatever we call it, analytics is about experimentation. Repeated experimentation.  As Foster Provost and Tom Fawcet put it in their (excellent) textbook Data Science for Business: “the results of a given step may change the fundamental understanding of the problem.”  Traditional notions of scope and requirements are therefore often difficult to apply to analytics projects.

Secondly, whilst many process models have been developed to try and codify the analytic process and so make it more reliable and repeatable – of which the Cross Industry Standard Process Model for Data Mining (CRISP-DM) shown below is probably the most successful and the most widely known – the reality is that analytics is an iterative, rather than a linear process.  We can’t simply execute each step of the process in-turn and hope that insight will miraculously “pop” out of the end of the process. An unsuccessful attempt at modelling, say, customer propensity-to-buy, may cause us to re-visit the data preparation step to create new metrics that we hope will be more predictive. Or it may cause us to realize that we are insufficiently clear in our understanding of the business problem – and require us to start over. One important outcome of all of this is that “failure” rates for analytics initiatives are high. Often, these “failures” really aren’t failures in the traditional sense at all – rather they represent important learning about which approaches, tools and techniques are relevant to a particular problem.  The industry refers to this as “fail fast”, although it might be more appropriate to call it a “learn quick” approach to analytics. But whatever we call it, this high failure rate has important consequences for the way we organize and manage analytic projects that we will return to later in this series.

There are many ways in which data can mislead, rather than inform us. Sometimes we can find results that appear to be interesting, but that are not statistically significant. We may conflate correlation with causality. Or we may be misled by Simpson’s paradox.  Paradoxically, as Kaiser Fung points out in his book Numbersense, big data can get us into big trouble, by multiplying the number of blind alleys and irrelevant correlations that we can chase - and so causing us to waste precious time and organizational resources.

But something even more basic can also trip us up: data quality. The most sophisticated techniques, algorithms and analytic technologies are still hostage to the quality of our data.  If we feed them garbage, garbage is what they will give us in return.

We cannot automatically assume that data are “true” – in particular, because the data that we are seeking to re-use and re-purpose for our analytics project are likely to have been collected to serve very different purposes.  Analytics of the sort that we are undertaking may never have been intended or foreseen. That is why the CRISP-DM model places so much emphasis on “data discovery”; it is important that we first understand whether the data that are available to us are “fit for purpose” – or if we need either to change our purpose and/or to get better data.

Defining data science

So how then, should we define data science? Spend 10 minutes with Google and you will find plenty of contradictory definitions. Our personal favorite is –

Data Science = Machine Learning + Data Mining + Experimental Method

It may lack mathematical rigor, but it’s short, sweet – and, if we say so ourselves - spot-on!

Tags

会社情報 Martin Willcox

Martin has over 27-years of experience in the IT industry and has twice been listed in dataIQ’s “Data 100” as one of the most influential people in data-driven business. Before joining Teradata, Martin held data leadership roles at a major UK Retailer and a large conglomerate. Since joining Teradata, Martin has worked globally with over 250 organisations to help them realise increased business value from their data. He has helped organisations develop data and analytic strategies aligned with business objectives; designed and delivered complex technology benchmarks; pioneered the deployment of “big data” technologies; and led the development of Teradata’s AI/ML strategy. Originally a physicist, Martin has a postgraduate certificate in computing and continues to study statistics.

すべての投稿の表示 Martin Willcox

会社情報 Dr. Frank Säuberlich

Dr. Frank Säuberlich leads the Data Science & Data Innovation unit of Teradata Germany. It is part of his repsonsibilities to make the latest market and technology developments available to Teradata customers. Currently, his main focus is on topics such as predictive analytics, machine learning and artificial intelligence.
Following his studies of business mathematics, Frank Säuberlich worked as a research assistant at the Institute for Decision Theory and Corporate Research at the University of Karlsruhe (TH), where he was already dealing with data mining questions.

His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International. Frank has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).

His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International.

Frank Säuberlich has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).

すべての投稿の表示 Dr. Frank Säuberlich

最新情報をお受け取りください

メールアドレスをご登録ください。ブログの最新情報をお届けします。



テラデータはソリューションやセミナーに関する最新情報をメールにてご案内する場合があります。 なお、お送りするメールにあるリンクからいつでも配信停止できます。 以上をご理解・ご同意いただける場合には「はい」を選択ください。

テラデータはお客様の個人情報を、Teradata Global Privacy Policyに従って適切に管理します。