(Statsistical Investigations Home Page)

Statistical Investigations Pty. Ltd.

ACN 075 797 401

2/6 Garden Court, Elwood, Victoria, Australia 3184
Tel +61 3 9531 5249 Fax +61 3 95257822 e.mail:tony@statistical.com.au

What is "Data Mining"?

(Revised 17/10/1999)

"Data mining … is the exploration and analysis, by automatic and semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules." (Berry, J. A. & Linoff, G. (1997). Data mining Techniques For Marketing, Sales and Customer Support, John Wiley & Sons, Inc. New York, p.5, http://www.data-miners.com/order/order.html/ )

"Data mining is the process of selecting, exploring, and modelling large amounts of data to uncover previously unknown patterns of data for business advantage." (SAS Institute Inc., http://www.sas.com/software/data_mining/ )

"Data mining simply means finding patterns in your business data which you can use to do your business better" (ISL, http://www.isl.co.uk/DM/vision.htm )

"… the use of statistical analysis and machine learning techniques, in a semiautomatic fashion, on large collections of data." (Jorgensen, M. & Gentleman, R. (1998). Data Mining. Chance 11, 34–42.)

Several concepts recur in these definitions: large quantities of data, a degree of automation, pattern finding and a business motivation. However, these definitions tend to be a bit vague about how such concepts fit or work together in practice. In part this is because data mining (DM) has mostly evolved and been applied within the 'data and business process'. That is prior to DM the appropriate data must be accessed, prepared and manipulated then after DM, decision makers must interpret the results, understand their implications and be prepared to make the appropriate operational changes. The role of DM in the data and business process is somewhat similar to quality control (QC) in the process of total quality management (TQM). Without an understanding of the manufacturing process (so the appropriate quantities are monitored) and the will to improve processes found to be deficient, QC cannot contribute to product improvement. Likewise, to be successful DM must be an embedded and routine part of the data and business process. As a consequence DM is hard to succinctly define.

David Hand, (Hand, D. (1998). Data Mining: Statistics and More? The American Statistician, 52, pp. 112–118) uses the following definition.

"Data mining is a new discipline lying at the interface of statistics, database technology, pattern recognition, machine learning, and other areas. It is concerned with the secondary analysis of large databases in order to find previously unsuspected relationships which are of interest or value to the database owners."

Hand uses the term 'secondary' to indicate that the analysis in data mining is carried out on data that was collected for purpose other than analysis (e.g. ATM transactions, credit card transactions, bank accounts, billing information, etc). This contrasts with statistics, which is mostly concerned (although not always) with 'primary' data analysis. That is, data collected primarily for data analysis (planned experiments, surveys, longitudinal studies, clinical trials, etc).

David Hand (all following quotes are from his article cited above) then goes on to raise some issues that distinguish DM from 'main stream' statistics, they are:
Large Size of Data Sets (millions/billions of records, thousands of variables);

Even with large memory capacities of modern computers data on this scale has to be analyzed adaptively or sequentially. Typically, the information to be analyzed is not stored as a single flat file but in several files that may be linked in some relational database. "As a consequence of the structured way in which data are necessarily stored, it might be the case that straightforward statistical methods cannot be applied ...". The usual procedures for drawing statistical inferences may not be appropriate because with very large data sets even very tiny effects will be of statistical significance. "In place of statistical significance, we need to consider more carefully substantive significance: is the effect important or valuable or not?"

Contaminated Data:

"... when data sets are large, it is practically certain that some of the data will be invalid in some way." Yet the amount of data and its origin as part of a complex database structure means that checking source data may not be feasible. As it is often not be possible to identify and correct invalid data, prior to DM, some form of imputation may be required and/or the DM methods themselves must be robust to missing and incomplete data.

Selection bias and dependent observations:

"In general, very large data sets are likely to have been subjected to selection bias of various kinds - they are likely to be convenience or opportunity samples rather than statisticians' idealized random samples." For example, "... of people offered a bank loan, comprehensive data is available only for those who take up the offer. If these are used to construct the models, to make inferences about the behaviour of future applicants, then errors are likely to introduced."

Finding 'useful' patterns:

To recognise a pattern one must have some idea what one is looking for and what constitutes a 'useful' pattern. "The essence of data mining is that one does not know precisely what sort of structure one is seeking, so a fairly general definition will be appropriate". "Familiarity with the problem domain and a willingness to try ad hoc approaches seems essential …".

Statistical Fishing:

The pejorative terms 'Statistical fishing' or 'data dredging' are used to describe the practice of repeatedly fitting models to data in the hope of finding some 'interesting' results. If this is done often enough some interesting results will turn up by chance alone. This is a recognised problem in conventional statistics but in DM "… with huge data sets it is possible to model small idiosyncrasies which are of little practical import." How do we know our useful patterns are not spurious especially given the possibly biased nature of our data?

Although the references given below discuss the popular DM techniques such as; decision trees, neural networks, memory-based reasoning, cluster analysis, logistic regression, genetic algorithm, etc. They also contain many case studies and examples of DM in areas such as database marketing, fraud detection, credit scoring, market basket analysis, segmentation, astrophysics! and more. It is via these case studies and examples that DM is really understood. Understanding the techniques (with reference to " statistics, database technology, pattern recognition, machine learning", etc.) is secondary to understanding why you are using them in the first place. And that at least requires an understanding of the cycle of data acquisition, analysis, decision making, … , whether it is the field of 'data and business process', TQM, scientific research, etc.

To find out more about DM:

books, articles, etc.

Hand, D. (1998). Data Mining: Statistics and More? The American Statistician, 52, pp. 112–118.

Jorgensen, M. & Gentleman, R. (1998). Data Mining. Chance 11, 34–42.

Mackinnon, M. J. & Glick, Ned (1999). Data Mining and Knowledge Discovery in Databases — An Overview. The Australian and New Zealand Journal of Statistics, 41, pp. 255-275.

Berry, J. A. & Linoff, G. (1997). Data mining Techniques For Marketing, Sales and Customer Support, John Wiley & Sons, Inc. New York

http://www.data-miners.com/books/suggested.html

software reviews

software vendors http://www.data-miners.com/products/vendors.html

For Melbourne readers of this page there is The Data Mining Special Interest Group.

For Sydney readers try nbhalerao@westpac.com.au