Thursday, December 18, 2008

Data Mining Methodologies

I use the CRISP-DM methodology for all Data Mining projects as it is industry and tool neutral, and also the most comprehensive of all the methodologies available. Some Data Mining software vendors have come up with their own methodologies though they are basically the same. Check them out.


1. Defining the Problem: Analyze business requirements, define the scope of the problem, define the metrics by which the model will be evaluated, and define specific objectives for the data mining project.

2. Preparing Data: Remove/handle bad data, find correlations in the data, identify sources of data that are the most accurate, and determining which columns are the most appropriate for use in analysis.

3. Exploring the Data: Calculate the minimum and maximum values, calculate mean and standard deviations, and look at the distribution of the data.

4. Building Models: Specify the input columns, the attribute that you are predicting, and parameters that tell the algorithm how to process the data.

5. Exploring & Validating Models: Use the models to create predictions, which you can then use to make business decisions, create content queries to retrieve statistics, rules, or formulas from the model, embed data mining functionality directly into an application, update the models after review and analysis or update the models dynamically, as more data comes into the organization.


1. Problem Definition: Specify the project objectives and requirements from a business perspective, formulate it as a data mining problem and develop a preliminary implementation plan.

2. Data Gathering and Preparation: Take a closer look at the data, remove some of the data or add additional data, identify data quality problems, and scan for patterns in the data. Typical tasks include table, case, and attribute selection as well as data cleansing and transformation.

3. Model Building and Evaluation: Select and apply various modeling techniques and calibrate the parameters to optimal values. If the algorithm requires data transformations, step back to the previous phase to implement them.

4. Knowledge Deployment: Can involve scoring (the application of models to new data), the extraction of model details (for example the rules of a decision tree), or the integration of data mining models within applications, data warehouse infrastructure, or query and reporting tools.


1. Sample the data by creating one or more data tables. The sample should be large enough to contain the significant information, yet small enough to process.

2. Explore the data by searching for anticipated relationships, unanticipated trends, and anomalies in order to gain understanding and ideas.

3. Modify the data by creating, selecting, and transforming the variables to focus the model selection process.

4. Model the data by using the analytical tools to search for a combination of the data that reliably predicts a desired outcome.

5. Assess the data by evaluating the usefulness and reliability of the findings from the data mining process.

CRISP-DM (CRoss Industry Standard Process for Data Mining)

1. Business Understanding: Understand the project objectives and requirements from a business perspective, convert this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.

2. Data Understanding: Collect initial data and proceed with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

3. Data Preparation: Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.

4. Modeling: Select and apply various modeling techniques, calibrate their parameters to optimal values, step back to the data preparation phase if needed.

5. Evaluation: Evaluate the model, review the steps executed to construct the model, to be certain it properly achieves the business objectives. At the end of this phase, a decision on the use of the data mining results should be reached.

6. Deployment: Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps.