Thursday, December 18, 2008

Data Mining Methodologies

I use the CRISP-DM methodology for all Data Mining projects as it is industry and tool neutral, and also the most comprehensive of all the methodologies available. Some Data Mining software vendors have come up with their own methodologies though they are basically the same. Check them out.


1. Defining the Problem: Analyze business requirements, define the scope of the problem, define the metrics by which the model will be evaluated, and define specific objectives for the data mining project.

2. Preparing Data: Remove/handle bad data, find correlations in the data, identify sources of data that are the most accurate, and determining which columns are the most appropriate for use in analysis.

3. Exploring the Data: Calculate the minimum and maximum values, calculate mean and standard deviations, and look at the distribution of the data.

4. Building Models: Specify the input columns, the attribute that you are predicting, and parameters that tell the algorithm how to process the data.

5. Exploring & Validating Models: Use the models to create predictions, which you can then use to make business decisions, create content queries to retrieve statistics, rules, or formulas from the model, embed data mining functionality directly into an application, update the models after review and analysis or update the models dynamically, as more data comes into the organization.


1. Problem Definition: Specify the project objectives and requirements from a business perspective, formulate it as a data mining problem and develop a preliminary implementation plan.

2. Data Gathering and Preparation: Take a closer look at the data, remove some of the data or add additional data, identify data quality problems, and scan for patterns in the data. Typical tasks include table, case, and attribute selection as well as data cleansing and transformation.

3. Model Building and Evaluation: Select and apply various modeling techniques and calibrate the parameters to optimal values. If the algorithm requires data transformations, step back to the previous phase to implement them.

4. Knowledge Deployment: Can involve scoring (the application of models to new data), the extraction of model details (for example the rules of a decision tree), or the integration of data mining models within applications, data warehouse infrastructure, or query and reporting tools.


1. Sample the data by creating one or more data tables. The sample should be large enough to contain the significant information, yet small enough to process.

2. Explore the data by searching for anticipated relationships, unanticipated trends, and anomalies in order to gain understanding and ideas.

3. Modify the data by creating, selecting, and transforming the variables to focus the model selection process.

4. Model the data by using the analytical tools to search for a combination of the data that reliably predicts a desired outcome.

5. Assess the data by evaluating the usefulness and reliability of the findings from the data mining process.

CRISP-DM (CRoss Industry Standard Process for Data Mining)

1. Business Understanding: Understand the project objectives and requirements from a business perspective, convert this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.

2. Data Understanding: Collect initial data and proceed with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

3. Data Preparation: Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.

4. Modeling: Select and apply various modeling techniques, calibrate their parameters to optimal values, step back to the data preparation phase if needed.

5. Evaluation: Evaluate the model, review the steps executed to construct the model, to be certain it properly achieves the business objectives. At the end of this phase, a decision on the use of the data mining results should be reached.

6. Deployment: Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps.

Tuesday, November 25, 2008

Fraud Prediction - Decision Trees & Support Vector Machines (Classification)

My first thought when I was asked to learn and use Oracle Data Mining (ODM) was, “Oh no! Yet another Data Mining Software!!!”

It’s been about 2 weeks now since I have been using ODM, particularly focusing on two classification techniques – Decision Trees & Support Vector Machines. As I don’t want to get into the details of the interface/usability of ODM (unless Oracle pays me!!), I will limit this post on a comparison of these two classification techniques at a very basic level, using ODM.

A very brief introduction of DT & SVM.

DT – A flow chart or diagram representing a classification system or a predictive model. The tree is structured as a sequence of simple questions. The answers to these questions trace a path down the tree. The end product is a collection of hierarchical rules that segment the data into groups, where a decision (classification or prediction) is made for each group.

-The hierarchy is called a tree, and each segment is called a node.
-The original segment contains the entire data set, referred to as the root node of the tree.
-A node with all of its successors forms a branch of the node that created it.
-The final nodes (terminal nodes) are called leaves. For each leaf, a decision is made and applied to all observations in the leaf.

SVM – A Support Vector Machine (SVM) performs classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories.

In SVM jargon, a predictor variable is called an attribute, and a transformed attribute that is used to define the hyperplane is called a feature. A set of features that describes one case/record is called a vector. The goal of SVM modeling is to find the optimal hyperplane that separates clusters of vector in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other size of the plane. The vectors near the hyperplane are the support vectors.

SVM is a kernel-based algorithm. A kernel is a function that transforms the input data to a high-dimensional space where the problem is solved. Kernel functions can be linear or nonlinear.

The linear kernel function reduces to a linear equation on the original attributes in the training data. The Gaussian kernel transforms each case in the training data to a point in an n-dimensional space, where n is the number of cases. The algorithm attempts to separate the points into subsets with homogeneous target values. The Gaussian kernel uses nonlinear separators, but within the kernel space it constructs a linear equation.

I worked on this dataset which has fraudulent fuel card transactions. Two techniques I previously tried are Logistic Regression (using SAS/STAT) & Decision Trees (using SPSS Answer Tree). Neither of them was found to be suitable for this dataset/problem.

The dataset has about 300,000 records/transactions and about 0.06% of these have been flagged as fraudulent. The target variable is the fraud indicator with 0s as non-frauds, and 1s as frauds.

The Data Preparation consisted of missing value treatments, normalization, etc. Predictor variables that are strongly associated with the fraud indicator – both from the business & statistics perspective – were selected.

The dataset was divided into a Build Data (60% of the records) and Test Data (40% of the records).

Algorithm Settings for DT,

Accuracy/Confusion Matrix for DT,

Algorithm Settings for SVM,

Accuracy/Confusion Matrix for SVM,

We can see clearly that SVM is outperforming DT in predicting the fraudulent cases (93% vs. 72%).

Though it depends a lot on the data/business domain & problem, SVM generally performs well on data sets where there are very few cases on which to train the model.

Wednesday, October 29, 2008

Eight Levels Of Analytics

The number of companies advertising Analytics as one of their service offerings leaves me astounded. I also come across an endless number of people who say they are into Analytics.

Maybe all this has to do with the very broad definition of Analytics. Or the way some people/companies interpret Analytics. A little more research, and a few more minutes of conversation reveals that the "Analytics" these companies or people actually do is just reporting. Personally speaking, I feel "Analytics" has become a widely misinterpreted and misused term.

According to Gartner, “Analytics leverage data in a particular functional process (or application) to enable context-specific insight that is actionable.”

The lastest
sascom online magazine describes the eight level of Analytics, from the simplest to the most advanced.


What happened? When did it happen?
E.g. Monthly or quarterly financial reports.

Generated on a regular basis, they just describe “what happened” in a particular area. They’re useful to some extent, but not for making long-term decisions.


How many? How often? Where?
E.g. Custom reports that describe the number of hospital patients for every diagnosis code for each day of the week.

At their best, ad hoc reports let you ask the questions and request a couple of custom reports to find the answers.


Where exactly is the problem? How do I find the answers?
E.g. Sort and explore data about different types of cell phone users and their calling behaviors.

Query drilldown allows for a little bit of discovery. OLAP lets you manipulate the data yourself to find out how many, what color and where.


When should I react? What actions are needed now?
E.g. Sales executives receive alerts when sales targets are falling behind.

With alerts, you can learn when you have a problem and be notified when something similar happens again in the future. Alerts can appear via e-mail, RSS feeds or as red dials on a scorecard or dashboard.


Why is this happening? What opportunities am I missing?
E.g. Banks can discover why an increasing number of customers are refinancing their homes.

Here we can begin to run some complex analytics, like frequency models and regression analysis. We can begin to look at why things are happening using the stored data and then begin to answer questions based on the data.


What if these trends continue? How much is needed? When will it be needed?
E.g. Retailers can predict how demand for individual products will vary from store to store.

Forecasting is one of the hottest markets – and hottest analytical applications – right now. It applies everywhere. In particular, forecasting demand helps supply just enough inventory, so you don’t run out or have too much.


What will happen next? How will it affect my business?
E.g. Hotels and casinos can predict which VIP customers will be more interested in particular vacation packages.

If you have 10 million customers and want to do a marketing campaign, who’s most likely to respond? How do you segment that group? And how do you determine who’s most likely to leave your organization? Predictive modeling provides the answers.


How do we do things better? What is the best decision for a complex problem?
E.g. Given business priorities, resource constraints and available technology, determine the best way to optimize your IT platform to satisfy the needs of every user.

Optimization supports innovation. It takes your resources and needs into consideration and helps you find the best possible way to accomplish your goals.

Thursday, October 16, 2008

Market Basket Analysis

Market Basket Analysis (MBA) is the process of analyzing transactional level data to determine the likelihood that a set of items/products will be bought together.

Retailers use the results/observations from an MBA to understand the purchase behaviour of customers for cross-selling, store design, discount plans and promotions. MBA can, and should be done across different branches/stores as the customer demographics/profiles and their purchase behavior usually varies across regions.

The most common technique used in MBA is Association Rules. The three measures of Association Rules are - Support, Confidence, and Lift

A --> B = if a customer buys A, then B is also purchased
Condition --> Result
Antecedent --> Consequent

Support: Ratio of the # of transactions that includes both A & B to the total number of all transactions

Confidence: Ratio of the # of transactions with all items in the rule (A + B) to the # of transactions with items in the condition (A )

Lift: Indicates how much better the rule is at predicting the “result” or “consequent” as compared to having no rule at all, or how much better the rule does rather than just guessing

Lift = Confidence/P(result) = [P (A+B)/P(A)]/P(B)

If a customer buys milk, what is the likelihood of orange juice being purchased?

Milk --> Orange Juice
Customer Base: 1000
600 customers buy milk
400 customers buy orange juice
300 customers buy milk & orange juice

Support = P(milk & orange juice)/1000 = 300/1000 = 0.3

Confidence = P(milk & orange juice)/P(milk) = (300/1000)/(600/1000) = 0.5

Lift = Confidence/P(result) = 0.5/(400/1000) = 1.25

Interpretation: A customer who purchases milk is 1.25 times likely to purchase orange juice, than a randomly chosen customer.

Actionable: rules that can be justified and lead to actionable information
Trivial: rules that are obvious or already known (because of past/existing promotions, mandatory/required purchase of a stabilizer with an air conditioner…)
Inexplicable: rules that have no explanation and no course of action

Transactional data characterized by multiple rows per customer or order is the norm for MBA.

1. Choose the right set of items/level of detail – items, product category, brands…?
2. Generate rules - one-way rules (2 items, A-->B), 2-way rules (3 items, A & B --> C)…?
3. Limit the orders/items in the analysis by
- considering only orders having at least as many items as are in the rule
- requiring a minimum support for the rule
- removing the largest orders having multiple items/products

MBA doesn’t refer to a single technique but a set of business problems related to understanding of POS transaction data. The most popular of these techniques happens to be Association Rules.

Friday, September 19, 2008

Will WPS Do Anything To SAS?

One of the first things a fresher entering the field of analytics will say is “I want to learn SAS!” I used to have the same aspiration about 2 years back and getting certified in SAS was a dream then. But experience has taught me otherwise.

I work with SPSS and SAS most of the times, with MS Excel & MS SQL thrown in once in a while. When the dataset is huge, I use SAS. But for smaller datasets, I prefer to use SPSS most of the time.

The truth is people don’t use SAS because it gives more accurate results compared to other similar softwares. Nor do they use it because it’s user-friendly. In an ideal world, these two should be very important parameters for purchasing or using any software. From my experience, the top reason people use SAS is because of its data handling capacity & faster processing time. As they say, it’s all about marketing!! 

I also don’t agree with the opinion that a SAS certification is the BIGGEST thing an analytics/data mining professional need. The biggest marketing research firms use SPSS Base/Clementine. So if your interest is marketing research and you apply to these companies, your SAS certification is going to be of little value to them. Likewise, different verticals use different softwares.

I think vendor-neutral/tool-independent Data Mining knowledge should be the top priority for both analytics professionals and recruiters. Moreover, if you work in a team and you focus too much on the SAS, SPSS or SQL programming part/syntaxes, there’s the danger of becoming a Programmer or a Database specialist at the cost of becoming an Analytics Leader or Manager. But if you work independently as a consultant, you may be required to know everything!!

A lot of other companies are turning to open source or cheaper software like Weka, R, RapidMinerGate (for Text Mining), etc. The latest and most interesting (to me!!!) is the World Programming System (WPS) developed by the World Programming Company. The WPS is a code interpreter that can execute programs written in the language of SAS.

SAS programs can typically be executed with WPS with little or no modification because the language supported by WPS covers much of that found in Base SAS® plus many data access methods found in SAS/ACCESS®. There is also support for some basic language elements that users may be familiar with from other SAS products. There is also the WPS2R Bridge by which users can use WPS and R together to have access to R for statistics. The product comparison info is given below

A detailed cost comparison of WPS and SAS can also be found here. I doubt that this will affect Mr.Jim Goodnight's sleep, but SAS better start listening and do something about its ridiculous pricing.

Saturday, August 30, 2008

The Numerati

Data Mining is continuously being considered, applied and adopted in new areas. The Numerati by Stephen Baker has a very interesting chapter called, "The Worker."

At IBM's Thomas J. Watson Research Center, a team of data miners, statisticians and anthropologists is building mathematical models of their colleagues (50,000 of IBM's tech consultants) to improve productivity and automate management. The idea is to pile up inventories of all of their skills and then to calculate mathematically (the job fit, for example), how best to deploy them.

Quoting the author, "IBM, for example, will also be able to place workers and their skills into the same type of analytic software that they use for financial projections. This way, they will project skills that will be needed (or in surplus) in coming years. This eventually could result in something like futures markets for skills and workers."

The data sources used for the modeling include resumes, project records, online calendars, cell phone & handheld computer usage, call records and emails, etc.

The article also mentions an interesting example of how an IBM manager can select and assign a team of five to set up a call center in Manila.

The criticism or shall we say the skepticism is directed at this idea that the complexity of highly intelligent knowledge workers can be translated into equations and algorithms. Comments left by readers include concerns about freedom, privacy, harassment by the management, discrimination, etc.

But how is this different from the racial profiling techniques used by the United States government after 9/11? Or, insurance agencies charging different premiums to persons based on their demographic profiles?

My guess is that, in the near future a few companies are going to adopt what IBM is currently doing, in some form or the other. According to IBM - the workforce has become too big, the world too vast and complicated for managers to get a grip on their workers the old-fashioned way.

And then one day, will your manager come up to you and say that you’ve been assigned a different role because your "job fit" with the work you are currently doing is only 72%? Will you get promoted in your team because you scored 1% higher than your colleague?

Or will an unmentioned and unwritten class system based on an employee’s score define the workplace of tomorrow?

Thursday, August 21, 2008

A Few Questions Before You Churn!

Everyone seems to be modeling customer churn these days. But before you roll up your sleeves and take a dive, here are a few things I learned from David Ogden’s webcasts.

How will you use your churn model?
- Do you want to identify/rank likely churners?
- Do you want to identify/quantify the churn drivers?

Data Collection Window
- How much historical data do you want to use – 3 years data, 5 years data?

Prediction Window
- Who will churn next month? Who will churn in the next 6 months?

You build a model and predict who will churn next month. But what if the client’s business is such that it usually takes 2-3 months to implement the results from your churn model - set up campaigns, target customers with customized retention offers, send out mailers, etc.? Understand the client’s business and decide on an appropriate prediction window before simply doing what they ask.

Involuntary Churn vs. Voluntary Churn
- Voluntary churn occurs when a customer decides to switch to a competitor or another service provider because of dissatisfaction with the service or the associated fees
- Involuntary churn occurs due to factors like relocation, death, non-payment, etc.

Sometimes models are built leaving out one or the other group of customers. There is a clear difference between the two; decide which one is more important for the client’s business.

Drivers vs. Indicators
- Both influence churn, but drivers are those factors/measures that the company can control or manipulate. Indicators are mostly demographic measures, macro-economic factors, or seasonality, and they are outside the company's control.

Expected time to churn, vs. probability to churn tomorrow
- Survival Time Modeling answers the question, “What is the expected time to churn?” The response variable here is the Time (months, weeks, etc. until a customer will churn).
- Binary Response Modeling answers the question – “Who is likely to churn next week/month/quarter?” The response variable here is the Churn Indicator (customer stays or leaves).

Monday, August 4, 2008

Log Transformation

One of the most commonly used data transformation method is taking the natural logs of the original values. Log transformation works for data where the errors/residuals get larger for larger values of the variable (s). And this trend occurs in most data because the error or change in the value of a variable is often a percent of the value rather than an absolute value. For the same percent error, a larger value of the variable means a larger absolute error, so errors are larger too.

For example, a 5% error translates into an error that is 5% of the value of the variable. If the original value is 100, the error is 5% x 100, or 5. If the original value is 500, the error becomes 5% x 500, or 25.

When we take logs, this multiplicative factor becomes an additive factor, because of the nature of logs.

log(X * error) = log(X) + log(error)

The percent error therefore becomes the same additive error, regardless of the original value of the variable. In other words, the non-uniform errors become uniform. And that's why taking logs of the variable(s) helps in meeting the requirements for our statistical analysis most of the times.

A New View of Statistics website

Monday, July 28, 2008

Logistic Regression - Continous or Categorical?

A prediction/classification problem involving a lot of categorical variables and the first thing that comes to mind is Logistic Regression.

One thing I normally come across in Logistic Regression models is the low percentage of true positives, or cases/records correctly classified. And most of the times, the problem lies with the selection of the predictor variables. Many people tend to select as many predictor variables as they can. They have this wrong notion that they will miss something really BIG if they don’t include certain variables in the model.

And this is exactly where the idea of statisticians being the best and only candidates for analytics jobs is proved wrong. Someone with an understanding of the domain/business will easily point out the variables that will influence the independent/response variable. I always say to my managers – A Statistician, a Database Expert and an MBA are absolutely required for a successful Analytics Team.

Coming back to the accuracy of the Logistic Regression topic; while variable selection is the most important factor (besides the data quality, of course!!) influencing the accuracy of the model, I would like to say variable transformation and/or how you interpret the predictor variable is the second most important factor.

In a churn prediction model for a telecom company, I was working on Logistic Regression techniques and one of the predictor variables was “Months in Service”. In the initial runs, I specified it as a continuous variable in the model. After a lot of reruns that failed to increase the accuracy of the model, something made me think about the relation between “Probability of Churn” & “Months in Service”. Will the probability increase with an increase in the months of service? Will it decrease? Or will it be a little more complicated - with a lot of customers leaving in the initial few months of service, staying back for the next couple of months, and then churning again for another block of months, and so on?

I reran the model, this time specifying ”Months in Service” as a categorical variable. And the model accuracy shot up by about 12%!!!

Friday, June 13, 2008

Factor Analysis - Work Orientation Survey

The data used in this project has been taken from the "2005 - Work Orientation" SURVEY CONDUCTED BY the International Social Survey Programme.

Total number of cases/records: 43,440

No. of variables: 91

In the survey, questions were asked on job perception, job satisfaction, working conditions, job content, job commitment, etc. Which of these job parameters/variables are strongly related? Which of them can be grouped together? Which scores/ratings should be used to measure a respondent's overall job satisfaction, job commitment, job security etc.?

The original data was in text format and it was read using the SAS column input method. Based on the analysis objective, out of the total 43,440 records, only employed (both full-time & part-time) respondents were selected for the analysis.

In the employed data, there are 24268 records. And out of the 91 variables, all country specific variables were removed from the final dataset. From the remaining variables, 38 rating/likert scale variables were selected.

Assuming these as ordinal variables, the spearman rank correlation was considered to be the most appropriate correlation for generating the correlation matrix/output data which will be used for running the Factor Analysis. Based on the MSA values and the significant factor loadings, 8 variables were removed during the analysis procedures.

An interesting thing turned up while using both the default Pearson's and the more appropriate Spearman's correlation in the analysis. When the Spearman correlation was used, 8 factors were extracted. But when I tried to summarize the variables based on these factors, I was not satisfied as the variables have been divided into too many small groups without any pattern or consistency in their meanings.

But when I used the default Pearson's correlation (in proc factor), I got 4 factors only. But the best thing was that the related variables have been grouped together under each of these factors. For example - Job content, Job Security, Work-Life Balance, & Job Satisfaction were clubbed together. While Work Environment, Working Relations, Organization Image & Job Commitment came under one factor.

The second approach of using Pearson's correlation while running Factor Analysis was thus found to give a much better, useful, and meaningful result in spite of what textbooks say about using Pearson's correlation on rating scale variables.