Datalligence: Fraud Prediction - Decision Trees & Support Vector Machines (Classification)

Tuesday, November 25, 2008

Fraud Prediction - Decision Trees & Support Vector Machines (Classification)

My first thought when I was asked to learn and use Oracle Data Mining (ODM) was, “Oh no! Yet another Data Mining Software!!!”

It’s been about 2 weeks now since I have been using ODM, particularly focusing on two classification techniques – Decision Trees & Support Vector Machines. As I don’t want to get into the details of the interface/usability of ODM (unless Oracle pays me!!), I will limit this post on a comparison of these two classification techniques at a very basic level, using ODM.

A very brief introduction of DT & SVM.

DT – A flow chart or diagram representing a classification system or a predictive model. The tree is structured as a sequence of simple questions. The answers to these questions trace a path down the tree. The end product is a collection of hierarchical rules that segment the data into groups, where a decision (classification or prediction) is made for each group.

-The hierarchy is called a tree, and each segment is called a node.
-The original segment contains the entire data set, referred to as the root node of the tree.
-A node with all of its successors forms a branch of the node that created it.
-The final nodes (terminal nodes) are called leaves. For each leaf, a decision is made and applied to all observations in the leaf.

SVM – A Support Vector Machine (SVM) performs classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories.

In SVM jargon, a predictor variable is called an attribute, and a transformed attribute that is used to define the hyperplane is called a feature. A set of features that describes one case/record is called a vector. The goal of SVM modeling is to find the optimal hyperplane that separates clusters of vector in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other size of the plane. The vectors near the hyperplane are the support vectors.

SVM is a kernel-based algorithm. A kernel is a function that transforms the input data to a high-dimensional space where the problem is solved. Kernel functions can be linear or nonlinear.

The linear kernel function reduces to a linear equation on the original attributes in the training data. The Gaussian kernel transforms each case in the training data to a point in an n-dimensional space, where n is the number of cases. The algorithm attempts to separate the points into subsets with homogeneous target values. The Gaussian kernel uses nonlinear separators, but within the kernel space it constructs a linear equation.

I worked on this dataset which has fraudulent fuel card transactions. Two techniques I previously tried are Logistic Regression (using SAS/STAT) & Decision Trees (using SPSS Answer Tree). Neither of them was found to be suitable for this dataset/problem.

The dataset has about 300,000 records/transactions and about 0.06% of these have been flagged as fraudulent. The target variable is the fraud indicator with 0s as non-frauds, and 1s as frauds.

The Data Preparation consisted of missing value treatments, normalization, etc. Predictor variables that are strongly associated with the fraud indicator – both from the business & statistics perspective – were selected.

The dataset was divided into a Build Data (60% of the records) and Test Data (40% of the records).

Algorithm Settings for DT,

Accuracy/Confusion Matrix for DT,

Algorithm Settings for SVM,

Accuracy/Confusion Matrix for SVM,

We can see clearly that SVM is outperforming DT in predicting the fraudulent cases (93% vs. 72%).

Though it depends a lot on the data/business domain & problem, SVM generally performs well on data sets where there are very few cases on which to train the model.

9 comments:

Sandro Saitta said...: Thanks for the details Romakanta. Did you obtain the same kind of results with Decision Trees (using SPSS Answer Tree) than your current results (72%)?; November 25, 2008 at 11:12 AM
jonathan polon said...: I don't necessarily agree that SVM outperforms decision trees. While SVM correctly classified a larger number of the fraudulent cases, it also had a greater number of false positives (non-fraud classified as fraud). The better model depends on the relative costs of misclassifying fraud as non-fraud and misclassifying non-fraud as fraud. Alternatively, you can adjust some settings (like the prior probabilities or misclassification matrix on the trees) until both methods correctly predict the same number of fraudulent cases - the better model will then be the one with fewer false positives.; November 25, 2008 at 9:42 PM
Datalligence said...: you are absolutely right jonathan. this post was meant to be a comparison of the 2 techniques on a very very basic level (i mentioned that too).

any model's performance has to be decided on a whole lot of parameters - true +ves/-ves, false -ves/-ves, misclassification costs (if info is available), gains/lift charts....

also, in a majority of fraud prediction problems, the emphasis is on true positives as the cost of fraud typically tends to be higher (i'm generalizing here!) than the time/cost/inconvenience arising from the false positives.; November 25, 2008 at 11:03 PM
Bhupendra said...: Great article. Thanks for the details.

I have two feedback on the article.

1. I am not surprised that SVM over-performed DT, as it almost always does. Neural Networks would have been a better comparison. But issues with SVM and NN is that they too much overfit with the data. It will be interested to see if you have similar results in intime and outtime validation data sets. I have always seen a significant drop in performance for SVM models.

2. It is really nice to see that Oracle's tools provide so much of facilities. I worked on Microsoft SQL Server Analytics Services for a week, and was impressed with their tool too. With biggies joining this market, it is going to be interesting.

-- Bhupendra; November 26, 2008 at 3:37 AM
Datalligence said...: sandro: SPSS has 4 algos for Decision Tree. I built and tested a DT model on the same dataset with the same inputs (same transformation) using CHAID as the tree-growing criteria. The accuracy was in the 50's range. I will check on the other 3 algos in SPSS and let you know if i have time:-); November 27, 2008 at 6:23 AM
Datalligence said...: yup bhupen, SVM & ANN would be a more appropriate comparison but ODM (10g) doesn't have ANN :-)
i tested the models on a few more datasets; performances dropped in both cases, with a lot more in the case of DT.

have heard a lot about MS SQL Server DM, would love to check it out. yeah, you can be sure that the 2 biggies are going to be major players in the DM market soon, 'coz the world's data reside in their systems!; December 1, 2008 at 11:23 PM
Anonymous said...: Oracle Data Mining, release 10.2 introduced a new variation of the Support Vector Machine algorithm, called 1-class SVM. 1-class SVMs are specially designed for fraud and anomaly detection when you lack examples of the "rare events". In those cases, you can do various tricks w/ stratified samples, ROC, and/or cost matrix for false pos/neg costs, but all struggle.

1-class SVMs work on the principle of learning what is considered "normal" e.g. expenses, phone calls, employees, etc. If your training data contains examples of the rare events, remove them for 1-class SVM model building. Applying the 1-class SVM model then scores each record on the likelihood that it is "abnormal". Oracle Data Mining's 1-class SVM can also mine transactional (nested data), unstructured data (i.e. text), and star schema data. You can read more about them in the documentation available online at: http://www.oracle.com/technology/products/bi/odm/index.htmlhttp://www.oracle.com/technology/products/bi/odm/index.html and in the OTN web site tech info posted at:

Hope this helps!

cb; December 11, 2008 at 3:14 PM
Will Dwinnell said...: Bhupendra wrote: But issues with SVM and NN is that they too much overfit with the data.

Could you expand on this, or provide references? Is there some reason that early stopping (in the case of neural networks) or constraining the number of hidden nodes do not address the over-fitting issue?

-Will Dwinnell
Data Mining in MATLAB; December 17, 2008 at 3:12 PM
Anonymous said...: have you considered random forrests?; December 22, 2008 at 7:42 PM