Monday, July 28, 2008

Logistic Regression - Continous or Categorical?

A prediction/classification problem involving a lot of categorical variables and the first thing that comes to mind is Logistic Regression.

One thing I normally come across in Logistic Regression models is the low percentage of true positives, or cases/records correctly classified. And most of the times, the problem lies with the selection of the predictor variables. Many people tend to select as many predictor variables as they can. They have this wrong notion that they will miss something really BIG if they don’t include certain variables in the model.

And this is exactly where the idea of statisticians being the best and only candidates for analytics jobs is proved wrong. Someone with an understanding of the domain/business will easily point out the variables that will influence the independent/response variable. I always say to my managers – A Statistician, a Database Expert and an MBA are absolutely required for a successful Analytics Team.

Coming back to the accuracy of the Logistic Regression topic; while variable selection is the most important factor (besides the data quality, of course!!) influencing the accuracy of the model, I would like to say variable transformation and/or how you interpret the predictor variable is the second most important factor.

In a churn prediction model for a telecom company, I was working on Logistic Regression techniques and one of the predictor variables was “Months in Service”. In the initial runs, I specified it as a continuous variable in the model. After a lot of reruns that failed to increase the accuracy of the model, something made me think about the relation between “Probability of Churn” & “Months in Service”. Will the probability increase with an increase in the months of service? Will it decrease? Or will it be a little more complicated - with a lot of customers leaving in the initial few months of service, staying back for the next couple of months, and then churning again for another block of months, and so on?

I reran the model, this time specifying ”Months in Service” as a categorical variable. And the model accuracy shot up by about 12%!!!


Kevin said...

Good job redefining the variable as a categorical variable. There's always something to be gained by exploring how to define variables.

Will Dwinnell said...

I reran the model, this time specifying ”Months in Service” as a categorical variable. And the model accuracy shot up by about 12%!!!

This adds as many new parameters to the model as one less than the number of distinct values in that variable. While this certainly makes the model more flexible, and one would expect an improvement in the apparent performance, this increased compexity can also lead to overfitting. Was the 12% improvement measured on out-of-sample data?

-Will Dwinnell
Data Mining in MATLAB

Romakanta said...

yes, overfitting is something that usually happens after this kind of variable transformation.

the 12% improvement was on the out-of-sample testing data set