Thursday, March 19, 2009

Software Dependence & Model Accuracy

I work a lot with the Data Mining/Analytics business development team at my current company. My primary role is to be there during client presentations/conferences and answer the client’s queries on modeling techniques, and the USP of our approach related to model performance and/or business benefits.

During one of these interactions, we found out that a particular client is using THREE Data Mining softwares. Not statistical softwares or the base versions, but the complete, very expensive Data Mining softwares – SAS EM, SPSS Clementine and KXEN.

I was like, “Wow!!! But do you really need 3 Data Mining softwares???” Our initial questions and the client’s answers confirmed that inconsistent data formats was not the reason as the client already has a BI/DW system. Their reason? Well, they have the opinion that some algorithms/techniques in a particular DM software is much better and accurate than the same algorithms/techniques in another DM software.

I was, and I am, not convinced. Unless a particular DM software has a totally different and new algorithm for which you can’t obviously make a comparison, I haven’t come across or heard of any stark differences among model performances and results for the same algorithms offered by the reputed DM softwares. Data Mining solutions and the subsequent business benefits are not solely driven by model accuracy, a lot depends on how you interpret and apply the model’s results too.

What’s your opinion on this?

On a slightly different but related note, I learned of an interesting case from Rob Mattison’s webcast on Telco Churn Management available on the SAS website. He mentioned an incident where a client’s existing churn model was giving an impressive “above 90%” accuracy. Feeling something amiss, he went and talked with the Marketing people and found out that they were sending the same communication (sent at the time of acquisition) to the list of customers identified by the model as the most likely churners.

The result? The already unsatisfied customers who were thinking of switching got an inappropriate message/treatment, got further irritated and eventually left. In other words, all customers identified as likely churners by the model were encouraged to leave thereby shooting up the model accuracy!!!

If you have come across such cases, please share them with me in your comments:-)


Themos Kalafatis said...

I am finding these differences hard to believe also. Some packages (such as SPSS Clementine) do change their algorithms for optimized speed though.

In my experience, having the option of algorithm parameter optimization is much more useful on achieving superior model accuracy.

Romakanta said...

yup, the options available in the algorithm settings are far more important. and most DM softwares have the option to customize their algorithms through coding too.

Tim Manns said...


Regarding your second point about 90% accuracy, I have two comments;

1) control groups. I don't need to say more than that...

2) 90% isn't good. If I always predict active/no churn I'll be correct/accurate 98% of the time (a typical telco might have monthly churn of 2%).
I'd suggest using the metric of lift, or in simple business terms "how many times would a response rate or churn rate be higher if I use the model?". And of course results based upon control groups and retrospective analysis after over a period of months.



Tim Manns said...

Is your post about multiple data mining tools related to your previous post about segmentation?

If these post relate to the same customer / project then I do have an idea and comments as to why they probably use multiple tools.

Romakanta said...

Tim: I think Rob was referring to the overall accuracy or maybe True Positives only. I agree, lift is a much more appropriate measure.

Regarding your second comment, nope. This post is not related to the previous post on customer segmentation but I would still love to know your opinion :-)

Bhupendra said...

Romakanta, You have touched a very genuine case. Too much belief in BI Tools have lead to huge investment in the system giving very less room for the business use. This has been the major reason for failure of BI Team to show ROI.

I had a client who had SAS ETL, SAS Eminer, SPSS Clementine, FICO's Model Builder, FICO's Model Builder for Decision Trees etc. They were then evaluating to buy Business Objects for reporting.

Wow!! This is terrible. They have drowned more than half of their BI budget in the system. Now they will need use team to use these tools to the limit.
My exp says, SAS ETL can do around 70% of BO does; SAS Eminer can do almost everything what MB and MB DT does. FICO products are simply the best for building models, but it requires people who can do it. Not an easy thing in the Market.

I would suggest people to leave spending on tools and start spending on people. They all need smart analysts who understand business and can connect findings to revenue gain. Tools, however good, is a cost center to the company till proven otherwise.


Romakanta said...

yup. they should rely more on smart people, and as soon as possible!