Just came across this post from Vincent Granville on AnalyticBridge. Quite interesting and informative. Am sharing it here, along with a few additions of my own.
- The model is more accurate than the data: you try to kill a fly with a nuclear weapon.
- You spent one month designing a perfect solution when a 95% accurate solution can be designed in one day. You focus on 1% of the business revenue, lack vision / lack the big picture.
- Poor communication. Not listening to the client, not requesting the proper data. Providing too much data to decision makers rather than 4 bullets with actionable information. Failure to leverage external data sources. Not using the right metrics. Problems with gathering requirements. Poor graphics or graphics that is too complicated.
- Failure to remove or aggregate conclusions that do not have statistical significance. Or repeating many times the same statistical tests, thus negatively impacting the confidence levels.
- Sloppy modeling or design of experiments, or sampling issues. One large bucket of data has good statistical significance, but all the data in the bucket in question is from one old client with inaccurate statistics. Or you use two databases to join sales and revenue, but the join is messy, or sales and revenue data do not overlap because of a different latency.
- Lack of maintenance. The data flow is highly dynamic and patterns change over time, but the model was tested 1 year ago on a data set that has significantly evolved. Model is never revisited, or parameters / blacklists are not updated with the right frequency.
- Changes in definition (e.g. include international users in the definition of a user, or remove filtered users) resulting in metrics that lack consistency, making vertical comparisons (trending, for a same client) or horizontal comparisons (comparing multiple clients at a same time) impossible.
- Blending data from multiple sources without proper standardizations: using (non-normalized) conversion rates instead of (normalized) odds of conversion.
- Poor cross validation. Cross validation should not be about randomly splitting the training set into a number of subsets, but rather comparing before (training) with after (test). Or comparing 50 training clients with 50 different test clients, rather than 5000 training observations from100 clients with another 5000 test observations from the same 100 clients. Eliminate features with statistical significance but lack of robustness when comparing 2 time periods.
- Improper use of statistical packages. Don't feed a decision tree software with a raw metric such as IP address: it just does not make sense. Instead provide a smart binned metric such as type of IP address (corporate proxy, bot, anonymous proxy, edu proxy, static IP, IP from ISP, etc.)
- Wrong assumptions. Working with dependent independent variables and not handling the problem. Violations of the Gaussian model, multimodality ignored. External factor explains the variations in response, not your independent variables. When doing A/B testing, ignoring important changes made to the website during the A/B testing time period.
- Lack of good sense. Analytics is a science AND an art, and the best solutions require sophisticated craftsmanship (the stuff you will never learn at school), but might usually be implemented pretty fast: elegant/efficient simplicity vs. inefficient complicated solutions.
My additions:
- What is your problem?
Without a real business problem, modeling or data mining will just give you numbers. I have come across people, on both the delivery and client side, who comes up with this often repeated line “I’ve got this data, tell me what you can do?”
My answer to that – “I will give you the probability that your customer will attrite based on the last 2 digits of her transaction ID. Now, tell me what are you going to do to make her stay with your business?”
Start with a REAL business problem.
- What are you gonna do about it?
So if your customers are leaving in alarming numbers, don’t just say that you want an attrition/churn model. Think on how you would like to use the model results. Are you thinking of a retention campaign? Are you to going to reduce churn by focusing on ALL the customers most likely to leave? Or are you going to focus on a specific subset of these customers (based on their profitability, for example)? How soon can you launch a campaign? How frequently will you be targeting these customers?
Have a CLEAR idea on what you are going to do with the model results.
- What have you got?
Don’t expect a wonderful earth-shattering surprise from all modeling projects. Model results or performances are based on many factors, with data quality as the most important one, in almost all the cases. If your database is full of @#$%, remember one thing. Garbage in, Garbage out. Period.
- Modeling is not going to give you a nice easy-to-read chart on how to run businesses.
- Technique is not everything.
A complex technique like (Artificial) Neural Networks doesn’t guarantee a prize winning model. Selecting a technique depends on many factors, with the most important ones being data types, data quality and the business requirement.
- Educate yourself
It’s never too late to learn. For people on the delivery side, modeling is not about the T-test and Regression alone. For people on the client side, know what Analytics or Data Mining can do, and CANNOT do. Know when, where and how to relate the model results with your business.
Friday, January 1, 2010
Subscribe to:
Posts (Atom)