Friday, December 17, 2010

What's behind your Tree?

Considering the number of target customer selection projects I do, Direct Mails appear to be a very popular communication and marketing channel amongst retailers.

Almost all the time, I use a combination of RFM, Decision Tree or Logistic Regression techniques for sorting, profiling and/or scoring customers (hopefully, I can post a separate detailed blog on this).

The best thing about a decision tree is that it has very less assumptions or requirements on the data unlike, let’s say, logistic regression. Another thing is that everyone can understand it! Depending on the software you use, there are a number of different Tree algorithms available with the most common being CHAID, CART and C5.

CART can handle only binary splits (produce splits of two child nodes). It uses a measure of impurity called Gini for splitting the nodes. This is a measure of dispersion that depends on the distribution of the outcome variables. Its values range from 1 (worst) to 0 (best). You get a 0 when all records of a node are falling under a single category level (e.g. all 10,000 customers in a terminal node are responders). This is a purely theoretical example, by the way!

In C5, splits are based on the ratio of the information gain. C5 prunes the tree by examining the error rate at each node and assuming that the true error rate is actually substantially worse. If N records arrive at a node, and E of them are classified incorrectly, then the error rate at that node is E/N.

Information gain can also be simply defined as –

Information (Parent Node) – Information (after splitting on a particular variable)

CHAID is an efficient decision tree technique based on the Chi-Square test of independence of 2 categorical fields. CHAID makes use of the Chi-square test in several ways—first to merge classes that do not have significantly different effects on the target variable; then to choose a best split; and finally to decide whether it is worth performing any additional splits on a node.

CHAID and C5 can handle multiple splits unlike CART. And as far as my own experiences go, I prefer CHAID over C5 as C5 tends to produce very bushy trees.

Data Mining Techniques: Michael J.A. Berry & Gordon S. Linoff
Data Mining Techniques (Inside Customer Segmentation): Konstantinos Tsiptsis & Antonios Chorianopoulos

Wednesday, October 20, 2010

So you thought...?

Sometimes you get the feeling that everyone around you is so confused or just don't know about things which are basic and essential in Analytics. Below is a list of the most common terms that a majority thinks they know but don't.

1. Linear/Pearson Correlation: The most misunderstood term as far as i know. Before doing anything else, check if the 2 variables share a linear relation. Correlation values without a linear pattern is meaningless. And also be aware that in many softwares (including MS Excel), the default is pearson correlation, for which a linear relation between the two variables is a requirement.

2. Significance Test: Many many people into Analytics (?) will never ever understand this or will never try to understand this. Just because you see 2 groups doesn't mean that you can do a significance test. Know something or everything about sampling and designs before talking about significance test.

3. Lift and Cumulative Gains Charts: They are different, period. Don't confuse one with another.

Lift - Without a model, we get 30% of the responders by contacting 30% of the customers. Using a model, we get 60% of responders. The lift is 60/30 = 2 times.

Cumulative Gains - Using the model, if we contact 30% of the customers we get 60% of all responders.

4. Clustering/Segmentation and Profiling: Let's make this simple. Clustering/Segmenting will answer - Can my customer base be broken up into distinct groups based on certain attributes/characteristics? Customers within a group will be very similar to one another while customers across groups will be different.

Profiling will answer - Who are my best customers? What do they purchase? How often? What is their ethnicity, their household size and income, etc.? In many cases, profiling usually follows clustering/segmentation. Who are the customers in Group 1?

Signing off with:
"There must be some kind of way out of here,"
Said the joker to the thief
"There's too much confusion,
I can get no relief"
-- All along the watchtower by Jimi Hendrix

Sunday, April 18, 2010

The dark side of Statistically Significant

According to Wikipedia:

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance.

The use of the word significance in statistics is different from the standard one, which suggests that something is important or meaningful. For example, a study that included tens of thousands of participants might be able to say with very great confidence that people of one state are more intelligent than people of another state by 1/20 of an IQ point. This result would be statistically significant, but the difference is small enough to be utterly unimportant. Many researchers urge that tests of significance should always be accompanied by effect-size statistics, which approximate the size and thus the practical importance of the difference.

Statistically significant is something I come across everyday in my line of work, and to be honest, the most abused and misunderstood term. There are people who assume that if something comes out statistically significant, all their questions are answered and their problems are solved.

An article by Tom Siegfried on Science News throws up interesting facts and assumptions about the Statistical Significance test. I have changed the 2nd part to use a channel effectiveness example instead of clinical trials in the original article.

1. The Hunger Hypothesis

The amount of evidence required to accept that an event is unlikely to have arisen by chance is known as the significance level or critical p-value. In other words, a p-value of .05 means that there is only a 5 % chance of obtaining the observed (or more extreme) result by chance.

So does this mean that you are 95% certain that the observed difference between groups, or sets of samples, is real and could not have arisen by chance? It is incorrect, however, to transpose that finding into a 95 percent probability that the null hypothesis is false. “The P value is calculated under the assumption that the null hypothesis is true,” writes biostatistician Steven Goodman. “It therefore cannot simultaneously be a probability that the null hypothesis is false.”

That interpretation commits an egregious logical error (technical term: “transposed conditional”): confusing the odds of getting a result (if a hypothesis is true) with the odds favoring the hypothesis if you observe that result.

Consider this simplified example. Suppose a certain dog is known to bark constantly when hungry. But when well-fed, the dog barks less than 5 percent of the time. So if you assume for the null hypothesis that the dog is not hungry, the probability of observing the dog barking (given that hypothesis) is less than 5 percent. If you then actually do observe the dog barking, what is the likelihood that the null hypothesis is incorrect and the dog is in fact hungry?

That probability cannot be computed with the information given. The dog barks 100% of the time when hungry, and less than 5% of the time when not hungry. A well-fed dog may seldom bark, but observing the rare bark does not imply that the dog is hungry.

2. Statistical significance is not always statistically significant.

The effectiveness of communication channels (Tele marketing, Emails or Direct Mails) are usually tested by comparing the results from Test & Control groups.

Using significance tests, the channel’s effect (response rate, purchase, etc) on the Test group is pronounced to be greater than the Control group by an amount unlikely to occur by chance.

The standard in most of the significance tests is 5%, and a result expected to occur less than 5% of the time is considered “statistically significant.” So if Email drives a higher response in the Test group than the Control (the non-Emailed group) by an amount that would be expected by chance only 4% of the time, it would be concluded that the Email campaign really worked.

Now suppose Direct Mail also delivered similar results – Test group having a higher response than the Control group, but by an amount that would be expected by chance 6% of the time. In that case, conventional analysis would say that such an effect lacked statistical significance and that there was insufficient evidence to conclude that Direct Mail worked.

If both channels were tested against each other, rather than separately using Control groups - one group getting Emails and another similar group receiving Direct Mails, the difference between the performance of Email and Direct Mail might very well NOT be statistically significant.

“Comparisons of the sort, ‘X is statistically significant but Y is not,’ can be misleading,” statisticians Andrew Gelman of Columbia University and Hal Stern of the University of California, Irvine, noted in an article discussing this issue in 2006 in the American Statistician. “Students and practitioners [should] be made more aware that the difference between ‘significant’ and ‘not significant’ is not itself statistically significant.”

The Control group for Email may be doing a lot worse than the Control group for Direct Mail. The difference in the response rates between Test & Control for Email thus becomes higher for Email, and a statistically significant result was observed.

Signing off, with a quote from William W. Watt:
"Do not put your faith in what statistics say until you have carefully considered what they do not say."

Friday, January 1, 2010

Why Analytics Fail To Deliver?

Just came across this post from Vincent Granville on AnalyticBridge. Quite interesting and informative. Am sharing it here, along with a few additions of my own.

- The model is more accurate than the data: you try to kill a fly with a nuclear weapon.
- You spent one month designing a perfect solution when a 95% accurate solution can be designed in one day. You focus on 1% of the business revenue, lack vision / lack the big picture.
- Poor communication. Not listening to the client, not requesting the proper data. Providing too much data to decision makers rather than 4 bullets with actionable information. Failure to leverage external data sources. Not using the right metrics. Problems with gathering requirements. Poor graphics or graphics that is too complicated.

- Failure to remove or aggregate conclusions that do not have statistical significance. Or repeating many times the same statistical tests, thus negatively impacting the confidence levels.
- Sloppy modeling or design of experiments, or sampling issues. One large bucket of data has good statistical significance, but all the data in the bucket in question is from one old client with inaccurate statistics. Or you use two databases to join sales and revenue, but the join is messy, or sales and revenue data do not overlap because of a different latency.
- Lack of maintenance. The data flow is highly dynamic and patterns change over time, but the model was tested 1 year ago on a data set that has significantly evolved. Model is never revisited, or parameters / blacklists are not updated with the right frequency.
- Changes in definition (e.g. include international users in the definition of a user, or remove filtered users) resulting in metrics that lack consistency, making vertical comparisons (trending, for a same client) or horizontal comparisons (comparing multiple clients at a same time) impossible.
- Blending data from multiple sources without proper standardizations: using (non-normalized) conversion rates instead of (normalized) odds of conversion.
- Poor cross validation. Cross validation should not be about randomly splitting the training set into a number of subsets, but rather comparing before (training) with after (test). Or comparing 50 training clients with 50 different test clients, rather than 5000 training observations from100 clients with another 5000 test observations from the same 100 clients. Eliminate features with statistical significance but lack of robustness when comparing 2 time periods.
- Improper use of statistical packages. Don't feed a decision tree software with a raw metric such as IP address: it just does not make sense. Instead provide a smart binned metric such as type of IP address (corporate proxy, bot, anonymous proxy, edu proxy, static IP, IP from ISP, etc.)
- Wrong assumptions. Working with dependent independent variables and not handling the problem. Violations of the Gaussian model, multimodality ignored. External factor explains the variations in response, not your independent variables. When doing A/B testing, ignoring important changes made to the website during the A/B testing time period.
- Lack of good sense. Analytics is a science AND an art, and the best solutions require sophisticated craftsmanship (the stuff you will never learn at school), but might usually be implemented pretty fast: elegant/efficient simplicity vs. inefficient complicated solutions.

My additions:

- What is your problem?
Without a real business problem, modeling or data mining will just give you numbers. I have come across people, on both the delivery and client side, who comes up with this often repeated line “I’ve got this data, tell me what you can do?”
My answer to that – “I will give you the probability that your customer will attrite based on the last 2 digits of her transaction ID. Now, tell me what are you going to do to make her stay with your business?”

Start with a REAL business problem.

- What are you gonna do about it?
So if your customers are leaving in alarming numbers, don’t just say that you want an attrition/churn model. Think on how you would like to use the model results. Are you thinking of a retention campaign? Are you to going to reduce churn by focusing on ALL the customers most likely to leave? Or are you going to focus on a specific subset of these customers (based on their profitability, for example)? How soon can you launch a campaign? How frequently will you be targeting these customers?

Have a CLEAR idea on what you are going to do with the model results.

- What have you got?
Don’t expect a wonderful earth-shattering surprise from all modeling projects. Model results or performances are based on many factors, with data quality as the most important one, in almost all the cases. If your database is full of @#$%, remember one thing. Garbage in, Garbage out. Period.

- Modeling is not going to give you a nice easy-to-read chart on how to run businesses.

- Technique is not everything.
A complex technique like (Artificial) Neural Networks doesn’t guarantee a prize winning model. Selecting a technique depends on many factors, with the most important ones being data types, data quality and the business requirement.

- Educate yourself
It’s never too late to learn. For people on the delivery side, modeling is not about the T-test and Regression alone. For people on the client side, know what Analytics or Data Mining can do, and CANNOT do. Know when, where and how to relate the model results with your business.