Wednesday, October 20, 2010

So you thought...?

Sometimes you get the feeling that everyone around you is so confused or just don't know about things which are basic and essential in Analytics. Below is a list of the most common terms that a majority thinks they know but don't.

1. Linear/Pearson Correlation: The most misunderstood term as far as i know. Before doing anything else, check if the 2 variables share a linear relation. Correlation values without a linear pattern is meaningless. And also be aware that in many softwares (including MS Excel), the default is pearson correlation, for which a linear relation between the two variables is a requirement.

2. Significance Test: Many many people into Analytics (?) will never ever understand this or will never try to understand this. Just because you see 2 groups doesn't mean that you can do a significance test. Know something or everything about sampling and designs before talking about significance test.

3. Lift and Cumulative Gains Charts: They are different, period. Don't confuse one with another.

Lift - Without a model, we get 30% of the responders by contacting 30% of the customers. Using a model, we get 60% of responders. The lift is 60/30 = 2 times.

Cumulative Gains - Using the model, if we contact 30% of the customers we get 60% of all responders.

4. Clustering/Segmentation and Profiling: Let's make this simple. Clustering/Segmenting will answer - Can my customer base be broken up into distinct groups based on certain attributes/characteristics? Customers within a group will be very similar to one another while customers across groups will be different.

Profiling will answer - Who are my best customers? What do they purchase? How often? What is their ethnicity, their household size and income, etc.? In many cases, profiling usually follows clustering/segmentation. Who are the customers in Group 1?

Signing off with:
"There must be some kind of way out of here,"
Said the joker to the thief
"There's too much confusion,
I can get no relief"
-- All along the watchtower by Jimi Hendrix


John said...

"All Along the Watchtower" was written by Bob Dylan . . . If you want to beat people up for their mistakes regarding terms like "correlation," you need to get it ALL right!

Datalligence said...

thanks john, but i know that. i was listening to the hendrix version when i wrote this, i like it better and mentioned that.

and something that might interest you. Dylan said: "I liked Jimi Hendrix's record of this and ever since he died I've been doing it that way...Strange how when I sing it, I always feel it's a tribute to him in some kind of way."

was browsing your blog. you are into software/technology, what about analytics or data mining?

Brian said...

Saying only linear relationships matter - or only linear correlations matter is just flat out wrong. Pearson Correlation is linear yes, but Spearman Rank is not - for example. A quadratic relationship between two variable is very useful - as is a cubic, an exponential, a logistic etc. etc.

Datalligence said...

I was talking about linear/pearson correlation. Lots of people just use proc corr (default is pearson) and/or proc reg in SAS, or the correlation function in MS Excel and then just infers anything from the correlation value. They rarely or never checked the relation between the 2 variables before using these functions.

Thanks for pointing that out Brian. I have made a few changes to make it more clear.

Sandro Saitta said...

Thanks for this very interesting points. It's always good to get such reminder!

BasiaBernstein said...

Correlation and dependence are any of a broad class of statistical relationships between two or more random variables or observed data values.

perason correlation