Wednesday, January 25, 2012

You want more drama? Use less data!

Beware of making statements on averages or proportions based on small samples. Take the example of a 10% response rate for your customer base. Now consider a particular customer segment - let's say, customers who bought hiking boots. And you find out that the response rate within this group is 30%. Before you consider that as something to share with your clients or before assuming/concluding that this segment will be a significant driver in your response model - look at the size of this segment. Chances are, that segment will be too small to get any business interest or there is an interaction effect you should explore and identify (in your model).

In many instances, I have come across interesting findings and patterns that turn out be nothing but occurrences in extremely small samples.

The title of this post and the idea of writing something on this came from "Deceptive Data and Statistical Skullduggery" by Gordon H. Bell.

Sunday, January 8, 2012

Sampling or Weights

Below is a small excerpt from a post on one of my favorite Analytics blogs - Data Miners Blog  

When we have a binary target variable and our goal is to predict the rare outcome - we either do oversampling or we use weights.

Oversampling is when we use all the rare outcomes and an equal-sized random sample of the common outcomes. This is most useful when there are a large number of cases, and reducing the number of rows makes the modeling tool run faster.

The second approach is weighting. Rare cases are given a weight of more than 1 and common cases are given a weight less than 1, so that the sum of the weights of the two groups is equal.

Assume that we have data that is 10% rare and 90% common, and we oversample so it is 50%-50%. If we are using weights, we will multiply each rare observation by 5 and each common observation by 5/9.

---

Dear Readers,

In the past few months, my roles and responsibilities have changed and I will be spending lesser time on this blog. Instead of lengthy/detailed posts once in 2-3 months, I would like to do short posts more regularly. These posts will now be mostly tips, common mistakes, something new about an old familiar technique we all know, interesting insights, etc. And all these will be based on my own experiences or something I came across a book, or a website. I hope you will like the new format, and find these short posts useful and interesting.

Thanks,
DataLLigence