Sunday, January 8, 2012

Sampling or Weights

Below is a small excerpt from a post on one of my favorite Analytics blogs - Data Miners Blog  

When we have a binary target variable and our goal is to predict the rare outcome - we either do oversampling or we use weights.

Oversampling is when we use all the rare outcomes and an equal-sized random sample of the common outcomes. This is most useful when there are a large number of cases, and reducing the number of rows makes the modeling tool run faster.

The second approach is weighting. Rare cases are given a weight of more than 1 and common cases are given a weight less than 1, so that the sum of the weights of the two groups is equal.

Assume that we have data that is 10% rare and 90% common, and we oversample so it is 50%-50%. If we are using weights, we will multiply each rare observation by 5 and each common observation by 5/9.


Dear Readers,

In the past few months, my roles and responsibilities have changed and I will be spending lesser time on this blog. Instead of lengthy/detailed posts once in 2-3 months, I would like to do short posts more regularly. These posts will now be mostly tips, common mistakes, something new about an old familiar technique we all know, interesting insights, etc. And all these will be based on my own experiences or something I came across a book, or a website. I hope you will like the new format, and find these short posts useful and interesting.


No comments: