Thursday, February 19, 2009

Two Step Cluster - Customer Segmentation in Telecom

I love Cluster Analysis because unlike a lot of other techniques, I don’t have to make any assumptions about the underlying distribution of the data. Though there are a few assumptions for best performance, it’s perfectly okay to cluster data that may not meet these assumptions. Only the business requirements/goals can determine whether the clusters/segments are useful or the solution is satisfactory.

Customer Segmentation is the process of splitting a customer database into distinct, meaningful, and homogenous groups based on specific parameters or attributes. At a macro level, the main objective for customer segmentation is to understand the customer base, monitor and understand changes over time, and to support critical strategies and functions such as CRM, Loyalty programs, and product development.

At a micro level, the goal is to support specific campaigns, commercial policies, cross-selling & up-selling activities, and analyze/manage churn & loyalty

SPSS has three different procedures that can be used to cluster data: hierarchical cluster analysis, k-means cluster, and two-step cluster. The two-step cluster is appropriate for large datasets or datasets that have a mixture of continuous and categorical variables. It requires only one pass of data (which is important for very large data files).

The first step - Formation of Preclusters
Preclusters are just clusters of the original cases that are used in place of the raw data to reduce the size of the matrix that contains distances between all possible pairs of cases. When preclustering is complete, all cases in the same precluster are treated as a single entity. The size of the distance matrix is no longer dependent on the number of cases but on the number of preclusters. These preclusters are then used in hierarchical clustering.

The second step - Hierarchical Clustering of Preclusters
In the second step, the standard hierarchical clustering algorithm is used on the preclusters.

The dataset I am going to use has information on 75 attributes for more than 70,000 customers. Product/service usage variables for all customers in the dataset are averages calculated over a period of four months.

In SPSS Clementine, the Data Audit available under the Output nodes palette gives the basic/descriptive statistics (mean, min, max...) and the quality (outliers, missing values...) of the variables.

Out of the 75 variables in the dataset, I used about 15 original variables and 3 new derived variables after considering their quality and business relevance. These selected variables were a combination of demographic, billing, and usage information.

The two-step cluster analysis produced 3 clusters. A very interesting difference was observed between Clusters 1 and 2.

Customers in Cluster 2 display the following characteristics:
- few of them are married
- few of them have children
- few of them have a credit card
- owns the most expensive mobile set

- maximum # of incoming & outgoing calls
- maximum # of roaming calls
- maximum MOU (minutes of usage)
- maximum # of active subscriptions
- maximum recurring charge (or, subscribes to the most expensive calling plan)
- maximum revenue

- maximum # of calls to customer care
- has the largest proportion of customers with low credit rating

Customers in Cluster 1 display characteristics that were exactly the opposite in ALMOST all of the areas mentioned above. So we have these customers who are married with children, posses a credit card, own a cheap mobile set, subscribe to the least expensive calling plan, make the minimum # of calls (incoming, outgoing, roaming & customer care), and has the highest credit rating.

Customers in Cluster 3 follow the middle path (in almost all the attributes) and offered no interesting or meaningful insights.

So what can be the business application of this exercise?
To put it simply, cluster analysis has thrown up two very distinct groups of customers – highly profitable but high risk customers in Cluster 2, and low profitable and low risk customers in Cluster 1.

For the highly profitable but high risk customers, one or more of the following actions can be implemented:
- Enhance credit risk monitoring
- Establish stringent usage thresholds
- Educate customers about alternative payment options, or make CC a mandatory payment method
- Migrate to pre-paid plans

For the low profitable and low risk customers, usage stimulation campaigns can be attempted with or without further segmentation.

This is one of the most basic examples of customer segmentation. If we consider traffic analysis information by taking ratios of certain call/service usage parameters, we can identify customer groups who have increased or decreased their usage. If we consider customer tenure, we can have an understanding of customer loyalty. Accordingly, specific actions can be taken for these groups.


Anonymous said...


is it necessary to normalize data before 2 step clustering?

how to ascertain that a cluster solution is a good solution?

Tim Manns said...

Ok, I'm being brutally honest because partly I am hoping I've misunderstood... But if I have understood your analysis correctly, then you need to know this.

for example re:"maximum # of incoming & outgoing calls"
-> this is a huge generalisation. Within a telco's call detail records there is lots information pertaining to the time of day, how many different distinct numbers dialled during the month, average call duration, calls to other customers or on different networks. For mobile networks there is additionally sms, picture calls, data, ringtones etc.

I'm not saying anything do did seemed incorrect, many telcos were doing this exact same stuff 10 years ago. As an data miner I find it a very simple project and use of potential data for segmentation; graduate level work. Maybe the data was summarise to the form and passed to you in a nice text file, maybe you simplified it for the weblog and beginner dataminers. I don't know... If you are working with the call data there's far more you could have done with it, and I believe the segementation would have benefited greatly therefore highlighted many more interesting features about your customer base.

The clementine stream you provided was as simple as possible. I'm guessing that's just for ilustrative purposes, and the actually analysis did far more than the picture suggests.

Romakanta said...

Anon: The two-step method standardizes variables within the procedure itself. You can override the settings, and standardize the data yourself.

To ascertain that a cluster solution is a good solution, you’ll have to rely less on the statistical parameters/outputs and more on your business problem and requirements!

Tim: This was not a project at all :-)

I downloaded the data from a university website, so you are right again - the data was summarized and passed to me in a nice text file:-))

This post was meant to be a simple (and very easy to understand) introduction to customer segmentation (and 2-step cluster method) and Clementine. Thanks for your insights Tim, I would love to see some detailed posts regarding Telecom Analytics on your blog :-)

Tim Manns said...

re: "meant to be a simple"

-> phew! Ok, in that case I sincerely apologise. For while I was thinking, this couldn't be right...

I'll write up a description of my segmentation on my blog soon. It is difficult to avoid giving away too many secrets :) Telco's are very protective of their data (obviously) and any data mining is often considered a competitive advantage.

Romakanta said...

not at all...comments and insights are always appreciated:-)

yup, we data miners have to be careful of privacy/confidentiality issues all the time! looking forward to your post.

shofi said...

it's really interesting. I was looking for more information 'bout this topic, and I found ur blog.
but I have a question: for the second step of two-step clustering, is it a must to use Hierarchical Clustering? what about k-means?

Datalligence said...

hi shofi,

it's not a must, it's just the method/procedure SPSS uses.

but i'm a bit skeptical about using the k-means here. you need to assume/know the number of clusters in your dataset for using the k-means.

Anonymous said...

Hi! This article is very interesting.
I would like to know if the two-step cluster is implemented in SAS. I have to make a segmentation with a mixture of continuous and categorical variables in SAS but i don´t know how to do it in SAS.

Thanks in advance.

jennifersign said...

Great information! Customer Segmentation can be great for a company if successfully done.