Datalligence: analytics

Showing posts with label analytics. Show all posts

Tuesday, January 14, 2014

Analytics 3.0

This slide is from the ebook available at http://iianalytics.com/a3/
The full article on Analytics 3.0 is at http://hbr.org/2013/12/analytics-30/ar/1

Friday, March 16, 2012

Clustering vs. Segmentation vs. Profiling

Tuesday, October 4, 2011

Discount/Variety stores productivity

I was looking at the "RetailSails 2011 Chain Store Productivity Report" available at retailsails. I do most of my "everyday needs" shopping at Discount/Variety stores and so that section was the most interesting to me.

The table has the following columns:
1. Sales
2. YoY change in Sales
3. # of Stores
4. Average Store Size
5. Sales per Store
6. YoY change in Sales/Store
7. Sales per sqft
8. YoY change in Sales/sqft

So, which companies are doing good?

Total Sales is very much dependent on the number of stores and the size of the stores. YoY change in Sales would have been useful if there was information on the change in the number of stores, YoY.

Sales per store is again strongly associated with size of the store. But does bigger store size always mean higher sales?

Walmart has the biggest stores, with an average size of about 162,000 sqft. But Costco, with an average store size of 145,000 sqft has an impressive sales per store - $58 million more than a Walmart store. A Costco store is about 0.9 times the size of a Walmart store but its sales is 1.9 times that of a Walmart store.

Another interesting comparison is Target and Sam's Club. Almost the same store size, but sales at a Sam's Club store is almost twice that of a Target store.

You will see a very similar story if you use Sales per sqft instead of Sales per store. But what makes these analyses incomplete is the absence of average price/item. The comparison between Costco and Walmart is obvious - prices will be much lower at Walmart. But what about Target & Sam's Club? Prices at Sam's Club may still be lower but can that alone explain a Target store having twice the sales of a Sam's Club? Other factors that can be the reason behind a Target store's higher sales will be - optimized layouts (more efficient use of space, more items per unit area), higher traffic, larger proportion of high ticket items, etc..

And how are these companies growing?

Looks good - almost all of these companies are seeing positive trends in both total sales and sales per sqft, YoY. What about Kmart? Sales per sqft is growing YoY but total sales YoY is down. The first thing that comes to my mind - Kmart must have closed some of its non-performing stores. Any other reasons you can think of?

Wednesday, June 29, 2011

How to make an impact with Analytics

From the sascom magazine blog - How to make an impact with Analytics by Bob Messier, Senior Director of Business Analytics, SAS.

Sunday, June 5, 2011

Keep it simple

Is analytics all about actionable insights? Go and check out some of the websites of companies that offers analytics as a service. Chances are extremely high that you will come across either “actionable” or “actionable insights”.

But the truth is a lot of the work done at Analytics companies is not actionable at all. A lot of the daily or regular requirements will be the “good to know” numbers or information, or what many analysts will derogatorily refer to as Reporting work.

To be honest, when I just started my Analytics career I had the same biased or uninformed opinions. Predictive Analytics or Modeling was the only “cool” thing in Analytics. As they say, much water has flowed under the bridge and below are some of the things I have learned along the way, over the years.

All numbers and insights need not be actionable

When working with a new department or analyzing a new customer base, most of the analyses will start with understanding the business or the customers. The simple reports or the exploratory data analysis is going to be a very useful and important guide for future business plans and strategies.

If the results of your analysis can answer a business question, that’s good. And sometimes, it can be very good.

Averages are sometimes the best

I learned this all over again recently. The client wanted to see how different their 2 groups of customers were. As the initial discussion was focused on the purchase behavior or purchase life cycles of these two groups, I jumped into analyzing the customers’ monthly transactions since their acquisition dates - trying to see if these two groups have different buying patterns across their tenures.

When their overall sales didn’t throw up any surprises, I went into sales within specific product categories. By the end of the week, I found a few interesting patterns. But during the second meeting with the client next week, as I was going through the slides one by one (about 7-8 slides) – both my client and I realized that though the buying patterns were very similar, there was a big gap between the lines. Instead of analyzing how the behavior spiked or dipped or flattened out month on month, the biggest and most important analysis would have been a single slide on the averages of the two customer groups over their 1 year tenure. The differences in their average sales, basket size, number of trips, etc. was clearly seen – one single slide, one table – that was what I should have done when my hypothesis or what we all wanted to see was proved wrong by the data.

Understand the drivers – is modeling really required?

When clients say they want to understand the drivers of customer attrition or response, the first thing many analysts will do is to develop a model. But wait a minute, have some patience and ask a few more questions. A model has to be built if the client has a marketing plan or strategy because you will need to score customers for targeting.

What if all the client wants is to understand the drivers? Maybe all she needs is to identify and understand who these attriters or responders are. And to answer that, you don’t need to develop a model that will take a lot of time and money.

All you need is EDA. Means, frequencies and cross tabs based on the target variable (example, responded or not) will reveal things like – 70% of the responders visited the store in the last 3 months, 80% of the responders live within 5 miles of the store, 65% of the responders use a Credit Card, etc. And this will very much answer your client’s questions.

Signing off with:

Karma police, arrest this man
He talks in maths
He buzzes like a fridge
He's like a detuned radio
-- Karma Police by Radiohead

Wednesday, February 23, 2011

The Keyword Tree

Lately, I am getting very interested in data visualization and text mining. Just got the evaluation version of Tibco Spotfire and I would have to say this - it rocks! Beautiful high-quality visualizations, and a good number of features.

Was also playing around with the Keyword Tree from Juice Analytics. A brief description of this tool:

What search words and phrases are driving traffic to my site?

In the word tree visualization, you'll see a frequently used search term at the center. To the right and left, it shows the search terms that are most often used in combination with that word. The words are sized by their frequency of use and colored by bounce rate (or 0% new visitors or average time on site).

I then linked it to my Google Analytics account, and to this blog and here's what I got. Simply BEAUTIFUL!

Wednesday, October 20, 2010

So you thought...?

Sometimes you get the feeling that everyone around you is so confused or just don't know about things which are basic and essential in Analytics. Below is a list of the most common terms that a majority thinks they know but don't.

1. Linear/Pearson Correlation: The most misunderstood term as far as i know. Before doing anything else, check if the 2 variables share a linear relation. Correlation values without a linear pattern is meaningless. And also be aware that in many softwares (including MS Excel), the default is pearson correlation, for which a linear relation between the two variables is a requirement.

2. Significance Test: Many many people into Analytics (?) will never ever understand this or will never try to understand this. Just because you see 2 groups doesn't mean that you can do a significance test. Know something or everything about sampling and designs before talking about significance test.

3. Lift and Cumulative Gains Charts: They are different, period. Don't confuse one with another.

Lift - Without a model, we get 30% of the responders by contacting 30% of the customers. Using a model, we get 60% of responders. The lift is 60/30 = 2 times.

Cumulative Gains - Using the model, if we contact 30% of the customers we get 60% of all responders.

4. Clustering/Segmentation and Profiling: Let's make this simple. Clustering/Segmenting will answer - Can my customer base be broken up into distinct groups based on certain attributes/characteristics? Customers within a group will be very similar to one another while customers across groups will be different.

Profiling will answer - Who are my best customers? What do they purchase? How often? What is their ethnicity, their household size and income, etc.? In many cases, profiling usually follows clustering/segmentation. Who are the customers in Group 1?

Signing off with:
"There must be some kind of way out of here,"
Said the joker to the thief
"There's too much confusion,
I can get no relief"
-- All along the watchtower by Jimi Hendrix

Friday, January 1, 2010

Why Analytics Fail To Deliver?

Just came across this post from Vincent Granville on AnalyticBridge. Quite interesting and informative. Am sharing it here, along with a few additions of my own.

- The model is more accurate than the data: you try to kill a fly with a nuclear weapon.
- You spent one month designing a perfect solution when a 95% accurate solution can be designed in one day. You focus on 1% of the business revenue, lack vision / lack the big picture.
- Poor communication. Not listening to the client, not requesting the proper data. Providing too much data to decision makers rather than 4 bullets with actionable information. Failure to leverage external data sources. Not using the right metrics. Problems with gathering requirements. Poor graphics or graphics that is too complicated.
- Failure to remove or aggregate conclusions that do not have statistical significance. Or repeating many times the same statistical tests, thus negatively impacting the confidence levels.
- Sloppy modeling or design of experiments, or sampling issues. One large bucket of data has good statistical significance, but all the data in the bucket in question is from one old client with inaccurate statistics. Or you use two databases to join sales and revenue, but the join is messy, or sales and revenue data do not overlap because of a different latency.
- Lack of maintenance. The data flow is highly dynamic and patterns change over time, but the model was tested 1 year ago on a data set that has significantly evolved. Model is never revisited, or parameters / blacklists are not updated with the right frequency.
- Changes in definition (e.g. include international users in the definition of a user, or remove filtered users) resulting in metrics that lack consistency, making vertical comparisons (trending, for a same client) or horizontal comparisons (comparing multiple clients at a same time) impossible.
- Blending data from multiple sources without proper standardizations: using (non-normalized) conversion rates instead of (normalized) odds of conversion.
- Poor cross validation. Cross validation should not be about randomly splitting the training set into a number of subsets, but rather comparing before (training) with after (test). Or comparing 50 training clients with 50 different test clients, rather than 5000 training observations from100 clients with another 5000 test observations from the same 100 clients. Eliminate features with statistical significance but lack of robustness when comparing 2 time periods.
- Improper use of statistical packages. Don't feed a decision tree software with a raw metric such as IP address: it just does not make sense. Instead provide a smart binned metric such as type of IP address (corporate proxy, bot, anonymous proxy, edu proxy, static IP, IP from ISP, etc.)
- Wrong assumptions. Working with dependent independent variables and not handling the problem. Violations of the Gaussian model, multimodality ignored. External factor explains the variations in response, not your independent variables. When doing A/B testing, ignoring important changes made to the website during the A/B testing time period.
- Lack of good sense. Analytics is a science AND an art, and the best solutions require sophisticated craftsmanship (the stuff you will never learn at school), but might usually be implemented pretty fast: elegant/efficient simplicity vs. inefficient complicated solutions.

My additions:

- What is your problem?
Without a real business problem, modeling or data mining will just give you numbers. I have come across people, on both the delivery and client side, who comes up with this often repeated line “I’ve got this data, tell me what you can do?”
My answer to that – “I will give you the probability that your customer will attrite based on the last 2 digits of her transaction ID. Now, tell me what are you going to do to make her stay with your business?”

Start with a REAL business problem.

- What are you gonna do about it?
So if your customers are leaving in alarming numbers, don’t just say that you want an attrition/churn model. Think on how you would like to use the model results. Are you thinking of a retention campaign? Are you to going to reduce churn by focusing on ALL the customers most likely to leave? Or are you going to focus on a specific subset of these customers (based on their profitability, for example)? How soon can you launch a campaign? How frequently will you be targeting these customers?

Have a CLEAR idea on what you are going to do with the model results.

- What have you got?
Don’t expect a wonderful earth-shattering surprise from all modeling projects. Model results or performances are based on many factors, with data quality as the most important one, in almost all the cases. If your database is full of @#$%, remember one thing. Garbage in, Garbage out. Period.

- Modeling is not going to give you a nice easy-to-read chart on how to run businesses.

- Technique is not everything.
A complex technique like (Artificial) Neural Networks doesn’t guarantee a prize winning model. Selecting a technique depends on many factors, with the most important ones being data types, data quality and the business requirement.

- Educate yourself
It’s never too late to learn. For people on the delivery side, modeling is not about the T-test and Regression alone. For people on the client side, know what Analytics or Data Mining can do, and CANNOT do. Know when, where and how to relate the model results with your business.

Tuesday, December 1, 2009

Profiling your Customers!

From the "Field Guide to Mathematical Marketing" by Mark Klein, available at Loyalty Builders

Sunday, May 31, 2009

Analytics: Reality and the Growing Interest

This is a guest post by Bhupendra Khanal, CEO of InRev Systems.

InRev Systems is a Bangalore based Decision Management Company, which works on Data Based Information Systems. Their interest areas are Marketing Services, Web Information, MIS Reporting, Social Media Services and Economic Research. Bhupendra also maintains a personal blog at Business Analytics.

Introduction
Huge amount of data is collected by any business houses today. There are also certain data collection agencies who have information like economic variables, demographic variables, police fraud list, loan default list, telephone and electricity bill payment history etc. All these data if analyzed, tend to separate people into various similar groups. These groups can be fraudulent group, defaulter group, risk averse and risk takers group, high income and low income groups, etc.

Based on this information, many business decisions can be made in a better and rational way. Analytics leverages almost the same concept but with assumptions:
· The behavior of people do not change with time
· People with similar profile behave similarly

Predictive Modeling and Segmentation are the major components of Analytics. The profile and the behavior of a set of people are taken, and a relation is found out. This same relation is used for building future profiles and predicting the behavior of people having the same profiles. This is commonly done using advanced statistical techniques like Regression Modeling (Linear, Logistic, and Poisson etc) and Neural Networks.

The Business Analytics Services Market comprises solutions for storing, analyzing, modeling, and delivering information in support of decision-making and reporting processes.

Analytics, regardless of its complexity, serve the same purpose – to assist in improving or standardizing decisions at all levels of an organization.

Size and Type of Market
The size of the Analytics market globally is estimated around $25 billion today. It is increasing very fast, doubling almost every five years for the last few decades, with $19 billion dollars in 2006. It is expected to grow to $31 billion by the end of 2011 (source - IDC 2007).

There are many areas for the implementation of Analytics. The most common Analytics practices are Risk Management, Marketing Analytics, Web Analytics, and Fraud Prediction etc. These functions are handled by different organizations in different ways, with most companies maintaining a fine balance with the in-house team, outsourcing partner and consulting project vendors. Such variation makes calculation of market size very difficult.

Risk Management is one of the largest component of the Analytics Industry today, and it is the pioneer component too. The market is huge in the US and Europe; Asia Pacific is coming up fast and it is yet to get full swing in China and South Asia, including Nepal.

Web Analytics (analytics of Web Data) has picked up very fast and it is increasing by more than 20% for the last few years, thanks to the revolution led by Amazon, Google and Yahoo. Web Analytics market is a late entrant but have already passed the one billion mark.

The biggest of all is the Marketing Analytics (MA) and Strategy Science component. This is really huge owing to the effort put by companies like Dunhumby, Axiom and others. MA is critical as it tends to compete directly with Marketing Research Firms and Strategy Consulting firms in the type of work it does. This makes it difficult to calculate the market size but it is worth billions of dollars.

Big Players in the Area
The size of the Analytics market is huge today but the industry is fragmented. None of the core Analytics and Decision Management Company has ever touch a billion dollar revenue mark.

Fair Isaac is the pioneer and the largest Decision Management Company with revenue of around 800 million dollars. The other core DM and Analytics companies are quite smaller. This has happened due to the aggressive moves by IT Services Companies and Information Bureaus to acquire Analytics companies.

Experian, Transunion and Equifax are three major bureaus in the US while there are others too - Innovis, Axiom, Teletrack, Lexis Nexis etc. Each of these bureaus has analytics services as their offerings. Apart from these the BI majors like SAS, SPSS, Salford Systems etc. offer Analytics Services.

The Indian Analytics Market is small but growing. This can be justified by the more than double salary hike rates in Analytics compared to the Software domain. The outsourcing shops and India-focused companies both have mushroomed in the last five years in India while the problem remains in getting proper talent and retaining them.

Major Banks like ICICI and HDFC have their strong in-house analytics unit while SBI has partnered with GE Money for Analytics support for their Cards portfolio. The smaller banks are yet to start.

Other majors in non banking industries are also using Analytics through Outsourcing or Consulting. Airtel and Reliance are leading the way for the use of Analytics in India in Telecom.

Scope on how it can grow: Indian Context
The Analytics Industry is growing fast. It has the scope of forming a separate process across industries like HR, Operations, and IT System. IT is now in the early stage of process metamorphosis, where each process start through consulting, grow through in-house establishments and finally settle down as outsourcing to third parties.

The future growth depends on the approach of major players and consolidation of the industry. All this will make it a high value and high growth industry, where players can provide high quality products and services while maintaining their profitability.

Another big challenge is the supply of quality man-power and training. Today India neither has good institutes training people on Analytics nor has it got an Infosys for Analytics (who can employ and train huge number of freshers). Even the number of good Statistics and Mathematics Institutes in India is less.

Amidst all these challenges, India is positioned fairly well in the world today, and it will be interesting to see if it can become a Knowledge Process and Analytics hub in the days ahead.

Monday, May 18, 2009

A Tale Of Two Banks and One Telecom Service Provider

I have bank accounts at ICICI & Citibank (India). I also use credit cards of these two companies. Let’s first talk about their Mobile Banking services. ICICI has the following options:

My account getting credited above
My account getting debited above
Salary credited to my account #
Cheque deposited in my account bounced
Account Balance above
Account Balance below
Debit Card Purchases above

Messages for the above can be received through SMS, Email or both.

Citibank calls it Alerts, and they offer the following options:

Withdrawal balance by account
Withdrawal balance by account
Time deposit maturity advice
Cheque Status
Cheque Bounce Alert
Time Deposit Redemption Notice
Cheque dishonor

These messages can be received through SMS, Email or Both depending on the alert type. Also, for some of the alerts, message frequencies can be chosen as Daily, Weekly, or Monthly.

At first glance, it seems that Citibank offers more options but a closer look will reveal ICICI has done more research and come up with a better offering. The Citibank alerts are based on how frequently you want them while ICICI’s alerts are based on particular (defined by the customer) credited and debited amounts. ICICI’s options make more sense as I don’t want alerts daily or weekly, but only when I have made a transaction.

And because of the lack of options, I continue to receive daily alerts on my Citibank account balance irrespective of the fact that the balance remains the same or I haven’t done any transaction for a month. Also, the system at Citibank looks like a typical CRM system while the one at ICICI looks more like a BI system.

Now, let’s discuss their Credit Cards and their Customer Analytics.

I have been using an ICICI Gold credit card for almost 3 years now. I used it to pay all my bills – electricity bill, mobile bill, internet bill, shopping bills….and I paid all dues in time. I applied for and got the Citibank Gold card about a year after I got my ICICI card. I used the Citibank Gold card for 2-3 purchases, paid all the dues in time, and Citibank increased my credit limit every time.

Encouraged by their response (I got very nice emails from their customer service) and actions (increase of credit limit), I started using the Citibank Gold card more frequently. In a few months’ time, I got a free-for-life Citibank Platinum card with all these attractive features and benefits. I even got an invitation to join a wine club though I’m more of a rum and whiskey guy. Don’t blame them though; getting your hands on such kind of consumer lifestyle/preferences data will be next to impossible in India.

I have now almost forgotten the ICICI Gold card; I use it very rarely these days. The credit limit given to me 3 years back still remains the same. And I have never received a single email or communication from ICICI. I also know that ICICI bank outsources its Customer Analytics to an Analytics Service Provider in Mumbai. So where is the up sell analytics? Doesn’t their data show the fact that I am “almost” leaving now? That again reveals the fact that customer spending information is not at all analyzed and they are not doing much about customer churn either.

On a different note, I am a post-paid mobile customer of India’s largest telecom service provider, Airtel. Every month, whenever my bill is generated I receive about 6 SMSs from Airtel within the next 5-6 days. The messages can be summarized as:

1. Your bill has been generated…
2. You can view your bill at your online account…
3. Your bill has been emailed to abc@gmail.com and the password to open it is… and if you haven’t received it… (this is actually 2 SMSs because of the length of the message)
4. Your bill amount is XXX…
5. Your bill amount is XXX and the last date of payment is…

For the last 3 years or so, I have been paying all my bills before the due date. Once I got so irritated that I emailed their customer service, and their reply? These are server generated messages and we can’t do anything about it. I got more irritated and asked my email to be forwarded to Airtel’s CRM, Business Intelligence, Analytics or whatever the team there like to call themselves!

I got just one SMS alert when my next month’s bill was generated. But it was back to square one from the 2nd month onwards.

My question is which “smart” manager came up with the idea that 6 SMSs should be sent to all their post-paid customers every month? Why doesn’t one SMS saying “Your bill amount of XXX has been generated and the due date is ABC” suffice? Has anyone at Airtel calculated the cost of sending 5-6 SMSs to all their post-paid customers, every month? Shouldn’t the last SMS be sent only to those customers who have a habit of making late payments? Do they send the second SMS to those customers who don’t have online accounts too? And why can’t these alerts be customized based on a customer’s usage and payment behavior?

I can give more examples of Indian companies in the retail, entertainment, and services sector that are not doing nothing or very little about all the customer data they have, inspite of mentioning or advertising that they use BI & Analytics. So how mature is the Business Intelligence, and CRM Analytics setup at Indian companies? And how skilled or knowledgeable are the senior people associated with it?

Thursday, February 19, 2009

Two Step Cluster - Customer Segmentation in Telecom

I love Cluster Analysis because unlike a lot of other techniques, I don’t have to make any assumptions about the underlying distribution of the data. Though there are a few assumptions for best performance, it’s perfectly okay to cluster data that may not meet these assumptions. Only the business requirements/goals can determine whether the clusters/segments are useful or the solution is satisfactory.

Customer Segmentation is the process of splitting a customer database into distinct, meaningful, and homogenous groups based on specific parameters or attributes. At a macro level, the main objective for customer segmentation is to understand the customer base, monitor and understand changes over time, and to support critical strategies and functions such as CRM, Loyalty programs, and product development.

At a micro level, the goal is to support specific campaigns, commercial policies, cross-selling & up-selling activities, and analyze/manage churn & loyalty

SPSS has three different procedures that can be used to cluster data: hierarchical cluster analysis, k-means cluster, and two-step cluster. The two-step cluster is appropriate for large datasets or datasets that have a mixture of continuous and categorical variables. It requires only one pass of data (which is important for very large data files).

The first step - Formation of Preclusters
Preclusters are just clusters of the original cases that are used in place of the raw data to reduce the size of the matrix that contains distances between all possible pairs of cases. When preclustering is complete, all cases in the same precluster are treated as a single entity. The size of the distance matrix is no longer dependent on the number of cases but on the number of preclusters. These preclusters are then used in hierarchical clustering.

The second step - Hierarchical Clustering of Preclusters
In the second step, the standard hierarchical clustering algorithm is used on the preclusters.

The dataset I am going to use has information on 75 attributes for more than 70,000 customers. Product/service usage variables for all customers in the dataset are averages calculated over a period of four months.

In SPSS Clementine, the Data Audit available under the Output nodes palette gives the basic/descriptive statistics (mean, min, max...) and the quality (outliers, missing values...) of the variables.

Out of the 75 variables in the dataset, I used about 15 original variables and 3 new derived variables after considering their quality and business relevance. These selected variables were a combination of demographic, billing, and usage information.

The two-step cluster analysis produced 3 clusters. A very interesting difference was observed between Clusters 1 and 2.

Customers in Cluster 2 display the following characteristics:
- few of them are married
- few of them have children
- few of them have a credit card
- owns the most expensive mobile set

- maximum # of incoming & outgoing calls
- maximum # of roaming calls
- maximum MOU (minutes of usage)
- maximum # of active subscriptions
- maximum recurring charge (or, subscribes to the most expensive calling plan)
- maximum revenue

- maximum # of calls to customer care
- has the largest proportion of customers with low credit rating

Customers in Cluster 1 display characteristics that were exactly the opposite in ALMOST all of the areas mentioned above. So we have these customers who are married with children, posses a credit card, own a cheap mobile set, subscribe to the least expensive calling plan, make the minimum # of calls (incoming, outgoing, roaming & customer care), and has the highest credit rating.

Customers in Cluster 3 follow the middle path (in almost all the attributes) and offered no interesting or meaningful insights.

So what can be the business application of this exercise?
To put it simply, cluster analysis has thrown up two very distinct groups of customers – highly profitable but high risk customers in Cluster 2, and low profitable and low risk customers in Cluster 1.

For the highly profitable but high risk customers, one or more of the following actions can be implemented:
- Enhance credit risk monitoring
- Establish stringent usage thresholds
- Educate customers about alternative payment options, or make CC a mandatory payment method
- Migrate to pre-paid plans

For the low profitable and low risk customers, usage stimulation campaigns can be attempted with or without further segmentation.

This is one of the most basic examples of customer segmentation. If we consider traffic analysis information by taking ratios of certain call/service usage parameters, we can identify customer groups who have increased or decreased their usage. If we consider customer tenure, we can have an understanding of customer loyalty. Accordingly, specific actions can be taken for these groups.

Friday, January 9, 2009

Q & A with Eric Siegel, President of Prediction Impact

It's my pleasure to welcome Eric Siegel, President of Prediction Impact on datalligence. He has kindly answered some of my questions related to Data Mining.

Q1. A brief intro about yourself and your DM experience
Eric: I've been in data mining for 16 years and commercially applying predictive analytics with Prediction Impact since 2003. As a professor at Columbia University, I taught the graduate course in predictive modeling (referred to as "machine learning" at universities), and have continued to lead training seminars in predictive analytics as part of my consulting career.

I'm also the program chair for Predictive Analytics World, coming to San Francisco Feb 18-19. This is the business-focused event for predictive analytics professionals, managers and commercial practitioners. This conference delivers case studies, expertise and resources in order to strengthen the business impact delivered by predictive analytics.

Q2. What are the most common mistakes you've encountered while working on DM projects?
Eric: The main mistake is not following best practice organizational processes, as set forth by standards such as by CRISP-DM (mentioned in your Dec 18th blog on "Methodologies").

Predictive analytics' success hinges on deciding as an organization which specific customer behavior to predict. The decision must be guided not only by what is analytically feasible with the data available, but by which predictions will provide a positive business impact. This can be an elusive thing to pin down, requiring truly informed buy-in by various parties, including those who's operational activities will be changed by integrating predictive scores output by a model. The interactive process model defined by CRISP-DM and other standards ensures that you "plan backwards," starting from the end deployment goal, including the right personnel at key decision points throughout the project, and establishing realistic timelines and performance expectations

Dr. John Elder has a somewhat famous list of the top 10 common-but-deadly mistakes, which is an integral part of the workshop he's conducting at Predictive Analytics World, "The Best and the Worst of Predictive Analytics: Predictive Modeling Methods and Common Data Mining Mistakes". As he likes to say, "Best Practices by seeing their flip side: Worst Practices". For more information about the workshop, see The Best and the Worst of Predictive Analytics

Q3. Translating the Business Goal to a Data Mining Goal, and then defining the acceptable model performance/accuracy level for the success of the DM project appears to be one of the biggest challenges in a DM project. One approach is to use the typical accuracy level used in that particular domain. Another method is to model on a sample dataset (sort of a POC) to come up with an acceptable model performance/accuracy level for the entire dataset/project. Which approaches do you recommend/use to define the acceptable accuracy/cut-off level for a DM project?
Eric: Acceptable performance should be defined as the level where your company attains true business value. Establishing typical performance for a domain can be very tricky, since, even within one domain, each company is so unique - the context in which predictive models will be deployed is unique in the available data (which reflects unique customer lists and their responses or lack thereof to unique products) and in the operational systems and processes. Instead, forecast the ROI that will be attained in model deployment, based on both optimistic and conservative model performance levels. Then, if the conservative ROI looks healthy enough to move forward (or the optimistic ROI is exciting enough to take a risk), determine a minimal acceptable ROI and the corresponding model performance that would attain it as the target model performance level. This is then followed as the goal that must be attained in order to deploy the model, putting its predictive scores into play "in the field".

Q4. One thing I hear a lot from freshers entering the DM field is that they want to learn SAS. Considering the fact that SAS programming skills are highly respected and earn more than any other DM software skills, it's actually a futile exercise to convince these freshers that a tool-neutral DM knowledge is what they should actually strive for. What's your opinion on this?
Eric: Well, I think most people understand there are advantages to taking general driving lessons, rather than lessons that teach you only how to drive a Porsche. On the other hand, you can only sit in one car at a time, and when you learn how to drive your first car, most of what you learn applies in general, for other cars as well. All cars have steering wheels and accelerators; many predictive modeling tools share the same standard, non-proprietary core analytical methods developed at universities (decision trees, neural networks, etc.), and all of them help you prepare the data, evaluate model performance by viewing lift curves and such, and deploy the models.

Q5. According to you, what are the new areas/domains where DM is being applied?
Eric: I see human resource applications, including human capital retention, as an up-and-coming, and an interesting contrast to marketing applications: predict which employees will quit rather than the more standard prediction of which customer will defect.

I consider these the hottest areas (all represented by named case studies at PAW-09, by the way):

* Marketing and CRM (offline and online)
- Response modeling
- Customer retention with churn modeling
- Acquisition of high-value customers
- Direct marketing
- Database marketing
- Profiling and cloning
* Online marketing optimization
- Behavior-based advertising
- Email targeting
- Website content optimization
* Product recommendation systems (e.g., the Netflix Prize)
* Insurance pricing
* Credit scoring

Q6. In spite of the fact that a lot of companies in India provide Analytics or Data Mining as a service/solution to many companies around the world, there are no institutions/companies providing quality and industry focused Data Mining education. There are no colleges/universities offering Masters in Analytics/Data Mining in India. I have a lot of friends/colleagues who will gladly take up such courses/programs if they are made available in India. Can we expect this kind of courses/trainings from Prediction Impact, The Modeling Agency, TDWI, etc. in the near future?
Eric: I'm in on discussions several times a year about bringing a training seminar to other regions beyond North America and Europe, but it isn't clear when this will happen. For now, Prediction Impact does offer an online training program, "Predictive Analytics Applied" available on-demand at any time.

Tuesday, November 25, 2008

Fraud Prediction - Decision Trees & Support Vector Machines (Classification)

My first thought when I was asked to learn and use Oracle Data Mining (ODM) was, “Oh no! Yet another Data Mining Software!!!”

It’s been about 2 weeks now since I have been using ODM, particularly focusing on two classification techniques – Decision Trees & Support Vector Machines. As I don’t want to get into the details of the interface/usability of ODM (unless Oracle pays me!!), I will limit this post on a comparison of these two classification techniques at a very basic level, using ODM.

A very brief introduction of DT & SVM.

DT – A flow chart or diagram representing a classification system or a predictive model. The tree is structured as a sequence of simple questions. The answers to these questions trace a path down the tree. The end product is a collection of hierarchical rules that segment the data into groups, where a decision (classification or prediction) is made for each group.

-The hierarchy is called a tree, and each segment is called a node.
-The original segment contains the entire data set, referred to as the root node of the tree.
-A node with all of its successors forms a branch of the node that created it.
-The final nodes (terminal nodes) are called leaves. For each leaf, a decision is made and applied to all observations in the leaf.

SVM – A Support Vector Machine (SVM) performs classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories.

In SVM jargon, a predictor variable is called an attribute, and a transformed attribute that is used to define the hyperplane is called a feature. A set of features that describes one case/record is called a vector. The goal of SVM modeling is to find the optimal hyperplane that separates clusters of vector in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other size of the plane. The vectors near the hyperplane are the support vectors.

SVM is a kernel-based algorithm. A kernel is a function that transforms the input data to a high-dimensional space where the problem is solved. Kernel functions can be linear or nonlinear.

The linear kernel function reduces to a linear equation on the original attributes in the training data. The Gaussian kernel transforms each case in the training data to a point in an n-dimensional space, where n is the number of cases. The algorithm attempts to separate the points into subsets with homogeneous target values. The Gaussian kernel uses nonlinear separators, but within the kernel space it constructs a linear equation.

I worked on this dataset which has fraudulent fuel card transactions. Two techniques I previously tried are Logistic Regression (using SAS/STAT) & Decision Trees (using SPSS Answer Tree). Neither of them was found to be suitable for this dataset/problem.

The dataset has about 300,000 records/transactions and about 0.06% of these have been flagged as fraudulent. The target variable is the fraud indicator with 0s as non-frauds, and 1s as frauds.

The Data Preparation consisted of missing value treatments, normalization, etc. Predictor variables that are strongly associated with the fraud indicator – both from the business & statistics perspective – were selected.

The dataset was divided into a Build Data (60% of the records) and Test Data (40% of the records).

Algorithm Settings for DT,

Accuracy/Confusion Matrix for DT,

Algorithm Settings for SVM,

Accuracy/Confusion Matrix for SVM,

We can see clearly that SVM is outperforming DT in predicting the fraudulent cases (93% vs. 72%).

Though it depends a lot on the data/business domain & problem, SVM generally performs well on data sets where there are very few cases on which to train the model.

Wednesday, October 29, 2008

Eight Levels Of Analytics

The number of companies advertising Analytics as one of their service offerings leaves me astounded. I also come across an endless number of people who say they are into Analytics.

Maybe all this has to do with the very broad definition of Analytics. Or the way some people/companies interpret Analytics. A little more research, and a few more minutes of conversation reveals that the "Analytics" these companies or people actually do is just reporting. Personally speaking, I feel "Analytics" has become a widely misinterpreted and misused term.

According to Gartner, “Analytics leverage data in a particular functional process (or application) to enable context-specific insight that is actionable.”

The lastest sascom online magazine describes the eight level of Analytics, from the simplest to the most advanced.

1. STANDARD REPORTS
What happened? When did it happen?
E.g. Monthly or quarterly financial reports.

Generated on a regular basis, they just describe “what happened” in a particular area. They’re useful to some extent, but not for making long-term decisions.

2. AD HOC REPORTS
How many? How often? Where?
E.g. Custom reports that describe the number of hospital patients for every diagnosis code for each day of the week.

At their best, ad hoc reports let you ask the questions and request a couple of custom reports to find the answers.

3. QUERY DRILLDOWN (OR OLAP)
Where exactly is the problem? How do I find the answers?
E.g. Sort and explore data about different types of cell phone users and their calling behaviors.

Query drilldown allows for a little bit of discovery. OLAP lets you manipulate the data yourself to find out how many, what color and where.

4. ALERTS
When should I react? What actions are needed now?
E.g. Sales executives receive alerts when sales targets are falling behind.

With alerts, you can learn when you have a problem and be notified when something similar happens again in the future. Alerts can appear via e-mail, RSS feeds or as red dials on a scorecard or dashboard.

5. STATISTICAL ANALYSIS
Why is this happening? What opportunities am I missing?
E.g. Banks can discover why an increasing number of customers are refinancing their homes.

Here we can begin to run some complex analytics, like frequency models and regression analysis. We can begin to look at why things are happening using the stored data and then begin to answer questions based on the data.

6. FORECASTING
What if these trends continue? How much is needed? When will it be needed?
E.g. Retailers can predict how demand for individual products will vary from store to store.

Forecasting is one of the hottest markets – and hottest analytical applications – right now. It applies everywhere. In particular, forecasting demand helps supply just enough inventory, so you don’t run out or have too much.

7. PREDICTIVE MODELING
What will happen next? How will it affect my business?
E.g. Hotels and casinos can predict which VIP customers will be more interested in particular vacation packages.

If you have 10 million customers and want to do a marketing campaign, who’s most likely to respond? How do you segment that group? And how do you determine who’s most likely to leave your organization? Predictive modeling provides the answers.

8. OPTIMIZATION
How do we do things better? What is the best decision for a complex problem?
E.g. Given business priorities, resource constraints and available technology, determine the best way to optimize your IT platform to satisfy the needs of every user.

Optimization supports innovation. It takes your resources and needs into consideration and helps you find the best possible way to accomplish your goals.

Thursday, October 16, 2008

Market Basket Analysis

Market Basket Analysis (MBA) is the process of analyzing transactional level data to determine the likelihood that a set of items/products will be bought together.

Retailers use the results/observations from an MBA to understand the purchase behaviour of customers for cross-selling, store design, discount plans and promotions. MBA can, and should be done across different branches/stores as the customer demographics/profiles and their purchase behavior usually varies across regions.

The most common technique used in MBA is Association Rules. The three measures of Association Rules are - Support, Confidence, and Lift

A --> B = if a customer buys A, then B is also purchased
LHS --> RHS
Condition --> Result
Antecedent --> Consequent

Support: Ratio of the # of transactions that includes both A & B to the total number of all transactions

Confidence: Ratio of the # of transactions with all items in the rule (A + B) to the # of transactions with items in the condition (A )

Lift: Indicates how much better the rule is at predicting the “result” or “consequent” as compared to having no rule at all, or how much better the rule does rather than just guessing

Lift = Confidence/P(result) = [P (A+B)/P(A)]/P(B)

EXAMPLE
If a customer buys milk, what is the likelihood of orange juice being purchased?

Milk --> Orange Juice
Customer Base: 1000
600 customers buy milk
400 customers buy orange juice
300 customers buy milk & orange juice

Support = P(milk & orange juice)/1000 = 300/1000 = 0.3

Confidence = P(milk & orange juice)/P(milk) = (300/1000)/(600/1000) = 0.5

Lift = Confidence/P(result) = 0.5/(400/1000) = 1.25

Interpretation: A customer who purchases milk is 1.25 times likely to purchase orange juice, than a randomly chosen customer.

THREE TYPES OF RULES PRODUCED BY ASSOCIATION RULES
Actionable: rules that can be justified and lead to actionable information
Trivial: rules that are obvious or already known (because of past/existing promotions, mandatory/required purchase of a stabilizer with an air conditioner…)
Inexplicable: rules that have no explanation and no course of action

DATA TYPE
Transactional data characterized by multiple rows per customer or order is the norm for MBA.

BASIC PROCESS FOR BUILDING ASSOCIATION RULES
1. Choose the right set of items/level of detail – items, product category, brands…?
2. Generate rules - one-way rules (2 items, A-->B), 2-way rules (3 items, A & B --> C)…?
3. Limit the orders/items in the analysis by

- considering only orders having at least as many items as are in the rule

- requiring a minimum support for the rule

- removing the largest orders having multiple items/products

MBA doesn’t refer to a single technique but a set of business problems related to understanding of POS transaction data. The most popular of these techniques happens to be Association Rules.

Thursday, August 21, 2008

A Few Questions Before You Churn!

Everyone seems to be modeling customer churn these days. But before you roll up your sleeves and take a dive, here are a few things I learned from David Ogden’s webcasts.

How will you use your churn model?
- Do you want to identify/rank likely churners?
- Do you want to identify/quantify the churn drivers?

Data Collection Window
- How much historical data do you want to use – 3 years data, 5 years data?

Prediction Window
- Who will churn next month? Who will churn in the next 6 months?

You build a model and predict who will churn next month. But what if the client’s business is such that it usually takes 2-3 months to implement the results from your churn model - set up campaigns, target customers with customized retention offers, send out mailers, etc.? Understand the client’s business and decide on an appropriate prediction window before simply doing what they ask.

Involuntary Churn vs. Voluntary Churn
- Voluntary churn occurs when a customer decides to switch to a competitor or another service provider because of dissatisfaction with the service or the associated fees
- Involuntary churn occurs due to factors like relocation, death, non-payment, etc.

Sometimes models are built leaving out one or the other group of customers. There is a clear difference between the two; decide which one is more important for the client’s business.

Drivers vs. Indicators
- Both influence churn, but drivers are those factors/measures that the company can control or manipulate. Indicators are mostly demographic measures, macro-economic factors, or seasonality, and they are outside the company's control.

Expected time to churn, vs. probability to churn tomorrow
- Survival Time Modeling answers the question, “What is the expected time to churn?” The response variable here is the Time (months, weeks, etc. until a customer will churn).
- Binary Response Modeling answers the question – “Who is likely to churn next week/month/quarter?” The response variable here is the Churn Indicator (customer stays or leaves).

Monday, July 28, 2008

Logistic Regression - Continous or Categorical?

A prediction/classification problem involving a lot of categorical variables and the first thing that comes to mind is Logistic Regression.

One thing I normally come across in Logistic Regression models is the low percentage of true positives, or cases/records correctly classified. And most of the times, the problem lies with the selection of the predictor variables. Many people tend to select as many predictor variables as they can. They have this wrong notion that they will miss something really BIG if they don’t include certain variables in the model.

And this is exactly where the idea of statisticians being the best and only candidates for analytics jobs is proved wrong. Someone with an understanding of the domain/business will easily point out the variables that will influence the independent/response variable. I always say to my managers – A Statistician, a Database Expert and an MBA are absolutely required for a successful Analytics Team.

Coming back to the accuracy of the Logistic Regression topic; while variable selection is the most important factor (besides the data quality, of course!!) influencing the accuracy of the model, I would like to say variable transformation and/or how you interpret the predictor variable is the second most important factor.

In a churn prediction model for a telecom company, I was working on Logistic Regression techniques and one of the predictor variables was “Months in Service”. In the initial runs, I specified it as a continuous variable in the model. After a lot of reruns that failed to increase the accuracy of the model, something made me think about the relation between “Probability of Churn” & “Months in Service”. Will the probability increase with an increase in the months of service? Will it decrease? Or will it be a little more complicated - with a lot of customers leaving in the initial few months of service, staying back for the next couple of months, and then churning again for another block of months, and so on?

I reran the model, this time specifying ”Months in Service” as a categorical variable. And the model accuracy shot up by about 12%!!!

Friday, June 13, 2008

Factor Analysis - Work Orientation Survey

The data used in this project has been taken from the "2005 - Work Orientation" SURVEY CONDUCTED BY the International Social Survey Programme.

Total number of cases/records: 43,440
No. of variables: 91

BUSINESS REQUIREMENT
In the survey, questions were asked on job perception, job satisfaction, working conditions, job content, job commitment, etc. Which of these job parameters/variables are strongly related? Which of them can be grouped together? Which scores/ratings should be used to measure a respondent's overall job satisfaction, job commitment, job security etc.?

ANALYSIS
The original data was in text format and it was read using the SAS column input method. Based on the analysis objective, out of the total 43,440 records, only employed (both full-time & part-time) respondents were selected for the analysis.

In the employed data, there are 24268 records. And out of the 91 variables, all country specific variables were removed from the final dataset. From the remaining variables, 38 rating/likert scale variables were selected.

Assuming these as ordinal variables, the spearman rank correlation was considered to be the most appropriate correlation for generating the correlation matrix/output data which will be used for running the Factor Analysis. Based on the MSA values and the significant factor loadings, 8 variables were removed during the analysis procedures.

An interesting thing turned up while using both the default Pearson's and the more appropriate Spearman's correlation in the analysis. When the Spearman correlation was used, 8 factors were extracted. But when I tried to summarize the variables based on these factors, I was not satisfied as the variables have been divided into too many small groups without any pattern or consistency in their meanings.

But when I used the default Pearson's correlation (in proc factor), I got 4 factors only. But the best thing was that the related variables have been grouped together under each of these factors. For example - Job content, Job Security, Work-Life Balance, & Job Satisfaction were clubbed together. While Work Environment, Working Relations, Organization Image & Job Commitment came under one factor.

The second approach of using Pearson's correlation while running Factor Analysis was thus found to give a much better, useful, and meaningful result in spite of what textbooks say about using Pearson's correlation on rating scale variables.