Datalligence: August 2009

Statistical inference about means and proportions with two populations seems to be one of the most commonly used applications in the field of analytics - comparing campaign response rates between 2 groups of customers, pre and post campaign sales, membership renewal rates, etc.

Call it chance or whatever, but whenever these kind of tasks came up I hear people talking about the t-tests only. No issues as long as you want to compare means or when your target variable is a continuous value. But how or why do people talk about the t-test when they want to compare ratios or proportions? Whatever happened to the Chi-Square tests or the Z-test for difference in proportions?

I did a bit of research on the net, a bit of calculation using pen and paper [very good exercise for the brain in this age of calculators and spreadsheets :-) ], read a very good article by Gerard E. Dallal, and I found the answers.

Going back to our introductory class in statistics, let’s check out the formulae for the t-tests.

1. Assuming that the population variances are equal,
T = (X₁ – X₂)/sqrt (Sp²(1/n₁ + 1/n₂) ..........Equation 1
where
X₁, X₂ = means of sample 1 and 2
n₁, n₂ = size of sample 1 and 2
Sp² = pooled variance = [((n₁-1)S₁²+(n₂-1)S₂²)/(n₁+n₂-2)]

2. Assuming that the population variances are not equal,
T = (X₁ – X₂)/sqrt(S₁²/n₁ + S₂²/n₂) ..........Equation 2

We have also been taught that the test statistic Z is used to determine the difference between two population proportions based on the difference between the two sample proportions.

And the formula for the Z statistic is given by
Z = (P₁ – P₂)/ sqrt(P(1-P)(1/n₁ + 1/n₂)) ..........Equation 3

where
P₁, P₂ = proportions of success (or target category) in samples 1 and 2
S₁, S₂ = variances for samples 1 and 2
n₁, n₂ = size of samples 1 and 2
P = pooled estimate of the sample proportion of successes =(X₁ + X₂)/(n₁ + n₂)
X₁, X₂ = number of successes (or target category) in samples 1 and 2

The test statistic Z (equation 3) is equivalent to the chi- square goodness-of-fit test, also called a test of homogeneity of proportions.

But how different is the proportions from means? The proportion having the desired outcome is the number of individuals/observations with the outcome divided by total number of individuals/observations. Suppose we create a variable that equals 1 if the subject has the outcome and 0 if not. The proportion of individuals/observations with the outcome is the mean of this variable because the sum of these 0s and 1s is the number of individuals/observations with the outcome.

Let's suppose there are m 1s and (n-m) 0s among the n observations. Then, X_Mean (=P) = m/n and X_i - X_Mean is equal to (1-m/n) for m observations and 0-m/n for (n-m) observations. When these results are combined, the final result is

∑(X_i – X_Mean)² = m(1-m/n)² + (n – m) (0 – m/n)²
= m(1 – 2m/n + m²/n²) + (n – m) m²/n²
= m – 2(m²/n²) + (m³/n²) + (m²/n) – (m³/n²)
= m – (m²/n)
= m(1-m/n)
= nP(1-P)

So, variance = ∑(X_i – X_Mean)²/n = P(1-P)

Substituting this in the equation 3 (for Z statistic), we get
(P₁ – P₂)/ sqrt(Variance/n₁ + Variance/n₂)), which is not so different from equation 2 (the formula for the "equal variances not assumed" version of t test).

As long as the sample size is relatively large, the distributional assumptions are met, and the response is binomial – the t test and the z test will give p-values that are very close to one another.

And in the case where we have only two categories, the z test and the chi-square test turn out to be exactly equivalent, though the chi-square is by nature a two-tailed test. The chi-square distribution for 1 df is just the square of the z distribution.

The various tests and their assumptions as listed in Wikipedia are given below:
1. Two-sample pooled t-test, equal variances
(Normal populations or n1 + n2 > 40) and independent observations and σ1 = σ2 and (σ1 and σ2 unknown)

2. Two-sample unpooled t-test, unequal variances
(Normal populations or n1 + n2 > 40) and independent observations and σ1 ≠ σ2 and (σ1 and σ2 unknown)

3. Two-proportion z-test, equal variances
n1 p1 > 5 and n1(1 − p1) > 5 and n2 p2 > 5 and n2(1 − p2) > 5 and independent observations

4. Two-proportion z-test, unequal variances
n1 p1 > 5 and n1(1 − p1) > 5 and n2 p2 > 5 and n2(1 − p2) > 5 and independent observations

Datalligence

Sunday, August 30, 2009

Means and Proportions with two populations

Disclaimer

Followers

DM Sites & Personal Blogs

Blog Archive