The Two-proportion Z-test (or, Two-sample proportion Z-test) is a statistical method used to determine whether the difference between the proportions of two groups, coming from a

binomial distribution In probability theory and statistics, the binomial distribution with parameters and is the discrete probability distribution of the number of successes in a sequence of statistical independence, independent experiment (probability theory) ...

is statistically significant. This approach relies on the assumption that the sample proportions follow a

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

under the

Central Limit Theorem In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the Probability distribution, distribution of a normalized version of the sample mean converges to a Normal distribution#Standard normal distributi ...

, allowing the construction of a

z-test A ''Z''-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. ''Z''-test tests the mean of a distribution. For each statistical significance, signi ...

for

hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...

and confidence interval estimation. It is used in various fields to compare success rates, response rates, or other proportions across different groups.

Hypothesis test

The z-test for comparing two proportions is a

Statistical hypothesis test A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...

for evaluating whether the proportion of a certain characteristic differs significantly between two independent samples. This test leverages the property that the sample proportions (which is the average of observations coming from a

Bernoulli distribution In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with pro ...

) are

asymptotically In analytic geometry, an asymptote () of a curve is a line such that the distance between the curve and the line approaches zero as one or both of the ''x'' or ''y'' coordinates tends to infinity. In projective geometry and related contexts, ...

normal under the

, enabling the construction of a

. The test involves two competing hypotheses: * Null hypothesis (H₀): The proportions in the two populations are equal, i.e.,

p_1 = p_2

. * Alternative hypothesis (H₁): The proportions in the two populations are not equal, i.e.,

p_1 \neq p_2

( two-tailed) or

p_1 > p_2

p_1 < p_2

(one-tailed). The z-statistic for comparing two proportions is computed using:

z = \frac

Where: *

\hat_1

= sample proportion in the first sample *

\hat_2

= sample proportion in the second sample *

n_1

= size of the first sample *

n_2

= size of the second sample *

\hat

= pooled proportion, calculated as

\hat = \frac

, where

x_1

and

x_2

are the counts of successes in the two samples. The pooled proportion is used to estimate the shared probability of success under the null hypothesis, and the standard error accounts for variability across the two samples. The z-test determines statistical significance by comparing the calculated z-statistic to a critical value. E.g., for a significance level of

\alpha = 0.05

we reject the null hypothesis if

, z,  > 1.96

(for a

two-tailed test In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic. A two-tailed test is appropriate if ...

). Or, alternatively, by computing the

p-value In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...

and rejecting the null hypothesis if

p < \alpha

Confidence interval

The confidence interval for the difference between two proportions, based on the definitions above, is:

(\hat_1 - \hat_2) \pm z_ \sqrt

Where: *

z_

is the critical value of the standard normal distribution (e.g., 1.96 for a 95% confidence level). This interval provides a range of plausible values for the true difference between population proportions. Using the z-test confidence intervals for hypothesis testing would give the same results as the chi-squared test for a two-by-two contingency table. Fisher’s exact test is more suitable for when the sample sizes are small. Notice how the variance estimation is different between the hypothesis testing and the confidence intervals. The first uses a pooled variance (based on the null hypothesis), while the second has to estimate the variance using each sample separately (so as to allow for the confidence interval to accommodate a range of differences in proportions). This difference may lead to slightly different results if using the confidence interval as an alternative to the hypothesis testing method.

Minimum detectable effect (MDE)

The

minimum detectable effect In mathematical analysis, the maximum and minimum of a function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given range (the ''local'' or ''relative'' ...

(MDE) is the smallest difference between two proportions (

p_1

and

p_2

) that a statistical test can detect for a chosen

Type I error Type I error, or a false positive, is the erroneous rejection of a true null hypothesis in statistical hypothesis testing. A type II error, or a false negative, is the erroneous failure in bringing about appropriate rejection of a false null hy ...

level (

\alpha

statistical power In frequentist statistics, power is the probability of detecting a given effect (if that effect actually exists) using a given test in a given context. In typical use, it is a function of the specific test that is used (including the choice of tes ...

(

1-\beta

), and sample sizes (

n_1

and

n_2

). It is commonly used in study design to determine whether the sample sizes allows for a test with sufficient sensitivity to detect meaningful differences. The MDE for when using the (two-sided) z-test formula for comparing two proportions, incorporating critical values for

\alpha

and

1-\beta

, and the standard errors of the proportions:

\text = , p_1 - p_2,  = z_ \sqrt + z_ \sqrt

Where: *

z_

: Critical value for the significance level. *

z_

: Quantile for the desired power. *

p_0=p_1=p_2

: When assuming the null is correct. The MDE depends on the sample sizes, baseline proportions (

p_1, p_2

), and test parameters. When the baseline proportions are not known, they need to be assumed or roughly estimated from a small study. Larger samples or smaller power requirements leads to a smaller MDE, making the test more sensitive to smaller differences. Researchers may use the MDE to assess the feasibility of detecting meaningful differences before conducting a study. The Minimal Detectable Effect (MDE) is the smallest difference, denoted as

\Delta = , p_1 - p_2,

, that satisfies two essential criteria in hypothesis testing: # The null hypothesis (

H_0: p_1 = p_2

) is rejected at the specified significance level (

\alpha

). # Statistical power (

1 - \beta

) is achieved under the alternative hypothesis (

H_a: p_1 \neq p_2

). Given that the distribution is normal under the null and the alternative hypothesis, for the two criteria to happen, it is required that the distance of

, p_1 - p_2,

will be such that the critical value for rejecting the null (

X_\text

) is exactly in the location in which the probability of exceeding this value, under the null, is (

\alpha

), and also that the probability of exceeding this value, under the alternative, is

1 - \beta

. The first criterion establishes the critical value required to reject the null hypothesis. The second criterion specifies how far the alternative distribution must be from

X_\text

to ensure that the probability of exceeding it under the alternative hypothesis is at least

1 - \beta

.Power, minimal detectable effect, and bucket size estimation in A/B tests
(has some nice figures to illustrate the tradeoffs) Condition 1: Rejecting

H_0

Under the null hypothesis, the test statistic is based on the pooled standard error (

\text_\text

Z_\text = \frac, \quad \text \text_\text = \sqrt.

p_0

might be estimated (as described above). To reject

H_0

, the observed difference must exceed the critical threshold (

Z_\text = z_

) after properly inflating it to the SE:

, p_1 - p_2,  \geq X_ = z_ \cdot \text_\text

If the MDE is defined solely as

MDE = z_ \cdot \text_\text

, the statistical power would be only 50% because the alternative distribution is symmetric about the threshold. To achieve a higher power level, an additional component is required in the MDE calculation. Condition 2: Achieving power

1 - \beta

Under the alternative hypothesis, the standard error is (

\text_\text = \sqrt

). It means that if the alternative distribution was centered around some value (e.g.,

X_\text

), then the minimal

, p_1 - p_2,

must be at least larger than

z_ \cdot \text_\text

to ensure that the probability of detecting the difference under the alternative hypothesis is at least

1 - \beta

. Combining conditions To meet both conditions, the total detectable difference incorporates components from both the null and alternative distributions. The MDE is defined as:

\text = z_ \cdot \text_\text + z_ \cdot \text_\text.

By summing the critical thresholds from the null and adding to it the relevant quantile from the alternative distributions, the MDE ensures the test satisfies the dual requirements of rejecting

H_0

at significance level

\alpha

and achieving statistical power of at least

1 - \beta

Assumptions and conditions

To ensure valid results, the following assumptions must be met: # Independent random samples: The samples must be drawn independently from the populations of interest. # Large sample sizes: Typically,

n_1

and

n_2

should exceed 30. # Success/failure condition: ##

n_1 \hat_1 > 10

and

n_1(1-\hat_1) > 10

n_2 \hat_2 > 10

and

n_2(1-\hat_2) > 10

The z-test is most reliable when sample sizes are large, and all assumptions are satisfied.

Software implementation

R

Use prop.test() with continuity correction disabled: prop.test(x = c(120, 150), n = c(1000, 1000), correct = FALSE) Output includes z-test equivalent results: chi-squared statistic, p-value, and confidence interval: 2-sample test for equality of proportions without continuity correction data: c(120, 150) out of c(1000, 1000) X-squared = 3.8536, df = 1, p-value = 0.04964 alternative hypothesis: two.sided 95 percent confidence interval: -5.992397e-02 -7.602882e-05 sample estimates: prop 1 prop 2 0.12 0.15

Python

Use proportions_ztest from statsmodels: from statsmodels.stats.proportion import proportions_ztest z, p = proportions_ztest( 20, 150 000, 1000 0) # For CI: from statsmodels.stats.proportion import proportions_diff_confint_indep

References

External links

Inference for Proportions: Comparing Two Independent Samples
(power calculator) {{Statistics Statistical tests Normal distribution Experiments

Hypothesis test

Confidence interval

Minimum detectable effect (MDE)

Assumptions and conditions

Software implementation

R

Python

See also

References

External links