
The Kaplan–Meier estimator, also known as the product limit estimator, is a
non-parametric
Nonparametric statistics is a type of statistical analysis that makes minimal assumptions about the underlying distribution of the data being studied. Often these models are infinite-dimensional, rather than finite dimensional, as in parametric sta ...
statistic
A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...
used to estimate the
survival function
The survival function is a function that gives the probability that a patient, device, or other object of interest will survive past a certain time.
The survival function is also known as the survivor function
or reliability function.
The term ...
from lifetime data. In medical research, it is often used to measure the fraction of patients living for a certain amount of time after treatment. In other fields, Kaplan–Meier estimators may be used to measure the length of time people remain unemployed after a job loss, the time-to-failure of machine parts, or how long fleshy fruits remain on plants before they are removed by
frugivore
A frugivore ( ) is an animal that thrives mostly on raw fruits or succulent fruit-like produce of plants such as roots, shoots, nuts and seeds. Approximately 20% of mammalian herbivores eat fruit. Frugivores are highly dependent on the abundance ...
s. The
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...
is named after
Edward L. Kaplan and
Paul Meier, who each submitted similar manuscripts to the ''
Journal of the American Statistical Association
The ''Journal of the American Statistical Association'' is a quarterly peer-reviewed scientific journal published by Taylor & Francis on behalf of the American Statistical Association. It covers work primarily focused on the application of statis ...
''. The journal editor,
John Tukey
John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
, convinced them to combine their work into one paper, which has been cited more than 34,000 times since its publication in 1958.
The
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...
of the
survival function
The survival function is a function that gives the probability that a patient, device, or other object of interest will survive past a certain time.
The survival function is also known as the survivor function
or reliability function.
The term ...
(the probability that life is longer than
) is given by:
:
with
a time when at least one event happened, ''d''
''i'' the ''number of events'' (e.g., deaths) that happened at time
, and
the ''individuals known to have survived'' (have not yet had an event or been censored) up to time
.
Basic concepts
A plot of the Kaplan–Meier estimator is a series of declining horizontal steps which, with a large enough sample size, approaches the true survival function for that population. The value of the survival function between successive distinct sampled observations ("clicks") is assumed to be constant.
An important advantage of the Kaplan–Meier curve is that the method can take into account some types of
censored data, particularly ''right-censoring'', which occurs if a patient withdraws from a study, is
lost to follow-up, or is alive without event occurrence at last follow-up. On the plot, small vertical tick-marks state individual patients whose survival times have been right-censored. When no truncation or censoring occurs, the Kaplan–Meier curve is the
complement of the
empirical distribution function
In statistics, an empirical distribution function ( an empirical cumulative distribution function, eCDF) is the Cumulative distribution function, distribution function associated with the empirical measure of a Sampling (statistics), sample. Th ...
.
In
medical statistics
Medical statistics (also health statistics) deals with applications of statistics to medicine and the health sciences, including epidemiology, public health, forensic medicine, and clinical research. Medical statistics has been a recognized branc ...
, a typical application might involve grouping patients into categories, for instance, those with Gene A profile and those with Gene B profile. In the graph, patients with Gene B die much quicker than those with Gene A. After two years, about 80% of the Gene A patients survive, but less than half of patients with Gene B.
To generate a Kaplan–Meier estimator, at least two pieces of data are required for each patient (or each subject): the status at last observation (event occurrence or right-censored), and the time to event (or time to censoring). If the survival functions between two or more groups are to be compared, then a third piece of data is required: the group assignment of each subject.
Problem definition
Let
be a random variable as the time that passes between the start of the possible exposure period,
, and the time that the event of interest takes place,
. As indicated above, the goal is to estimate the
survival function
The survival function is a function that gives the probability that a patient, device, or other object of interest will survive past a certain time.
The survival function is also known as the survivor function
or reliability function.
The term ...
underlying
. Recall that this function is defined as
:
, where
is the time.
Let
be independent, identically distributed random variables, whose common distribution is that of
:
is the random time when some event
happened. The data available for estimating
is not
, but the list of pairs
where for
,
is a fixed, deterministic integer, the censoring time of event
and
. In particular, the information available about the timing of event
is whether the event happened before the fixed time
and if so, then the actual time of the event is also available. The challenge is to estimate
given this data.
Derivation of the Kaplan–Meier estimator
Two derivations of the Kaplan–Meier estimator are shown. Both are based on rewriting the survival function in terms of what is sometimes called hazard, or mortality rates. However, before doing this it is worthwhile to consider a naive estimator.
A naive estimator
To understand the power of the Kaplan–Meier estimator, it is worthwhile to first describe a naive estimator of the survival function.
Fix
and let
. A basic argument shows that the following proposition holds:
:Proposition 1: If the censoring time
of event
exceeds
(
), then
if and only if
.
Let
be such that
. It follows from the above proposition that
:
Let
and consider only those
, i.e. the events for which the outcome was not censored before time
. Let
be the number of elements in
. Note that the set
is not random and so neither is
. Furthermore,
is a sequence of independent, identically distributed
Bernoulli random variables with common parameter
. Assuming that
, this suggests to estimate
using
:
where the second equality follows because
implies
, while the last equality is simply a change of notation.
The quality of this estimate is governed by the size of
. This can be problematic when
is small, which happens, by definition, when a lot of the events are censored. A particularly unpleasant property of this estimator, that suggests that perhaps it is not the "best" estimator, is that it ignores all the observations whose censoring time precedes
. Intuitively, these observations still contain information about
: For example, when for many events with
,