K-medians clustering is a partitioning technique used in cluster analysis. It groups data into ''k'' clusters by minimizing the sum of distances—typically using the Manhattan (L1) distance—between data points and the median of their assigned clusters. This method is especially robust to outliers and is well-suited for discrete or categorical data. It is a generalization of the
geometric median
In geometry, the geometric median of a discrete point set in a Euclidean space is the point minimizing the sum of distances to the sample points. This generalizes the median, which has the property of minimizing the sum of distances or absolute ...
or 1-median algorithm, defined for a single cluster. ''k''-medians is a variation of
''k''-means clustering where instead of calculating the
mean
A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...
for each cluster to determine its
centroid
In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the figure. The same definition extends to any object in n-d ...
, one instead calculates the
median
The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...
. This has the effect of minimizing error over all clusters with respect to the 2-
norm
Norm, the Norm or NORM may refer to:
In academic disciplines
* Normativity, phenomenon of designating things as good or bad
* Norm (geology), an estimate of the idealised mineral content of a rock
* Norm (philosophy), a standard in normative e ...
distance metric, as opposed to the squared 2-norm distance metric (which ''k''-means does).
This relates directly to the ''k''-median problem which is the problem of finding ''k'' centers such that the clusters formed by them are the most compact with respect to the 2-norm. Formally, given a set of data points ''x'', the ''k'' centers ''c''
''i'' are to be chosen so as to minimize the sum of the distances from each ''x'' to the nearest ''c''
''i''.
The criterion function formulated in this way is sometimes a better criterion than that used in the
''k''-means clustering algorithm, in which the sum of the squared distances is used. The sum of distances is widely used in applications such as the
facility location problem.
The proposed algorithm uses
Lloyd-style iteration which alternates between an expectation (E) and maximization (M) step, making this an
expectation–maximization algorithm
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent varia ...
. In the E step, all objects are assigned to their nearest median. In the M step, the medians are recomputed by using the median in each single dimension.
Medians and Medioids
The median is computed in each single dimension in the
Manhattan-distance formulation of the ''k''-medians problem, so the individual attributes will come from the dataset (or be an average of two values from the dataset). This makes the algorithm more reliable for discrete or even
binary data
Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra.
Binary data occurs in many different technical and scientific fields, wh ...
sets. In contrast, the use of means or
Euclidean-distance medians will not necessarily yield individual attributes from the dataset. Even with the Manhattan-distance formulation, the individual attributes may come from different instances in the dataset; thus, the resulting median may not be a member of the input dataset.
This algorithm is often confused with the
''k''-medoids algorithm. However, a medoid has to be an actual instance from the dataset, while for the multivariate Manhattan-distance median this only holds for single attribute values. The actual median can thus be a combination of multiple instances. For example, given the vectors (0,1), (1,0) and (2,2), the Manhattan-distance median is (1,1), which does not exist in the original data, and thus cannot be a medoid.
Comparison with Related Algorithms
K-medians clustering is closely related to other partitional clustering techniques such as k-means and k-medoids, each differing primarily in how cluster centers are determined and the type of distance metric employed. These differences lead to distinct behaviors with respect to robustness, computational cost, and applicability to various data distributions. The k-means algorithm minimizes the sum of squared Euclidean distances between data points and their corresponding cluster mean (centroid). It uses the arithmetic mean as the cluster representative, which makes it sensitive to outliers and noise because the mean can be heavily influenced by extreme values. In contrast, k-medians minimizes the sum of absolute differences (typically using the Manhattan/L1 distance), selecting the median along each dimension as the cluster center. Because the median is less affected by extreme values, k-medians is generally more robust in the presence of outliers. K-medoids also emphasizes robustness, but instead of using computed medians or means, it selects actual data points (medoids) as cluster centers.
This makes k-medoids particularly suitable for non-Euclidean or categorical data. However, because it involves evaluating pairwise dissimilarities and repeatedly searching for representative points, it tends to be more computationally intensive than both k-means and k-medians, especially on large datasets.
Software
*
ELKI
ELKI (''Environment for Developing KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases) software framework developed for use in research and teaching. It was originally created by the databa ...
includes various k-means variants, including k-medians.
*
FORTRANbr>
kmedians*
GNU R
R is a programming language for statistical computing and data visualization. It has been widely adopted in the fields of data mining, bioinformatics, data analysis, and data science.
The core R language is extended by a large number of soft ...
includes k-medians in the "flexclust" package.
*
Stata
Stata (, , alternatively , occasionally stylized as STATA) is a general-purpose Statistics, statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers ...
br>
kmedians
See also
*
Cluster analysis
Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...
*
''k''-means
*
Medoid Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be ...
*
Silhouette
A silhouette (, ) is the image of a person, animal, object or scene represented as a solid shape of a single colour, usually black, with its edges matching the outline of the subject. The interior of a silhouette is featureless, and the silhouett ...
References
{{reflist
Cluster analysis algorithms