In
computational geometry, a coreset of an input set is a subset of points, such that solving a problem on the coreset provably yields similar results as solving the problem on the entire
point set
In geometry, a point is an abstract idealization of an exact position, without size, in physical space, or its generalization to other kinds of mathematical spaces. As zero-dimensional objects, points are usually taken to be the fundamental ind ...
, for some given family of problems. Coresets are commonly used in
Mathematical optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...
,
Cluster analysis
Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...
and
Range Queries
In computer science, the range query problem consists of efficiently answering several queries regarding a given interval of elements within an array data structure, array. For example, a common task, known as range minimum query, is finding the ...
to reduce
computational complexity
In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time (generally measured by the number of needed elementary operations ...
while maintaining high accuracy. They allow algorithms to operate efficiently on large datasets by replacing the original data with a significantly smaller representative subset.
Many natural geometric optimization problems have coresets that approximate an optimal solution to within a factor of , that can be found quickly (in
linear time
In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations ...
or near-linear time), and that have size bounded by a function of independent of the input size, where is an arbitrary positive number. When this is the case, one obtains a linear-time or near-linear time approximation scheme, based on the idea of finding a coreset and then applying an exact optimization algorithm to the coreset. Regardless of how slow the exact optimization algorithm is, for any fixed choice of , the running time of this approximation scheme will be plus the time to find the coreset.
[.]
Definition
A coreset is a subset
of a point set
, possibly with associated weights, that preserves an optimization cost function within a factor of
, where
is some user defined approximation parameter. Formally, for an optimization problem with some cost function COST
, a coreset
satisfies the following inequality:
COST
COST
(1 +
)
COST
Applications
Coresets are used in a variety of problems, a few key examples include:
* Clustering: Approximating solutions for
K-means clustering
''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster (statistics), cluste ...
,
K-medians clustering
K-medians clustering is a partitioning technique used in cluster analysis. It groups data into ''k'' clusters by minimizing the sum of distances—typically using the Manhattan (L1) distance—between data points and the median of their assigned c ...
and
K-center clustering while significantly reducing computation.
* Range Queries: Speeding up spatial searches in
Geographic Information Systems
A geographic information system (GIS) consists of integrated computer hardware and software that store, manage, analyze, edit, output, and visualize geographic data. Much of this often happens within a spatial database; however, this is not ...
or large databases by efficiently summarizing data.
* Machine Learning: Enhancing performance in
Hyperparameter optimization
In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process, which must be con ...
by working with a smaller representative set.
References
Computational geometry
{{algorithm-stub