Selection algorithm
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includi ...
, a selection algorithm is an
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
for finding the ''k''th smallest number in a
list A ''list'' is any set of items in a row. List or lists may also refer to: People * List (surname) Organizations * List College, an undergraduate division of the Jewish Theological Seminary of America * SC Germania List, German rugby unio ...
or
array An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
; such a number is called the ''k''th ''
order statistic In statistics, the ''k''th order statistic of a statistical sample is equal to its ''k''th-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference. Importan ...
''. This includes the cases of finding the
minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
,
maximum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
, and median elements. There are O(''n'')-time (worst-case linear time) selection algorithms, and sublinear performance is possible for structured data; in the extreme, O(1) for an array of sorted data. Selection is a subproblem of more complex problems like the nearest neighbor and
shortest path In graph theory, the shortest path problem is the problem of finding a path between two vertices (or nodes) in a graph such that the sum of the weights of its constituent edges is minimized. The problem of finding the shortest path between ...
problems. Many selection algorithms are derived by generalizing a
sorting algorithm In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important ...
, and conversely some sorting algorithms can be derived as repeated application of selection. The simplest case of a selection algorithm is finding the minimum (or maximum) element by iterating through the list, keeping track of the running minimum – the minimum so far – (or maximum) and can be seen as related to the
selection sort In computer science, selection sort is an in-place comparison sorting algorithm. It has an O(''n''2) time complexity, which makes it inefficient on large lists, and generally performs worse than the similar insertion sort. Selection sort is no ...
. Conversely, the hardest case of a selection algorithm is finding the median. In fact, a specialized median-selection algorithm can be used to build a general selection algorithm, as in
median of medians In computer science, the median of medians is an approximate (median) selection algorithm, frequently used to supply a good pivot for an exact selection algorithm, mainly the quickselect, that selects the ''k''th smallest element of an initially u ...
. The best-known selection algorithm is
Quickselect In computer science, quickselect is a selection algorithm to find the ''k''th smallest element in an unordered list. It is also known as the kth order statistics . It is related to the quicksort sorting algorithm. Like quicksort, it was devel ...
, which is related to
Quicksort Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961, it is still a commonly used algorithm for sorting. Overall, it is slightly faster than ...
; like Quicksort, it has (asymptotically) optimal average performance, but poor worst-case performance, though it can be modified to give optimal worst-case performance as well.


Selection by sorting

By sorting the list or array then selecting the desired element, selection can be reduced to sorting. This method is inefficient for selecting a single element, but is efficient when many selections need to be made from an array, in which case only one initial, expensive sort is needed, followed by many cheap selection operations – O(1) for an array, though selection is O(''n'') in a linked list, even if sorted, due to lack of random access. In general, sorting requires O(''n'' log ''n'') time, where ''n'' is the length of the list, although a lower bound is possible with non-comparative sorting algorithms like
radix sort In computer science, radix sort is a non-comparative sorting algorithm. It avoids comparison by creating and distributing elements into buckets according to their radix. For elements with more than one significant digit, this bucketing process i ...
and
counting sort In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess dis ...
. Rather than sorting the whole list or array, one can instead use partial sorting to select the ''k'' smallest or ''k'' largest elements. The ''k''th smallest (resp., ''k''th largest element) is then the largest (resp., smallest element) of the partially sorted list – this then takes O(1) to access in an array and O(''k'') to access in a list.


Unordered partial sorting

If partial sorting is relaxed so that the ''k'' smallest elements are returned, but not in order, the factor of O(''k'' log ''k'') can be eliminated. An additional maximum selection (taking O(''k'') time) is required, but since k \leq n, this still yields asymptotic complexity of O(''n''). In fact, partition-based selection algorithms yield both the ''k''th smallest element itself and the ''k'' smallest elements (with other elements not in order). This can be done in O(''n'') time – average complexity of
Quickselect In computer science, quickselect is a selection algorithm to find the ''k''th smallest element in an unordered list. It is also known as the kth order statistics . It is related to the quicksort sorting algorithm. Like quicksort, it was devel ...
, and worst-case complexity of refined partition-based selection algorithms. Conversely, given a selection algorithm, one can easily get an unordered partial sort (''k'' smallest elements, not in order) in O(''n'') time by iterating through the list and recording all elements less than the ''k''th element. If this results in fewer than ''k'' − 1 elements, any remaining elements equal the ''k''th element. Care must be taken, due to the possibility of equality of elements: one must not include all elements less than ''or equal to'' the ''k''th element, as elements greater than the ''k''th element may also be equal to it. Thus unordered partial sorting (lowest ''k'' elements, but not ordered) and selection of the ''k''th element are very similar problems. Not only do they have the same asymptotic complexity, O(''n''), but a solution to either one can be converted into a solution to the other by a straightforward algorithm (finding a max of ''k'' elements, or filtering elements of a list below a cutoff of the value of the ''k''th element).


Partial selection sort

A simple example of selection by partial sorting is to use the partial
selection sort In computer science, selection sort is an in-place comparison sorting algorithm. It has an O(''n''2) time complexity, which makes it inefficient on large lists, and generally performs worse than the similar insertion sort. Selection sort is no ...
. The obvious linear time algorithm to find the minimum (resp. maximum) – iterating over the list and keeping track of the minimum (resp. maximum) element so far – can be seen as a partial selection sort that selects the 1 smallest element. However, many other partial sorts also reduce to this algorithm for the case ''k'' =1, such as a partial heap sort. More generally, a partial selection sort yields a simple selection algorithm which takes O(''kn'') time. This is asymptotically inefficient, but can be sufficiently efficient if ''k'' is small, and is easy to implement. Concretely, we simply find the minimum value and move it to the beginning, repeating on the remaining list until we have accumulated ''k'' elements, and then return the ''k''th element. Here is partial selection sort-based algorithm: function select(list ..n k) for i from 1 to k minIndex = i minValue = list for j from i+1 to n do if list < minValue then minIndex = j minValue = list swap list and list inIndex return list


Partition-based selection

Linear performance can be achieved by a partition-based selection algorithm, most basically
Quickselect In computer science, quickselect is a selection algorithm to find the ''k''th smallest element in an unordered list. It is also known as the kth order statistics . It is related to the quicksort sorting algorithm. Like quicksort, it was devel ...
. Quickselect is a variant of
Quicksort Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961, it is still a commonly used algorithm for sorting. Overall, it is slightly faster than ...
– in both one chooses a pivot and then partitions the data by it, but while Quicksort recurses on both sides of the partition, Quickselect only recurses on one side, namely the side on which the desired ''k''th element is. As with Quicksort, this has optimal average performance, in this case linear, but poor worst-case performance, in this case quadratic. This occurs for instance by taking the first element as the pivot and searching for the maximum element, if the data is already sorted. In practice this can be avoided by choosing a random element as pivot, which yields almost certain linear performance. Alternatively, a more careful deterministic pivot strategy can be used, such as
median of medians In computer science, the median of medians is an approximate (median) selection algorithm, frequently used to supply a good pivot for an exact selection algorithm, mainly the quickselect, that selects the ''k''th smallest element of an initially u ...
. These are combined in the hybrid introselect algorithm (analogous to introsort), which starts with Quickselect but falls back to median of medians if progress is slow, resulting in both fast average performance and optimal worst-case performance of O(''n''). The partition-based algorithms are generally done in place, which thus results in partially sorting the data. They can be done out of place, not changing the original data, at the cost of O(''n'') additional space.


Median selection as pivot strategy

A median-selection algorithm can be used to yield a general selection algorithm or sorting algorithm, by applying it as the pivot strategy in Quickselect or Quicksort; if the median-selection algorithm is asymptotically optimal (linear-time), the resulting selection or sorting algorithm is as well. In fact, an exact median is not necessary – an approximate median is sufficient. In the
median of medians In computer science, the median of medians is an approximate (median) selection algorithm, frequently used to supply a good pivot for an exact selection algorithm, mainly the quickselect, that selects the ''k''th smallest element of an initially u ...
selection algorithm, the pivot strategy computes an approximate median and uses this as pivot, recursing on a smaller set to compute this pivot. In practice the overhead of pivot computation is significant, so these algorithms are generally not used, but this technique is of theoretical interest in relating selection and sorting algorithms. In detail, given a median-selection algorithm, one can use it as a pivot strategy in Quickselect, obtaining a selection algorithm. If the median-selection algorithm is optimal, meaning O(''n''), then the resulting general selection algorithm is also optimal, again meaning linear. This is because Quickselect is a
divide and conquer algorithm In computer science, divide and conquer is an algorithm design paradigm. A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved direc ...
, and using the median at each pivot means that at each step the search set decreases by half in size, so the overall complexity is a
geometric series In mathematics, a geometric series is the sum of an infinite number of terms that have a constant ratio between successive terms. For example, the series :\frac \,+\, \frac \,+\, \frac \,+\, \frac \,+\, \cdots is geometric, because each suc ...
times the complexity of each step, and thus simply a constant times the complexity of a single step, in fact 2 = 1/(1-(1/2)) times (summing the series). Similarly, given a median-selection algorithm or general selection algorithm applied to find the median, one can use it as a pivot strategy in Quicksort, obtaining a sorting algorithm. If the selection algorithm is optimal, meaning O(''n''), then the resulting sorting algorithm is optimal, meaning O(''n'' log ''n''). The median is the best pivot for sorting, as it evenly divides the data, and thus guarantees optimal sorting, assuming the selection algorithm is optimal. A sorting analog to median of medians exists, using the pivot strategy (approximate median) in Quicksort, and similarly yields an optimal Quicksort.


Incremental sorting by selection

Converse to selection by sorting, one can incrementally sort by repeated selection. Abstractly, selection only yields a single element, the ''k''th element. However, practical selection algorithms frequently involve partial sorting, or can be modified to do so. Selecting by partial sorting naturally does so, sorting the elements up to ''k'', and selecting by partitioning also sorts some elements: the pivots are sorted to the correct positions, with the ''k''th element being the final pivot, and the elements between the pivots have values between the pivot values. The difference between partition-based selection and partition-based sorting, as in Quickselect versus Quicksort, is that in selection one recurses on only one side of each pivot, sorting only the pivots (an average of log(''n'') pivots are used), rather than recursing on both sides of the pivot. This can be used to speed up subsequent selections on the same data; in the extreme, a fully sorted array allows O(1) selection. Further, compared with first doing a full sort, incrementally sorting by repeated selection amortizes the sorting cost over multiple selections. For partially sorted data (up to ''k''), so long as the partially sorted data and the index ''k'' up to which the data is sorted are recorded, subsequent selections of ''j'' less than or equal to ''k'' can simply select the ''j''th element, as it is already sorted, while selections of ''j'' greater than ''k'' only need to sort the elements above the ''k''th position. For partitioned data, if the list of pivots is stored (for example, in a sorted list of the indices), then subsequent selections only need to select in the interval between two pivots (the nearest pivots below and above). The biggest gain is from the top-level pivots, which eliminate costly large partitions: a single pivot near the middle of the data cuts the time for future selections in half. The pivot list will grow over subsequent selections, as the data becomes more sorted, and can even be passed to a partition-based sort as the basis of a full sort.


Using data structures to select in sublinear time

Given an unorganized list of data, linear time (Ω(''n'')) is required to find the minimum element, because we have to examine every element (otherwise, we might miss it). If we organize the list, for example by keeping it sorted at all times, then selecting the ''k''th largest element is trivial, but then insertion requires linear time, as do other operations such as combining two lists. The strategy to find an order statistic in sublinear time is to store the data in an organized fashion using suitable data structures that facilitate the selection. Two such data structures are tree-based structures and frequency tables. When only the minimum (or maximum) is needed, a good approach is to use a heap, which is able to find the minimum (or maximum) element in constant time, while all other operations, including insertion, are O(log ''n'') or better. More generally, a
self-balancing binary search tree In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height (maximal number of levels below the root) small in the face of arbitrary item insertions and deletions.Donal ...
can easily be augmented to make it possible to both insert an element and find the ''k''th largest element in O(log ''n'') time; this is called an ''
order statistic tree In computer science, an order statistic tree is a variant of the binary search tree (or more generally, a B-tree) that supports two additional operations beyond insertion, lookup and deletion: * Select(''i'') – find the ''ith smallest element st ...
.'' We simply store in each node a count of how many descendants it has, and use this to determine which path to follow. The information can be updated efficiently since adding a node only affects the counts of its O(log ''n'') ancestors, and tree rotations only affect the counts of the nodes involved in the rotation. Another simple strategy is based on some of the same concepts as the
hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', ...
. When we know the range of values beforehand, we can divide that range into ''h'' subintervals and assign these to ''h'' buckets. When we insert an element, we add it to the bucket corresponding to the interval it falls in. To find the minimum or maximum element, we scan from the beginning or end for the first nonempty bucket and find the minimum or maximum element in that bucket. In general, to find the ''k''th element, we maintain a count of the number of elements in each bucket, then scan the buckets from left to right adding up counts until we find the bucket containing the desired element, then use the expected linear-time algorithm to find the correct element in that bucket. If we choose ''h'' of size roughly sqrt(''n''), and the input is close to uniformly distributed, this scheme can perform selections in expected O(sqrt(''n'')) time. Unfortunately, this strategy is also sensitive to clustering of elements in a narrow interval, which may result in buckets with large numbers of elements (clustering can be eliminated through a good hash function, but finding the element with the ''k''th largest hash value isn't very useful). Additionally, like hash tables this structure requires table resizings to maintain efficiency as elements are added and ''n'' becomes much larger than ''h''2. A useful case of this is finding an order statistic or extremum in a finite range of data. Using above table with bucket interval 1 and maintaining counts in each bucket is much superior to other methods. Such hash tables are like frequency tables used to classify the data in
descriptive statistics A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and an ...
.


Lower bounds

In ''
The Art of Computer Programming ''The Art of Computer Programming'' (''TAOCP'') is a comprehensive monograph written by the computer scientist Donald Knuth presenting programming algorithms and their analysis. Volumes 1–5 are intended to represent the central core of com ...
'',
Donald E. Knuth Donald Ervin Knuth ( ; born January 10, 1938) is an American computer scientist, mathematician, and professor emeritus at Stanford University. He is the 1974 recipient of the ACM Turing Award, informally considered the Nobel Prize of computer sc ...
discussed a number of lower bounds for the number of comparisons required to locate the ''t'' smallest entries of an unorganized list of ''n'' items (using only comparisons). There is a trivial lower bound of ''n'' − 1 for the minimum or maximum entry. To see this, consider a tournament where each game represents one comparison. Since every player except the winner of the tournament must lose a game before we know the winner, we have a lower bound of ''n'' − 1 comparisons. The story becomes more complex for other indexes. We define W_(n) as the minimum number of comparisons required to find the ''t'' smallest values. Knuth references a paper published by S. S. Kislitsyn, which shows an upper bound on this value: :W_(n) \leq n - t + \sum_ \lceil\rceil \quad \text\, n \geq t This bound is achievable for ''t''=2 but better, more complex bounds are known for larger ''t''.


Space complexity

The required space complexity of selection is O(1) additional storage, in addition to storing the array in which selection is being performed. Such space complexity can be achieved while preserving optimal O(n) time complexity.


Online selection algorithm

Online In computer technology and telecommunications, online indicates a state of connectivity and offline indicates a disconnected state. In modern terminology, this usually refers to an Internet connection, but (especially when expressed "on line" o ...
selection may refer narrowly to computing the ''k''th smallest element of a stream, in which case partial sorting algorithms — with ''k'' + O(1) space for the ''k'' smallest elements so far — can be used, but partition-based algorithms cannot be. Alternatively, selection itself may be required to be
online In computer technology and telecommunications, online indicates a state of connectivity and offline indicates a disconnected state. In modern terminology, this usually refers to an Internet connection, but (especially when expressed "on line" o ...
, that is, an element can only be selected from a sequential input at the instance of observation and each selection, respectively refusal, is irrevocable. The problem is to select, under these constraints, a specific element of the input sequence (as for example the largest or the smallest value) with largest probability. This problem can be tackled by the
Odds algorithm The odds algorithm (or Bruss algorithm) is a mathematical method for computing optimal strategies for a class of problems that belong to the domain of optimal stopping problems. Their solution follows from the ''odds strategy'', and the importance ...
, which yields the optimal under an independence condition; it is also optimal itself as an algorithm with the number of computations being linear in the length of input. The simplest example is the
secretary problem The secretary problem demonstrates a scenario involving optimal stopping theory For French translation, secover storyin the July issue of ''Pour la Science'' (2009). that is studied extensively in the fields of applied probability, statistics, ...
of choosing the maximum with high probability, in which case optimal strategy (on random data) is to track the running maximum of the first ''n''/''e'' elements and reject them, and then select the first element that is higher than this maximum.


Related problems

One may generalize the selection problem to apply to ranges within a list, yielding the problem of range queries. The question of range median queries (computing the medians of multiple ranges) has been analyzed.


Language support

Very few languages have built-in support for general selection, although many provide facilities for finding the smallest or largest element of a list. A notable exception is
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
, which provides a templated nth_element method with a guarantee of expected linear time, and also partitions the data, requiring that the ''n''th element be sorted into its correct place, elements before the ''n''th element are less than it, and elements after the ''n''th element are greater than it. It is implied but not required that it is based on Hoare's algorithm (or some variant) by its requirement of expected linear time and partitioning of data. For
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offic ...
, the modul
Sort::Key::Top
available from CPAN, provides a set of functions to select the top n elements from a list using several orderings and custom key extraction procedures. Furthermore, th
Statistics::CaseResampling
module provides a function to calculate quantiles using Quickselect.
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
's standard library (since 2.4) includes heapq
nsmallest() and nlargest(), returning sorted lists, in O(''n'' log ''k'') time. Numpy has the partition() function.
Matlab MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementa ...
includes maxk() and mink() functions, which return the maximal (minimal) k values in a vector as well as their indices. Because language support for sorting is more ubiquitous, the simplistic approach of sorting followed by indexing is preferred in many environments despite its disadvantage in speed. Indeed, for lazy languages, this simplistic approach can even achieve the best complexity possible for the ''k'' smallest/greatest sorted (with maximum/minimum as a special case) if the sort is lazy enough.


See also

* Ordinal optimization *
Search algorithm In computer science, a search algorithm is an algorithm designed to solve a search problem. Search algorithms work to retrieve information stored within particular data structure, or calculated in the search space of a problem domain, with eith ...


References


Bibliography

* * * *
Donald Knuth Donald Ervin Knuth ( ; born January 10, 1938) is an American computer scientist, mathematician, and professor emeritus at Stanford University. He is the 1974 recipient of the ACM Turing Award, informally considered the Nobel Prize of computer sc ...
. ''
The Art of Computer Programming ''The Art of Computer Programming'' (''TAOCP'') is a comprehensive monograph written by the computer scientist Donald Knuth presenting programming algorithms and their analysis. Volumes 1–5 are intended to represent the central core of com ...
'', Volume 3: ''Sorting and Searching'', Third Edition. Addison-Wesley, 1997. . Section 5.3.3: Minimum-Comparison Selection, pp. 207–219. * Thomas H. Cormen,
Charles E. Leiserson Charles Eric Leiserson is a computer scientist, specializing in the theory of parallel computing and distributed computing, and particularly practical applications thereof. As part of this effort, he developed the Cilk multithreaded language. ...
,
Ronald L. Rivest Ronald Linn Rivest (; born May 6, 1947) is a cryptographer and an Institute Professor at MIT. He is a member of MIT's Department of Electrical Engineering and Computer Science (EECS) and a member of MIT's Computer Science and Artificial Inte ...
, and
Clifford Stein Clifford Seth Stein (born December 14, 1965), a computer scientist, is a professor of industrial engineering and operations research at Columbia University in New York, NY, where he also holds an appointment in the Department of Computer Scien ...
. ''
Introduction to Algorithms ''Introduction to Algorithms'' is a book on computer programming by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. The book has been widely used as the textbook for algorithms courses at many universities and is ...
'', Second Edition. MIT Press and McGraw-Hill, 2001. . Chapter 9: Medians and Order Statistics, pp. 183–196. Section 14.1: Dynamic order statistics, pp. 302–308. *


External links

*
Lecture notes for January 25, 1996: Selection and order statistics
, ''ICS 161: Design and Analysis of Algorithms,'' David Eppstein {{DEFAULTSORT:Selection Algorithm