HOME

TheInfoList



OR:

In
database management In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
, an aggregate function or aggregation function is a function where the values of multiple rows are grouped together to form a single summary value. Common aggregate functions include: *
Average In ordinary language, an average is a single number taken as representative of a list of numbers, usually the sum of the numbers divided by how many numbers are in the list (the arithmetic mean). For example, the average of the numbers 2, 3, 4, 7 ...
(i.e.,
arithmetic mean In mathematics and statistics, the arithmetic mean ( ) or arithmetic average, or just the '' mean'' or the ''average'' (when the context is clear), is the sum of a collection of numbers divided by the count of numbers in the collection. The co ...
) *
Count Count (feminine: countess) is a historical title of nobility in certain European countries, varying in relative status, generally of middling rank in the hierarchy of nobility. Pine, L. G. ''Titles: How the King Became His Majesty''. New York ...
*
Maximum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
*
Median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic f ...
*
Minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
*
Mode Mode ( la, modus meaning "manner, tune, measure, due measure, rhythm, melody") may refer to: Arts and entertainment * '' MO''D''E (magazine)'', a defunct U.S. women's fashion magazine * ''Mode'' magazine, a fictional fashion magazine which is ...
* Range * Sum Others include: * Nanmean (mean ignoring NaN values, also known as "nil" or "null") * Stddev Formally, an aggregate function takes as input a
set Set, The Set, SET or SETS may refer to: Science, technology, and mathematics Mathematics *Set (mathematics), a collection of elements *Category of sets, the category whose objects and morphisms are sets and total functions, respectively Electro ...
, a
multiset In mathematics, a multiset (or bag, or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements. The number of instances given for each element is called the multiplicity of that e ...
(bag), or a
list A ''list'' is any set of items in a row. List or lists may also refer to: People * List (surname) Organizations * List College, an undergraduate division of the Jewish Theological Seminary of America * SC Germania List, German rugby unio ...
from some input domain and outputs an element of an output domain . The input and output domains may be the same, such as for SUM, or may be different, such as for COUNT. Aggregate functions occur commonly in numerous
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
s, in
spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in ...
s, and in
relational algebra In database theory, relational algebra is a theory that uses algebraic structures with a well-founded semantics for modeling data, and defining queries on it. The theory was introduced by Edgar F. Codd. The main application of relational algebr ...
. The listagg function, as defined in the SQL:2016 standard aggregates data from multiple rows into a single concatenated string.


Decomposable aggregate functions

Aggregate functions present a
bottleneck Bottleneck literally refers to the narrowed portion (neck) of a bottle near its opening, which limit the rate of outflow, and may describe any object of a similar shape. The literal neck of a bottle was originally used to play what is now known as ...
, because they potentially require having all input values at once. In
distributed computing A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. Distributed computing is a field of computer sci ...
, it is desirable to divide such computations into smaller pieces, and distribute the work, usually computing in parallel, via a
divide and conquer algorithm In computer science, divide and conquer is an algorithm design paradigm. A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved direc ...
. Some aggregate functions can be computed by computing the aggregate for subsets, and then aggregating these aggregates; examples include COUNT, MAX, MIN, and SUM. In other cases the aggregate can be computed by computing auxiliary numbers for subsets, aggregating these auxiliary numbers, and finally computing the overall number at the end; examples include AVERAGE (tracking sum and count, dividing at the end) and RANGE (tracking max and min, subtracting at the end). In other cases the aggregate cannot be computed without analyzing the entire set at once, though in some cases approximations can be distributed; examples include DISTINCT COUNT, MEDIAN, and MODE. Such functions are called decomposable aggregation functions or decomposable aggregate functions. The simplest may be referred to as self-decomposable aggregation functions, which are defined as those functions such that there is a ''merge operator'' such that :f(X \uplus Y) = f(X) \diamond f(Y) where is the union of multisets (see monoid homomorphism). For example, SUM: :\operatorname() = x, for a singleton; :\operatorname(X \uplus Y) = \operatorname(X) + \operatorname(Y), meaning that merge is simply addition. COUNT: :\operatorname() = 1, :\operatorname(X \uplus Y) = \operatorname(X) + \operatorname(Y). MAX: :\operatorname() = x, :\operatorname(X \uplus Y) = \max\bigl(\operatorname(X), \operatorname(Y)\bigr). MIN: :\operatorname() = x, :\operatorname(X \uplus Y) = \min\bigl(\operatorname(X), \operatorname(Y)\bigr). Note that self-decomposable aggregation functions can be combined (formally, taking the product) by applying them separately, so for instance one can compute both the SUM and COUNT at the same time, by tracking two numbers. More generally, one can define a decomposable aggregation function as one that can be expressed as the composition of a final function and a self-decomposable aggregation function , f = g \circ h, f(X) = g(h(X)). For example, AVERAGE=SUM/COUNT and RANGE=MAXMIN. In the
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
framework, these steps are known as InitialReduce (value on individual record/singleton set), Combine (binary merge on two aggregations), and FinalReduce (final function on auxiliary values), and moving decomposable aggregation before the Shuffle phase is known as an InitialReduce step, Decomposable aggregation functions are important in
online analytical processing Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repor ...
(OLAP), as they allow aggregation queries to be computed on the pre-computed results in the
OLAP cube An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique of analyzing data to look for insights. The term ''cube'' here refers to a multi-dimensional dataset, which is also sometimes c ...
, rather than on the base data. For example, it is easy to support COUNT, MAX, MIN, and SUM in OLAP, since these can be computed for each cell of the OLAP cube and then summarized ("rolled up"), but it is difficult to support MEDIAN, as that must be computed for every view separately.


Other decomposable aggregate functions

In order to calculate the average and standard deviation from aggregate data, it is necessary to have available for each group: the total of values (Σxi = SUM(x)), the number of values (N=COUNT(x)) and the total of squares of the values (Σxi2=SUM(x2)) of each groups.

AVG: :\operatorname(X \uplus Y) = \bigl(\operatorname(X) * \operatorname(X) + \operatorname(Y) * \operatorname(Y)\bigr) / \bigl(\operatorname(X) + \operatorname(Y)\bigr). or
:\operatorname(X \uplus Y) = \bigl(\operatorname(X) + \operatorname(Y)\bigr) / \bigl(\operatorname(X) + \operatorname(Y)\bigr). or, only if COUNT(X)=COUNT(Y)
:\operatorname(X \uplus Y) = \bigl(\operatorname(X) + \operatorname(Y)\bigr) / 2.
SUM(x2): The sum of squares of the values is important in order to calculate the Standard Deviation of groups
: \operatorname(X^2 \uplus Y^2) = \operatorname(X^2)+\operatorname(Y^2)
STDDEV:
For a finite population with equal probabilities at all points, we have Standard deviation#Identities and mathematical properties : \operatorname(X) = s(x) = \sqrt = \sqrt = \sqrt This means that the standard deviation is equal to the square root of the difference between the average of the squares of the values and the square of the average value. :\operatorname(X \uplus Y) = \sqrt. :\operatorname(X \uplus Y) = \sqrt.


See also

*
Cross-tabulation In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business ...
a.k.a.
Contingency table In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business ...
*
Data drilling Data drilling (also drilldown) refers to any of various operations and transformations on tabular, relational, and multidimensional data. The term has widespread use in various contexts, but is primarily associated with specialized software desig ...
* Data mining *
Data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of '' information processing'', which is the modification (processing) of information in any manner detectable by ...
*
Extract, transform, load In computing, extract, transform, load (ETL) is a three-phase process where data is extracted, transformed (cleaned, sanitized, scrubbed) and loaded into an output data container. The data can be collated from one or more sources and it can also ...
*
Fold (higher-order function) In functional programming, fold (also termed reduce, accumulate, aggregate, compress, or inject) refers to a family of higher-order functions that analyze a recursive data structure and through use of a given combining operation, recombine the res ...
*
Group by (SQL) A GROUP BY statement in SQL specifies that a SQL Select (SQL), SELECT statement partitions result rows into groups, based on their values in one or several columns. Typically, grouping is used to apply some sort of aggregate function for each group. ...
, SQL clause *
OLAP cube An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique of analyzing data to look for insights. The term ''cube'' here refers to a multi-dimensional dataset, which is also sometimes c ...
*
Online analytical processing Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repor ...
*
Pivot table A pivot table is a table of grouped values that aggregates the individual items of a more extensive table (such as from a database, spreadsheet, or business intelligence program) within one or more discrete categories. This summary might include ...
*
Relational algebra In database theory, relational algebra is a theory that uses algebraic structures with a well-founded semantics for modeling data, and defining queries on it. The theory was introduced by Edgar F. Codd. The main application of relational algebr ...
* Utility functions on indivisible goods#Aggregates of utility functions * XML for Analysis *
AggregateIQ AggregateIQ (AIQ) previously known as SCL Canada is a Canadian political consultancy and technology company, based in Victoria, British Columbia. History AIQ was founded in 2013 by Zack Massingham, a former university administrator and Jeff Si ...


References


Citations


Bibliography

* * *


Further reading

*
Oracle Aggregate Functions: MAX, MIN, COUNT, SUM, AVG Examples


External links


Aggregate Functions (Transact-SQL)
{{DEFAULTSORT:Aggregate Function Subroutines