In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, streaming algorithms are algorithms for processing
data streams in which the input is presented as a
sequence
In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is called ...
of items and can be examined in only a few passes (typically
just one
''Just One'' is the debut album by the hardcore band Better Than a Thousand, released in 1997. Originally just intended to be a fun project, Ray Cappo, Graham Land and Ken Olden recorded some hardcore songs at Issa Diao's (of Good Clean Fun) stu ...
). In most models, these algorithms have access to limited memory (generally
logarithmic Logarithmic can refer to:
* Logarithm, a transcendental function in mathematics
* Logarithmic scale, the use of the logarithmic function to describe measurements
* Logarithmic spiral,
* Logarithmic growth
* Logarithmic distribution, a discrete pr ...
in the size of and/or the maximum value in the stream). They may also have limited processing time per item.
These constraints may mean that an algorithm produces an approximate answer based on a summary or "sketch" of the data stream.
History
Though streaming algorithms had already been studied by Munro and Paterson as early as 1978, as well as
Philippe Flajolet and G. Nigel Martin in 1982/83,
the field of streaming algorithms was first formalized and popularized in a 1996 paper by
Noga Alon,
Yossi Matias, and
Mario Szegedy.
For this paper, the authors later won the
Gödel Prize in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing.
Semi-streaming algorithms were introduced in 2005 as a relaxation of streaming algorithms for graphs, in which the space allowed is linear in the number of vertices , but only logarithmic in the number of edges . This relaxation is still meaningful for dense graphs, and can solve interesting problems (such as connectivity) that are insoluble in
space.
Models
Data stream model
In the data stream model, some or all of the input is represented as a finite sequence of integers (from some finite domain) which is generally not available for
random access, but instead arrives one at a time in a "stream". If the stream has length and the domain has size , algorithms are generally constrained to use space that is
logarithmic Logarithmic can refer to:
* Logarithm, a transcendental function in mathematics
* Logarithmic scale, the use of the logarithmic function to describe measurements
* Logarithmic spiral,
* Logarithmic growth
* Logarithmic distribution, a discrete pr ...
in and . They can generally make only some small constant number of passes over the stream, sometimes just
one
1 (one, unit, unity) is a number representing a single or the only entity. 1 is also a numerical digit and represents a single unit of counting or measurement. For example, a line segment of ''unit length'' is a line segment of length 1. I ...
.
Turnstile and cash register models
Much of the streaming literature is concerned with computing statistics on
frequency distributions that are too large to be stored. For this class of
problems, there is a vector
(initialized to the zero vector
) that has updates
presented to it in a stream. The goal of these algorithms is to compute
functions of
using considerably less space than it
would take to represent
precisely. There are two
common models for updating such streams, called the "cash register" and
"turnstile" models.
In the cash register model, each update is of the form
, so that
is incremented by some positive
integer
. A notable special case is when
(only unit insertions are permitted).
In the turnstile model, each update is of the form
, so that
is incremented by some (possibly negative) integer
. In the "strict turnstile" model, no
at any time may be less than zero.
Sliding window model
Several papers also consider the "sliding window" model. In this model,
the function of interest is computing over a fixed-size window in the
stream. As the stream progresses, items from the end of the window are
removed from consideration while new items from the stream take their
place.
Besides the above frequency-based problems, some other types of problems
have also been studied. Many graph problems are solved in the setting
where the
adjacency matrix
In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.
In the special case of a finite simple ...
or the
adjacency list of the graph is streamed in
some unknown order. There are also some problems that are very dependent
on the order of the stream (i.e., asymmetric functions), such as counting
the number of inversions in a stream and finding the longest increasing
subsequence.
Evaluation
The performance of an algorithm that operates on data streams is measured by three basic factors:
* The number of passes the algorithm must make over the stream.
* The available memory.
* The running time of the algorithm.
These algorithms have many similarities with
online algorithms since they both require decisions to be made before all data are available, but they are not identical. Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives.
If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor. The accuracy is often stated as an
approximation meaning that the algorithm achieves an error of less than
with probability
.
Applications
Streaming algorithms have several applications in
networking
Network, networking and networked may refer to:
Science and technology
* Network theory, the study of graphs as a representation of relations between discrete objects
* Network science, an academic field that studies complex networks
Mathematics
...
such as
monitoring network links for
elephant flows, counting the number of
distinct flows, estimating the distribution of flow sizes, and so
on. They also have applications in
databases, such as estimating the size of a
join .
Some streaming problems
Frequency moments
The th frequency moment of a set of frequencies
is defined as
.
The first moment
is simply the sum of the frequencies (i.e., the total count). The second moment
is useful for computing statistical properties of the data, such as the
Gini coefficient
of variation.
is defined as the frequency of the most frequent items.
The seminal paper of Alon, Matias, and Szegedy dealt with the problem of estimating the frequency moments.
Calculating frequency moments
A direct approach to find the frequency moments requires to maintain a register for all distinct elements which requires at least memory
of order
.
But we have space limitations and require an algorithm that computes in much lower memory. This can be achieved by using approximations instead of exact values. An algorithm that computes an (''ε,δ'')approximation of , where is the (''ε,δ'')-
approximated value of . Where ''ε'' is the approximation parameter and ''δ'' is the confidence parameter.
= Calculating ''F''0 (distinct elements in a DataStream)
=
FM-Sketch algorithm
Flajolet et al. in
introduced probabilistic method of counting which was inspired from a paper by
Robert Morris. Morris in his paper says that if the requirement of accuracy is dropped, a counter ''n'' can be replaced by a counter which can be stored in bits. Flajolet et al. in
improved this method by using a hash function which is assumed to uniformly distribute the element in the hash space (a binary string of length ).
: