Misra and
Gries defined the ''heavy-hitters problem''
(though they did not introduce the term ''heavy-hitters'') and described the first algorithm
for it in the paper ''Finding repeated elements''.
Their algorithm
extends the
Boyer-Moore majority finding algorithm
in a significant way.
One version of the heavy-hitters problem is as follows: Given is a
bag
A bag, also known regionally as a sack, is a common tool in the form of a floppy container, typically made of cloth, leather, bamboo, paper, or plastic. The use of bags predates recorded history, with the earliest bags being lengths of animal s ...
of elements and an integer . Find the values that
occur more than times in . The Misra-Gries algorithm solves
the problem by making two passes over the values in , while storing
at most values from and their number of occurrences during the
course of the algorithm.
Misra-Gries
[ is one of the earliest ]streaming algorithm
In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically one-pass algorithm, just one. These algorithms are desi ...
s,
and it is described below in those terms in section #Summaries.
Misra–Gries algorithm
A bag
A bag, also known regionally as a sack, is a common tool in the form of a floppy container, typically made of cloth, leather, bamboo, paper, or plastic. The use of bags predates recorded history, with the earliest bags being lengths of animal s ...
is like a set
Set, The Set, SET or SETS may refer to:
Science, technology, and mathematics Mathematics
*Set (mathematics), a collection of elements
*Category of sets, the category whose objects and morphisms are sets and total functions, respectively
Electro ...
in which the same value may occur multiple
times. Assume that a bag is available as an array of elements.
In the abstract description of the algorithm, we treat
and its segments also as bags. Henceforth, a ''heavy hitter'' of
bag is a value that occurs more than times in it, for some integer , .
A ''-reduced bag'' for bag is derived from by
repeating the following operation until no longer possible: Delete distinct elements from . From its definition, a -reduced bag contains fewer than different values.
The following theorem is easy to prove:
Theorem 1. Each heavy-hitter of is an element of a -reduced bag for .
The first pass of the heavy-hitters computation constructs a -reduced
bag . The second pass declares an element of to be a heavy-hitter if
it occurs more than times in . According to Theorem 1, this
procedure determines all and only the heavy-hitters. The second pass
is easy to program, so we describe only the first pass.
In order to construct , scan the values in in arbitrary order, for
specificity the following algorithm scans them in the order of
increasing indices. Invariant of the
algorithm is that is a -reduced bag for the scanned values and is
the number of distinct values in . Initially, no value has been
scanned, is the empty bag, and is zero.
Whenever element is scanned, in order to preserve the
invariant: (1) if is not in , add it to and increase by 1,
(2) if is in , add it to but don't modify , and
(3) if becomes equal to , reduce by deleting distinct values from
it and update appropriately.
algorithm Misra–Gries is
t, d := , 0
for i from 0 to n-1 do
if b t then
t, d:= t ∪ , d+1
else
t, d:= t ∪ , d
endif
if d = k then
Delete distinct values from update
endif
endfor
A possible implementation of is as a set of pairs of the form
, ) where each is a distinct value in
and is the number of occurrences of in .
Then is the size of this set. The
step "Delete distinct values from " amounts to reducing each by
1 and then removing any pair (, ) from the set if becomes 0.
Using an AVL tree
In computer science, an AVL tree (named after inventors Adelson-Velsky and Landis) is a self-balancing binary search tree. In an AVL tree, the heights of the two child subtrees of any node differ by at most one; if at any time they differ by m ...
implementation of , the algorithm has a
running time of . In order to assess the space requirement, assume that the elements of
can have possible values, so the storage of a value needs
bits. Since each counter may have a value as high as
, its storage needs bits. Therefore, for value-counter pairs,
the space requirement is .
Summaries
In the field of streaming algorithms
In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically one-pass algorithm, just one. These algorithms are desi ...
, the output of the Misra-Gries algorithm in the first pass may be called a ''summary'', and such summaries are used to solve the frequent elements problem in the data stream model. A streaming algorithm makes a small, bounded number of
passes over a list of data items called a ''stream''. It processes the elements using at most logarithmic amount of extra space in the size of the list to produce an answer.
The term Misra–Gries summary
In the field of streaming algorithms, Misra–Gries summaries are used to solve the Streaming algorithm#Frequent elements, frequent elements problem in the Streaming algorithm#Data stream model, data stream model. That is, given a long stream of ...
appears to have been coined by Graham Cormode.
References
Streaming algorithms
{{DEFAULTSORT:Misra-Gries heavy hitters algorithm