Count-distinct Problem

	Count-distinct Problem In computer science, the count-distinct problem (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks. Formal definition : Instance: A stream of elements x_1,x_2,\ldots,x_s with repetitions, and an integer m . Let n be the number of distinct elements, namely n = , \left\, , and let these elements be \left\ . : Objective: Find an estimate \widehat of n using only m storage units, where m \ll n . An example of an instance for the cardinality estimation problem is the stream: a,b,a,c,d,b,d . For this instance, n = , \left\, = 4 . Naive solution The naive solution to the problem is as follows: ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	IP Addresses An Internet Protocol address (IP address) is a numerical label such as that is connected to a computer network that uses the Internet Protocol for communication.. Updated by . An IP address serves two main functions: network interface identification and location addressing. Internet Protocol version 4 (IPv4) defines an IP address as a 32-bit number. However, because of the growth of the Internet and the depletion of available IPv4 addresses, a new version of IP ( IPv6), using 128 bits for the IP address, was standardized in 1998. IPv6 deployment has been ongoing since the mid-2000s. IP addresses are written and displayed in human-readable notations, such as in IPv4, and in IPv6. The size of the routing prefix of the address is designated in CIDR notation by suffixing the address with the number of significant bits, e.g., , which is equivalent to the historically used subnet mask . The IP address space is managed globally by the Internet Assigned Numbers Authority ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	HyperLogLog HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the ''exact'' cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984 Flajolet–Martin algorithm. Terminology In the original paper by Flajolet ''et al.'' and in related literature on the count-distinct problem, the term "cardinality" is used to mean the number of distinct elements in a data stream with repeated elements. How ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Maximum Likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference. If the likelihood function is differentiable, the derivative test for finding maxima can be applied. In some cases, the first-order conditions of the likelihood function can be solved analytically; for instance, the ordinary least squares estimator for a linear regression model maximizes the likelihood when all observed outcomes are assumed to have Normal distributions with the same variance. From the perspective of Bayesian in ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Streaming Algorithm In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). In most models, these algorithms have access to limited memory (generally logarithmic in the size of and/or the maximum value in the stream). They may also have limited processing time per item. These constraints may mean that an algorithm produces an approximate answer based on a summary or "sketch" of the data stream. History Though streaming algorithms had already been studied by Munro and Paterson as early as 1978, as well as Philippe Flajolet and G. Nigel Martin in 1982/83, the field of streaming algorithms was first formalized and popularized in a 1996 paper by Noga Alon, Yossi Matias, and Mario Szegedy. For this paper, the authors later won the Gödel Prize in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Count–min Sketch In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005 paper. Count–min sketch is an alternative to count sketch and AMS sketch and can be considered an implementation of a counting Bloom filter (Fan et al., 1998) or multistage-filter. However, they are used differently and therefore sized differently: a count–min sketch typically has a sublinear number of cells, related to the desired approximation quality of the sketch, while a counting Bloom filter is more typically sized to match the number of elements in the set. Data structure The goal of the basic version of the count–min sketch is to cons ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Internet Protocol The Internet Protocol (IP) is the network layer communications protocol in the Internet protocol suite for relaying datagrams across network boundaries. Its routing function enables internetworking, and essentially establishes the Internet. IP has the task of delivering packets from the source host to the destination host solely based on the IP addresses in the packet headers. For this purpose, IP defines packet structures that encapsulate the data to be delivered. It also defines addressing methods that are used to label the datagram with source and destination information. IP was the connectionless datagram service in the original Transmission Control Program introduced by Vint Cerf and Bob Kahn in 1974, which was complemented by a connection-oriented service that became the basis for the Transmission Control Protocol (TCP). The Internet protocol suite is therefore often referred to as ''TCP/IP''. The first major version of IP, Internet Protocol Version 4 (IPv ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Daniel Kane (mathematician) Daniel Mertz Kane (born 1986) is an American mathematician. He is an associate professor with a joint position in the Mathematics Department and the Computer Science and Engineering Department at the University of California, San Diego.. Early life and education Kane was born in Madison, Wisconsin, to Janet E. Mertz and Jonathan M. Kane, professors of oncology and of mathematics and computer science, respectively... The article is primarily about a study jointly authored by Kane's parents, but also mentions Kane's IMO results. He attended Wingra School, a small alternative K-8 school in Madison that focuses on self-guided education. By 3rd grade, he had mastered K through 9th-grade mathematics. Starting at age 13, he took honors math courses at the University of Wisconsin–Madison and did research under the mentorship of Ken Ono while dual enrolled at Madison West High School. He earned gold medals in the 2002 and 2003 International Mathematical Olympiads. Prior to his 17th b ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Philippe Flajolet Philippe Flajolet (; 1 December 1948 – 22 March 2011) was a French computer scientist. Biography A former student of École Polytechnique, Philippe Flajolet received his PhD in computer science from University Paris Diderot in 1973 and state doctorate from Paris-Sud 11 University in 1979. Most of Philippe Flajolet's research work was dedicated towards general methods for analyzing the computational complexity of algorithms, including the theory of average-case complexity. He introduced the theory of analytic combinatorics. With Robert Sedgewick of Princeton University, he wrote the first book-length treatment of the topic, the 2009 book entitled ''Analytic Combinatorics''. In 1993, together with Rainer Kemp, Helmut Prodinger and Robert Sedgewick, Flajolet initiated the successful series of workshops and conferences which was key to the development of a research community around the analysis of algorithms, and which evolved into the AofA—International Meeting on Combinatoria ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Random Variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the possible upper sides of a flipped coin such as heads H and tails T) in a sample space (e.g., the set \) to a measurable space, often the real numbers (e.g., \ in which 1 corresponding to H and -1 corresponding to T). Informally, randomness typically represents some fundamental element of chance, such as in the roll of a dice; it may also represent uncertainty, such as measurement error. However, the interpretation of probability is philosophically complicated, and even in specific cases is not always straightforward. The purely mathematical analysis of random variables is independent of such interpretational difficulties, and can be based upon a rigorous axiomatic setup. In the formal mathematical language of measure theory, a rando ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Maximum Likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference. If the likelihood function is differentiable, the derivative test for finding maxima can be applied. In some cases, the first-order conditions of the likelihood function can be solved analytically; for instance, the ordinary least squares estimator for a linear regression model maximizes the likelihood when all observed outcomes are assumed to have Normal distributions with the same variance. From the perspective of Bayesian in ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Router (computing) A router is a networking device that forwards data packets between computer networks. Routers perform the traffic directing functions between networks and on the global Internet. Data sent through a network, such as a web page or email, is in the form of data packets. A packet is typically forwarded from one router to another router through the networks that constitute an internetwork (e.g. the Internet) until it reaches its destination node. A router is connected to two or more data lines from different IP networks. When a data packet comes in on one of the lines, the router reads the network address information in the packet header to determine the ultimate destination. Then, using information in its routing table or routing policy, it directs the packet to the next network on its journey. The most familiar type of IP routers are home and small office routers that simply forward IP packets between the home computers and the Internet. More sophisticated routers, s ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Minimum-variance Unbiased Estimator In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter. For practical statistics problems, it is important to determine the MVUE if one exists, since less-than-optimal procedures would naturally be avoided, other things being equal. This has led to substantial development of statistical theory related to the problem of optimal estimation. While combining the constraint of unbiasedness with the desirability metric of least variance leads to good results in most practical settings—making MVUE a natural starting point for a broad range of analyses—a targeted specification may perform better for a given problem; thus, MVUE is not always the best stopping point. Definition Consider estimation of g(\theta) based on data X_1, X_2, \ldots, X_n i.i.d. from some member of a family of densities p_\theta ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]