Min-wise Independence

	Min-wise Independence In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by , and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results.. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.. Jaccard similarity and minimum hash values The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. Let be a set and and be subsets of , then the Jaccard index is defined to be the ratio of the number of elements of their intersection and the number of elements of their union: : J(A,B) = . This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise. Two sets are more similar (i.e. have relatively more members in common) ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Computer Science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (including the design and implementation of hardware and software). Computer science is generally considered an area of academic research and distinct from computer programming. Algorithms and data structures are central to computer science. The theory of computation concerns abstract models of computation and general classes of problems that can be solved using them. The fields of cryptography and computer security involve studying the means for secure communication and for preventing security vulnerabilities. Computer graphics and computational geometry address the generation of images. Programming language theory considers different ways to describe computational processes, and database theory concerns the management of repositories ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Bias Of An Estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In statistics, "bias" is an property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more. All else being equal, an unbiased estimator is preferable to a biased estimator, although in practice, biased estimators (with generally small bias) are frequently used. When a biased estimator is used, bounds of the bias are calculated. A biased estimator may be used for various reasons: because an unbiased estimator does not exist without further assumptions about a population; because an estimator is difficult to compute (as in unbiased estimation of standard deviation); because a biased esti ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Association Rule Learning Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.Piatetsky-Shapiro, Gregory (1991), ''Discovery, analysis, and presentation of strong rules'', in Piatetsky-Shapiro, Gregory; and Frawley, William J.; eds., ''Knowledge Discovery in Databases'', AAAI/MIT Press, Cambridge, MA. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected. Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule \ \Rightarrow \ found in the sales data of a supermarket would indicate that if a customer buys ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Tabulation Hashing In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; later work by Carter and Wegman extended this method to arbitrary fixed-length keys. Generalizations of tabulation hashing have also been developed that can handle variable-length keys such as text strings. Despite its simplicity, tabulation hashing has strong theoretical properties that distinguish it from some other hash functions. In particular, it is 3-independent: every 3-tuple of keys is equally likely to be mapped to any 3-tuple of hash values. However, it is not 4-independent. More sophisticated but slower variants of tabulation hashing extend the method to higher degrees of independence. Because of its high degree of independence, tabulation hashing is usable with hashing methods that require a high-quality hash function, including hopsc ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	K-independent Hashing In computer science, a family of hash functions is said to be ''k''-independent, ''k''-wise independent or ''k''-universal if selecting a function at random from the family guarantees that the hash codes of any designated ''k'' keys are independent random variables (see precise mathematical definitions below). Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many ''k''-independent families have been proposed. Background The goal of hashing is usually to map keys from some large domain (universe) U into a smaller range, such as m bins (labelled = \). In the analysis of randomized algorithms and data structures, it is often desirable for the hash codes of various keys to "behave randomly". For instance, if the hash code of each key were an independent random choi ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Piotr Indyk Piotr Indyk is Thomas D. and Virginia W. Cabot Professor in the Theory of Computation Group at the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Academic biography Indyk received the Magister (MA) degree from the University of Warsaw in 1995 and a PhD in computer science from Stanford University in 2000 under the supervision of Rajeev Motwani. In 2000, Indyk joined MIT where he currently holds the title of Thomas D. and Virginia W. Cabot Professor in the Department of Electrical Engineering and Computer Science. Research Indyk's research focuses primarily on computational geometry in high-dimensions, streaming algorithms, and computational learning theory. He has made a range of contributions to these fields, particularly in the study of low-distortion embeddings, algorithmic coding theory, and geometric and combinatorial pattern matching. He has also made contributions to the theory of compressed sensing. His work on algo ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Information Processing Letters ''Information Processing Letters'' is a peer reviewed scientific journal in the field of computer science, published by Elsevier. The aim of the journal is to enable fast dissemination of results in the field of information processing Information processing is the change (processing) of information in any manner detectable by an observer. As such, it is a process that ''describes'' everything that happens (changes) in the universe, from the falling of a rock (a change in posi ... in the form of short papers. Submissions are limited to nine double-spaced pages. Both theoretical and experimental research is covered. External links * Computer science journals Publications established in 1971 Semi-monthly journals Elsevier academic journals {{compu-journal-stub ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Universal Hashing In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography. Introduction Assume we want to map keys from some universe U into m bins (labelled = \). The algorithm will have to handle some data set S \subseteq U of , S, =n keys, which is not known in advance. Usually, the goal of hashing is to obtain a low number of collisions (keys from S that land in the same bin). A deterministic hash function cannot offer any guarantee in an adversarial sett ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Random Permutation A random permutation is a random ordering of a set of objects, that is, a permutation-valued random variable. The use of random permutations is often fundamental to fields that use randomized algorithms such as coding theory, cryptography, and simulation. A good example of a random permutation is the shuffling of a deck of cards: this is ideally a random permutation of the 52 cards. Generating random permutations Entry-by-entry brute force method One method of generating a random permutation of a set of length ''n'' uniformly at random (i.e., each of the ''n''! permutations is equally likely to appear) is to generate a sequence by taking a random number between 1 and ''n'' sequentially, ensuring that there is no repetition, and interpreting this sequence (''x''1, ..., ''x''''n'') as the permutation : \begin 1 & 2 & 3 & \cdots & n \\ x_1 & x_2 & x_3 & \cdots & x_n \\ \end, shown here in two-line notation. This brute-force method will require occasional retries whenever the r ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Exponential Distribution In probability theory and statistics, the exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts. The exponential distribution is not the same as the class of exponential families of distributions. This is a large class of probability distributions that includes the exponential distribution as one of its members, but also includes many other distributions, like the normal, binomial, gamma, and Poisson distributions. Definitions Probability density function The probability density function (pdf) of an exponential distribution is : f(x;\lambda) = \begin \l ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Inverse Transform Sampling Inverse transform sampling (also known as inversion sampling, the inverse probability integral transform, the inverse transformation method, Smirnov transform, or the golden ruleAalto University, N. Hyvönen, Computational methods in inverse problems. Twelfth lecture https://noppa.tkk.fi/noppa/kurssi/mat-1.3626/luennot/Mat-1_3626_lecture12.pdf) is a basic method for pseudo-random number sampling, i.e., for generating sample numbers at random from any probability distribution given its cumulative distribution function. Inverse transformation sampling takes uniform samples of a number u between 0 and 1, interpreted as a probability, and then returns the largest number x from the domain of the distribution P(X) such that P(-\infty , e.g. from U \sim \mathrm ,1 #Find the inverse of the desired CDF, e.g. F_X^(x). # Compute X=F_X^(u). The computed random variable X has distribution F_X(x). Expressed differently, given a continuous uniform variable U in ,1/math> and an invertible cumul ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Linear Time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by the algorithm, supposing that each elementary operation takes a fixed amount of time to perform. Thus, the amount of time taken and the number of elementary operations performed by the algorithm are taken to be related by a constant factor. Since an algorithm's running time may vary among different inputs of the same size, one commonly considers the worst-case time complexity, which is the maximum amount of time required for inputs of a given size. Less common, and usually specified explicitly, is the average-case complexity, which is the average of the time taken on inputs of a given size (this makes sense because there are only a finite number of possible inputs of a given size). In both cases, the time complexity is generally expresse ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]