HOME





Dasgupta's Objective
In the study of hierarchical clustering, Dasgupta's objective is a measure of the quality of a clustering, defined from a similarity measure on the elements to be clustered. It is named after Sanjoy Dasgupta, who formulated it in 2016. Its key property is that, when the similarity comes from an ultrametric space, the optimal clustering for this quality measure follows the underlying structure of the ultrametric space. In this sense, clustering methods that produce good clusterings for this objective can be expected to approximate the ground truth underlying the given similarity measure. In Dasgupta's formulation, the input to a clustering problem consists of similarity scores between certain pairs of elements, represented as an undirected graph G=(V,E), with the elements as its vertices and with non-negative real weights on its edges. Large weights indicate elements that should be considered more similar to each other, while small weights or missing edges indicate pairs of elements t ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Hierarchical Clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories: * Agglomerative: This is a " bottom-up" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. * Divisive: This is a "top-down" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram. The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of \mathcal(n^3) and requires \Omega(n^2) memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of comp ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Similarity Measure
In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms. Cosine similarity is a commonly used similarity measure for real-valued vectors, used in (among other fields) information retrieval to score the similarity of documents in the vector space model. In machine learning, common kernel functions such as the RBF kernel can be viewed as similarity functions. Use in clustering In spectral clustering, a similarity, or affinity, measure is used to transform data to overcome difficulties related to lack of convexity in the shape of the data dist ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Ultrametric Space
In mathematics, an ultrametric space is a metric space in which the triangle inequality is strengthened to d(x,z)\leq\max\left\. Sometimes the associated metric is also called a non-Archimedean metric or super-metric. Although some of the theorems for ultrametric spaces may seem strange at a first glance, they appear naturally in many applications. Formal definition An ultrametric on a set is a real-valued function :d\colon M \times M \rightarrow \mathbb (where denote the real numbers), such that for all : # ; # (''symmetry''); # ; # if then ; # (strong triangle inequality or ultrametric inequality). An ultrametric space is a pair consisting of a set together with an ultrametric on , which is called the space's associated distance function (also called a metric). If satisfies all of the conditions except possibly condition 4 then is called an ultrapseudometric on . An ultrapseudometric space is a pair consisting of a set and an ultrapseudometric on . In the ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Ground Truth
Ground truth is information that is known to be real or true, provided by direct observation and measurement (i.e. empirical evidence) as opposed to information provided by inference. Etymology The ''Oxford English Dictionary'' (s.v. "ground truth") records the use of the word "Groundtruth" in the sense of a "fundamental truth" from Henry Ellison's poem "The Siberian Exile's Tale", published in 1833. Statistics and machine learning "Ground truth" may be seen as a conceptual term relative to the knowledge of the truth concerning a specific question. It is the ideal expected result. This is used in statistical models to prove or disprove research hypotheses. The term "ground truthing" refers to the process of gathering the proper objective (provable) data for this test. Compare with gold standard. For example, suppose we are testing a stereo vision system to see how well it can estimate 3D positions. The "ground truth" might be the positions given by a laser rangefinder whi ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Undirected Graph
In discrete mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The objects correspond to mathematical abstractions called '' vertices'' (also called ''nodes'' or ''points'') and each of the related pairs of vertices is called an ''edge'' (also called ''link'' or ''line''). Typically, a graph is depicted in diagrammatic form as a set of dots or circles for the vertices, joined by lines or curves for the edges. Graphs are one of the objects of study in discrete mathematics. The edges may be directed or undirected. For example, if the vertices represent people at a party, and there is an edge between two people if they shake hands, then this graph is undirected because any person ''A'' can shake hands with a person ''B'' only if ''B'' also shakes hands with ''A''. In contrast, if an edge from a person ''A'' to a person ''B'' means that ''A'' owes money to ''B'', th ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

NP-hard
In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard problem is the subset sum problem. A more precise specification is: a problem ''H'' is NP-hard when every problem ''L'' in NP can be reduced in polynomial time to ''H''; that is, assuming a solution for ''H'' takes 1 unit time, ''H''s solution can be used to solve ''L'' in polynomial time. As a consequence, finding a polynomial time algorithm to solve any NP-hard problem would give polynomial time algorithms for all the problems in NP. As it is suspected that P≠NP, it is unlikely that such an algorithm exists. It is suspected that there are no polynomial-time algorithms for NP-hard problems, but that has not been proven. Moreover, the class P, in which all problems can be solved in polynomial time, is contained in the NP class. Def ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Polynomial Time
In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by the algorithm, supposing that each elementary operation takes a fixed amount of time to perform. Thus, the amount of time taken and the number of elementary operations performed by the algorithm are taken to be related by a constant factor. Since an algorithm's running time may vary among different inputs of the same size, one commonly considers the worst-case time complexity, which is the maximum amount of time required for inputs of a given size. Less common, and usually specified explicitly, is the average-case complexity, which is the average of the time taken on inputs of a given size (this makes sense because there are only a finite number of possible inputs of a given size). In both cases, the time complexity is generally express ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Approximation Algorithm
In computer science and operations research, approximation algorithms are efficient algorithms that find approximate solutions to optimization problems (in particular NP-hard problems) with provable guarantees on the distance of the returned solution to the optimal one. Approximation algorithms naturally arise in the field of theoretical computer science as a consequence of the widely believed P ≠ NP conjecture. Under this conjecture, a wide class of optimization problems cannot be solved exactly in polynomial time. The field of approximation algorithms, therefore, tries to understand how closely it is possible to approximate optimal solutions to such problems in polynomial time. In an overwhelming majority of the cases, the guarantee of such algorithms is a multiplicative one expressed as an approximation ratio or approximation factor i.e., the optimal solution is always guaranteed to be within a (predetermined) multiplicative factor of the returned solution. However, there are ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Sparsest Cut Problem
In graph theory, a cut is a partition of the vertices of a graph into two disjoint subsets. Any cut determines a cut-set, the set of edges that have one endpoint in each subset of the partition. These edges are said to cross the cut. In a connected graph, each cut-set determines a unique cut, and in some cases cuts are identified with their cut-sets rather than with their vertex partitions. In a flow network, an s–t cut is a cut that requires the ''source'' and the ''sink'' to be in different subsets, and its ''cut-set'' only consists of edges going from the source's side to the sink's side. The ''capacity'' of an s–t cut is defined as the sum of the capacity of each edge in the ''cut-set''. Definition A cut is a partition of of a graph into two subsets and . The cut-set of a cut is the set of edges that have one endpoint in and the other endpoint in . If and are specified vertices of the graph , then an cut is a cut in which belongs to the set and belongs to t ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Approximation Ratio
An approximation is anything that is intentionally similar but not exactly equal to something else. Etymology and usage The word ''approximation'' is derived from Latin ''approximatus'', from ''proximus'' meaning ''very near'' and the prefix ''ad-'' (''ad-'' before ''p'' becomes ap- by assimilation) meaning ''to''. Words like ''approximate'', ''approximately'' and ''approximation'' are used especially in technical or scientific contexts. In everyday English, words such as ''roughly'' or ''around'' are used with a similar meaning. It is often found abbreviated as ''approx.'' The term can be applied to various properties (e.g., value, quantity, image, description) that are nearly, but not exactly correct; similar, but not exactly the same (e.g., the approximate time was 10 o'clock). Although approximation is most often applied to numbers, it is also frequently applied to such things as mathematical functions, shapes, and physical laws. In science, approximation can refer to us ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Journal Of The ACM
The ''Journal of the ACM'' is a peer-reviewed scientific journal covering computer science in general, especially theoretical aspects. It is an official journal of the Association for Computing Machinery. Its current editor-in-chief An editor-in-chief (EIC), also known as lead editor or chief editor, is a publication's editorial leader who has final responsibility for its operations and policies. The highest-ranking editor of a publication may also be titled editor, managing ... is Venkatesan Guruswami. The journal was established in 1954 and "computer scientists universally hold the ''Journal of the ACM'' in high esteem". See also * '' Communications of the ACM'' References External links * Publications established in 1954 Computer science journals Association for Computing Machinery academic journals Bimonthly journals English-language journals {{compu-journal-stub ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]