data analysis Data analysis is the process of inspecting, Data cleansing, cleansing, Data transformation, transforming, and Data modeling, modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Da ...

, cosine similarity is a measure of similarity between two non-zero vectors defined in an

inner product space In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, ofte ...

. Cosine similarity is the

cosine In mathematics, sine and cosine are trigonometric functions of an angle. The sine and cosine of an acute angle are defined in the context of a right triangle: for the specified angle, its sine is the ratio of the length of the side opposite that ...

of the angle between the vectors; that is, it is the

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a Scalar (mathematics), scalar as a result". It is also used for other symmetric bilinear forms, for example in a pseudo-Euclidean space. N ...

of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval

1, +1

For example, two proportional vectors have a cosine similarity of +1, two

orthogonal vectors In mathematics, an inner product space (or, rarely, a Hausdorff space, Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation (mathematics), operation called an inner product. The inner product of two v ...

have a similarity of 0, and two opposite vectors have a similarity of −1. In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in

,1 /math>.

For example, in

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

and

text mining Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...

, each word is assigned a different coordinate and a document is represented by the vector of the numbers of occurrences of each word in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be, in terms of their subject matter, and independently of the length of the documents. The technique is also used to measure cohesion within clusters in the field of

data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...

. One advantage of cosine similarity is its low complexity, especially for sparse vectors: only the non-zero coordinates need to be considered. Other names for cosine similarity include Orchini similarity and Tucker coefficient of congruence; the Otsuka–Ochiai similarity (see below) is cosine similarity applied to

binary data Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra. Binary data occurs in many different technical and scientific fields, wh ...

Definition

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula: :

\mathbf\cdot\mathbf
=\left\, \mathbf\right\, \left\, \mathbf\right\, \cos\theta

Given two ''n''-dimensional vectors of attributes, A and B, the cosine similarity, , is represented using a

and magnitude as :

\text =S_C (A,B):= \cos(\theta) =  = \frac,

where

A_i

and

B_i

are the

i

components Component may refer to: In engineering, science, and technology Generic systems *System components, an entity with discrete structure, such as an assembly or software module, within a system considered at a particular level of analysis * Lumped e ...

of vectors

\mathbf

and

\mathbf

, respectively. The resulting similarity ranges from −1 meaning exactly opposite, to +1 meaning exactly the same, with 0 indicating

orthogonality In mathematics, orthogonality is the generalization of the geometric notion of '' perpendicularity''. Although many authors use the two terms ''perpendicular'' and ''orthogonal'' interchangeably, the term ''perpendicular'' is more specifically ...

or decorrelation, while in-between values indicate intermediate similarity or dissimilarity. For text matching, the attribute vectors ''A'' and ''B'' are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison. In the case of

, the cosine similarity of two documents will range from

0 \to 1

, since the term frequencies cannot be negative. This remains true when using TF-IDF weights. The angle between two term frequency vectors cannot be greater than 90°. If the attribute vectors are normalized by subtracting the vector means (e.g.,

A - \bar

), the measure is called the centered cosine similarity and is equivalent to the

Pearson correlation coefficient In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviatio ...

. For an example of centering, :

\text\, A =_1, A_2 T, \text \bar = \left frac,\frac\right T,

\text A-\bar=  \left frac,\frac\right T.

Cosine distance

When the distance between two unit-length vectors is defined to be the length of their vector difference then

\operatorname(\mathbf A, \mathbf B)
= \sqrt
= \sqrt
= \sqrt\,.

Nonetheless the cosine distance is often defined without the square root or factor of 2: :

\text = D_C(A,B) := 1 - S_C(A,B)\,.

It is important to note that, by virtue of being proportional to squared Euclidean distance, the cosine distance is not a true distance metric; it does not exhibit the

triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of Degeneracy (mathematics)#T ...

property — or, more formally, the

Schwarz inequality Schwarz may refer to: * Schwarz, Germany, a municipality in Mecklenburg-Vorpommern, Germany * Schwarz (surname), a surname (and list of people with the surname) * Schwarz (musician), American DJ and producer * Schwarz (Böhse Onkelz album), ''Schw ...

— and it violates the coincidence axiom. To repair the triangle inequality property while maintaining the same ordering, one can convert to

Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...

\sqrt

or angular distance . Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; see below.

Angular distance and similarity

The normalized angle, referred to as angular distance, between any two vectors

A

and

B

is a formal distance metric and can be calculated from the cosine similarity. The complement of the angular distance metric can then be used to define angular similarity function bounded between 0 and 1, inclusive. When the vector elements may be positive or negative: :

\text = D_ := \frac = \frac

\text = S_ := 1 - \text = 1 - \frac

Or, if the vector elements are always positive: :

\text = D_ := \frac  = \frac

\text = S_ := 1 - \text = 1 - \frac

Unfortunately, computing the inverse cosine () function is slow, making the use of the angular distance more computationally expensive than using the more common (but not metric) cosine distance above.

L₂-normalized Euclidean distance

Another effective proxy for cosine distance can be obtained by

L_2

normalisation of the vectors, followed by the application of normal

. Using this technique each term in each vector is first divided by the magnitude of the vector, yielding a vector of unit length. Then the Euclidean distance over the end-points of any two vectors is a proper metric which gives the same ordering as the cosine distance (a monotonic transformation of Euclidean distance; see below) for any comparison of vectors, and furthermore avoids the potentially expensive trigonometric operations required to yield a proper metric. Once the normalisation has occurred, the vector space can be used with the full range of techniques available to any Euclidean space, notably standard

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

techniques. This normalised form distance is often used within many

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

algorithms.

Otsuka–Ochiai coefficient

In biology, there is a similar concept known as the Otsuka–Ochiai coefficient named after Yanosuke Otsuka (also spelled as Ōtsuka, Ootsuka or Otuka, ) and Akira Ochiai (), also known as the Ochiai–Barkman or Ochiai coefficient, which can be represented as: :

K =\frac

Here,

A

and

B

are sets, and

, A,

is the number of elements in

A

. If sets are represented as bit vectors, the Otsuka–Ochiai coefficient can be seen to be the same as the cosine similarity. It is identical to the score introduced by Godfrey Thomson. In a recent book, the coefficient is tentatively misattributed to another Japanese researcher with the family name Otsuka. The confusion arises because in 1957 Akira Ochiai attributes the coefficient only to Otsuka (no first name mentioned) by citing an article by Ikuso Hamai (), who in turn cites the original 1936 article by Yanosuke Otsuka.

Properties

The most noteworthy property of cosine similarity is that it reflects a relative, rather than absolute, comparison of the individual vector dimensions. For any positive constant

a

and vector

V

, the vectors

V

and

aV

are maximally similar. The measure is thus most appropriate for data where frequency is more important than absolute values; notably, term frequency in documents. However more recent metrics with a grounding in information theory, such as Jensen–Shannon, SED, and triangular divergence have been shown to have improved semantics in at least some contexts. Cosine similarity is related to

as follows. Denote Euclidean distance by the usual

\, A - B\,

, and observe that :

\, A - B\, ^2 = (A - B) \cdot (A - B) = \, A\, ^2 + \, B\, ^2 - 2 (A \cdot B)\

( polarization identity) by expansion. When and are normalized to unit length,

\, A\, ^2 = \, B\, ^2 = 1

so this expression is equal to :

2 (1 - \cos(A, B)).

In short, the cosine distance can be expressed in terms of Euclidean distance as :

D_C(A, B) = \frac\quad\mathrm\quad\, A\, ^2 = \, B\, ^2 = 1

. The Euclidean distance is called the ''chord distance'' (because it is the length of the chord on the unit circle) and it is the Euclidean distance between the vectors which were normalized to unit sum of squared values within them. Null distribution: For data which can be negative as well as positive, the null distribution for cosine similarity is the distribution of the

of two independent random

unit vector In mathematics, a unit vector in a normed vector space is a Vector (mathematics and physics), vector (often a vector (geometry), spatial vector) of Norm (mathematics), length 1. A unit vector is often denoted by a lowercase letter with a circumfle ...

s. This distribution has a

mean A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...

of zero and a

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

1/n

(where

n

is the number of dimensions), and although the distribution is bounded between −1 and +1, as

n

grows large the distribution is increasingly well-approximated by the

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

. Other types of data such as

bitstream A bitstream (or bit stream), also known as binary sequence, is a sequence of bits. A bytestream is a sequence of bytes. Typically, each byte is an 8-bit quantity, and so the term octet stream is sometimes used interchangeably. An octet may ...

s, which only take the values 0 or 1, the null distribution takes a different form and may have a nonzero mean.

Triangle inequality for cosine similarity

The ordinary

for angles (i.e., arc lengths on a unit hypersphere) gives us that :

, ~\angle - \angle~,  \le ~\angle~ \le ~\angle~ + ~\angle~.

Because the cosine function decreases as an angle in radians increases, the sense of these inequalities is reversed when we take the cosine of each value: :

\cos(\angle - \angle) \ge \cos(\angle) \ge \cos(\angle + \angle).

Using the cosine addition and subtraction formulas, these two inequalities can be written in terms of the original cosines, :

\cos(A,C) \cdot \cos(C,B) + \sqrt \geq \cos(A,B),

\cos(A,B) \geq \cos(A,C) \cdot \cos(C,B) - \sqrt.

This form of the triangle inequality can be used to bound the minimum and maximum similarity of two objects A and B if the similarities to a reference object C is already known. This is used for example in metric data indexing, but has also been used to accelerate spherical

k-means clustering ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster (statistics), cluste ...

the same way the Euclidean triangle inequality has been used to accelerate regular k-means.

Soft cosine measure

A soft cosine or ("soft" similarity) between two vectors considers similarities between pairs of features. The traditional cosine similarity considers the

vector space model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...

(VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity. For example, in the field of

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

(NLP) the similarity among features is quite intuitive. Features such as words, ''n''-grams, or syntactic ''n''-grams can be quite similar, though formally they are considered as different features in the VSM. For example, words "play" and "game" are different words and thus mapped to different points in VSM; yet they are semantically related. In case of ''n''-grams or syntactic ''n''-grams, Levenshtein distance can be applied (in fact, Levenshtein distance can be applied to words as well). For calculating soft cosine, the matrix is used to indicate similarity between features. It can be calculated through Levenshtein distance,

WordNet WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definitions and usage examples. It can thu ...

similarity, or other

similarity measure In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such mea ...

s. Then we just multiply by this matrix. Given two -dimension vectors

a

and

b

, the soft cosine similarity is calculated as follows: :

\begin
    \operatorname_1(a,b)=
    \frac,
\end

where . If there is no similarity between features (, for ), the given equation is equivalent to the conventional cosine similarity formula. The

time complexity In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations ...

of this measure is quadratic, which makes it applicable to real-world tasks. Note that the complexity can be reduced to subquadratic. An efficient implementation of such soft cosine similarity is included in the Gensim open source library.

References

External links

Weighted cosine measure

A tutorial on cosine similarity using Python
{{DEFAULTSORT:Cosine Similarity Information retrieval techniques Similarity measures Data analysis