GloVe (machine Learning)
   HOME

TheInfoList



OR:

GloVe, coined from ''Global Vectors'', is a model for distributed word representation. The model is an
unsupervised learning Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...
algorithm for obtaining vector representations of words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. Training is performed on aggregated global word-word
co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of ordered occurrence of two adjacent terms in a text corpus. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idio ...
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods. It was developed as an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
project at
Stanford University Leland Stanford Junior University, commonly referred to as Stanford University, is a Private university, private research university in Stanford, California, United States. It was founded in 1885 by railroad magnate Leland Stanford (the eighth ...
and launched in 2014. It was designed as a competitor to
word2vec Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...
, and the original paper noted multiple improvements of GloVe over word2vec. , both approaches are outdated, and
Transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...
-based models, such as BERT, which add multiple neural-network attention layers on top of a word embedding model similar to Word2vec, have come to be regarded as the state of the art in NLP.


Definition

You shall know a word by the company it keeps (Firth, J. R. 1957:11)
The idea of GloVe is to construct, for each word i, two vectors w_i, \tilde w_i, such that the relative positions of the vectors capture part of the statistical regularities of the word i. The statistical regularity is defined as the co-occurrence probabilities. Words that resemble each other in meaning should also resemble each other in co-occurrence probabilities.


Word counting

Let the vocabulary be V, the set of all possible words (aka "tokens"). Punctuation is either ignored, or treated as vocabulary, and similarly for capitalization and other typographical details. If two words occur close to each other, then we say that they occur in the context of each other. For example, if the context length is 3, then we say that in the following sentence
GloVe1, coined2 from3 Global4 Vectors5, is6 a7 model8 for9 distributed10 word11 representation12
the word "model8" is in the context of "word11" but not the context of "representation12". A word is not in the context of itself, so "model8" is not in the context of the word "model8", although, if a word appears again in the same context, then it does count. Let X_ be the number of times that the word j appears in the context of the word i over the entire corpus. For example, if the corpus is just "I don't think that that is a problem." we have X_ = 2 since the first "that" appears in the second one's context, and vice versa. Let X_i = \sum_ X_ be the number of words in the context of all instances of word i. By counting, we haveX_i = 2 \times(\text) \times \# (\texti)(except for words occurring right at the start and end of the corpus)


Probabilistic modelling

Let P_ := P(k , i) := \fracbe the co-occurrence probability. That is, if one samples a random occurrence of the word i in the entire document, and a random word within its context, that word is k with probability P_. Note that P_ \neq P_ in general. For example, in a typical modern English corpus, P_ is close to one, but P_ is close to zero. This is because the word "ado" is almost only used in the context of the archaic phrase " much ado about", but the word "much" occurs in all kinds of contexts. For example, in a 6 billion token corpus, we have Inspecting the table, we see that the words "ice" and "steam" are indistinguishable along the "water" (often co-occurring with both) and "fashion" (rarely co-occurring with either), but distinguishable along the "solid" (co-occurring more with ice) and "gas" (co-occurring more with "steam"). The idea is to learn two vectors w_i, \tilde w_i for each word i, such that we have a
multinomial logistic regression In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the prob ...
:w_i^T \tilde w_j + b_i + \tilde b_j \approx \ln P_and the terms b_i, \tilde b_j are unimportant parameters. This means that if the words i, j have similar co-occurrence probabilities (P_)_ \approx (P_)_ , then their vectors should also be similar: w_i \approx w_j.


Logistic regression

Naively, logistic regression can be run by minimizing the squared loss:L = \sum_ (w_i^T \tilde w_j + b_i + \tilde b_j - \ln P_)^2However, this would be noisy for rare co-occurrences. To fix the issue, the squared loss is weighted so that the loss is slowly ramped-up as the absolute number of co-occurrences X_ increases:L = \sum_ f(X_) (w_i^T \tilde w_j + b_i + \tilde b_j - \ln P_)^2 wheref(x)=\left\{\begin{array}{cc} \left(x / x_{\max }\right)^\alpha & \text { if } xand x_{\max }, \alpha are hyperparameters. In the original paper, the authors found that x_{\max } = 100, \alpha = 3/4 seem to work well in practice.


Use

Once a model is trained, we have 4 trained parameters for each word: w_i, \tilde w_i, b_i, \tilde b_i. The parameters b_i, \tilde b_i are irrelevant, and only w_i, \tilde w_i are relevant. The authors recommended using w_i + \tilde w_i as the final representation vector for word i, because empirically it worked better than w_i or \tilde w_i alone.


Applications

GloVe can be used to find relations between words like synonyms, company-product relations, zip codes and cities, etc. However, the unsupervised learning algorithm is not effective in identifying homographs, i.e., words with the same spelling and different meanings. This is as the unsupervised learning algorithm calculates a single set of vectors for words with the same morphological structure. The algorithm is also used by the
SpaCy spaCy ( ) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and I ...
library to build semantic word embedding features, while computing the top list words that match with distance measures such as
cosine similarity In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided ...
and
Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...
approach. GloVe was also used as the word representation framework for the online and offline systems designed to detect psychological distress in patient interviews.


See also

*
ELMo Elmo is a Muppet character on the children's television show ''Sesame Street''. A furry red monster who speaks in a high-pitched falsetto voice and frequently refers to himself in the third person, he hosts the last full 15-minute segmen ...
* BERT *
Word2vec Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...
*
fastText fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows one to create an unsupervised learning or supervised learning algorithm for obtaining vector representati ...
*
Natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...


References


External links


GloVe


{{Artificial intelligence navbox Computational linguistics Natural language processing toolkits