natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

(NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using a set of

language model A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...

ing and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of

real numbers In mathematics, a real number is a number that can be used to measure a ''continuous'' one-dimensional quantity such as a distance, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small variations. Every re ...

. Methods to generate this mapping include

neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

on the word co-occurrence matrix, probabilistic models, explainable knowledge base method, and explicit representation in terms of the context in which words appear. Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as

syntactic parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lat ...

and

sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...

Development and history of the approach

Distributional semantics Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. T ...

, a quantitative methodological approach to understanding meaning in observed language, word embeddings or semantic vector space models have been used as a knowledge representation for some time. Such models aim to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying idea that "a word is characterized by the company it keeps" was proposed in a 1957 article by

John Rupert Firth John Rupert Firth (June 17, 1890 in Keighley, Yorkshire – December 14, 1960 in Lindfield, West Sussex), commonly known as J. R. Firth, was an English linguist and a leading figure in British linguistics during the 1950s. Education and career F ...

, but also has roots in the contemporaneous work on search systems and in cognitive psychology. The notion of a semantic space with lexical items (words or multi-word terms) represented as vectors or embeddings is based on the computational challenges of capturing distributional characteristics and using them for practical application to measure similarity between words, phrases, or entire documents. The first generation of semantic space models is the

vector space model Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing an ...

for information retrieval. Such vector space models for words and their distributional data implemented in their simplest form results in a very sparse vector space of high dimensionality (cf.

Curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. T ...

). Reducing the number of dimensions using linear algebraic methods such as

singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is re ...

then led to the introduction of latent semantic analysis in the late 1980s and the

Random indexing Random indexing is a dimensionality reduction method and computational framework for distributional semantics, based on the insight that very-high-dimensional vector space model implementations are impractical, that models need not grow in dimension ...

approach for collecting word cooccurrence contexts. In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language models" to reduce the high dimensionality of words representations in contexts by "learning a distributed representation for words". A study published in NeurIPS (NIPS) 2002 introduced the use of both word and document embeddings applying the method of kernel CCA to bilingual (and multi-lingual) corpora, also providing an early example of

self-supervised learning Self-supervised learning (SSL) refers to a machine learning paradigm, and corresponding methods, for processing unlabelled data to obtain useful representations that can help with downstream learning tasks. The most salient thing about SSL metho ...

of word embeddings Word embeddings come in two different styles, one in which words are expressed as vectors of co-occurring words, and another in which words are expressed as vectors of linguistic contexts in which the words occur; these different styles are studied in (Lavelli et al., 2004). Roweis and Saul published in ''Science'' how to use " locally linear embedding" (LLE) to discover representations of high dimensional data structures. Most new word embedding techniques after about 2005 rely on a

neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

architecture instead of more probabilistic and algebraic models, since some foundational work by Yoshua Bengio and colleagues. The approach has been adopted by many research groups after advances around year 2010 had been made on theoretical work on the quality of vectors and the training speed of the model and hardware advances allowed for a broader parameter space to be explored profitably. In 2013, a team at

Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...

led by

Tomas Mikolov Tomas may refer to: People * Tomás (given name), a Spanish, Portuguese, and Gaelic given name * Tomas (given name), a Swedish, Dutch, and Lithuanian given name * Tomáš, a Czech and Slovak given name * Tomas (surname), a French and Croatian surna ...

created word2vec, a word embedding toolkit that can train vector space models faster than the previous approaches. The word2vec approach has been widely used in experimentation and was instrumental in raising interest for word embeddings as a technology, moving the research strand out of specialised research into broader experimentation and eventually paving the way for practical application.

Polysemy and homonymy

Historically, one of the main limitations of static word embeddings or word

s is that words with multiple meanings are conflated into a single representation (a single vector in the semantic space). In other words,

polysemy Polysemy ( or ; ) is the capacity for a sign (e.g. a symbol, a morpheme, a word, or a phrase) to have multiple related meanings. For example, a word can have several word senses. Polysemy is distinct from ''monosemy'', where a word has a singl ...

and

homonym In linguistics, homonyms are words which are homographs (words that share the same spelling, regardless of pronunciation), or homophones ( equivocal words, that share the same pronunciation, regardless of spelling), or both. Using this definitio ...

y are not handled properly. For example, in the sentence "The club I tried yesterday was great!", it is not clear if the term ''club'' is related to the word sense of a '' club sandwich'', '' baseball club'', '' clubhouse'', ''

golf club A golf club is a club used to hit a golf ball in a game of golf. Each club is composed of a shaft with a grip and a club head. Woods are mainly used for long-distance fairway or tee shots; irons, the most versatile class, are used for a variet ...

'', or any other sense that ''club'' might have. The necessity to accommodate multiple meanings per word in different vectors (multi-sense embeddings) is the motivation for several contributions in NLP to split single-sense embeddings into multi-sense ones. Most approaches that produce multi-sense embeddings can be divided into two main categories for their word sense representation, i.e., unsupervised and knowledge-based. Based on word2vec skip-gram, Multi-Sense Skip-Gram (MSSG) performs word-sense discrimination and embedding simultaneously, improving its training time, while assuming a specific number of senses for each word. In the Non-Parametric Multi-Sense Skip-Gram (NP-MSSG) this number can vary depending on each word. Combining the prior knowledge of lexical databases (e.g.,

WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short defin ...

ConceptNet Open Mind Common Sense (OMCS) is an artificial intelligence project based at the Massachusetts Institute of Technology (MIT) Media Lab whose goal is to build and utilize a large commonsense knowledge base from the contributions of many thousands ...

, BabelNet), word embeddings and word sense disambiguation, Most Suitable Sense Annotation (MSSA) labels word-senses through an unsupervised and knowledge-based approach considering a word's context in a pre-defined sliding window. Once the words are disambiguated, they can be used in a standard word embeddings technique, so multi-sense embeddings are produced. MSSA architecture allows the disambiguation and annotation process to be performed recurrently in a self-improving manner. The use of multi-sense embeddings is known to improve performance in several NLP tasks, such as

part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...

, semantic relation identification, semantic relatedness,

named entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...

and sentiment analysis. As of the late 2010s, contextually-meaningful embeddings such as ELMo and BERT have been developed. Unlike static word embeddings, these embeddings are at the token-level, in that each occurrence of a word has its own embedding. These embeddings better reflect the multi-sense nature of words, because occurrences of a word in similar contexts are situated in similar regions of BERT’s embedding space.

For biological sequences: BioVectors

Word embeddings for ''n-''grams in biological sequences (e.g. DNA, RNA, and Proteins) for

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

applications have been proposed by Asgari and Mofrad. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in

proteomics Proteomics is the large-scale study of proteins. Proteins are vital parts of living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replication of DNA. I ...

and

genomics Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...

. The results presented by Asgari and Mofrad suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.

Game design

Word embeddings with applications in

game design Game design is the art of applying design and aesthetics to create a game for entertainment or for educational, exercise, or experimental purposes. Increasingly, elements and principles of game design are also applied to other interactions, in ...

have been proposed by Rabii and Cook as a way to discover

emergent gameplay Emergent gameplay refers to complex situations in video games, board games, or table top role-playing games that emerge from the interaction of relatively simple game mechanics. Designers have attempted to encourage emergent play by providing too ...

using logs of gameplay data. The process requires to transcribe actions happening during the game within a

formal language In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of sym ...

and then use the resulting text to create word embeddings. The results presented by Rabii and Cook suggest that the resulting vectors can capture expert knowledge about games like

chess Chess is a board game for two players, called White and Black, each controlling an army of chess pieces in their color, with the objective to checkmate the opponent's king. It is sometimes called international chess or Western chess to dist ...

, that are not explicitly stated in the game's rules.

Sentence embeddings

The idea has been extended to embeddings of entire sentences or even documents, e.g. in form of the thought vectors concept. In 2015, some researchers suggested "skip-thought vectors" as a means to improve the quality of

machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates ...

. A more recent and popular approach for representing sentences is Sentence-BERT, or SentenceTransformers, which modifies pre-trained BERT with the use of siamese and triplet network structures.

Software

Software for training and using word embeddings includes Tomas Mikolov's Word2vec, Stanford University's

GloVe A glove is a garment covering the hand. Gloves usually have separate sheaths or openings for each finger and the thumb. If there is an opening but no (or a short) covering sheath for each finger they are called fingerless gloves. Fingerless g ...

, GN-GloVe, Flair embeddings, AllenNLP's ELMo, BERT, fastText,

Gensim Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning. Gensim is implemented in Python and ...

, Indra and

Deeplearning4j Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, ...

Principal Component Analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...

(PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.

Examples of application

For instance, the fastText is also used to calculate word embeddings for text corpora in

Sketch Engine Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour ( lexicographers, researchers in corpus linguistics, translators or lan ...

that are available online.

References

{{Natural Language Processing Language modeling Artificial neural networks Natural language processing Computational linguistics Semantic relations