natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

, a word embedding is a representation of a word. The

embedding In mathematics, an embedding (or imbedding) is one instance of some mathematical structure contained within another instance, such as a group (mathematics), group that is a subgroup. When some object X is said to be embedded in another object Y ...

is used in

text analysis Content analysis is the study of documents and communication artifacts, known as texts e.g. photos, speeches or essays. Social scientists use content analysis to examine patterns in communication in a replicable and systematic manner. One of the ...

. Typically, the representation is a

real-valued In mathematics, value may refer to several, strongly related notions. In general, a mathematical value may be any definite mathematical object. In elementary mathematics, this is most often a number – for example, a real number such as or an ...

vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

ing and

feature learning In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual fea ...

techniques, where words or phrases from the vocabulary are mapped to vectors of

real numbers In mathematics, a real number is a number that can be used to measurement, measure a continuous variable, continuous one-dimensional quantity such as a time, duration or temperature. Here, ''continuous'' means that pairs of values can have arbi ...

. Methods to generate this mapping include

neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

on the word

co-occurrence matrix A co-occurrence matrix or co-occurrence distribution (also referred to as : ''gray-level co-occurrence matrices'' GLCMs) is a matrix (mathematics), matrix that is defined over an Digital image, image to be the distribution of co-occurring pixel v ...

, probabilistic models, explainable knowledge base method, and explicit representation in terms of the context in which words appear. Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as

syntactic parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term ''pa ...

and

sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...

Development and history of the approach

In distributional semantics, a quantitative methodological approach for understanding meaning in observed language, word embeddings or semantic

feature space Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenom ...

models have been used as a knowledge representation for some time. Such models aim to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying idea that "a word is characterized by the company it keeps" was proposed in a 1957 article by

John Rupert Firth John Rupert Firth OBE (17 June 1890 in Keighley, Yorkshire – 14 December 1960 in Lindfield, West Sussex), commonly known as J. R. Firth, was an English linguist and a leading figure in British linguistics during the 1950s. Education and care ...

, but also has roots in the contemporaneous work on search systems and in cognitive psychology. The notion of a semantic space with lexical items (words or multi-word terms) represented as vectors or embeddings is based on the computational challenges of capturing distributional characteristics and using them for practical application to measure similarity between words, phrases, or entire documents. The first generation of semantic space models is the

vector space model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...

for information retrieval. Such vector space models for words and their distributional data implemented in their simplest form results in a very sparse vector space of high dimensionality (cf.

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. T ...

). Reducing the number of dimensions using linear algebraic methods such as

singular value decomposition In linear algebra, the singular value decomposition (SVD) is a Matrix decomposition, factorization of a real number, real or complex number, complex matrix (mathematics), matrix into a rotation, followed by a rescaling followed by another rota ...

then led to the introduction of

latent semantic analysis Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the d ...

in the late 1980s and the

random indexing Random indexing is a dimensionality reduction method and computational framework for distributional semantics, based on the insight that very-high-dimensional vector space model implementations are impractical, that models need not grow in dimensi ...

approach for collecting word co-occurrence contexts. In 2000, Bengio et al. provided in a series of papers titled "Neural probabilistic language models" to reduce the high dimensionality of word representations in contexts by "learning a distributed representation for words". A study published in

NeurIPS The Conference and Workshop on Neural Information Processing Systems (abbreviated as NeurIPS and formerly NIPS) is a machine learning and computational neuroscience conference held every December. Along with ICLR and ICML, it is one of the three ...

(NIPS) 2002 introduced the use of both word and document embeddings applying the method of kernel CCA to bilingual (and multi-lingual) corpora, also providing an early example of

self-supervised learning Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self ...

of word embeddings. Word embeddings come in two different styles, one in which words are expressed as vectors of co-occurring words, and another in which words are expressed as vectors of linguistic contexts in which the words occur; these different styles are studied in Lavelli et al., 2004. Roweis and Saul published in ''

Science Science is a systematic discipline that builds and organises knowledge in the form of testable hypotheses and predictions about the universe. Modern science is typically divided into twoor threemajor branches: the natural sciences, which stu ...

'' how to use "

locally linear embedding Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear de ...

" (LLE) to discover representations of high dimensional data structures. Most new word embedding techniques after about 2005 rely on a

neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

architecture instead of more probabilistic and algebraic models, after foundational work done by Yoshua Bengio and colleagues. The approach has been adopted by many research groups after theoretical advances in 2010 had been made on the quality of vectors and the training speed of the model, as well as after hardware advances allowed for a broader

parameter space The parameter space is the space of all possible parameter values that define a particular mathematical model. It is also sometimes called weight space, and is often a subset of finite-dimensional Euclidean space. In statistics, parameter spaces a ...

to be explored profitably. In 2013, a team at

Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...

led by

Tomas Mikolov Tomas may refer to: People * Tomás (given name), a Spanish, Portuguese, and Gaelic given name * Tomas (given name), a Swedish, Dutch, and Lithuanian given name * Tomáš, a Czech and Slovak given name * Tomàs, a Catalan given name and surname * ...

created

word2vec Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...

, a word embedding toolkit that can train vector space models faster than previous approaches. The word2vec approach has been widely used in experimentation and was instrumental in raising interest for word embeddings as a technology, moving the research strand out of specialised research into broader experimentation and eventually paving the way for practical application.

Polysemy and homonymy

Historically, one of the main limitations of static word embeddings or word

s is that words with multiple meanings are conflated into a single representation (a single vector in the semantic space). In other words,

polysemy Polysemy ( or ; ) is the capacity for a Sign (semiotics), sign (e.g. a symbol, morpheme, word, or phrase) to have multiple related meanings. For example, a word can have several word senses. Polysemy is distinct from ''monosemy'', where a word h ...

and

homonym In linguistics, homonyms are words which are either; '' homographs''—words that mean different things, but have the same spelling (regardless of pronunciation), or '' homophones''—words that mean different things, but have the same pronunciat ...

y are not handled properly. For example, in the sentence "The club I tried yesterday was great!", it is not clear if the term ''club'' is related to the word sense of a ''

club sandwich A club sandwich or clubhouse sandwich, is a three-layer sandwich consisting of three slices of bread (traditionally toasted), sliced cooked poultry, fried bacon, lettuce, tomato, and mayonnaise.Mariani, John (July 1995). "The club sandwich." '' ...

'', ''

clubhouse Clubhouse may refer to: Locations * The meetinghouse of: ** A club (organization), an association of two or more people united by a common interest or goal ** In the United States, a country club ** In the United Kingdom, a gentlemen's club * A ...

'', ''

golf club A golf club is a club used to hit a golf ball in a game of golf. Each club is composed of a shaft with a grip and a club head. Woods are mainly used for long-distance fairway or tee shots; irons, the most versatile class, are used for a variety o ...

'', or any other sense that ''club'' might have. The necessity to accommodate multiple meanings per word in different vectors (multi-sense embeddings) is the motivation for several contributions in NLP to split single-sense embeddings into multi-sense ones. Most approaches that produce multi-sense embeddings can be divided into two main categories for their word sense representation, i.e., unsupervised and knowledge-based. Based on

skip-gram, Multi-Sense Skip-Gram (MSSG) performs word-sense discrimination and embedding simultaneously, improving its training time, while assuming a specific number of senses for each word. In the Non-Parametric Multi-Sense Skip-Gram (NP-MSSG) this number can vary depending on each word. Combining the prior knowledge of lexical databases (e.g.,

WordNet WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definitions and usage examples. It can thu ...

ConceptNet Open Mind Common Sense (OMCS) is an artificial intelligence project based at the Massachusetts Institute of Technology (MIT) Media Lab whose goal is to build and utilize a large commonsense knowledge base from the contributions of many thousands ...

BabelNet BabelNet is a multilingual lexical-semantic knowledge graph, ontology and encyclopedic dictionary developed at the NLP group of the Sapienza University of Rome under the supervision of Roberto Navigli.R. Navigli and S. P Ponzetto. 2012BabelNet: ...

), word embeddings and

word sense disambiguation Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious. Given that natural language requires re ...

, Most Suitable Sense Annotation (MSSA) labels word-senses through an unsupervised and knowledge-based approach, considering a word's context in a pre-defined sliding window. Once the words are disambiguated, they can be used in a standard word embeddings technique, so multi-sense embeddings are produced. MSSA architecture allows the disambiguation and annotation process to be performed recurrently in a self-improving manner. The use of multi-sense embeddings is known to improve performance in several NLP tasks, such as

part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...

, semantic relation identification,

semantic relatedness Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical too ...

named entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pr ...

and sentiment analysis. As of the late 2010s, contextually-meaningful embeddings such as

ELMo Elmo is a Muppet character on the children's television show ''Sesame Street''. A furry red monster who speaks in a high-pitched falsetto voice and frequently refers to himself in the third person, he hosts the last full 15-minute segmen ...

and BERT have been developed. Unlike static word embeddings, these embeddings are at the token-level, in that each occurrence of a word has its own embedding. These embeddings better reflect the multi-sense nature of words, because occurrences of a word in similar contexts are situated in similar regions of BERT’s embedding space.

For biological sequences: BioVectors

Word embeddings for ''n-''grams in biological sequences (e.g. DNA, RNA, and Proteins) for

bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

applications have been proposed by Asgari and Mofrad. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in

proteomics Proteomics is the large-scale study of proteins. Proteins are vital macromolecules of all living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replicatio ...

and

genomics Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...

. The results presented by Asgari and Mofrad suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.

Game design

Word embeddings with applications in

game design Game design is the process of creating and shaping the mechanics, systems, rules, and gameplay of a game. Game design processes apply to board games, card games, dice games, casino games, role-playing games, sports, Wargame (video games), war ga ...

have been proposed by Rabii and Cook as a way to discover

emergent gameplay Emergent gameplay refers to complex situations in video games, board games, or role-playing games that emerge from the interaction of relatively simple game mechanics. Designers have attempted to encourage emergent play by providing tools to play ...

using logs of gameplay data. The process requires transcribing actions that occur during a game within a

formal language In logic, mathematics, computer science, and linguistics, a formal language is a set of strings whose symbols are taken from a set called "alphabet". The alphabet of a formal language consists of symbols that concatenate into strings (also c ...

and then using the resulting text to create word embeddings. The results presented by Rabii and Cook suggest that the resulting vectors can capture expert knowledge about games like

chess Chess is a board game for two players. It is an abstract strategy game that involves Perfect information, no hidden information and no elements of game of chance, chance. It is played on a square chessboard, board consisting of 64 squares arran ...

that are not explicitly stated in the game's rules.

Sentence embeddings

The idea has been extended to embeddings of entire sentences or even documents, e.g. in the form of the thought vectors concept. In 2015, some researchers suggested "skip-thought vectors" as a means to improve the quality of

machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

. A more recent and popular approach for representing sentences is Sentence-BERT, or SentenceTransformers, which modifies pre-trained BERT with the use of siamese and triplet network structures.

Software

Software for training and using word embeddings includes

Tomáš Mikolov Tomáš Mikolov is a Czech computer scientist working in the field of machine learning. In March 2020, Mikolov became a senior research scientist at the Czech Institute of Informatics, Robotics and Cybernetics. Career Mikolov obtained his PhD ...

Word2vec Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...

, Stanford University's

GloVe A glove is a garment covering the hand, with separate sheaths or openings for each finger including the thumb. Gloves protect and comfort hands against cold or heat, damage by friction, abrasion or chemicals, and disease; or in turn to provide a ...

, GN-GloVe, Flair embeddings, AllenNLP's

, BERT,

fastText fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows one to create an unsupervised learning or supervised learning algorithm for obtaining vector representati ...

Gensim Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning. Gensim is implemented in Python and ...

, Indra, and

Deeplearning4j Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, ...

Principal Component Analysis Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data is linearly transformed onto a new coordinate system such that th ...

(PCA) and

T-Distributed Stochastic Neighbour Embedding t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally de ...

(t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and

clusters may refer to: Science and technology Astronomy * Cluster (spacecraft), constellation of four European Space Agency spacecraft * Cluster II (spacecraft), a European Space Agency mission to study the magnetosphere * Asteroid cluster, a small ...

Examples of application

For instance, the fastText is also used to calculate word embeddings for

text corpora In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in cor ...

Sketch Engine Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour (lexicographers, researchers in corpus linguistics, translators or language learn ...

that are available online.

Ethical implications

Word embeddings may contain the biases and stereotypes contained in the trained dataset, as Bolukbasi et al. points out in the 2016 paper “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” that a publicly available (and popular) word2vec embedding trained on Google News texts (a commonly used data corpus), which consists of text written by professional journalists, still shows disproportionate word associations reflecting gender and racial biases when extracting word analogies. For example, one of the analogies generated using the aforementioned word embedding is “man is to computer programmer as woman is to homemaker”. Research done by Jieyu Zhou et al. shows that the applications of these trained word embeddings without careful oversight likely perpetuates existing bias in society, which is introduced through unaltered training data. Furthermore, word embeddings can even amplify these biases .

References

{{Artificial intelligence navbox Language modeling Artificial neural networks Natural language processing Computational linguistics Semantic relations