Seq2seq with RNN and attention mechanism

Seq2seq is a family of

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

approaches used for

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

. Applications include language translation, image captioning, conversational models,

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

, and text summarization. Seq2seq uses

sequence transformation In mathematics, a sequence transformation is an Operator (mathematics), operator acting on a given space of sequences (a sequence space). Sequence transformations include linear mappings such as convolution, discrete convolution with another sequen ...

: it turns one sequence into another sequence.

History

seq2seq is an approach to machine translation (or more generally, sequence transduction) with roots in information theory, where communication is understood as an encode-transmit-decode process, and machine translation can be studied as a special case of communication. This viewpoint was elaborated, for example, in the noisy channel model of machine translation. In practice, seq2seq maps an input sequence into a real-numerical vector by using a neural network (the ''encoder''), and then maps it back to an output sequence using another neural network (the ''decoder''). The idea of encoder-decoder sequence transduction had been developed in the early 2010s (see for previous papers). The papers most commonly cited as the originators that produced seq2seq are two papers from 2014. In the seq2seq as proposed by them, both the encoder and the decoder were LSTMs. This had the "bottleneck" problem, since the encoding vector has a fixed size, so for long input sequences, information would tend to be lost, as they are difficult to fit into the fixed-length encoding vector. The

attention mechanism In machine learning, attention is a method that determines the importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented b"soft"weights assigned to eac ...

, proposed in 2014, resolved the bottleneck problem. They called their model ''RNNsearch'', as it "emulates searching through a source sentence during decoding a translation". A problem with seq2seq models at this point was that recurrent neural networks are difficult to parallelize. The 2017 publication of

Transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...

resolved the problem by replacing the encoding RNN with self-attention Transformer blocks ("encoder blocks"), and the decoding RNN with cross-attention causally-masked Transformer blocks ("decoder blocks").

Priority dispute

One of the papers cited as the originator for seq2seq is (Sutskever et al 2014), published at Google Brain while they were on Google's machine translation project. The research allowed Google to overhaul

Google Translate Google Translate is a multilingualism, multilingual neural machine translation, neural machine translation service developed by Google to translation, translate text, documents and websites from one language into another. It offers a web applic ...

into Google Neural Machine Translation in 2016. Tomáš Mikolov claims to have developed the idea (before joining

Google Brain Google Brain was a deep learning artificial intelligence research team that served as the sole AI branch of Google before being incorporated under the newer umbrella of Google AI, a research division at Google dedicated to artificial intelligence ...

) of using a "neural language model on pairs of sentences... and then eneratingtranslation after seeing the first sentence"—which he equates with seq2seq machine translation, and to have mentioned the idea to

Ilya Sutskever Ilya Sutskever (; born 8 December 1986) is an Israeli-Canadian computer scientist who specializes in machine learning. He has made several major contributions to the field of deep learning. With Alex Krizhevsky and Geoffrey Hinton, he co-inv ...

and Quoc Le (while at Google Brain), who failed to acknowledge him in their paper. Mikolov had worked on RNNLM (using RNN for language modelling) for his PhD thesis, and is more notable for developing

word2vec Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...

Architecture

The main reference for this section is.

Encoder

The encoder is responsible for processing the input sequence and capturing its essential information, which is stored as the hidden state of the network and, in a model with attention mechanism, a context vector. The context vector is the weighted sum of the input hidden states and is generated for every time instance in the output sequences.

Decoder

The decoder takes the context vector and hidden states from the encoder and generates the final output sequence. The decoder operates in an autoregressive manner, producing one element of the output sequence at a time. At each step, it considers the previously generated elements, the context vector, and the input sequence information to make predictions for the next element in the output sequence. Specifically, in a model with attention mechanism, the context vector and the hidden state are concatenated together to form an attention hidden vector, which is used as an input for the decoder. The seq2seq method developed in the early 2010s uses two neural networks: an encoder network converts an input sentence into numerical vectors, and a decoder network converts those vectors to sentences in the target language. The Attention mechanism was grafted onto this structure in 2014 and shown below. Later it was refined into the encoder-decoder Transformer architecture of 2017.

Training vs prediction

There is a subtle difference between training and prediction. During training time, both the input and the output sequences are known. During prediction time, only the input sequence is known, and the output sequence must be decoded by the network itself. Specifically, consider an input sequence

x_

and output sequence

y_

. The encoder would process the input

x_

step by step. After that, the decoder would take the output from the encoder, as well as the as input, and produce a prediction

\hat y_1

. Now, the question is: what should be input to the decoder in the next step? A standard method for training is "teacher forcing". In teacher forcing, no matter what is output by the decoder, the next input to the decoder is always the reference. That is, even if

\hat y_1 \neq y_1

, the next input to the decoder is still

y_1

, and so on. During prediction time, the "teacher"

y_

would be unavailable. Therefore, the input to the decoder must be

\hat y_1

, then

\hat y_2

, and so on. It is found that if a model is trained purely by teacher forcing, its performance would degrade during prediction time, since generation based on the model's own output is different from generation based on the teacher's output. This is called exposure bias or a train/test

distribution shift Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...

. A 2015 paper recommends that, during training, randomly switch between teacher forcing and no teacher forcing.

Attention for seq2seq

The attention mechanism is an enhancement introduced by Bahdanau et al. in 2014 to address limitations in the basic Seq2Seq architecture where a longer input sequence results in the hidden state output of the encoder becoming irrelevant for the decoder. It enables the model to selectively focus on different parts of the input sequence during the decoding process. At each decoder step, an alignment model calculates the attention score using the current decoder state and all of the attention hidden vectors as input. An alignment model is another neural network model that is trained jointly with the seq2seq model used to calculate how well an input, represented by the hidden state, matches with the previous output, represented by attention hidden state. A

softmax function The softmax function, also known as softargmax or normalized exponential function, converts a tuple of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

is then applied to the attention score to get the attention weight. In some models, the encoder states are directly fed into an activation function, removing the need for alignment model. An activation function receives one decoder state and one encoder state and returns a scalar value of their relevance. Consider the seq2seq language English-to-French translation task. To be concrete, let us consider the translation of "the zone of international control ", which should translate to "la zone de contrôle international ". Here, we use the special token as a

control character In computing and telecommunications, a control character or non-printing character (NPC) is a code point in a character encoding, character set that does not represent a written Character (computing), character or symbol. They are used as in-ba ...

to delimit the end of input for both the encoder and the decoder. An input sequence of text

x_0, x_1, \dots

is processed by a neural network (which can be an LSTM, a Transformer encoder, or some other network) into a sequence of real-valued vectors

h_0, h_1, \dots

, where

h

stands for "hidden vector". After the encoder has finished processing, the decoder starts operating over the hidden vectors, to produce an output sequence

y_0, y_1, \dots

, autoregressively. That is, it always takes as input both the hidden vectors produced by the encoder, and what the decoder itself has produced before, to produce the next output word: # (

h_0, h_1, \dots

, "") → "la" # (

h_0, h_1, \dots

, " la") → "la zone" # (

h_0, h_1, \dots

, " la zone") → "la zone de" # ... # (

h_0, h_1, \dots

, " la zone de contrôle international") → "la zone de contrôle international " Here, we use the special token as a

to delimit the start of input for the decoder. The decoding terminates as soon as "" appears in the decoder output.

Attention weights

As hand-crafting weights defeats the purpose of machine learning, the model must compute the attention weights on its own. Taking analogy from the language of database queries, we make the model construct a triple of vectors: key, query, and value. The rough idea is that we have a "database" in the form of a list of key-value pairs. The decoder sends in a query, and obtains a reply in the form of a weighted sum of the values, where the weight is proportional to how closely the query resembles each key. The decoder first processes the "" input partially, to obtain an intermediate vector

h^d_0

, the 0th hidden vector of decoder. Then, the intermediate vector is transformed by a linear map

W^Q

into a query vector

q_0 = h_0^d W^Q

. Meanwhile, the hidden vectors outputted by the encoder are transformed by another linear map

W^K

into key vectors

k_0 = h_0 W^K, k_1 = h_1 W^K,  \dots

. The linear maps are useful for providing the model with enough freedom to find the best way to represent the data. Now, the query and keys are compared by taking dot products:

q_0 k_0^T, q_0 k_1^T, \dots

. Ideally, the model should have learned to compute the keys and values, such that

q_0 k_0^T

is large,

q_0 k_1^T

is small, and the rest are very small. This can be interpreted as saying that the attention weight should be mostly applied to the 0th hidden vector of the encoder, a little to the 1st, and essentially none to the rest. In order to make a properly weighted sum, we need to transform this list of dot products into a probability distribution over

0, 1, \dots

. This can be accomplished by the

, thus giving us the attention weights:

(w_, w_, \dots) = \mathrm(q_0 k_0^T, q_0 k_1^T, \dots)

This is then used to compute the context vector:

c_0 = w_ v_0 + w_ v_1 + \cdots

where

v_0 =  h_0 W^V, v_1 = h_1 W^V , \dots

are the value vectors, linearly transformed by another matrix to provide the model with freedom to find the best way to represent values. Without the matrices

W^Q, W^K, W^V

, the model would be forced to use the same hidden vector for both key and value, which might not be appropriate, as these two tasks are not the same. Encoder_cross-attention

This is the dot-attention mechanism. The particular version described in this section is "decoder cross-attention", as the output context vector is used by the decoder, and the input keys and values come from the encoder, but the query comes from the decoder, thus "cross-attention". More succinctly, we can write it as

c_0 = \mathrm(h_0^d W^Q, HW^K, H W^V) = \mathrm((h_0^d W^Q) \; (H W^K)^T) (H W^V)

where the matrix

H

is the matrix whose rows are

h_0, h_1, \dots

. Note that the querying vector,

h_0^d

, is not necessarily the same as the key-value vector

h_0

. In fact, it is theoretically possible for query, key, and value vectors to all be different, though that is rarely done in practice.

Other applications

In 2019,

Facebook Facebook is a social media and social networking service owned by the American technology conglomerate Meta Platforms, Meta. Created in 2004 by Mark Zuckerberg with four other Harvard College students and roommates, Eduardo Saverin, Andre ...

announced its use in

symbolic integration In calculus, symbolic integration is the problem of finding a formula for the antiderivative, or ''indefinite integral'', of a given function ''f''(''x''), i.e. to find a formula for a differentiable function ''F''(''x'') such that :\frac = f(x ...

and resolution of differential equations. The company claimed that it could solve complex equations more rapidly and with greater accuracy than commercial solutions such as

Mathematica Wolfram (previously known as Mathematica and Wolfram Mathematica) is a software system with built-in libraries for several areas of technical computing that allows machine learning, statistics, symbolic computation, data manipulation, network ...

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...

and

Maple ''Acer'' is a genus of trees and shrubs commonly known as maples. The genus is placed in the soapberry family Sapindaceae.Stevens, P. F. (2001 onwards). Angiosperm Phylogeny Website. Version 9, June 2008 nd more or less continuously updated si ...

. First, the equation is parsed into a tree structure to avoid notational idiosyncrasies. An LSTM neural network then applies its standard

pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...

facilities to process the tree. In 2020, Google released Meena, a 2.6 billion

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

seq2seq-based

chatbot A chatbot (originally chatterbot) is a software application or web interface designed to have textual or spoken conversations. Modern chatbots are typically online and use generative artificial intelligence systems that are capable of main ...

trained on a 341 GB data set. Google claimed that the chatbot has 1.7 times greater model capacity than

OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...

GPT-2 Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of Generative pre-trained transformer, GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was par ...

. In 2022,

Amazon Amazon most often refers to: * Amazon River, in South America * Amazon rainforest, a rainforest covering most of the Amazon basin * Amazon (company), an American multinational technology company * Amazons, a tribe of female warriors in Greek myth ...

introduced AlexaTM 20B, a moderate-sized (20 billion parameter) seq2seq

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

. It uses an encoder-decoder to accomplish few-shot learning. The encoder outputs a representation of the input that the decoder uses as input to perform a specific task, such as translating the input into another language. The model outperforms the much larger GPT-3 in language translation and summarization. Training mixes

denoising Noise reduction is the process of removing noise from a signal. Noise reduction techniques exist for audio and images. Noise reduction algorithms may distort the signal to some degree. Noise rejection is the ability of a circuit to isolate an u ...

(appropriately inserting missing text in strings) and causal-language-modeling (meaningfully extending an input text). It allows adding features across different languages without massive training workflows. AlexaTM 20B achieved state-of-the-art performance in few-shot-learning tasks across all Flores-101 language pairs, outperforming GPT-3 on several tasks.

References

External links

* * {{Artificial intelligence navbox Artificial neural networks Natural language processing