
In
representation learning
In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for Feature (machine learning), feature detection or classification from raw data. Th ...
, knowledge graph embedding (KGE), also referred to as knowledge representation learning (KRL), or multi-relation learning,
is a
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
task of learning a low-dimensional representation of a
knowledge graph
The Google Knowledge Graph is a knowledge base from which Google serves relevant information in an infobox beside its search results. This allows the user to see the answer in a glance. The data is generated automatically from a variety of so ...
's entities and relations while preserving their
semantic
Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comput ...
meaning.
Leveraging their
embedded
Embedded or embedding (alternatively imbedded or imbedding) may refer to:
Science
* Embedding, in mathematics, one instance of some mathematical object contained within another instance
** Graph embedding
* Embedded generation, a distributed ge ...
representation, knowledge graphs (KGs) can be used for various applications such as
link prediction
In network theory, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a social network, predicting co-authorship li ...
,
triple classification,
entity recognition,
clustering, and
relation extraction A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE a ...
.
Definition
A knowledge graph
is a collection of entities
, relations
, and facts
.
A ''fact'' is a triple
that denotes a link
between the head
and the tail
of the triple. Another notation that is often used in the literature to represent a triple (or fact) is
. This notation is called resource description framework (RDF).
A knowledge graph represents the knowledge related to a specific domain; leveraging this structured representation, it is possible to infer a piece of new knowledge from it after some refinement steps.
However, nowadays, people have to deal with the sparsity of data and the computational inefficiency to use them in a real-world application.
The embedding of a knowledge graph translates each entity and relation of a knowledge graph,
into a vector of a given dimension
, called embedding dimension.
In the general case, we can have different embedding dimensions for the entities
and the relations
.
The collection of embedding vectors for all the entities and relations in the knowledge graph can then be used for downstream tasks.
A knowledge graph embedding is characterized by four different aspects:
# Representation space: The low-dimensional space in which the entities and relations are represented.
#Scoring function: A measure of the goodness of a triple embedded representation.
# Encoding models: The modality in which the embedded representation of the entities and relations interact with each other.
# Additional information: Any additional information coming from the knowledge graph that can enrich the embedded representation.
Usually, an ad hoc scoring function is integrated into the general scoring function for each additional information.
Embedding procedure
All the different knowledge graph embedding models follow roughly the same procedure to learn the semantic meaning of the facts.
First of all, to learn an embedded representation of a knowledge graph, the embedding vectors of the entities and relations are initialized to random values.
Then, starting from a
training set
In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...
until a stop condition is reached, the algorithm continuously optimizes the embeddings.
Usually, the stop condition is given by the
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
over the training set.
For each iteration, is sampled a
batch of size
from the training set, and for each triple of the batch is sampled a random corrupted facti.e., a triple that does not represent a true fact in the knowledge graph.
The corruption of a triple involves substituting the head or the tail (or both) of the triple with another entity that makes the fact false.
The original triple and the corrupted triple are added in the training batch, and then the embeddings are updated, optimizing a scoring function.
At the end of the algorithm, the learned embeddings should have extracted the semantic meaning from the triples and should correctly predict unseen true facts in the knowledge graph.
Pseudocode
The following is the pseudocode for the general embedding procedure.
algorithm Compute entity and relation embeddings is
input: The training set
,
entity set
,
relation set
,
embedding dimension
output: Entity and relation embeddings
''initialization:'' ''the entities''
''and relations''
''embeddings (vectors) are randomly initialized''
while stop condition do
// From the training set randomly sample a batch of size b
for each
in
do
// sample a corrupted fact of triple
end for
Update embeddings by minimizing the loss function
end while
Performance indicators
These indexes are often used to measure the embedding quality of a model. The simplicity of the indexes makes them very suitable for evaluating the performance of an embedding algorithm even on a large scale.
Given
Q as the set of all ranked predictions of a model, it is possible to define three different performance indexes: Hits@K, MR, and MRR.
Hits@K
Hits@K or in short, H@K, is a performance index that measures the probability to find the correct prediction in the first top K model predictions.
Usually, it is used
.
Hits@K reflects the accuracy of an embedding model to predict the relation between two given triples correctly.
Hits@K