network theory Network theory is the study of graphs as a representation of either symmetric relations or asymmetric relations between discrete objects. In computer science and network science, network theory is a part of graph theory: a network can be defi ...

, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a

social network A social network is a social structure made up of a set of social actors (such as individuals or organizations), sets of dyadic ties, and other social interactions between actors. The social network perspective provides a set of methods for an ...

, predicting co-authorship links in a

citation network A citation graph (or citation network), in information science and bibliometrics, is a directed graph that describes the citations within a collection of documents. Each Vertex (graph theory), vertex (or Vertex (graph theory), node) in the gra ...

, and predicting interactions between genes and proteins in a

biological network A biological network is a method of representing systems as complex sets of binary interactions or relations between various biological entities. In general, networks or graphs are used to capture relationships between entities or objects. A typi ...

. Link prediction can also have a temporal aspect, where, given a snapshot of the set of links at time

t

, the goal is to predict the links at time

t + 1

. Link prediction is widely applicable. In e-commerce, link prediction is often a subtask for recommending items to users. In the curation of citation databases, it can be used for record deduplication. In bioinformatics, it has been used to predict protein-protein interactions (PPI). It is also used to identify hidden groups of terrorists and criminals in security related applications.

Problem definition

Consider a network

G = (V, E)

, where

V

represents the entity nodes in the network and

E \subseteq , V,

, V,

represents the set of "true" links across entities in the network. We are given the set of entities

V

and a subset of true links which are referred to as ''observed links''. The goal of link prediction is to identify the unobserved true links. In the temporal formulation of link prediction the observed links correspond to true links at a time

t

, and the goal is to infer the set of true links at time

t+1

Usually, we are also given a subset of unobserved links called ''potential links''

E'

, and we need to identify true links among these potential links. In the binary classification formulation of the link prediction task the potential links are classified as either true links or false links. Link prediction approaches for this setting learn a classifier

M_b

that maps links in

E'

to positive and negative labels i.e.

M_b:E' \to \

. In the probability estimation formulation, potential links are associated with existence probabilities. Link prediction approaches for this setting learn a model

M_p

that maps links in

E'

to a probability i.e.

M_p:E' \to,1 /math>.

Single link approaches learn a model that classifies each link independently. 
Structured prediction approaches capture the correlation between potential links by formulating the task as a collective link prediction task. 
Collective link prediction approaches learn a model that jointly identify all the true links among the set of potential links.

Link prediction task can also be formulated as an instance of missing value estimation task.
Here, the graph is represented as an adjacency matrix with missing values. 
The task is to complete the matrix by identifying the missing values.
Matrix factorization based methods commonly use this formulation.

History

The task of link prediction has attracted attention from several research communities ranging from

statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

and

network science Network science is an academic field which studies complex networks such as telecommunication networks, computer networks, biological networks, cognitive and semantic networks, and social networks, considering distinct elements or actors repre ...

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

and data mining. In statistics, generative random graph models such as

stochastic block model The stochastic block model is a generative model for random graphs. This model tends to produce graphs containing ''communities'', subsets of nodes characterized by being connected with one another with particular edge densities. For example, edg ...

s propose an approach to generate links between nodes in a

random graph In mathematics, random graph is the general term to refer to probability distributions over graphs. Random graphs may be described simply by a probability distribution, or by a random process which generates them. The theory of random graphs li ...

. For social networks, Liben-Nowell and Kleinberg proposed a link prediction models based on different graph proximity measures. Several statistical models have been proposed for link prediction by the machine learning and data mining community. For example, Popescul et al. proposed a structured logistic regression model that can make use of relational features. Local conditional probability models based on attribute and structural features were proposed by O’Madadhain et al. Several models based on directed graphical models for collective link prediction have been proposed by Getoor. Other approached based on random walks. and matrix factorization have also been proposed With the advent of deep learning, several graph embedding based approaches for link prediction have also been proposed. For more information on link prediction refer to the survey by Getoor et al. and Yu et al.

Approaches and methods

Several link predication approaches have been proposed including unsupervised approaches such as similarity measures computed on the entity attributes,

random walk In mathematics, a random walk is a random process that describes a path that consists of a succession of random steps on some mathematical space. An elementary example of a random walk is the random walk on the integer number line \mathbb Z ...

and

matrix factorization In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of ...

based approaches, and supervised approaches based on

graphical models ''Graphical Models'' is an academic journal in computer graphics and geometry processing publisher by Elsevier. , its editor-in-chief is Jorg Peters of the University of Florida. History This journal has gone through multiple names. Founded in 1 ...

and

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...

. Link prediction approaches can be divided into two broad categories based on the type of the underlying network: (1) link prediction approaches for homogeneous networks (2) link prediction approaches for heterogeneous networks. Based on the type of information used to predict links, approaches can be categorized as topology-based approaches, content-based approaches, and mixed methods.

Topology-based methods

Topology-based methods broadly make the assumption that nodes with similar network structure are more likely to form a link.

Common neighbors

This is a common approach to link prediction that computes the number of common neighbors. Entities with more neighbors in common are more likely to have a link. It is computed as follows: :

CN(A,B) =

A weakness of this approach is that it does not take into account the relative number of common neighbors.

Jaccard measure

The Jaccard Measure addresses the problem of Common Neighbors by computing the relative number of neighbors in common: :

J(A,B) =

Adamic–Adar measure

The Adamic–Adar measure is the sum of the log of the intersection of the neighbors of two nodes. This captures a two-hop similarity, which can yield better results than simple one-hop methods. It is computed as follows: :

A(x,y) = \sum_ \frac,

where

N(u)

is the

set Set, The Set, SET or SETS may refer to: Science, technology, and mathematics Mathematics *Set (mathematics), a collection of elements *Category of sets, the category whose objects and morphisms are sets and total functions, respectively Electro ...

of nodes adjacent to

u

Katz measure

Neighbor based methods can be effective when the number of neighbors is large, but this is not the case in sparse graphs. In these situations it is appropriate to use methods that account for longer walks. The Katz Measure is one metric that captures this. It is computed by searching the graph for paths of length

t

in the graph and adding the counts of each path length weighted by user specified weights. Let ''A'' be the

adjacency matrix In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph. In the special case of a finite simp ...

of a network under consideration. Elements

(a_)

of ''A'' are variables that take a value 1 if a node ''i'' is connected to node ''j'' and 0 otherwise. The powers of ''A'' indicate the presence (or absence) of links between two nodes through intermediaries. For instance, in matrix

A^3

, if element

(a_) = 1

, it indicates that node 2 and node 12 are connected through some walk of length 3. If

C_(i)

denotes Katz centrality of a node ''i'', then mathematically: :

C_(i) = \sum_^\infty \sum_^n \alpha^k (A^k)_

Note that the above definition uses the fact that the element at location

(i,j)

A^k

reflects the total number of

k

degree connections between nodes

i

and

j

Node attribute-based methods

Node-similarity methods predict the existence of a link based on the similarity of the node attributes.

Euclidean distance

The attribute values are represented as normalized vector and the distance between the vectors used to measure similarity. Small distances indicate higher similarity.

Cosine similarity

After normalizing the attribute values, computing the cosine between the two vectors is a good measure of similarity, with higher values indicating higher similarity.

Mixed methods

Mixed methods combine attribute and topology based methods.

Graph embeddings

Graph embedding In topological graph theory, an embedding (also spelled imbedding) of a Graph (discrete mathematics), graph G on a surface (mathematics), surface \Sigma is a representation of G on \Sigma in which points of \Sigma are associated with graph the ...

s also offer a convenient way to predict links. Graph embedding algorithms, such as Node2vec, learn an embedding space in which neighboring nodes are represented by vectors so that vector similarity measures, such as

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...

similarity, or euclidean distance, hold in the embedding space. These similarities are functions of both topological features and attribute-based similarity. One can then use other machine learning techniques to predict edges on the basis of vector similarity.

Probabilistic relationship models

A probabilistic relational model (PRM) specifies a template for a probability distribution over a databases. The template describes the relational schema for the domain, and the probabilistic dependencies between attributes in the domain. A PRM, together with a particular database of entities and unobserved links, defines a probability distribution over the unobserved links.

Probabilistic soft logic (PSL)

Probabilistic soft logic Probabilistic Soft Logic (PSL) is a statistical relational learning (SRL) framework for modeling probabilistic and relational domains. It is applicable to a variety of machine learning problems, such as collective classification, entity reso ...

(PSL) is a probabilistic graphical model over hinge-loss Markov random field (HL-MRF). HL-MRFs are created by a set of templated first-order logic-like rules, which are then grounded over the data. PSL can combine attribute, or local, information with topological, or relational, information. While PSL can incorporate local predictors, such as

cosine similarity In data analysis, cosine similarity is a measure of similarity between two sequences of numbers. For defining it, the sequences are viewed as vectors in an inner product space, and the cosine similarity is defined as the cosine of the angle betwe ...

, it also supports relational rules, such as triangle completion in a network.

Markov logic networks (MLNs)

Markov logic network A Markov logic network (MLN) is a probabilistic logic which applies the ideas of a Markov network to first-order logic, enabling uncertain inference. Markov logic networks generalize first-order logic, in the sense that, in a certain limit, all uns ...

s (MLNs) is a probabilistic graphical model defined over Markov networks. These networks are defined by templated first-order logic-like rules, which is then grounded over the training data. MLNs are able to incorporate both local and relational rules for the purpose of link prediction.

R-Model (RMLs)

R-Models (RMLs) is a neural network model created to provide a deep learning approach to the link weight prediction problem. This model uses a node embedding technique that extracts node embeddings (knowledge of nodes) from the known links’ weights (relations between nodes) and uses this knowledge to predict the unknown links’ weights.

Applications

Link prediction has found varied uses, but any domain in which entities interact in a structures way can benefit from link prediction. A common applications of link prediction is improving similarity measures for

collaborative filtering Collaborative filtering (CF) is a technique used by recommender systems.Francesco Ricci and Lior Rokach and Bracha ShapiraIntroduction to Recommender Systems Handbook Recommender Systems Handbook, Springer, 2011, pp. 1-35 Collaborative filtering ...

approaches to recommendation. Link prediction is also frequently used in social networks to suggest friends to users. It has also been used to predict criminal associations. In biology, link prediction has been used to predict interactions between proteins in protein-protein interaction networks. Link prediction has also been used to infer interactions between drugs and targets using link prediction Another application is found in collaboration prediction in scientific co-authorship networks.

Entity resolution Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and da ...

, also known as deduplication, commonly uses link prediction to predict whether two entities in a network are references to the same physical entity. Some authors have used context information in network structured domains to improve entity resolution. Link prediction in the context of network effects has been used to analyze tendencies to spread across networks and can be used to improve marketing strategies, in particular viral marketing.

Software packages

Free and open-source software

* Caffe * CNTK *

Deeplearning4j Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, d ...

DeepSpeed DeepSpeed is an open source deep learning optimization library for PyTorch. The library is designed to reduce computing power and memory use and to train large distributed models with better parallelism on existing computer hardware. DeepSpeed i ...

ELKI ELKI (for ''Environment for DeveLoping KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases) software framework developed for use in research and teaching. It was originally at the database s ...

Infer.NET Infer.NET is a free and open source .NET software library for machine learning. It supports running Bayesian inference in graphical models and can also be used for probabilistic programming. Overview Infer.NET follows a model-based approach and ...

Keras Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library. Up until version 2.3, Keras supported multiple backends, including TensorFlow, Micro ...

Mahout A mahout is an elephant rider, trainer, or keeper. Mahouts were used since antiquity for both civilian and military use. Traditionally, mahouts came from ethnic groups with generations of elephant keeping experience, with a mahout retaining h ...

Mallet A mallet is a tool used for imparting force on another object, often made of rubber or sometimes wood, that is smaller than a maul or beetle, and usually has a relatively large head. The term is descriptive of the overall size and proport ...

ML.NET ML.NET is a free software machine learning library for the C# and F# programming languages. It also supports Python models when used together with NimbusML. The preview release of ML.NET included transforms for feature engineering like n-gram cr ...

mlpack mlpack is a machine learning software library for C++, built on top of the Armadillo library and thensmallennumerical optimization library. mlpack has an emphasis on scalability, speed, and ease-of-use. Its aim is to make machine learning possibl ...

MXNet Apache MXNet is an open-source deep learning software framework, used to train and deploy deep neural networks. It is scalable, allowing for fast model training and supports a flexible programming model and multiple programming languages (inclu ...

Octave In music, an octave ( la, octavus: eighth) or perfect octave (sometimes called the diapason) is the interval between one musical pitch and another with double its frequency. The octave relationship is a natural phenomenon that has been refer ...

OpenNN OpenNN (Open Neural Networks Library) is a software library written in the C++ programming language which implements neural networks, a main area of deep learning research. The library is open-source, licensed under the GNU Lesser General Public L ...

Orange Orange most often refers to: *Orange (fruit), the fruit of the tree species '' Citrus'' × ''sinensis'' ** Orange blossom, its fragrant flower *Orange (colour), from the color of an orange, occurs between red and yellow in the visible spectrum * ...

Perl Data Language Perl Data Language (abbreviated PDL) is a set of free software array programming extensions to the Perl programming language. PDL extends the data structures built into Perl, to include large multidimensional arrays, and adds functionality to m ...

Probabilistic Soft Logic Probabilistic Soft Logic (PSL) is a statistical relational learning (SRL) framework for modeling probabilistic and relational domains. It is applicable to a variety of machine learning problems, such as collective classification, entity reso ...

* R *

ROOT In vascular plants, the roots are the organs of a plant that are modified to provide anchorage for the plant and take in water and nutrients into the plant body, which allows plants to grow taller and faster. They are most often below the sur ...

(TMVA with ROOT) *

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector m ...

Shogun , officially , was the title of the military dictators of Japan during most of the period spanning from 1185 to 1868. Nominally appointed by the Emperor, shoguns were usually the de facto rulers of the country, though during part of the Kamakur ...

* Spark MLlib * SystemML *

TensorFlow TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learning ...

Torch A torch is a stick with combustible material at one end, which is ignited and used as a light source. Torches have been used throughout history, and are still used in processions, symbolic and religious events, and in juggling entertainment. In ...

PyTorch PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open ...

Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus ''Gallirallus''. Four subspecies are recognize ...

MOA Moa are extinct giant flightless birds native to New Zealand. The term has also come to be used for chicken in many Polynesian cultures and is found in the names of many chicken recipes, such as Kale moa and Moa Samoa. Moa or MOA may also refer ...

Yooreeka Yooreeka is a library for data mining, machine learning, soft computing, and mathematical analysis. The project started with the code of the book "Algorithms of the Intelligent Web". Although the term "Web" prevailed in the title, in essence, the ...

Proprietary software with free and open-source editions

KNIME KNIME (), the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks ...

RapidMiner RapidMiner is a data science platform designed for enterprises that analyses the collective impact of organizations’ employees, expertise and data. Rapid Miner's data science platform is intended to support many analytics users across a broad AI ...

Proprietary software

Amazon Machine Learning Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis. These cloud computing web services provide di ...

Angoss Angoss Software Corporation, headquartered in Toronto, Ontario, Canada, with offices in the United States and UK, acquired by Datawatch and now owned by Altair, was a provider of predictive analytics systems through software licensing and ser ...

KnowledgeSTUDIO *

Azure Machine Learning Microsoft Azure, often referred to as Azure ( , ), is a cloud computing platform operated by Microsoft for application management via around the world-distributed data centers. Microsoft Azure has multiple capabilities such as software as a ...

* Ayasdi *

IBM Watson Studio Watson Studio, formerly Data Science Experience or DSX, is IBM’s software platform for data science. The platform consists of a workspace that includes multiple collaboration and open-source tools for use in data science. In Watson Studio, a d ...

* Google Prediction API * IBM SPSS Modeler * KXEN Modeler *

LIONsolver LIONsolver is an integrated software for data mining, business intelligence, analytics, and modeling and reactive business intelligence approach. A non-profit version is also available as LIONoso. LIONsolver is used to build models, visualize th ...

Mathematica Wolfram Mathematica is a software system with built-in libraries for several areas of technical computing that allow machine learning, statistics, symbolic computation, data manipulation, network analysis, time series analysis, NLP, optimizat ...

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation ...

Microsoft Azure Microsoft Azure, often referred to as Azure ( , ), is a cloud computing platform operated by Microsoft for application management via around the world-distributed data centers. Microsoft Azure has multiple capabilities such as software as a ...

Neural Designer Neural Designer is a software tool for machine learning based on neural networks, a main area of artificial intelligence research, and contains a graphical user interface which simplifies data entry and interpretation of results. In 2015, Neural ...

NeuroSolutions NeuroSolutions is a neural network development environment developed by NeuroDimension. It combines a modular, icon-based (component-based) network design interface with an implementation of advanced learning procedures, such as conjugate gradie ...

Oracle Data Mining Oracle Data Mining (ODM) is an option of Oracle Database Enterprise Edition. It contains several data mining and data analysis algorithms for classification, prediction, regression, associations, feature selection, anomaly detection, feature ...

* Oracle AI Platform Cloud Service * RCASE * SAS Enterprise Miner *

SequenceL SequenceL is a general purpose functional programming language and auto-parallelizing (Parallel computing) compiler and tool set, whose primary design objectives are performance on multi-core processor hardware, ease of programming, platform porta ...

Splunk Splunk Inc. is an American software company based in San Francisco, California, that produces software for searching, monitoring, and analyzing machine-generated data via a Web-style interface. Its software helps capture, index and correlate ...

STATISTICA Statistica is an advanced analytics software package originally developed by StatSoft and currently maintained by TIBCO Software Inc. Statistica provides data analysis, data management, statistics, data mining, machine learning, text analytics a ...

Data Miner

References

{{Authority control Graph algorithms Link analysis Network theory