Datalog is a declarative

logic programming Logic programming is a programming, database and knowledge representation paradigm based on formal logic. A logic program is a set of sentences in logical form, representing knowledge about some problem domain. Computation is performed by applyin ...

language. While it is syntactically a subset of

Prolog Prolog is a logic programming language that has its origins in artificial intelligence, automated theorem proving, and computational linguistics. Prolog has its roots in first-order logic, a formal logic. Unlike many other programming language ...

, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties from

. It is often used as a

query language A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. In database systems, query languages rely on strict theory to retrieve informa ...

for

deductive database A deductive database is a database system that can make deductions (i.e. conclude additional facts) based on rules and facts stored in its database. Datalog is the language typically used to specify facts, rules and queries in deductive database ...

s. Datalog has been applied to problems in

data integration Data integration refers to the process of combining, sharing, or synchronizing data from multiple sources to provide users with a unified view. There are a wide range of possible applications for data integration, from commercial (such as when a ...

, networking,

program analysis In computer science, program analysis is the process of analyzing the behavior of computer programs regarding a property such as correctness, robustness, safety and liveness. Program analysis focuses on two major areas: program optimization an ...

, and more.

Example

A Datalog program consists of ''facts'', which are statements that are held to be true, and ''rules'', which say how to deduce new facts from known facts. For example, here are two facts that mean ''xerces is a parent of brooke'' and ''brooke is a parent of damocles'': parent(xerces, brooke). parent(brooke, damocles). The names are written in lowercase because strings beginning with an uppercase letter stand for variables. Here are two rules: ancestor(X, Y) :- parent(X, Y). ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y). The :- symbol is read as "if", and the comma is read "and", so these rules mean: * ''X is an ancestor of Y if X is a parent of Y.'' * ''X is an ancestor of Y if X is a parent of some Z, and Z is an ancestor of Y.'' The meaning of a program is defined to be the set of all of the facts that can be deduced using the initial facts and the rules. This program's meaning is given by the following facts: parent(xerces, brooke). parent(brooke, damocles). ancestor(xerces, brooke). ancestor(brooke, damocles). ancestor(xerces, damocles). Some Datalog implementations don't deduce all possible facts, but instead answer ''queries'': ?- ancestor(xerces, X). This query asks: ''Who are all the X that xerces is an ancestor of?'' For this example, it would return ''brooke'' and ''damocles''.

Comparison to relational databases

The non-recursive subset of Datalog is closely related to query languages for

relational database A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...

s, such as SQL. The following table maps between Datalog,

relational algebra In database theory, relational algebra is a theory that uses algebraic structures for modeling data and defining queries on it with well founded semantics (computer science), semantics. The theory was introduced by Edgar F. Codd. The main applica ...

, and SQL concepts: More formally, non-recursive Datalog corresponds precisely to unions of conjunctive queries, or equivalently, negation-free relational algebra. s(x, y). t(y). r(A, B) :- s(A, B), t(B). CREATE TABLE s ( z0 TEXT NONNULL, z1 TEXT NONNULL, PRIMARY KEY (z0, z1) ); CREATE TABLE t ( z0 TEXT NONNULL PRIMARY KEY ); INSERT INTO s VALUES ('x', 'y'); INSERT INTO t VALUES ('y'); CREATE VIEW r AS SELECT s.z0, s.z1 FROM s, t WHERE s.z1 = t.z0;

Syntax

A Datalog program consists of a list of ''rules'' ( Horn clauses). If ''constant'' and ''variable'' are two

countable In mathematics, a Set (mathematics), set is countable if either it is finite set, finite or it can be made in one to one correspondence with the set of natural numbers. Equivalently, a set is ''countable'' if there exists an injective function fro ...

sets of constants and variables respectively and ''relation'' is a countable set of predicate symbols, then the following BNF grammar expresses the structure of a Datalog program: ::= , "" ::= ":-" "." ::= "(" ")" ::= , "," , "" ::= ,

::= , "," , ""

Atoms are also referred to as . The atom to the left of the :- symbol is called the of the rule; the atoms to the right are the . Every Datalog program must satisfy the condition that every variable that appears in the head of a rule also appears in the body (this condition is sometimes called the ). There are two common conventions for variable names: capitalizing variables, or prefixing them with a question mark ?. Note that under this definition, Datalog does include negation nor aggregates; see for more information about those constructs. Rules with empty bodies are called . For example, the following rule is a fact: r(x) :- . The set of facts is called the or of the Datalog program. The set of tuples computed by evaluating the Datalog program is called the or .

Syntactic sugar

Many implementations of logic programming extend the above grammar to allow writing facts without the :-, like so: r(x). Some also allow writing 0-ary relations without parentheses, like so: p :- q. These are merely abbreviations (

syntactic sugar In computer science, syntactic sugar is syntax within a programming language that is designed to make things easier to read or to express. It makes the language "sweeter" for human use: things can be expressed more clearly, more concisely, or in an ...

); they have no impact on the semantics of the program.

Semantics

There are three widely-used approaches to the semantics of Datalog programs: model-theoretic, fixed-point, and proof-theoretic. These three approaches can be proven equivalent. An atom is called if none of its subterms are variables. Intuitively, each of the semantics define the meaning of a program to be the set of all ground atoms that can be deduced from the rules of the program, starting from the facts.

Model theoretic

A rule is called ground if all of its atoms (head and body) are ground. A ground rule ''R''₁ is a ''ground instance'' of another rule ''R''₂ if ''R''₁ is the result of a substitution of constants for all the variables in ''R''₂. The '' Herbrand base'' of a Datalog program is the set of all ground atoms that can be made with the constants appearing in the program. The of a Datalog program is the smallest subset of the Herbrand base such that, for each ground instance of each rule in the program, if the atoms in the body of the rule are in the set, then so is the head. The model-theoretic semantics define the minimal Herbrand model to be the meaning of the program.

Fixed-point

Let be the

power set In mathematics, the power set (or powerset) of a set is the set of all subsets of , including the empty set and itself. In axiomatic set theory (as developed, for example, in the ZFC axioms), the existence of the power set of any set is po ...

of the Herbrand base of a program ''P''. The ''immediate consequence operator'' for ''P'' is a map from to that adds all of the new ground atoms that can be derived from the rules of the program in a single step. The least-fixed-point semantics define the least fixed point of to be the meaning of the program; this coincides with the minimal Herbrand model. The

fixpoint In mathematics, a fixed point (sometimes shortened to fixpoint), also known as an invariant point, is a value that does not change under a given transformation. Specifically, for functions, a fixed point is an element that is mapped to itself ...

semantics suggest an algorithm for computing the minimal model: Start with the set of ground facts in the program, then repeatedly add consequences of the rules until a fixpoint is reached. This algorithm is called naïve evaluation.

Proof-theoretic

The proof-theoretic semantics defines the meaning of a Datalog program to be the set of facts with corresponding ''proof trees''. Intuitively, a proof tree shows how to derive a fact from the facts and rules of a program. One might be interested in knowing whether or not a particular ground atom appears in the minimal Herbrand model of a Datalog program, perhaps without caring much about the rest of the model. A top-down reading of the proof trees described above suggests an algorithm for computing the results of such ''queries''. This reading informs the

SLD resolution SLD resolution (''Selective Linear Definite'' clause resolution) is the basic rule of inference, inference rule used in logic programming. It is a refinement of Resolution (logic), resolution, which is both Soundness, sound and refutation Completen ...

algorithm, which forms the basis for the evaluation of

Evaluation

There are many different ways to evaluate a Datalog program, with different performance characteristics.

Bottom-up evaluation strategies

Bottom-up evaluation strategies start with the facts in the program and repeatedly apply the rules until either some goal or query is established, or until the complete minimal model of the program is produced.

Naïve evaluation

''Naïve evaluation'' mirrors the fixpoint semantics for Datalog programs. Naïve evaluation uses a set of "known facts", which is initialized to the facts in the program. It proceeds by repeatedly enumerating all ground instances of each rule in the program. If each atom in the body of the ground instance is in the set of known facts, then the head atom is added to the set of known facts. This process is repeated until a fixed point is reached, and no more facts may be deduced. Naïve evaluation produces the entire minimal model of the program.

Semi-naïve evaluation

Semi-naïve evaluation is a bottom-up evaluation strategy that can be asymptotically faster than naïve evaluation.

Performance considerations

Naïve and semi-naïve evaluation both evaluate recursive Datalog rules by repeatedly applying them to a set of known facts until a fixed point is reached. In each iteration, rules are only run for "one step", i.e., non-recursively. As mentioned above, each non-recursive Datalog rule corresponds precisely to a

conjunctive query In database theory, a conjunctive query is a restricted form of first-order queries using the logical conjunction operator. Many first-order queries can be written as conjunctive queries. In particular, a large part of queries issued on relational ...

. Therefore, many of the techniques from

database theory Database theory encapsulates a broad range of topics related to the study and research of the theoretical realm of databases and database management systems. Theoretical aspects of data management include, among other areas, the foundations of q ...

used to speed up conjunctive queries are applicable to bottom-up evaluation of Datalog, such as *

Index Index (: indexes or indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on the Halo Array in the ...

selection *

Query optimization Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the po ...

, especially join order * Join algorithms * Selection of

data structure In computer science, a data structure is a data organization and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the relationships amo ...

s used to store relations; common choices include

hash table In computer science, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps Unique key, keys to Value (computer science), values. ...

s and

B-tree In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing fo ...

s, other possibilities include disjoint set data structures (for storing

equivalence relation In mathematics, an equivalence relation is a binary relation that is reflexive, symmetric, and transitive. The equipollence relation between line segments in geometry is a common example of an equivalence relation. A simpler example is equ ...

s), bries (a variant of

trie In computer science, a trie (, ), also known as a digital tree or prefix tree, is a specialized search tree data structure used to store and retrieve strings from a dictionary or set. Unlike a binary search tree, nodes in a trie do not store t ...

s), binary decision diagrams, and even SMT formulas Many such techniques are implemented in modern bottom-up Datalog engines such as Soufflé. Some Datalog engines integrate SQL databases directly. Bottom-up evaluation of Datalog is also amenable to

parallelization Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different for ...

. Parallel Datalog engines are generally divided into two paradigms: * In the shared-memory, multi-core setting, Datalog engines execute on a single node. Coordination between threads may be achieved using locking or lock-free data structures. The shared-memory setting may be further divided into

single instruction, multiple data Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...

and multiple instruction, multiple data paradigms: ** Datalog engines that execute on

graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

s fall into the SIMD paradigm. ** Datalog engines using

OpenMP OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, ...

are instances of the MIMD paradigm. * In the shared-nothing setting, Datalog engines execute on a cluster of nodes. Such engines generally operate by splitting relations into disjoint subsets based on a

hash function A hash function is any Function (mathematics), function that can be used to map data (computing), data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. The values returned by a ...

, performing computations (joins) on each node, and then exchanging newly-generated tuples over the network. Examples include Datalog engines based on MPI,

Hadoop Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...

, and Spark.

Top-down evaluation strategies

is sound and complete for Datalog programs.

Magic sets

Top-down evaluation strategies begin with a ''query'' or ''goal''. Bottom-up evaluation strategies can answer queries by computing the entire minimal model and matching the query against it, but this can be inefficient if the answer only depends on a small subset of the entire model. The ''magic sets'' algorithm takes a Datalog program and a query, and produces a more efficient program that computes the same answer to the query while still using bottom-up evaluation. A variant of the magic sets algorithm has been shown to produce programs that, when evaluated using semi-naïve evaluation, are as efficient as top-down evaluation.

Complexity

The

decision problem In computability theory and computational complexity theory, a decision problem is a computational problem that can be posed as a yes–no question on a set of input values. An example of a decision problem is deciding whether a given natura ...

formulation of Datalog evaluation is as follows: Given a Datalog program split into a set of facts (EDB) and a set of rules , and a ground atom , is in the minimal model of ? In this formulation, there are three variations of the

computational complexity In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time (generally measured by the number of needed elementary operations ...

of evaluating Datalog programs: * The is the complexity of the decision problem when and are inputs and is fixed. * The is the complexity of the decision problem when and are inputs and is fixed. * The is the complexity of the decision problem when , , and are inputs. With respect to data complexity, the decision problem for Datalog is

P-complete In computational complexity theory, a decision problem is P-complete ( complete for the complexity class P) if it is in P and every problem in P can be reduced to it by an appropriate reduction. The notion of P-complete decision problems is use ...

(See Theorem 4.4 in ). P-completeness for data complexity means that there exists a fixed datalog query for which evaluation is P-complete. The proof is based on ''Datalog metainterpreter'' for propositional logic programs. With respect to program complexity, the decision problem is EXPTIME-complete. In particular, evaluating Datalog programs always terminates; Datalog is not

Turing-complete In computability theory, a system of data-manipulation rules (such as a model of computation, a computer's instruction set, a programming language, or a cellular automaton) is said to be Turing-complete or computationally universal if it can be ...

. Some extensions to Datalog do not preserve these complexity bounds. Extensions implemented in some Datalog engines, such as algebraic data types, can even make the resulting language Turing-complete.

Extensions

Several extensions have been made to Datalog, e.g., to support negation,

aggregate function In database management, an aggregate function or aggregation function is a function where multiple values are processed together to form a single summary statistic. Common aggregate functions include: * Average (i.e., arithmetic mean) * Count ...

s, inequalities, to allow

object-oriented programming Object-oriented programming (OOP) is a programming paradigm based on the concept of '' objects''. Objects can contain data (called fields, attributes or properties) and have actions they can perform (called procedures or methods and impl ...

, or to allow

disjunction In logic, disjunction (also known as logical disjunction, logical or, logical addition, or inclusive disjunction) is a logical connective typically notated as \lor and read aloud as "or". For instance, the English language sentence "it is ...

s as heads of

clause In language, a clause is a Constituent (linguistics), constituent or Phrase (grammar), phrase that comprises a semantic predicand (expressed or not) and a semantic Predicate (grammar), predicate. A typical clause consists of a subject (grammar), ...

s. These extensions have significant impacts on the language's semantics and on the implementation of a corresponding interpreter. Datalog is a syntactic subset of

, disjunctive Datalog, answer set programming, DatalogZ, and constraint logic programming. When evaluated as an answer set program, a Datalog program yields a single answer set, which is exactly its minimal model. Many implementations of Datalog extend Datalog with additional features; see for more information.

Aggregation

Datalog can be extended to support

s. Notable Datalog engines that implement aggregation include: * LogicBlox * Soufflé

Negation

Adding negation to Datalog complicates its semantics, leading to whole new languages and strategies for evaluation. For example, the language that results from adding negation with the stable model semantics is exactly answer set programming. Stratified negation can be added to Datalog while retaining its model-theoretic and fixed-point semantics. Notable Datalog engines that implement stratified negation include: * LogicBlox * Soufflé

Comparison to Prolog

Unlike in

, statements of a Datalog program can be stated in any order. Datalog does not have Prolog's cut operator. This makes Datalog a fully

declarative language In computer science, declarative programming is a programming paradigm—a style of building the structure and elements of computer programs—that expresses the logic of a computation without describing its control flow. Many languages that app ...

. In contrast to Prolog, Datalog * disallows complex terms as arguments of predicates, e.g., p(x, y) is admissible but not p(f(x), y), * disallows negation, * requires that every variable that appears in the head of a

also appear in a literal in the body of the clause. This article deals primarily with Datalog without negation (see also ). However, stratified negation is a common addition to Datalog; the following list contrasts

with Datalog with stratified negation. Datalog with stratified negation * also disallows complex terms as arguments of predicates, * requires that every variable that appears in the head of a

also appear in a positive (i.e., not negated) atom in the body of the clause, * requires that every variable appearing in a negative literal in the body of a clause also appear in some positive literal in the body of the clause.

Expressiveness

Datalog generalizes many other query languages. For instance, conjunctive queries and union of conjunctive queries can be expressed in Datalog. Datalog can also express regular path queries. When we consider ''ordered databases'', i.e., databases with an

order relation Order theory is a branch of mathematics that investigates the intuitive notion of order using binary relations. It provides a formal framework for describing statements such as "this is less than that" or "this precedes that". This article intro ...

on their active domain, then the Immerman–Vardi theorem implies that the expressive power of Datalog is precisely that of the class PTIME: a property can be expressed in Datalog if and only if it is computable in polynomial time. The for Datalog asks, given a Datalog program, whether it is , i.e., the maximal recursion depth reached when evaluating the program on an input database can be bounded by some constant. In other words, this question asks whether the Datalog program could be rewritten as a nonrecursive Datalog program, or, equivalently, as a union of conjunctive queries. Solving the boundedness problem on arbitrary Datalog programs is undecidable, but it can be made decidable by restricting to some fragments of Datalog.

Datalog engines

Systems that implement languages inspired by Datalog, whether

compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...

s, interpreters,

libraries A library is a collection of Book, books, and possibly other Document, materials and Media (communication), media, that is accessible for use by its members and members of allied institutions. Libraries provide physical (hard copies) or electron ...

, or embedded DSLs, are referred to as . Datalog engines often implement extensions of Datalog, extending it with additional

data type In computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these ...

s, foreign function interfaces, or support for user-defined lattices. Such extensions may allow for writing non-terminating or otherwise ill-defined programs. Here is a short list of systems that are either based on Datalog or provide a Datalog interpreter:

Free software/open source

Non-free software

* FoundationDB provides a free-of-charge database binding for pyDatalog, with a tutorial on its use. * Leapsight Semantic Dataspace (LSD) is a distributed deductive database that offers high availability, fault tolerance, operational simplicity, and scalability. LSD uses Leaplog (a Datalog implementation) for querying and reasoning and was create by Leapsight. * LogicBlox, a commercial implementation of Datalog used for web-based retail planning and insurance applications. * Profium Sense is a native RDF compliant

graph database A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph (or edge or relationship). The graph relates the dat ...

written in Java. It provides Datalog evaluation support of user defined rules. * .QL, a commercial object-oriented variant of Datalog created by Semmle for analyzing source code to detect security vulnerabilities. * SecPAL a security policy language developed by

Microsoft Research Microsoft Research (MSR) is the research subsidiary of Microsoft. It was created in 1991 by Richard Rashid, Bill Gates and Nathan Myhrvold with the intent to advance state-of-the-art computing and solve difficult world problems through technologi ...

. * Stardog is a

, implemented in

Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...

. It provides support for RDF and all OWL 2 profiles providing extensive reasoning capabilities, including datalog evaluation. * StrixDB: a commercial RDF graph store, SPARQL compliant with Lua API and Datalog inference capabilities. Could be used as httpd (

Apache HTTP Server The Apache HTTP Server ( ) is a free and open-source software, free and open-source cross-platform web server, released under the terms of Apache License, Apache License 2.0. It is developed and maintained by a community of developers under the ...

) module or standalone (although beta versions are under the Perl Artistic License 2.0).

Uses and influence

Datalog is quite limited in its expressivity. It is not

, and doesn't include basic data types such as

integers An integer is the number zero (0), a positive natural number (1, 2, 3, ...), or the negation of a positive natural number (−1, −2, −3, ...). The negations or additive inverses of the positive natural numbers are referred to as negative in ...

or strings. This parsimony is appealing from a theoretical standpoint, but it means Datalog ''per se'' is rarely used as a programming language or

knowledge representation Knowledge representation (KR) aims to model information in a structured manner to formally represent it as knowledge in knowledge-based systems whereas knowledge representation and reasoning (KRR, KR&R, or KR²) also aims to understand, reason, and ...

language. Most Datalog engines implement substantial extensions of Datalog. However, Datalog has a strong influence on such implementations, and many authors don't bother to distinguish them from Datalog as presented in this article. Accordingly, the applications discussed in this section include applications of realistic implementations of Datalog-based languages. Datalog has been applied to problems in

, information extraction, networking, security,

cloud computing Cloud computing is "a paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand," according to International Organization for ...

and

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...

has developed an extension to Datalog for

big data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...

processing. Datalog has seen application in

static program analysis In computer science, static program analysis (also known as static analysis or static simulation) is the analysis of computer programs performed without executing them, in contrast with dynamic program analysis, which is performed on programs duri ...

. The Soufflé dialect has been used to write pointer analyses for

and a

control-flow analysis In computer science, control-flow analysis (CFA) is a static code analysis, static-code-analysis technique for determining the control flow of a program. The control flow is expressed as a control-flow graph (CFG). For both functional programming ...

for Scheme. Datalog has been integrated with SMT solvers to make it easier to write certain static analyses. The Flix dialect is also suited to writing static program analyses. Some widely used database systems include ideas and algorithms developed for Datalog. For example, the SQL:1999 standard includes recursive queries, and the Magic Sets algorithm (initially developed for the faster evaluation of Datalog queries) is implemented in IBM's DB2.

History

The origins of Datalog date back to the beginning of

, but it became prominent as a separate area around 1977 when Hervé Gallaire and Jack Minker organized a workshop on

logic Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises based on the structure o ...

and

database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

s. David Maier is credited with coining the term Datalog..

Notes

References

* * {{Authority control Query languages Logic programming languages Declarative programming languages