Semi-naïve Evaluation
   HOME

TheInfoList



OR:

Datalog is a
declarative Declarative may refer to: * Declarative learning, acquiring information that one can speak about * Declarative memory, one of two types of long term human memory * Declarative programming, a computer programming paradigm * Declarative sentence, a t ...
logic programming Logic programming is a programming, database and knowledge representation paradigm based on formal logic. A logic program is a set of sentences in logical form, representing knowledge about some problem domain. Computation is performed by applyin ...
language. While it is syntactically a subset of
Prolog Prolog is a logic programming language that has its origins in artificial intelligence, automated theorem proving, and computational linguistics. Prolog has its roots in first-order logic, a formal logic. Unlike many other programming language ...
, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties from
Prolog Prolog is a logic programming language that has its origins in artificial intelligence, automated theorem proving, and computational linguistics. Prolog has its roots in first-order logic, a formal logic. Unlike many other programming language ...
. It is often used as a
query language A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. In database systems, query languages rely on strict theory to retrieve informa ...
for
deductive database A deductive database is a database system that can make deductions (i.e. conclude additional facts) based on rules and facts stored in its database. Datalog is the language typically used to specify facts, rules and queries in deductive database ...
s. Datalog has been applied to problems in
data integration Data integration refers to the process of combining, sharing, or synchronizing data from multiple sources to provide users with a unified view. There are a wide range of possible applications for data integration, from commercial (such as when a ...
, networking,
program analysis In computer science, program analysis is the process of analyzing the behavior of computer programs regarding a property such as correctness, robustness, safety and liveness. Program analysis focuses on two major areas: program optimization an ...
, and more.


Example

A Datalog program consists of ''facts'', which are statements that are held to be true, and ''rules'', which say how to deduce new facts from known facts. For example, here are two facts that mean ''xerces is a parent of brooke'' and ''brooke is a parent of damocles'': parent(xerces, brooke). parent(brooke, damocles). The names are written in lowercase because strings beginning with an uppercase letter stand for variables. Here are two rules: ancestor(X, Y) :- parent(X, Y). ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y). The :- symbol is read as "if", and the comma is read "and", so these rules mean: * ''X is an ancestor of Y if X is a parent of Y.'' * ''X is an ancestor of Y if X is a parent of some Z, and Z is an ancestor of Y.'' The meaning of a program is defined to be the set of all of the facts that can be deduced using the initial facts and the rules. This program's meaning is given by the following facts: parent(xerces, brooke). parent(brooke, damocles). ancestor(xerces, brooke). ancestor(brooke, damocles). ancestor(xerces, damocles). Some Datalog implementations don't deduce all possible facts, but instead answer ''queries'': ?- ancestor(xerces, X). This query asks: ''Who are all the X that xerces is an ancestor of?'' For this example, it would return ''brooke'' and ''damocles''.


Comparison to relational databases

The non-recursive subset of Datalog is closely related to query languages for
relational database A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
s, such as
SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
. The following table maps between Datalog,
relational algebra In database theory, relational algebra is a theory that uses algebraic structures for modeling data and defining queries on it with well founded semantics (computer science), semantics. The theory was introduced by Edgar F. Codd. The main applica ...
, and
SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
concepts: More formally, non-recursive Datalog corresponds precisely to
unions of conjunctive queries In database theory, a conjunctive query is a restricted form of first-order queries using the logical conjunction operator. Many first-order queries can be written as conjunctive queries. In particular, a large part of queries issued on relational ...
, or equivalently, negation-free relational algebra. s(x, y). t(y). r(A, B) :- s(A, B), t(B). CREATE TABLE s ( z0 TEXT NONNULL, z1 TEXT NONNULL, PRIMARY KEY (z0, z1) ); CREATE TABLE t ( z0 TEXT NONNULL PRIMARY KEY ); INSERT INTO s VALUES ('x', 'y'); INSERT INTO t VALUES ('y'); CREATE VIEW r AS SELECT s.z0, s.z1 FROM s, t WHERE s.z1 = t.z0;


Syntax

A Datalog program consists of a list of ''rules'' (
Horn clause In mathematical logic and logic programming, a Horn clause is a logical formula of a particular rule-like form that gives it useful properties for use in logic programming, formal specification, universal algebra and model theory. Horn clauses are ...
s). If ''constant'' and ''variable'' are two
countable In mathematics, a Set (mathematics), set is countable if either it is finite set, finite or it can be made in one to one correspondence with the set of natural numbers. Equivalently, a set is ''countable'' if there exists an injective function fro ...
sets of constants and variables respectively and ''relation'' is a countable set of predicate symbols, then the following BNF grammar expresses the structure of a Datalog program: ::= , "" ::= ":-" "." ::= "(" ")" ::= , "," , "" ::= , ::= , "," , "" Atoms are also referred to as . The atom to the left of the :- symbol is called the of the rule; the atoms to the right are the . Every Datalog program must satisfy the condition that every variable that appears in the head of a rule also appears in the body (this condition is sometimes called the ). There are two common conventions for variable names: capitalizing variables, or prefixing them with a question mark ?. Note that under this definition, Datalog does include negation nor aggregates; see for more information about those constructs. Rules with empty bodies are called . For example, the following rule is a fact: r(x) :- . The set of facts is called the or of the Datalog program. The set of tuples computed by evaluating the Datalog program is called the or .


Syntactic sugar

Many implementations of logic programming extend the above grammar to allow writing facts without the :-, like so: r(x). Some also allow writing 0-ary relations without parentheses, like so: p :- q. These are merely abbreviations (
syntactic sugar In computer science, syntactic sugar is syntax within a programming language that is designed to make things easier to read or to express. It makes the language "sweeter" for human use: things can be expressed more clearly, more concisely, or in an ...
); they have no impact on the semantics of the program.


Semantics

There are three widely-used approaches to the semantics of Datalog programs:
model-theoretic In mathematical logic, model theory is the study of the relationship between formal theories (a collection of sentences in a formal language expressing statements about a mathematical structure), and their models (those structures in which the ...
, fixed-point, and
proof-theoretic Proof theory is a major branchAccording to , proof theory is one of four domains mathematical logic, together with model theory, axiomatic set theory, and recursion theory. consists of four corresponding parts, with part D being about "Proof Theor ...
. These three approaches can be proven equivalent. An atom is called if none of its subterms are variables. Intuitively, each of the semantics define the meaning of a program to be the set of all ground atoms that can be deduced from the rules of the program, starting from the facts.


Model theoretic

A rule is called ground if all of its atoms (head and body) are ground. A ground rule ''R''1 is a ''ground instance'' of another rule ''R''2 if ''R''1 is the result of a
substitution Substitution may refer to: Arts and media *Substitution (poetry), a variation in poetic scansion * Substitution (theatre), an acting methodology Music *Chord substitution, swapping one chord for a related one within a chord progression *Tritone ...
of constants for all the variables in ''R''2. The ''
Herbrand base In first-order logic, a Herbrand structure S is a structure over a vocabulary \sigma (also sometimes called a ''signature'') that is defined solely by the syntactical properties of \sigma. The idea is to take the symbol strings of terms as their ...
'' of a Datalog program is the set of all ground atoms that can be made with the constants appearing in the program. The of a Datalog program is the smallest subset of the Herbrand base such that, for each ground instance of each rule in the program, if the atoms in the body of the rule are in the set, then so is the head. The model-theoretic semantics define the minimal Herbrand model to be the meaning of the program.


Fixed-point

Let be the
power set In mathematics, the power set (or powerset) of a set is the set of all subsets of , including the empty set and itself. In axiomatic set theory (as developed, for example, in the ZFC axioms), the existence of the power set of any set is po ...
of the Herbrand base of a program ''P''. The ''immediate consequence operator'' for ''P'' is a map from to that adds all of the new ground atoms that can be derived from the rules of the program in a single step. The least-fixed-point semantics define the least fixed point of to be the meaning of the program; this coincides with the minimal Herbrand model. The
fixpoint In mathematics, a fixed point (sometimes shortened to fixpoint), also known as an invariant point, is a value that does not change under a given transformation. Specifically, for functions, a fixed point is an element that is mapped to itself ...
semantics suggest an algorithm for computing the minimal model: Start with the set of ground facts in the program, then repeatedly add consequences of the rules until a fixpoint is reached. This algorithm is called naïve evaluation.


Proof-theoretic

The proof-theoretic semantics defines the meaning of a Datalog program to be the set of facts with corresponding ''proof trees''. Intuitively, a proof tree shows how to derive a fact from the facts and rules of a program. One might be interested in knowing whether or not a particular ground atom appears in the minimal Herbrand model of a Datalog program, perhaps without caring much about the rest of the model. A top-down reading of the proof trees described above suggests an algorithm for computing the results of such ''queries''. This reading informs the
SLD resolution SLD resolution (''Selective Linear Definite'' clause resolution) is the basic rule of inference, inference rule used in logic programming. It is a refinement of Resolution (logic), resolution, which is both Soundness, sound and refutation Completen ...
algorithm, which forms the basis for the evaluation of
Prolog Prolog is a logic programming language that has its origins in artificial intelligence, automated theorem proving, and computational linguistics. Prolog has its roots in first-order logic, a formal logic. Unlike many other programming language ...
.


Evaluation

There are many different ways to evaluate a Datalog program, with different performance characteristics.


Bottom-up evaluation strategies

Bottom-up evaluation strategies start with the facts in the program and repeatedly apply the rules until either some goal or query is established, or until the complete minimal model of the program is produced.


Naïve evaluation

''Naïve evaluation'' mirrors the fixpoint semantics for Datalog programs. Naïve evaluation uses a set of "known facts", which is initialized to the facts in the program. It proceeds by repeatedly enumerating all ground instances of each rule in the program. If each atom in the body of the ground instance is in the set of known facts, then the head atom is added to the set of known facts. This process is repeated until a fixed point is reached, and no more facts may be deduced. Naïve evaluation produces the entire minimal model of the program.


Semi-naïve evaluation

Semi-naïve evaluation is a bottom-up evaluation strategy that can be asymptotically faster than naïve evaluation.


Performance considerations

Naïve and semi-naïve evaluation both evaluate recursive Datalog rules by repeatedly applying them to a set of known facts until a fixed point is reached. In each iteration, rules are only run for "one step", i.e., non-recursively. As mentioned
above Above may refer to: *Above (artist) Tavar Zawacki (b. 1981, California) is a Polish, Portuguese - American abstract artist and internationally recognized visual artist based in Berlin, Germany. From 1996 to 2016, he created work under the ...
, each non-recursive Datalog rule corresponds precisely to a
conjunctive query In database theory, a conjunctive query is a restricted form of first-order queries using the logical conjunction operator. Many first-order queries can be written as conjunctive queries. In particular, a large part of queries issued on relational ...
. Therefore, many of the techniques from
database theory Database theory encapsulates a broad range of topics related to the study and research of the theoretical realm of databases and database management systems. Theoretical aspects of data management include, among other areas, the foundations of q ...
used to speed up conjunctive queries are applicable to bottom-up evaluation of Datalog, such as *
Index Index (: indexes or indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on the Halo Array in the ...
selection *
Query optimization Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the po ...
, especially join order * Join algorithms * Selection of
data structure In computer science, a data structure is a data organization and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the relationships amo ...
s used to store relations; common choices include
hash table In computer science, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps Unique key, keys to Value (computer science), values. ...
s and
B-tree In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing fo ...
s, other possibilities include disjoint set data structures (for storing
equivalence relation In mathematics, an equivalence relation is a binary relation that is reflexive, symmetric, and transitive. The equipollence relation between line segments in geometry is a common example of an equivalence relation. A simpler example is equ ...
s), bries (a variant of
trie In computer science, a trie (, ), also known as a digital tree or prefix tree, is a specialized search tree data structure used to store and retrieve strings from a dictionary or set. Unlike a binary search tree, nodes in a trie do not store t ...
s),
binary decision diagram In computer science, a binary decision diagram (BDD) or branching program is a data structure that is used to represent a Boolean function. On a more abstract level, BDDs can be considered as a compressed representation of sets or relations. Un ...
s, and even SMT formulas Many such techniques are implemented in modern bottom-up Datalog engines such as
Soufflé A soufflé () is a baked egg dish originating in France in the early 18th century. Combined with various other ingredients, it can be served as a savoury main dish or sweetened as a dessert. The word ''soufflé'' is the past participle of the Fr ...
. Some Datalog engines integrate SQL databases directly. Bottom-up evaluation of Datalog is also amenable to
parallelization Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different for ...
. Parallel Datalog engines are generally divided into two paradigms: * In the shared-memory, multi-core setting, Datalog engines execute on a single node. Coordination between threads may be achieved using locking or lock-free data structures. The shared-memory setting may be further divided into
single instruction, multiple data Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...
and
multiple instruction, multiple data In computing, multiple instruction, multiple data (MIMD) is a technique employed to achieve parallelism. Machines using MIMD have a number of processor cores that function asynchronously and independently. At any time, different processors may b ...
paradigms: ** Datalog engines that execute on
graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
s fall into the SIMD paradigm. ** Datalog engines using
OpenMP OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, ...
are instances of the MIMD paradigm. * In the shared-nothing setting, Datalog engines execute on a
cluster may refer to: Science and technology Astronomy * Cluster (spacecraft), constellation of four European Space Agency spacecraft * Cluster II (spacecraft), a European Space Agency mission to study the magnetosphere * Asteroid cluster, a small ...
of nodes. Such engines generally operate by splitting relations into disjoint subsets based on a
hash function A hash function is any Function (mathematics), function that can be used to map data (computing), data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. The values returned by a ...
, performing computations (joins) on each node, and then exchanging newly-generated tuples over the network. Examples include Datalog engines based on
MPI MPI or Mpi may refer to: Science and technology Biology and medicine * Magnetic particle imaging, a tomographic technique * Myocardial perfusion imaging, a medical procedure that illustrates heart function * Mannose phosphate isomerase, an enzyme ...
,
Hadoop Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
, and
Spark Spark commonly refers to: * Spark (fire), a small glowing particle or ember * Electric spark, a form of electrical discharge Spark may also refer to: People * Spark (surname) * Jessica Morgan (born 1992; formerly known as Spark), female singe ...
.


Top-down evaluation strategies

SLD resolution SLD resolution (''Selective Linear Definite'' clause resolution) is the basic rule of inference, inference rule used in logic programming. It is a refinement of Resolution (logic), resolution, which is both Soundness, sound and refutation Completen ...
is sound and complete for Datalog programs.


Magic sets

Top-down evaluation strategies begin with a ''query'' or ''goal''. Bottom-up evaluation strategies can answer queries by computing the entire minimal model and matching the query against it, but this can be inefficient if the answer only depends on a small subset of the entire model. The ''magic sets'' algorithm takes a Datalog program and a query, and produces a more efficient program that computes the same answer to the query while still using bottom-up evaluation. A variant of the magic sets algorithm has been shown to produce programs that, when evaluated using semi-naïve evaluation, are as efficient as top-down evaluation.


Complexity

The
decision problem In computability theory and computational complexity theory, a decision problem is a computational problem that can be posed as a yes–no question on a set of input values. An example of a decision problem is deciding whether a given natura ...
formulation of Datalog evaluation is as follows: Given a Datalog program split into a set of facts (EDB) and a set of rules , and a ground atom , is in the minimal model of ? In this formulation, there are three variations of the
computational complexity In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time (generally measured by the number of needed elementary operations ...
of evaluating Datalog programs: * The is the complexity of the decision problem when and are inputs and is fixed. * The is the complexity of the decision problem when and are inputs and is fixed. * The is the complexity of the decision problem when , , and are inputs. With respect to data complexity, the decision problem for Datalog is
P-complete In computational complexity theory, a decision problem is P-complete ( complete for the complexity class P) if it is in P and every problem in P can be reduced to it by an appropriate reduction. The notion of P-complete decision problems is use ...
(See Theorem 4.4 in ). P-completeness for data complexity means that there exists a fixed datalog query for which evaluation is P-complete. The proof is based on ''Datalog metainterpreter'' for propositional logic programs. With respect to program complexity, the decision problem is
EXPTIME-complete In computational complexity theory, the complexity class EXPTIME (sometimes called EXP or DEXPTIME) is the set of all decision problems that are solvable by a deterministic Turing machine in exponential time, i.e., in O(2''p''(''n'')) time, w ...
. In particular, evaluating Datalog programs always terminates; Datalog is not
Turing-complete In computability theory, a system of data-manipulation rules (such as a model of computation, a computer's instruction set, a programming language, or a cellular automaton) is said to be Turing-complete or computationally universal if it can be ...
. Some extensions to Datalog do not preserve these complexity bounds. Extensions implemented in some Datalog engines, such as algebraic data types, can even make the resulting language Turing-complete.


Extensions

Several extensions have been made to Datalog, e.g., to support negation,
aggregate function In database management, an aggregate function or aggregation function is a function where multiple values are processed together to form a single summary statistic. Common aggregate functions include: * Average (i.e., arithmetic mean) * Count ...
s, inequalities, to allow
object-oriented programming Object-oriented programming (OOP) is a programming paradigm based on the concept of '' objects''. Objects can contain data (called fields, attributes or properties) and have actions they can perform (called procedures or methods and impl ...
, or to allow
disjunction In logic, disjunction (also known as logical disjunction, logical or, logical addition, or inclusive disjunction) is a logical connective typically notated as \lor and read aloud as "or". For instance, the English language sentence "it is ...
s as heads of
clause In language, a clause is a Constituent (linguistics), constituent or Phrase (grammar), phrase that comprises a semantic predicand (expressed or not) and a semantic Predicate (grammar), predicate. A typical clause consists of a subject (grammar), ...
s. These extensions have significant impacts on the language's semantics and on the implementation of a corresponding interpreter. Datalog is a syntactic subset of
Prolog Prolog is a logic programming language that has its origins in artificial intelligence, automated theorem proving, and computational linguistics. Prolog has its roots in first-order logic, a formal logic. Unlike many other programming language ...
,
disjunctive Datalog Disjunctive Datalog is an extension of the logic programming language Datalog that allows Logical disjunction, disjunctions in the heads of rules. This extension enables disjunctive Datalog to express several NP-hardness, NP-hard problems that are n ...
,
answer set programming Answer set programming (ASP) is a form of declarative programming oriented towards difficult (primarily NP-hard) search problems. It is based on the stable model (answer set) semantics of logic programming. In ASP, search problems are reduced ...
, DatalogZ, and
constraint logic programming Constraint logic programming is a form of constraint programming, in which logic programming is extended to include concepts from constraint satisfaction. A constraint logic program is a logic program that contains constraints in the body of claus ...
. When evaluated as an answer set program, a Datalog program yields a single answer set, which is exactly its minimal model. Many implementations of Datalog extend Datalog with additional features; see for more information.


Aggregation

Datalog can be extended to support
aggregate function In database management, an aggregate function or aggregation function is a function where multiple values are processed together to form a single summary statistic. Common aggregate functions include: * Average (i.e., arithmetic mean) * Count ...
s. Notable Datalog engines that implement aggregation include: *
LogicBlox The LogicBlox system is a commercial, declarative, incremental logic programming language and deductive database inspired by Datalog. The LogiQL programming language extends Datalog with several features, including stratified negation, aggreg ...
*
Soufflé A soufflé () is a baked egg dish originating in France in the early 18th century. Combined with various other ingredients, it can be served as a savoury main dish or sweetened as a dessert. The word ''soufflé'' is the past participle of the Fr ...


Negation

Adding negation to Datalog complicates its semantics, leading to whole new languages and strategies for evaluation. For example, the language that results from adding negation with the
stable model semantics The concept of a stable model, or answer set, is used to define a declarative semantics for logic programs with negation as failure. This is one of several standard approaches to the meaning of negation in logic programming, along with program com ...
is exactly
answer set programming Answer set programming (ASP) is a form of declarative programming oriented towards difficult (primarily NP-hard) search problems. It is based on the stable model (answer set) semantics of logic programming. In ASP, search problems are reduced ...
.
Stratified Stratification may refer to: Mathematics * Stratification (mathematics), any consistent assignment of numbers to predicate symbols * Data stratification in statistics Earth sciences * Stable and unstable stratification * Stratification, or st ...
negation can be added to Datalog while retaining its model-theoretic and fixed-point semantics. Notable Datalog engines that implement stratified negation include: *
LogicBlox The LogicBlox system is a commercial, declarative, incremental logic programming language and deductive database inspired by Datalog. The LogiQL programming language extends Datalog with several features, including stratified negation, aggreg ...
*
Soufflé A soufflé () is a baked egg dish originating in France in the early 18th century. Combined with various other ingredients, it can be served as a savoury main dish or sweetened as a dessert. The word ''soufflé'' is the past participle of the Fr ...


Comparison to Prolog

Unlike in
Prolog Prolog is a logic programming language that has its origins in artificial intelligence, automated theorem proving, and computational linguistics. Prolog has its roots in first-order logic, a formal logic. Unlike many other programming language ...
, statements of a Datalog program can be stated in any order. Datalog does not have Prolog's
cut Cut or CUT may refer to: Common uses * The act of cutting, the separation of an object into two through acutely directed force ** A type of wound ** Cut (archaeology), a hole dug in the past ** Cut (clothing), the style or shape of a garment ** ...
operator. This makes Datalog a fully
declarative language In computer science, declarative programming is a programming paradigm—a style of building the structure and elements of computer programs—that expresses the logic of a computation without describing its control flow. Many languages that app ...
. In contrast to Prolog, Datalog * disallows complex terms as arguments of
predicate Predicate or predication may refer to: * Predicate (grammar), in linguistics * Predication (philosophy) * several closely related uses in mathematics and formal logic: **Predicate (mathematical logic) **Propositional function **Finitary relation, o ...
s, e.g., p(x, y) is admissible but not p(f(x), y), * disallows negation, * requires that every variable that appears in the head of a
clause In language, a clause is a Constituent (linguistics), constituent or Phrase (grammar), phrase that comprises a semantic predicand (expressed or not) and a semantic Predicate (grammar), predicate. A typical clause consists of a subject (grammar), ...
also appear in a literal in the body of the clause. This article deals primarily with Datalog without negation (see also ). However, stratified negation is a common addition to Datalog; the following list contrasts
Prolog Prolog is a logic programming language that has its origins in artificial intelligence, automated theorem proving, and computational linguistics. Prolog has its roots in first-order logic, a formal logic. Unlike many other programming language ...
with Datalog with stratified negation. Datalog with stratified negation * also disallows complex terms as arguments of
predicate Predicate or predication may refer to: * Predicate (grammar), in linguistics * Predication (philosophy) * several closely related uses in mathematics and formal logic: **Predicate (mathematical logic) **Propositional function **Finitary relation, o ...
s, * requires that every variable that appears in the head of a
clause In language, a clause is a Constituent (linguistics), constituent or Phrase (grammar), phrase that comprises a semantic predicand (expressed or not) and a semantic Predicate (grammar), predicate. A typical clause consists of a subject (grammar), ...
also appear in a positive (i.e., not negated) atom in the body of the clause, * requires that every variable appearing in a negative literal in the body of a clause also appear in some positive literal in the body of the clause.


Expressiveness

Datalog generalizes many other query languages. For instance, conjunctive queries and
union of conjunctive queries In database theory, a conjunctive query is a restricted form of first-order queries using the logical conjunction operator. Many first-order queries can be written as conjunctive queries. In particular, a large part of queries issued on relational ...
can be expressed in Datalog. Datalog can also express regular path queries. When we consider ''ordered databases'', i.e., databases with an
order relation Order theory is a branch of mathematics that investigates the intuitive notion of order using binary relations. It provides a formal framework for describing statements such as "this is less than that" or "this precedes that". This article intro ...
on their active domain, then the Immerman–Vardi theorem implies that the expressive power of Datalog is precisely that of the class
PTIME In computational complexity theory, P, also known as PTIME or DTIME(''n''O(1)), is a fundamental complexity class. It contains all decision problems that can be solved by a deterministic Turing machine using a polynomial amount of computation time ...
: a property can be expressed in Datalog if and only if it is computable in polynomial time. The for Datalog asks, given a Datalog program, whether it is , i.e., the maximal recursion depth reached when evaluating the program on an input database can be bounded by some constant. In other words, this question asks whether the Datalog program could be rewritten as a nonrecursive Datalog program, or, equivalently, as a
union of conjunctive queries In database theory, a conjunctive query is a restricted form of first-order queries using the logical conjunction operator. Many first-order queries can be written as conjunctive queries. In particular, a large part of queries issued on relational ...
. Solving the boundedness problem on arbitrary Datalog programs is undecidable, but it can be made decidable by restricting to some fragments of Datalog.


Datalog engines

Systems that implement languages inspired by Datalog, whether
compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...
s,
interpreters Interpreting is translation from a spoken or signed language into another language, usually in real time to facilitate live communication. It is distinguished from the translation of a written text, which can be more deliberative and make use o ...
,
libraries A library is a collection of Book, books, and possibly other Document, materials and Media (communication), media, that is accessible for use by its members and members of allied institutions. Libraries provide physical (hard copies) or electron ...
, or embedded DSLs, are referred to as . Datalog engines often implement extensions of Datalog, extending it with additional
data type In computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these ...
s,
foreign function interface A foreign function interface (FFI) is a mechanism by which a program written in one programming language can call routines or make use of services written or compiled in another one. An FFI is often used in contexts where calls are made into a bin ...
s, or support for user-defined lattices. Such extensions may allow for writing non-terminating or otherwise ill-defined programs. Here is a short list of systems that are either based on Datalog or provide a Datalog interpreter:


Free software/open source


Non-free software

*
FoundationDB FoundationDB is a free and open-source multi-model distributed NoSQL database developed by Apple Inc. with a shared-nothing architecture. The product was designed around a "core" database, with additional features supplied in "layers." The core ...
provides a free-of-charge database binding for pyDatalog, with a tutorial on its use. * Leapsight Semantic Dataspace (LSD) is a distributed deductive database that offers high availability, fault tolerance, operational simplicity, and scalability. LSD uses Leaplog (a Datalog implementation) for querying and reasoning and was create by Leapsight. *
LogicBlox The LogicBlox system is a commercial, declarative, incremental logic programming language and deductive database inspired by Datalog. The LogiQL programming language extends Datalog with several features, including stratified negation, aggreg ...
, a commercial implementation of Datalog used for web-based retail planning and insurance applications. * Profium Sense is a native RDF compliant
graph database A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph (or edge or relationship). The graph relates the dat ...
written in Java. It provides Datalog evaluation support of user defined rules. * .QL, a commercial object-oriented variant of Datalog created by Semmle for analyzing source code to detect security vulnerabilities. * SecPAL a security policy language developed by
Microsoft Research Microsoft Research (MSR) is the research subsidiary of Microsoft. It was created in 1991 by Richard Rashid, Bill Gates and Nathan Myhrvold with the intent to advance state-of-the-art computing and solve difficult world problems through technologi ...
. * Stardog is a
graph database A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph (or edge or relationship). The graph relates the dat ...
, implemented in
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
. It provides support for RDF and all OWL 2 profiles providing extensive reasoning capabilities, including datalog evaluation. * StrixDB: a commercial RDF graph store,
SPARQL SPARQL (pronounced ":wikt:sparkle, sparkle", a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language—that is, a Semantic Query, semantic query language for databases—able to retrieve and manipulate data sto ...
compliant with Lua API and Datalog inference capabilities. Could be used as httpd (
Apache HTTP Server The Apache HTTP Server ( ) is a free and open-source software, free and open-source cross-platform web server, released under the terms of Apache License, Apache License 2.0. It is developed and maintained by a community of developers under the ...
) module or standalone (although beta versions are under the Perl Artistic License 2.0).


Uses and influence

Datalog is quite limited in its expressivity. It is not
Turing-complete In computability theory, a system of data-manipulation rules (such as a model of computation, a computer's instruction set, a programming language, or a cellular automaton) is said to be Turing-complete or computationally universal if it can be ...
, and doesn't include basic data types such as
integers An integer is the number zero (0), a positive natural number (1, 2, 3, ...), or the negation of a positive natural number (−1, −2, −3, ...). The negations or additive inverses of the positive natural numbers are referred to as negative in ...
or
strings String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...
. This parsimony is appealing from a theoretical standpoint, but it means Datalog ''per se'' is rarely used as a programming language or
knowledge representation Knowledge representation (KR) aims to model information in a structured manner to formally represent it as knowledge in knowledge-based systems whereas knowledge representation and reasoning (KRR, KR&R, or KR²) also aims to understand, reason, and ...
language. Most Datalog engines implement substantial extensions of Datalog. However, Datalog has a strong influence on such implementations, and many authors don't bother to distinguish them from Datalog as presented in this article. Accordingly, the applications discussed in this section include applications of realistic implementations of Datalog-based languages. Datalog has been applied to problems in
data integration Data integration refers to the process of combining, sharing, or synchronizing data from multiple sources to provide users with a unified view. There are a wide range of possible applications for data integration, from commercial (such as when a ...
, information extraction, networking,
security Security is protection from, or resilience against, potential harm (or other unwanted coercion). Beneficiaries (technically referents) of security may be persons and social groups, objects and institutions, ecosystems, or any other entity or ...
,
cloud computing Cloud computing is "a paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand," according to International Organization for ...
and
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
.
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
has developed an extension to Datalog for
big data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
processing. Datalog has seen application in
static program analysis In computer science, static program analysis (also known as static analysis or static simulation) is the analysis of computer programs performed without executing them, in contrast with dynamic program analysis, which is performed on programs duri ...
. The
Soufflé A soufflé () is a baked egg dish originating in France in the early 18th century. Combined with various other ingredients, it can be served as a savoury main dish or sweetened as a dessert. The word ''soufflé'' is the past participle of the Fr ...
dialect has been used to write pointer analyses for
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
and a
control-flow analysis In computer science, control-flow analysis (CFA) is a static code analysis, static-code-analysis technique for determining the control flow of a program. The control flow is expressed as a control-flow graph (CFG). For both functional programming ...
for
Scheme Scheme or schemer may refer to: Arts and entertainment * ''The Scheme'', a BBC Scotland documentary TV series * The Scheme (band), an English pop band * ''The Scheme'', an action role-playing video game for the PC-8801, made by Quest Corporation * ...
. Datalog has been integrated with SMT solvers to make it easier to write certain static analyses. The Flix dialect is also suited to writing static program analyses. Some widely used database systems include ideas and algorithms developed for Datalog. For example, the SQL:1999 standard includes recursive queries, and the Magic Sets algorithm (initially developed for the faster evaluation of Datalog queries) is implemented in IBM's DB2.


History

The origins of Datalog date back to the beginning of
logic programming Logic programming is a programming, database and knowledge representation paradigm based on formal logic. A logic program is a set of sentences in logical form, representing knowledge about some problem domain. Computation is performed by applyin ...
, but it became prominent as a separate area around 1977 when Hervé Gallaire and
Jack Minker Jack Minker (4 July 1927 – 9 April 2021) was a leading authority in artificial intelligence, deductive databases, logic programming and non-monotonic reasoning. He was also an internationally recognized leader in the field of human rights of c ...
organized a workshop on
logic Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises based on the structure o ...
and
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
s.
David Maier David Maier (born 2 June 1953) is the Maseeh Professor of Emerging Technologies in the Department of Computer Science at Portland State University. Born in Eugene, OR, he has also been a computer science faculty member at the State University of ...
is credited with coining the term Datalog..


See also

*
Answer set programming Answer set programming (ASP) is a form of declarative programming oriented towards difficult (primarily NP-hard) search problems. It is based on the stable model (answer set) semantics of logic programming. In ASP, search problems are reduced ...
*
Conjunctive query In database theory, a conjunctive query is a restricted form of first-order queries using the logical conjunction operator. Many first-order queries can be written as conjunctive queries. In particular, a large part of queries issued on relational ...
* DatalogZ *
Disjunctive Datalog Disjunctive Datalog is an extension of the logic programming language Datalog that allows Logical disjunction, disjunctions in the heads of rules. This extension enables disjunctive Datalog to express several NP-hardness, NP-hard problems that are n ...
* Flix * SWRL *
Tuple-generating dependency In relational database theory, a tuple-generating dependency (TGD) is a certain kind of constraint on a relational database. It is a subclass of the class of embedded dependencies (EDs). An algorithm known as the chase takes as input an instance ...
(TGD), a language for
integrity constraint Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire life-cycle. It is a critical aspect to the design, implementation, and usage of any system that stores, processes, or retrieves data. The ter ...
s on
relational database A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
s with a similar syntax to Datalog


Notes


References

* * {{Authority control Query languages Logic programming languages Declarative programming languages