Vadalog is a system for performing complex logic reasoning tasks over

knowledge graph In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a Graph (discrete mathematics), graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interl ...

s. Its language is based on an extension of the rule-based language

Datalog Datalog is a declarative logic programming language. While it is syntactically a subset of Prolog, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties ...

, ''Warded Datalog^±''. Vadalog was developed by researchers at the

University of Oxford The University of Oxford is a collegiate university, collegiate research university in Oxford, England. There is evidence of teaching as early as 1096, making it the oldest university in the English-speaking world and the List of oldest un ...

and

Technische Universität Wien TU Wien () is a public research university in Vienna, Austria. The university's teaching and research are focused on engineering, computer science, and natural sciences. It currently has about 28,100 students (29% women), eight faculties, and ...

as well as employees at the

Bank of Italy The Bank of Italy (Italian language, Italian: ''Banca d'Italia'', , informally referred to as ''Bankitalia'') is the National central bank (Eurosystem), national central bank for Italy within the Eurosystem. It was the Italian central bank from ...

Knowledge graph management systems (KGMS)

A knowledge graph management system (KGMS) has to manage

s, which incorporate large amounts of data in the form of ''facts'' and ''relationships.'' In general, it can be seen as the union of three components: # KBMS, that is, a knowledge base management system, # Big Data, which is the need of handling large amounts of data, especially when considering that knowledge graphs have been thought as a solution for integrating multiple data sources, both corporate and public knowledge, which can be integrated into large knowledge graphs, # (Data) Analytics is the need to provide access to existing software packages for machine learning, text mining, data analytics, and data visualization and to combine them together in the same platform. From a more technical standpoint, some additional requirements can be identified for defining a proper KGMS: * definition of a language and a formalism with high expressive power, * cost-effective

data wrangling Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one " raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes ...

, in all its steps, from data cleaning to web data extraction and big data access from many different sources, * efficient

logical Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises based on the structure of arg ...

probabilistic Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...

and ontological reasoning, * low complexity, both in terms of

space complexity The space complexity of an algorithm or a data structure is the amount of memory space required to solve an instance of the computational problem as a function of characteristics of the input. It is the memory required by an algorithm until it exec ...

(to handle big data) and syntax, * interfaces ( APIs) to access many heterogeneous data sources, such as corporate

RDBMS A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured forma ...

NoSQL NoSQL (originally meaning "Not only SQL" or "non-relational") refers to a type of database design that stores and retrieves data differently from the traditional table-based structure of relational databases. Unlike relational databases, which ...

or RDF stores, the web, machine-learning and analytics packages. Other requirements may include more typical

DBMS In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...

functions and services, as the ones proposed by Codd.

Vadalog system

Vadalog offers a platform that fulfills all the requirements of a KGMS listed above. It is able to perform ''rule-based'' reasoning tasks on top of knowledge graphs and it also supports the data science workflow, such as data visualization and machine learning.

Reasoning task and recursion

A rule is an expression of the form :− where: * are the atoms of the ''body,'' * is the atom of the ''head''. A ''rule'' allows to infer new knowledge starting from the variables that are in the body: when all the variables in the body of a rule are successfully assigned, the rule is activated and it results in the derivation of the head predicate: given a database and a set of rules , ''a reasoning task'' aims at inferring new knowledge, applying the rules of the set to the database (the ''extensional knowledge).'' The most widespread form of knowledge that has been adopted over the last decades has been in the form of rules, be it in

rule-based system In computer science, a rule-based system is a computer system in which domain-specific knowledge is represented in the form of rules and general-purpose reasoning is used to solve problems in the domain. Two different kinds of rule-based systems ...

s, ontology-based systems or other forms and it can be typically captured in knowledge graphs. The nature of knowledge graphs also makes the presence of

recursion Recursion occurs when the definition of a concept or process depends on a simpler or previous version of itself. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in m ...

in these rules a particularly important aspect. Recursion means that the same rules might be called multiple times before obtaining the final answer of the reasoning task and it is particularly powerful as it allows an inference based on previously inferred results. This implies that the system must provide a strategy that guarantees termination. More technically, a program is recursive if the

dependency graph In mathematics, computer science and digital electronics, a dependency graph is a directed graph representing dependencies of several objects towards each other. It is possible to derive an evaluation order or the absence of an evaluation order th ...

built with the application of the rules is cyclical. The simplest form of recursion is that in which the head of a rule also appears in the body (''self-recursive rules'').

The query language

The Vadalog language allows to answer reasoning queries that also include recursion. It is based on Warded Datalog^±, which belongs to the Datalog^± family of languages that extends

with

existential quantifier Existentialism is a family of philosophy, philosophical views and inquiry that explore the human individual's struggle to lead an Authenticity (philosophy), authentic life despite the apparent Absurdity#The Absurd, absurdity or incomprehensibili ...

s in rule heads and at the same time restricts its syntax in order to achieve decidability and tractability. Existential rules are also known as tuple-generating dependencies (''tgds''). An ''existential rule'' has the following form: :

\varphi(x) \Rightarrow \exist z \Psi(x,z)

or, alternatively, in Datalog syntax, it can be written as follows: p(X,Z) :- r(X). Variables in Vadalog are like variables in

first-order logic First-order logic, also called predicate logic, predicate calculus, or quantificational logic, is a collection of formal systems used in mathematics, philosophy, linguistics, and computer science. First-order logic uses quantified variables over ...

and a variable is local to the rule in which it occurs. This means that occurrences of the same variable name in different rules refer to different variables.

Warded Datalog^±

In case of a set of rules

\Sigma

, consisting of the following: r(X,Y) :- p(X). p(Z) :- r(X,Z). the variable Z in the second rule is said to be ''dangerous,'' since the first rule will generate a ''null'' in the second term of the atom ''r'' and this will be injected to the second rule to get the atom ''p,'' leading to a propagation of ''nulls'' when trying to find an answer to the program. If arbitrary propagation is allowed, reasoning is undecidable and the program will be infinite. Warded Datalog^± overcomes this issue asking that for every rule defined in a set

\Sigma

, all the variables in the rule bodies must coexist in at least one atom in the head, called a ''ward.'' The concept of ''wardness'' restricts the way dangerous variable can be used inside a program. Although this is a limit in terms of expressive power, with this requirement and thanks to its architecture and termination algorithms'','' Warded Datalog^± is able to find answers to a program in a finite number of steps. It also exhibits a good trade-off between computational complexity and expressive power, capturing PTIME data complexity while allowing ontological reasoning and the possibility of running programs with recursion.

Vadalog extension

Vadalog replicates in its entirety Warded Datalog^± and extends it with the inclusion in the language of: * monotonic aggregations (''min, max, sum, prod, count'' operators), * stratified negation, * support for different data types (''strings, integer, double, date, boolean, set, lists, marked nulls),'' * rich annotation mechanism to define how to interact with data sources and external libraries. In addition, the system provides a highly engineered architecture to allow efficient computation. This is done in the following two ways. # Adopting an

in-memory processing The term is used for two different things: # In computer science, in-memory processing, also called compute-in-memory (CIM), or processing-in-memory (PIM), is a computer architecture in which data operations are available directly on the data ...

architecture and a pull stream-based approach, which limits the memory consumption or make it predictable. # Using an aggressive ''termination control strategy'', which detects patterns and redundancy while building of the chase-graph (used to generate the answers) as early as possible and cuts off computation when they occur. This is connected with the concept of

isomorphism In mathematics, an isomorphism is a structure-preserving mapping or morphism between two structures of the same type that can be reversed by an inverse mapping. Two mathematical structures are isomorphic if an isomorphism exists between the ...

of facts, which reduces the number of steps needed to get an answer: if a fact ''h'' is isomorphic to ''h, the system will only explore the chase graph of fact ''h''. The Vadalog system is therefore able to perform ontological reasoning tasks, as it belongs to the

family. Reasoning with the logical core of Vadalog captures OWL 2 QL and

SPARQL SPARQL (pronounced ":wikt:sparkle, sparkle", a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language—that is, a Semantic Query, semantic query language for databases—able to retrieve and manipulate data sto ...

(through the use of existential quantifiers), and graph analytics (through support for recursion and aggregation).

Example of ontological reasoning task

Consider the following set of Vadalog rules: ancestor(Y,X) :- person(X). ancestor(Y,Z) :- ancestor(Y,X), parent(X,Z). The first rule states that for each person

X

there exists an ancestor

Y

. The second rule states that, if

X

is a parent of

Z

, then

Y

is an ancestor of

Z

too. Note the existential quantification in the first position of the ancestor predicate in the first rule, which will generate a null ν_i in the chase procedure. Such null is then propagated to the head of the second rule. Consider a database D = with the extensional facts and the query of finding all the entailed ancestor facts as reasoning task. By performing the chase procedure, the fact ''ancestor''(ν₁,Alice) is generated by triggering the first rule on ''person''(Alice). Then, ''ancestor''(ν₁,Bob) is created by activating the second rule on ''ancestor''(ν₁,Alice) and ''parent''(Alice,Bob). Finally, the first rule could be triggered on ''person''(Bob), but the resulting fact ''ancestor''(ν₂,Bob) is isomorphic with ''ancestor''(ν₁,Bob), thus this fact is not generated and the corresponding portion of the chase graph is not explored. In conclusion, the answer to the query is the set of facts .

Additional features

The integration of Vadalog with data science tools is achieved by means of data bindings primitives and functions''.'' # ''Data binding primitives'': bindings allow to connect to external data sources or systems like a database, a framework or a library and tell the system how to deal with them. This includes connections with relational database systems as

PostgreSQL PostgreSQL ( ) also known as Postgres, is a free and open-source software, free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transaction processing, transactions ...

MySQL MySQL () is an Open-source software, open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A rel ...

, and graph databases, such as

Neo4j Neo4j is a graph database management system (GDBMS) developed by Neo4j Inc. The data elements Neo4j stores are nodes, edges connecting them, and attributes of nodes and edges. Described by its developers as an ACID-compliant transactional d ...

. It is also possible to integrate machine learning libraries (

Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. Some authorities consider it as the only extant member of the genus '' Gallirallus''. ...

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support ...

Tensorflow TensorFlow is a Library (computing), software library for machine learning and artificial intelligence. It can be used across a range of tasks, but is used mainly for Types of artificial neural networks#Training, training and Statistical infer ...

) and a web data extraction tool, OXPath, which allows to retrieve data directly from the web. This is possible through the so-called ''annotations:'' they are special ''facts'' that augment the sets of existential rules with specific behaviours. The @input annotation tells the system that the facts for an atom of the program are imported from an external data source, e.g., a relational database. The @output annotation specifies that the facts for an atom of the program will be exported to an external target, e.g., the standard output or a relational database. Other annotations like @bind, @mapping, @qbind allow to customize the data sources for the @input annotation or the targets for the @output annotation. # ''Functions'': custom expressions which make use of arithmetic operators, string manipulation or comparison can also be supported. Libraries written in Python are also enabled and they can be integrated with the dedicated annotation @library. The system also provides an integration with the JupyterLab platform, where Vadalog programs can be written and run and the output can be read, exploiting the functionalities of the platform. It gives also the possibility to evaluate the correctness of the program, run it and analyse the derivation process of output facts by means of tools as ''syntax highlighting'', ''code analysis'' (checking whether the code is correct or there are errors) and ''explanations of results'' (how the result has been obtained): all these functionalities are embedded in the notebook and help in writing and analyzing Vadalog code.

Use cases

The Vadalog system can be employed to address many real-world use cases from distinct research and industry fields. Among the latter, this section presents two relevant and accessible cases belonging to the financial domain.

Company control

A company ownership graph shows entities as nodes and shares as edges. When an entity has a certain amount of shares on another one (commonly identified in the absolute majority), it is able to exert a decision power on that entity and this configures a company control and, more generally, a group structure. Searching for all ''control'' relationships requires to investigate different scenarios and very complex group structures, namely direct and indirect control. This query be translated into the following rules: * a company directly controls a company if directly owns more than 50% of the total equity of , * a company indirectly controls a company if controls a set of intermediary companies that jointly (i.e., summing their shares) own more than 50% of . These rules can be written in a Vadalog program that will derive all ''control'' edges like the following: control(X,X) :- company(X). control(X,Y) :- control(X,Y), own(Y,Z,W), V = sum(W,), V > 0.5. The first rule states that each company controls itself. The second rule defines control of over by summing the shares of held by companies , over all companies controlled by .

Close link

This scenario consists in determining whether there exists a link between two entities in a company ownerships graph. Determining the existence of such links is relevant, for instance, in banking supervision and credit worthiness evaluation, as a company cannot act as guarantor for loans to another company if the two share such a relationship. Formally, two companies and are involved in a ''close link'' if: * (resp., ) owns directly or indirectly 20% or more of the equity of (resp., ), or * a common third-party owns directly or indirectly 20% or more of the equity of both and . These rules can be written in a Vadalog program that will derive all ''close link'' edges like the following: mcl(X,Y,S) :- own(X,Y,S). mcl(X,Z,S1 * S2) :- mc1(X,Y,S1), own(Y,Z,S2). cl1(X,Y) :- mcl(X,Y,S), TS = sum(S), TS > 0.2. cl2(X,Y) :- cl1(Z,X), cl1(Z,Y), X != Y. closelink(X,Y) :- cl1(X,Y). closelink(X,Y) :- cl2(X,Y). The first rule states that two companies and connected by an ownership edge are possible close links. The second rule states that, if and are possible close links with a share and there exists an ownership edge from to a company with a share , then also and are possible close links with a share . The third rule states that, if the sum of all the partial shares of owned directly or indirectly by is greater than or equal to 20% of the equity of , then they are close links according to the first definition. The fourth rule models the second definition of close links, i.e., the third-party case.

References

{{Reflist Graph databases Declarative programming languages