Data integration refers to the process of combining, sharing, or synchronizing

data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...

from multiple sources to provide users with a unified view. There are a wide range of possible applications for data integration, from commercial (such as when a business merges multiple

database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

s) to scientific (combining research data from different

bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

repositories). The decision to integrate data tends to arise when the volume, complexity (that is,

big data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...

) and need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a

heterogeneous database system A heterogeneous database system is an automated (or semi-automated) system for the integration of heterogeneous, disparate database management systems to present a user with a single, unified query interface. Heterogeneous database systems (HDBs) ...

and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in

data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...

when analyzing and extracting information from existing databases that can be useful for

Business information Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include reporting, online analytical processing, ...

History

Issues with combining

heterogeneous Homogeneity and heterogeneity are concepts relating to the uniformity of a substance, process or image. A homogeneous feature is uniform in composition or character (i.e., color, shape, size, weight, height, distribution, texture, language, i ...

data sources, often referred to as

information silo An information silo, or a group of such silos, is an insular management system in which one information system or subsystem is incapable of reciprocal operation with others that are, or should be, related. Thus information is not adequately shared ...

s, under a single query interface have existed for some time. In the early 1980s, computer scientists began designing systems for interoperability of heterogeneous databases. The first data integration system driven by structured metadata was designed in 1991 at the

University of Minnesota The University of Minnesota Twin Cities (historically known as University of Minnesota) is a public university, public Land-grant university, land-grant research university in the Minneapolis–Saint Paul, Twin Cities of Minneapolis and Saint ...

for the Integrated Public Use Microdata Series (IPUMS). IPUMS used a

data warehousing In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for Business intelligence, reporting and data analysis and is a core component of business intelligence. Data warehouses are central Re ...

approach, which extracts, transforms, and loads data from heterogeneous sources into a unique view

schema Schema may refer to: Science and technology * SCHEMA (bioinformatics), an algorithm used in protein engineering * Schema (genetic algorithms), a set of programs or bit strings that have some genotypic similarity * Schema.org, a web markup vocab ...

so data from different sources become compatible. By making thousands of population databases interoperable, IPUMS demonstrated the feasibility of large-scale data integration. The data warehouse approach offers a tightly coupled architecture because the data are already physically reconciled in a single queryable repository, so it usually takes little time to resolve queries. The

data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for Business intelligence, reporting and data analysis and is a core component of business intelligence. Data warehouses are central Re ...

approach is less feasible for data sets that are frequently updated, requiring the

extract, transform, load Extract, transform, load (ETL) is a three-phase computing process where data is ''extracted'' from an input source, ''transformed'' (including cleaning), and ''loaded'' into an output data container. The data can be collected from one or mor ...

(ETL) process to be continuously re-executed for synchronization. Difficulties also arise in constructing data warehouses when one has only a query interface to summary data sources and no access to the full data. This problem frequently emerges when integrating several commercial query services like travel or classified advertisement web applications. A trend began in 2009 favoring the

loose coupling In computing and systems design, a loosely coupled system is one # in which components are weakly associated (have breakable relationships) with each other, and thus changes in one component least affect existence or performance of another comp ...

of data and providing a unified query-interface to access real time data over a mediated schema (see Figure 2), which allows information to be retrieved directly from original databases. This is consistent with the SOA approach popular in that era. This approach relies on mappings between the mediated schema and the schema of original sources, and translating a query into decomposed queries to match the schema of the original databases. Such mappings can be specified in two ways: as a mapping from entities in the mediated schema to entities in the original sources (the "Global-as-View" (GAV) approach), or as a mapping from entities in the original sources to the mediated schema (the "Local-as-View" (LAV) approach). The latter approach requires more sophisticated inferences to resolve a query on the mediated schema, but makes it easier to add new data sources to a (stable) mediated schema. , some of the work in data integration research concerns the

semantic integration Semantic integration is the process of interrelating information from diverse sources, for example calendars and to do lists, email archives, presence information (physical, psychological, and social), documents of all sorts, contacts (including ...

problem. This problem addresses not the structuring of the architecture of the integration, but how to resolve

semantic Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...

conflicts between heterogeneous data sources. For example, if two companies merge their databases, certain concepts and definitions in their respective schemas like "earnings" inevitably have different meanings. In one database it may mean profits in dollars (a floating-point number), while in the other it might represent the number of sales (an integer). A common strategy for the resolution of such problems involves the use of

ontologies In information science, an ontology encompasses a representation, formal naming, and definitions of the categories, properties, and relations between the concepts, data, or entities that pertain to one, many, or all domains of discourse. More ...

which explicitly define schema terms and thus help to resolve semantic conflicts. This approach represents

ontology-based data integration Ontology-based data integration involves the use of one or more ontologies to effectively combine data or information from multiple heterogeneous sources. It is one of the multiple data integration approaches and may be classified as Global-As-View ...

. On the other hand, the problem of combining research results from different bioinformatics repositories requires bench-marking of the similarities, computed from different data sources, on a single criterion such as positive predictive value. This enables the data sources to be directly comparable and can be integrated even when the natures of experiments are distinct. , it was determined that current

data modeling Data modeling in software engineering is the process of creating a data model for an information system by applying certain formal techniques. It may be applied as part of broader Model-driven engineering (MDE) concept. Overview Data modeli ...

methods were imparting data isolation into every

data architecture Data architecture consist of models, policies, rules, and standards that govern which data is collected and how it is stored, arranged, integrated, and put to use in data systems and in organizations. Data is usually one of several architecture d ...

in the form of islands of disparate data and information silos. This data isolation is an unintended artifact of the data modeling methodology that results in the development of disparate data models. Disparate data models, when instantiated as databases, form disparate databases. Enhanced data model methodologies have been developed to eliminate the data isolation artifact and to promote the development of integrated data models. One enhanced data modeling method recasts data models by augmenting them with structural

metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...

in the form of standardized data entities. As a result of recasting multiple data models, the set of recast data models will now share one or more commonality relationships that relate the structural metadata now common to these data models. Commonality relationships are a peer-to-peer type of entity relationships that relate the standardized data entities of multiple data models. Multiple data models that contain the same standard data entity may participate in the same commonality relationship. When integrated data models are instantiated as databases and are properly populated from a common set of master data, then these databases are integrated. Since 2011, data hub approaches have been of greater interest than fully structured (typically relational) Enterprise Data Warehouses. Since 2013,

data lake A data lake is a system or data repository, repository of data stored in its natural/raw format, usually object binary large object, blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor ...

approaches have risen to the level of Data Hubs. (See all three search terms popularity on Google Trends.) These approaches combine unstructured or varied data into one location, but do not necessarily require an (often complex) master relational schema to structure and define all data in the Hub. In recent times, as the number of applications being used have increased many fold and application to application integration have become critical and this has given rise to nified APIsthat help application developers integrate their apps with other apps and more recently with CP - Model Context Protocoltaking it a step further for AI Agents. Data integration plays a big role in business regarding

data collection Data collection or data gathering is the process of gathering and measuring information on targeted variables in an established system, which then enables one to answer relevant questions and evaluate outcomes. Data collection is a research com ...

used for studying the market. Converting the raw data retrieved from consumers into coherent data is something businesses try to do when considering what steps they should take next. Organizations are more frequently using

for collecting information and patterns from their databases, and this process helps them develop new business strategies to increase business performance and perform economic analyses more efficiently. Compiling the large amount of data they collect to be stored in their system is a form of data integration adapted for

Business intelligence Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include Financial reporting, reporting, online an ...

to improve their chances of success.

Example

Consider a

web application A web application (or web app) is application software that is created with web technologies and runs via a web browser. Web applications emerged during the late 1990s and allowed for the server to dynamically build a response to the request, ...

where a user can query a variety of information about cities (such as crime statistics, weather, hotels, demographics, etc.). Traditionally, the information must be stored in a single database with a single schema. But any single enterprise would find information of this breadth somewhat difficult and expensive to collect. Even if the resources exist to gather the data, it would likely duplicate data in existing crime databases, weather websites, and census data. A data-integration solution may address this problem by considering these external resources as

materialized view In computing, a materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table or join result, or may be a summa ...

s over a virtual mediated schema, resulting in "virtual data integration". This means application-developers construct a virtual schema—the ''mediated schema''—to best model the kinds of answers their users want. Next, they design "wrappers" or adapters for each data source, such as the crime database and weather website. These adapters simply transform the local query results (those returned by the respective websites or databases) into an easily processed form for the data integration solution (see figure 2). When an application-user queries the mediated schema, the data-integration solution transforms this query into appropriate queries over the respective data sources. Finally, the virtual database combines the results of these queries into the answer to the user's query. This solution offers the convenience of adding new sources by simply constructing an adapter or an application software blade for them. It contrasts with ETL systems or with a single database solution, which require manual integration of entire new data set into the system. The virtual ETL solutions leverage virtual mediated schema to implement data harmonization; whereby the data are copied from the designated "master" source to the defined targets, field by field. Advanced data virtualization is also built on the concept of object-oriented modeling in order to construct virtual mediated schema or virtual metadata repository, using

hub and spoke A hub is the central part of a wheel that connects the axle to the wheel itself. Hub, HUB, or hubs may refer to: Geography Pakistan * Hub Tehsil, Balochistan, an administrative division of southern pakistan ** Hub, Balochistan, capital city o ...

architecture. Each data source is disparate and as such is not designed to support reliable joins between data sources. Therefore, data virtualization as well as data federation depends upon accidental data commonality to support combining data and information from disparate data sets. Because of the lack of data value commonality across data sources, the return set may be inaccurate, incomplete, and impossible to validate. One solution is to recast disparate databases to integrate these databases without the need for ETL. The recast databases support commonality constraints where referential integrity may be enforced between databases. The recast databases provide designed data access paths with data value commonality across databases.

Theory

The theory of data integration forms a subset of database theory and formalizes the underlying concepts of the problem in

first-order logic First-order logic, also called predicate logic, predicate calculus, or quantificational logic, is a collection of formal systems used in mathematics, philosophy, linguistics, and computer science. First-order logic uses quantified variables over ...

. Applying the theories gives indications as to the feasibility and difficulty of data integration. While its definitions may appear abstract, they have sufficient generality to accommodate all manner of integration systems, including those that include nested relational / XML databases and those that treat databases as programs. Connections to particular databases systems such as Oracle or DB2 are provided by implementation-level technologies such as

JDBC Java Database Connectivity (JDBC) is an application programming interface (API) for the Java (programming language), Java programming language which defines how a client may access a database. It is a Java-based data access technology used for Java ...

and are not studied at the theoretical level.

Definitions

Data integration systems are formally defined as a

tuple In mathematics, a tuple is a finite sequence or ''ordered list'' of numbers or, more generally, mathematical objects, which are called the ''elements'' of the tuple. An -tuple is a tuple of elements, where is a non-negative integer. There is o ...

\left \langle G,S,M\right \rangle

where

G

is the global (or mediated) schema,

S

is the heterogeneous set of source schemas, and

M

is the mapping that maps queries between the source and the global schemas. Both

G

and

S

are expressed in

languages Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed forms, and may also be conveyed through writing. Human language is ch ...

over

alphabets An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...

composed of symbols for each of their respective relations. The mapping

M

consists of assertions between queries over

G

and queries over

S

. When users pose queries over the data integration system, they pose queries over

G

and the mapping then asserts connections between the elements in the global schema and the source schemas. A database over a schema is defined as a set of sets, one for each relation (in a relational database). The database corresponding to the source schema

S

would comprise the set of sets of tuples for each of the heterogeneous data sources and is called the ''source database''. Note that this single source database may actually represent a collection of disconnected databases. The database corresponding to the virtual mediated schema

G

is called the ''global database''. The global database must satisfy the mapping

M

with respect to the source database. The legality of this mapping depends on the nature of the correspondence between

G

and

S

. Two popular ways to model this correspondence exist: ''Global as View'' or GAV and ''Local as View'' or LAV. GAVLAV

GAV systems model the global database as a set of views over

S

. In this case

M

associates to each element of

G

a query over

S

. Query processing becomes a straightforward operation due to the well-defined associations between

G

and

S

. The burden of complexity falls on implementing mediator code instructing the data integration system exactly how to retrieve elements from the source databases. If any new sources join the system, considerable effort may be necessary to update the mediator, thus the GAV approach appears preferable when the sources seem unlikely to change. In a GAV approach to the example data integration system above, the system designer would first develop mediators for each of the city information sources and then design the global schema around these mediators. For example, consider if one of the sources served a weather website. The designer would likely then add a corresponding element for weather to the global schema. Then the bulk of effort concentrates on writing the proper mediator code that will transform predicates on weather into a query over the weather website. This effort can become complex if some other source also relates to weather, because the designer may need to write code to properly combine the results from the two sources. On the other hand, in LAV, the source database is modeled as a set of views over

G

. In this case

M

associates to each element of

S

a query over

G

. Here the exact associations between

G

and

S

are no longer well-defined. As is illustrated in the next section, the burden of determining how to retrieve elements from the sources is placed on the query processor. The benefit of an LAV modeling is that new sources can be added with far less work than in a GAV system, thus the LAV approach should be favored in cases where the mediated schema is less stable or likely to change. In an LAV approach to the example data integration system above, the system designer designs the global schema first and then simply inputs the schemas of the respective city information sources. Consider again if one of the sources serves a weather website. The designer would add corresponding elements for weather to the global schema only if none existed already. Then programmers write an adapter or wrapper for the website and add a schema description of the website's results to the source schemas. The complexity of adding the new source moves from the designer to the query processor.

Query processing

The theory of query processing in data integration systems is commonly expressed using conjunctive queries and

Datalog Datalog is a declarative logic programming language. While it is syntactically a subset of Prolog, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties ...

, a purely declarative

logic programming Logic programming is a programming, database and knowledge representation paradigm based on formal logic. A logic program is a set of sentences in logical form, representing knowledge about some problem domain. Computation is performed by applyin ...

language. One can loosely think of a

conjunctive query In database theory, a conjunctive query is a restricted form of first-order queries using the logical conjunction operator. Many first-order queries can be written as conjunctive queries. In particular, a large part of queries issued on relational ...

as a logical function applied to the relations of a database such as "

f(A,B)

where

A < B

". If a tuple or set of tuples is substituted into the rule and satisfies it (makes it true), then we consider that tuple as part of the set of answers in the query. While formal languages like Datalog express these queries concisely and without ambiguity, common

SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...

queries count as conjunctive queries as well. In terms of data integration, "query containment" represents an important property of conjunctive queries. A query

A

contains another query

B

(denoted

A \supset B

) if the results of applying

B

are a subset of the results of applying

A

for any database. The two queries are said to be equivalent if the resulting sets are equal for any database. This is important because in both GAV and LAV systems, a user poses conjunctive queries over a ''virtual'' schema represented by a set of views, or "materialized" conjunctive queries. Integration seeks to rewrite the queries represented by the views to make their results equivalent or maximally contained by our user's query. This corresponds to the problem of answering queries using views ( AQUV). In GAV systems, a system designer writes mediator code to define the query-rewriting. Each element in the user's query corresponds to a substitution rule just as each element in the global schema corresponds to a query over the source. Query processing simply expands the subgoals of the user's query according to the rule specified in the mediator and thus the resulting query is likely to be equivalent. While the designer does the majority of the work beforehand, some GAV systems such a
Tsimmis
involve simplifying the mediator description process. In LAV systems, queries undergo a more radical process of rewriting because no mediator exists to align the user's query with a simple expansion strategy. The integration system must execute a search over the space of possible queries in order to find the best rewrite. The resulting rewrite may not be an equivalent query but maximally contained, and the resulting tuples may be incomplete. the GQR algorithm is the leading query rewriting algorithm for LAV data integration systems. In general, the complexity of query rewriting is

NP-complete In computational complexity theory, NP-complete problems are the hardest of the problems to which ''solutions'' can be verified ''quickly''. Somewhat more precisely, a problem is NP-complete when: # It is a decision problem, meaning that for any ...

. If the space of rewrites is relatively small, this does not pose a problem — even for integration systems with hundreds of sources.

Medicine and life sciences

Large-scale questions in science, such as real world evidence,

global warming Present-day climate change includes both global warming—the ongoing increase in global average temperature—and its wider effects on Earth's climate system. Climate change in a broader sense also includes previous long-term changes ...

invasive species An invasive species is an introduced species that harms its new environment. Invasive species adversely affect habitats and bioregions, causing ecological, environmental, and/or economic damage. The term can also be used for native spec ...

spread, and

resource depletion Resource depletion occurs when a natural resource is consumed faster than it can be replenished. The value of a resource depends on its availability in nature and the cost of extracting it. By the law of supply and demand, the Scarcity, scarcer ...

, are increasingly requiring the collection of disparate data sets for

meta-analysis Meta-analysis is a method of synthesis of quantitative data from multiple independent studies addressing a common research question. An important part of this method involves computing a combined effect size across all of the studies. As such, th ...

. This type of data integration is especially challenging for ecological and environmental data because

metadata standards A metadata standard is a requirement which is intended to establish a common understanding of the meaning or semantics of the data, to ensure correct and proper use and interpretation of the data by its owners and users. To achieve this common und ...

are not agreed upon and there are many different data types produced in these fields.

National Science Foundation The U.S. National Science Foundation (NSF) is an Independent agencies of the United States government#Examples of independent agencies, independent agency of the Federal government of the United States, United States federal government that su ...

initiatives such as

Datanet DataNet, or Sustainable Digital Data Preservation and Access Network Partner, was a research program of the U.S. National Science Foundation Office of Cyberinfrastructure. The office announced a request for proposals with this title on September ...

are intended to make data integration easier for scientists by providing

cyberinfrastructure United States federal government agencies use the term cyberinfrastructure to describe research environments that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computin ...

and setting standards. The five funded

initiatives are

DataONE logo DataONE is a network of interoperable data repositories facilitating data sharing, data discovery, and open science. Originally supported by $21.2 million in funding from the US National Science Foundation as one of the initial DataNet pro ...

, led by William Michener at the

University of New Mexico The University of New Mexico (UNM; ) is a public research university in Albuquerque, New Mexico, United States. Founded in 1889 by the New Mexico Territorial Legislature, it is the state's second oldest university, a flagship university in th ...

; The Data Conservancy, led by Sayeed Choudhury of

Johns Hopkins University The Johns Hopkins University (often abbreviated as Johns Hopkins, Hopkins, or JHU) is a private university, private research university in Baltimore, Maryland, United States. Founded in 1876 based on the European research institution model, J ...

; SEAD: Sustainable Environment through Actionable Data, led by

Margaret Hedstrom Margaret L. Hedstrom is an American archivist who is professor emerita of informationat the University of Michigan School of Information. She has contributed to the fields of digital preservation, archives, and electronic records management. ...

of the

University of Michigan The University of Michigan (U-M, U of M, or Michigan) is a public university, public research university in Ann Arbor, Michigan, United States. Founded in 1817, it is the oldest institution of higher education in the state. The University of Mi ...

; the DataNet Federation Consortium, led by Reagan Moore of the

University of North Carolina The University of North Carolina is the Public university, public university system for the state of North Carolina. Overseeing the state's 16 public universities and the North Carolina School of Science and Mathematics, it is commonly referre ...

; and ''Terra Populus'', led by

Steven Ruggles Steven Ruggles (born May 8, 1955 - New Haven, Conn.) is Regents Professor of History and Population Studies at the University of Minnesota, and the director of the IPUMS Center for Data Integration. He is best known as the creator of IPUMS, the ...

of the

. The

Research Data Alliance The Research Data Alliance (RDA) is a global research community-driven organization started in 2013 by the European Commission, the US National Science Foundation and National Institute of Standards and Technology, and the Australian Department o ...

, has more recently explored creating global data integration frameworks. The OpenPHACTS project, funded through the

European Union The European Union (EU) is a supranational union, supranational political union, political and economic union of Member state of the European Union, member states that are Geography of the European Union, located primarily in Europe. The u ...

Innovative Medicines Initiative, built a drug discovery platform by linking datasets from providers such as

European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wel ...

Royal Society of Chemistry The Royal Society of Chemistry (RSC) is a learned society and professional association in the United Kingdom with the goal of "advancing the chemistry, chemical sciences". It was formed in 1980 from the amalgamation of the Chemical Society, the ...

UniProt UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...

, WikiPathways and

DrugBank The DrugBank database is a comprehensive, freely accessible, online database containing information on drugs and drug targets created and maintained by the University of Alberta and The Metabolomics Innovation Centre located in Alberta, Canada. A ...

References

External links

{{Authority control Cyberinfrastructure Data management Database models