HOME

TheInfoList



OR:

In metadata, metadata discovery (also metadata harvesting) is the process of using automated tools to discover the
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and compu ...
of a
data element In metadata, the term data element is an atomic unit of data that has precise meaning or precise semantics. A data element has: # An identification such as a data element name # A clear data element definition # One or more representation terms ...
in data sets. This process usually ends with a set of mappings between the data source elements and a centralized
metadata registry A metadata registry is a central location in an organization where metadata definitions are stored and maintained in a controlled method. A metadata repository is the database where metadata is stored. The registry also adds relationships with ...
. Metadata discovery is also known as metadata scanning.


Data source formats for metadata discovery

Data sets may be in a variety of different forms including: #
Relational database A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
s #
NoSQL A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...
databases #
Spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in ce ...
s #
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. ...
files # Web services # Software
source code In computing, source code, or simply code, is any collection of code, with or without comment (computer programming), comments, written using a human-readable programming language, usually as plain text. The source code of a Computer program, p ...
such as Fortran, Jovial, COBOL, Assembler, RPG, PL/1, EasyTrieve, Java, C# or C++ classes, and thousands of other software languages # Unstructured text documents such as
Microsoft Word Microsoft Word is a word processing software developed by Microsoft. It was first released on October 25, 1983, under the name ''Multi-Tool Word'' for Xenix systems. Subsequent versions were later written for several other platforms includi ...
or
PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
files


A taxonomy of metadata matching algorithms

There are distinct categories of automated metadata discovery:


Lexical matching

# Exact match - where data element linkages are made based on the exact name of a column in a database, the name of an XML element or a label on a screen. For example, if a database column has the name "PersonBirthDate" and a data element in a metadata registry also has the name "PersonBirthDate", automated tools can infer that the column of a database has the same semantics (meaning) as the data element in the metadata registry. # Synonym match - where the discovery tool is not just given a single name but a set of synonym. # Pattern match - in this case the tools is given a set of lexical patterns that it can match. For example, the tools may search for "*gender*" or "*sex*"


Semantic matching

Semantic matching Semantic matching is a technique used in computer science to identify information which is semantically related. Given any two graph-like structures, e.g. classifications, taxonomies database or XML schemas and ontologies, matching is an operat ...
attempts to use
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and compu ...
to associate target data with registered
data element In metadata, the term data element is an atomic unit of data that has precise meaning or precise semantics. A data element has: # An identification such as a data element name # A clear data element definition # One or more representation terms ...
s. # Semantic similarity - In this algorithm that relies on a database of word conceptual nearness is used. For example, the
WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definit ...
system can rank how close words are conceptually to each other. For example, the terms "Person", "Individual" and "Human" may be highly similar concepts.


Statistical matching

Statistical matching uses statistics about data sources data itself to derive similarities with registered data elements. # Distinct value analysis - By analyzing all the distinct values in a column the similarity to a registered data element may be made. For example, if a column only has two distinct values of 'male' and 'female' this could be mapped to 'PersonGenderCode'. # Data distribution analysis - By analyzing the distribution of values within a single column and comparing this distribution with known data elements a semantic linkage could be inferred.


Vendors

The following vendors (listed in alphabetical order) provide metadata discovery and metadata mapping software and solutions * Atlan (se

* BigHand/Esquire Innovations (se

* IBM * Talend * InfoLibrarian Corporation (se

* MindHARBOR Metadata Database application (se

* Octopai - a Cross-Platform Metadata Discovery and Management Automation (se

* Revelytix (se

* Silver Creek Systems (se

* Stratio (company), Stratio (se
Data reliability is the base of successful companies
* Sypherlink: Harvester (se

*
Unicorn Systems The unicorn is a legendary creature that has been described since antiquity as a beast with a single large, pointed, spiraling horn projecting from its forehead. In European literature and art, the unicorn has for the last thousand years or ...
(se


Research

* INDUS project at the
Iowa State University Iowa State University of Science and Technology (Iowa State University, Iowa State, or ISU) is a public land-grant research university in Ames, Iowa. Founded in 1858 as the Iowa Agricultural College and Model Farm, Iowa State became one of the ...
(se

* Mercury - A Distributed Metadata Management and Data discovery, Data Discovery System developed at the
Oak Ridge National Laboratory DAAC The ORNL DAAC (Oak Ridge National Laboratory Distributed Active Archive Center) for Biogeochemical Dynamics is a National Aeronautics and Space Administration (NASA) Earth Observing System Data and Information System (EOSDIS) data center managed b ...
(se


See also

* Metadata *
Data mapping In computing and data management, data mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks, including: * Data transformat ...
*
Data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integra ...
* Semantic web *
Defense Discovery Metadata Specification {{Redirect, DDMS, the Android debugger named Dalvik Debug Monitor Server, Dalvik (software) The Department of Defense Discovery Metadata Specification (DoD Discovery Metadata Specification or DDMS) is a Net-Centric Enterprise Services (NCES) meta ...


References


Citations


Sources


Massive Data Analysis Systems
by San Diego Supercomputer Center June 1997
IBM Whitepaper on Enterprise Metadata Discovery

White Paper on Metadata Management
- b
Esquire Innovations
{{refend Metadata