Data extraction is the act or process of retrieving
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
out of (usually
unstructured or poorly structured) data sources for further
data processing
Data processing is the collection and manipulation of digital data to produce meaningful information.
Data processing is a form of '' information processing'', which is the modification (processing) of information in any manner detectable by ...
or
data storage
Data storage is the recording (storing) of information (data) in a storage medium. Handwriting, phonographic recording, magnetic tape, and optical discs are all examples of storage media. Biological molecules such as RNA and DNA are cons ...
(
data migration). The
import into the intermediate extracting system is thus usually followed by
data transformation and possibly the addition of
metadata prior to
export
An export in international trade is a good produced in one country that is sold into another country or a service provided in one country for a national or resident of another country. The seller of such goods or the service provider is an ...
to another stage in the data
workflow
A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence ...
.
Usually, the term data extraction is applied when (
experiment
An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs wh ...
al) data is first imported into a computer from primary sources, like
measuring
Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events.
In other words, measurement is a process of determining how large or small a physical quantity is as compared ...
or
recording devices. Today's
electronic devices will usually present an
electrical connector
Components of an electrical circuit are electrically connected if an electric current can run between them through an electrical conductor. An electrical connector is an electromechanical device used to create an electrical connection betwee ...
(e.g.
USB) through which '
raw data' can be
streamed into a
personal computer
A personal computer (PC) is a multi-purpose microcomputer whose size, capabilities, and price make it feasible for individual use. Personal computers are intended to be operated directly by an end user, rather than by a computer expert or tech ...
.
Data sources
Typical unstructured data sources include
web pages,
email
Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" mean ...
s, documents,
PDFs, scanned text, mainframe reports, spool files, classifieds, etc. which is further used for sales or marketing leads. Extracting data from these unstructured sources has grown into a considerable technical challenge where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction
data extraction.
/ref> from the web is referred to as "Web data extraction" or "Web scraping
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scrapin ...
".
Imposing structure
The act of adding structure to unstructured data takes a number of forms
* Using text pattern matching
In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be ...
such as regular expression
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s to identify small or large-scale structure e.g. records in a report and their associated data from headers and footers;
* Using a table-based approach to identify common sections within a limited domain e.g. in emailed resumes, identifying skills, previous work experience, qualifications etc. using a standard set of commonly used headings (these would differ from language to language), e.g. Education might be found under Education/Qualification/Courses;
* Using text analytics
Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extrac ...
to attempt to understand the text and link it to other information
See also
* Information extraction
* Data retrieval
* Extract, transform, load (ETL)
* Data mining
References
{{DEFAULTSORT:Data Extraction
Data management
Data warehousing