Unstructured data (or unstructured information) is information that either does not have a pre-defined
data model or is not organized in a pre-defined manner. Unstructured information is typically
text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and
ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or
annotated (
semantically tagged) in documents.
In 1998,
Merrill Lynch said "unstructured data comprises the vast majority of data found in an organization, some estimates run as high as 80%." It's unclear what the source of this number is, but nonetheless it is accepted by some.
Other sources have reported similar or higher percentages of unstructured data.
,
IDC and
Dell EMC project that data will grow to 40
zettabytes
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
by 2020, resulting in a 50-fold growth from the beginning of 2010.
More recently, IDC and
Seagate predict that the global datasphere will grow to 163 zettabytes by 2025 and majority of that will be unstructured. The
Computer World magazine states that unstructured information might account for more than 70–80% of all data in organizations.
Background
The earliest research into
business intelligence focused in on unstructured textual data, rather than numerical data.
As early as 1958,
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
researchers like
H.P. Luhn were particularly concerned with the extraction and classification of unstructured text.
However, only since the turn of the century has the technology caught up with the research interest. In 2004, the
SAS Institute developed the
SAS
SAS or Sas may refer to:
Arts, entertainment, and media
* ''SAS'' (novel series), a French book series by Gérard de Villiers
* ''Shimmer and Shine'', an American animated children's television series
* Southern All Stars, a Japanese rock ba ...
Text Miner, which uses
Singular Value Decomposition
In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is r ...
(SVD) to reduce a
hyper-dimensional textual
space
Space is the boundless three-dimensional extent in which objects and events have relative position and direction. In classical physics, physical space is often conceived in three linear dimensions, although modern physicists usually con ...
into smaller dimensions for significantly more efficient machine-analysis.
The mathematical and technological advances sparked by
machine textual analysis prompted a number of businesses to research applications, leading to the development of fields like
sentiment analysis,
voice of the customer mining, and call center optimization.
The emergence of
Big Data in the late 2000s led to a heightened interest in the applications of unstructured data analytics in contemporary fields such as
predictive analytics
Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events.
In busin ...
and
root cause analysis.
Issues with terminology
The term is imprecise for several reasons:
#
Structure, while not formally defined, can still be implied.
# Data with some form of structure may still be characterized as unstructured if its structure is not helpful for the processing task at hand.
# Unstructured information might have some structure (
semi-structured) or even be highly structured but in ways that are unanticipated or unannounced.
Dealing with unstructured data
Techniques such as
data mining,
natural language processing (NLP), and
text analytics provide different methods to
find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual
tagging with metadata or
part-of-speech tagging for further
text mining-based structuring. The
Unstructured Information Management Architecture (UIMA) standard provided a common framework for processing this information to extract meaning and create structured data about the information.
Software that creates machine-processable structure can utilize the linguistic, auditory, and visual structure that exist in all forms of human communication.
Algorithms can infer this inherent structure from text, for instance, by examining word
morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents,
metadata,
health records,
audio,
video
Video is an Electronics, electronic medium for the recording, copying, playback, broadcasting, and display of moving picture, moving image, visual Media (communication), media. Video was first developed for mechanical television systems, whi ...
,
analog data, images, files, and unstructured text such as the body of an
e-mail message,
Web page, or
word-processor document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, ...) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data". For example, an
HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
web page is tagged, but HTML mark-up typically serves solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page.
XHTML
Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.
While HTML, prior ...
tagging does allow machine processing of elements, although it typically does not capture or convey the semantic meaning of tagged terms.
Since unstructured data commonly occurs in
electronic document
An electronic document is any electronic media content (other than computer programs or system files) that is intended to be used in either an electronic form or as printed output. Originally, any computer data were considered as something intern ...
s, the use of a
content or
document management system which can categorize entire documents is often preferred over data transfer and manipulation from within the documents. Document management thus provides the means to convey structure onto
document collections.
Search engines have become popular tools for indexing and searching through such data, especially text.
Approaches in natural language processing
Specific computational workflows have been developed to impose structure upon the unstructured data contained within text documents. These workflows are generally designed to handle sets of thousands or even millions of documents, or far more than manual approaches to annotation may permit. Several of these approaches are based upon the concept of
online analytical processing, or OLAP, and may be supported by data models such as text cubes. Once document metadata is available through a data model, generating summaries of subsets of documents (i.e., cells within a text cube) may be performed with phrase-based approaches.
Approaches in medicine and biomedical research
Biomedical research generates one major source of unstructured data as researchers often publish their findings in scholarly journals. Though the language in these documents is challenging to derive structural elements from (e.g., due to the complicated technical vocabulary contained within and the domain knowledge required to fully contextualize observations), the results of these activities may yield links between technical and medical studies and clues regarding new disease therapies. Recent efforts to enforce structure upon biomedical documents include
self-organizing map approaches for identifying topics among documents, general-purpose
unsupervised algorithms, and an application of the CaseOLAP workflow
to determine associations between protein names and
cardiovascular disease
Cardiovascular disease (CVD) is a class of diseases that involve the heart or blood vessels. CVD includes coronary artery diseases (CAD) such as angina and myocardial infarction (commonly known as a heart attack). Other CVDs include stroke, ...
topics in the literature.
CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications.
The use of "unstructured" in data privacy regulations
In Sweden (EU), pre 2018, some data privacy regulations did not apply if the data in question was confirmed as "unstructured".
This terminology, unstructured data, is rarely used in the EU after
GDPR came into force in 2018. GDPR does neither mention nor define "unstructured data". It does use the word "structured" as follows (without defining it);
* Parts of GDPR Recital 15, "The protection of natural persons should apply to the processing of personal data ... if ... contained in a filing system."
* GDPR Article 4, "‘filing system’ means any structured set of personal data which are accessible according to specific criteria ..."
GDPR Case-law on what defines a "filing system"; "the specific criterion and the specific form in which the set of personal data collected by each of the members who engage in preaching is actually structured is irrelevant, so long as that set of data makes it possible for the data relating to a specific person who has been contacted to be easily retrieved, which is however for the referring court to ascertain in the light of all the circumstances of the case in the main proceedings.” (
CJEUTodistajat v. Tietosuojavaltuutettu, Jehovan, Paragraph 61.
If personal data is easily retrieved - then it is a filing system and - then it is in scope for GDPR regardless of being "structured" or "unstructured". Most electronic systems today, subject to access and applied software, can allow for easy retrieval of data.
See also
*
Clustering
*
Pattern recognition
*
List of text mining software
*
Semi-structured data
*
Structured data
Notes
# Today's Challenge in Government: What to do with Unstructured Information and Why Doing Nothing Isn't An Option, Noel Yuhanna, Principal Analyst,
Forrester Research, Nov 2010
References
{{Reflist
External links
Matching Unstructured Data and Structured Dataa brief description for Structured DataUnstructured Data Definition, Examples, Benefits & Challenges
Data
Information technology management
Business intelligence