Auto-entity Extraction
   HOME

TheInfoList



OR:

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality
information Information is an Abstraction, abstract concept that refers to something which has the power Communication, to inform. At the most fundamental level, it pertains to the Interpretation (philosophy), interpretation (perhaps Interpretation (log ...
from
text Text may refer to: Written word * Text (literary theory) In literary theory, a text is any object that can be "read", whether this object is a work of literature, a street sign, an arrangement of buildings on a city block, or styles of clothi ...
. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include
website A website (also written as a web site) is any web page whose content is identified by a common domain name and is published on at least one web server. Websites are typically dedicated to a particular topic or purpose, such as news, educatio ...
s,
book A book is a structured presentation of recorded information, primarily verbal and graphical, through a medium. Originally physical, electronic books and audiobooks are now existent. Physical books are objects that contain printed material, ...
s,
email Electronic mail (usually shortened to email; alternatively hyphenated e-mail) is a method of transmitting and receiving Digital media, digital messages using electronics, electronic devices over a computer network. It was conceived in the ...
s,
review A review is an evaluation of a publication, product, service, or company or a critical take on current affairs in literature, politics or culture. In addition to a critical evaluation, the review's author may assign the work a content rating, ...
s, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005), there are three perspectives of text mining: information extraction,
data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
, and
knowledge discovery in databases Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and s ...
(KDD). Text mining usually involves the process of structuring the input text (usually
parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...
, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
), deriving patterns within the
structured data A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be ...
, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of
relevance Relevance is the connection between topics that makes one useful for dealing with the other. Relevance is studied in many different fields, including cognitive science, logic, and library and information science. Epistemology studies it in gener ...
,
novelty Novelty (derived from Latin word ''novus'' for "new") is the quality of being new, or following from that, of being striking, original or unusual. Novelty may be the shared experience of a new cultural phenomenon or the subjective perception of an ...
, and interest. Typical text mining tasks include
text categorization Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually" ...
, text clustering, concept/entity extraction, production of granular taxonomies,
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
, document summarization, and entity relation modeling (''i.e.'', learning relations between named entities). Text analysis involves
information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
,
lexical analysis Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...
to study word frequency distributions,
pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...
, tagging/
annotation An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented Marginalia, in the margin of book page ...
, information extraction,
data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
techniques including link and association analysis, visualization, and
predictive analytics Predictive analytics encompasses a variety of Statistics, statistical techniques from data mining, Predictive modelling, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or other ...
. The overarching goal is, essentially, to turn text into data for analysis, via the application of
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
(NLP), different types of
algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
s and analytical methods. An important phase of this process is the interpretation of the gathered information. A typical application is to scan a set of documents written in a
natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...
and either model the
document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as fictional, content. The word originates from the Latin ', which denotes ...
set for predictive classification purposes or populate a database or search index with the information extracted. The
document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as fictional, content. The word originates from the Latin ', which denotes ...
is the basic element when starting with text mining. Here, we define a document as a unit of textual data, which normally exists in many types of collections.


Text analytics

Text analytics describes a set of
linguistic Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
,
statistical Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, and
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
techniques that model and structure the information content of textual sources for
business intelligence Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include Financial reporting, reporting, online an ...
,
exploratory data analysis In statistics, exploratory data analysis (EDA) is an approach of data analysis, analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or ...
,
research Research is creative and systematic work undertaken to increase the stock of knowledge. It involves the collection, organization, and analysis of evidence to increase understanding of a topic, characterized by a particular attentiveness to ...
, or investigation. The term is roughly synonymous with text mining; indeed, Ronen Feldman modified a 2000 description of "text mining" in 2004 to describe "text analytics". The latter term is now used more frequently in business settings while "text mining" is used in some of the earliest application areas, dating to the 1980s, notably life-sciences research and government intelligence. The term text analytics also describes that application of text analytics to respond to business problems, whether independently or in conjunction with query and analysis of fielded, numerical data. It is a truism that 80% of business-relevant information originates in unstructured form, primarily text. These techniques and processes discover and present knowledge – facts,
business rule A business rule defines or constrains some aspect of a business. It may be expressed to specify an action to be taken when certain conditions are true or may be phrased so it can only resolve to either true or false. Business rules are intended to a ...
s, and relationships – that is otherwise locked in textual form, impenetrable to automated processing.


Text analysis processes

Subtasks—components of a larger text-analytics effort—typically include: *
Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
is an important technique for pre-processing data. It is used to identify the root word for actual words and reduce the size of the text data. *
Information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
or identification of a
corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...
is a preparatory step: collecting or identifying a set of textual materials, on the Web or held in a file system,
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
, or content
corpus manager A corpus manager (corpus browser or corpus query system) is a tool for multilingual corpus analysis, which allows effective searching in corpora. A corpus manager usually represents a complex tool that allows one to perform searches for language ...
, for analysis. * Although some text analytics systems apply exclusively advanced statistical methods, many others apply more extensive
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
, such as
part of speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its definiti ...
, syntactic
parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...
, and other types of linguistic analysis. *
Named entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pr ...
is the use of gazetteers or statistical techniques to identify named text features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so on. * Disambiguation—the use of contextual clues—may be required to decide where, for instance, "Ford" can refer to a former U.S. president, a vehicle manufacturer, a movie star, a river crossing, or some other entity. * Recognition of pattern-identified entities: Features such as telephone numbers, e-mail addresses, quantities (with units) can be discerned via regular expression or other pattern matches. *
Document clustering Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering. Overview Document cluste ...
: identification of sets of similar text documents. *
Coreference In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in ''Bill said Alice would arrive soon, and she did'', the words ''Alice'' ...
resolution: identification of
noun phrase A noun phrase – or NP or nominal (phrase) – is a phrase that usually has a noun or pronoun as its head, and has the same grammatical functions as a noun. Noun phrases are very common cross-linguistically, and they may be the most frequently ...
s and other terms that refer to the same object. * Extraction of relationships, facts and events: identification of associations among entities and other information in texts. *
Sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
: discerning of subjective material and extracting information about attitudes: sentiment, opinion, mood, and emotion. This is done at the entity, concept, or topic level and aims to distinguish opinion holders and objects. * Quantitative text analysis: a set of techniques stemming from the social sciences where either a human judge or a computer extracts semantic or grammatical relationships between words in order to find out the meaning or stylistic patterns of, usually, a casual personal text for the purpose of psychological profiling etc. * Pre-processing usually involves tasks such as tokenization, filtering and stemming.


Applications

Text mining technology is now broadly applied to a wide variety of government, research, and business needs. All these groups may use text mining for records management and searching documents relevant to their daily activities. Legal professionals may use text mining for
e-discovery Electronic discovery (also ediscovery or e-discovery) refers to discovery in legal proceedings such as litigation, government investigations, or Freedom of Information Act requests, where the information sought is in electronic format (often re ...
, for example. Governments and military groups use text mining for
national security National security, or national defence (national defense in American English), is the security and Defence (military), defence of a sovereign state, including its Citizenship, citizens, economy, and institutions, which is regarded as a duty of ...
and intelligence purposes. Scientific researchers incorporate text mining approaches into efforts to organize large sets of text data (i.e., addressing the problem of
unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically plain text, text-heavy, but may contain data such ...
), to determine ideas communicated through text (e.g.,
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
in
social media Social media are interactive technologies that facilitate the Content creation, creation, information exchange, sharing and news aggregator, aggregation of Content (media), content (such as ideas, interests, and other forms of expression) amongs ...
) and to support scientific discovery in fields such as the
life sciences This list of life sciences comprises the branches of science that involve the scientific study of life – such as microorganisms, plants, and animals including human beings. This science is one of the two major branches of natural science, ...
and
bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
. In business, applications are used to support
competitive intelligence Competitive intelligence (CI) is the process and forward-looking practices used in producing knowledge about the competitive environment to improve organizational performance. Competitive intelligence involves systematically collecting and anal ...
and automated ad placement, among numerous other activities.


Security applications

Many text mining software packages are marketed for security applications, especially monitoring and analysis of online plain text sources such as
Internet news Digital journalism, also known as netizen journalism or online journalism, is a contemporary form of journalism where editorial content is distributed via the Internet, as opposed to publishing via print or broadcast. What constitutes digital jo ...
,
blog A blog (a Clipping (morphology), truncation of "weblog") is an informational website consisting of discrete, often informal diary-style text entries also known as posts. Posts are typically displayed in Reverse chronology, reverse chronologic ...
s, etc. for
national security National security, or national defence (national defense in American English), is the security and Defence (military), defence of a sovereign state, including its Citizenship, citizens, economy, and institutions, which is regarded as a duty of ...
purposes. It is also involved in the study of text
encryption In Cryptography law, cryptography, encryption (more specifically, Code, encoding) is the process of transforming information in a way that, ideally, only authorized parties can decode. This process converts the original representation of the inf ...
/
decryption In cryptography, encryption (more specifically, encoding) is the process of transforming information in a way that, ideally, only authorized parties can decode. This process converts the original representation of the information, known as plai ...
.


Biomedical applications

A range of text mining applications in the biomedical literature has been described, including computational approaches to assist with studies in
protein docking Macromolecular docking is the computational modelling of the quaternary structure of complexes formed by two or more interacting biological macromolecules. Protein–protein complexes are the most commonly attempted targets of such modelling, foll ...
, protein interactions, and protein-disease associations. In addition, with large patient textual datasets in the clinical field, datasets of demographic information in population studies and adverse event reports, text mining can facilitate clinical studies and precision medicine. Text mining algorithms can facilitate the stratification and indexing of specific clinical events in large patient textual datasets of symptoms, side effects, and comorbidities from electronic health records, event reports, and reports from specific diagnostic tests. One online text mining application in the biomedical literature is
PubGene PubGene AS is a bioinformatics company located in Oslo, Norway and is the daughter company of PubGene Inc. In 2001, PubGene founders demonstrated one of the first applications of text mining to research in biomedicine (i.e., biomedical text ...
, a publicly accessible
search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
that combines biomedical text mining with network visualization.
GoPubMed GoPubMed was a knowledge-based search engine for biomedical texts. The Gene Ontology (GO) and Medical Subject Headings (MeSH) served as "Table of contents" in order to structure the millions of articles in the MEDLINE database. MeshPubMed was at one ...
is a knowledge-based search engine for biomedical texts. Text mining techniques also enable us to extract unknown knowledge from unstructured documents in the clinical domain


Software applications

Text mining methods and software is also being researched and developed by major firms, including
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
and
Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results. Within the public sector, much effort has been concentrated on creating software for tracking and monitoring terrorist activities. For study purposes, Weka software is one of the most popular options in the scientific world, acting as an excellent entry point for beginners. For Python programmers, there is an excellent toolkit called
NLTK The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It supports classification, tokeniza ...
for more general purposes. For more advanced programmers, there's also the
Gensim Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning. Gensim is implemented in Python and ...
library, which focuses on word embedding-based text representations.


Online media applications

Text mining is being used by large media companies, such as the
Tribune Company Tribune Media Company, also known as Tribune Company, was an American multimedia conglomerate headquartered in Chicago, Illinois. Through Tribune Broadcasting, Tribune Media was one of the largest television broadcasting companies, owning 39 ...
, to clarify information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.


Business and marketing applications

Text analytics is being used in business, particularly, in marketing, such as in
customer relationship management Customer relationship management (CRM) is a strategic process that organizations use to manage, analyze, and improve their interactions with customers. By leveraging data-driven insights, CRM helps businesses optimize communication, enhance cus ...
. Coussement and Van den Poel (2008) apply it to improve
predictive analytics Predictive analytics encompasses a variety of Statistics, statistical techniques from data mining, Predictive modelling, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or other ...
models for customer churn (
customer attrition Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers. Companies often use customer attrition analysis and customer attrition rates as one of their key business metrics (alo ...
). Text mining is also being applied in stock returns prediction.


Sentiment analysis

Sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
may involve analysis of products such as movies, books, or hotel reviews for estimating how favorable a review is for the product. Such an analysis may need a labeled data set or labeling of the
affectivity Affect, in psychology, is the underlying experience of feeling, emotion, attachment, or mood. It encompasses a wide range of emotional states and can be positive (e.g., happiness, joy, excitement) or negative (e.g., sadness, anger, fear, dis ...
of words. Resources for affectivity of words and concepts have been made for
WordNet WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definitions and usage examples. It can thu ...
and
ConceptNet Open Mind Common Sense (OMCS) is an artificial intelligence project based at the Massachusetts Institute of Technology (MIT) Media Lab whose goal is to build and utilize a large commonsense knowledge base from the contributions of many thousands ...
, respectively. Text has been used to detect emotions in the related area of affective computing. Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories.


Scientific literature mining and academic applications

The issue of text mining is of importance to publishers who hold large
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
s of information needing indexing for retrieval. This is especially true in scientific disciplines, in which highly specific information is often contained within the written text. Therefore, initiatives have been taken such as Nature's proposal for an Open Text Mining Interface (OTMI) and the
National Institutes of Health The National Institutes of Health (NIH) is the primary agency of the United States government responsible for biomedical and public health research. It was founded in 1887 and is part of the United States Department of Health and Human Service ...
's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within the text without removing publisher barriers to public access. Academic institutions have also become involved in the text mining initiative: * The National Centre for Text Mining (NaCTeM), is the first publicly funded text mining centre in the world. NaCTeM is operated by the
University of Manchester The University of Manchester is a public university, public research university in Manchester, England. The main campus is south of Manchester city centre, Manchester City Centre on Wilmslow Road, Oxford Road. The University of Manchester is c ...
in close collaboration with the Tsujii Lab,
University of Tokyo The University of Tokyo (, abbreviated as in Japanese and UTokyo in English) is a public research university in Bunkyō, Tokyo, Japan. Founded in 1877 as the nation's first modern university by the merger of several pre-westernisation era ins ...
. NaCTeM provides customised tools, research facilities and offers advice to the academic community. They are funded by the
Joint Information Systems Committee Jisc is a United Kingdom not-for-profit organisation that provides network and IT services and digital resources in support of further and higher education and research, as well as the public sector. Its head office is based in Bristol with o ...
(JISC) and two of the UK
research councils Research is creative and systematic work undertaken to increase the stock of knowledge. It involves the collection, organization, and analysis of evidence to increase understanding of a topic, characterized by a particular attentiveness to ...
(
EPSRC The Engineering and Physical Sciences Research Council (EPSRC) is a British Research Council that provides government funding for grants to undertake research and postgraduate degrees in engineering and the physical sciences, mainly to univers ...
&
BBSRC Biotechnology and Biological Sciences Research Council (BBSRC), part of UK Research and Innovation, is a non-departmental public body (NDPB), and is the largest UK public funder of non-medical bioscience. It predominantly funds scientific res ...
). With an initial focus on text mining in the
biological Biology is the scientific study of life and living organisms. It is a broad natural science that encompasses a wide range of fields and unifying principles that explain the structure, function, growth, origin, evolution, and distribution of ...
and
biomedical Biomedicine (also referred to as Western medicine, mainstream medicine or conventional medicine)
sciences, research has since expanded into the areas of
social sciences Social science (often rendered in the plural as the social sciences) is one of the branches of science, devoted to the study of society, societies and the Social relation, relationships among members within those societies. The term was former ...
. * In the United States, the School of Information at
University of California, Berkeley The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California), is a Public university, public Land-grant university, land-grant research university in Berkeley, California, United States. Founded in 1868 and named after t ...
is developing a program called BioText to assist
biology Biology is the scientific study of life and living organisms. It is a broad natural science that encompasses a wide range of fields and unifying principles that explain the structure, function, growth, History of life, origin, evolution, and ...
researchers in text mining and analysis. * The Text Analysis Portal for Research (TAPoR), currently housed at the
University of Alberta The University of Alberta (also known as U of A or UAlberta, ) is a public research university located in Edmonton, Alberta, Canada. It was founded in 1908 by Alexander Cameron Rutherford, the first premier of Alberta, and Henry Marshall Tory, t ...
, is a scholarly project to catalogue text analysis applications and create a gateway for researchers new to the practice.


Methods for scientific literature mining

Computational methods have been developed to assist with information retrieval from scientific literature. Published approaches include methods for searching, determining novelty, and clarifying
homonym In linguistics, homonyms are words which are either; '' homographs''—words that mean different things, but have the same spelling (regardless of pronunciation), or '' homophones''—words that mean different things, but have the same pronunciat ...
s among technical reports.


Digital humanities and computational sociology

The automatic analysis of vast textual corpora has created the possibility for scholars to analyze millions of documents in multiple languages with very limited manual intervention. Key enabling technologies have been parsing,
machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...
, topic
categorization Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identi ...
, and machine learning. The automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, turning textual data into network data. The resulting networks, which can contain thousands of nodes, are then analyzed by using tools from network theory to identify the key actors, the key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes. This automates the approach introduced by quantitative narrative analysis, whereby subject-verb-object triplets are identified with pairs of actors linked by an action, or pairs formed by actor-object.
Content analysis Content analysis is the study of documents and communication artifacts, known as texts e.g. photos, speeches or essays. Social scientists use content analysis to examine patterns in communication in a replicable and systematic manner. One of the ...
has been a traditional part of social sciences and media studies for a long time. The automation of content analysis has allowed a "
big data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
" revolution to take place in that field, with studies in social media and newspaper content that include millions of news items.
Gender bias Gender bias is the tendency to prefer one gender over another. It is a form of unconscious bias, or implicit bias, which occurs when one individual unconsciously attributes certain attitudes and stereotypes to another person or group of people ...
,
readability Readability is the ease with which a reader can understand a written text. The concept exists in both natural language and programming languages though in different forms. In natural language, the readability of text depends on its content ( ...
, content similarity, reader preferences, and even mood have been analyzed based on text mining methods over millions of documents. The analysis of readability, gender bias and topic bias was demonstrated in Flaounas et al. showing how different topics have different gender biases and levels of readability; the possibility to detect mood patterns in a vast population by analyzing Twitter content was demonstrated as well.


Software

Text mining computer programs are available from many
commercial Commercial may refer to: * (adjective for) commerce, a system of voluntary exchange of products and services ** (adjective for) trade, the trading of something of economic value such as goods, services, information or money * a dose of advertising ...
and
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
companies and sources.


Intellectual property law


Situation in Europe

] Under Copyright law of the European Union, European copyright and database laws, the mining of in-copyright works (such as by
web mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and s ...
) without the permission of the copyright owner is illegal. In the UK in 2014, on the recommendation of the Hargreaves review, the government amended copyright law to allow text mining as a limitation and exception. It was the second country in the world to do so, following
Japan Japan is an island country in East Asia. Located in the Pacific Ocean off the northeast coast of the Asia, Asian mainland, it is bordered on the west by the Sea of Japan and extends from the Sea of Okhotsk in the north to the East China Sea ...
, which introduced a mining-specific exception in 2009. However, owing to the restriction of the
Information Society Directive The Copyright and Information Society Directive 20012001/29 is a directive in European Union law that was enacted to implement the WIPO Copyright Treaty and to harmonise aspects of copyright law across Europe, such as copyright exceptions. ...
(2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law does not allow this provision to be overridden by contractual terms and conditions. The
European Commission The European Commission (EC) is the primary Executive (government), executive arm of the European Union (EU). It operates as a cabinet government, with a number of European Commissioner, members of the Commission (directorial system, informall ...
facilitated stakeholder discussion on text and
data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
in 2013, under the title of Licenses for Europe. The fact that the focus on the solution to this legal issue was licenses, and not limitations and exceptions to copyright law, led representatives of universities, researchers, libraries, civil society groups and
open access Open access (OA) is a set of principles and a range of practices through which nominally copyrightable publications are delivered to readers free of access charges or other barriers. With open access strictly defined (according to the 2001 de ...
publishers to leave the stakeholder dialogue in May 2013.


Situation in the United States

US copyright law The copyright law of the United States grants monopoly protection for "original works of authorship". With the stated purpose to promote art and culture, copyright law assigns a set of exclusive rights to authors: to make and sell copies of thei ...
, and in particular its
fair use Fair use is a Legal doctrine, doctrine in United States law that permits limited use of copyrighted material without having to first acquire permission from the copyright holder. Fair use is one of the limitations to copyright intended to bal ...
provisions, means that text mining in America, as well as other fair use countries such as Israel, Taiwan and South Korea, is viewed as being legal. As text mining is transformative, meaning that it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed—one such use being text and data mining.


Situation in Australia

There is no exception in
copyright law of Australia The copyright law of Australia defines the legally enforceable rights of creators of creative and artistic works under Australian law. The scope of copyright in Australia is defined in the ''Copyright Act 1968'' (as amended), which applies the ...
for text or data mining within the ''
Copyright Act 1968 The copyright law of Australia defines the legally enforceable rights of creators of creative and artistic works under Australian law. The scope of copyright in Australia is defined in the '' Copyright Act 1968'' (as amended), which applies the ...
''. The
Australian Law Reform Commission The Australian Law Reform Commission (often abbreviated to ALRC) is an Australian independent statutory body established to conduct reviews into the law of Australia. The reviews, also called inquiries or references, are referred to the ALRC by ...
has noted that it is unlikely that the "research and study"
fair dealing Fair dealing is a limitation and exception to the exclusive rights granted by copyright law to the author of a creative work. Fair dealing is found in many of the common law jurisdictions of the Commonwealth of Nations. Fair dealing is an e ...
exception would extend to cover such a topic either, given it would be beyond the "reasonable portion" requirement.


Implications

Until recently, websites most often used text-based searches, which only found documents containing specific user-defined words or phrases. Now, through use of a
semantic web The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable. To enable the encoding o ...
, text mining can find content based on meaning and context (rather than just by a specific word). Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, large datasets based on data extracted from news reports can be built to facilitate social networks analysis or
counter-intelligence Counterintelligence (counter-intelligence) or counterespionage (counter-espionage) is any activity aimed at protecting an agency's intelligence program from an opposition's intelligence service. It includes gathering information and conducting ac ...
. In effect, the text mining software may act in a capacity similar to an
intelligence analyst Intelligence analysis is the application of individual and collective cognitive methods to weigh data and test hypotheses within a secret socio-cultural context. The descriptions are drawn from what may only be available in the form of delibera ...
or research librarian, albeit with a more limited scope of analysis. Text mining is also used in some email
spam filter Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly ap ...
s as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material. Text mining plays an important role in determining financial
market sentiment Market sentiment, also known as investor attention, is the general prevailing attitude of investors as to anticipated price development in a market. This attitude is the accumulation of a variety of fundamental and technical factors, including ...
.


See also

* Concept mining *
Document processing Document processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or scan a document to obtain a digital image, but also to make it digita ...
*
Full text search In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts ...
* List of text mining software *
Market sentiment Market sentiment, also known as investor attention, is the general prevailing attitude of investors as to anticipated price development in a market. This attitude is the accumulation of a variety of fundamental and technical factors, including ...
*
Name resolution (semantics and text extraction) In semantics and text extraction, name resolution refers to the ability of text mining software to determine which actual person, actor, or object a particular use of a name refers to. It can also be referred to as entity resolution. Name resolut ...
*
Named entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pr ...
*
News analytics In trading strategy, news analysis refers to the measurement of the various qualitative and quantitative attributes of textual (unstructured data) news stories. Some of these attributes are: sentiment, relevance, and novelty. Expressing news stor ...
* Ontology learning *
Record linkage Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and d ...
*
Sequential pattern mining Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus time serie ...
(string and sequence mining) *
w-shingling In natural language processing a ''w''-shingling is a set of ''unique'' ''shingles'' (therefore ''n-grams'') each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity b ...
*
Web mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and s ...
, a task that may involve text mining (e.g. first find appropriate web pages by classifying crawled web pages, then extract the desired information from the text content of these pages considered relevant)


References


Citations


Sources

* Ananiadou, S. and McNaught, J. (Editors) (2006). ''Text Mining for Biology and Biomedicine''. Artech House Books. * Bilisoly, R. (2008). ''Practical Text Mining with Perl''. New York: John Wiley & Sons. * Feldman, R., and Sanger, J. (2006). ''The Text Mining Handbook''. New York: Cambridge University Press. * Hotho, A., Nürnberger, A. and Paaß, G. (2005). "A brief survey of text mining". In Ldv Forum, Vol. 20(1), p. 19-62 * Indurkhya, N., and Damerau, F. (2010). ''Handbook of Natural Language Processing'', 2nd Edition. Boca Raton, FL: CRC Press. * Kao, A., and Poteet, S. (Editors). ''Natural Language Processing and Text Mining''. Springer. * Konchady, M. ''Text Mining Application Programming (Programming Series)''. Charles River Media. * Manning, C., and Schutze, H. (1999). ''Foundations of Statistical Natural Language Processing''. Cambridge, MA: MIT Press. * Miner, G., Elder, J., Hill. T, Nisbet, R., Delen, D. and Fast, A. (2012). ''Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications''. Elsevier Academic Press. * McKnight, W. (2005). "Building business intelligence: Text data mining in business intelligence". ''DM Review'', 21–22. * Srivastava, A., and Sahami. M. (2009). ''Text Mining: Classification, Clustering, and Applications''. Boca Raton, FL: CRC Press. * Zanasi, A. (Editor) (2007). ''Text Mining and its Applications to Intelligence, CRM and Knowledge Management''. WIT Press.


External links


Marti Hearst: What Is Text Mining?
(October 2003)
Automatic Content Extraction, Linguistic Data Consortium

Automatic Content Extraction, NIST
{{Authority control Applied data mining Computational linguistics Natural language processing Statistical natural language processing Text