HOME

TheInfoList



OR:

Data mining is the process of extracting and finding patterns in massive
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...
s involving methods at the intersection of
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
,
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, and database systems. Data mining is an
interdisciplinary Interdisciplinarity or interdisciplinary studies involves the combination of multiple academic disciplines into one activity (e.g., a research project). It draws knowledge from several fields such as sociology, anthropology, psychology, economi ...
subfield of
computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
and
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
with an overall goal of extracting information (with intelligent methods) from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the " knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and
data management Data management comprises all disciplines related to handling data as a valuable resource, it is the practice of managing an organization's data so it can be analyzed for decision making. Concept The concept of data management emerged alongsi ...
aspects, data pre-processing,
model A model is an informative representation of an object, person, or system. The term originally denoted the plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin , . Models can be divided in ...
and
inference Inferences are steps in logical reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinct ...
considerations, interestingness metrics,
complexity Complexity characterizes the behavior of a system or model whose components interact in multiple ways and follow local rules, leading to non-linearity, randomness, collective dynamics, hierarchy, and emergence. The term is generally used to c ...
considerations, post-processing of discovered structures, visualization, and online updating. The term "data mining" is a
misnomer A misnomer is a name that is incorrectly or unsuitably applied. Misnomers often arise because something was named long before its correct nature was known, or because an earlier form of something has been replaced by a later form to which the nam ...
because the goal is the extraction of
pattern A pattern is a regularity in the world, in human-made design, or in abstract ideas. As such, the elements of a pattern repeat in a predictable manner. A geometric pattern is a kind of pattern formed of geometric shapes and typically repeated l ...
s and knowledge from large amounts of data, not the extraction (''mining'') of data itself. It also is a
buzzword A buzzword is a word or phrase, new or already existing, that becomes popular for a period of time. Buzzwords often derive from technical terms yet often have much of the original technical meaning removed through fashionable use, being simply ...
and is frequently applied to any form of large-scale data or information processing ( collection, extraction,
warehousing A warehouse is a building for storing goods. Warehouses are used by manufacturers, importers, exporters, wholesalers, transport businesses, customs, etc. They are usually large plain buildings in industrial parks on the rural–urban fringe, out ...
, analysis, and statistics) as well as any application of computer decision support systems, including
artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
(e.g., machine learning) and
business intelligence Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include Financial reporting, reporting, online an ...
. Often the more general terms (''large scale'') ''
data analysis Data analysis is the process of inspecting, Data cleansing, cleansing, Data transformation, transforming, and Data modeling, modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Da ...
'' and ''
analytics Analytics is the systematic computational analysis of data or statistics. It is used for the discovery, interpretation, and communication of meaningful patterns in data, which also falls under and directly relates to the umbrella term, data sc ...
''—or, when referring to actual methods, ''artificial intelligence'' and ''machine learning''—are more appropriate. The actual data mining task is the semi- automatic or automatic analysis of massive quantities of data to extract previously unknown, interesting patterns such as groups of data records (
cluster analysis Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...
), unusual records ( anomaly detection), and dependencies ( association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a
decision support system A decision support system (DSS) is an information system that supports business or organizational decision-making activities. DSSs serve the management, operations and planning levels of an organization (usually mid and higher management) and ...
. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, although they do belong to the overall KDD process as additional steps. The difference between
data analysis Data analysis is the process of inspecting, Data cleansing, cleansing, Data transformation, transforming, and Data modeling, modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Da ...
and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data. In contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data. The related terms '' data dredging'', ''data fishing'', and '' data snooping'' refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.


Etymology

In the 1960s, statisticians and economists used terms like ''data fishing'' or ''data dredging'' to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term "data mining" was used in a similarly critical way by economist Michael Lovell in an article published in the '' Review of Economic Studies'' in 1983. Lovell indicates that the practice "masquerades under a variety of aliases, ranging from "experimentation" (positive) to "fishing" or "snooping" (negative). The term ''data mining'' appeared around 1990 in the database community, with generally positive connotations. For a short time in 1980s, the phrase "database mining"™, was used, but since it was trademarked by HNC, a
San Diego San Diego ( , ) is a city on the Pacific coast of Southern California, adjacent to the Mexico–United States border. With a population of over 1.4 million, it is the List of United States cities by population, eighth-most populous city in t ...
–based company, to pitch their Database Mining Workstation; researchers consequently turned to ''data mining''. Other terms used include ''data archaeology'', ''information harvesting'', ''information discovery'', '' knowledge extraction'', etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topi
(KDD-1989)
and this term became more popular in the AI and
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
communities. However, the term data mining became more popular in the business and press communities. Currently, the terms ''data mining'' and ''knowledge discovery'' are used interchangeably.


Background

The manual extraction of patterns from
data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
has occurred for centuries. Early methods of identifying patterns in data include
Bayes' theorem Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...
(1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology have dramatically increased data collection, storage, and manipulation ability. As
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...
s have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, specially in the field of machine learning, such as
neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...
,
cluster analysis Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...
,
genetic algorithms In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to g ...
(1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns. in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever-larger data sets.


Process

The ''knowledge discovery in databases (KDD) process'' is commonly defined with the stages: # Selection # Pre-processing # Transformation # ''Data mining'' # Interpretation/evaluation. It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases: # Business understanding # Data understanding # Data preparation # Modeling # Evaluation # Deployment or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation. Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners. The only other data mining standard named in these polls was SEMMA. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models,Lukasz Kurgan and Petr Musilek
"A survey of Knowledge Discovery and Data Mining process models"
. ''The Knowledge Engineering Review''. Volume 21 Issue 1, March 2006, pp 1–24, Cambridge University Press, New York,
and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.Azevedo, A. and Santos, M. F
KDD, SEMMA and CRISP-DM: a parallel overview
. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182–185.


Pre-processing

Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or
data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for Business intelligence, reporting and data analysis and is a core component of business intelligence. Data warehouses are central Re ...
. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing
noise Noise is sound, chiefly unwanted, unintentional, or harmful sound considered unpleasant, loud, or disruptive to mental or hearing faculties. From a physics standpoint, there is no distinction between noise and desired sound, as both are vibrat ...
and those with
missing data In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. Mi ...
.


Data mining

Data mining involves six common classes of tasks: * Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation due to being out of standard range. * Association rule learning (dependency modeling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. * Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. *
Classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
– is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". * Regression – attempts to find a function that models the data with the least error that is, for estimating the relationships among data or datasets. * Summarization – providing a more compact representation of the data set, including visualization and report generation.


Results validation

Data mining can unintentionally be misused, producing results that appear to be significant but which do not actually predict future behavior and cannot be reproduced on a new sample of data, therefore bearing little use. This is sometimes caused by investigating too many hypotheses and not performing proper
statistical hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...
. A simple version of this problem in
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening. The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" e-mails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had ''not'' been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as ROC curves. If the learned patterns do not meet the desired standards, it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.


Research

The premier professional body in the field is the
Association for Computing Machinery The Association for Computing Machinery (ACM) is a US-based international learned society for computing. It was founded in 1947 and is the world's largest scientific and educational computing society. The ACM is a non-profit professional membe ...
's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining ( SIGKDD). Since 1989, this ACM SIG has hosted an annual international conference and published its proceedings, and since 1999 it has published a biannual
academic journal An academic journal (or scholarly journal or scientific journal) is a periodical publication in which Scholarly method, scholarship relating to a particular academic discipline is published. They serve as permanent and transparent forums for the ...
titled "SIGKDD Explorations". Computer science conferences on data mining include: * CIKM Conference – ACM Conference on Information and Knowledge Management * European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases * KDD Conference – ACM SIGKDD Conference on Knowledge Discovery and Data Mining Data mining topics are also present in many data management/database conferences such as the ICDE Conference, SIGMOD Conference and International Conference on Very Large Data Bases.


Standards

There have been some efforts to define standards for the data mining process, for example, the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). Development on successors to these processes (CRISP-DM 2.0 and JDM 2.0) was active in 2006 but has stalled since. JDM 2.0 was withdrawn without reaching a final draft. For exchanging the extracted models—in particular for use in predictive analytics—the key standard is the Predictive Model Markup Language (PMML), which is an
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) subspace clustering have been proposed independently of the DMG.


Notable uses

Data mining is used wherever there is digital data available. Notable examples of data mining can be found throughout business, medicine, science, finance, construction, and surveillance.


Privacy concerns and ethics

While the term "data mining" itself may have no ethical implications, it is often associated with the mining of information in relation to user behavior (ethical and otherwise). The ways in which data mining can be used can in some cases and contexts raise questions regarding
privacy Privacy (, ) is the ability of an individual or group to seclude themselves or information about themselves, and thereby express themselves selectively. The domain of privacy partially overlaps with security, which can include the concepts of a ...
, legality, and
ethics Ethics is the philosophy, philosophical study of Morality, moral phenomena. Also called moral philosophy, it investigates Normativity, normative questions about what people ought to do or which behavior is morally right. Its main branches inclu ...
. In particular, data mining government or commercial data sets for
national security National security, or national defence (national defense in American English), is the security and Defence (military), defence of a sovereign state, including its Citizenship, citizens, economy, and institutions, which is regarded as a duty of ...
or
law enforcement Law enforcement is the activity of some members of the government or other social institutions who act in an organized manner to enforce the law by investigating, deterring, rehabilitating, or punishing people who violate the rules and norms gove ...
purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns. Data mining requires data preparation which uncovers information or patterns which compromise
confidentiality Confidentiality involves a set of rules or a promise sometimes executed through confidentiality agreements that limits the access to or places restrictions on the distribution of certain types of information. Legal confidentiality By law, la ...
and
privacy Privacy (, ) is the ability of an individual or group to seclude themselves or information about themselves, and thereby express themselves selectively. The domain of privacy partially overlaps with security, which can include the concepts of a ...
obligations. A common way for this to occur is through data aggregation. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent).''Think Before You Dig: Privacy Implications of Data Mining & Aggregation''
, NASCIO Research Brief, September 2004
This is not data mining ''per se'', but a result of the preparation of data before—and for the purposes of—the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous. Data may also be modified so as to ''become'' anonymous, so that individuals may not readily be identified. However, even " anonymized" data sets can potentially contain enough information to allow identification of individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL. The inadvertent revelation of personally identifiable information leading to the provider violates Fair Information Practices. This indiscretion can cause financial, emotional, or bodily harm to the indicated individual. In one instance of privacy violation, the patrons of Walgreens filed a lawsuit against the company in 2011 for selling prescription information to data mining companies who in turn provided the data to pharmaceutical companies.


Situation in Europe

Europe Europe is a continent located entirely in the Northern Hemisphere and mostly in the Eastern Hemisphere. It is bordered by the Arctic Ocean to the north, the Atlantic Ocean to the west, the Mediterranean Sea to the south, and Asia to the east ...
has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the U.S.–E.U. Safe Harbor Principles, developed between 1998 and 2000, currently effectively expose European users to privacy exploitation by U.S. companies. As a consequence of Edward Snowden's global surveillance disclosure, there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the National Security Agency, and attempts to reach an agreement with the United States have failed. In the United Kingdom in particular there have been cases of corporations using data mining as a way to target certain groups of customers forcing them to pay unfairly high prices. These groups tend to be people of lower socio-economic status who are not savvy to the ways they can be exploited in digital market places.


Situation in the United States

In the United States, privacy concerns have been addressed by the US Congress via the passage of regulatory controls such as the
Health Insurance Portability and Accountability Act The Health Insurance Portability and Accountability Act of 1996 (HIPAA or the Ted Kennedy, Kennedy–Nancy Kassebaum, Kassebaum Act) is a United States Act of Congress enacted by the 104th United States Congress and signed into law by President ...
(HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in ''Biotech Business Week'', "' practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is approach a level of incomprehensibility to average individuals." This underscores the necessity for data anonymity in data aggregation and mining practices. U.S. information privacy legislation such as HIPAA and the Family Educational Rights and Privacy Act (FERPA) applies only to the specific areas that each such law addresses. The use of data mining by the majority of businesses in the U.S. is not controlled by any legislation.


Copyright law


Situation in Europe

Under European copyright database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is not legal. Where a database is pure data in Europe, it may be that there is no copyright—but database rights may exist, so data mining becomes subject to
intellectual property Intellectual property (IP) is a category of property that includes intangible creations of the human intellect. There are many types of intellectual property, and some countries recognize more than others. The best-known types are patents, co ...
owners' rights that are protected by the Database Directive. On the recommendation of the Hargreaves review, this led to the UK government to amend its copyright law in 2014 to allow content mining as a limitation and exception. The UK was the second country in the world to do so after Japan, which introduced an exception in 2009 for data mining. However, due to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law also does not allow this provision to be overridden by contractual terms and conditions. Since 2020 also Switzerland has been regulating data mining by allowing it in the research field under certain conditions laid down by art. 24d of the Swiss Copyright Act. This new article entered into force on 1 April 2020. The
European Commission The European Commission (EC) is the primary Executive (government), executive arm of the European Union (EU). It operates as a cabinet government, with a number of European Commissioner, members of the Commission (directorial system, informall ...
facilitated stakeholder discussion on text and data mining in 2013, under the title of Licences for Europe. The focus on the solution to this legal issue, such as licensing rather than limitations and exceptions, led to representatives of universities, researchers, libraries, civil society groups and
open access Open access (OA) is a set of principles and a range of practices through which nominally copyrightable publications are delivered to readers free of access charges or other barriers. With open access strictly defined (according to the 2001 de ...
publishers to leave the stakeholder dialogue in May 2013.


Situation in the United States

US copyright law The copyright law of the United States grants monopoly protection for "original works of authorship". With the stated purpose to promote art and culture, copyright law assigns a set of exclusive rights to authors: to make and sell copies of thei ...
, and in particular its provision for
fair use Fair use is a Legal doctrine, doctrine in United States law that permits limited use of copyrighted material without having to first acquire permission from the copyright holder. Fair use is one of the limitations to copyright intended to bal ...
, upholds the legality of content mining in America, and other fair use countries such as Israel, Taiwan and South Korea. As content mining is transformative, that is it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed—one being text and data mining.


Software


Free open-source data mining software and applications

The following applications are available under free/open-source licenses. Public access to application source code is also available. * Carrot2: Text and search results clustering framework. * Chemicalize.org: A chemical structure miner and web search engine. * ELKI: A university research project with advanced
cluster analysis Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...
and outlier detection methods written in the
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
language. *
GATE A gate or gateway is a point of entry to or from a space enclosed by walls. The word is derived from Proto-Germanic language, Proto-Germanic ''*gatan'', meaning an opening or passageway. Synonyms include yett (which comes from the same root w ...
: a
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
and language engineering tool. *
KNIME KNIME (), the Konstanz Information Miner, is a data analytics, reporting and integrating platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks of Analytics" con ...
: The Konstanz Information Miner, a user-friendly and comprehensive data analytics framework. * Massive Online Analysis (MOA): a real-time big data stream mining with concept drift tool in the
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
programming language. * MEPX: cross-platform tool for regression and classification problems based on a Genetic Programming variant. *
mlpack mlpack is a free, open-source and header-only software library for machine learning and artificial intelligence written in C++, built on top of the Armadillo library and thensmallennumerical optimization library. mlpack has an emphasis on scal ...
: a collection of ready-to-use machine learning algorithms written in the C++ language. * NLTK ( Natural Language Toolkit): A suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python language. * OpenNN: Open
neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...
library. * Orange: A component-based data mining and
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
software suite written in the Python language. * PSPP: Data mining and statistics software under the GNU Project similar to SPSS * R: A
programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
and software environment for
statistical Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
computing, data mining, and graphics. It is part of the
GNU Project The GNU Project ( ) is a free software, mass collaboration project announced by Richard Stallman on September 27, 1983. Its goal is to give computer users freedom and control in their use of their computers and Computer hardware, computing dev ...
. *
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support ...
: An open-source machine learning library for the Python programming language; * Torch: An open-source
deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
library for the Lua programming language and scientific computing framework with wide support for
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
algorithms. * UIMA: The UIMA (Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video – originally developed by IBM. * Weka: A suite of machine learning software applications written in the
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
programming language.


Proprietary data-mining software and applications

The following applications are available under proprietary licenses. * Angoss KnowledgeSTUDIO: data mining tool * LIONsolver: an integrated software application for data mining, business intelligence, and modeling that implements the Learning and Intelligent OptimizatioN (LION) approach. * PolyAnalyst: data and text mining software by Megaputer Intelligence. * Microsoft Analysis Services: data mining software provided by
Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
. * NetOwl: suite of multilingual text and entity analytics products that enable data mining. * Oracle Data Mining: data mining software by
Oracle Corporation Oracle Corporation is an American Multinational corporation, multinational computer technology company headquartered in Austin, Texas. Co-founded in 1977 in Santa Clara, California, by Larry Ellison, who remains executive chairman, Oracle was ...
. * PSeven: platform for automation of engineering simulation and analysis, multidisciplinary optimization and data mining provided by DATADVANCE. * Qlucore Omics Explorer: data mining software. *
RapidMiner RapidMiner is a data science platform that analyses the collective impact of an organization's data. It was acquired by Altair Engineering in September 2022. History RapidMiner, formerly known as YALE (Yet Another Learning Environment), was deve ...
: An environment for
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
and data mining experiments. * SAS Enterprise Miner: data mining software provided by the SAS Institute. * SPSS Modeler: data mining software provided by
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
. * STATISTICA Data Miner: data mining software provided by StatSoft. * Tanagra: Visualisation-oriented data mining software, also for teaching. * Vertica: data mining software provided by
Hewlett-Packard The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company. It was founded by Bill Hewlett and David Packard in 1939 in a one-car garage in Palo Alto, California ...
. *
Google Cloud Platform Google Cloud Platform (GCP) is a suite of cloud computing services offered by Google that provides a series of modular cloud services including computing, Computer data storage, data storage, Data analysis, data analytics, and machine learnin ...
: automated custom ML models managed by
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
. * Amazon SageMaker: managed service provided by
Amazon Amazon most often refers to: * Amazon River, in South America * Amazon rainforest, a rainforest covering most of the Amazon basin * Amazon (company), an American multinational technology company * Amazons, a tribe of female warriors in Greek myth ...
for creating & productionising custom ML models.


See also

; Methods ; Application domains ; Application examples ; Related topics For more information about extracting information out of data (as opposed to ''analyzing'' data), see: ;Other resources * International Journal of Data Warehousing and Mining


References


Further reading

* Cabena, Peter; Hadjnian, Pablo; Stadler, Rolf; Verhees, Jaap; Zanasi, Alessandro (1997); ''Discovering Data Mining: From Concept to Implementation'',
Prentice Hall Prentice Hall was a major American publishing#Textbook_publishing, educational publisher. It published print and digital content for the 6–12 and higher-education market. It was an independent company throughout the bulk of the twentieth cen ...
, * M.S. Chen, J. Han, P.S. Yu (1996)
Data mining: an overview from a database perspective
". ''Knowledge and data Engineering, IEEE Transactions'' on 8 (6), 866–883 * Feldman, Ronen; Sanger, James (2007); ''The Text Mining Handbook'',
Cambridge University Press Cambridge University Press was the university press of the University of Cambridge. Granted a letters patent by King Henry VIII in 1534, it was the oldest university press in the world. Cambridge University Press merged with Cambridge Assessme ...
, * Guo, Yike; and Grossman, Robert (editors) (1999); ''High Performance Data Mining: Scaling Algorithms, Applications and Systems'',
Kluwer Academic Publishers Springer Science+Business Media, commonly known as Springer, is a German multinational publishing company of books, e-books and peer-reviewed journals in science, humanities, technical and medical (STM) publishing. Originally founded in 1842 in ...
* Han, Jiawei, Micheline Kamber, and Jian Pei. ''Data mining: concepts and techniques''. Morgan kaufmann, 2006. * Hastie, Trevor, Tibshirani, Robert and Friedman, Jerome (2001); ''The Elements of Statistical Learning: Data Mining, Inference, and Prediction'', Springer, * Liu, Bing (2007, 2011); ''Web Data Mining: Exploring Hyperlinks, Contents and Usage Data'',
Springer Springer or springers may refer to: Publishers * Springer Science+Business Media, aka Springer International Publishing, a worldwide publishing group founded in 1842 in Germany formerly known as Springer-Verlag. ** Springer Nature, a multinationa ...
, * * Nisbet, Robert; Elder, John; Miner, Gary (2009); ''Handbook of Statistical Analysis & Data Mining Applications'',
Academic Press Academic Press (AP) is an academic book publisher founded in 1941. It launched a British division in the 1950s. Academic Press was acquired by Harcourt, Brace & World in 1969. Reed Elsevier said in 2000 it would buy Harcourt, a deal complete ...
/Elsevier, * Poncelet, Pascal; Masseglia, Florent; and Teisseire, Maguelonne (editors) (October 2007); "Data Mining Patterns: New Methods and Applications", ''Information Science Reference'', * Tan, Pang-Ning; Steinbach, Michael; and Kumar, Vipin (2005); ''Introduction to Data Mining'', * Theodoridis, Sergios; and Koutroumbas, Konstantinos (2009); ''Pattern Recognition'', 4th Edition, Academic Press, * Weiss, Sholom M.; and Indurkhya, Nitin (1998); ''Predictive Data Mining'', Morgan Kaufmann * (See also Free Weka software) * Ye, Nong (2003); ''The Handbook of Data Mining'', Mahwah, NJ: Lawrence Erlbaum


External links

{{DEFAULTSORT:Data Mining Formal sciences