Oracle Data Mining (ODM) is an option of
Oracle Database
Oracle Database (commonly referred to as Oracle DBMS, Oracle Autonomous Database, or simply as Oracle) is a multi-model database management system produced and marketed by Oracle Corporation.
It is a database commonly used for running online ...
Enterprise Edition. It contains several
data mining and
data analysis
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, en ...
algorithms for
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
,
prediction
A prediction (Latin ''præ-'', "before," and ''dicere'', "to say"), or forecast, is a statement about a future event or data. They are often, but not always, based upon experience or knowledge. There is no universal agreement about the exac ...
,
regression
Regression or regressions may refer to:
Science
* Marine regression, coastal advance due to falling sea level, the opposite of marine transgression
* Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...
,
associations,
feature selection
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
,
anomaly detection
In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...
,
feature extraction
In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values ( features) intended to be informative and non-redundant, facilitating the subsequent learning ...
, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.
Overview
Oracle Corporation
Oracle Corporation is an American multinational computer technology corporation headquartered in Austin, Texas. In 2020, Oracle was the third-largest software company in the world by revenue and market capitalization. The company sells da ...
has implemented a variety of
data mining algorithms inside its
Oracle Database
Oracle Database (commonly referred to as Oracle DBMS, Oracle Autonomous Database, or simply as Oracle) is a multi-model database management system produced and marketed by Oracle Corporation.
It is a database commonly used for running online ...
relational database
A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
product. These implementations integrate directly with the Oracle
database kernel
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases span ...
and operate natively on data stored in the
relational database
A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
tables. This eliminates the need for extraction or
transfer of data into standalone mining/analytic
servers. The relational database platform is leveraged to securely manage models and to efficiently execute
SQL queries on large volumes of data. The system is organized around a few generic operations providing a general unified interface for
data-mining functions. These operations include functions to
create
To create is to make a new person, place, thing, or phenomenon. The term and its variants may also refer to:
* Creativity, phenomenon whereby something new and valuable is created
Art, entertainment, and media
* Create (TV network), an America ...
,
apply
In mathematics and computer science, apply is a function that applies a function to arguments. It is central to programming languages derived from lambda calculus, such as LISP and Scheme, and also in functional languages. It has a role in the ...
,
test
Test(s), testing, or TEST may refer to:
* Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities
Arts and entertainment
* ''Test'' (2013 film), an American film
* ''Test'' (2014 film), ...
, and
manipulate data-mining models. Models are created and stored as
database objects, and their management is done within the database - similar to tables, views, indexes and other database objects.
In data mining, the process of using a model to derive predictions or descriptions of behavior that is yet to occur is called "scoring". In traditional analytic workbenches, a model built in the analytic engine has to be deployed in a mission-critical system to score new data, or the data is moved from relational tables into the analytical workbench - most workbenches offer proprietary scoring interfaces. ODM simplifies model deployment by offering Oracle SQL functions to score data stored right in the database. This way, the user/application-developer can leverage the full power of Oracle SQL - in terms of the ability to pipeline and manipulate the results over several levels, and in terms of parallelizing and partitioning data access for performance.
Models can be created and managed by one of several means. Oracle Data Miner provides a
graphical user interface
The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows User (computing), users to Human–computer interaction, interact with electronic devices through graphical icon (comp ...
that steps the user through the process of creating, testing, and applying models (e.g. along the lines of the
CRISP-DM methodology). Application- and tools-developers can embed predictive and descriptive mining capabilities using
PL/SQL
PL/SQL (Procedural Language for SQL) is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database (since version 6 - stored PL/SQL procedures/functions/packages/triggers since ...
or
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
APIs. Business analysts can quickly experiment with, or demonstrate the power of,
predictive analytics
Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events.
In busin ...
using Oracle Spreadsheet Add-In for Predictive Analytics, a dedicated
Microsoft Excel
Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro programming language called Visual Basic for ...
adaptor interface. ODM offers a choice of well-known
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
approaches such as
Decision Trees
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains cond ...
,
Naive Bayes,
Support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
s,
Generalized linear model
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...
(GLM) for predictive mining,
Association rules
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.P ...
,
K-means
''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to Partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the Cluster (statistics), cluster ...
and Orthogonal Partitioning
[
][
Boriana L. Milenova and Marcos M. Campos (2002)]
''O-Cluster: Scalable Clustering of Large High Dimensional Data Sets''
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining, pages 290-297, .
Clustering, and
Non-negative matrix factorization
Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property that ...
for descriptive mining. A
minimum description length
Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occa ...
based technique to grade the relative importance of input mining attributes for a given problem is also provided. Most Oracle Data Mining functions also allow
text mining
Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
by accepting text (
unstructured data
Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, n ...
) attributes as input. Users do not need to configure text-mining options - the
Database_options database option handles this behind the scenes.
History
Oracle Data Mining was first introduced in 2002 and its releases are named according to the corresponding Oracle database release:
* Oracle Data Mining 9iR2 (9.2.0.1.0 - May 2002)
* Oracle Data Mining 10gR1 (10.1.0.2.0 - February 2004)
* Oracle Data Mining 10gR2 (10.2.0.1.0 - July 2005)
* Oracle Data Mining 11gR1 (11.1 - September 2007)
* Oracle Data Mining 11gR2 (11.2 - September 2009)
Oracle Data Mining is a logical successor of the Darwin data mining toolset developed by
Thinking Machines Corporation
Thinking Machines Corporation was a supercomputer manufacturer and artificial intelligence (AI) company, founded in Waltham, Massachusetts, in 1983 by Sheryl Handler and W. Daniel "Danny" Hillis to turn Hillis's doctoral work at the Massachusett ...
in the mid-1990s and later distributed by Oracle after its acquisition of Thinking Machines in 1999. However, the product itself
is a
complete redesign and rewrite from ground-up - while Darwin was a classic GUI-based analytical workbench, ODM offers a data mining development/deployment platform integrated into the Oracle database, along with the Oracle Data Miner GUI.
The Oracle Data Miner 11gR2 New Workflow GUI was previewed at Oracle Open World 2009. An updated Oracle Data Miner GUI was released in 2012. It is free, and is available as an extension to Oracle SQL Developer 3.1 .
Functionality
As of release 11gR1 Oracle Data Mining contains the following
data mining functions:
* Data transformation and model analysis:
** Data
sampling,
binning,
discretization
In applied mathematics, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numeri ...
, and other data transformations.
** Model exploration, evaluation and analysis.
*
Feature selection
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
(Attribute Importance).
**
Minimum description length
Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occa ...
(MDL).
*
Classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
.
**
Naive Bayes (NB).
**
Generalized linear model
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...
(GLM) for
Logistic regression
In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
.
**
Support Vector Machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
(SVM).
**
Decision Trees
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains cond ...
(DT).
*
Anomaly detection
In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...
.
** One-class
Support Vector Machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
(SVM).
*
Regression
Regression or regressions may refer to:
Science
* Marine regression, coastal advance due to falling sea level, the opposite of marine transgression
* Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...
**
Support Vector Machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
(SVM).
**
Generalized linear model
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...
(GLM) for
Multiple regression
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one o ...
*
Clustering:
** Enhanced
k-means
''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to Partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the Cluster (statistics), cluster ...
(EKM).
** Orthogonal Partitioning Clustering (O-Cluster).
*
Association rule learning
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.Pi ...
:
**
Itemsets and
association rules (AM).
*
Feature extraction
In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values ( features) intended to be informative and non-redundant, facilitating the subsequent learning ...
.
**
Non-negative matrix factorization
Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property that ...
(NMF).
*
Text
Text may refer to:
Written word
* Text (literary theory), any object that can be read, including:
**Religious text, a writing that a religious tradition considers to be sacred
**Text, a verse or passage from scripture used in expository preachin ...
and
spatial mining:
** Combined text and non-text columns of input data.
** Spatial/
GIS data.
Input sources and data preparation
Most Oracle Data Mining functions accept as input one relational table or view. Flat data can be combined with
transactional data through the use of nested columns, enabling mining of data involving one-to-many relationships (e.g. a
star schema). The full functionality of
SQL can be used when preparing data for data mining, including dates and spatial data.
Oracle Data Mining distinguishes numerical, categorical, and unstructured (text) attributes. The product also provides utilities for data preparation steps prior to model building such as
outlier
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
treatment,
discretization
In applied mathematics, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numeri ...
,
normalization and binning (
sorting in general speak)
Graphical user interface: Oracle Data Miner
Users can access Oracle Data Mining through Oracle Data Miner, a
GUI client application that provides access to the
data mining functions and structured templates (called Mining Activities) that automatically prescribe the order of operations, perform required data transformations, and set model parameters. The user interface also allows the automated generation of
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
and/or
SQL code associated with the
data-mining activities. The Java Code Generator is an extension to
Oracle JDeveloper. An independent interface also exists: the Spreadsheet Add-In for Predictive Analytics which enables access to the Oracle Data Mining Predictive Analytics
PL/SQL
PL/SQL (Procedural Language for SQL) is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database (since version 6 - stored PL/SQL procedures/functions/packages/triggers since ...
package from
Microsoft Excel
Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro programming language called Visual Basic for ...
.
From version 11.2 of the
Oracle database
Oracle Database (commonly referred to as Oracle DBMS, Oracle Autonomous Database, or simply as Oracle) is a multi-model database management system produced and marketed by Oracle Corporation.
It is a database commonly used for running online ...
, Oracle Data Miner integrates with
Oracle SQL Developer.
[
]
PL/SQL and Java interfaces
Oracle Data Mining provides a native
PL/SQL
PL/SQL (Procedural Language for SQL) is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database (since version 6 - stored PL/SQL procedures/functions/packages/triggers since ...
package (DBMS_DATA_MINING) to create, destroy, describe, apply, test, export and import models. The code below illustrates a typical call to build a
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
model:
BEGIN
DBMS_DATA_MINING.CREATE_MODEL (
model_name => 'credit_risk_model',
function => DBMS_DATA_MINING.classification,
data_table_name => 'credit_card_data',
case_id_column_name => 'customer_id',
target_column_name => 'credit_risk',
settings_table_name => 'credit_risk_model_settings');
END;
where 'credit_risk_model' is the model name, built for the express purpose of classifying future customers' 'credit_risk', based on training data provided in the table 'credit_card_data', each case distinguished by a unique 'customer_id', with the rest of the model parameters specified through the table 'credit_risk_model_settings'.
Oracle Data Mining also supports a
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
API consistent with the
Java Data Mining
Java Data Mining (JDM) is a standard Java API for developing data mining applications and tools. JDM defines an object model and Java API for data mining objects and processes. JDM enables applications to integrate data mining technology for devel ...
(JDM) standard for data mining (JSR-73) for enabling integration with web and
Java EE applications and to facilitate portability across platforms.
SQL scoring functions
As of release 10gR2, Oracle Data Mining contains built-in SQL functions for scoring data mining models. These single-row functions support classification, regression, anomaly detection, clustering, and feature extraction. The code below illustrates a typical usage of a
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
model:
SELECT customer_name
FROM credit_card_data
WHERE PREDICTION (credit_risk_model USING *) = 'LOW' AND customer_value = 'HIGH';
PMML
In Release 11gR2 (11.2.0.2), ODM supports the import of externally created
PMML
The Predictive Model Markup Language (PMML) is an XML-based predictive model interchange format conceived by Dr. Robert Lee Grossman, then the director of the National Center for Data Mining at the University of Illinois at Chicago. PMML prov ...
for some of the data mining models.
PMML
The Predictive Model Markup Language (PMML) is an XML-based predictive model interchange format conceived by Dr. Robert Lee Grossman, then the director of the National Center for Data Mining at the University of Illinois at Chicago. PMML prov ...
is an XML-based standard for representing data mining models.
Predictive analytics Microsoft Excel add-in
The
PL/SQL
PL/SQL (Procedural Language for SQL) is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database (since version 6 - stored PL/SQL procedures/functions/packages/triggers since ...
package DBMS_PREDICTIVE_ANALYTICS automates the data mining process including
data preprocessing, model building and evaluation, and scoring of new data. The PREDICT operation is used for predicting target values classification or regression while EXPLAIN ranks attributes in order of influence in explaining a target column feature selection. The new 11g feature PROFILE finds customer segments and their profiles, given a target attribute. These operations can be used as part of an operational pipeline providing actionable results or displayed for interpretation by end users.
References and further reading
* T. H. Davenport
Competing on Analytics Harvard Business Review, January 2006.
* I. Ben-Ga
Outlier detection In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers," Kluwer Academic Publishers, 2005, .
* M. M. Campos, P. J. Stengard, and B. L. Milenova, Data-centric Automated Data Mining. In proceedings of the ''Fourth International Conference on Machine Learning and Applications 2005'', 15–17 December 2005. pp8,
* M. F. Hornick, Erik Marcade, and Sunil Venkayala. Java Data Mining: Strategy, Standard, and Practice. Morgan-Kaufmann, 2006, .
* B. L. Milenova, J. S. Yarmus, and M. M. Campos. SVM in Oracle database 10g: removing the barriers to widespread adoption of support vector machines. In Proceedings of the ''31st international Conference on Very Large Data Bases'' (Trondheim, Norway, August 30 - September 2, 2005). pp1152–1163, .
* B. L. Milenova and M. M. Campos. O-Cluster: scalable clustering of large high dimensional data sets. In proceedings of the ''2002 IEEE International Conference on Data Mining: ICDM 2002''. pp290–297, .
* P. Tamayo, C. Berger, M. M. Campos, J. S. Yarmus, B. L.Milenova, A. Mozes, M. Taft, M. Hornick, R. Krishnan, S.Thomas, M. Kelly, D. Mukhin, R. Haberstroh, S. Stephens and J. Myczkowski. Oracle Data Mining - Data Mining in the Database Environment. In Part VII of ''Data Mining and Knowledge Discovery Handbook'', Maimon, O.; Rokach, L. (Eds.) 2005, p315-1329, .
* Brendan Tierney, Predictive Analytics using Oracle Data Miner: for the data scientist, oracle analyst, oracle developer & DBA, Oracle Press, McGraw Hill, Spring 2014.
See also
*
Oracle LogMiner - in contrast to generic data mining, targets the extraction of information from the internal logs of an Oracle database
References
{{Reflist
External links
Oracle Data Mining at Oracle Technology Network
Oracle Data Mining Blog
Oracle Data Mining and Analytics Blog
Oracle Wiki for Data Mining
Oracle Data Mining RSS Feed
Oracle Data Mining related blog by Brendan Tierney (Oracle ACE Director)
Oracle Data Mining Examples (on Panoply Technology)
Oracle software
Data mining and machine learning software