Oracle Data Mining (ODM) is an option of

Oracle Database Oracle Database (commonly referred to as Oracle DBMS, Oracle Autonomous Database, or simply as Oracle) is a multi-model database management system produced and marketed by Oracle Corporation. It is a database commonly used for running online ...

Enterprise Edition. It contains several data mining and

data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, en ...

algorithms for

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

prediction A prediction (Latin ''præ-'', "before," and ''dicere'', "to say"), or forecast, is a statement about a future event or data. They are often, but not always, based upon experience or knowledge. There is no universal agreement about the exac ...

regression Regression or regressions may refer to: Science * Marine regression, coastal advance due to falling sea level, the opposite of marine transgression * Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...

, associations,

feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...

anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...

feature extraction In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values ( features) intended to be informative and non-redundant, facilitating the subsequent learning ...

, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.

Overview

Oracle Corporation Oracle Corporation is an American multinational computer technology corporation headquartered in Austin, Texas. In 2020, Oracle was the third-largest software company in the world by revenue and market capitalization. The company sells da ...

has implemented a variety of data mining algorithms inside its

relational database A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...

product. These implementations integrate directly with the Oracle

database kernel In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases span ...

and operate natively on data stored in the

tables. This eliminates the need for extraction or transfer of data into standalone mining/analytic servers. The relational database platform is leveraged to securely manage models and to efficiently execute SQL queries on large volumes of data. The system is organized around a few generic operations providing a general unified interface for data-mining functions. These operations include functions to

create To create is to make a new person, place, thing, or phenomenon. The term and its variants may also refer to: * Creativity, phenomenon whereby something new and valuable is created Art, entertainment, and media * Create (TV network), an America ...

apply In mathematics and computer science, apply is a function that applies a function to arguments. It is central to programming languages derived from lambda calculus, such as LISP and Scheme, and also in functional languages. It has a role in the ...

test Test(s), testing, or TEST may refer to: * Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities Arts and entertainment * ''Test'' (2013 film), an American film * ''Test'' (2014 film), ...

, and manipulate data-mining models. Models are created and stored as database objects, and their management is done within the database - similar to tables, views, indexes and other database objects. In data mining, the process of using a model to derive predictions or descriptions of behavior that is yet to occur is called "scoring". In traditional analytic workbenches, a model built in the analytic engine has to be deployed in a mission-critical system to score new data, or the data is moved from relational tables into the analytical workbench - most workbenches offer proprietary scoring interfaces. ODM simplifies model deployment by offering Oracle SQL functions to score data stored right in the database. This way, the user/application-developer can leverage the full power of Oracle SQL - in terms of the ability to pipeline and manipulate the results over several levels, and in terms of parallelizing and partitioning data access for performance. Models can be created and managed by one of several means. Oracle Data Miner provides a

graphical user interface The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows User (computing), users to Human–computer interaction, interact with electronic devices through graphical icon (comp ...

that steps the user through the process of creating, testing, and applying models (e.g. along the lines of the CRISP-DM methodology). Application- and tools-developers can embed predictive and descriptive mining capabilities using

PL/SQL PL/SQL (Procedural Language for SQL) is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database (since version 6 - stored PL/SQL procedures/functions/packages/triggers since ...

Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...

APIs. Business analysts can quickly experiment with, or demonstrate the power of,

predictive analytics Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events. In busin ...

using Oracle Spreadsheet Add-In for Predictive Analytics, a dedicated

Microsoft Excel Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro programming language called Visual Basic for ...

adaptor interface. ODM offers a choice of well-known

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

approaches such as

Decision Trees A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains cond ...

, Naive Bayes,

Support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...

Generalized linear model In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...

(GLM) for predictive mining,

Association rules Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.P ...

K-means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to Partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the Cluster (statistics), cluster ...

and Orthogonal Partitioning Boriana L. Milenova and Marcos M. Campos (2002)
''O-Cluster: Scalable Clustering of Large High Dimensional Data Sets''
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining, pages 290-297, . Clustering, and

Non-negative matrix factorization Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property that ...

for descriptive mining. A

minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occa ...

based technique to grade the relative importance of input mining attributes for a given problem is also provided. Most Oracle Data Mining functions also allow

text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...

by accepting text (

unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, n ...

) attributes as input. Users do not need to configure text-mining options - the Database_options database option handles this behind the scenes.

History

Oracle Data Mining was first introduced in 2002 and its releases are named according to the corresponding Oracle database release: * Oracle Data Mining 9iR2 (9.2.0.1.0 - May 2002) * Oracle Data Mining 10gR1 (10.1.0.2.0 - February 2004) * Oracle Data Mining 10gR2 (10.2.0.1.0 - July 2005) * Oracle Data Mining 11gR1 (11.1 - September 2007) * Oracle Data Mining 11gR2 (11.2 - September 2009) Oracle Data Mining is a logical successor of the Darwin data mining toolset developed by

Thinking Machines Corporation Thinking Machines Corporation was a supercomputer manufacturer and artificial intelligence (AI) company, founded in Waltham, Massachusetts, in 1983 by Sheryl Handler and W. Daniel "Danny" Hillis to turn Hillis's doctoral work at the Massachusett ...

in the mid-1990s and later distributed by Oracle after its acquisition of Thinking Machines in 1999. However, the product itself is a complete redesign and rewrite from ground-up - while Darwin was a classic GUI-based analytical workbench, ODM offers a data mining development/deployment platform integrated into the Oracle database, along with the Oracle Data Miner GUI. The Oracle Data Miner 11gR2 New Workflow GUI was previewed at Oracle Open World 2009. An updated Oracle Data Miner GUI was released in 2012. It is free, and is available as an extension to Oracle SQL Developer 3.1 .

Functionality

As of release 11gR1 Oracle Data Mining contains the following data mining functions: * Data transformation and model analysis: ** Data sampling, binning,

discretization In applied mathematics, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numeri ...

, and other data transformations. ** Model exploration, evaluation and analysis. *

Feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...

(Attribute Importance). **

Minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occa ...

(MDL). *

Classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

. ** Naive Bayes (NB). **

(GLM) for

Logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...

. **

Support Vector Machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...

(SVM). **

(DT). *

Anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...

. ** One-class

(SVM). *

Regression Regression or regressions may refer to: Science * Marine regression, coastal advance due to falling sea level, the opposite of marine transgression * Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...

(SVM). **

(GLM) for

Multiple regression In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one o ...

* Clustering: ** Enhanced

k-means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to Partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the Cluster (statistics), cluster ...

(EKM). ** Orthogonal Partitioning Clustering (O-Cluster). *

Association rule learning Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.Pi ...

: ** Itemsets and association rules (AM). *

Feature extraction In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values ( features) intended to be informative and non-redundant, facilitating the subsequent learning ...

. **

(NMF). *

Text Text may refer to: Written word * Text (literary theory), any object that can be read, including: **Religious text, a writing that a religious tradition considers to be sacred **Text, a verse or passage from scripture used in expository preachin ...

and spatial mining: ** Combined text and non-text columns of input data. ** Spatial/ GIS data.

Input sources and data preparation

Most Oracle Data Mining functions accept as input one relational table or view. Flat data can be combined with transactional data through the use of nested columns, enabling mining of data involving one-to-many relationships (e.g. a star schema). The full functionality of SQL can be used when preparing data for data mining, including dates and spatial data. Oracle Data Mining distinguishes numerical, categorical, and unstructured (text) attributes. The product also provides utilities for data preparation steps prior to model building such as

outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...

treatment,

, normalization and binning ( sorting in general speak)

Graphical user interface: Oracle Data Miner

Users can access Oracle Data Mining through Oracle Data Miner, a GUI client application that provides access to the data mining functions and structured templates (called Mining Activities) that automatically prescribe the order of operations, perform required data transformations, and set model parameters. The user interface also allows the automated generation of

and/or SQL code associated with the data-mining activities. The Java Code Generator is an extension to Oracle JDeveloper. An independent interface also exists: the Spreadsheet Add-In for Predictive Analytics which enables access to the Oracle Data Mining Predictive Analytics

package from

. From version 11.2 of the

Oracle database Oracle Database (commonly referred to as Oracle DBMS, Oracle Autonomous Database, or simply as Oracle) is a multi-model database management system produced and marketed by Oracle Corporation. It is a database commonly used for running online ...

, Oracle Data Miner integrates with Oracle SQL Developer.

PL/SQL and Java interfaces

Oracle Data Mining provides a native

package (DBMS_DATA_MINING) to create, destroy, describe, apply, test, export and import models. The code below illustrates a typical call to build a

model: BEGIN DBMS_DATA_MINING.CREATE_MODEL ( model_name => 'credit_risk_model', function => DBMS_DATA_MINING.classification, data_table_name => 'credit_card_data', case_id_column_name => 'customer_id', target_column_name => 'credit_risk', settings_table_name => 'credit_risk_model_settings'); END; where 'credit_risk_model' is the model name, built for the express purpose of classifying future customers' 'credit_risk', based on training data provided in the table 'credit_card_data', each case distinguished by a unique 'customer_id', with the rest of the model parameters specified through the table 'credit_risk_model_settings'. Oracle Data Mining also supports a

API consistent with the

Java Data Mining Java Data Mining (JDM) is a standard Java API for developing data mining applications and tools. JDM defines an object model and Java API for data mining objects and processes. JDM enables applications to integrate data mining technology for devel ...

(JDM) standard for data mining (JSR-73) for enabling integration with web and Java EE applications and to facilitate portability across platforms.

SQL scoring functions

As of release 10gR2, Oracle Data Mining contains built-in SQL functions for scoring data mining models. These single-row functions support classification, regression, anomaly detection, clustering, and feature extraction. The code below illustrates a typical usage of a

model: SELECT customer_name FROM credit_card_data WHERE PREDICTION (credit_risk_model USING *) = 'LOW' AND customer_value = 'HIGH';

PMML

In Release 11gR2 (11.2.0.2), ODM supports the import of externally created

PMML The Predictive Model Markup Language (PMML) is an XML-based predictive model interchange format conceived by Dr. Robert Lee Grossman, then the director of the National Center for Data Mining at the University of Illinois at Chicago. PMML prov ...

for some of the data mining models.

is an XML-based standard for representing data mining models.

Predictive analytics Microsoft Excel add-in

The

package DBMS_PREDICTIVE_ANALYTICS automates the data mining process including data preprocessing, model building and evaluation, and scoring of new data. The PREDICT operation is used for predicting target values classification or regression while EXPLAIN ranks attributes in order of influence in explaining a target column feature selection. The new 11g feature PROFILE finds customer segments and their profiles, given a target attribute. These operations can be used as part of an operational pipeline providing actionable results or displayed for interpretation by end users.

References and further reading

* T. H. Davenport
Competing on Analytics
Harvard Business Review, January 2006. * I. Ben-Ga
Outlier detection
In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers," Kluwer Academic Publishers, 2005, . * M. M. Campos, P. J. Stengard, and B. L. Milenova, Data-centric Automated Data Mining. In proceedings of the ''Fourth International Conference on Machine Learning and Applications 2005'', 15–17 December 2005. pp8, * M. F. Hornick, Erik Marcade, and Sunil Venkayala. Java Data Mining: Strategy, Standard, and Practice. Morgan-Kaufmann, 2006, . * B. L. Milenova, J. S. Yarmus, and M. M. Campos. SVM in Oracle database 10g: removing the barriers to widespread adoption of support vector machines. In Proceedings of the ''31st international Conference on Very Large Data Bases'' (Trondheim, Norway, August 30 - September 2, 2005). pp1152–1163, . * B. L. Milenova and M. M. Campos. O-Cluster: scalable clustering of large high dimensional data sets. In proceedings of the ''2002 IEEE International Conference on Data Mining: ICDM 2002''. pp290–297, . * P. Tamayo, C. Berger, M. M. Campos, J. S. Yarmus, B. L.Milenova, A. Mozes, M. Taft, M. Hornick, R. Krishnan, S.Thomas, M. Kelly, D. Mukhin, R. Haberstroh, S. Stephens and J. Myczkowski. Oracle Data Mining - Data Mining in the Database Environment. In Part VII of ''Data Mining and Knowledge Discovery Handbook'', Maimon, O.; Rokach, L. (Eds.) 2005, p315-1329, . * Brendan Tierney, Predictive Analytics using Oracle Data Miner: for the data scientist, oracle analyst, oracle developer & DBA, Oracle Press, McGraw Hill, Spring 2014.

References

{{Reflist

External links

Oracle Data Mining at Oracle Technology Network

Oracle Data Mining Blog

Oracle Data Mining and Analytics Blog

Oracle Wiki for Data Mining

Oracle Data Mining RSS Feed

Oracle Data Mining related blog by Brendan Tierney (Oracle ACE Director)

Oracle Data Mining Examples (on Panoply Technology)
Oracle software Data mining and machine learning software