The Predictive Model Markup Language (PMML) is an
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. ...
-based
predictive model
Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive mod ...
interchange format conceived by Dr.
Robert Lee Grossman, then the director of the
National Center for Data Mining at the
University of Illinois at Chicago
The University of Illinois Chicago (UIC) is a public research university in Chicago, Illinois. Its campus is in the Near West Side community area, adjacent to the Chicago Loop. The second campus established under the University of Illinois s ...
. PMML provides a way for analytic applications to describe and exchange
predictive models
Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive mod ...
produced by
data mining and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
algorithms. It supports common models such as
logistic regression
In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
and other
feedforward neural network
A feedforward neural network (FNN) is an artificial neural network wherein connections between the nodes do ''not'' form a cycle. As such, it is different from its descendant: recurrent neural networks.
The feedforward neural network was the ...
s. Version 0.9 was published in 1998. Subsequent versions have been developed by the Data Mining Group.
Since PMML is an XML-based standard, the specification comes in the form of an
XML schema
An XML schema is a description of a type of Extensible Markup Language, XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed ...
. PMML itself is a mature standard with over 30 organizations having announced products supporting PMML.
PMML Components
A PMML file can be described by the following components:
*
Header: contains general information about the PMML document, such as copyright information for the model, its description, and information about the application used to generate the model such as name and version. It also contains an attribute for a timestamp which can be used to specify the date of model creation.
*
Data Dictionary
A data dictionary, or metadata repository, as defined in the ''IBM Dictionary of Computing'', is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format". ''Oracle'' defines it ...
: contains definitions for all the possible fields used by the model. It is here that a field is defined as continuous, categorical, or ordinal (attribute optype). Depending on this definition, the appropriate value ranges are then defined as well as the data type (such as, string or double).
*
Data Transformations: transformations allow for the mapping of user data into a more desirable form to be used by the mining model. PMML defines several kinds of simple data transformations.
** Normalization: map values to numbers, the input can be continuous or discrete.
** Discretization: map continuous values to discrete values.
** Value mapping: map discrete values to discrete values.
** Functions (custom and built-in): derive a value by applying a function to one or more parameters.
** Aggregation: used to summarize or collect groups of values.
*
Model
A model is an informative representation of an object, person or system. The term originally denoted the plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin ''modulus'', a measure.
Models c ...
: contains the definition of the data mining model. E.g., A multi-layered
feedforward neural network
A feedforward neural network (FNN) is an artificial neural network wherein connections between the nodes do ''not'' form a cycle. As such, it is different from its descendant: recurrent neural networks.
The feedforward neural network was the ...
is represented in PMML by a "NeuralNetwork" element which contains attributes such as:
** Model Name (attribute modelName)
** Function Name (attribute functionName)
** Algorithm Name (attribute algorithmName)
** Activation Function (attribute activationFunction)
** Number of Layers (attribute numberOfLayers)
:This information is then followed by three kinds of neural layers which specify the architecture of the neural network model being represented in the PMML document. These attributes are NeuralInputs, NeuralLayer, and NeuralOutputs. Besides neural networks, PMML allows for the representation of many other types of models including
support vector machines
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
,
association rules
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.P ...
,
Naive Bayes classifier
In statistics, naive Bayes classifiers are a family of simple " probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...
, clustering models,
text models,
decision trees
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains cond ...
, and different
regression models
Regression or regressions may refer to:
Science
* Marine regression, coastal advance due to falling sea level, the opposite of marine transgression
* Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ...
.
* Mining Schema: a list of all fields used in the model. This can be a subset of the fields as defined in the data dictionary. It contains specific information about each field, such as:
** Name (attribute name): must refer to a field in the data dictionary
** Usage type (attribute usageType): defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.
** Outlier Treatment (attribute outliers): defines the outlier treatment to be use. In PMML, outliers can be treated as missing values, as extreme values (based on the definition of high and low values for a particular field), or as is.
** Missing Value Replacement Policy (attribute missingValueReplacement): if this attribute is specified then a missing value is automatically replaced by the given values.
** Missing Value Treatment (attribute missingValueTreatment): indicates how the missing value replacement was derived (e.g. as value, mean or median).
* Targets: allows for post-processing of the predicted value in the format of scaling if the output of the model is continuous. Targets can also be used for classification tasks. In this case, the attribute priorProbability specifies a default probability for the corresponding target category. It is used if the prediction logic itself did not produce a result. This can happen, e.g., if an input value is missing and there is no other method for treating missing values.
* Output: this element can be used to name all the desired output fields expected from the model. These are features of the predicted field and so are typically the predicted value itself, the probability, cluster affinity (for clustering models), standard error, etc. The latest release of PMML, PMML 4.1, extended Output to allow for generic post-processing of model outputs. In PMML 4.1, all the built-in and custom functions that were originally available only for pre-processing became available for post-processing too.
PMML 4.0, 4.1, 4.2 and 4.3
PMML 4.0 was released on June 16, 2009.
Examples of new features included:
* Improved Pre-Processing Capabilities: Additions to built-in functions include a range of
Boolean operations and an
If-Then-Else
In computer science, conditionals (that is, conditional statements, conditional expressions and conditional constructs,) are programming language commands for handling decisions. Specifically, conditionals perform different computations or actio ...
function.
*
Time Series
In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. E ...
Models: New exponential
Smoothing
In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the data ...
models; also place holders for
ARIMA
Arima, officially The Royal Chartered Borough of Arima is the easternmost and second largest in area of the three boroughs of Trinidad and Tobago. It is geographically adjacent to Sangre Grande and Arouca at the south central foothills of ...
,
Seasonal Trend Decomposition, and
Spectral density estimation
In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density (also known as the power spectral density) of a signal from a sequence of time samples of the si ...
, which are to be supported in the near future.
* Model Explanation: Saving of evaluation and model performance measures to the PMML file itself.
* Multiple Models: Capabilities for model composition, ensembles, and segmentation (e.g., combining of
regression and decision trees).
* Extensions of Existing Elements: Addition of
multi-class classification
In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary ...
for
Support Vector Machines
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
, improved representation for
Association Rules
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.P ...
, and the addition of
Cox Regression Models.
PMML 4.1 was released on December 31, 2011.
New features included:
* New model elements for representing Scorecards, k-Nearest Neighbors (
KNN) and Baseline Models.
* Simplification of multiple models. In PMML 4.1, the same element is used to represent model segmentation, ensemble, and chaining.
* Overall definition of field scope and field names.
* A new attribute that identifies for each model element if the model is ready or not for production deployment.
* Enhanced post-processing capabilities (via the Output element).
PMML 4.2 was released on February 28, 2014.
New features include:
* Transformations: New elements for implementing text mining
* New built-in functions for implementing regular expressions: matches, concat, and replace
* Simplified outputs for post-processing
* Enhancements to Scorecard and Naive Bayes model elements
PMML 4.3 was released on August 23, 2016.
New features include:
* New Model Types:
** Gaussian Process
** Bayesian Network
* New built-in functions
* Usage clarifications
* Documentation improvements
Version 4.4 was released in November 2019.
Release history
Data Mining Group
Th
Data Mining Groupis a consortium managed by the Center for Computational Science Research, Inc., a nonprofit founded in 2008.
The Data Mining Group also developed a standard called
Portable Format for Analytics, or PFA, which is complementary to PMML.
See also
*
Open Neural Network Exchange
The Open Neural Network Exchange (ONNX) [] is an Open-source software, open-source artificial intelligence ecosystem of technology companies and research organizations that establish open standards for representing machine learning algorithms and ...
References
{{reflist
External links
Data Pre-processing in PMML and ADAPA - A PrimerVideo of Dr. Alex Guazzelli's PMML presentation for the ACM Data Mining Group (hosted by LinkedIn)Representing predictive solutions in PMML: Move from raw data to predictions- Article published on the IBM developerWorks website.
Predictive analytics in healthcare: The importance of open standards- Article published on the IBM developerWorks website.
Domain-specific knowledge representation languages
Applied machine learning