Multifactor dimensionality reduction (MDR) is a statistical approach, also used in
machine learning automatic approaches, for detecting and characterizing combinations of
attributes or
independent variable
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s that interact to influence a dependent or class variable.
MDR was designed specifically to identify nonadditive
interactions among
discrete variables that influence a
binary outcome and is considered a
nonparametric and model-free alternative to traditional statistical methods such as
logistic regression.
The basis of the MDR method is a constructive induction or
feature engineering algorithm that converts two or more variables or attributes to a single attribute.
This process of constructing a new attribute changes the representation space of the data. The end goal is to create or discover a representation that facilitates the detection of
nonlinear or nonadditive interactions among the attributes such that prediction of the class variable is improved over that of the original representation of the data.
Illustrative example
Consider the following simple example using the
exclusive OR (XOR) function. XOR is a
logical operator that is commonly used in data mining and
machine learning as an example of a function that is not linearly separable. The table below represents a simple dataset where the relationship between the attributes (X1 and X2) and the class variable (Y) is defined by the XOR function such that Y = X1 XOR X2.
Table 1
A
machine learning algorithm would need to discover or approximate the XOR function in order to accurately predict Y using information about X1 and X2. An alternative strategy would be to first change the representation of the data using constructive induction to facilitate predictive modeling. The MDR algorithm would change the representation of the data (X1 and X2) in the following manner. MDR starts by selecting two attributes. In this simple example, X1 and X2 are selected. Each combination of values for X1 and X2 are examined and the number of times Y=1 and/or Y=0 is counted. In this simple example, Y=1 occurs zero times and Y=0 occurs once for the combination of X1=0 and X2=0. With MDR, the ratio of these counts is computed and compared to a fixed threshold. Here, the ratio of counts is 0/1 which is less than our fixed threshold of 1. Since 0/1 < 1 we encode a new attribute (Z) as a 0. When the ratio is greater than one we encode Z as a 1. This process is repeated for all unique combinations of values for X1 and X2. Table 2 illustrates our new transformation of the data.
Table 2
The machine learning algorithm now has much less work to do to find a good predictive function. In fact, in this very simple example, the function Y = Z has a classification accuracy of 1. A nice feature of constructive induction methods such as MDR is the ability to use any data mining or machine learning method to analyze the new representation of the data.
Decision trees,
neural networks
A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
, or a
naive Bayes classifier
In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...
could be used in combination with measures of model quality such as balanced accuracy and mutual information.
Machine learning with MDR
As illustrated above, the basic constructive induction algorithm in MDR is very simple. However, its implementation for mining patterns from real data can be computationally complex. As with any machine learning algorithm there is always concern about
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
. That is, machine learning algorithms are good at finding patterns in completely random data. It is often difficult to determine whether a reported pattern is an important signal or just chance. One approach is to estimate the generalizability of a model to independent datasets using methods such as
cross-validation.
Models that describe random data typically don't generalize. Another approach is to generate many random permutations of the data to see what the data mining algorithm finds when given the chance to overfit.
Permutation testing makes it possible to generate an empirical
p-value
In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...
for the result. Replication in independent data may also provide evidence for an MDR model but can be sensitive to difference in the data sets. These approaches have all been shown to be useful for choosing and evaluating MDR models. An important step in a machine learning exercise is interpretation. Several approaches have been used with MDR including entropy analysis
and pathway analysis. Tips and approaches for using MDR to model gene-gene interactions have been reviewed.
Extensions to MDR
Numerous extensions to MDR have been introduced. These include family-based methods, fuzzy methods, covariate adjustment, odds ratios, risk scores, survival methods, robust methods, methods for quantitative traits, and many others.
Applications of MDR
MDR has mostly been applied to detecting gene-gene interactions or
epistasis
Epistasis is a phenomenon in genetics in which the effect of a gene mutation is dependent on the presence or absence of mutations in one or more other genes, respectively termed modifier genes. In other words, the effect of the mutation is dep ...
in genetic studies of common human diseases such as
atrial fibrillation,
autism
The autism spectrum, often referred to as just autism or in the context of a professional diagnosis autism spectrum disorder (ASD) or autism spectrum condition (ASC), is a neurodevelopmental condition (or conditions) characterized by difficulti ...
,
bladder cancer,
breast cancer,
cardiovascular disease
Cardiovascular disease (CVD) is a class of diseases that involve the heart or blood vessels. CVD includes coronary artery diseases (CAD) such as angina and myocardial infarction (commonly known as a heart attack). Other CVDs include stroke, h ...
,
hypertension
Hypertension (HTN or HT), also known as high blood pressure (HBP), is a long-term medical condition in which the blood pressure in the arteries is persistently elevated. High blood pressure usually does not cause symptoms. Long-term high bl ...
,
obesity,
pancreatic cancer
Pancreatic cancer arises when cell (biology), cells in the pancreas, a glandular organ behind the stomach, begin to multiply out of control and form a Neoplasm, mass. These cancerous cells have the malignant, ability to invade other parts of t ...
,
prostate cancer
Prostate cancer is cancer of the prostate. Prostate cancer is the second most common cancerous tumor worldwide and is the fifth leading cause of cancer-related mortality among men. The prostate is a gland in the male reproductive system that sur ...
and
tuberculosis. It has also been applied to other biomedical problems such as the genetic analysis of
pharmacology
Pharmacology is a branch of medicine, biology and pharmaceutical sciences concerned with drug or medication action, where a drug may be defined as any artificial, natural, or endogenous (from within the body) molecule which exerts a biochemica ...
outcomes. A central challenge is the scaling of MDR to
big data
Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
such as that from
genome-wide association studies (GWAS). Several approaches have been used. One approach is to filter the features prior to MDR analysis. This can be done using biological knowledge through tools such as BioFilter. It can also be done using computational tools such as ReliefF. Another approach is to use
stochastic search
Stochastic optimization (SO) methods are optimization methods that generate and use random variables. For stochastic problems, the random variables appear in the formulation of the optimization problem itself, which involves random objective fun ...
algorithms such as
genetic programming to explore the search space of feature combinations. Yet another approach is a brute-force search using
high-performance computing.
Implementations
www.epistasis.orgprovides an
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
and freely-available MDR software package.
* A
R packagefor MDR.
* An sklearn-compatibl
Python implementation
* A
for Model-Based MDR.
*
MDR in Weka.
Generalized MDR
See also
*
Data mining
*
Dimensionality reduction
*
Epistasis
Epistasis is a phenomenon in genetics in which the effect of a gene mutation is dependent on the presence or absence of mutations in one or more other genes, respectively termed modifier genes. In other words, the effect of the mutation is dep ...
*
Feature Engineering
*
Machine learning
*
Multilinear subspace learning
References
Further reading
*Michalski, R. S., "Pattern Recognition as Knowledge-Guided Computer Induction," Department of Computer Science Reports, No. 927, University of Illinois, Urbana, June 1978.
{{DEFAULTSORT:Multifactor Dimensionality Reduction
Data mining
Dimension reduction
Classification algorithms