Multifactor dimensionality reduction (MDR) is a statistical approach, also used in

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

automatic approaches, for detecting and characterizing combinations of attributes or

independent variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...

s that interact to influence a dependent or class variable. MDR was designed specifically to identify nonadditive interactions among

discrete Discrete may refer to: *Discrete particle or quantum in physics, for example in quantum theory * Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit * Discrete group, ...

variables that influence a binary outcome and is considered a nonparametric and model-free alternative to traditional statistical methods such as

logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...

. The basis of the MDR method is a constructive induction or feature engineering algorithm that converts two or more variables or attributes to a single attribute. This process of constructing a new attribute changes the representation space of the data. The end goal is to create or discover a representation that facilitates the detection of

nonlinear In mathematics and science, a nonlinear system (or a non-linear system) is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathe ...

or nonadditive interactions among the attributes such that prediction of the class variable is improved over that of the original representation of the data.

Illustrative example

Consider the following simple example using the

exclusive OR Exclusive or, exclusive disjunction, exclusive alternation, logical non-equivalence, or logical inequality is a logical operator whose negation is the logical biconditional. With two inputs, XOR is true if and only if the inputs differ (on ...

(XOR) function. XOR is a

logical operator In logic, a logical connective (also called a logical operator, sentential connective, or sentential operator) is a logical constant. Connectives can be used to connect logical formulas. For instance in the syntax of propositional logic, the ...

that is commonly used in data mining and

as an example of a function that is not linearly separable. The table below represents a simple dataset where the relationship between the attributes (X1 and X2) and the class variable (Y) is defined by the XOR function such that Y = X1 XOR X2. Table 1 A

algorithm would need to discover or approximate the XOR function in order to accurately predict Y using information about X1 and X2. An alternative strategy would be to first change the representation of the data using constructive induction to facilitate predictive modeling. The MDR algorithm would change the representation of the data (X1 and X2) in the following manner. MDR starts by selecting two attributes. In this simple example, X1 and X2 are selected. Each combination of values for X1 and X2 are examined and the number of times Y=1 and/or Y=0 is counted. In this simple example, Y=1 occurs zero times and Y=0 occurs once for the combination of X1=0 and X2=0. With MDR, the ratio of these counts is computed and compared to a fixed threshold. Here, the ratio of counts is 0/1 which is less than our fixed threshold of 1. Since 0/1 < 1 we encode a new attribute (Z) as a 0. When the ratio is greater than one we encode Z as a 1. This process is repeated for all unique combinations of values for X1 and X2. Table 2 illustrates our new transformation of the data. Table 2 The machine learning algorithm now has much less work to do to find a good predictive function. In fact, in this very simple example, the function Y = Z has a classification accuracy of 1. A nice feature of constructive induction methods such as MDR is the ability to use any data mining or machine learning method to analyze the new representation of the data.

Decision trees A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...

neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...

, or a

naive Bayes classifier In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of " probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes th ...

could be used in combination with measures of model quality such as balanced accuracy and mutual information.

Machine learning with MDR

As illustrated above, the basic constructive induction algorithm in MDR is very simple. However, its implementation for mining patterns from real data can be computationally complex. As with any machine learning algorithm there is always concern about

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

. That is, machine learning algorithms are good at finding patterns in completely random data. It is often difficult to determine whether a reported pattern is an important signal or just chance. One approach is to estimate the generalizability of a model to independent datasets using methods such as cross-validation. Models that describe random data typically don't generalize. Another approach is to generate many random permutations of the data to see what the data mining algorithm finds when given the chance to overfit. Permutation testing makes it possible to generate an empirical

p-value In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...

for the result. Replication in independent data may also provide evidence for an MDR model but can be sensitive to difference in the data sets. These approaches have all been shown to be useful for choosing and evaluating MDR models. An important step in a machine learning exercise is interpretation. Several approaches have been used with MDR including entropy analysis and pathway analysis. Tips and approaches for using MDR to model gene-gene interactions have been reviewed.

Extensions to MDR

Numerous extensions to MDR have been introduced. These include family-based methods, fuzzy methods, covariate adjustment,

odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of event A taking place in the presence of B, and the odds of A in the absence of B ...

s, risk scores, survival methods, robust methods, methods for quantitative traits, and many others.

Applications of MDR

MDR has mostly been applied to detecting gene-gene interactions or

epistasis Epistasis is a phenomenon in genetics in which the effect of a gene mutation is dependent on the presence or absence of mutations in one or more other genes, respectively termed modifier genes. In other words, the effect of the mutation is depe ...

in genetic studies of common human diseases such as

atrial fibrillation Atrial fibrillation (AF, AFib or A-fib) is an Heart arrhythmia, abnormal heart rhythm (arrhythmia) characterized by fibrillation, rapid and irregular beating of the Atrium (heart), atrial chambers of the heart. It often begins as short periods ...

autism Autism, also known as autism spectrum disorder (ASD), is a neurodevelopmental disorder characterized by differences or difficulties in social communication and interaction, a preference for predictability and routine, sensory processing d ...

bladder cancer Bladder cancer is the abnormal growth of cells in the bladder. These cells can grow to form a tumor, which eventually spreads, damaging the bladder and other organs. Most people with bladder cancer are diagnosed after noticing blood in thei ...

breast cancer Breast cancer is a cancer that develops from breast tissue. Signs of breast cancer may include a Breast lump, lump in the breast, a change in breast shape, dimpling of the skin, Milk-rejection sign, milk rejection, fluid coming from the nipp ...

cardiovascular disease Cardiovascular disease (CVD) is any disease involving the heart or blood vessels. CVDs constitute a class of diseases that includes: coronary artery diseases (e.g. angina, heart attack), heart failure, hypertensive heart disease, rheumati ...

hypertension Hypertension, also known as high blood pressure, is a Chronic condition, long-term Disease, medical condition in which the blood pressure in the artery, arteries is persistently elevated. High blood pressure usually does not cause symptoms i ...

obesity Obesity is a medical condition, considered by multiple organizations to be a disease, in which excess Adipose tissue, body fat has accumulated to such an extent that it can potentially have negative effects on health. People are classifi ...

pancreatic cancer Pancreatic cancer arises when cell (biology), cells in the pancreas, a glandular organ behind the stomach, begin to multiply out of control and form a Neoplasm, mass. These cancerous cells have the malignant, ability to invade other parts of ...

prostate cancer Prostate cancer is the neoplasm, uncontrolled growth of cells in the prostate, a gland in the male reproductive system below the bladder. Abnormal growth of the prostate tissue is usually detected through Screening (medicine), screening tests, ...

and

tuberculosis Tuberculosis (TB), also known colloquially as the "white death", or historically as consumption, is a contagious disease usually caused by ''Mycobacterium tuberculosis'' (MTB) bacteria. Tuberculosis generally affects the lungs, but it can al ...

. It has also been applied to other biomedical problems such as the genetic analysis of

pharmacology Pharmacology is the science of drugs and medications, including a substance's origin, composition, pharmacokinetics, pharmacodynamics, therapeutic use, and toxicology. More specifically, it is the study of the interactions that occur betwee ...

outcomes. A central challenge is the scaling of MDR to

big data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...

such as that from genome-wide association studies (GWAS). Several approaches have been used. One approach is to filter the features prior to MDR analysis. This can be done using biological knowledge through tools such as BioFilter. It can also be done using computational tools such as ReliefF. Another approach is to use stochastic search algorithms such as

genetic programming Genetic programming (GP) is an evolutionary algorithm, an artificial intelligence technique mimicking natural evolution, which operates on a population of programs. It applies the genetic operators selection (evolutionary algorithm), selection a ...

to explore the search space of feature combinations. Yet another approach is a brute-force search using

high-performance computing High-performance computing (HPC) is the use of supercomputers and computer clusters to solve advanced computation problems. Overview HPC integrates systems administration (including network and security knowledge) and parallel programming into ...

Implementations

www.epistasis.org
provides an

open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...

and freely-available MDR software package. * A
R package
for MDR. * An sklearn-compatibl
Python implementation
* A

for Model-Based MDR. * MDR in Weka.
Generalized MDR

Illustrative example

Machine learning with MDR

Extensions to MDR

Applications of MDR

Implementations

See also

References

Further reading