HOME

TheInfoList



OR:

Quantitative structure–activity relationship models (QSAR models) are
regression Regression or regressions may refer to: Science * Marine regression, coastal advance due to falling sea level, the opposite of marine transgression * Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...
or classification models used in the chemical and biological sciences and engineering. Like other regression models, QSAR regression models relate a set of "predictor" variables (X) to the potency of the response variable (Y), while classification QSAR models relate the predictor variables to a categorical value of the response variable. In QSAR modeling, the predictors consist of physico-chemical properties or theoretical
molecular descriptor Molecular descriptors play a fundamental role in chemistry, pharmaceutical sciences, environmental protection policy, and health researches, as well as in quality control, being the way molecules, thought of as real bodies, are transformed into num ...
s of chemicals; the QSAR response-variable could be a
biological activity In pharmacology, biological activity or pharmacological activity describes the beneficial or adverse effects of a drug on living matter. When a drug is a complex chemical mixture, this activity is exerted by the substance's active ingredient or p ...
of the chemicals. QSAR models first summarize a supposed relationship between
chemical structure A chemical structure determination includes a chemist's specifying the molecular geometry and, when feasible and necessary, the electronic structure of the target molecule or other solid. Molecular geometry refers to the spatial arrangement of a ...
s and
biological activity In pharmacology, biological activity or pharmacological activity describes the beneficial or adverse effects of a drug on living matter. When a drug is a complex chemical mixture, this activity is exerted by the substance's active ingredient or p ...
in a data-set of chemicals. Second, QSAR models predict the activities of new chemicals. Related terms include ''quantitative structure–property relationships'' (''QSPR'') when a chemical property is modeled as the response variable. "Different properties or behaviors of chemical molecules have been investigated in the field of QSPR. Some examples are quantitative structure–reactivity relationships (QSRRs), quantitative structure–chromatography relationships (QSCRs) and, quantitative structure–toxicity relationships (QSTRs), quantitative structure–electrochemistry relationships (QSERs), and quantitative structure–
biodegradability Biodegradation is the breakdown of organic matter by microorganisms, such as bacteria and fungi. It is generally assumed to be a natural process, which differentiates it from composting. Composting is a human-driven process in which biodegradat ...
relationships (QSBRs)." As an example, biological activity can be expressed quantitatively as the concentration of a substance required to give a certain biological response. Additionally, when physicochemical properties or structures are expressed by numbers, one can find a mathematical relationship, or quantitative structure-activity relationship, between the two. The mathematical expression, if carefully validated can then be used to predict the modeled response of other chemical structures. A QSAR has the form of a mathematical model: * Activity = ''f''(physiochemical properties and/or structural properties) + error The error includes model error (
bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
) and observational variability, that is, the variability in observations even on a correct model.


Essential steps in QSAR studies

Principal steps of QSAR/QSPR including (i) Selection of Data set and extraction of structural/empirical descriptors (ii) variable selection, (iii) model construction and (iv) validation evaluation."


SAR and the SAR paradox

The basic assumption for all molecule-based hypotheses is that similar molecules have similar activities. This principle is also called Structure–Activity Relationship ( SAR). The underlying problem is therefore how to define a ''small'' difference on a molecular level, since each kind of activity, e.g.
reaction Reaction may refer to a process or to a response to an action, event, or exposure: Physics and chemistry *Chemical reaction * Nuclear reaction *Reaction (physics), as defined by Newton's third law * Chain reaction (disambiguation). Biology and ...
ability,
biotransformation Biotransformation is the biochemical modification of one chemical compound or a mixture of chemical compounds. Biotransformations can be conducted with whole cells, their lysates, or purified enzymes. Increasingly, biotransformations are effected w ...
ability, solubility, target activity, and so on, might depend on another difference. Examples were given in the bioisosterism reviews by Patanie/LaVoie and Brown. In general, one is more interested in finding strong trends. Created hypotheses usually rely on a
finite Finite is the opposite of infinite. It may refer to: * Finite number (disambiguation) * Finite set, a set whose cardinality (number of elements) is some natural number * Finite verb Traditionally, a finite verb (from la, fīnītus, past particip ...
number of chemicals, so care must be taken to avoid overfitting: the generation of hypotheses that fit training data very closely but perform poorly when applied to new data. The ''SAR paradox'' refers to the fact that it is not the case that all similar molecules have similar activities.


Types


Fragment based (group contribution)

Analogously, the "
partition coefficient In the physical sciences, a partition coefficient (''P'') or distribution coefficient (''D'') is the ratio of concentrations of a compound in a mixture of two immiscible solvents at equilibrium. This ratio is therefore a comparison of the solu ...
"—a measurement of differential solubility and itself a component of QSAR predictions—can be predicted either by atomic methods (known as "XLogP" or "ALogP") or by chemical fragment methods (known as "CLogP" and other variations). It has been shown that the logP of compound can be determined by the sum of its fragments; fragment-based methods are generally accepted as better predictors than atomic-based methods. Fragmentary values have been determined statistically, based on empirical data for known logP values. This method gives mixed results and is generally not trusted to have accuracy of more than ±0.1 units. Group or Fragment based QSAR is also known as GQSAR. GQSAR allows flexibility to study various molecular fragments of interest in relation to the variation in biological response. The molecular fragments could be substituents at various substitution sites in congeneric set of molecules or could be on the basis of pre-defined chemical rules in case of non-congeneric sets. GQSAR also considers cross-terms fragment descriptors, which could be helpful in identification of key fragment interactions in determining variation of activity. Lead discovery using Fragnomics is an emerging paradigm. In this context FB-QSAR proves to be a promising strategy for fragment library design and in fragment-to-lead identification endeavours. An advanced approach on fragment or group-based QSAR based on the concept of pharmacophore-similarity is developed. This method, pharmacophore-similarity-based QSAR (PS-QSAR) uses topological pharmacophoric descriptors to develop QSAR models. This activity prediction may assist the contribution of certain pharmacophore features encoded by respective fragments toward activity improvement and/or detrimental effects.


3D-QSAR

The acronym 3D-QSAR or 3-D QSAR refers to the application of force field calculations requiring three-dimensional structures of a given set of small molecules with known activities (training set). The training set needs to be superimposed (aligned) by either experimental data (e.g. based on ligand-protein
crystallography Crystallography is the experimental science of determining the arrangement of atoms in crystalline solids. Crystallography is a fundamental subject in the fields of materials science and solid-state physics (condensed matter physics). The wor ...
) or molecule superimposition software. It uses computed potentials, e.g. the
Lennard-Jones potential The Lennard-Jones potential (also termed the LJ potential or 12-6 potential) is an intermolecular pair potential. Out of all the intermolecular potentials, the Lennard-Jones potential is probably the one that has been the most extensively studied ...
, rather than experimental constants and is concerned with the overall molecule rather than a single substituent. The first 3-D QSAR was named Comparative Molecular Field Analysis (CoMFA) by Cramer et al. It examined the steric fields (shape of the molecule) and the electrostatic fields which were correlated by means of
partial least squares regression Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a ...
(PLS). The created data space is then usually reduced by a following feature extraction (see also dimensionality reduction). The following learning method can be any of the already mentioned machine learning methods, e.g. support vector machines. An alternative approach uses multiple-instance learning by encoding molecules as sets of data instances, each of which represents a possible molecular conformation. A label or response is assigned to each set corresponding to the activity of the molecule, which is assumed to be determined by at least one instance in the set (i.e. some conformation of the molecule). On June 18, 2011 the Comparative Molecular Field Analysis (CoMFA) patent has dropped any restriction on the use of GRID and partial least-squares (PLS) technologies.


Chemical descriptor based

In this approach, descriptors quantifying various electronic, geometric, or steric properties of a molecule are computed and used to develop a QSAR. This approach is different from the fragment (or group contribution) approach in that the descriptors are computed for the system as whole rather than from the properties of individual fragments. This approach is different from the 3D-QSAR approach in that the descriptors are computed from scalar quantities (e.g., energies, geometric parameters) rather than from 3D fields. An example of this approach is the QSARs developed for olefin polymerization by half sandwich compounds.


Modeling

In the literature it can be often found that chemists have a preference for
partial least squares Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a li ...
(PLS) methods, since it applies the feature extraction and induction in one step.


Data mining approach

Computer SAR models typically calculate a relatively large number of features. Because those lack structural interpretation ability, the preprocessing steps face a feature selection problem (i.e., which structural features should be interpreted to determine the structure-activity relationship). Feature selection can be accomplished by visual inspection (qualitative selection by a human); by data mining; or by molecule mining. A typical data mining based prediction uses e.g. support vector machines, decision trees,
artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...
s for inducing a predictive learning model. Molecule mining approaches, a special case of
structured data mining Structure mining or structured data mining is the process of finding and extracting useful information from semi-structured data sets. Graph mining, sequential pattern mining and molecule mining are special cases of structured data mining. Descri ...
approaches, apply a similarity matrix based prediction or an automatic fragmentation scheme into molecular substructures. Furthermore, there exist also approaches using maximum common subgraph searches or
graph kernel In structure mining, a graph kernel is a kernel function that computes an inner product on graphs. Graph kernels can be intuitively understood as functions measuring the similarity of pairs of graphs. They allow kernelized learning algorithms su ...
s.


Matched molecular pair analysis

Typically QSAR models derived from non linear machine learning is seen as a "black box", which fails to guide medicinal chemists. Recently there is a relatively new concept of matched molecular pair analysis or prediction driven MMPA which is coupled with QSAR model in order to identify activity cliffs.


Evaluation of the quality of QSAR models

QSAR modeling produces predictive
model A model is an informative representation of an object, person or system. The term originally denoted the plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin ''modulus'', a measure. Models c ...
s derived from application of statistical tools correlating
biological activity In pharmacology, biological activity or pharmacological activity describes the beneficial or adverse effects of a drug on living matter. When a drug is a complex chemical mixture, this activity is exerted by the substance's active ingredient or p ...
(including desirable therapeutic effect and undesirable side effects) or physico-chemical properties in QSPR models of chemicals (drugs/toxicants/environmental pollutants) with descriptors representative of molecular structure or
properties Property is the ownership of land, resources, improvements or other tangible objects, or intellectual property. Property may also refer to: Mathematics * Property (mathematics) Philosophy and science * Property (philosophy), in philosophy and ...
. QSARs are being applied in many disciplines, for example:
risk assessment Broadly speaking, a risk assessment is the combined effort of: # identifying and analyzing potential (future) events that may negatively impact individuals, assets, and/or the environment (i.e. hazard analysis); and # making judgments "on the t ...
, toxicity prediction, and regulatory decisions in addition to drug discovery and
lead optimization Hit to lead (H2L) also known as lead generation is a stage in early drug discovery where small molecule hits from a high throughput screen (HTS) are evaluated and undergo limited optimization to identify promising lead compounds. These lead compo ...
. Obtaining a good quality QSAR model depends on many factors, such as the quality of input data, the choice of descriptors and statistical methods for modeling and for validation. Any QSAR modeling should ultimately lead to statistically robust and predictive models capable of making accurate and reliable predictions of the modeled response of new compounds. For validation of QSAR models, usually various strategies are adopted: # internal validation or cross-validation (actually, while extracting data, cross validation is a measure of model robustness, the more a model is robust (higher q2) the less data extraction perturb the original model); # external validation by splitting the available data set into training set for model development and prediction set for model predictivity check; # blind external validation by application of model on new external data and # data randomization or Y-scrambling for verifying the absence of chance correlation between the response and the modeling descriptors. The success of any QSAR model depends on accuracy of the input data, selection of appropriate descriptors and statistical tools, and most importantly validation of the developed model. Validation is the process by which the reliability and relevance of a procedure are established for a specific purpose; for QSAR models validation must be mainly for robustness, prediction performances and applicability domain (AD) of the models. Some validation methodologies can be problematic. For example, ''leave one-out'' cross-validation generally leads to an overestimation of predictive capacity. Even with external validation, it is difficult to determine whether the selection of training and test sets was manipulated to maximize the predictive capacity of the model being published. Different aspects of validation of QSAR models that need attention include methods of selection of training set compounds, setting training set size and impact of variable selection for training set models for determining the quality of prediction. Development of novel validation parameters for judging quality of QSAR models is also important.


Application


Chemical

One of the first historical QSAR applications was to predict
boiling point The boiling point of a substance is the temperature at which the vapor pressure of a liquid equals the pressure surrounding the liquid and the liquid changes into a vapor. The boiling point of a liquid varies depending upon the surrounding envi ...
s. It is well known for instance that within a particular family of
chemical compound A chemical compound is a chemical substance composed of many identical molecules (or molecular entities) containing atoms from more than one chemical element held together by chemical bonds. A molecule consisting of atoms of only one elemen ...
s, especially of organic chemistry, that there are strong correlations between structure and observed properties. A simple example is the relationship between the number of carbons in
alkanes In organic chemistry, an alkane, or paraffin (a historical trivial name that also has other meanings), is an acyclic saturated hydrocarbon. In other words, an alkane consists of hydrogen and carbon atoms arranged in a tree structure in ...
and their
boiling point The boiling point of a substance is the temperature at which the vapor pressure of a liquid equals the pressure surrounding the liquid and the liquid changes into a vapor. The boiling point of a liquid varies depending upon the surrounding envi ...
s. There is a clear trend in the increase of boiling point with an increase in the number carbons, and this serves as a means for predicting the boiling points of higher alkanes. A still very interesting application is the Hammett equation,
Taft equation The Taft equation is a linear free energy relationship (LFER) used in physical organic chemistry in the study of reaction mechanisms and in the development of quantitative structure–activity relationships for organic compounds. It was develop ...
and pKa prediction methods.


Biological

The biological activity of molecules is usually measured in
assay An assay is an investigative (analytic) procedure in laboratory medicine, mining, pharmacology, environmental biology and molecular biology for qualitatively assessing or quantitatively measuring the presence, amount, or functional activity of a ...
s to establish the level of inhibition of particular signal transduction or metabolic pathways. Drug discovery often involves the use of QSAR to identify chemical structures that could have good inhibitory effects on specific targets and have low
toxicity Toxicity is the degree to which a chemical substance or a particular mixture of substances can damage an organism. Toxicity can refer to the effect on a whole organism, such as an animal, bacterium, or plant, as well as the effect on a subs ...
(non-specific activity). Of special interest is the prediction of
partition coefficient In the physical sciences, a partition coefficient (''P'') or distribution coefficient (''D'') is the ratio of concentrations of a compound in a mixture of two immiscible solvents at equilibrium. This ratio is therefore a comparison of the solu ...
log ''P'', which is an important measure used in identifying "
druglikeness Druglikeness is a qualitative concept used in drug design for how "druglike" a substance is with respect to factors like bioavailability. It is estimated from the molecular structure before the substance is even synthesized and tested. A druglike m ...
" according to
Lipinski's Rule of Five Lipinski's rule of five, also known as Pfizer's rule of five or simply the rule of five (RO5), is a rule of thumb to evaluate druglikeness or determine if a chemical compound with a certain pharmacological or biological activity has chemical pro ...
. While many quantitative structure activity relationship analyses involve the interactions of a family of molecules with an enzyme or
receptor Receptor may refer to: *Sensory receptor, in physiology, any structure which, on receiving environmental stimuli, produces an informative nerve impulse *Receptor (biochemistry), in biochemistry, a protein molecule that receives and responds to a n ...
binding site, QSAR can also be used to study the interactions between the
structural domain In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of ...
s of proteins. Protein-protein interactions can be quantitatively analyzed for structural variations resulted from
site-directed mutagenesis Site-directed mutagenesis is a molecular biology method that is used to make specific and intentional mutating changes to the DNA sequence of a gene and any gene products. Also called site-specific mutagenesis or oligonucleotide-directed mutagenesi ...
. It is part of the machine learning method to reduce the risk for a SAR paradox, especially taking into account that only a finite amount of data is available (see also MVUE). In general, all QSAR problems can be divided into coding and learning.


Applications

(Q)SAR models have been used for risk management. QSARS are suggested by regulatory authorities; in the European Union, QSARs are suggested by the
REACH Reach or REACH may refer to: Companies and organizations * Reach plc, formerly Trinity Mirror, large British newspaper, magazine, and digital publisher * Reach Canada, an NGO in Canada * Reach Limited, an Asia Pacific cable network company ...
regulation, where "REACH" abbreviates "Registration, Evaluation, Authorisation and Restriction of Chemicals". Regulatory application of QSAR methods includes ''in silico'' toxicological assessment of genotoxic impurities. Commonly used QSAR assessment software such as DEREK or CASE Ultra (MultiCASE) is used to genotoxicity of impurity according to ICH M7.
/ref> The chemical descriptor space whose
convex hull In geometry, the convex hull or convex envelope or convex closure of a shape is the smallest convex set that contains it. The convex hull may be defined either as the intersection of all convex sets containing a given subset of a Euclidean spac ...
is generated by a particular training set of chemicals is called the training set's applicability domain. Prediction of properties of novel chemicals that are located outside the applicability domain uses
extrapolation In mathematics, extrapolation is a type of estimation, beyond the original observation range, of the value of a variable on the basis of its relationship with another variable. It is similar to interpolation, which produces estimates between kno ...
, and so is less reliable (on average) than prediction within the applicability domain. The assessment of the reliability of QSAR predictions remains a research topic. The QSAR equations can be used to predict biological activities of newer molecules before their synthesis. Examples of machine learning tools for QSAR modeling include:


See also


References


Further reading

* *


External links

* * * *
Chemoinformatics Tools
Drug Theoretics and Cheminformatics Laboratory
Multiscale Conceptual Model Figures for QSARs in Biological and Environmental Science
{{DEFAULTSORT:Quantitative structure-activity relationship Medicinal chemistry Drug discovery Cheminformatics Computational chemistry Structure-Activity Relationship paradox