Soft independent modelling by class analogy (SIMCA) is a
statistical
Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...
method for
supervised classification of data. The method requires a
training data set
In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...
consisting of samples (or objects) with a set of attributes and their class membership. The term soft refers to the fact the classifier can identify samples as belonging to multiple classes and not necessarily producing a classification of samples into non-overlapping classes.
Method
In order to build the classification models, the samples belonging to each class need to be analysed using
principal component analysis
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA); only the significant components are retained.
For a given class, the resulting model then describes either a line (for one Principal Component or PC), plane (for two PCs) or
hyper-plane
In geometry, a hyperplane is a subspace whose dimension is one less than that of its '' ambient space''. For example, if a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hyper ...
(for more than two PCs). For each modelled class, the mean
orthogonal distance In geometry, the perpendicular distance between two objects is the distance from one to the other, measured along a line that is perpendicular to one or both.
The distance from a point to a line is the distance to the nearest point on that line. Th ...
of training data samples from the line, plane, or hyper-plane (calculated as the residual standard deviation) is used to determine a critical distance for classification. This critical distance is based on the
F-distribution and is usually calculated using 95% or 99% confidence intervals.
New observations are projected into each PC model and the residual distances calculated. An observation is assigned to the model class when its residual distance from the model is below the statistical limit for the class. The observation may be found to belong to multiple classes and a measure of
goodness of the model can be found from the number of cases where the observations are classified into multiple classes. The classification efficiency is usually indicated by
Receiver operating characteristic
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of ...
s.
In the original SIMCA method, the ends of the hyper-plane of each class are closed off by setting statistical control limits along the retained principal components axes (i.e., score value between plus and minus 0.5 times score standard deviation).
More recent adaptations of the SIMCA method close off the hyper-plane by construction of ellipsoids (e.g.
Hotelling's T2 or
Mahalanobis distance The Mahalanobis distance is a measure of the distance between a point ''P'' and a distribution ''D'', introduced by P. C. Mahalanobis in 1936. Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based ...
). With such modified SIMCA methods, classification of an object requires both that its orthogonal distance from the model and its projection within the model (i.e. score value within the region defined by the ellipsoid) are not significant.
Application
SIMCA as a method of classification has gained widespread use especially in applied statistical fields such as
chemometrics and spectroscopic data analysis.
References
* Wold, Svante, and Sjostrom, Michael, 1977, SIMCA: A method for analyzing chemical data in terms of similarity and analogy, in Kowalski, B.R., ed., Chemometrics Theory and Application, American Chemical Society Symposium Series 52, Wash., D.C., American Chemical Society, p. 243-282.
*