CRM114 (program)
   HOME

TheInfoList



OR:

The CRM114 Discriminator, or simply CRM114, is a program based upon a statistical approach for classifying data, and especially used for filtering
email spam Email spam, also referred to as junk email, spam mail, or simply spam, refers to unsolicited messages sent in bulk via email. The term originates from a Spam (Monty Python), Monty Python sketch, where the name of a canned meat product, "Spam (food ...
.


Nomenclature

The name comes from the CRM-114 Discriminator in
Stanley Kubrick Stanley Kubrick (; July 26, 1928 – March 7, 1999) was an American filmmaker and photographer. Widely considered one of the greatest filmmakers of all time, Stanley Kubrick filmography, his films were nearly all adaptations of novels or sho ...
's film ''Dr. Strangelove'' - a piece of radio equipment designed to filter out messages lacking a specific code-prefix.


Operation

While others have done statistical
Bayesian spam filtering In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of " probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes th ...
based upon the frequency of single word occurrences in email, CRM114 achieves a higher rate of spam recognition through creating hits based upon phrases up to five words in length. These phrases are used to form a
Markov Random Field In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph In discrete mathematics, particularly ...
representing the incoming texts. With this additional contextual recognition, it is one of the more accurate spam filters available. Initial testing in 2002 by author Bill Yerazunis gave a 99.87% accuracy; Holden ''Spam Filtering II''
/ref> and TREC 2005 and 2006''Spam Track Overview'' (2005)
- TREC 2005
''Spam Track Overview'' (2006)
- TREC 2005
gave results of better than 99%, with significant variation depending on the particular corpus. CRM114's classifier can also be switched to use Littlestone's Winnow algorithm, character-by-character
correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
, a variant on KNN (
K-nearest neighbor algorithm In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method. It was first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. Most often, it is used for cl ...
) classification called Hyperspace, a bit-entropic classifier that uses
entropy encoding In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...
to determine similarity, a SVM, by mutual compressibility as calculated by a modified
LZ77 LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as Lempel-Ziv 1 (LZ1) and Lempel-Ziv 2 (LZ2) respectively. These two algorithms form the basis ...
algorithm, and other more experimental classifiers. The actual features matched are based on a generalization of skip-grams. The CRM114 algorithms are multi-lingual (compatible with
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
encodings) and null-safe. A voting set of CRM114 classifiers have been demonstrated to detect confidential versus non-confidential documents written in Japanese at better than 99.9% detection rate and a 5.3% false alarm rate. CRM114 is a good example of
pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...
software, demonstrating how machine learning can be accomplished with a reasonably simple algorithm. The program's C source code is available under the GPL. At a deeper level, CRM114 is also a string pattern matching language, similar to
grep grep is a command-line utility for searching plaintext datasets for lines that match a regular expression. Its name comes from the ed command g/re/p (global regular expression search and print), which has the same effect. grep was originally de ...
or even
Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language". Perl was developed ...
; although it is
Turing complete Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. He was highly influential in the development of theoretical comput ...
it is highly tuned for matching text, and even a simple (recursive) definition of the factorial takes almost ten lines. Part of this is because the crm114 language syntax is not positional, but
declension In linguistics, declension (verb: ''to decline'') is the changing of the form of a word, generally to express its syntactic function in the sentence by way of an inflection. Declension may apply to nouns, pronouns, adjectives, adverbs, and det ...
al. As a programming language, it may be used for many other applications aside from detecting spam. CRM114 uses the TRE approximate-match
regex A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
engine, so it is possible to write programs that do not depend on absolutely identical strings matching to function correctly. CRM114 has been applied to
email filtering Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly ap ...
in the KMail client and a number of other applications, including detection of bots on Twitter and Yahoo, as well as the first-level filter in the US Dept of Transportation's vehicle defect detection system. It has also been used as a predictive method for classifying fault-prone software modules.


See also

* String matching


References


External links


The CRM114 home page on SourceForge

The TRE approximate regex matcher homepage
{{DEFAULTSORT:Crm114 (Program) Anti-spam