TRE is an
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
library
A library is a collection of materials, books or media that are accessible for use and not just for display purposes. A library provides physical (hard copies) or digital access (soft copies) materials, and may be a physical location or a vi ...
for
pattern matching
In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be ...
in text, which works like a
regular expression
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
engine with the ability to do
approximate string matching
In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching ...
.
It was developed by Ville Laurikari
[ and is distributed under a 2-clause BSD-like license.
The library] is written in C and provides functions which allow using regular expressions for searching over input text lines. The main difference from other regular expression engines is that TRE can match text fragments in an approximate way, that is, supposing that text could have some number of typos.
Features
TRE uses extended regular expression syntax with the addition of "directions" for matching preceding fragment in approximate way. Each of such directions specifies how many typos are allowed for this fragment.
Approximate matching is performed in a way similar to Levenshtein distance
In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
, which means that there are three types of typos 'recognized':
TRE allows specifying of ''cost'' for each of three typos type independently.
The project comes with a command-line utility, a reimplementation of agrep
agrep (approximate grep) is an open-source approximate string matching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.
It selects the best-s ...
.
Though approximate matching requires some syntax extension, when this feature is not used, TRE works like most of other regular expression matching engines. This means that
* it implements ordinary regular expressions written for strict matching;[
* programmers familiar with POSIX-style regular expressions][ need not do much study to be able to use TRE.][
]
Predictable time and memory consumption
The library's author states that time spent for matching grows linearly with increasing of input text length, while memory requirement is constant during matching and does not depend on the input, only on the pattern.
Other
Other features, common for most regular expression engines could be checked in regex engines comparison tables or in list of TRE features on its web-page.
Usage example
Approximate matching directions are specified in curly brackets and should be distinguishable from repetitive quantifiers (possibly with inserting a space after opening bracket):
* (regular)\s+(expression) would match variants of phrase "regular expression" in which "regular" have no more than one typo and "expression" no more than two; as in ordinary regular expressions "" means one or more space characters i.e. rogular ekspression would pass test;
* (expression)
would match word "expression" if total cost of typos is less than 11, while insertion cost is set to 5, deletion to 3 and substitution of character to 2 - i.e. gives cost of 10.
Language bindings
Apart from C, TRE is usable through bindings for Perl
Perl is a family of two High-level programming language, high-level, General-purpose programming language, general-purpose, Interpreter (computing), interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it ...
, Python and Haskell. It is the default regular expression engine in R. However if the project should be cross-platform
In computing, cross-platform software (also called multi-platform software, platform-agnostic software, or platform-independent software) is computer software that is designed to work in several computing platforms. Some cross-platform software ...
, there would be necessary separate interface for each of the target platforms.
Disadvantages
Since other regular expression engines usually do not provide approximate matching ability, there is almost no concurrent implementation with which TRE could be compared. However there are a few things which programmers may wish to see implemented in future releases:
* a replacement mechanism for substituting matched text fragments (like in sed string processor and many modern implementations of regular expressions, including built into Perl
Perl is a family of two High-level programming language, high-level, General-purpose programming language, general-purpose, Interpreter (computing), interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it ...
or Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
);
* opportunity to use another approximate matching algorithm (than Levenshtein's) for better typo value assessment (for example Soundex), or at least this algorithm to be improved to allow typos of the "swap" type (see Damerau–Levenshtein distance In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein.) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Leve ...
).
See also
* Levenshtein automaton
In computer science, a Levenshtein automaton for a string ''w'' and a number ''n'' is a finite-state automaton that can recognize the set of all strings whose Levenshtein distance from ''w'' is at most ''n''. That is, a string ''x'' is in the form ...
* Comparison of regular expression engines
* Agrep
agrep (approximate grep) is an open-source approximate string matching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.
It selects the best-s ...
References
External links
TRE - The free and portable approximate regular expression matching library
Further reading
*
{{Strings
Computer libraries
Regular expressions
Software using the BSD license