In computer science, a trie (, ), also known as a digital tree or prefix tree,
is a specialized
search tree
In computer science, a search tree is a tree data structure used for locating specific keys from within a set. In order for a tree to function as a search tree, the key for each node must be greater than any keys in subtrees on the left, and les ...
data structure used to store and retrieve strings from a dictionary or set. Unlike a
binary search tree
In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a Rooted tree, rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left ...
, nodes in a trie do not store their associated key. Instead, each node's ''position'' within the trie determines its associated key, with the connections between nodes defined by individual
characters rather than the entire key.
Tries are particularly effective for tasks such as autocomplete, spell checking, and IP routing, offering advantages over
hash table
In computer science, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps Unique key, keys to Value (computer science), values. ...
s due to their prefix-based organization and lack of hash collisions. Every child node shares a common
prefix
A prefix is an affix which is placed before the stem of a word. Particularly in the study of languages, a prefix is also called a preformative, because it alters the form of the word to which it is affixed.
Prefixes, like other affixes, can b ...
with its parent node, and the root node represents the
empty string
In formal language theory, the empty string, or empty word, is the unique String (computer science), string of length zero.
Formal theory
Formally, a string is a finite, ordered sequence of character (symbol), characters such as letters, digits ...
. While basic trie implementations can be memory-intensive, various optimization techniques such as compression and bitwise representations have been developed to improve their efficiency. A notable optimization is the
radix tree, which provides more efficient prefix-based storage.
While tries commonly store character strings, they can be adapted to work with any ordered sequence of elements, such as
permutation
In mathematics, a permutation of a set can mean one of two different things:
* an arrangement of its members in a sequence or linear order, or
* the act or process of changing the linear order of an ordered set.
An example of the first mean ...
s of digits or shapes. A notable variant is the bitwise trie, which uses individual
bits from fixed-length binary data (such as
integers
An integer is the number zero (0), a positive natural number (1, 2, 3, ...), or the negation of a positive natural number (−1, −2, −3, ...). The negations or additive inverses of the positive natural numbers are referred to as negative in ...
or
memory addresses) as keys.
History, etymology, and pronunciation
The idea of a trie for representing a set of strings was first abstractly described by
Axel Thue
Axel Thue (; 19 February 1863 – 7 March 1922) was a Norwegian mathematician, known for his original work in diophantine approximation and combinatorics.
Work
Thue published his first important paper in 1909.
He stated in 1914 the so-called w ...
in 1912.
[ Cited by Knuth.][ Tries were first described in a computer context by René de la Briandais in 1959.]
The idea was independently described in 1960 by Edward Fredkin,[ who coined the term ''trie'', pronouncing it (as "tree"), after the middle syllable of ''retrieval''.] However, other authors pronounce it (as "try"), in an attempt to distinguish it verbally from "tree".
Overview
Tries are a form of string-indexed look-up data structure, which is used to store a dictionary list of words that can be searched on in a manner that allows for efficient generation of completion lists. A prefix trie is an ordered tree data structure used in the representation of a set of strings over a finite alphabet set, which allows efficient storage of words with common prefixes.
Tries can be efficacious on string-searching algorithms such as predictive text, approximate string matching, and spell checking in comparison to binary search trees. A trie can be seen as a tree-shaped deterministic finite automaton.
Operations
Tries support various operations: insertion, deletion, and lookup of a string key. Tries are composed of nodes that contain links, which either point to other suffix child nodes or ''null''. As for every tree, each node but the root is pointed to by only one other node, called its ''parent''. Each node contains as many links as the number of characters in the applicable alphabet
An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...
(although tries tend to have a substantial number of null links). In some cases, the alphabet used is simply that of the character encoding
Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
—resulting in, for example, a size of 256 in the case of (unsigned) ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
.
The null links within the children of a node emphasize the following characteristics:
# Characters and string keys are implicitly stored in the trie, and include a character sentinel value
In computer programming, a sentinel value (also referred to as a flag value, trip value, rogue value, signal value, or dummy data) is a special value in the context of an algorithm which uses its presence as a condition of termination, typically ...
indicating string termination.
# Each node contains one possible link to a prefix
A prefix is an affix which is placed before the stem of a word. Particularly in the study of languages, a prefix is also called a preformative, because it alters the form of the word to which it is affixed.
Prefixes, like other affixes, can b ...
of strong keys of the set.
A basic structure type of nodes in the trie is as follows; may contain an optional , which is associated with each key stored in the last character of string, or terminal node.
Searching
Searching for a value in a trie is guided by the characters in the search string key, as each node in the trie contains a corresponding link to each possible character in the given string. Thus, following the string within the trie yields the associated value for the given string key. A null link during the search indicates the inexistence of the key.
The following pseudocode implements the search procedure for a given string in a rooted trie .
In the above pseudocode, and correspond to the pointer of trie's root node and the string key respectively. The search operation, in a standard trie, takes time, where is the size of the string parameter , and corresponds to the alphabet size. Binary search trees, on the other hand, take in the worst case, since the search depends on the height of the tree () of the BST (in case of balanced trees), where and being number of keys and the length of the keys.
The trie occupies less space in comparison with a BST in the case of a large number of short strings, since nodes share common initial string subsequences and store the keys implicitly. The terminal node of the tree contains a non-null value, and it is a search ''hit'' if the associated value is found in the trie, and search ''miss'' if it is not.
Insertion
Insertion into trie is guided by using the character sets
Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
as indexes to the children array until the last character of the string key is reached. Each node in the trie corresponds to one call of the radix sorting routine, as the trie structure reflects the execution of pattern of the top-down radix sort.
If a null link is encountered prior to reaching the last character of the string key, a new node is created (line 3). The value of the terminal node is assigned to the input value; therefore, if the former was non-null at the time of insertion, it is substituted with the new value.
Deletion
Deletion of a key–value pair from a trie involves finding the terminal node with the corresponding string key, marking the terminal indicator and value to ''false'' and null correspondingly.
The following is a recursive procedure for removing a string from rooted trie ().
The procedure begins by examining the ; null denotes the arrival of a terminal node or end of a string key. If the node is terminal it has no children, it is removed from the trie (line 14). However, an end of string key without the node being terminal indicates that the key does not exist, thus the procedure does not modify the trie. The recursion proceeds by incrementing 's index.
Replacing other data structures
Replacement for hash tables
A trie can be used to replace a hash table
In computer science, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps Unique key, keys to Value (computer science), values. ...
, over which it has the following advantages:
* Searching for a node with an associated key of size has the complexity of , whereas an imperfect hash function may have numerous colliding keys, and the worst-case lookup speed of such a table would be , where denotes the total number of nodes within the table.
* Tries do not need a hash function for the operation, unlike a hash table; there are also no collisions of different keys in a trie.
* Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value.
* String keys within the trie can be sorted using a predetermined alphabetical ordering.
However, tries are less efficient than a hash table when the data is directly accessed on a secondary storage device such as a hard disk drive that has higher random access
Random access (also called direct access) is the ability to access an arbitrary element of a sequence in equal time or any datum from a population of addressable elements roughly as easily and efficiently as any other, no matter how many elemen ...
time than the main memory
Computer data storage or digital data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.
The central processin ...
. Tries are also disadvantageous when the key value cannot be easily represented as string, such as floating point numbers where multiple representations are possible (e.g. 1 is equivalent to 1.0, +1.0, 1.00, etc.), however it can be unambiguously represented as a binary number
A binary number is a number expressed in the Radix, base-2 numeral system or binary numeral system, a method for representing numbers that uses only two symbols for the natural numbers: typically "0" (zero) and "1" (one). A ''binary number'' may ...
in IEEE 754
The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic originally established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard #Design rationale, add ...
, in comparison to two's complement
Two's complement is the most common method of representing signed (positive, negative, and zero) integers on computers, and more generally, fixed point binary values. Two's complement uses the binary digit with the ''greatest'' value as the ''s ...
format.
Implementation strategies
Tries can be represented in several ways, corresponding to different trade-offs between memory use and speed of the operations. Using a vector of pointers for representing a trie consumes enormous space; however, memory space can be reduced at the expense of running time if a singly linked list is used for each node vector, as most entries of the vector contains .
Techniques such as ''alphabet reduction'' may reduce the large space requirements by reinterpreting the original string as a longer string over a smaller alphabet i.e. a string of bytes can alternatively be regarded as a string of four-bit units and stored in a trie with 16 instead of 256 pointers per node. Although this can reduce memory usage by up to a factor of eight, lookups need to visit twice as many nodes in the worst case. Other techniques include storing a vector of 256 ASCII pointers as a bitmap of 256 bits representing ASCII alphabet, which reduces the size of individual nodes dramatically.
Bitwise tries
Bitwise tries are used to address the enormous space requirement for the trie nodes in a naive simple pointer vector implementations. Each character in the string key set is represented via individual bits, which are used to traverse the trie over a string key. The implementations for these types of trie use vectorized CPU instructions to find the first set bit in a fixed-length key input (e.g. GCC's __builtin_clz()
intrinsic function
In computer software, in compiler theory, an intrinsic function, also called built-in function or builtin function, is a function ( subroutine) available for use in a given programming language whose implementation is handled specially by the com ...
). Accordingly, the set bit is used to index the first item, or child node, in the 32- or 64-entry based bitwise tree. Search then proceeds by testing each subsequent bit in the key.
This procedure is also cache-local and highly parallelizable due to register independency, and thus performant on out-of-order execution
In computer engineering, out-of-order execution (or more formally dynamic execution) is an instruction scheduling paradigm used in high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In t ...
CPUs.
Compressed tries
Radix tree, also known as a compressed trie, is a space-optimized variant of a trie in which any node with only one child gets merged with its parent; elimination of branches of the nodes with a single child results in better metrics in both space and time. This works best when the trie remains static and set of keys stored are very sparse within their representation space.
One more approach is to "pack" the trie, in which a space-efficient implementation of a sparse packed trie applied to automatic hyphenation, in which the descendants of each node may be interleaved in memory.
Patricia trees
Patricia trees are a particular implementation of the compressed binary trie that uses the binary encoding of the string keys in its representation. Every node in a Patricia tree contains an index, known as a "skip number", that stores the node's branching index to avoid empty subtrees during traversal. A naive implementation of a trie consumes immense storage due to larger number of leaf-nodes caused by sparse distribution of keys; Patricia trees can be efficient for such cases.
A representation of a Patricia tree is shown to the right. Each index value adjacent to the nodes represents the "skip number"—the index of the bit with which branching is to be decided. The skip number 1 at node 0 corresponds to the position 1 in the binary encoded ASCII where the leftmost bit differed in the key set . The skip number is crucial for search, insertion, and deletion of nodes in the Patricia tree, and a bit masking operation is performed during every iteration.
Applications
Trie data structures are commonly used in predictive text or autocomplete
Autocomplete, or word completion, is a feature in which an application software, application predicts the rest of a word a user is typing. In Android (operating system), Android and iOS smartphones, this is called predictive text. In graphical us ...
dictionaries, and approximate matching algorithms. Tries enable faster searches, occupy less space, especially when the set contains large number of short strings, thus used in spell checking, hyphenation applications and longest prefix match algorithms. However, if storing dictionary words is all that is required (i.e. there is no need to store metadata associated with each word), a minimal deterministic acyclic finite state automaton (DAFSA) or radix tree would use less storage space than a trie. This is because DAFSAs and radix trees can compress identical branches from the trie which correspond to the same suffixes (or parts) of different words being stored. String dictionaries are also utilized in natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
, such as finding lexicon
A lexicon (plural: lexicons, rarely lexica) is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Greek word () ...
of a text corpus
In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Annotated, they have been used in corp ...
.
Sorting
Lexicographic sorting of a set of string keys can be implemented by building a trie for the given keys and traversing the tree in pre-order fashion; this is also a form of radix sort. Tries are also fundamental data structures for burstsort, which is notable for being the fastest string sorting algorithm as of 2007, accomplished by its efficient use of CPU cache.
Full-text search
A special kind of trie, called a suffix tree, can be used to index all suffix
In linguistics, a suffix is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns and adjectives, and verb endings, which form the conjugation of verbs. Suffixes can ca ...
es in a text to carry out fast full-text searches.
Web search engines
A specialized kind of trie called a compressed trie, is used in web search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
s for storing the indexes
Index (: indexes or indices) may refer to:
Arts, entertainment, and media Fictional entities
* Index (A Certain Magical Index), Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index''
* The Index, a ...
- a collection of all searchable words. Each terminal node is associated with a list of URLs—called occurrence list—to pages that match the keyword. The trie is stored in the main memory, whereas the occurrence is kept in an external storage, frequently in large clusters, or the in-memory index points to documents stored in an external location.
Bioinformatics
Tries are used in Bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
, notably in sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
software applications such as BLAST, which indexes all the different substring of length ''k'' (called k-mers) of a text by storing the positions of their occurrences in a compressed trie sequence databases.
Internet routing
Compressed variants of tries, such as databases for managing Forwarding Information Base (FIB), are used in storing IP address prefixes within routers and bridges
A bridge is a structure built to span a physical obstacle (such as a body of water, valley, road, or railway) without blocking the path underneath. It is constructed for the purpose of providing passage over the obstacle, which is usually somet ...
for prefix-based lookup to resolve mask-based operations in IP routing
IP routing is the application of traffic routing methodologies to IP networks. This involves technologies, protocols, structure, administrations, and policies of the worldwide Internet infrastructure. In each IP network node, IP routing involv ...
.
See also
* Suffix tree
* Hash trie
* Hash array mapped trie
* Prefix hash tree
* Ctrie
* HAT-trie
* Aho–Corasick algorithm
References
External links
NIST's Dictionary of Algorithms and Data Structures: Trie
{{Strings
Trees (data structures)
Finite-state machines
Articles with example Python (programming language) code