HOME

TheInfoList



OR:

The PH-tree is a
tree data structure In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be con ...
used for spatial indexing of multi-dimensional data (keys) such as geographical coordinates, points, feature vectors, rectangles or bounding boxes. The PH-tree is space partitioning index with a structure similar to that of a
quadtree A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are the two-dimensional analog of octrees and are most often used to partition a two-dimensional space by recursively subdividing it into four ...
or
octree An octree is a tree data structure in which each internal node has exactly eight children. Octrees are most often used to partition a three-dimensional space by recursively subdividing it into eight octants. Octrees are the three-dimensional a ...
. However, unlike quadtrees, it uses a splitting policy based on tries and similar to Crit bit trees that is based on the bit-representation of the keys. The bit-based splitting policy, when combined with the use of different internal representations for nodes, provides scalability with high-dimensional data. The bit-representation splitting policy also imposes a maximum depth, thus avoiding degenerated trees and the need for rebalancing.


Overview

The basic PH-tree is a spatial index that
maps A map is a symbolic depiction emphasizing relationships between elements of some space, such as objects, regions, or themes. Many maps are static, fixed to paper or some other durable medium, while others are dynamic or interactive. Althoug ...
keys, which are -dimensional vectors with integers, to user defined values. The PH-tree is a multi-dimensional generalization of a Crit bit tree in the sense that a Crit bit tree is equivalent to a PH-tree with 1-dimensional keys. Like the Crit bit tree, and unlike most other spatial indexes, the PH-tree is a '' map'' rather than a ''
multimap In computer science, a multimap (sometimes also multihash, multidict or multidictionary) is a generalization of a map or associative array abstract data type in which more than one value may be associated with and returned for a given key. Both m ...
''. A -dimensional PH-tree is a tree of nodes where each node partitions space by subdividing it into 2^d ''quadrants'' (see
below Below may refer to: *Earth * Ground (disambiguation) * Soil * Floor * Bottom (disambiguation) * Less than *Temperatures below freezing * Hell or underworld People with the surname * Ernst von Below (1863–1955), German World War I general * Fr ...
for how potentially large nodes scales with high dimensional data). Each ''quadrant'' contains at most one ''entry'', either a key-value pair (leaf quadrant) or a key-subnode pair. For a key-subnode pair, the key represents the center of the subnode. The key is also the common prefix (bit-representation) of all keys in the subnode and its child subnodes. Each node has at least two entries, otherwise it is merged with the parent node. Some other structural properties of PH-trees are: * They are 2^n-ary trees. * They are inherently unbalanced but imbalance is limited due to their depth being limited to the bit width of the keys, e.g. to 32 for a d-dimensional key with 32bit integers. * Insertion or removal operations cause exactly one node to be modified and potentially a second node to be added or removed. This can be useful for concurrent implementations. This also means little variation in modification cost. * Their structure is independent from insertion/removal order.


Splitting strategy

Similar to most quadtrees, the PH-tree is a hierarchy of nodes where every node splits the space in all dimensions. Thus, a node can have up to 2^d subnodes, one for each quadrant.


Quadrant numbering

The PH-tree uses the bits of the multi-dimensional keys to determine their position in the tree. All keys that have the same leading bits are stored in the same branch of the tree. For example, in a node at level , to determine the quadrant where a key should be inserted (or removed or looked up), it looks at the 's bit of each dimension of the key. For a 3D node with 8 quadrants (forming a cube) the 's bit of the first dimension of the key determines whether the target quadrant is on the left or the right of the cube, the 's bit of the second dimension determines whether it is at the front or the back, and the 's bit of the third dimension determines bottom vs top, see picture.


1D example

Example with three 1D keys with 8bit values: k_0 = \_ = \_, k_1 = \_ = \_ and k_2 = \_ = \_. Adding k_0 and k_1 to an empty tree results in a single node. The two keys first differ in their 6th bit so the node has a level L=5 (starting with 0). The node has a 5bit prefix representing the common 5 bits of both keys. The node has two quadrants, each key is stored in one quadrant. Adding a third key k_3 results in one additional node at L=2 with one quadrant containing the original node as subnode and the other quadrant containing the new key k_2.


2D example

With 2D keys every node has 2^d=4 quadrants. The position of the quadrant where a key is stored is extracted from the respective bits of the keys, one bit from each dimension. The four quadrants of the node form a 2D hypercube (quadrants may be empty). The bits that are extracted from the keys form the hypercube address h, for k_0 \rarr h=\_2 and for k_1 \rarr h=\_2. h is effectively the position of the quadrant in the node's hypercube.


Node structure

The ordering of the entries in a node always follows Z-ordering. Entries in a node can, for example, be stored in fixed size arrays of size 2^d. is then effectively the array index of a quadrant. This allows lookup, insert and remove with O(1) and there is no need to store . Space complexity is however O(2^d) per node, so it is less suitable for high dimensional data. Another solution is to store entries in a sorted collection, such as dynamic arrays and/or
B-trees In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for ...
. This slows down lookup operations to O(\log) but reduces memory consumption to O(n_). The original implementation aimed for minimal memory consumption by switching between fixed and dynamic array representation depending on which uses less memory. Other implementatio

https://github.com/improbable-eng/phtree-cpp] do not switch dynamically but use fixed arrays for d \lesssim 4, dynamic arrays for d \lesssim 8 and B-trees for high dimensional data.


Operations

#Lookup, Lookup, insertion and removal operations all work very similar: find the correct node, then perform the operation on the node.
Window queries A window is an opening in a wall, door, roof, or vehicle that allows the exchange of light and may also allow the passage of sound and sometimes air. Modern windows are usually glazed or covered in some other transparent or translucent mater ...
and -nearest-neighbor searches are more complex.


Lookup

The ''Lookup'' operation determines whether a key exists in the tree. It walks down the tree and checks every node whether it contains a candidate subnode or a user value that matches the key. function lookup(key) is entry ← get_root_entry() // if the tree is not empty the root entry contains a root node while entry != NIL && entry.is_subnode() do node ← entry.get_node() entry ← node.get_entry(key) repeat return entry // entry can be NIL function get_entry(key) is node ← current node h ← extract_bits_at_depth(key, node.get_depth()} entry ← node.get_entry_at(h) return entry // entry can be NIL


Insert

The ''Insert'' operation inserts a new key-value pair into the tree unless they key already exists. The operation traverses the tree like the ''Lookup'' function and then inserts the key into the node. There are several cases to consider: # The quadrant is empty and we can simply insert a new entry into the quadrant and return. # The quadrant contains a user entry with a key that is identical to the new entry. One way to deal with such a ''collision'' is to return a flag that indicates failed insertion. If the tree is implemented as multi-map with a collection as the node's entry, the new value is added to that collection. # The quadrant contains an entry (user entry or subnode entry) with a different key. This case requires replacing the existing entry with a new subnode that holds the old and the new entry. function insert(node, key, value) level ← node.get_level() // Level is 0 for root h ← extract_bits_at_level(key, level) entry ← node.get_entry(h) if entry

NIL then // Case 1. entry_new ← create_entry(key, value) node.set_entry(h, entry_new) else if !entry.is_subnode() && entry.get_key()

key then // Case 2. Collision, there is already an entry return ← failed_insertion else // Case 3. level_diff ← get_level_of_difference(key, entry.get_key()) entry_new ← create_entry(key, value) // new subnode with existing entry and new entry subnode_new ← create_node(level_diff, entry, entry_new) node.set_entry(h, subnode_new) end if return


Remove

Removal works inversely to insertion, with the additional constraint that any subnode has to be removed if less than two entries remain. The remaining entry is moved to the parent node.


Window queries

Windows queries are queries that return all keys that lie inside a rectangular axis-aligned hyperbox. They can be defined to be two -dimensional points min and max that represent the "lower left" and "upper right" corners of the query box. A trivial implementation traverses all entries in a node (starting with the root node) and if an entry matches it either adds it to the result list (if it is a user entry) or recursively traverses it (if it is a subnode). function query(node, min, max, result_list) is foreach entry ← node.get_entries() do if entry.is_subnode() then if entry.get_prefix() >= min and entry.get_prefix() <= max then query(entry.get_subnode(), min, max, result_list) end if else if entry.get_key() >= min and entry.get_key() <= max then result_list.add(entry) end if end if repeat return In order to accurately estimate query time complexity the analysis needs to include the dimensionality d. Traversing and comparing all n_ entries in a node has a time complexity of O(d * n_) because each comparison of d-dimensional key with min/max takes O(d) time. Since nodes can have up to 2^d entries, this does not scale well with increasing dimensionality d. There are various ways how this approach can be improved by making use of the hypercube address .


Min & max

The idea is to find minimum and maximum values for the quadrant's addresses h such that the search can avoid some quadrants that do not overlap with the query box. Let C be the center of a node (this is equal to the node's prefix) and h_ and h_ be two bit strings with d bits each. Also, let subscript i with 0 \leq i < d indicate the i's bit of h_ and h_ and the i'th dimension of min, max and C. Let h_ = (min_i \leq C_i) and h_ = (max_i \geq C_i). h_ then has a `1` for every dimension where the "lower" half of the node and all quadrants in it does not overlap with the query box. Similarly, h_ has a `0` for every dimension where the "upper" half does not overlap with the query box. h_ and h_ then present the lowest and highest h in a node that need to be traversed. Quadrants with h < h_ or h > h_ do not intersect with the query box. A proof is available in. With this, the above query function can be improved to: function query(node, min, max, result_list) is h_min ← calculate h_min h_max ← calculate h_max for each entry ← node.get_entries_range(h_min, h_max) do ... repeat return Calculating h_ and h_ is O(2*d) = O(d). Depending on the distribution of the occupied quadrants in a node this approach will allow avoiding anywhere from no to almost all key comparisons. This reduces the average traversal time but the resulting complexity is still O(d + d * n_).


Check quadrants for overlap with query box

Between h_ and h_ there can still be quadrants that do not overlap with the query box. Idea: h_ and h_ each have one bit for every dimensions that indicates whether the query box overlaps with the lower/upper half of a node in that dimension. This can be used to quickly check whether a quadrant h overlaps with the query box without having to compare d-dimensional keys: a quadrant h overlaps with the query box if for every `0` bit in h there is a corresponding `0` bit in h_ and for every `1` bit in h there is a corresponding `1` bit in h_. On a CPU with 64bit registers it is thus possible to check for overlap of up to 64-dimensional keys in O(1). function is_overlap(h, h_min, h_max) is return (h , h_min) & h_max

h // evaluates to 'true' if quadrant and query overlap. function query(node, min, max, result_list) is h_min ← calculate h_min h_max ← calculate h_max for each entry ← node.get_entries_range(h_min, h_max) do h ← entry.get_h(); if (h , h_min) & h_max

h then // evaluates to 'true' if quadrant and query overlap. ... end if repeat return The resulting time complexity is O(d + n_) compared to the O(d * n_) of the full iteration.


Traverse quadrants that overlap with query box

For higher dimensions with larger nodes it is also possible to avoid iterating through all h and instead directly calculate the next higher h that overlaps with the query box. The first step puts `1`-bits into a given h_ for all quadrants that have no overlap with the query box. The second step increments the adapted h and the added `1`-bits trigger an overflow so that the non-overlapping quadrants are skipped. The last step removes all the undesirable bits used for triggering the overflow. The logic is described in detail in. The calculation works as follows: function increment_h(h_input, h_min, h_max) is h_out = h_input , (~ h_max ) // pre - mask h_out += 1 // increment h_out = ( h_out & h_max ) , h_min // post - mask return h_out Again, for d \leq 64 this can be done on most CPUs in O(1). The resulting time complexity for traversing a node is O(d + n_). This works best if most of the quadrants that overlap with the query box are occupied with an entry.


-nearest neighbors

nearest neighbor searches can be implemented using standard algorithms.


Floating point keys

The PH-tree can only store integer values. Floating point values can trivially be stored as integers
casting Casting is a manufacturing process in which a liquid material is usually poured into a mold, which contains a hollow cavity of the desired shape, and then allowed to solidify. The solidified part is also known as a ''casting'', which is ejected ...
it them an integer. However, the authors of also propose an approach without loss of precision.


Lossless conversion

Lossless converting of a floating point value into an integer value (and back) without loss if precision can be achieved by simply interpreting the 32 or 64 bits of the floating point value as an integer (with 32 or 64 bits). Due to the way that
IEEE 754 The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found i ...
encodes floating point values, the resulting integer values have the same ordering as the original floating point values, at least for positive values. Ordering for negative values can be achieved by inverting the non-sign bits. Example implementations in Java: long encode(double value) Example implementations in C++: std::int64_t encode(double value) Encoding (and the inverse decoding) is lossless for all floating point values. The ordering works well in practice, including \pm\infty and -0.0. However, the integer representation also turns NaN into a normal comparable value (smaller than infinity), infinities are comparable to each other and 0.0 is larger than -0.0. That means that, for example, a query range .0, 10.0/math> will ''not'' match a value of -0.0. In order to match -0.0 the query range needs to be 0.0, 10.0/math>.


Hyperbox keys

In order to store volumes (axis-aligned hyper-boxes) as keys, implementations typically use ''corner representation'' which converts the two d-dimensional minimum and maximum corners of a box into a single key with 2*d dimensions, for example by interleaving them: k = \. This works trivially for lookup, insert and remove operations. Window queries need to be converted from d-dimensional vectors to 2*d-dimensional vectors. For example, for a window query that matches all boxes that are completely ''inside'' the query box, the query keys are: k_ = \ k_ = \ For a window query operation that matches all boxes that ''intersect'' with a query box, the query keys are: k_ = \ k_ = \


Scalability

In high dimensions with less than 2^d entries, a PH-tree may have only a single node, effectively “degenerating” into a
B-Tree In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for ...
with
Z-order curve In mathematical analysis and computer science, functions which are Z-order, Lebesgue curve, Morton space-filling curve, Morton order or Morton code map multidimensional data to one dimension while preserving locality of the data points. It ...
. The add/remove/lookup operations remain O(\log) and window queries can use the quadrant filters. However, this cannot avoid the
curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. Th ...
, for high dimensional data with d=50 or d=100 a PH-tree is only marginally better than a full scan.


Uses

Research has reported fast add/remove/exact-match operations with large and fast changing datasets.. Window queries have been shown to work well especially for small windows or large dataset The PH-tree is mainly suited for in-memory use. The size of the nodes (number of entries) is fixed while persistent storage tends to benefit from indexes with configurable node size to align node size with page size on disk. This is easier with other spatial indexes, such as R-Trees.


Implementations

* Java
GitHub repository by original inventor
* C++
GitHub repository by original inventor
* C++
GitHub repository
* C++
GitHub repository


See also

*
Binary space partitioning In computer science, binary space partitioning (BSP) is a method for space partitioning which recursively subdivides a Euclidean space into two convex sets by using hyperplanes as partitions. This process of subdividing gives rise to a represe ...
*
Binary tiling In geometry, the binary tiling (sometimes called the Böröczky tiling) is a tiling of the hyperbolic plane, resembling a quadtree over the Poincaré half-plane model of the hyperbolic plane. It was first studied mathematically in 1974 by . Howev ...
*
Kd-tree In computer science, a ''k''-d tree (short for ''k-dimensional tree'') is a space-partitioning data structure for organizing points in a ''k''-dimensional space. ''k''-d trees are a useful data structure for several applications, such as searc ...
*
Octree An octree is a tree data structure in which each internal node has exactly eight children. Octrees are most often used to partition a three-dimensional space by recursively subdividing it into eight octants. Octrees are the three-dimensional a ...
*
Quadtree A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are the two-dimensional analog of octrees and are most often used to partition a two-dimensional space by recursively subdividing it into four ...
*
R-tree R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found s ...
*
UB-tree The UB-tree as proposed by Rudolf Bayer and Volker Markl is a balanced tree for storing and efficiently retrieving multidimensional data. It is basically a B+ tree A B+ tree is an m-ary tree with a variable but often large number of childre ...
*
Spatial database A spatial database is a general-purpose database (usually a relational database) that has been enhanced to include spatial data that represents objects defined in a geometric space, along with tools for querying and analyzing such data. Most s ...


External links


PH-tree website with detailed description, examples and performance comparison


References

{{CS-Trees Trees (data structures) Database index techniques Geometric data structures