In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.
The construction of such a tree for the string
S
S
S
The concept was first introduced by .Rather than the suffix
S[i..n]
i
S
S[k+1..n]
S[k..n]
S[n..n]
S[1..n]
n-1
O(n2)
O(n2)
S=anbnanbn\$.
was the first to build a (compressed) trie of all suffixes of
S
i
further simplified the construction. He provided the first online-construction of suffix trees, now known as Ukkonen's algorithm, with running time that matched the then fastest algorithms.These algorithms are all linear-time for a constant-size alphabet, and have worst-case running time of
O(nlogn)
gave the first suffix tree construction algorithm that is optimal for all alphabets. In particular, this is the first linear-time algorithm for strings drawn from an alphabet of integers in a polynomial range. Farach's algorithm has become the basis for new algorithms for constructing both suffix trees and suffix arrays, for example, in external memory, compressed, succinct, etc.
The suffix tree for the string
S
n
1
n
S
i
S[i..n]
i
1
n
If a suffix of
S
S
$
). This ensures that no suffix is a prefix of another, and that there will be n
n
S
n-1
n+(n-1)+1=2n
n
n-1
Suffix links are a key feature for older linear-time construction algorithms, although most newer algorithms, which are based on Farach's algorithm, dispense with suffix links. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string
\chi\alpha
\chi
\alpha
\alpha
ANA
to the node for NA
in the figure above. Suffix links are also used in some algorithms running on the tree.A generalized suffix tree is a suffix tree made for a set of strings instead of a single string. It represents all suffixes from this set of strings. Each string must be terminated by a different termination symbol.
A suffix tree for a string
S
n
\Theta(n)
O(n)
O(nlogn)
Assume that a suffix tree has been built for the string
S
n
D=\{S1,S2,...,SK\}
n=n1+n2+ … +nK
P
m
O(m)
P1,...,Pq
m
O(m)
z
P1,...,Pq
m
O(m+z)
n
P
P[i...m]
D
\Theta(m)
P
Si
Sj
\Theta(ni+nj)
\Theta(n+z)
\Theta(n)
\Theta(n)
\Theta(n)
\Sigma
D
O(n+z)
z
\Theta(n)
i
Si
D
\Theta(n)
The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in
\Theta(n)
Si[p..ni]
Sj[q..nj]
\Theta(1)
O(kn+z)
z
\Theta(n)
\Theta(gn)
g
\Theta(kn)
k
z
O(nlogn+z)
O(knlog(n/k)+z)
k
D
k=2,...,K
\Theta(n)
Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology and other application areas. Primary applications include:[20]
Suffix trees are often used in bioinformatics applications, searching for patterns in DNA or protein sequences (which can be viewed as long strings of characters). The ability to search efficiently with mismatches might be considered their greatest strength. Suffix trees are also used in data compression; they can be used to find repeated data, and can be used for the sorting stage of the Burrows–Wheeler transform. Variants of the LZW compression schemes use suffix trees (LZSS). A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some search engines.[21]
If each node and edge can be represented in
\Theta(1)
\Theta(n)
O(n2)
\Theta(n)
2n
An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using linked lists called sibling lists. Each node has a pointer to its first child, and to the next node in the child list it is a part of. Other implementations with efficient running time properties use hash maps, sorted or unsorted arrays (with array doubling), or balanced search trees. We are interested in:
Let be the size of the alphabet. Then you have the following costs:
Lookup | Insertion | Traversal | ||
---|---|---|---|---|
Sibling lists / unsorted arrays | ||||
Bitwise sibling trees | ||||
Hash maps | ||||
Balanced search tree | ||||
Sorted arrays | ||||
Hash maps + sibling lists |
The insertion cost is amortised, and that the costs for hashing are given for perfect hashing.
The large amount of information in each edge and node makes the suffix tree very expensive, consuming about 10 to 20 times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of 8 (for array including LCP values built within 32-bit address space and 8-bit characters.) This factor depends on the properties and may reach 2 with usage of 4-byte wide characters (needed to contain any symbol in some UNIX-like systems, see wchar_t) on 32-bit systems. Researchers have continued to find smaller indexing structures.
Various parallel algorithms to speed up suffix tree construction have been proposed.Recently, a practical parallel algorithm for suffix tree construction with
O(n)
O(log2n)
Though linear, the memory usage of a suffix tree is significantly higherthan the actual size of the sequence collection. For a large text,construction may require external memory approaches.
There are theoretical results for constructing suffix trees in externalmemory.The algorithm by is theoretically optimal, with an I/O complexity equal to that of sorting.However the overall intricacy of this algorithm has prevented, so far, itspractical implementation.
On the other hand, there have been practical works for constructingdisk-based suffix treeswhich scale to (few) GB/hours.The state of the art methods are TDD,[22] TRELLIS,[23] DiGeST,[24] andB2ST.[25]
TDD and TRELLIS scale up to the entire human genome resulting in a disk-based suffix tree of a size in the tens of gigabytes. However, these methods cannot handle efficiently collections of sequences exceeding 3 GB. DiGeST performs significantly better and is able to handle collections of sequences in the order of 6 GB in about 6 hours.
All these methods can efficiently build suffix trees for the case when thetree does not fit in main memory,but the input does.The most recent method, B2ST, scales to handleinputs that do not fit in main memory. ERA is a recent parallel suffix tree construction method that is significantly faster. ERA can index the entire human genome in 19 minutes on an 8-core desktop computer with 16 GB RAM. On a simple Linux cluster with 16 nodes (4 GB RAM per node), ERA can index the entire human genome in less than 9 minutes.