Succinct data structure explained

In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but (unlike other compressed representations) still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, insofar as the size of the stored or encoded data similarly depends upon the specific content of the data itself.

Suppose that

Z

is the information-theoretical optimal number of bits needed to store some data. A representation of this data is called:

Z+O(1)

bits of space,

Z+o(Z)

bits of space, and

O(Z)

bits of space.

For example, a data structure that uses

2Z

bits of storage is compact,

Z+\sqrt{Z}

bits is succinct,

Z+lgZ

bits is also succinct, and

Z+3

bits is implicit.

Implicit structures are thus usually reduced to storing information using some permutation of the input data; the most well-known example of this is the heap.

Succinct indexable dictionaries

Succinct indexable dictionaries, also called rank/select dictionaries, form the basis of a number of succinct representation techniques, including binary trees,

k

-ary trees and multisets, as well as suffix trees and arrays. The basic problem is to store a subset

S

of a universe

U= [0...n)=\{0,1,...,n-1\}

, usually represented as a bit array

B[0...n)

where

B[i]=1

iff

i\inS.

An indexable dictionary supports the usual methods on dictionaries (queries, and insertions/deletions in the dynamic case) as well as the following operations:

rankq(x)=|\{k\in[0...x]:B[k]=q\}|

selectq(x)=min\{k\in[0...n):rankq(k)=x\}

for

q\in\{0,1\}

.

In other words,

rankq(x)

returns the number of elements equal to

q

up to position

x

while

selectq(x)

returns the position of the

x

-th occurrence of

q

.

There is a simple representation which uses

n+o(n)

bits of storage space (the original bit array and an

o(n)

auxiliary structure) and supports rank and select in constant time. It uses an idea similar to that for range-minimum queries; there are a constant number of recursions before stopping at a subproblem of a limited size. The bit array

B

is partitioned into large blocks of size

l=lg2n

bits and small blocks of size

s=lgn/2

bits. For each large block, the rank of its first bit is stored in a separate table

Rl[0...n/l)

; each such entry takes

lgn

bits for a total of

(n/l)lgn=n/lgn

bits of storage. Within a large block, another directory

Rs[0...l/s)

stores the rank of each of the

l/s=2lgn

small blocks it contains. The difference here is that it only needs

lgl=lglg2n=2lglgn

bits for each entry, since only the differences from the rank of the first bit in the containing large block need to be stored. Thus, this table takes a total of

(n/s)lgl=4nlglgn/lgn

bits. A lookup table

Rp

can then be used that stores the answer to every possible rank query on a bit string of length

s

for

i\in[0,s)

; this requires

2sslgs=O(\sqrt{n}lgnlglgn)

bits of storage space. Thus, since each of these auxiliary tables take

o(n)

space, this data structure supports rank queries in

O(1)

time and

n+o(n)

bits of space.

To answer a query for

rank1(x)

in constant time, a constant time algorithm computes:

rank1(x)=Rl[\lfloorx/l\rfloor]+Rs[\lfloorx/s\rfloor]+Rp[x\lfloorx/s\rfloor,xmods]

In practice, the lookup table

Rp

can be replaced by bitwise operations and smaller tables that can be used to find the number of bits set in the small blocks. This is often beneficial, since succinct data structures find their uses in large data sets, in which case cache misses become much more frequent and the chances of the lookup table being evicted from closer CPU caches becomes higher. Select queries can be easily supported by doing a binary search on the same auxiliary structure used for rank; however, this takes

O(lgn)

time in the worst case. A more complicated structure using

3n/lglgn+O(\sqrt{n}lgnlglgn)=o(n)

bits of additional storage can be used to support select in constant time. In practice, many of these solutions have hidden constants in the

O()

notation which dominate before any asymptotic advantage becomes apparent; implementations using broadword operations and word-aligned blocks often perform better in practice.

Entropy-compressed solutions

The

n+o(n)

space approach can be improved by noting that there are

style\binom{n}{m}

distinct

m

-subsets of

[n)

(or binary strings of length

n

with exactly

m

1’s), and thus

stylel{B}(m,n)=\lceillg\binom{n}{m}\rceil

is an information theoretic lower bound on the number of bits needed to store

B

. There is a succinct (static) dictionary which attains this bound, namely using

l{B}(m,n)+o(l{B}(m,n))

space. This structure can be extended to support rank and select queries and takes

l{B}(m,n)+O(m+nlglgn/lgn)

space. Correct rank queries in this structure are however limited to elements contained in the set, analogous to how minimal perfect hashing functions work. This bound can be reduced to a space/time tradeoff by reducing the storage space of the dictionary to

l{B}(m,n)+O(ntt/lgtn+n3/4)

with queries taking

O(t)

time.

It is also possible to construct a indexible dictionary supporting rank (but not select) that uses fewer than

stylel{B}(m,n)

bits. Such a dictionary is called a monotone minimal perfect hash function, and can be implemented using as few as

O(mlogloglogn)

bits.[1]

Succinct hash tables

A succinct hash table, also known as a succinct unordered dictionary, is a data structure that stores

m

keys from a universe

\{0,1,...,n-1\}

using space

(1+o(1))l{B}(m,n)

bits, and while supporting membership queries in constant expected time. If a succinct hash table also supports insertions and deletions in constant expected time, then it is referred to as dynamic, and otherwise it is referred to as static.

The first dynamic succinct hash table was due to Raman and Rao in 2003. In the case where

n=poly(m)

, their solution uses space

l{B}(m,n)+O(mloglogm)

bits. Subsequently, it was shown that this space bound could be improved to

l{B}(m,n)+O(mloglogloglogm)

bits for any constant number of logarithms[2] and a little after that this bound was also optimal.[3] [4] The latter solution supports all operations in worst-case constant time with high probability.

The first static succinct hash table was due to Pagh in 1999.[5] [6] In the case where

n=poly(m)

, their solution uses space

l{B}(m,n)+O(m(loglogm)2/logm)

bits, and supports worst-case constant-time queries. This bound was subsequently improved to

l{B}(m,n)+m/polylogm

bits,[7] and then to

l{B}(m,n)+polylogm

bits.[8] Whereas the first two solutions support worst-case constant-time queries, the final one supports constant expected-time queries. The final solution also requires access to a lookup table of size

n\epsilon

, but this lookup table is independent of the set of elements being stored.

Other examples

A string with an arbitrary length (Pascal string) takes Z + log(Z) space, and is thus succinct. If there is a maximum length – which is the case in practice, since 232 = 4 GiB of data is a very long string, and 264 = 16 EiB of data is larger than any string in practice – then a string with a length is also implicit, taking Z + k space, where k is the number of data to represent the maximum length (e.g., 64 bits).

When a sequence of variable-length items (such as strings) needs to be encoded, there are various possibilities. A direct approach is to store a length and an item in each record – these can then be placed one after another. This allows efficient next, but not finding the kth item. An alternative is to place the items in order with a delimiter (e.g., null-terminated string). This uses a delimiter instead of a length, and is substantially slower, since the entire sequence must be scanned for delimiters. Both of these are space-efficient. An alternative approach is out-of-band separation: the items can simply be placed one after another, with no delimiters. Item bounds can then be stored as a sequence of length, or better, offsets into this sequence. Alternatively, a separate binary string consisting of 1s in the positions where an item begins, and 0s everywhere else is encoded along with it. Given this string, the

select

function can quickly determine where each item begins, given its index.[9] This is compact but not succinct, as it takes 2Z space, which is O(Z).

Another example is the representation of a binary tree: an arbitrary binary tree on

n

nodes can be represented in

2n+o(n)

bits while supporting a variety of operations on any node, which includes finding its parent, its left and right child, and returning the size of its subtree, each in constant time. The number of different binary trees on

n

nodes is

{\tbinom{2n}{n}}

/(n+1)

. For large

n

, this is about

4n

; thus we need at least about
n)=2n
log
2(4
bits to encode it. A succinct binary tree therefore would occupy only

2

bits per node.

See also

References

  1. Belazzougui . Djamal . Boldi . Paolo . Pagh . Rasmus . Vigna . Sebastiano . 2009-01-04 . Monotone Minimal Perfect Hashing: Searching a Sorted Table with O(1) Accesses . Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms . 785–794 . Philadelphia, PA . Society for Industrial and Applied Mathematics . 10.1137/1.9781611973068.86. 978-0-89871-680-1 . free .
  2. Book: Bender . Michael A. . Farach-Colton . Martín . Kuszmaul . John . Kuszmaul . William . Liu . Mingmou . Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing . On the optimal time/Space tradeoff for hash tables . 2022-06-09 . http://dx.doi.org/10.1145/3519935.3519969 . 1284–1297 . New York, NY, USA . ACM . 10.1145/3519935.3519969. 2111.00602 . 1721.1/146419 . 9781450392648 . 240354692 .
  3. Li . Tianxiao . Liang . Jingxun . Yu . Huacheng . Zhou . Renfei . 2023 . Tight Cell-Probe Lower Bounds for Dynamic Succinct Dictionaries . cs.DS . 2306.02253.
  4. Web site: Nadis . Steve . 8 February 2024 . Scientists Find Optimal Balance of Data Storage and Time . Quanta Magazine.
  5. Pagh . Rasmus . 1998-01-28 . Low Redundancy in Dictionaries with O(1) Worst Case Lookup Time . BRICS Report Series . 5 . 28 . 10.7146/brics.v5i28.19434 . 1601-5355. free .
  6. Pagh . Rasmus . January 2001 . Low Redundancy in Static Dictionaries with Constant Query Time . SIAM Journal on Computing . 31 . 2 . 353–363 . 10.1137/s0097539700369909 . 0097-5397.
  7. Book: Patrascu, Mihai . 2008 49th Annual IEEE Symposium on Foundations of Computer Science . Succincter . October 2008 . 305–313 . http://dx.doi.org/10.1109/focs.2008.83 . IEEE . 10.1109/focs.2008.83. 978-0-7695-3436-7 . 257721481 .
  8. Book: Yu, Huacheng . Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing . Nearly optimal static Las Vegas succinct dictionary . 2020-06-22 . http://dx.doi.org/10.1145/3357713.3384274 . 1389–1401 . New York, NY, USA . ACM . 10.1145/3357713.3384274. 1911.01348 . 9781450369794 . 207780523 .
  9. Web site: Hash, displace, and compress. Belazzougui. Djamal.

[10] [11] [12] [13] [14] [15] [16] [17] [18]