Gestalt pattern matching, also Ratcliff/Obershelp pattern recognition, is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988.[1]
The similarity of two strings
S1
S2
Km
Dro=
2Km | |
|S1|+|S2| |
0\leqDro\leq1
S1 | W | I | K | I | M | E | D | I | A |
---|---|---|---|---|---|---|---|---|---|
S2 | W | I | K | I | M | A | N | I | A |
WIKIM
(light grey) with 5 characters. There is no further substring on the left. The non-matching substrings on the right side are EDIA
and ANIA
. They again have a longest common substring IA
(dark gray) with length 2.The similarity metric is determined by:2Km | |
|S1|+|S2| |
=
2 ⋅ (|“WIKIM”|+|“IA”|) | |
|S1|+|S2| |
=
2 ⋅ (5+2) | |
9+9 |
=
14 | |
18 |
=0.\overline{7}
The Ratcliff/Obershelp matching characters can be substantially different from each longest common subsequence of the given strings. For example
S1=q ccccc r ddd s bbbb t eee u
S2=v ddd w bbbb x eee y ccccc z
ccccc
Km=5
S1
S2
(ddd) (bbbb) (eee)
10
The execution time of the algorithm is
O(n3)
O(n2)
The Python library implementation of the gestalt pattern matching algorithm is not commutative:[4]
Dro(S1,S2) ≠ Dro(S2,S1).
S1=GESTALTPATTERNMATCHING
S2=GESTALTPRACTICE
Dro(S1,S2)
24 | |
40 |
GESTALT P
, A
, T
, E
and forDro(S2,S1)
26 | |
40 |
GESTALT P
, R
, A
, C
, I
.The Python difflib
library, which was introduced in version 2.1,[5] implements a similar algorithm that predates the Ratcliff-Obershelp algorithm. Due to the unfavourable runtime behaviour of this similarity metric, three methods have been implemented. Two of them return an upper bound in a faster execution time. The fastest variant only compares the length of the two substrings:[6]
Drqr=
2 ⋅ min(|S1|,|S2|) | |
|S1|+|S2| |
The second upper bound calculates twice the sum of all used characters
S1
S2
Dqr=
2 ⋅ |\{\vertS1\vert\ | |
\cap |
\{\vertS2\vert\} |}{|S1|+|S2|}
import collections
def quick_ratio(s1: str, s2: str) -> float: """Return an upper bound on ratio relatively quickly.""" length = len(s1) + len(s2)
if not length: return 1.0
intersect = (collections.Counter(s1) & collections.Counter(s2)) matches = sum(intersect.values) return 2.0 * matches / length
Trivially the following applies:
0\leqDro\leqDqr\leqDrqr\leq1
0\leqKm\leq|\{\vertS1\vert\}\cap\{\vertS2\vert\} |\leqmin(|S1|,|S2|)\leq
|S1|+|S2| | |
2 |