Levenshtein distance

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after Soviet mathematician Vladimir Levenshtein, who defined the metric in 1965.^[1]

Levenshtein distance may also be referred to as edit distance, although that term may also denote a larger family of distance metrics known collectively as edit distance.^[2]^: 32 It is closely related to pairwise string alignments.

It is at least the absolute value of the difference of the sizes of the two strings.

It is at most the length of the longer string.

It is zero if and only if the strings are equal.

If the strings have the same size, the is an upper bound on the Levenshtein distance. The Hamming distance is the number of positions at which the corresponding symbols in the two strings are different.

Hamming distance

The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string ().

triangle inequality

Applications[edit]

In approximate string matching, the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected. The short strings could come from a dictionary, for instance. Here, one of the strings is typically short, while the other is arbitrarily long. This has a wide range of applications, for instance, spell checkers, correction systems for optical character recognition, and software to assist natural-language translation based on translation memory.

The Levenshtein distance can also be computed between two longer strings, but the cost to compute it, which is roughly proportional to the product of the two string lengths, makes this impractical. Thus, when used to aid in fuzzy string searching in applications such as record linkage, the compared strings are usually short to help improve speed of comparisons.

In linguistics, the Levenshtein distance is used as a metric to quantify the linguistic distance, or how different two languages are from one another.^[3] It is related to mutual intelligibility: the higher the linguistic distance, the lower the mutual intelligibility, and the lower the linguistic distance, the higher the mutual intelligibility.

the allows the transposition of two adjacent characters alongside insertion, deletion, substitution;

Damerau–Levenshtein distance

the (LCS) distance allows only insertion and deletion, not substitution;

longest common subsequence

the allows only substitution, hence, it only applies to strings of the same length.

Hamming distance

the allows only transposition.

Jaro distance

There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,

Edit distance is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.

Computation[edit]

Recursive[edit]

This is a straightforward, but inefficient, recursive Haskell implementation of a lDistance function that takes two strings, s and t, together with their lengths, and returns the Levenshtein distance between them:

Black, Paul E., ed. (14 August 2008), "Levenshtein distance", , U.S. National Institute of Standards and Technology, retrieved 2 November 2016

Levenshtein distance

Hamming distance

triangle inequality

Applications[edit]

Damerau–Levenshtein distance

longest common subsequence

Hamming distance

Jaro distance

Computation[edit]

Recursive[edit]

Dictionary of Algorithms and Data Structures [online]

Rosetta Code implementations of Levenshtein distance