Of course, fuzzy searches don't work by magic. Instead, they use several different algorithms to determine the words you meant to search for. The algorithms use different techniques to create a metric that looks for similar words, which then determines the words you meant to use.
These algorithms are called Levenshtein distance, Damerau-Levenshtein distance, and Longest common subsequence (LCS). Let's go over them one by one.
Levenshtein Distance is one of the first algorithms used by search engines of all types. This algorithm is designed to look at three different metrics, insertion, deletion, and substitution, in order to find the best fit for the potential search terms.
It looks at the smallest possible changes that can be made, assuming that the misspelled word or unknown phrase is something close to a typo or minor grammatical error.
Here are a few examples of each type of metric:
Insertion: The user typed BAT but meant to type BOAT. The search term becomes clear, and the user receives results for boats, not bats.
Deletion: The opposite of insertion, the user typed a longer word than they meant to. For example, they entered COAT but wanted to search for COT. Removing a letter provides them with the results that they want.
Substitution: In this metric, a single letter is swapped out, or substituted, for another one, changing the results to something that the user wants. For example, the user typed in COAT but actually wanted to find the COST of something.
Of course, in order for the Levenshtein Distance to work properly, the searches need to include multiple words. This helps further determine what the keywords were supposed to be.
As the name implies, not only does the Damerau-Levenshtein Distance algorithm include everything that the Levenshtein Distance algorithm does on its own (the insertion, deletion, and substitution metrics), but it also has an additional metric: transposition.
This algorithm also assumes that searchers tend to switch up letters in a word accidentally.
Here's a good example:
- Transposition: The user switched around a few letters, entering CAOT instead of COAT. The algorithm determined that the center letters were transposed and changed them into the correct term.
As with the Levenshtein Distance algorithm, the Damerau-Levenshtein Distance algorithm works best when a keyword string is entered. However, since it can fix words with transposed letters, it also performs well with a single keyword search term.
Longest Common Subsequence (LCS)
The Longest Common Subsequence algorithm works similarly to the Levenshtein Distance algorithm. It looks at the entered words to see what can be inserted or deleted to determine the preferred keyword search terms.
However, unlike the Levenshtein Distance algorithm, which looks for the least number of changes that need to be made, the Longest Common Subsequence algorithm lives up to its name and looks for the longest subsequence or highest number of changes required to determine the search term.
For example, the LCS algorithm would make the following change:
- Looking For Common Strings: The user enters ABCD into the search engine. While looking for the common strings, the LCS algorithm finds a string that consists of the same three letters, turning the search term into ACBAD, an acronym. The two three-lettered strings are ABC and ACB.
Again, using a string of words would make the search parameters more accurate, allowing the Longest Common Subsequence algorithm to best do its job.