Most of us are familiar with the idea that there are several different methods of conducting an internet search: you can enter the exact keywords or keyword string that you're looking for, or you can enter something that sounds kind of like what you're looking for, but isn't exactly correct. The latter is an example of a fuzzy search.
How do fuzzy searches work? Is it possible to do a fuzzy search and find the information that you want? The answer to the latter question is yes, but it's still important to know and understand how a fuzzy search work
What Exactly is a Fuzzy Search?
The key thing to note about a fuzzy search is the word "approximate." Rather than search for the terms that you know will pull up your desired results, you instead look for something that sounds approximately like them.
However, this isn't always done by accident. Sometimes, people purposely do fuzzy searches in order to find the information that they're looking for, due to certain circumstances surrounding the terms. The best way to explain both scenarios is with a few examples.
Accidental Fuzzy Searches
There are times when you're typing fast, causing you to execute a fuzzy search accidentally.
For example, you want to look for "restaurants near me" but misspell the words and instead type in "restrants near me."
Google and other search engines will recognize the misspelled word and provide you with the results you want. You'll even see Google's response under the search bar and above the results, pointing out the misspelling.
It will say something like, "Did you mean: restaurants near me?" and then produce the results you meant to search for. In this example, Google is using a fuzzy search algorithm to determine what you really want to find.
Purposeful Fuzzy Searches
On the other hand, sometimes fuzzy searches are done on purpose. Genealogists, in particular, end up doing them on the extensive databases designed to hold everything from birth and death records to past census information.
Since these databases are transcribed by humans or computers, neither is completely accurate. It can be tricky to read old-fashioned handwriting and documents marred by time, so fuzzy searches are the best way to find and retrieve information.
When someone goes into one of these genealogy databases to search, they have two main choices: an exact match search, which limits the information considerably, or a fuzzy "sounds like" or "approximate" search.
Doing one of these allows the computer database to look for words and dates close to what the search parameters entail.
As a result, the odds of being able to find what you're looking for go up quite a bit, as long as you sort through the retrieved records and look at the digital copies of the original documents.
Using Quotation Marks
Many times you'll see search engine instructions tell you to place the terms that you're looking for (whether they are fuzzy or standard) in quotation marks. Doing this helps the search engine further determine what you're looking for, especially when specific movie, band, and television show names are involved.
For example, if you were looking for information on the band REO Speedwagon, you'd place the name in quotation marks: "REO Speedwagon" to help the search engine further refine your results.
Why? We'll cover this more in a second, as it involves the metrics and coding used to make fuzzy searches happen.
How Do Fuzzy Searches Work?
Of course, fuzzy searches don't work by magic. Instead, they use several different algorithms to determine the words you meant to search for. The algorithms use different techniques to create a metric that looks for similar words, which then determines the words you meant to use.
These algorithms are called Levenshtein distance, Damerau-Levenshtein distance, and Longest common subsequence (LCS). Let's go over them one by one.
Levenshtein Distance is one of the first algorithms used by search engines of all types. This algorithm is designed to look at three different metrics, insertion, deletion, and substitution, in order to find the best fit for the potential search terms.
It looks at the smallest possible changes that can be made, assuming that the misspelled word or unknown phrase is something close to a typo or minor grammatical error.
Here are a few examples of each type of metric:
- Insertion: The user typed BAT but meant to type BOAT. The search term becomes clear, and the user receives results for boats, not bats.
- Deletion: The opposite of insertion, the user typed a longer word than they meant to. For example, they entered COAT but wanted to search for COT. Removing a letter provides them with the results that they want.
- Substitution: In this metric, a single letter is swapped out, or substituted, for another one, changing the results to something that the user wants. For example, the user typed in COAT but actually wanted to find the COST of something.
Of course, in order for the Levenshtein Distance to work properly, the searches need to include multiple words. This helps further determine what the keywords were supposed to be.
As the name implies, not only does the Damerau-Levenshtein Distance algorithm include everything that the Levenshtein Distance algorithm does on its own (the insertion, deletion, and substitution metrics), but it also has an additional metric: transposition.
This algorithm also assumes that searchers tend to switch up letters in a word accidentally.
Here's a good example:
- Transposition: The user switched around a few letters, entering CAOT instead of COAT. The algorithm determined that the center letters were transposed and changed them into the correct term.
As with the Levenshtein Distance algorithm, the Damerau-Levenshtein Distance algorithm works best when a keyword string is entered. However, since it can fix words with transposed letters, it also performs well with a single keyword search term.
Longest Common Subsequence (LCS)
The Longest Common Subsequence algorithm works similarly to the Levenshtein Distance algorithm. It looks at the entered words to see what can be inserted or deleted to determine the preferred keyword search terms.
However, unlike the Levenshtein Distance algorithm, which looks for the least number of changes that need to be made, the Longest Common Subsequence algorithm lives up to its name and looks for the longest subsequence or highest number of changes required to determine the search term.
For example, the LCS algorithm would make the following change:
- Looking For Common Strings: The user enters ABCD into the search engine. While looking for the common strings, the LCS algorithm finds a string that consists of the same three letters, turning the search term into ACBAD, an acronym. The two three-lettered strings are ABC and ACB.
Again, using a string of words would make the search parameters more accurate, allowing the Longest Common Subsequence algorithm to best do its job.
Fuzzy Searches are Imperfect
Since fuzzy searches use an algorithm or two (or three) to determine what the user tried to type in or wants to search for, they are not perfect. While the searches work out properly a good portion of the time, there are those instances where the algorithm sends you in the wrong direction.
In order to make fuzzy searches work for you, there are several different things that you, the searcher, can do, from looking for long-tail keywords instead of single words and even using quotation marks around the words that you're searching for.
Both of these methods make it easier for the system to determine what you're looking for, resulting in more accurate results.
On the programming end, developers who put together the algorithms have several options to ensure that their search engines work properly and don't overly frustrate users.
For example, the more strings involved in the substitution process, the trickier it will be for the algorithms to work properly.
If you use one with five strings, your users might not find what they're looking for because your algorithms will send them off in the wrong direction.
In general, two strings are the universally accepted amount when setting up fuzzy searches. Overall, fuzzy searches can be a great thing, as they make it more likely that a user will find what they're looking for. As long as the search algorithms are working properly, everything will work out just fine.