Introduction:

FuzzyFinder is a popular feature available in decent editors to open files. The idea is to start typing partial strings from the full path and the list of suggestions will be narrowed down to match the desired file. 

Examples: 

Vim (Ctrl-P)

Sublime Text (Cmd-P)

This is an extremely useful feature and it’s quite easy to implement.

Problem Statement:

We have a collection of strings (filenames). We’re trying to filter down that collection based on user input. The user input can be partial strings from the filename. Let’s walk this through with an example. Here is a collection of filenames:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

[Show hidden characters]({{ revealButtonHref }})

»> collection = [‘django_migrations.py’,
‘django_admin_log.py’,
‘main_generator.py’,
‘migrations.py’,
‘api_user.doc’,
‘user_group.doc’,
‘accounts.txt’,
]

view raw file_list.py hosted with ❤ by GitHub

When the user types ‘djm’ we are supposed to match ‘django_migrations.py’ and ‘django_admin_log.py’. The simplest route to achieve this is to use regular expressions. 

Solutions:

Naive Regex Matching:

Convert ‘djm’ into ’d.*j.*m' and try to match this regex against every item in the list. Items that match are the possible candidates.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

[Show hidden characters]({{ revealButtonHref }})

»> import re # regex module from standard library.
»> def fuzzyfinder(user_input, collection):
suggestions = []
pattern = ‘.*'.join(user_input) # Converts ‘djm’ to ’d.*j.*m’
regex = re.compile(pattern) # Compiles a regex.
for item in collection:
match = regex.search(item) # Checks if the current item matches the regex.
if match:
suggestions.append(item)
return suggestions
»> print fuzzyfinder(‘djm’, collection)
[‘django_migrations.py’, ‘django_admin_log.py’]
»> print fuzzyfinder(‘mig’, collection)
[‘django_migrations.py’, ‘django_admin_log.py’, ‘main_generator.py’, ‘migrations.py’]

view raw naive_regex.py hosted with ❤ by GitHub

This got us the desired results for input ‘djm’. But the suggestions are not ranked in any particular order.

In fact, for the second example with user input ‘mig’ the best possible suggestion ‘migrations.py’ was listed as the last item in the result.

Ranking based on match position:

We can rank the results based on the position of the first occurrence of the matching character. For user input ‘mig’ the position of the matching characters are as follows:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

[Show hidden characters]({{ revealButtonHref }})

‘main_generator.py’ - 0
‘migrations.py’ - 0
‘django_migrations.py’ - 7
‘django_admin_log.py’ - 9

view raw position_of_match.py hosted with ❤ by GitHub

Here’s the code:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

[Show hidden characters]({{ revealButtonHref }})

»> import re # regex module from standard library.
»> def fuzzyfinder(user_input, collection):
suggestions = []
pattern = ‘.*'.join(user_input) # Converts ‘djm’ to ’d.*j.*m’
regex = re.compile(pattern) # Compiles a regex.
for item in collection:
match = regex.search(item) # Checks if the current item matches the regex.
if match:
suggestions.append((match.start(), item))
return [x for _, x in sorted(suggestions)]
»> print fuzzyfinder(‘mig’, collection)
[‘main_generator.py’, ‘migrations.py’, ‘django_migrations.py’, ‘django_admin_log.py’]

view raw ranked_by_matching_pos.py hosted with ❤ by GitHub

We made the list of suggestions to be tuples where the first item is the position of the match and second item is the matching filename. When this list is sorted, python will sort them based on the first item in tuple and use the second item as a tie breaker. On line 14 we use a list comprehension to iterate over the sorted list of tuples and extract just the second item which is the file name we’re interested in.

This got us close to the end result, but as shown in the example, it’s not perfect. We see ‘main_generator.py’ as the first suggestion, but the user wanted ‘migration.py’.

Ranking based on compact match:

When a user starts typing a partial string they will continue to type consecutive letters in a effort to find the exact match. When someone types ‘mig’ they are looking for ‘migrations.py’ or ‘django_migrations.py’ not ‘main_generator.py’. The key here is to find the most compact match for the user input.

Once again this is trivial to do in python. When we match a string against a regular expression, the matched string is stored in the match.group(). 

For example, if the input is ‘mig’, the matching group from the ‘collection’ defined earlier is as follows:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

[Show hidden characters]({{ revealButtonHref }})

regex = ‘(m.*i.*g)’
‘main_generator.py’ -> ‘main_g’
‘migrations.py’ -> ‘mig’
‘django_migrations.py’ -> ‘mig’
‘django_admin_log.py’ -> ‘min_log’

view raw match_group.py hosted with ❤ by GitHub

We can use the length of the captured group as our primary rank and use the starting position as our secondary rank. To do that we add the len(match.group()) as the first item in the tuple, match.start() as the second item in the tuple and the filename itself as the third item in the tuple. Python will sort this list based on first item in the tuple (primary rank), second item as tie-breaker (secondary rank) and the third item as the fall back tie-breaker. 

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

[Show hidden characters]({{ revealButtonHref }})

»> import re # regex module from standard library.
»> def fuzzyfinder(user_input, collection):
suggestions = []
pattern = ‘.*'.join(user_input) # Converts ‘djm’ to ’d.*j.*m’
regex = re.compile(pattern) # Compiles a regex.
for item in collection:
match = regex.search(item) # Checks if the current item matches the regex.
if match:
suggestions.append((len(match.group()), match.start(), item))
return [x for _, _, x in sorted(suggestions)]
»> print fuzzyfinder(‘mig’, collection)
[‘migrations.py’, ‘django_migrations.py’, ‘main_generator.py’, ‘django_admin_log.py’]

view raw Compactness_ranking.py hosted with ❤ by GitHub

This produces the desired behavior for our input. We’re not quite done yet.

Non-Greedy Matching

There is one more subtle corner case that was caught by Daniel Rocco. Consider these two items in the collection [‘api_user’, ‘user_group’]. When you enter the word ‘user’ the ideal suggestion should be [‘user_group’, ‘api_user’]. But the actual result is:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

[Show hidden characters]({{ revealButtonHref }})

»> print fuzzyfinder(‘user’, collection)
[‘api_user.doc’, ‘user_group.doc’]

view raw corner_case.py hosted with ❤ by GitHub

Looking at this output, you’ll notice that api_user appears before user_group. Digging in a little, it turns out the search user expands to u.*s.*e.*r; notice that user_group has two rs, so the pattern matches user_gr instead of the expected user. The longer match length forces the ranking of this match down, which again seems counterintuitive. This is easy to change by using the non-greedy version of the regex (.*? instead of .*) on line 4. 

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

[Show hidden characters]({{ revealButtonHref }})

»> import re # regex module from standard library.
»> def fuzzyfinder(user_input, collection):
suggestions = []
pattern = ‘.*?'.join(user_input) # Converts ‘djm’ to ’d.*?j.*?m’
regex = re.compile(pattern) # Compiles a regex.
for item in collection:
match = regex.search(item) # Checks if the current item matches the regex.
if match:
suggestions.append((len(match.group()), match.start(), item))
return [x for _, _, x in sorted(suggestions)]
»> fuzzyfinder(‘user’, collection)
[‘user_group.doc’, ‘api_user.doc’]
»> print fuzzyfinder(‘mig’, collection)
[‘migrations.py’, ‘django_migrations.py’, ‘main_generator.py’, ‘django_admin_log.py’]

view raw non_greedy_matching.py hosted with ❤ by GitHub

Now that works for all the cases we’ve outlines. We’ve just implemented a fuzzy finder in 10 lines of code.

Conclusion:

That was the design process for implementing fuzzy matching for my side project pgcli, which is a repl for Postgresql that can do auto-completion. 

I’ve extracted fuzzyfinder into a stand-alone python package. You can install it via ‘pip install fuzzyfinder’ and use it in your projects.

Thanks to Micah Zoltu and Daniel Rocco for reviewing the algorithm and fixing the corner cases.

If you found this interesting, you should follow me on twitter

Epilogue:

When I first started looking into fuzzy matching in python, I encountered this excellent library called fuzzywuzzy. But the fuzzy matching done by that library is a different kind. It uses levenshtein distance to find the closest matching string from a collection. Which is a great technique for auto-correction against spelling errors but it doesn’t produce the desired results for matching long names from partial sub-strings.