- How many character insertions are necessary
- How many character deletions are necessary
- How many character substitutions are necessary
More than that, a true Levenshtein function should return the numeric distance between the two strings, not a string value of this distance. Keeping this in mind, I started editing the algorithm to understand what was going on at a low level but quickly found that I would just be better off re-architecting the whole solution, not just to make it performant, but so that I would understand it completely. I followed the above 3 traits with some small additions for performance. I wanted to do the following:
- If either string is empty, return the length of the other string
- Find the longest string, and loop on that length.
- Compare each character and see if they are equal
- If they are not, then increment the difference counter
- See if an insertion will re-align the strings; increment the right index forward if true
- Else, see if a deletion will re-align the string; increment the left index forward if true
- Return the number of differences
So, how much does this help my performance?
Test Case: I was matching first and last names on a table with over 5,500 users and using Mr. Gilleland's approach with a user that was towards the end of the table, it took ~28 seconds. I optimized that call down to 14 seconds (I was using the function twice in the proc). Still, 14 seconds is a heck of a long time.
Original Query: 28 seconds
Optimized Original Query (removed a second Levenshtein function call, using lengths as a predeterminate instead): 14 seconds
My 'optimized' Levenshtein Algorithm using the Optimized Original Query: With the same table and same user with my 'improved' algorithm, I could now perform the same query in ~400 ms. That's a 35x improvement! All of these tests were run with MSSQL 2005.
Here's the code:
/****** Object: UserDefinedFunction [dbo].[LEVENSHTEIN] Script Date: 06/10/2009 09:36:44 ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO ALTER function [dbo].[LEVENSHTEIN]( @left varchar(100), @right varchar(100) ) returns int as BEGIN DECLARE @difference int, @lenRight int, @lenLeft int, @leftIndex int, @rightIndex int, @left_char char(1), @right_char char(1), @compareLength int SET @lenLeft = LEN(@left) SET @lenRight = LEN(@right) SET @difference = 0 If @lenLeft = 0 BEGIN SET @difference = @lenRight GOTO done END If @lenRight = 0 BEGIN SET @difference = @lenLeft GOTO done END GOTO comparison comparison: IF (@lenLeft >= @lenRight) SET @compareLength = @lenLeft Else SET @compareLength = @lenRight SET @rightIndex = 1 SET @leftIndex = 1 WHILE @leftIndex <= @compareLength BEGIN SET @left_char = substring(@left, @leftIndex, 1) SET @right_char = substring(@right, @rightIndex, 1) IF @left_char <> @right_char BEGIN -- Would an insertion make them re-align? IF(@left_char = substring(@right, @rightIndex+1, 1)) SET @rightIndex = @rightIndex + 1 -- Would an deletion make them re-align? ELSE IF(substring(@left, @leftIndex+1, 1) = @right_char) SET @leftIndex = @leftIndex + 1 SET @difference = @difference + 1 END SET @leftIndex = @leftIndex + 1 SET @rightIndex = @rightIndex + 1 END GOTO done done: RETURN @difference END