Wednesday, June 10, 2009

35x Improved T-SQL LevenShtein Distance Algorithm...at a cost

At work, we noticed a considerable performance hit using a T-SQL implementation of the Levenshtein algorithm created by Michael Gilleland found here. It seems to me that his approach was to replicate the C-code algorithm found in Wikipedia in T-SQL rather than taking a step back and re-conceptualizing the algorithm from a T-SQL standpoint. Feel free to correct me, but it seems that the algorithm focues on three things to find the shortest distance between two strings (what it takes to make them the same):
  1. How many character insertions are necessary
  2. How many character deletions are necessary
  3. How many character substitutions are necessary
These three traits are necessary to accommodate mis-aligned strings because of typos. If the algorithm is not designed with these constraints, one missing letter will result in inaccurate reporting because it throws the indexing off. The C-code example in Wikipedia uses a 2-D matrix approach, which is a natural, compact, and performant fit for C/C++, or probably any other modern programming language. However, in SQL, especially T-SQL, stuffing all these characters into one huge string and doing sub string comparisons is very costly.

More than that, a true Levenshtein function should return the numeric distance between the two strings, not a string value of this distance. Keeping this in mind, I started editing the algorithm to understand what was going on at a low level but quickly found that I would just be better off re-architecting the whole solution, not just to make it performant, but so that I would understand it completely. I followed the above 3 traits with some small additions for performance. I wanted to do the following:
  1. If either string is empty, return the length of the other string
  2. Find the longest string, and loop on that length.
  3. Compare each character and see if they are equal
  4. If they are not, then increment the difference counter
  5. See if an insertion will re-align the strings; increment the right index forward if true
  6. Else, see if a deletion will re-align the string; increment the left index forward if true
  7. Return the number of differences
Seems pretty simple when you think about it... I now have one 'WHILE' loop with at most 4 substring() calls per character.

This new algorithm comes at a cost of not checking for insertions/deletions along the entire length of the comparison string, so we can now only handle one insertion/deletion consecutively except at the end of the comparison string. This, in effect, makes it no longer follow the Levenshtein algorithm's end capabilities, but this is an acceptable cost for our needs.

So, how much does this help my performance?
  Test Case: I was matching first and last names on a table with over 5,500 users and using Mr. Gilleland's approach with a user that was towards the end of the table, it took ~28 seconds. I optimized that call down to 14 seconds (I was using the function twice in the proc). Still, 14 seconds is a heck of a long time.


Original Query: 28 seconds
Optimized Original Query (removed a second Levenshtein function call, using lengths as a predeterminate instead): 14 seconds

My 'optimized' Levenshtein Algorithm using the Optimized Original Query: With the same table and same user with my 'improved' algorithm, I could now perform the same query in ~400 ms. That's a 35x improvement! All of these tests were run with MSSQL 2005.

Here's the code:
/****** Object: UserDefinedFunction [dbo].[LEVENSHTEIN] Script Date: 06/10/2009 09:36:44 ******/ 
SET ANSI_NULLS ON GO 
SET QUOTED_IDENTIFIER ON GO  
ALTER function [dbo].[LEVENSHTEIN]( @left varchar(100), @right varchar(100) ) 
   returns int 
as 
BEGIN 
   DECLARE @difference int, @lenRight int, @lenLeft int, @leftIndex int, @rightIndex int,   @left_char char(1), @right_char char(1), @compareLength int  
   SET @lenLeft = LEN(@left) 
   SET @lenRight = LEN(@right) 
   SET @difference = 0  
   If @lenLeft = 0 
   BEGIN 
      SET @difference = @lenRight GOTO done 
   END 
   If @lenRight = 0 
   BEGIN 
      SET @difference = @lenLeft 
      GOTO done 
   END  
   GOTO comparison  

   comparison: 
   IF (@lenLeft >= @lenRight) 
      SET @compareLength = @lenLeft 
   Else 
      SET @compareLength = @lenRight  
   SET @rightIndex = 1 
   SET @leftIndex = 1 
   WHILE @leftIndex <= @compareLength 
   BEGIN 
      SET @left_char = substring(@left, @leftIndex, 1)
      SET @right_char = substring(@right, @rightIndex, 1)
      IF @left_char <> @right_char 
      BEGIN -- Would an insertion make them re-align? 
         IF(@left_char = substring(@right, @rightIndex+1, 1))    
            SET @rightIndex = @rightIndex + 1 
         -- Would an deletion make them re-align? 
         ELSE 
            IF(substring(@left, @leftIndex+1, 1) = @right_char)
               SET @leftIndex = @leftIndex + 1
               SET @difference = @difference + 1 
      END 
      SET @leftIndex = @leftIndex + 1 
      SET @rightIndex = @rightIndex + 1 
   END  
   GOTO done  

   done: 
      RETURN @difference 
END

11 comments:

  1. I have since edited this post to more correctly reflect the cost associated with the performance of my algorithm as it is not as powerful as the original Levenshtein algorithm. However, it does handle most real-world spelling errors that an admin would want to tolerate.

    ReplyDelete
  2. Hi Brent, can you show us a usage? I'd like to be able to query all records where a given field is less than the a given Levenshtein distance from a given string.

    Seems to me that if we know the maximum allowable distance, we can optimize further, ie, throw out a lot of records without doing the full calculation.

    ReplyDelete
    Replies
    1. I know it's been forever since you posted this reply Matt, but I was looking at this again today and I agree. If you add an argument to the function for the max difference allowed, you can short circuit the loop in many cases and it should speed things up even further.

      I would hazard a guess between 30 - 50% faster for a max difference of say 3 with an average string length of 10 or more. It would really help to stabilize the average overall speed with strings that could be much longer, too.

      I should have thought of that a long time ago (and you already did)!

      Delete
  3. Hi Brent,

    If you combine the info you find on the following websites, you can speed that up even further. From 45 seconds to 1 ms I kid you not, and you will have the full levenshtein functionality!

    http://www.stev.org/post/2011/02/05/MSSQL-Levenshtein.aspx

    and

    http://www.asp.net/data-access/tutorials/creating-stored-procedures-and-user-defined-functions-with-managed-code-cs

    From the first website I altered the function so it can be used straight away following step 13 from the second website.

    Here is the code:

    using System;
    using System.Data;
    using System.Data.SqlClient;
    using System.Data.SqlTypes;
    using Microsoft.SqlServer.Server;

    public partial class StoredFunctions
    {
    [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = false)]
    public static SqlDouble Levenshtein(SqlString S1, SqlString S2)
    {
    if (S1.IsNull || S2.IsNull)
    throw (new ArgumentNullException());

    String S3 = S1.Value.ToUpper();
    String S4 = S2.Value.ToUpper();

    int n = S3.Length;
    int m = S4.Length;

    int[,] d = new int[n + 1, m + 1];
    int cost = 0;

    if (n + m == 0) {
    return 100;
    } else if (n == 0) {
    return 0;
    } else if (m == 0) {
    return 0;
    }

    for (int i = 0; i <= n; i++)
    d[i, 0] = i;

    for (int j = 0; j <= m; j++)
    d[0, j] = j;

    for (int i = 1; i <= n; i++)
    {
    for (int j = 1; j <= m; j++)
    {
    if (S3[i - 1] == S4[j - 1])
    cost = 0;
    else
    cost = 1;

    d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
    }
    }

    double percentage = System.Math.Round((1.0 - ((double)d[n, m]/(double)System.Math.Max(n,m))) * 100.0,2);

    return percentage;
    }
    };

    Cheers,
    Tom

    ReplyDelete
  4. A very interesting approach, Tom, and it appears to be much faster, and, of course, more powerful b/c of your C#/.NET approach.

    For the case where a user can build and use this code from a referenced dll in their SQL Server instance, this is definitely better.

    As an FYI for my approach, referencing a .NET dll from our SQL server instance was not feasible (as crazy as that may sound), so I didn't even bother traversing this path.

    So, if the software absolutely must go pure SQL and you don't mind the cost, my solution should work, but I pity your deployment environment if that is the case (as I pity my old deployment environment - switched jobs since then). :)

    ReplyDelete
  5. I do like the SQL approach though C# in a CLR will kick the crap out of it performance wise as soon as you encounter any kind of "loop" in SQL

    The Stev.Org Guy :)

    ReplyDelete
  6. Wow, this is amazing. I am able to use this to search our inventory of 150k products in under 1 second. Awesome :)

    ReplyDelete
  7. Thank you for sharing this function! It has worked with 100% reliability for my specific purpose, and at a better-than-expected speed. I also am in a SQL environment that does not support CLRs, so this was a PERFECT solution. Thanks again!

    ReplyDelete
  8. Thanks a lot man. This will help me out a lot.

    I just performance tested your TSQL implementation with 6 others and yours came out on top with 1000 calls in 420ms.

    The only reason I won't be using your implementation is that I want to use the full implementation of the algorithm.

    Amazing job sir

    ReplyDelete
  9. Found a bug:
    SELECT [dbo].[LEVENSHTEIN] (
    'moo'
    ,'mo77o')

    returns 3, should return 2 (delete 2 numeric characters from the 2nd line to get the 1st line)

    ReplyDelete
    Replies
    1. I wouldn't classify this as a bug b/c my solution has many limitations from a full-blown Levenshtein implementation. However, it is worth noting that this is one of them b/c it doesn't look at the whole string to figure out removals/additions, but one character at a time.

      So, the first 7 becomes an 'o', then the second '7' is removed, then the last 'o' is removed = 3 changes.

      Thanks for the note!

      Delete