How to Examine the Result of a Fuzzy Lookup?

The motivation for this post is to share my experience and understanding on examining or understanding the result of a fuzzy lookup. The project example is from the exercise projects in Chapter 20 in the book "Training Kit 70-463 Implementing a Data Warehouse with SQL Server 2012 by D. Sarka, M. Lah, and G. Jerkic" . When I reached to step 12 on page 762:

12.  Save the solution. Run your package to test it. Check the results of the Fuzzy Lookup
transformation. Check for rows for which the transformation didn’t find a match, and
for incorrect matches. Use the following code.
-- Not matched
SELECT * FROM FuzzyMatchingResults
WHERE CleanCustomerKey IS NULL;

-- Incorrect matches
SELECT * FROM FuzzyMatchingResults
WHERE CleanCustomerKey <> CustomerKey 
I am OK with the 1st query. But I had a hard time to understand the 2nd query as the result of "Incorrect matches". So I decide to dig it into as below:

Let's start with the control flow and the Execute T-SQL Statement Task:

This is the data flow task:

The CustomerDirty table above was built on the CustomerClean table. So let’s start with the CustomerClean table:

/**** Return 1849 rows

CustomerKey  FullName                                StreetAddress
11000              Jon Yang                                 3761 N. 14th St
11010              Jacquelyn Suarez                    7800 Corrinne Court
11020              Jordan King                            7156 Rose Dr.
SELECT * FROM dbo.CustomersClean

In Step 1, the dbo.CustomersDirty also contains 1849 ‘dirty’ data

  • All CustomerKey are the original key * (-1)
  • Some rows on FullName and/or StreetAddress has been modified based on some random sampling logic.
  • If the Updated column for a row is a value 3, it means that the row been modified 3 times.
  • CleanCustomerKey is to populate the table with the customer key from the clean table after identity mapping 

CustomerKey      FullName                            StreetAddress                    Updated                CleanCustomerKey
-29260                 Ricy ywrlson                      317zb7 yuqer Way            4                            NULL
-29250                 Anronyf Williams             4c51 fwow Rd.                   3                            NULL
-29240                 Madeline Allen                  7207 St. Andrews Way     0                            NULL

SELECT * FROM dbo.CustomersDirty

Step 2- Using DQS Cleansing Transformation to clean for the StreetAddress data against a DQS KB

Step 3- Using Lookup to do an exact match again the dbo.CustomersClean table on the FullName and StreetAddress columns, the CutomerKey in the clean table (the lookup table) replaces the CleanCustomerKey in the input table (CustomerDirty) with a name “CleanCustomerKey”. Additionally, the other two columns in the lookup table are outputted as well. The other column in the input table are passed through. Total 1002 rows are exactly matched on FullName and StreetAddress.

Step 4 and Step 5: These matched rows are inserted in the CustomersDirtyMatch table

Step 6 – 847 rows are not exactly matched.

Step 7 – These 847 rows are inserted in the CustomersDirtyNoMatch table

Step 8 – These 847 rows missing the exact match are undergone Fuzzy Lookup again the clean table on the FullName and StreetAddress columns. 

Similar to the exact lookup, all columns except for the CleanCustomerKey in the input table are passed through. The CleanCustomerKey was replaced with the CustomerKey in the lookup table with the same name as CleanCustomerKey. The other two columns (FullName ad StreetAddress) in the lookup table are outputted as well. 

Different from the Lookup, Fuzzy Lookup adds additional columns on similarity and confidence. The values on similarity and confidence vary based on the selection of the matching threshold. If the rows do not meet the minimum threshold for fuzzy loop, they are still included in the output, but with NULL as the CleanCustomerKey or 0 for similarity or confidence. In other words, all rows are passed through. If there is a fuzzy matching, the CleanCustomerKey is not NULL, and the similarity and confidence columns are not 0 either. In contrary, the rows with NULL as the CleanCustomerKey are missing the fuzzy matching.

Step 9: Union all of the 1849 rows from two sources (in three types)
  • Exactly matched (1002 rows),
  • Not exactly matched (847 rows)
    • Fuzzy matched (varies depending on threshold),
    • Fuzzy not-matched (the same as above).
The number of the rows in each category varies depending on the threshold. For example, for threshold 0.9, 176 rows for fuzzy matched and 671 rows for fuzzy no-match. For threshold 0.5, fuzzy match=740 rows, fuzzy no match = 107 rows

You need to use the T-SQL queries to find out the number of rows in each category:

-- Return the exact matches

-- Exactly matched rows have no updates and no value for similarity (it's for fuzzy lookup)
-- The CustomerKey=-11010 one is manually updated for testing the DQS Cleansing transformation.
-- Its address was corrected based on term-based relation, so it is considered as exact match with updated =3

SELECT * FROM FuzzyMatchingResults
WHERE (Updated=0 AND [_Similarity] IS NULL) OR CustomerKey=-11010

-- Return Fuzzy Matched rows

-- Fuzzy match has a value for similarity
SELECT * FROM FuzzyMatchingResults
WHERE [_Similarity]!=0

-- Return not fuzzy match

SELECT * FROM FuzzyMatchingResults
WHERE CleanCustomerKey IS NULL

  So the conclusion is that the 2nd query in the book is not right.