=============================================
12. Save the solution. Run your package to test it. Check the results of the Fuzzy Lookup
transformation. Check for rows for which the transformation didn’t find a match, and
for incorrect matches. Use the following code.
-- Not matched
SELECT * FROM FuzzyMatchingResults
WHERE CleanCustomerKey IS NULL;
-- Incorrect matches
SELECT * FROM FuzzyMatchingResults
WHERE CleanCustomerKey <>
CustomerKey
============================================I am OK with the 1st query. But I had a hard time to understand the 2nd query as the result of "Incorrect matches". So I decide to dig it into as below:
Let's start with the control flow and the Execute T-SQL Statement Task:
This is the data flow task:
The CustomerDirty table above was built on the CustomerClean
table. So let’s start with the CustomerClean table:
/**** Return 1849 rows
CustomerKey
FullName
StreetAddress
11000
Jon
Yang
3761 N. 14th St
11010
Jacquelyn Suarez
7800
Corrinne Court
11020
Jordan
King
7156 Rose Dr.
........
*/
SELECT * FROM dbo.CustomersClean
In Step 1, the dbo.CustomersDirty also contains 1849 ‘dirty’ data
/****
- All CustomerKey are the original key * (-1)
- Some rows on FullName and/or StreetAddress has been modified based on some random sampling logic.
- If the Updated column for a row is a value 3, it means that the row been modified 3 times.
- CleanCustomerKey is to populate the table with the customer key from the clean table after identity mapping
CustomerKey
FullName
StreetAddress
Updated CleanCustomerKey
-29260
Ricy
ywrlson
317zb7 yuqer Way
4
NULL
-29250
Anronyf
Williams
4c51 fwow
Rd.
3
NULL
-29240
Madeline
Allen
7207 St. Andrews Way
0
NULL
……………………
*/
Step 2- Using DQS Cleansing
Transformation to clean for the StreetAddress data against a DQS KB
Step 3- Using Lookup to do an
exact match again the dbo.CustomersClean table
on the FullName and StreetAddress columns, the CutomerKey in the clean table
(the lookup table) replaces the CleanCustomerKey in the input table (CustomerDirty)
with a name “CleanCustomerKey”. Additionally, the other two columns in the lookup
table are outputted as well. The other column in the input table are passed
through. Total 1002 rows are exactly matched on FullName and StreetAddress.
Step 4 and Step 5: These matched
rows are inserted in the CustomersDirtyMatch table
Step 6 – 847 rows are not
exactly matched.
Step 7 – These 847 rows are
inserted in the CustomersDirtyNoMatch table
Step 8 – These 847 rows missing
the exact match are undergone Fuzzy Lookup again the clean table on the
FullName and StreetAddress columns.
Similar to the exact lookup, all columns
except for the CleanCustomerKey in the input table are passed through. The
CleanCustomerKey was replaced with the CustomerKey in the lookup table with the
same name as CleanCustomerKey. The other two columns (FullName ad
StreetAddress) in the lookup table are
outputted as well.
Different from the Lookup, Fuzzy Lookup adds additional
columns on similarity and confidence. The values on similarity and confidence
vary based on the selection of the matching threshold. If the rows do not meet
the minimum threshold for fuzzy loop, they are still included in the output,
but with NULL as the CleanCustomerKey or 0 for similarity or confidence. In
other words, all rows are passed through. If there is a fuzzy matching, the CleanCustomerKey
is not NULL, and the similarity and confidence columns are not 0 either. In contrary,
the rows with NULL as the CleanCustomerKey are missing the fuzzy matching.
Step 9: Union all of the
1849 rows from two sources (in three types)
- Exactly matched (1002 rows),
- Not exactly matched (847 rows)
- Fuzzy matched (varies depending on threshold),
- Fuzzy not-matched (the same as above).
The number of the rows in each
category varies depending on the threshold. For example, for threshold 0.9, 176
rows for fuzzy matched and 671 rows for fuzzy no-match. For threshold 0.5,
fuzzy match=740 rows, fuzzy no match = 107 rows
You need to use the T-SQL
queries to find out the number of rows in each category:
-- Return the exact matches
-- Exactly matched rows have no updates and no value for similarity
(it's for fuzzy lookup)
-- The CustomerKey=-11010 one is manually updated for testing the
DQS Cleansing transformation.
-- Its address was corrected based on term-based relation, so it
is considered as exact match with updated =3
SELECT * FROM
FuzzyMatchingResults
WHERE (Updated=0 AND [_Similarity] IS NULL) OR CustomerKey=-11010
-- Return Fuzzy Matched rows
-- Fuzzy match has a value for similarity
SELECT * FROM
FuzzyMatchingResults
WHERE
[_Similarity]!=0
-- Return not fuzzy match
SELECT * FROM
FuzzyMatchingResults
WHERE
CleanCustomerKey IS NULL
So the conclusion is that the 2nd query in the book is not right.