Project
In this example project task was to detect matching records from two different product catalogs.
There are 1082 records in "Abt" catalog and 1093 records in "Buy" catalog. It is known that there are 1097 true matching pairs.
Several example solutions are included, to demonstrate how varying parameters influence end-result.
It is demonstrated how using dictionaries (lexemization) can be useful in case when we fuzzy match long strings, such as verbose product descriptions.
Input Datasets Import
Left Dataset Schema Definition
"Abt" product catalog Excel file, containing 1082 records has been registered and imported as "left" dataset for fuzzy matching. During file registration, it has been automatically converted to a .tab text file format and then imported into PostgreSQL database.
Right Dataset Schema Definition
"Buy" product catalog Excel file, containing 1093 records has been registered as "right" dataset for fuzzy matching. During registration, it has been automatically converted into a .tab format text file and imported into Postgresql database afterwards.
Fuzzy Match Solution
Identify Matching Records
We have been experimenting with several fuzzy matching constraints definitions, which you can explore on your own. In all solutions, applying dictionary (i.e. lexemization) returned better results.
We were able to find appropriate threshold to pick-up all true matches (it is known that there are 1097 true matches).
Under Solution Constraints /Fuzzy Match Relations, there are multiple field pairs combinations defined, with different weights assigned. Besides main columns being paired, additional potentially useful combinations of column pairs are also included. Such additional combinations sometimes improve fuzzy matching, but in many cases is contra-productive.
This example demonstrates how using dictionaries can improve fuzzy matching in circumstances when we have long strings, such as verbose descriptions.