Project

In this example project task is to detect matching records from two different product catalogs ("Amazon" and "Google"). There are 3227 records in the Google catalog and 1364 records in the Amazon catalog.
Google products catalog is bigger and is imported as "left" dataset, while Amazon catalog is imported as "right" dataset. It is know that there are 1300 true matching pairs.
Several example solutions are included, to demonstrate how varying input parameters influence end-result.
In sample solutions it is demonstrated how simpler models usually provide better results than complex models. It is demonstrated that it is usually better to exclude fields with long strings, such as long product descriptions, from fuzzy matching constraints, because comparison on long strings both deteriorate quality of matching and also increases execution time, making it unpractical for regular use.
Bottom line: always try first with simple fuzzy matching models and don't use long strings (such as long descriptions) for fuzzy matching constraints!

Input Datasets Import

Left Dataset Schema Definition

Google products catalog is an Excel file containing 3227 records. It has been registered as left dataset for fuzzy matching. During registration of the file, it has been converted into a .tab format text file and then imported into the underlying PostgreSQL database, along with indexes creation. Beloow is the schema description for importing of the input file.

Right Dataset Schema Definition

Amazon products catalog is an Excel file containing 1364 records. It has been registered as right dataset for fuzzy matching. During registration of the file, it has been converted into a .tab format text file and then imported into the underlying PostgreSQL database, along with indexes creation. Beloow is the schema description for importing of the input file.

Fuzzy Match Solution

Identify Matching Records

Several demo solutions with different fuzzy matching constraints have been experimented. You can investigate them on your own.

We concluded that excluding description fields from fuzzy match constraints greatly improves quality of the model. It also decreases execution time. Therefore we have chosen relatively simple fuzzy match model as our model of choice.
If we would like to find perfect matching model, this model would be a starting point to search for optimum parameters - results are pretty good, distribution is pretty good separated, execution time is fast.