Project

In this project, task is to de-duplicate a restaurants dataset, containing 864 records.
It is know there are 112 duplicate record pairs.

Difference Between a Record Linkage and a Data De-duplication Project

Both record linkage and data de-duplication projects are fuzzy matching projects. Difference is that in a data de-duplication project we are actually first fuzzy-matching a dataset with itself, excluding matching each record with itself in the process, and then in the next step we are eliminating results of matching from the original dataset.

In case of data de-duplication projects, you need to import the same input dataset twice: both as "left" and "right" dataset.

Notice that in order to exclude matching each record with itself, in data de-duplication solutions we always need to define at least one exact matching constraint of "id" <> "id" pattern. 

When doing a data de-duplication project, you are advised to break it into two discrete steps, i.e. two consecutive solutions:

  1. Solution to identify duplicate records (fuzzy matching of the dataset with itself)
  2. Solution to perform de-duplication, according to parameters established in the first step

In the first solution you are searching to find optimum parameters for fuzzy matching, in order to get all true matches while avoiding false matches. Once you have found optimum parameters and got satisfactory matching results, copy the project into a new project, switch on option "It is Deduplicate Type (instead of default Record Linkage Type) solution" and perform actual de-duplication.

Input Dataset Import

For a data-deduplication project, the same input dataset has to be imported twice, as both "left" and "right" dataset. This would enable us to fuzzy-match the dataset with itself, in order to find out duplicate records, based on string similarity.

The restaurants dataset is an Excel file containing 864 records. During registering it as "left" or "right" dataset, it was automatically converted to .tab format text file and imported into the PostgreSQL database afterwards.

Data De-Duplication Solution

Identify Duplicate Records

In first step, we created new fuzzy match solution, where we linked records from the "left" and "right" datasets (remember, these are actually the same input dataset!).

As in every data de-duplication project, we had to define an exact matching constraint of the "id"<>"id" pattern, in order to prevent fuzzy-matching of each record with itself!

Additionaly, due to completness of data in the "city" column, we were able to utilize this fact to define additional exact matching constraint on the "city" field. Defining an exact matching constraint improves model quality and increases execution speed.

Of course, we had to define fuzzy matching constraints too.