Introduction
Short summary on data deduplication in QDeFuZZiner
QDeFuZZiner software provides means to define a deduplication project, instead of regular data matching and merging project of two different datasets.
In essence, in each deduplication project, QDeFuZZiner is performing data matching of two separate imported copies of the same original source dataset. Thus we need to import the same source dataset as both "left" and "right" dataset.
Additional differences to regular data matching projects is that:
1. We need to switch-on the dedicated flag "Deduplication (instead of Matching)?":
2. You must define not-equal (<>) exact constraint
While in regular record linkage projects it is also a good practice to ensure ID column in input datasets, in a deduplication projects it is a must, you must have a column containing unique ID value for each row. And you must define exact matching constraint with "<>" exact constraint (not-equal-to constrain) for that column. For example:
This will ensure that a row is not matched with itself, but rather with other duplicate rows belonging to the same entity.
3. Merging is expected default behaviour
While you can use data merging features in record linkage projects too, in deduplication projects you will almost certainly want to do it, because, typically you want to vertically stack, merge and consolidate data coming from duplicate rows belonging to same entity.
Example of data merging definition in a data deduplication project:
Data deduplication based on exact matching constraints
As with regular data matching & merging projects, in deduplication projects we also can combine various exact matching and fuzzy matching constraints.
In this article we are going to analyze special case when data deduplication model could be established by using exact matching constraints only.
We would normally expect that exact matching constraints should be superior in performance and execution time to fuzzy matching constraints. However, when input dataset is large, this might not be the case. Due to how PostgreSQL is constructing query plan in the background, it seems sometimes Postgresql is not able to utilize index properly and thus execution time can be extremely long.
We have experimentally came to simple solution to this problem: just add an additional fuzzy matching constraint with very high blocking similarity limit and very high similarity threshold and performance will be dramatically increased, i.e. runtime reduced!
Since this is all very counter-intuitive, we will show it by example.
Subject of this deduplication is a dataset having 1.5 millions of rows and more than 500 columns. It's containing various information about persons. Each person is having unique UID column value.
Deduplication model
In this article we want to focus on an example of deduplication project where we can utilize exact matching constraints and theoretically we would not need to use fuzzy matching constraints at all.
In this particular example we have 2 ID columns guaranted to provide unique information. We have "Global Row ID" to provide unique ID for each row, and we also have "UID" column to provide unique identity of each person.
Regular deduplication model based on exact matching constraints
This is what we would normally define in exact matching constraints section for this deduplication project:
"Global Row ID" <> "Global Row ID" constraint ensures that a rows is not being matched with itself.
"UID" = "UID" constraint ensures that only rows with persons having same UID are being deduplicated and merged together.
We would normally expect that this should be translated into very efficient query, utilizing standard btree indexes. But, unfortunatelly it is not the case, when we ran this project, it was unable to finish execution in one week period! It seem as if Postgresql database was trying first to establish Cartesian product of all combinations and only then to perform filtration according to constraints definition. Since 1.5 million multiplied by 1.5 million is an astronomical number, this is destined to be a failure when we speak about performance. Not good.
Improved deduplication model based on exact matching constraints, with addition of dummy fuzzy matching constraint
Forunately, we can improve this model by a simple trick. We just need to add an additional fuzzy matching constraint for UID column pair. We can say that we are adding a dummy fuzzy matching constraint, because, normally we would argue that it is not really needed since we already have an exact constraint for the very same column pair.
Of course, since we added a fuzzy matching constraint, now we also need to define blocking and final similarity threshold. Since this is just a dummy constraint, we don't need to deal with determining an optimum value, we can put an arbitrary very high value (for example 99%) for both blocking similarity and final similarity threshold.
When we ran this version of model, it took only 1.5 hour to accomplish it!
That's because Postgresql utilized special indexes used for fuzzy matching. Very high values of blocking and final similarity thresholds reduced the solution space to a small subset of candidate rows and thus dramatically increased performance.
Conclusion
Each deduplication project in QDeFuZZiner contains at least one exact matching constraint with "<>" (not eqaul to) constraint, to ensure that a row is not matched with itself.
If we are able to define other constraints as exact matching constraints, with no need to define fuzzy matching constraints, we would expect to see high performance.
However, this is not the case. Due to how Postgresql is managing query plan in the background, performance can be very bad.
Counter-intuitive as it is, fortunately there is a simple solution how to improve performance, and it is very simple: just add additional fuzzy matching constraint based on the same column pairs you used for exact matching constraints, and set very high blocking similarity threshold and final similarity threshold values (for example, put it to 99%).
Further Reading
Introduction To Fuzzy Data Matching
Data Matching Flow
Managing QDeFuZZiner Projects
Importing Input Datasets into QDeFuZZiner
Managing QDeFuZZiner Solutions
Demo Fuzzy Match Projects
Various Articles on QDeFuZZiner
Our Data Matching Services
Do you wish us to perform fuzzy data matching, de-duplication or cleansing of your datasets?
Check-out our data matching service here: Data Matching Service