What is QDeFuZZiner Solution?

By term "Solution" we consider section of the QDeFuZZiner in which fuzzy matching specification is defined, fuzzy matching execution is performed and results are acquired and manipulated. Solution is the most important object in QDeFuZZiner software.

A solution is an entity on which you define all the parameters that will influence how fuzzy matching will be performed, provide specification of field pairs to be matched, along with exact, fuzzy and other constraints.

By executing solution specification, fuzzy data matching analysis is executed on the underlying PostgreSQL database. Upon completion of fuzzy matching, retrieved results are saved as new table and presented in a datagrid.

Retrieved resultset can be exported into a spreadsheet (.xlsx, .xls, .ods) or flatĀ plain text (.csv, .tab, .txt) files.

Each solution belongs to a certain project, i.e. one project can contain multiple solutions.

Solution Architecture (Underlying logic, workflow and database tables)

In order to understand how to set-up a solution, we need to understand basics of internal fuzzy match architecture and workflow.Data Matching Flow with QDeFuZZiner software

Solutions are saved as records of table of solutions, where each record represents a solution,Ā containing definition of parameters and constraints to be applied. Solution execution actually involves two separate phases, which we call "blocking phase" and "final resultset retrieval phase". Result of the first phase is so-called "solution base table", while result of the second phase is final resultset.

These two phases can be executed in one step or in two (or three) consecutive distinctive steps, by using corresponding buttons.

Screenshot from versions before v. 1.2.0:

Screenshot from version v. 1.2.0:

We recommend execution in two (or three) distinctive steps:

  1. Execute Blocking Phase (i.e. Prepare and execute Solution Base Table SQL Query) ---> solution base table creation
  2. Analyze Similarity Function Distribution (i.e. Open Similarity Distribution Estimator) (optinally) ---> visually assessing optimum similarity threshold value
  3. Get Final Resultset (i.e. Prepare and execute Result SQL Query) ----> resultset table creation and representation in datagrid

Blocking phase, i.e. creation of solution base table is a time-consuming operation, which, depending on the datasets size and number of columns included into fuzzy match comparison, can take anything from few minutes to few hoursĀ  to few days to finish! On the contrary, final reultset retrieval phase is executed in matter of seconds or minutes. Therefore, there is much sense to follow this recommended three-step approach: you first define solution parameters and constraints, then create solution base table (blocking phase), then open similarity distribution tool to visually determine area of optimum threshold values, then consecutively vary threshold values (inside previously determined optimum range) until getting satisfactory results.

Execution in one step is suitable for re-running a solution on update (re-imported) input datasets, when you expect that new imported data will not substantially change already defined data matching model.

Blocking Phase

"Blocking" phase is phase in which a subset of best matching candidate record pairs are chosen from the whole universe of all possible combinations.

Blocking phase is actually sequence of two distinct consecutive sub-phases:

1. Sub-phase of rough similarity filtration (blocking)

By using rough filtration on similarity, best candidates (those matching pairs which have string similarity greater than blocking similarity limit) are passed-through and saved into an intermediate table called "solution base table".

2. Detailed similarity calculation

After best candidates are saved, then detailed similarity calculation takes place for each passed-through record pair.

The most important parameter we need to define for blocking phase is called "blocking similarity limit". This value represents a similarity threshold that is used in so-called "blocking phase". Term "blocking" is used here to designate phase in which Cartesian product of all possible combinations of records from left and right dataset is constrained, i.e. narrowed down to a much smaller subset of combinations, according to some blocking similarity criteria. This is very important sub-phase, because for medium and big datasets, detailed fuzzy match calculation would become infeasible (extremely time consuming) if we would compare and analyze all possible combinations in detailed similarity calculation sub-phase.

The following screenshot shows parameters related to blocking phase:

The bigger is the blocking similarity limit value, the less number of record pairs will be saved in the solution table and consequently next phase (detailed fuzzy match phase) will be faster. However, if the blocking similarity values is too big, we risk to omit true matches.

WhenĀ a solution definition is executed, QDeFuZZiner creates a table for a solution, which we call "solution base table". This table is constructed as combination of records from left and right datasets, according toĀ exact and fuzzy matching constraints and blocking similarity limit. Solution base table thus contains subset ofĀ  left and right dataset records combinations, which satisfy condition of blocking similarity limit. Only combinations saved into the solution table are then analyzed in the detailed fuzzy match sub-phase and similarity values calculated for them.

Besides blocking similarity limit,Ā  parameter "Use dictionaries (yes/no)" also influences on the solution base table creation. If dictionary is used, strings used for blocking and detailed phase are lexemized into lexems, according to selected dictionary. Lexems are then used for similarity calculation instead of original words. This can be useful in cases of big strings, such as verbose product descriptions, because lexemization decreases variations in related words.

Of course, exact and fuzzy match constraints also influence blocking phase. Adding exact matching constraints can dramatically reduce time for execution of blocking phase and also improve accuracy of fuzzy matching model.

Immediately after rough filtration of candidate record pairs, detailed calculation of string similarity is executed on the passed-through records.

Overall result of the blocking phase is intermediary table stored in the database, called "solution base table", which contains record pairs with calculated string similarity values.

By using similarity distribution estimator tool, we can then visually assess optimum threshold to be used in resultset retrieval phase.

simialrity function distribution

Resultset Retrieval Phase

Following screenshot shows parameters important for final resultset retrieval phase:

In this phase,Ā value of the "similarity threshold" parameter is used to discern between matches and non-matches. Result of this phase is creation and saving of a resultset table which is then loaded into the datagrid, from which it can be exported into a spreadsheet or flat file.

Besides similarity threshold, this phase is also influenced by exact and fuzzy matching constraints. It is also influenced by the "Join Type" and "Return only best matching record (yes/no)" parameters.

Major Elements of a QDeFuZZiner Solutions Page

If you select a project in the Projects page and then click on the "Records Matching Analysis" tab, Solutions page will be shown.Ā