Details: Written by Super User; Category: Fuzzy Data Matching, Record Linkage and Data Deduplication; Published: 12 June 2020

Fuzzy Data Matching

Availability of a reliable master data for a supply chain can be crucial for success of a company. Supply chain performance is dependent on consistent definitions of customers, products, items, locations, and other master data objects. When data is poorly governed and inconsistent, supply chain become less competitive. Good data leads to efficient supply chains, allowing resources to be spent on innovation rather than on coping with problems.
However, business data is often messy, containing duplicate entries of same entities, missing information, having misspellings and various other errors.
In many cases, there is no unique identifier which can be used to unambiguously identify an entity. When such data sources need to be linked or deduplicated solely based on string similarity, it is called fuzzy data matching.

Fuzzy data matching is set of techniques to determine same identities, i.e. related records in two different datasets, when there is no unique identifier to identify entities. In such cases, records are compared based on approximate string similarity. Records that have calculated string similarity above a custom threshold are considered to be same entities.

Fuzzy Match

Term “fuzzy match” refers to methods of identifying related records by measuring how similar they are. It is used in cases where no unique identifier or exact match relation exists between two sets of data.

Fuzzy matching uses weights to calculate the probability that two given records refer to the same entity. Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with probabilities below threshold are considered to be non-matches.

Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold matching percentage set by the application.

Record Linkage

Record linkage refers to the task of finding records in a data set that refer to the same entity across different data sources, i.e. to identify related records in two separate data sets.

Record linkage is necessary when joining data sets is based on entities that may or may not share a common identifier, as may be the case due to differences in record shape, storage location, and/or curator style or preference.

There are many business cases where record linkage has to be performed. Some typical examples are product price lists, partner lists, book and movie catalogs, customer loyalty databases, medical records etc.

Data Deduplication

Data deduplication refers to identifying duplicate records in a dataset and cleansing datasets from redundant information.

Entity Resolution

Fuzzy data matching of same entities in different data sources is often referred as "entity resolution".

Use Cases for Fuzzy Data Matching

There are many business cases where record linkage has to be performed.

Some typical examples are: merging or deduplication of product catalogs and price lists, partner lists, book and movie catalogs, customer loyalty databases, medical records etc.

Fuzzy data matching finds similarities between data elements, such as names and addresses, enabling effective data cleansing and consolidation.

Fuzzy data matching identifies linked or similar records that contain keyboard errors, missing words, extra words, nicknames, changed surnames, or multicultural name variations.

Inherent Complexity of Fuzzy Data Matching

Fuzzy data matching is non-trivial, inherently difficult task to accomplish, due to various technical and other reasons. This is especially true for fuzzy matching huge datasets. I will try to explain why fuzzy matching is so incredibly difficult trait.

Let's first talk about human capabilities in determining same entities accross various datasets. Human beings are capable to immediately recognize same entity in two different records. We are capable to ignore typographical errors, abbreviations, fill gaps in data and use our intuition to determine same entity. It is easy for humans if there are few dozens of records at hand. However, how would you match records in two product catalogs if there are 10 thousand records in one dataset and 30 thousand records in another? Scrolling and searching of such big datasets is very tedious job and becomes unfeasible in real life. Obviously, we need computer program to do it for us.

However, using computers raises another types of problems involved. First of all, how computer can know whether two records are similar or not? Obviously, we need to provide some mathematical functions to calculate string similarities between records and some kind of threshold to discern between matches and non-matches. Such string similarity calculation functions are inherently time-consuming and require substantial computer power. Defining appropriate threshold value is another difficulty. How do you know what is the optimum value to discern matches from non-matches? Obviously, we need to experiment with various values and compare acquired results in order to assess which threshold value provides best ratio between true matches and false matches. It is almost certainly that we will always get some false positives and/or false negatives. Post-hoc manual cleansing of a resultset is almost certainly unavoidable.

When fuzzy matching huge datasets, another inherent problem arises. Try to comprehend this: if we are matching 10 records with 10 records, we are talking about 10x10=100 combinations, if we are comparing 100 records with 100 records, we are talking about 10000 combinations, if we are comparing 1000 records with 1000 records, we are now dealing with 1000000 combinations, etc. You can see the problem, with real life datasets we can easily come into situations that we need to calculate and compare string similarity for millions or billions combinations. Additionally, if our datasets contain more than one column, we need to compare different columns separately and this multiplies complexity. Calculating and comparing string similarity for each and every possible combinations, therefore, easily becomes infeasible even for supercomputers. Thus, we need another approach, we need to somehow limit number of combinations prior calculating detailed string similarity. This is called "blocking", we need to use some kind of function to narrow-down number of record combination candidates before going into detailed fuzzy matching.

It is not my intent to go into all technical details, I just wanted to emphasize inherent complexity of any fuzzy matching project and to point out that fuzzy matching is always associated with inherent difficulties which makes it both challenging and exciting.

QDeFuZZiner - Fuzzy Data Matching, Record Linkage and Data Deduplication Software

Our software called "QDeFuZZiner" is here to help you with fuzzy data matching, i.e. record linkage and data deduplication tasks.

Due to its inherent complexity, fuzzy match analysis is a popular subject of scientific research and academic papers. Some of the researchers even tend to build their own software, but those programs suffer from their complexity and necessity to understand advanced mathematics and algorithms, in order to be able to use it. This is not something that can be expected from an average user facing data linkage problem in urge to be able to solve it in matter of hours or days.

On the other hand, there are huge corporate entity resolution framework solutions, produced by big software companies, oriented towards huge corporate customers. These solutions are often very complex and affordable only to big companies and corporate users.

QDeFuZZiner places itself in the middle and provides powerful fuzzy match records linkage solution for mere mortals and regular office users.

Data Matching Flow with QDeFuZZiner software

By allowing users to define exact matching constraints, fuzzy matching constraints and all other constraints in visual and intuitive way, all the complexity of the fuzzy match analysis is hidden from the user and he/she can focus on the business case, rather than technical issues. That is where QDeFuZZiner software really shines and clearly distinguishes itself from competition.

Key Benefits

QDeFuZZiner is an invaluable tool for anyone looking to perform data matching, merging or de-duplication. Whether you're a data scientist, business analyst, or simply someone looking to make sense of complex data sets, QDeFuZZiner can help you achieve your goals.

QDeFuZZiner is the best fuzzy data matching software available because it is a powerful, yet intuitive software that can identify linked or similar records that contain keyboard errors, missing words, extra words, nicknames, changed surnames, or multicultural name variations. It can also help you to merge and consolidate product and customer lists, from multiple sources, and to identify and link together same entities, such as same customers or products, from two different datasets. Additionally, it can be used to minimize duplicate customer data and accurately link each data record to one customer identity. It also offers a free version called QDeFuZZiner Lite which has all features of the full commercial version, with only limitation of importing maximum 10000 rows per dataset.

The main features of the QDeFuZZiner software include a robust back-end PostgreSQL database, capable of storing, indexing and processing heavy input datasets; an intuitive and interactive front-end desktop GUI application; the ability to import input datasets from spreadsheet and flat (csv) files; intuitive organization of fuzzy data matching projects; intuitive creation of multiple solutions inside each project; interactive user interface for definition of various fuzzy matching parameters; definition of exact matching constraints, fuzzy matching constraints, other constraints; graphical tool for visualization of similarity distribution of matches and non-matches in a solution table; interactive datagrids with integrated searching, filtering, sorting and customization capabilities; integrated spreadsheet software "Spready" for analyzing input datasets and resultsets; and the ability to export resultsets into spreadsheet files (.xlsx, .xls, .ods) or flat files (.csv, .txt, .tab).

The major benefits of using QDeFuZZiner software include lower cost, faster time to market, ability to identify linked or similar records, ability to merge and consolidate product and customer lists, and ability to minimize duplicate customer data and accurately link each data record to one customer identity.

QDeFuZZiner is considered to be one of the best fuzzy data matching software for several reasons:

Advanced Algorithms: QDeFuZZiner uses advanced algorithms that are specifically designed for fuzzy data matching. These algorithms are able to accurately match data even when there are variations in spelling, format, or other inconsistencies.

High Accuracy: QDeFuZZiner delivers high accuracy results, which is essential when working with fuzzy data. This means that users can trust the results produced by the software, which in turn increases the efficiency of their data analysis and decision making.

Intuitive Interface: QDeFuZZiner has an intuitive interface that is easy to use, even for non-technical users. This means that users do not need to have a background in computer science or data analysis to take advantage of the software's capabilities.

Customizability: QDeFuZZiner allows users to customize the data matching process to meet their specific needs. This includes the ability to set various parameters for data matching or de-duplication, with merging capabilities.

Scalability: QDeFuZZiner is able to handle large data sets, making it an ideal solution for organizations of all sizes.

Wide range of industries: QDeFuZZiner offers a number of features that can be used to quickly and easily analyze data in a wide range of industries, including finance, healthcare, retail and more.

All these factors combined make QDeFuZZiner the best fuzzy data matching software in the market, its ability to handle complex data sets with ease, its high accuracy, its easy-to-use interface, its customizability, its scalability and its wide range of industries support make it an ideal solution for organizations and individuals looking to extract insights from data.

QDeFuZZiner Features

Main features of the QDeFuZZiner software:

Robust back-end PostgreSQL database, capable of storing, indexing and processing heavy input datasets
Intuitive and interactive front-end desktop GUI application
Importing input datasets from spreadsheet and flat (csv) files through interactive GUI
Intuitive organization of fuzzy data matching projects
Intuitive creation of multiple solutions inside each project
Each solution provides interactive user interface for definition of various fuzzy matching parameters (blocking similarity threshold, similarity threshold, optional usage of lexemization and dictionaries)
Each solution provides definition of exact matching constraints, fuzzy matching constraints, other constraints
"Merged Columns" feature, which enables creation of new, combined columns, with horizontal and vertical data aggregation, as well as provides mean for updating existing columns from merged values or from opposite table.
Merging feature is versatile tool which can be used for various data consolidation scenarios, such as for merge/consolidation of data during data matching project or for data enriching od surviving row from duplicate rows, during de-duplication.
You can define merged columns in a resultset, multiple options of merging types are available
Graphical tool for visualization of similarity distribution of matches and non-matches in a solution table, providing help in determining otimum threshold value for a solution
Automatic determination of optimum weights for fuzzy matching field pairs
Interactive datagrids with integrated searching, filtering, sorting and customization capabilities
Interactive splitters between form sections, helping in optimum utilization of screen space
Integrated spreadsheet software "Spready" for analyzing input datasets and resultsets
Seamlessly integrated with your default spreadsheet software, for analyzing input datasets and resultsets
Exporting of resultsets into spreadsheet files (.xlsx, .xls, .ods) or flat files (.csv, .txt, .tab)

Screenshots

Projects organization

Input Data Import

Solutions Organization

Solution parameters related to blocking phase:

Solution parameters related to final resultset retrieval:

Picking Column Pairs for adding into Solution Constraints

Definition of Exact Matching Constraints

Definition of Fuzzy Match Constraints

Merged Columns

Similarity Distribution Estimator

Fuzzy Matching Results

Our Data Matching Services

Do you wish us to perform fuzzy data matching, de-duplication or cleansing of your datasets?

Check-out our data matching service here: Data Matching Service

Introduction to Fuzzy Data Matching