- Written by Super User
- Category: Fuzzy Data Matching, Record Linkage and Data Deduplication
Fuzzy Data Matching
Availability of a reliable master data for a supply chain can be crucial for success of a company. Supply chain performance is dependent on consistent definitions of customers, products, items, locations, and other master data objects. When data is poorly governed and inconsistent, supply chain become less competitive. Good data leads to efficient supply chains, allowing resources to be spent on innovation rather than on coping with problems.
However, business data is often messy, containing duplicate entries of same entities, missing information, having misspellings and various other errors.
In many cases, there is no unique identifier which can be used to unambiguously identify an entity. When such data sources need to be linked or deduplicated solely based on string similarity, it is called fuzzy data matching.
Fuzzy data matching is set of techniques to determine same identities, i.e. related records in two different datasets, when there is no unique identifier to identify entities. In such cases, records are compared based on approximate string similarity. Records that have calculated string similarity above a custom threshold are considered to be same entities.
Term “fuzzy match” refers to methods of identifying related records by measuring how similar they are. It is used in cases where no unique identifier or exact match relation exists between two sets of data.
Fuzzy matching uses weights to calculate the probability that two given records refer to the same entity. Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with probabilities below threshold are considered to be non-matches.
Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold matching percentage set by the application.
Record linkage refers to the task of finding records in a data set that refer to the same entity across different data sources, i.e. to identify related records in two separate data sets.
Record linkage is necessary when joining data sets is based on entities that may or may not share a common identifier, as may be the case due to differences in record shape, storage location, and/or curator style or preference.
There are many business cases where record linkage has to be performed. Some typical examples are product price lists, partner lists, book and movie catalogs, customer loyalty databases, medical records etc.
Data deduplication refers to identifying duplicate records in a dataset and cleansing datasets from redundant information.
Fuzzy data matching of same entities in different data sources is often referred as "entity resolution".
Use Cases for Fuzzy Data Matching
There are many business cases where record linkage has to be performed.
Some typical examples are: merging or deduplication of product catalogs and price lists, partner lists, book and movie catalogs, customer loyalty databases, medical records etc.
Fuzzy data matching finds similarities between data elements, such as names and addresses, enabling effective data cleansing and consolidation.
Fuzzy data matching identifies linked or similar records that contain keyboard errors, missing words, extra words, nicknames, changed surnames, or multicultural name variations.
Inherent Complexity of Fuzzy Data Matching
Fuzzy data matching is non-trivial, inherently difficult task to accomplish, due to various technical and other reasons. This is especially true for fuzzy matching huge datasets. I will try to explain why fuzzy matching is so incredibly difficult trait.
Let's first talk about human capabilities in determining same entities accross various datasets. Human beings are capable to immediately recognize same entity in two different records. We are capable to ignore typographical errors, abbreviations, fill gaps in data and use our intuition to determine same entity. It is easy for humans if there are few dozens of records at hand. However, how would you match records in two product catalogs if there are 10 thousand records in one dataset and 30 thousand records in another? Scrolling and searching of such big datasets is very tedious job and becomes unfeasible in real life. Obviously, we need computer program to do it for us.
However, using computers raises another types of problems involved. First of all, how computer can know whether two records are similar or not? Obviously, we need to provide some mathematical functions to calculate string similarities between records and some kind of threshold to discern between matches and non-matches. Such string similarity calculation functions are inherently time-consuming and require substantial computer power. Defining appropriate threshold value is another difficulty. How do you know what is the optimum value to discern matches from non-matches? Obviously, we need to experiment with various values and compare acquired results in order to assess which threshold value provides best ratio between true matches and false matches. It is almost certainly that we will always get some false positives and/or false negatives. Post-hoc manual cleansing of a resultset is almost certainly unavoidable.
When fuzzy matching huge datasets, another inherent problem arises. Try to comprehend this: if we are matching 10 records with 10 records, we are talking about 10x10=100 combinations, if we are comparing 100 records with 100 records, we are talking about 10000 combinations, if we are comparing 1000 records with 1000 records, we are now dealing with 1000000 combinations, etc. You can see the problem, with real life datasets we can easily come into situations that we need to calculate and compare string similarity for millions or billions combinations. Additionally, if our datasets contain more than one column, we need to compare different columns separately and this multiplies complexity. Calculating and comparing string similarity for each and every possible combinations, therefore, easily becomes infeasible even for supercomputers. Thus, we need another approach, we need to somehow limit number of combinations prior calculating detailed string similarity. This is called "blocking", we need to use some kind of function to narrow-down number of record combination candidates before going into detailed fuzzy matching.
It is not my intent to go into all technical details, I just wanted to emphasize inherent complexity of any fuzzy matching project and to point out that fuzzy matching is always associated with inherent difficulties which makes it both challenging and exciting.
QDeFuZZiner - Fuzzy Data Matching, Record Linkage and Data Deduplication Software
Our software called "QDeFuZZiner" is here to help you with fuzzy data matching, i.e. record linkage and data deduplication tasks.
Due to its inherent complexity, fuzzy match analysis is a popular subject of scientific research and academic papers. Some of the researchers even tend to build their own software, but those programs suffer from their complexity and necessity to understand advanced mathematics and algorithms, in order to be able to use it. This is not something that can be expected from an average user facing data linkage problem in urge to be able to solve it in matter of hours or days.
On the other hand, there are huge corporate entity resolution framework solutions, produced by big software companies, oriented towards huge corporate customers. These solutions are often very complex and affordable only to big companies and corporate users.
QDeFuZZiner places itself in the middle and provides powerful fuzzy match records linkage solution for mere mortals and regular office users.
By allowing users to define exact matching constraints, fuzzy matching constraints and all other constraints in visual and intuitive way, all the complexity of the fuzzy match analysis is hidden from the user and he/she can focus on the business case, rather than technical issues. That is where QDeFuZZiner software really shines and clearly distinguishes itself from competition.
Main features of the QDeFuZZiner software:
- Robust back-end PostgreSQL database, capable of storing, indexing and processing heavy input datasets
- Intuitive and interactive front-end desktop application, deployed as on-line web application via any browser - no installation on customer side is needed
- Uploading input files for importing and downloading exported results via FTP server, integrated web FTP client is available
- Importing input datasets from spreadsheet and flat (csv) files through interactive GUI
- Intuitive organization of fuzzy data matching projects
- Intuitive creation of multiple solutions inside each project
- Each solution provides interactive user interface for definition of various fuzzy matching parameters (blocking similarity threshold, similarity threshold, optional usage of lexemization and dictionaries)
- Each solution provides definition of exact matching constraints, fuzzy matching constraints, other constraints
- You can define merged columns in a resultset, multiple options of merging types are available
- Graphical tool for visualization of similarity distribution of matches and non-matches in a solution table, providing help in determining otimum threshold value for a solution
- Interactive datagrids with integrated searching, filtering, sorting and customization capabilities
- Interactive splitters between form sections, helping in optimum utilization of screen space
- Integrated spreadsheet software "Spready" for analyzing input datasets and resultsets
- Integrated Libfreoffice Calc spreadsheet software for analyzing input datasets and resultsets
- Exporting of resultsets into spreadsheet files (.xlsx, .xls, .ods) or flat files (.csv, .txt, .tab)
- QDeFuZZiner comes with several demo projects available