Fuzzy Data Matching, Record Linkage and Data Deduplication
- Details
- Written by Super User
- Category: Fuzzy Data Matching, Record Linkage and Data Deduplication
Unlock the Power of Precise Data Matching with QDeFuZZiner!
Are you struggling with messy datasets, duplicate records, or inconsistent customer information? QDeFuZZiner is your solution for powerful, intuitive fuzzy data matching, record linkage, and data deduplication.
Key features of QDeFuZZiner data matching software
Key Features:
• Advanced fuzzy matching engine for identifying similar records
• Robust PostgreSQL database for handling large datasets
• User-friendly GUI for easy project management
• Flexible importing from spreadsheets and CSV files
• Interactive data visualization tools
• Customizable matching parameters and constraints
Benefits of using QDeFuZZiner entity resolution software
Benefits:
• Consolidate customer and product lists from multiple sources
• Identify and link identical entities across different datasets
• Cleanse master data by removing duplicates
• Improve data quality for better business decisions
• Save time and resources on manual data cleaning
For whom QDeFuZZiner fuzzy matching software is built
Perfect for:
Data Analysts, Data Scientists, Business Analysts, Marketing Teams, Sales Departments, HR Professionals, and more!
QDeFuZZiner Lite - a perfect freeware for recurrent data matching tasks!
Try QDeFuZZiner Lite for FREE!
Experience all features with up to 10,000 rows per dataset.
QDeFuZZiner - powerful data matching and entity resolution software
Don't let messy data hold your business back. Harness the power of QDeFuZZiner for faster, more accurate data matching and consolidation.
QDeFuZZiner: Bringing Clarity to Your Data Chaos
#DataMatching #FuzzyMatching #DataCleansing #MasterDataManagement #BusinessIntelligence
Further Reading
Introduction To Fuzzy Data Matching
Managing QDeFuZZiner Projects
Importing Input Datasets into QDeFuZZiner
Managing QDeFuZZiner Solutions
Demo Fuzzy Match Projects
- Details
- Written by Super User
- Category: Fuzzy Data Matching, Record Linkage and Data Deduplication
Data Matching Flow with QDeFuZZiner software
In order to be able to use QDeFuZZiner software successfully, we need to understand general flow of a data matching project.
The same general principles apply to a data de-duplication project as well, which differs only in importing the same original input dataset twice, as both left and right dataset, and setting flag "Deduplication (instead of Matching".
Here is the graphical presentation of a typical data matching project:
Description of major phases involved in data matching or de-duplication project:
Project Creation
First step is to create a new project record.
Input Data Importing
Each project is dealing with matching of two input datasets, called "left dataset" and "right dataset", being imported from .csv files.
In this step you need to register both input datasets and then trigger procedure of their import into QDeFuZZiner database, where further data processing will take place.
In case of a data-deduplication project, the same input dataset has to be registered and imported as both left and right dataset.
QDeFuZZiner software imports only .csv files directly, so if you have your input dataset in other formats, such as Excel spreadsheets, you will need first to export them into corresponding .csv files, in UTF-8 format. Fortunately, all spreadsheet softwares has such option of exporting into .csv files. Our recommendation is to use LibreOffice Calc, which has most versatile options for data exporting.
As a good practice, it is recommended that before importing, you do basic preprocessing of input datasets, such as trimming whitespaces, doing proper capitalization (small and big letters), unified formatting of dates etc. Such data preparations will increase quality of fuzzy data matching.
Also, it is advisable to add a column with unique row identifiers, if not already present. It is always recommended, but for data de-duplication it is in fact a must, because you will need to set-up "<>" operator in exact matching constraints for ID columns of left and right dataset.
Solution Creation and Definition (i.e. setting up data matching model)
After input datasets are imported into QDeFuZZiner database, next step is to create a new Solution and define initial data matching model, which we will polish later.
Adding Columns into Data Matching Constraints
By using Fields Picker tool, we need to add column pairs from left and right datasets into applicable sections, for building our data matching model.
Available sections for adding column pairs are: Exact Matching Relations, Fuzzy Matching Relations, Other Constraints and Merged Columns.
After we added data matching constraints into applicable sections, we are ready to fine-tune our model.
Setting Up Exact Matching Constraints
By default, column pairs added to Exact Matching Constraints will have "=" (equal) operator assigned. However, if we are dealing with data de-duplication project, we need to use "<>" (not equal) operator instead, on ID columns from left and right dataset. That is important, because we don't want to compare a row from original dataset with itself (remember that for data de-duplication project we are importing the same original dataset twice, as both left and right dataset).
Setting Up Fuzzy Matching Constraints
In this section, we need to define relative weights for each columns pair. By default, each column pair gets the same relative weight, i.e. the same importance, which is not optimum.
QDeFuZZiner provide two alternative tools for automatic setting-up recommended relative weights. However, these tools are not perfect and you will need to judge it critically and manually adjust relative weights afterwards. Setting-up perfect relative weights is typically matter of trial and error - you will typically experiment with slight variations of the model, until you get satisfactory result.
Setting Up Other Constraints
This section is used to define additional exact matching constraints on individual columns from left or right dataset.
You will use it if you wish to constrain data model to certain custom sub-range, for example to certain town or gender, etc.
Such constraints must be manually defined.
Setting Up Merged Columns
"Merged Columns" is a very powerful, but complex section, with many parameters and options available, which you can use for creation of additional merged columns in final resultset, but also for merge/consolidation of duplicate rows.
It is important to understand that merging is performed not only horizontally (i.e. accross a matching row), but also vertically (i.e. accross all matching rows for the same matched entity). This is especially important in case of deduplication, where thus you can de-duplicate, while preserving data of all duplicate records, through consolidation options. In other words, you can enrich surviving rows from duplicate rows, during de-duplication process.
Solution Execution
After we defined our initial data matching model, we are ready for execution of the model, in order to retrieve resultset.
Typically, it is a cycle of multiple executions, resultset inspections, data model adjustments and fine-tuning, until you get perfect result.
Once data model is optimized, you can use it for repetitive executions of the same data matching model, with fresh imported data. You just need to re-import new data and execute already saved data matching model.
A) Solution Execution in 3 Consecutive Steps
For initial data model adjustments and fine-tunings, you will use this 3-step approach.
Solutions are saved as records of table of solutions, where each record represents a solution, containing definition of parameters and constraints to be applied. Solution execution actually involves two separate sub-phases, which we call "blocking phase" and "detailed fuzzy match phase". Result of the first phase is so-called "solution base table", while result of the second phase is final resultset.
Blocking phase, i.e. creation of solution base table is a time-consuming operation, which, depending on the datasets size and number of columns included into fuzzy match comparison, can take anything from few minutes to few hours to few days to finish! On the contrary, final resultset creation is executed in matter of seconds or minutes. Therefore, there is much sense to follow this recommended three-step approach: you first define solution parameters and constraints, then create solution base table (blocking phase), then open similarity distribution tool to visually determine area of optimum threshold values, then consecutively vary threshold values (inside previously determined optimum range) and execute detailed fuzzy match phase until getting satisfactory results.
1. Execute Blocking Phase
"Blocking" phase is phase in which a subset of best matching candidate record pairs are chosen from the whole universe of all possible combinations.
Blocking phase is actually sequence of two distinct consecutive sub-phases:
a) Sub-phase of rough similarity filtration (blocking)
By using rough filtration on similarity, best candidates (those matching pairs which have string similarity greater than blocking similarity limit) are passed-through and saved into an intermediate table called "solution base table".
b) Detailed similarity calculation
After best candidates are saved, then detailed similarity calculation takes place for each passed-through record pair.
The most important parameter we need to define for blocking phase is called "blocking similarity limit". This value represents a similarity threshold that is used in so-called "blocking phase". Term "blocking" is used here to designate phase in which Cartesian product of all possible combinations of records from left and right dataset is constrained, i.e. narrowed down to a much smaller subset of combinations, according to some blocking similarity criteria. This is very important sub-phase, because for medium and big datasets, detailed fuzzy match calculation would become infeasible (extremely time consuming), if we would compare and analyze all possible combinations in detailed similarity calculation sub-phase.
The bigger is the blocking similarity limit value, the less number of record pairs will be saved in the solution table and consequently next phase (detailed fuzzy match phase) will be faster. However, if the blocking similarity values is too big, we risk to omit true matches.
When a solution definition is executed, QDeFuZZiner creates a table for a solution, which we call "solution base table". This table is constructed as combination of records from left and right datasets, according to exact and fuzzy matching constraints and blocking similarity limit. Solution base table thus contains subset of left and right dataset records combinations, which satisfy condition of blocking similarity limit. Only combinations saved into the solution table are then analyzed in the detailed fuzzy match sub-phase.
Besides blocking similarity limit, parameter "Use dictionaries (yes/no)" also influences on the solution base table creation. If dictionary is used, strings used for blocking and detailed phase are lexemized into lexems, according to selected dictionary. Lexems are then used for similarity calculation instead of original words. This can be useful in cases of big strings, such as verbose product descriptions, because lexemization decreases variations in related words.
Of course, exact and fuzzy match constraints also influence blocking phase. Adding exact matching constraints can dramatically reduce time for execution of blocking phase and also improve accuracy of fuzzy matching model.
Immediately after rough filtration of candidate record pairs, detailed calculation of string similarity is executed on the passed-through records.
Overall result of the blocking phase is intermediary table stored in the database, called "solution base table", which contains record-pairs with calculated string similarity values.
2. Analyze Similarity Function Distribution
After blocking phase is executed and solution base table is saved in the database, we can investigate similarity function distribution visually, in order to determine appropriate similarity threshold to discern matches from non-matches. QDeFuZZiner provide and advanced tool for similarity function distribution graphical representation, along with mathematical functions trying to provide a clue what would be the optimal threshold.
3. Get Final Resultset
In this phase, value of the "similarity threshold" parameter is used to discern between matches and non-matches. Result of this detailed fuzzy match phase is creation and saving of a resultset table which is then loaded into the datagrid, from which it can be exported into a spreadsheet or flat file.
Besides similarity threshold, this phase is also influenced by exact and fuzzy matching constraints. It is also influenced by the "Join Type" and "Return only best matching record (yes/no)" parameters.
B) Solution Execution in 1 Step
Execution in one step is suitable for re-running a solution on updated (re-imported) input datasets, when you expect that new imported data will not substantially change already defined data matching model.
Resultset Exporting
After a solution is executed, resultset will be saved as a new table in the database and will be presented in a datgrid, from which we can filter, sort, search and export results into a spreadsheet.
Further Reading
Introduction To Fuzzy Data Matching
Managing QDeFuZZiner Projects
Importing Input Datasets into QDeFuZZiner
Managing QDeFuZZiner Solutions
Demo Fuzzy Match Projects
Various Articles on QDeFuZZiner
Our Data Matching Services
Do you wish us to perform fuzzy data matching, de-duplication or cleansing of your datasets?
Check-out our data matching service here: Data Matching Service
- Details
- Written by Super User
- Category: Fuzzy Data Matching, Record Linkage and Data Deduplication
Fuzzy Data Matching
Availability of a reliable master data for a supply chain can be crucial for success of a company. Supply chain performance is dependent on consistent definitions of customers, products, items, locations, and other master data objects. When data is poorly governed and inconsistent, supply chain become less competitive. Good data leads to efficient supply chains, allowing resources to be spent on innovation rather than on coping with problems.
However, business data is often messy, containing duplicate entries of same entities, missing information, having misspellings and various other errors.
In many cases, there is no unique identifier which can be used to unambiguously identify an entity. When such data sources need to be linked or deduplicated solely based on string similarity, it is called fuzzy data matching.
Fuzzy data matching is set of techniques to determine same identities, i.e. related records in two different datasets, when there is no unique identifier to identify entities. In such cases, records are compared based on approximate string similarity. Records that have calculated string similarity above a custom threshold are considered to be same entities.
Fuzzy Match
Term “fuzzy match” refers to methods of identifying related records by measuring how similar they are. It is used in cases where no unique identifier or exact match relation exists between two sets of data.
Fuzzy matching uses weights to calculate the probability that two given records refer to the same entity. Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with probabilities below threshold are considered to be non-matches.
Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold matching percentage set by the application.
Record Linkage
Record linkage refers to the task of finding records in a data set that refer to the same entity across different data sources, i.e. to identify related records in two separate data sets.
Record linkage is necessary when joining data sets is based on entities that may or may not share a common identifier, as may be the case due to differences in record shape, storage location, and/or curator style or preference.
There are many business cases where record linkage has to be performed. Some typical examples are product price lists, partner lists, book and movie catalogs, customer loyalty databases, medical records etc.
Data Deduplication
Data deduplication refers to identifying duplicate records in a dataset and cleansing datasets from redundant information.
Entity Resolution
Fuzzy data matching of same entities in different data sources is often referred as "entity resolution".
Use Cases for Fuzzy Data Matching
There are many business cases where record linkage has to be performed.
Some typical examples are: merging or deduplication of product catalogs and price lists, partner lists, book and movie catalogs, customer loyalty databases, medical records etc.
Fuzzy data matching finds similarities between data elements, such as names and addresses, enabling effective data cleansing and consolidation.
Fuzzy data matching identifies linked or similar records that contain keyboard errors, missing words, extra words, nicknames, changed surnames, or multicultural name variations.
Inherent Complexity of Fuzzy Data Matching
Fuzzy data matching is non-trivial, inherently difficult task to accomplish, due to various technical and other reasons. This is especially true for fuzzy matching huge datasets. I will try to explain why fuzzy matching is so incredibly difficult trait.
Let's first talk about human capabilities in determining same entities accross various datasets. Human beings are capable to immediately recognize same entity in two different records. We are capable to ignore typographical errors, abbreviations, fill gaps in data and use our intuition to determine same entity. It is easy for humans if there are few dozens of records at hand. However, how would you match records in two product catalogs if there are 10 thousand records in one dataset and 30 thousand records in another? Scrolling and searching of such big datasets is very tedious job and becomes unfeasible in real life. Obviously, we need computer program to do it for us.
However, using computers raises another types of problems involved. First of all, how computer can know whether two records are similar or not? Obviously, we need to provide some mathematical functions to calculate string similarities between records and some kind of threshold to discern between matches and non-matches. Such string similarity calculation functions are inherently time-consuming and require substantial computer power. Defining appropriate threshold value is another difficulty. How do you know what is the optimum value to discern matches from non-matches? Obviously, we need to experiment with various values and compare acquired results in order to assess which threshold value provides best ratio between true matches and false matches. It is almost certainly that we will always get some false positives and/or false negatives. Post-hoc manual cleansing of a resultset is almost certainly unavoidable.
When fuzzy matching huge datasets, another inherent problem arises. Try to comprehend this: if we are matching 10 records with 10 records, we are talking about 10x10=100 combinations, if we are comparing 100 records with 100 records, we are talking about 10000 combinations, if we are comparing 1000 records with 1000 records, we are now dealing with 1000000 combinations, etc. You can see the problem, with real life datasets we can easily come into situations that we need to calculate and compare string similarity for millions or billions combinations. Additionally, if our datasets contain more than one column, we need to compare different columns separately and this multiplies complexity. Calculating and comparing string similarity for each and every possible combinations, therefore, easily becomes infeasible even for supercomputers. Thus, we need another approach, we need to somehow limit number of combinations prior calculating detailed string similarity. This is called "blocking", we need to use some kind of function to narrow-down number of record combination candidates before going into detailed fuzzy matching.
It is not my intent to go into all technical details, I just wanted to emphasize inherent complexity of any fuzzy matching project and to point out that fuzzy matching is always associated with inherent difficulties which makes it both challenging and exciting.
QDeFuZZiner - Fuzzy Data Matching, Record Linkage and Data Deduplication Software
Our software called "QDeFuZZiner" is here to help you with fuzzy data matching, i.e. record linkage and data deduplication tasks.
Due to its inherent complexity, fuzzy match analysis is a popular subject of scientific research and academic papers. Some of the researchers even tend to build their own software, but those programs suffer from their complexity and necessity to understand advanced mathematics and algorithms, in order to be able to use it. This is not something that can be expected from an average user facing data linkage problem in urge to be able to solve it in matter of hours or days.
On the other hand, there are huge corporate entity resolution framework solutions, produced by big software companies, oriented towards huge corporate customers. These solutions are often very complex and affordable only to big companies and corporate users.
QDeFuZZiner places itself in the middle and provides powerful fuzzy match records linkage solution for mere mortals and regular office users.
By allowing users to define exact matching constraints, fuzzy matching constraints and all other constraints in visual and intuitive way, all the complexity of the fuzzy match analysis is hidden from the user and he/she can focus on the business case, rather than technical issues. That is where QDeFuZZiner software really shines and clearly distinguishes itself from competition.
Key Benefits
QDeFuZZiner is an invaluable tool for anyone looking to perform data matching, merging or de-duplication. Whether you're a data scientist, business analyst, or simply someone looking to make sense of complex data sets, QDeFuZZiner can help you achieve your goals.
QDeFuZZiner is the best fuzzy data matching software available because it is a powerful, yet intuitive software that can identify linked or similar records that contain keyboard errors, missing words, extra words, nicknames, changed surnames, or multicultural name variations. It can also help you to merge and consolidate product and customer lists, from multiple sources, and to identify and link together same entities, such as same customers or products, from two different datasets. Additionally, it can be used to minimize duplicate customer data and accurately link each data record to one customer identity. It also offers a free version called QDeFuZZiner Lite which has all features of the full commercial version, with only limitation of importing maximum 10000 rows per dataset.
The main features of the QDeFuZZiner software include a robust back-end PostgreSQL database, capable of storing, indexing and processing heavy input datasets; an intuitive and interactive front-end desktop GUI application; the ability to import input datasets from spreadsheet and flat (csv) files; intuitive organization of fuzzy data matching projects; intuitive creation of multiple solutions inside each project; interactive user interface for definition of various fuzzy matching parameters; definition of exact matching constraints, fuzzy matching constraints, other constraints; graphical tool for visualization of similarity distribution of matches and non-matches in a solution table; interactive datagrids with integrated searching, filtering, sorting and customization capabilities; integrated spreadsheet software "Spready" for analyzing input datasets and resultsets; and the ability to export resultsets into spreadsheet files (.xlsx, .xls, .ods) or flat files (.csv, .txt, .tab).
The major benefits of using QDeFuZZiner software include lower cost, faster time to market, ability to identify linked or similar records, ability to merge and consolidate product and customer lists, and ability to minimize duplicate customer data and accurately link each data record to one customer identity.
QDeFuZZiner is considered to be one of the best fuzzy data matching software for several reasons:
Advanced Algorithms: QDeFuZZiner uses advanced algorithms that are specifically designed for fuzzy data matching. These algorithms are able to accurately match data even when there are variations in spelling, format, or other inconsistencies.
High Accuracy: QDeFuZZiner delivers high accuracy results, which is essential when working with fuzzy data. This means that users can trust the results produced by the software, which in turn increases the efficiency of their data analysis and decision making.
Intuitive Interface: QDeFuZZiner has an intuitive interface that is easy to use, even for non-technical users. This means that users do not need to have a background in computer science or data analysis to take advantage of the software's capabilities.
Customizability: QDeFuZZiner allows users to customize the data matching process to meet their specific needs. This includes the ability to set various parameters for data matching or de-duplication, with merging capabilities.
Scalability: QDeFuZZiner is able to handle large data sets, making it an ideal solution for organizations of all sizes.
Wide range of industries: QDeFuZZiner offers a number of features that can be used to quickly and easily analyze data in a wide range of industries, including finance, healthcare, retail and more.
All these factors combined make QDeFuZZiner the best fuzzy data matching software in the market, its ability to handle complex data sets with ease, its high accuracy, its easy-to-use interface, its customizability, its scalability and its wide range of industries support make it an ideal solution for organizations and individuals looking to extract insights from data.
QDeFuZZiner Features
Main features of the QDeFuZZiner software:
- Robust back-end PostgreSQL database, capable of storing, indexing and processing heavy input datasets
- Intuitive and interactive front-end desktop GUI application
- Importing input datasets from spreadsheet and flat (csv) files through interactive GUI
- Intuitive organization of fuzzy data matching projects
- Intuitive creation of multiple solutions inside each project
- Each solution provides interactive user interface for definition of various fuzzy matching parameters (blocking similarity threshold, similarity threshold, optional usage of lexemization and dictionaries)
- Each solution provides definition of exact matching constraints, fuzzy matching constraints, other constraints
- "Merged Columns" feature, which enables creation of new, combined columns, with horizontal and vertical data aggregation, as well as provides mean for updating existing columns from merged values or from opposite table.
Merging feature is versatile tool which can be used for various data consolidation scenarios, such as for merge/consolidation of data during data matching project or for data enriching od surviving row from duplicate rows, during de-duplication. - You can define merged columns in a resultset, multiple options of merging types are available
- Graphical tool for visualization of similarity distribution of matches and non-matches in a solution table, providing help in determining otimum threshold value for a solution
- Automatic determination of optimum weights for fuzzy matching field pairs
- Interactive datagrids with integrated searching, filtering, sorting and customization capabilities
- Interactive splitters between form sections, helping in optimum utilization of screen space
- Integrated spreadsheet software "Spready" for analyzing input datasets and resultsets
- Seamlessly integrated with your default spreadsheet software, for analyzing input datasets and resultsets
- Exporting of resultsets into spreadsheet files (.xlsx, .xls, .ods) or flat files (.csv, .txt, .tab)
Screenshots
Projects organization
Input Data Import
Solutions Organization
Solution parameters related to blocking phase:
Solution parameters related to final resultset retrieval: