Working with large datasets

Due to utilizing robust PostgreSQL database engine, QDeFuZZiner software is capable to process huge datasets. However, fuzzy matching of large datasets is inherently complex and demanding task for computers and we need to pay attention and carefully plan the project if we want to get results efficiently and in timely manner.

When working with small datasets, fuzzy matching process will be fast and we can easily experiment with various parameters, until getting satisfactory results. However, when working with large datasets, we need to be careful with choosing appropriate parameters, because execution might consume a lot of time and resources.

When working with large input datasets, we need to be aware that:

- importing of datasets into underlying Postgresql database can take a lot of time

- Lower values of Blocking similarity limit can cause long runtime

- Complexity (and consequently runtime duration) is heavily influenced not only by number of rows in the datasets, but also by number of column pairs included in fuzzy matching constraints

- Post-processing with TFIDF is slow

- "Optimize fuzzy matching weights (Slow)" is slow indeed

In this article we will propose several tricks that can be useful when fuzzy matching or deduplicating large files with the QDeFuZZiner software.

Tip No. 1: Read and understand data matching flow in QDeFuZZiner

In order to be able to properly configure fuzzy matching model, you first need to understand data matching flow in QDeFuZZiner software.

Please read thoroughly the following article: https://matasoft.hr/QTrendControl/index.php/qdefuzziner-fuzzy-data-matching-software/data-matching-flow

Data Matching Flow with QDeFuZZiner software

Tip No. 2: Develop fuzzy match model with random sample of dataset rows

When setting-up a fuzzy matching project, typically you will need to iterate first 2 steps of the overall the 3-step process, multiple times, until you find-out parameters of the optimum fuzzy match model.

If you perform this iterative model optimization process with full datasets, you will waste a lot of time waiting for results of the blocking phase.

Therefore, advice is to use 10% or 1% (depending on the size of dataset) random sample of input dataset(s).

Once you develop the optimum fuzzy matching model, then you can import the full input datasets and execute the solution, in order to get final resultset.

Here is the python script that you can use to generate random sample from your input CSV file:

# -*- coding: utf-8 -*-
"""
Created on Thu Sep 22 18:31:33 2022

@author: Zlatko Matić
"""

#pip install dask

# import required modules
import pandas as pd
import numpy as np
import time
from dask import dataframe as df1

mydataset = 'full_dataset1.csv'
sampleddatasetname = "sampled_" + mydataset
sample_fraction = 0.01

myseparator =","
mydecimal = "."
mythousands =" "
myindex = True
  
# time taken to read data
s_time_dask = time.time()
print("Opening dataset:", mydataset)

dask_df = df1.read_csv(mydataset, sep=myseparator, decimal=mydecimal, thousands=mythousands, assume_missing=True, low_memory=False, dtype=str, error_bad_lines=False, engine="c")

e_time_dask = time.time() 
print("Read with dask: ", (e_time_dask-s_time_dask), "seconds")

# time taken to sample data
s_time_dask = time.time() 
 
# sampling
print("Sample fraction:", str(sample_fraction))

sampled_dataset = dask_df.sample(frac=sample_fraction)

e_time_dask = time.time()  
print( "Sampling with dask: ", (e_time_dask-s_time_dask), "seconds")

# time taken to save sample data
e_time_dask = time.time() 
#print(sampled_dataset)

sampled_dataset.to_csv(filename=sampleddatasetname, sep=myseparator, decimal=mydecimal, index=myindex, index_label="myID", single_file=True) 

print ("Sampled dataset:", "sampled_" + mydataset)

e_time_dask = time.time()  
print( "Saving sampled dataset: ", (e_time_dask-s_time_dask), "seconds")

Just modify following variables to suite your needs: mydataset, sample_fraction, myseparator, mydecimal, mythousands.

Tip No. 3: Use higher values of Blocking Similarity Limit parameter

"Blocking similarity limit" represents a similarity threshold that is used in so-called "blocking phase". Term "blocking" is used here to designate phase in which Cartesian product of all possible combinations of records from left and right dataset is constrained, i.e. narrowed down to a much smaller subset of combinations, according to some blocking similarity criteria. This is very important sub-phase, because for medium and big datasets, detailed fuzzy match calculation would become infeasible (extremely time consuming), if we would compare and analyze all possible combinations in detailed similarity calculation sub-phase.

Obviously, for large datasets it is of crucial importance to set up appropriate blocking similarity limit. Advice is to use higher values, especially on the beginning of the iterative process of searching for optimum parameters.

Always start with higher values of blocking similarity limit and then gradually lower it down until the value is suitable to produce balanced similarity function distribution chart.

In ideal situation, we can say that the similarity function distribution chart is balanced when you see only 2 major hills, one representing distribution of false matches, and another representing distribution of true matches, with clearly visible breaking point between the two. Also, ideally, height of the hills should be approximately of the same size.

This is how looks like ideal perfect similarity distribution:

Please, read the full explanation how to use similarity distribution charts here: https://matasoft.hr/QTrendControl/index.php/qdefuzziner-fuzzy-data-matching-software/various-articles/47-using-similarity-distribution-chart-to-assess-data-matching-model

Tip No. 4: Do not use "Optimize fuzzy matching weights (Slow)"  option

If you didn't accept advise not to prepare the fuzzy matching model with the full input datasets, but rather with a random sample, then at least do not use "Optimize fuzzy matching weights (Slow)" option, but rather use "Optimize fuzzy matching weights (Fast)" option.

Tip No. 5: Reduce complexity of fuzzy matching constraints

When dealing with datasets having many columns, it is always a temptation to include as many column as possible into the fuzzy matching model, with hope that the model will recognize hidden relations and utilize it to increase matching quality.

However, complexity of fuzzy matching model can also significantly increase time needed for calculation of string similarity.

Tip No. 6: Don't include too many column pairs into blocking phase

Blocking phase relies on utilization of special type of indexes. Creation of such indexes can take considerable time for large datasets.

Thus advice is to minimize number of columns being included in blocking, ideally to only one column pair. For example, look how in this fuzzy match model, only full address is used for blocking:

In some similar situations, it would be suitable to block only on town, but the principle is always the same - you need tor recognize what are the columns that would be best suitable for narrowing down the number of rows to be included in detailed similarity calculation. This should be a column that can separate meaningful candidates from totally useless candidates for fuzzy matching calculation.

In the following example, blocking has been done on company name only:

Notice, however, that this tip has to be always take with grain of salt. Sometimes it is useful to use several column pairs (but not too many) to reduce number of records in detailed calculation. For example, in this model, we are blocking on company name and city name, in order to narrow down number of rows to process in detailed phase. That's because same company (like a bank, for example) can be present in multiple towns.

So, you need to understand your dataset and consider minimum number of column pairs that would be useful to separate chaff from the wheat.

On one hand, it is desirable to have less columns included, in order to speed-up indexes creation time, but on other hand, it is desirable to include all columns that can filter-out combinations that are useless, i.e. that have no sense at all.

Tip No.6: Avoid using TFIDF post-processing with large datasets

Although TFIDF is very powerful option that can increase data matching quality in situations when wee need to distinguish entities having long names differing in only few letters (often the case with company names), you need to be aware that this algorithm takes much more time to execute than the core matching algorithms.

Thus, the advice is to avoid using TFIDF post-processing for very large datasets.

Tip No.7: Determine similarity threshold with sample datasets on lower blocking similarity, then increase blocking similarity limit for execution with full dataset

As already explained, during the iterative process of executing blocking phase and inspecting similarity distribution chart, we are adjusting parameters of the model in order to get final model that will have clear separation of matches and non-matches. We are assessing the model by visually inspecting similarity function distribution. Ideally, we will see two hills clearly separated, and having approximately same magnitude. Once we have good visual separation, then we also need to inspect resulset.

If the resultse is good, then we know that our model is capable to separate matches from non-matches.

When dealing with large datasets, it is advisable  to increase blocking similarity limit a little bit (for example from 0.55 to 0.6) and then execute the model with full dataset. This would reduce execution time.

Since we are sure that separation is good, we can afford ourselves to reduce the number of candidates being passed into detailed calculation...

Further Reading

Introduction To Fuzzy Data Matching

Data Matching Flow
Uploading and Downloding Input Files
Managing QDeFuZZiner Projects
Importing Input Datasets into QDeFuZZiner
Managing QDeFuZZiner Solutions
Demo Fuzzy Match Projects

Various Articles on QDeFuZZiner

Get QDeFuZZiner Software

Purchase QDeFuZZiner Premium License

Our Data Matching Services

Do you wish us to perform fuzzy data matching, de-duplication or cleansing of your datasets?

Check-out our data matching service here: Data Matching Service