Introduction to Data Importing into QDeFuZZiner

QDeFuZZiner provides mechanism for importing raw input datasets, in spreadsheet or flat textual files format, into corresponding tables on the PostgreSQL database server.

QDeFuZZiner does not operate on original data sources directly, but requires data to be first imported from source spreadsheet (.xlsx, .xls, .ods) or CSV (comma separated values) flat files to server database, where corresponding left and right database tables are then created, indexed and processed.

During importing and also later on, during solutions creation and execution, QDeFuZZiner is creating various indexes on the underlying PostgreSQL database, which facilitate fuzzy data matching.

Concept of "Left" and "Right" Dataset

Throughout QDeFuZZiner application and this tutorial, we use terms “left” and “right” dataset or table.

In every fuzzy match project, we always compare two tables, i.e. two datasets, inspecting their rows similarity. For convenience, we call them “left” and “right” table. Purpose of entity resolution framework software, such is QDeFuZZiner, is to identify which records from “left” dataset correspond to which records from “right” dataset.

Encoding Considerations

QDeFuZZiner underlying PostgreSQL database is operating with data in unicode UTF-8 format

If you are importing data encoded in charset format other than UTF-8, QDeFuZZiner will try to convert your files prior uploading. It will try to determine what is the original charset format and then perform conversion. However, encoding detection guessing is not always 100% reliable and program can guess it wrongly. In such cases, you will import wrongly formatted text, with strange characters.

Therefore, please ensure that your input spreadsheet or flat files are UTF-8 formatted, prior uploading via FTP and importing to QDeFuZZiner!

Spreahseet editors have embedded tools for saving in UTF-8 format. In case of flat textual files (in .csv, .tab or .txt format), you can use QDeFuZZiner embedded tools for charset encoding detection and conversion, but you can also use famous Notepad++ (https://notepad-plus-plus.org/), CudaText (http://uvviewsoft.com/cudatext/) and other powerful text editors which are capable to perform encoding detection and conversion of files. 

Input Data Preprocessing

Bare in mind that content and structure of your input files will put limits on fuzzy matching process. It is very important to consider whether you can improve chances for successful fuzzy matching by pre-processing your input filed prior importing to QDeFuZZiner.

For example, if you can extract certain parts of long strings into separate column, do it. Fuzzy matching is working better with multiple columns with small strings.

Sparse data is huge problem for fuzzy matching. If rows are missing data in some columns, this will deteriorate fuzzy matching process. If you can fill missing data, definitely do it!

In many cases, different rows use different terms for the same thing. For example, you might find in the same dataset both "M"/"F" and "Male"/"Female" designations for gender. In such cases, standardize the terms prior import. Standardization of terms highly improves fuzzy matching. Another example of standardization is standardizing date and time formats.

Record Linkage vs. Data Deduplication Considerations

In QDeFuZZiner software, there is no fundamental difference between data deduplication and records matching projects. In both cases we compare two datasets, trying to infer which records from “left” dataset correspond to which records in “right” dataset.

The only difference between the two is that in case of records matching project we have two different input datasets to be compared, while in case of data deduplication project we have to compare a dataset with itself, in order to identify duplicate records in the dataset.

Since QDeFuZZiner software always compare two datasets - left and right datasets, in case of data deduplication project we need to import the same original CSV file twice - first as the left dataset and then as the right dataset. The QDeFuZZiner software will thus create two identical tables with different names, in the underlying database.

Converting Spreadsheet File into Tab-Separated CSV File

You will notice that when you are importing source file in spreadsheet format, QDeFuZZiner will first convert the spreadsheet file into a corresponding tab-separated flat file, and then perform the actual import.

Import Page Organization

In the "Projects" page select your project for which you wish to import input data.