Details: Written by Super User; Category: QDeFuZZiner - Input Datasets Importing; Published: 14 June 2020

Introduction to Data Importing into QDeFuZZiner

QDeFuZZiner provides mechanism for importing raw input datasets, in spreadsheet or flat textual files format, into corresponding tables on the PostgreSQL database server.

QDeFuZZiner does not operate on original data sources directly, but requires data to be first imported from source spreadsheet (.xlsx, .xls, .ods) or CSV (comma separated values) flat files to server database, where corresponding left and right database tables are then created, indexed and processed.

During importing and also later on, during solutions creation and execution, QDeFuZZiner is creating various indexes on the underlying PostgreSQL database, which facilitate fuzzy data matching.

Concept of "Left" and "Right" Dataset

Throughout QDeFuZZiner application and this tutorial, we use terms “left” and “right” dataset or table.

In every fuzzy match project, we always compare two tables, i.e. two datasets, inspecting their rows similarity. For convenience, we call them “left” and “right” table. Purpose of entity resolution framework software, such is QDeFuZZiner, is to identify which records from “left” dataset correspond to which records from “right” dataset.

Encoding Considerations

QDeFuZZiner underlying PostgreSQL database is operating with data in unicode UTF-8 format.

If you are importing data encoded in charset format other than UTF-8, QDeFuZZiner will try to convert your files prior uploading. It will try to determine what is the original charset format and then perform conversion. However, encoding detection guessing is not always 100% reliable and program can guess it wrongly. In such cases, you will import wrongly formatted text, with strange characters.

Therefore, please ensure that your input spreadsheet or flat files are UTF-8 formatted, prior uploading via FTP and importing to QDeFuZZiner!

Spreahseet editors have embedded tools for saving in UTF-8 format. In case of flat textual files (in .csv, .tab or .txt format), you can use QDeFuZZiner embedded tools for charset encoding detection and conversion, but you can also use famous Notepad++ (https://notepad-plus-plus.org/), CudaText (http://uvviewsoft.com/cudatext/) and other powerful text editors which are capable to perform encoding detection and conversion of files.

Input Data Preprocessing

Bare in mind that content and structure of your input files will put limits on fuzzy matching process. It is very important to consider whether you can improve chances for successful fuzzy matching by pre-processing your input filed prior importing to QDeFuZZiner.

For example, if you can extract certain parts of long strings into separate column, do it. Fuzzy matching is working better with multiple columns with small strings.

Sparse data is huge problem for fuzzy matching. If rows are missing data in some columns, this will deteriorate fuzzy matching process. If you can fill missing data, definitely do it!

In many cases, different rows use different terms for the same thing. For example, you might find in the same dataset both "M"/"F" and "Male"/"Female" designations for gender. In such cases, standardize the terms prior import. Standardization of terms highly improves fuzzy matching. Another example of standardization is standardizing date and time formats.

Record Linkage vs. Data Deduplication Considerations

In QDeFuZZiner software, there is no fundamental difference between data deduplication and records matching projects. In both cases we compare two datasets, trying to infer which records from “left” dataset correspond to which records in “right” dataset.

The only difference between the two is that in case of records matching project we have two different input datasets to be compared, while in case of data deduplication project we have to compare a dataset with itself, in order to identify duplicate records in the dataset.

Since QDeFuZZiner software always compare two datasets - left and right datasets, in case of data deduplication project we need to import the same original CSV file twice - first as the left dataset and then as the right dataset. The QDeFuZZiner software will thus create two identical tables with different names, in the underlying database.

Converting Spreadsheet File into Tab-Separated CSV File

You will notice that when you are importing source file in spreadsheet format, QDeFuZZiner will first convert the spreadsheet file into a corresponding tab-separated flat file, and then perform the actual import.

Import Page Organization

In the "Projects" page select your project for which you wish to import input data.

Click on the "Data Import" tab. Data import page will be shown, for the currently selected project. Here you can define specification of source files path and structure for the "left" and "right" input datasets and trigger actual data import process.

Notice that the data import page consists of three sub-pages: "Left Dataset Specification", "Right Dataset Specification" and "Import Log".

"Left Dataset Specification" and "Right Dataset Specification" sub-pages look exactly the same, except providing specification for left and right datasets respectively.

"Import Log" sub-page provides information on progress during data importing process.

Top Upper part of the form contains input file registration section, where you can pick-up previously upload source file, from the Documents folder (where uploaded input files reside) and register it as left or right source dataset for the currently selected project.

Notice button "Register input file". if you click it, QDeFuzziner will register the browsed input dataset as left or right dataset for the project.

Middle portion of the form contains schema information for the source file being imported. Here you can also restrict whether certain column will be included in solution resultset or not.

Notice button "Get Fields Schema". When you click the button, QDeFuZZiner will analyse source file and list schema information in the datagrid.

This portion of the import form also contains buttons for opening of the source file in either internal ("Spready") or external ("LibreOffice Calc") spreadsheeet editor.

Bottom portion of the import data form contains bottons for triggering actual data import, as well as buttons for encoding type detection and converting encoding of input datasets prior importing.

Importing Process Workflow

Here we will describe typical workflow of importing datasets for fuzzy matching, into the QDeFuZZiner software.

Pre-process input files (standardization, filling missing data, extracting data into additional columns etc.)
If input file is a spreadsheet file, export it into a csv file.
Ensure that source files are encoded as UTF-8.
In QDeFuZZiner, select project in the Projects page.
Go to "Data Import" page
Register left and right dataset in corresponding sub-pages (if you are doing data-deduplication project, register the same source file as both left and right dataset).
1. Browse file in Documents folder (on the server, not on the your local computer!)
2. Click "Register input file" button, in order to register file as left or right dataset, along with corresponding fields schema information.
3. Review fields list and optionally exclude columns that you don't need from resultset
4. Optionally, check source files encoding and do the conversion into UTF-8
Commence data import - you can import both left and right dataset in one step, by clicking button "Import both input files to server" or separately, one by one, by clicking buttons "Import left dataset input file" and "Import right dataset input file".

Uploading and Downloading Files (QDeFuZZiner in Cloud)

If you are using QDeFuZZiner front-end application deployed in cloud, rather than locally installed front-end application, then you need to upload files cloud first, in order to be able to import them into the QDeFuZZiner.

Options for uploading and downloading files in cloud are described here: Uploading and Downloading Files in QDeFuZZiner Cloud

Our Data Matching Services

Do you wish us to perform fuzzy data matching, de-duplication or cleansing of your datasets?

Check-out our data matching service here: Data Matching Service

Input Datasets Import