Data deduplication | A study in code

What is Data Deduplication

Type of deduplication problems

There are lot of names that float around when talking about a problem in this domain. Even though they can be used loosely to roughly denote the typical problem of detecting duplicates, there are some fine differences between them based on the type of problem being solved. Let’s take a look at each of them:

Deduplication It refers to the process of detecting duplicate records (duplicate detection), followed by creating unique representative records for each of the groups containing duplicates (canonicalization). This is applicable to records present in a single database.
Record Linkage When there are multiple datasets in consideration, the problem translates into ‘Record Linkage’. This involves doing deduplication within each dataset as well as linking records from one dataset to another. As a special case, when a record from one dataset can be matched with at most one record from any other dataset, the problem becomes what is called Bipartite Record Linkage or Maximum one-to-one linkage
Entity Resolution/ Unification This refers to simultaneously merging together multiple datasets and removing duplicate records across and within datasets. This is basically record linkage plus one more step. In record linkage the records across datasets are only linked. In case of entity resolution, the linked records are also canonicalized.

Source: (Almost) All of Entity Resolution

Why we need Data Deduplication

The motivation for data deduplication depends on how the data is being used. For scenarios where data is being used only for viewing, the motivation would be data hygiene. On the other hand, in situations where the data is used by other processes to produce results, the cost can be significantly higher.

For eg. Let’s assume that we are living in a completely digitised society. In this society, say, there is a dataset that contains the details of all people residing in a state. The cost of having duplicates would be low if it is used only as a reference to know and search for people residing in that state. The cost of having duplicates increases significantly if the same dataset is used by more sources. Let’s say, the Department of motor vehicles (DMV) uses it to keep a record of people who are authorised to drive a vehicle. It could be used by the Department of Labor to keep a track of employment of various people. Additionally, the DMV dataset could be referenced by the Police Department to aid decision making with regards to issuing tickets. As the dependency on a given dataset increases, the cost of having duplicates in it increases steadily, possibly in a non-linear fashion.

When not exploited, duplicated data can cause people who consume it directly/indirectly to get incorrect information. When exploited as a vulnerability, duplicated data can give way to identity theft, fraud and many other malpractices.

The dark side of deduplication is generally referred to as identity resolution, which involves preventing fraud, catching criminals, terrorists and villains.