Type of deduplication problems
There are lot of names that float around when talking about a problem in this domain. Even though they can be used loosely to roughly denote the typical problem of detecting duplicates, there are some fine differences between them based on the type of problem being solved. Let’s take a look at each of them:
Source: (Almost) All of Entity Resolution
The motivation for data deduplication depends on how the data is being used. For scenarios where data is being used only for viewing, the motivation would be data hygiene. On the other hand, in situations where the data is used by other processes to produce results, the cost can be significantly higher.
For eg. Let’s assume that we are living in a completely digitised society. In this society, say, there is a dataset that contains the details of all people residing in a state. The cost of having duplicates would be low if it is used only as a reference to know and search for people residing in that state. The cost of having duplicates increases significantly if the same dataset is used by more sources. Let’s say, the Department of motor vehicles (DMV) uses it to keep a record of people who are authorised to drive a vehicle. It could be used by the Department of Labor to keep a track of employment of various people. Additionally, the DMV dataset could be referenced by the Police Department to aid decision making with regards to issuing tickets. As the dependency on a given dataset increases, the cost of having duplicates in it increases steadily, possibly in a non-linear fashion.
When not exploited, duplicated data can cause people who consume it directly/indirectly to get incorrect information. When exploited as a vulnerability, duplicated data can give way to identity theft, fraud and many other malpractices.
The dark side of deduplication is generally referred to as identity resolution, which involves preventing fraud, catching criminals, terrorists and villains.