One of the most important goals of data cleaning is to detect and process duplicates. When we talk about duplication it does not only mean that two records are identical but it also means that the same record can appear several times in the database. Thus the correct naming would be multiplication however, due to general widespread use and easier interpretation, duplication is used.
The better quality the data used for identification purposes are, the more efficient the detection of duplicates will be. For example, in the case of natural persons, the name, place of birth, date of birth, legal entity, tax number, and company number can be regarded as best suited for identification purposes. If these data are of poor quality, it is strongly recommended to perform data cleaning before duplicate detection. Of course, the data will never be flawless, so the duplicate search will ultimately be done on more or less incomplete, incorrect data. This means that search cannot be done based on the exact equivalence of the value of the corresponding fields, but similarity criteria have to be formulated. Using good duplicate search algorithms, the probable duplicate groups can be found even among moderately defective data, most of which is confirmed by human review.
Search for duplicates
Duplication searches are done algorithmically to detect and list the identical entities, customers or products. To do this, we create duplicate groups, which are the same set of records that belong to the same individual. The task is to group the records of the same individual into groups.
Creating a Master record
If it is not possible, or the goal is not the complete deduplication e.g. because there are several remaining elements of the duplicate groups in other systems, then it is expedient for each group to create a master record. The master record is based on the record in the highest priority system, which can be supplemented, if needed, and justified with data from other systems. The master record generally contains only the most important customer (identifier) data, and possibly some of the most important data related to the business activity.
After detecting the duplications, the next step is deduplication. In every duplicate group the records to be retained are selected and the rest will be terminated. The record to be retained can be determined on the basis of the best quality data, but sometimes this task is more complicated. Those entities, such as products, which belong to some records in the duplicate group, need to be disconnected from the records to be deleted, and they need to be bound to the remaining record. In some cases we face technical problems here. For example, there may be a product that cannot be profitably connected to a different customer record, in which case that record remains to which this product is linked.