Data cleaning in practice

Name cleaning

The purpose of the name cleaning is to detect and correct or remove the incorrect names from a list. The whole names are broken into parts, components (prefix, family name, first name) and each component is corrected. Binding the corrected components we get the whole name, but the corrected components can be used individually as well. By correcting a first name we can establish the gender of a person and his name-day. The prefix or the suffix of a name can refer to a scientific degree that can be a valuable information. It is a requirement to have all the components of the name if we want to upload the data into a modern IT system.

The prefixes, suffixes and first names can be corrected using reference dictionaries. In the case of Hungarian names, at the examination of first names we use the list of first names qualified by the Institute of Linguistics of the Hungarian Academy of Sciences. In the cases of foreign first names, we use references coming from authentic sources. When we look at the structure of the full names, we proceed according to the part of the naming list in Act No 17 of 1982.

Address cleaning

During address cleaning, the addresses are converted to a uniform format, correcting mistyping, fixing bugs, replacing the old street names with the current ones. Using this service can eliminate the lacks of address lists and their obsolescence. Dissolving abbreviations further increases the reliability of address data.

When fixing addresses, the item is broken down into parts, components and repaired per item. After the correction of the elements of the address (zip code, settlement, public area, etc.) by binding them we get the corrected full address. Breaking down the addresses into parts when migrating them into IT systems today is an indispensable requirement. In newly-developed systems, the elements of the addresses are stored in separate fields, as opposed to storing them in bulk in the old systems. This solution also allows to identify “target groups” according to residence or site among the clients.

A reference database is used to determine the correctness of the addresses and to improve address data.

Cleaning e-mail addresses

E-mail addresses stored in databases are usually flawed because of recording or typing errors. These addresses cannot be used for sending emails. The errors, mainly the typical ones can be corrected. The corrected e-mail addresses will turn into valid, useful data.

When repairing e-mail addresses, both algorithmic processing and comparison with a reference database can be performed.

Cleaning phone numbers

When cleaning phone numbers, we check if the phone number corresponds to the format of the telephone numbers used in Hungary, which distinguishes between Budapest, country, mobile, and public and special telephone. In addition, along with the content check, we validate the area codes based on the official county code reference database in Hungary and convert the telephone numbers into a standard format.

Cleaning of document numbers and identifiers

In Hungary, various algorithmically identifiable documents (such as ID number, passport number) and other identifiers (e.g. tax identification number, social security number) and identifiers for business identification (e.g. tax number, business number, NACE code) can be verified, partly compared to a reference dictionary.

DSS Consulting