Data cleaning in general

Data collection and storage is a task for every company. The type of data that need to be collected depends on the company’s field of activity and profile, but some kind of data collection is part of every company’s operation. Any company works with its customers’ personal or company data, their contact information, data referring to the products or services provided by the company and some characteristics of its stock.

The need for accurate and complete data is evident, but experience tells us that data are incomplete and inaccurate. In other words: data quality is not impeccable. What does data quality mean? Data quality means a set of requirements, which are defined by the extent of the difference between the data that can be extracted from the object of the real world in the IT system and the actual object. If there are no differences, or the differences are minor than data quality is good.

Next to the requirements influencing data quality and beyond mapping the real world, needs arising from the operation of the data collecting company (data that need to be stored), and regulatory requirements prescribed by the environment also play an important role (name structure as defined in the law, tax number algorithm).

What can we do if our data do not meet our purposes? We can improve data quality by using data cleaning tools and services. To improve and clean inconsistent, inaccurate, incomplete data that are stored in databases two important solutions exist:
• An algorithmic test where the data can be inferred using certain mathematical algorithms to determine the correct value (e.g. CVD check, check the consistency of the field)
• Using a reference database where the examined data is compared with a valid reference database containing correct values and the correct value is determined (e.g. first names database, phone area numbers database, database of mistyped or shortened settlement names).

DSS Consulting