In 2013, Garter estimated that poor-quality data costs organizations, on average, $14.2m annually. Poor-quality data are often hidden from the business under layers of cleansing code, business logic, mapping tables, and other data-fixing techniques. While extract-transform-load (ETL) is valuable, applicable, and appropriate in many instances, it is also important to assess the scope, complexity, risk, and time-to-value of data-fixing algorithms and the technical debt associated with the implementation and maintenance of them.
Businesses typically begin a data discovery initiative with a goal in mind; whether it’s increasing sales, retaining customers, demonstrating adherence to a business process, etc. It is during this data discovery process that the team begins to assess and understand the quality of their source data. All too often, the anomalies are sought to be resolved in data-fixing ETL processes, where developers are tasked with resolving data quality issues through code.
The effort of developing, maintaining, testing, and scaling ETL code should be accounted for and incorporated into the value equation for a new measure, among other factors such as actionability, usability, and strategic influence. In a number of cases, I’ve found that the cost to produce and maintain a given measure far exceeds the value; for others, it’s worth it.
It is often possible to minimize the cost of ETL by improving data quality at the source – employing features such as validation and strongly-typed fields in addition to assessing the fitness of the source system’s data model. The cost of optimizing, or in some cases, rebuilding the source system is often less than the cost and risk associated with cleansing and maintaining poor-quality data.
Talavant recommends the following approach to improve data quality:
Implement validation and data quality measures at the source
Business systems should check for proper data input. If a field is designed to accept a date, it should reject anything but. Fields that require a value shouldn't accept a single space as a valid entry. Postal code fields should reference a lookup table of valid postal codes and notify the user of an invalid entry, etc.
Reduce the amount of human-entered data
Instead of depending on a human to manually count and enter the number of items redeemed, implement a barcode scanning/reconciliation system.
Assess the fitness of the business system’s data model
Instead of having users enter data into a ‘notes’ field, or record a warranty code in an unused field, consider designing and implementing a data model that fits the structure of data being captured.
Design and document
Make sure the processes that occur within a business system are well understood and documented, lest measures be built on a foundation with unknown cracks.
When a business system is not well-aligned to its users or the analytical goals of the business, it’s time to consider architecting and building (or buying) a new solution.