Improving Data Quality
Our customer address information is inaccurate and it contributes to delayed deliveries and frustrated customers. What general resources are available to improve the quality of our data?
There is a rather elegant solution to the problem of address standardization that is described later in this article. But I’d like to take the opportunity to talk more generally about data quality.
Data quality is one of those quirky things that you notice most when it is not there. Lousy data introduces all manner of inefficiencies and costs within an organization. Yet the time and expense to clean up data can seem overwhelming.
Achieve Data Quality Through Early Validation
The best method of having quality data is by validating it as early in the process as possible. Some examples:
- Confirm the USPS validity of an address while the customer is still on the phone with you.
- Cross-reference other data already in the system (and leverage the relational aspect of your database design.)
- Build as much data validation into the data fields of an application as you can, through use of:
- Data type (eg., invoice total is numeric, text should not be allowed)
- Range validations (eg., invoice total must be a positive number, zero is acceptable.)
- Programmatic triggers (eg., calculate the invoice total from component line items)
In a “catch-it-early” approach, you ensure quality by not allowing garbage data in the first place.
Correcting After the Fact
Validating data early is not always an option, in which case you are resigned to:
- Provide reports that show data outliers so that staff can manually review and cleanup.
- Perform brute force cleanup of bulk data using scripting (SQL or regular expression for those familiar).
As we talk about data cleanup, there are a couple clever tools that Microsoft SQL Server provides.
Data Quality Services (DQS) is a framework for defining business rules to be applied to your data. It’s a valuable tool for identifying (and removing) duplicates, for adherence to a particular vocabulary, and for a range of other data cleansing tasks.
A unique capability of Data Quality Services is the ability to reference external reference sources. These include address and phone validity checks from MelissaData, or Dun and Bradstreet to validate against their database of 225+ million businesses. Most of these services are free for a certain number of records per month with a sliding scale per request after this is exceeded.
About Those Street Addresses…
Back to this topic of Postal Service address verification: You can programmatically validate data against MelissaData geocoding files, and the service can alert you to problems or automatically improve the quality of the address provided. This can be “baked into” your application to capture as data is entered. They can also be applied to a bulk of data using an interactive data screen or through SQL Server Integration Service profiles. To learn more about available reference sources, visit the Microsoft DataMarket.
Microsoft provides yet another tool called Master Data Services (MDS) that helps you harmonize master data between different transactional or operational systems. On the occasion of trying to blend different data systems into a data warehouse, the ability to define formal hierarchies of data becomes important. The master data topic goes a bit beyond the scope of this article, but the key takeaway from this is: all of these tools and extensions are part of the core Microsoft SQL Server offering. They can be leveraged within your application to validate or improve the quality of your data.
There are many ways of tackling the data quality challenge, some of which are reviewed in this article. Please let us know if we can help your organization improve the quality of your data.