How bad data can lead to costly downtime

Published: Sep 15, 2020 by Greg Woodbury

People tend to default to either allowing errors to stop further processing, or if they are handling the errors, they tend to send email alerts to a single technology team with some vague details about the error that has occurred.

ETL tools are really good for quickly getting data into a system. Where they can struggle is in their difficulty to build robust error handling processes. Data representations tend to change over time as the business evolves. When data in the source system changes unexpectedly in either data type or the representation of the data, unless you have well built error handling, you may find that you are fighting costly downtimes. Take, for example, dollar values in a file that are now suddenly being sent with a leading dollar sign where they used to be omitted. This is typically a pretty easy fix, just update the source definition to account for the new dollar sign and you are off and running, but what if the data change doesn’t affect the entire data source, but instead a single record or handful of records. While many tools have the ability to kick out bad data for additional error handling, in our experience people tend to default to either allowing errors to stop further processing, or if they are handling the errors, they tend to send email alerts to a single technology team with some vague details about the error that has occurred.

When an error occurs in the system and an email alert is sent, the most common things we see happen are:

Alert is sent to technology team
Technology team investigates and confirms it is a problem with the data (5-10 min)
Technology team contacts operations team to contact vendor or fix issue internally (5-10 min to explain and hand over issue)
Errors are patched or data is resent (variable; a few minutes to a few hours)
System processes data (variable)
Data correctness is confirmed by the team (5-10 min)
Data becomes available to firm

The complexity of a system can really be seen in the steps that a team goes through to resolve an issue with bad data where even to fix a simple issue, the process can take upward of 30 minutes.

While building data ingestion workflows it is good to keep in mind the current process the business goes through and, at a minimum, try to route meaningful errors to the appropriate team. Ideally, the system would be able to do some baseline troubleshooting and make on the fly sensible adjustments to data to try and process it or in the event of total failure, provide good tools and context about the error, so the team can quickly research the issue and make the necessary edits.

Often building these types of workflows into systems becomes dependent on the time cost. The idea of handling errors is vague and many different types of errors may occur so it is difficult to define what should be handled and how. We see that many opt to forego building error handling all together because the relative incremental cost of a single failure can be small. That cost does add up over time, however.

With the combination of some simple analysis of the current business process and some logical and sensible checks systematic checks on the data you can cut down on costly downtime due to unexpected data.