When designing an enterprise extract-transform-load (ETL) pipeline for moving data, the most critical and complex part is the transformation. Extracting data from a source and loading the transformed result to the destination are both straightforward operations, but the tasks that take place in between are where the real work happens. Data cleansing, deduplication, reshaping, and validation may all take place during the transformation phase.

As part of the transformation of data, it’s inevitable that some data will be discovered to be suspect. Whether it is a possible duplicate, one or more rows that failed validation, or data type errors, there will be some deficiencies that cannot reliably be corrected in the ETL. When this happens, you are left with one of three options:

  • Ignore the deficiency and load the data, as is, to the destination
  • Discard the suspect data
  • Send the suspect data to a triage file or table for further analysis and manual cleanup

Any of these options may be valid, depending on the nature of the data, and I have used all 3 patterns in the past. As an ETL professional, however, my strongest preference is to triage the suspect data. Doing so brings several benefits, including:

In this article, I’ll share the principles of using a triage pipeline in your ETL process.

  • Allows for a closer inspection of the issue to see if it merits a change in the ETL logic or business rules
  • Permits a manual edit for cases that can’t easily be corrected programmatically
  • Establishes an audit trail to show which data was removed from the ETL pipeline, and why

Sending Suspect Data to Triage

As shown below, using a triage system for capturing suspect data involves adding another output in your data transformation layer. The transformation logic will be responsible for bifurcating the stream to separate the good data from the suspect data.

When building in an ETL architecture that includes the use of a triage output, you will need to add three major elements to your process:

  • The triage table(s) or output location to use for this process
  • Additional transformation logic to filter the suspect data from the good data
  • A notification mechanism to alert the ETL administrator, or data steward, that there is data in triage to be reviewed

The above flowchart shows the bare minimum that one must do to build a triage pipeline. In this scenario, any data sent to triage would need to be manually moved to the primary output table. This pattern is useful for cases where the volume of data sent to triage is very low, or in scenarios in which the triaged data requires a significant amount of manual work to fix.

For higher-volume operations, that regularly route data into triage, you’ll want to remove as much of the tedious and manual work as possible. In those cases, there should be a pipeline that allows data to flow back into the data integration pipeline after cleansing. As shown below, a secondary process using triage as a source (rather than a destination) loads the manually cleansed data back through the data transformation process.

As an alternative to the above pattern, the cleansed data could be loaded directly to the table of final destination. Keep in mind that you may be bypassing some of the business logic here, so make sure that short-circuiting the process like this doesn’t cause other issues with the data.

Caring for the Triaged Data

The triage output isn’t somewhere that data should go to die. For this to work properly, there must be some curation of the data that lands in triage. Even if the triaged data is used only for auditing purposes, there should be some periodic review to make sure that the data transformation logic is sending only the suspect data to that output. Such a review can also be used to check for any needed improvements in the data transformation logic, to reduce the amount of data falling into that bucket.

In many cases, caring for this data will be a manual process. After all, this is a bin for data that could not be cleaned up programmatically. However, when you choose to address this triaged data, make sure that your approach satisfies the needs of the business. As is true in so many data movement operations, these decisions should be driven by the business need rather than the technical components.

Final Thoughts

In using a triage pattern to identify and store suspect data, here are a few final thoughts to keep in mind:

  • The triage pattern is not going to be valuable for every load. In some cases, you’ll want to fail the entire load if any data fails validation (such as when loading general ledger data, which should fail or succeed as a whole), or in other cases, simply let the suspect data load as-is.
  • While it is possible to use a file in which to store the triage data, you’ll have easier options for management and cleanup if you store the data in a relational database table.
  • When loading data to a triage output, make sure you capture the reason it’s being sent to triage. The curator of that triage table could deduce the reason using trial and error, but it’s relatively easy to add a comment or other marker on each row indicating why it was marked as suspect.

 

Conclusion

Building a triage output pipeline is useful for addressing row-by-row instances of bad or suspect data. While not appropriate for every ETL process, this design pattern can help improve the quality of your data by routing otherwise unusable data to a table or file where it can be reviewed and potentially corrected.