If you’ve made it to this blog you’ve probably heard the term “persistent” thrown around with ETL, and are curious about what they really mean together. Extract, Transform, Load (ETL) is the generic concept of taking data from one or more systems and placing it in another system, often in a different format. Persistence is just a fancy word for storing data. Simply put, persistent ETL is adding a storage mechanism to an ETL process. That pretty much covers the what, but the why is much more interesting…
ETL processes have been around forever. They are a necessity for organizations that want to view data across multiple systems. This is all well and good, but what happens if that ETL process gets out of sync? What happens when the ETL process crashes? What about when one of the end systems updates? These are all very real possibilities when working with data storage and retrieval systems. Adding persistence to these processes can help ease or remove many of these concerns.
With a persistent ETL process, a full historical view is recorded as the data is moved. Direct copies of the source data are saved, and depending on the implementation, the output can be saved as well. Holding a full history of transactions enables developers and architects to investigate and troubleshoot issues as they arise. As someone who has done his fair share of troubleshooting, I can assure you that it’s easier to diagnose an issue if you can recreate or walk through the steps that caused it. Additionally, capturing history creates audit capabilities, a key requirement for most organizations.
If setup properly, a persistent ETL tool could serve as the source of record for auditing of all data in an organization. One major concern related to adding persistence to an ETL process is performance. While it may not seem intuitive, often adding persistence can actually increase performance. Standard ETL tools rely on in-memory transformations, which often limit the quantity and/or complexity of records that can be migrated at any one time. Instead, by persisting data, the system can execute the process incrementally. Multiple extracts, step-by-step transformations, and multiple loads are all possible due to persistence. As long as the system is tuned properly, adding persistence should provide the means to improve functionality and performance.
Persistent ETL for the Edge
Here at HarperDB we are major proponents of edge computing, so I should mention how this applies to an edge database and IoT sensor data. An edge database typically runs on a small microcomputer with minimal storage, while being bombarded with IoT sensor data at incredibly high throughput. This means that an edge device will hold a limited amount of historical data before it fills up, causing the device to flush older data. With persistent ETL, as the data is extracted from edge devices, it is stored elsewhere. The ETL tool can effectively become the source of record for aging edge data, which will free up storage on the edge devices.
While adding a persistence layer to your ETL processes may seem complicated, it can provide a significant foundation for developers and architects to unify systems and solve some of the age old ETL problems. Historical transaction records provide full context throughout the organization for audit and troubleshooting purposes. Performance improvements can be gained through incremental processing. Finally, edge data can be extracted and stored, permanently freeing up limited edge storage. Despite the suspected complications, persistent ETL may just be worth the investment.