According to Wikipedia, “Polyglot persistence is the concept of using different data storage technologies to handle different data storage needs within a given software application.” James Serra, in his blog writes, “Polyglot Persistence is a fancy term to mean that when storing data, it is best to use multiple data storage technologies, chosen based upon the way data is being used by individual applications or components of a single application. Different kinds of data are best dealt with different data stores.
The logic behind this methodology according to Wikipedia, and most other sources is, “There are numerous databases available to solve different problems. Using a single database to satisfy all of a program's requirements can result in a non-performant, "jack of all trades, master of none" solution. Relational databases, for example, are good at enforcing relationships that exist between various data tables. To discover a relationship or to find data from different tables that belong to the same object, a SQLjoin operation can be used. This might work when the data is smaller in size, but becomes problematic when the data involved grows larger. A graph database might solve the problem of relationships in the case of Big Data, but it might not solve the problem of database transactions, which are provided by RDBM systems. Instead, a NoSQLdocument database might be used to store unstructured data for that particular part of the problem. Thus, different problems are solved by different database systems, all within the same application.”
As James Serra notes, this is a “fancy term”, and it certainly sounds smart. It has clearly become the reigning paradigm for most large scale data management implementations; however, I would argue that it is a terrible idea. Why is it such a bad idea? The answer is pretty simple - consistency, cost, and complexity.
The three Cs - Consistency, Cost, and Complexity
The idea that you should adopt the right tool for the job seems sound. When it comes to implementation in the software world it often is a good idea. Windows and OSX are great operating systems for end-user interfaces for laptops, mobile devices, and desktops, but far from ideal for server environments. Conversely, I wouldn’t want to support my sales team using Linux, I’ve been there and done that during my days at Red Hat - it was awful.
I think at this point it’s pretty clear that data is the lifeblood of any organization regardless of size or industry. One could argue, this belief has gone too far. I recently listened to a podcast where the CEO of one of the world’s largest auto manufacturers claimed they weren’t a car company anymore, but a “data platform company”. This made me roll my eyes. Despite this, while my firm belief is that car companies should build cars, data is still a vital asset to any organization, and we all know this.
So, if we go back to my three Cs, consistency, cost, and complexity, let’s examine how the pervasive concept of polyglot persistence is a major threat to those areas of an organization’s data management strategy.
While my career has taken many twists and turns, I have basically spent the entirety of it trying to achieve one single goal for organizations like Red Hat, Nissan North America, The Charlotte Hornets, Parkwood Entertainment, and many others - get a single view of their data. I have learned an enormous amount on the hundreds of projects I have worked, trying to provide a single view of the truth, and I have fought one battle time and time again, consistency.
By introducing many different databases, as the Wikipedia article suggests above, into their technology ecosystems, companies inevitably create a situation where their data is inconsistent. Add to that the fact that we are consuming data at a frequency never before seen it becomes nearly impossible to keep the data in synch. The very nature of polyglot persistence ensures this, as it states that certain types of data should live in certain types of systems. That would be fine if there was an easy way to access it holistically across these systems, but that simply doesn’t exist. A year or two ago, many folks argued with me that the solution to this problem was data lakes like Hadoop and other technologies, but I haven’t heard that argument very often in the last 6 to 12 months. Why, because data goes to data lakes to die. They are slow, expensive, difficult to maintain, and make it challenging to get a near real-time view of your data.
The issue is that this model requires a significant reliance on memory and CPU for each data silo to perform on read transformations and calculations of data. These systems are being asked to essentially do double duty; their primary function that they have been designated for in a polyglot model, and then function as a worker for a data lake. This is over taxing these systems, adds to latency and data corruption, and creates a lot of complexity.
I fully agree that RDBMS’s are ideal for relationships and transactions but fail at massive scale. That said what you end up with in a polyglot persistence paradigm is an inability to get a consistent view of your data across your entire organization.
A Database for IoT and the convergence of OT and IT
All data is valuable because of its relationships. To truly achieve the promise of Industry 4.0, it will be essential to drive a convergence of OT and IT. This is combining operational technology (OT) data with IT data. OT data comes at a very high frequency. A single electrical relay can put off 2000 records a second. One project we are working on, that is smaller in scale, has 40 electrical relays - that’s 80,000 records a second. This power consumption data, to be valuable, needs to be combined with production data in other systems like ERPs. These relationships will drive the value of that data. For example, being able to understand in real-time, what the cost in power is to produce a unit of inventory, is a question that would need to be answered. This requires a database for IoT as well as a database that can functionally handle IT data.
Most folks would use a polyglot persistence model to achieve this. They would use a highly scalable streaming data solution, or an edge database, to consume the OT power data. They would then use an RDBMS to consume the inventory data. How then do we correlate those in real-time? Most likely by sending them to a third system. Things will get lost in transit, integrations will break, and consistency is lost.
The True Cost of Polyglot Persistence
Furthermore, this is highly complex. As we begin to add additional systems for each of these data types, we need additional specialized resources in both people and hardware to maintain them. We also need multiple integration layers that often times lack a dynamic nature, and ultimately become the failure points in these architectures. The more and more layers we add to these architectures, the more challenging it becomes to determine consistency and to manage this complexity. It also adds significant costs to house the same data in multiple places, as well as increased compute costs.
We are also paying in lost productivity more than anywhere else. How long does it take to triage an issue in your operational data pipeline when you have 5 to 7 different sources of the truth? How do you determine what is causing data corruption?
There is also a major risk in terms of compliance. If we look at the different data breaches across social media companies, credit bureaus, financial institutions, etc. how much time has it taken for them to diagnose the real effect of those breaches? Why is that? The answer is pretty simple, they don’t have a holistic picture of their data nor a unified audit trail on said data. This is becoming more and more dangerous as the impact of these breaches becomes more dramatic effecting personal data while more things become connected.
What is the solution?
I am not suggesting we go back to the days of monolithic RDBMS environments. I think it’s clear that paradigm is over. Nor am I suggesting that we abandon many of the products we currently are using. Many of the developer tools are awesome for different uses. Tools for search like ElastiCache have become vital parts of the technology ecosystem, and in-memory databases play important roles for areas where very high speed search on relational data is needed.
What I am suggesting is that we need to look at data architectures that provide a single persistence model across all these tools, providing those tools with the stability and consistency that they require. Creating data planes with persistence that can function as middleware layers as well as stream processing engines will be key to reducing complexity.
If we stop relying on each of these individual tools for long term persistence, but rather view the data inside them as transient, we can then accept the fact that their version of the data might be out of synch and they might crash. If we are able to put persistence in a stream processing layer with ACID compliance and very high stability, we can then rely on that layer to provide a holistic view of our data. Stop overtaxing these systems with data lakes where storage algorithms make it impossible to do performant transformation and aggregation, but rather allow these end-point data stores to do their jobs and provide that functionality in a layer that can be used as an operational data pipeline.