Too many IoT sensors, not enough disks.
You read the GE report and got leadership on board with big data and Industrial IoT. You enabled all your edge computing IoT sensors and monitors. You painstakingly designed and implemented your data warehouse for analytics with custom ETL solutions. You turned on the data fire hose.
Then you waited a week.
Wow that’s a lot of data. Wow, that’s a LOT of data. Wow, where are we going to put all that data and how much is it going to cost? They say, “storage is cheap”. This is true until you have to massage and duplicate all of this unstructured data into a SQL data warehouse so someone can run reports on it. The escalating monthly bill of cloud storage might bring your accountants to tears. In fact the IDC estimates that global spending on IIoT will reach $1.4 trillion by 2021.
So, how can we solve this?
Possible Remedies:
Selective data cleansing
If you have very detailed technical specifications and are able to definitively decide what data needs to be retained and what can be removed, you could decide what data can be skipped from replication. This will work in some cases, but it is feasible down the road that missing data could become suddenly relevant to drive a business decision. This also constrains technical decisions later, as the structure of the data warehouse and the ETL solutions to get the data there are typically tightly coupled.
Move historical data onto hard storage
It’s possible that historical data is no longer useful. If you are not performing long-term trend analysis you could move data from your warehouse onto hard copy. This dark data could still be available as needed with the help of an admin with a lot of patience. This solution shifts the cost from storage to a person who will need to manually move the data back into a database that may have a different structure from when the data was originally pulled from storage.
Ditch the data warehouse for database as a microservice
It turns out storage is still cheap, but not in the traditional sense. Storage on many IoT devices is still cheap and mostly unused. What if we could use an HTAP NoSQL IoT database on each device to leverage that storage? If said database had clustering enabled and could support ANSI SQL, these devices could each independently solve analytical queries. We could eliminate the Data Warehouse completely, as the performance bottleneck of running reports on a production SQL server is replaced by a distributed mesh of IoT database hosts. You can read more about this concept in our blog Database As a Microservice..
The term ‘big data’ is not meant to be flippant. IoT data bloat is a very real problem that is typically seen post-implementation. There are solutions, all of which require some forethought and architectural time investment. Choosing a performant and flexible architecture up front is critical. Consuming data big data at scale is truly like drinking from a firehouse, hence the name. As a result once you are in production it becomes difficult to pivot without a flexible and performant solution as you can end up drowning in data.