A Historical View of Data Storage and Retrieval

Histories of databases all start at relatively the same stage, technologically speaking, save the storage of data on rocks to paper ledgers.  Each history describes the evolution of the data model; hierarchical, network and then the transformational paper, A Relational Model of Data for Large Shared Data Banks” by Edgar F. Codd. All the different syntaxes used to query the systems up to the creation of the Structured Query Language better known as SQL and into the present time of No-SQL are key parts of the history.

Regardless of how it was thought up or designed, each phase of the history up to the current moment is limited by hardware. Each database storage and database retrieval mechanism was developed out of necessity and each discovery stood on the shoulders of its predecessor. 

Databases Evolve with Technology

Computers were expensive, transistors were huge, programming was difficult and proprietary.  I am grateful and respect the work of those pioneers; the spinning hard disk is an incredible feat of engineering and programming.  However, no matter where you put the data, the structure of the data, the amount of data stored in disparate data stores, it is the hardware that orchestrates to write, copy, move, update, sort and search. Of course, the hardware needs software drivers and application programming interfaces to allow programs to use it. Humans have developed clever algorithms for each of these aspects, and it was all relative to an underlying layer, the hardware.

Computer science courses drill students to understand the measuring mechanism, big O notation,  of the time it takes to sort and/or search.  Searching algorithm speeds can be highly dependent on the state of the data; if it is sorted or not.  The thing I remember most about my algorithms class is the fastest way to do a search through N objects is to have N operations doing the search.  At a time when computer processing units had 2 cores and HDD data transfer rates were averaging 30-50 megabytes on the wire, this seemed nearly impossible.  

Fast forward to the current epoch, my mobile phone’s central processing unit has 8 cores, 6 gigabytes of random access memory (RAM), transferring data over a wireless network at rates up to 12 megabits a second. On the wire, this device is using a flash disk which is doing internal transfer rates in the 150+ megabytes range. The vertical scale of hardware is impressive but the horizontal scaling of hardware today is extraordinary.  

Graphical processing units have thousands of cores, relative to CPUs cores, optimized for sequential serial processing, GPUs cores are designed to do multiple tasks simultaneously.  Build a network of local GPUs communicating over a fast interconnect at potentially 40+ GigaBytes a second, simultaneously searching a distributed file system? Or imagine a few million Internet of Things (IoT) devices querying a distributed file system or even local metadata with pointers to a particular data store where the data may actually live. That may not be enough to run the biggest social media services today, however, it is probably sufficient for many use cases of storing data and retrieving data through a searching interface (think HarperDB).

What is the Future of Databases?

We are standing at the edge of the future. Devices are everywhere with processing capabilities that far surpass the ability of computers in the days when the paper by E.F. Codd was published. There was a paradigm when programming and the technology that could use and run the programs was sparse;  a time that begged the mantra reuse, reuse, reuse.  At the same time, there was a race to be “The One.” Software intellectual property was important and still is, but if you did not know how to write a database storage application you used the existing one, even if it was expensive.

The paradigm is no longer as dichotomous as it was, programming languages are abundant, communities around them are actively developing all possible layers.  In today’s ecosystem the phrase “roll your own” is sometimes preferred; if not only for the developer’s curiosity but out of necessity, as the go-to application or module may have too much unnecessary functionality and would be considered bloat.

The future is even cooler, if the bottleneck of your data storage application is its ability to search “Big Data” from disparate systems or migrating data between multiple systems.  There is work being done to shorten that data value chain.  High transaction analysis processing databases are the future.  Parallel processing to do searching is here. Node.js provides great parallel processing capabilities as well as high transaction throughput.  HarperDB was built with these ideas in mind: it uses its own patent pending data storage topology that separates each attribute and stores them individually.  This is different than most data bases that store documents or rows in a single file.  Some people build things around existing limitations, but I have seen more often than not, that in parallel without each others knowledge a solution is already out there but they have not found each other, yet.  HarperDB was built with an understanding of the current limitations, however, new technology like pmem, 3d cross point memory are all blurring these lines.