HarperDB’s Exploded Data Model

If you’ve looked around our website, talked to us at a trade show, or read an article about us, you’ve probably heard about our exploded data model. This is the keystone of HarperDB’s innovative database solution. Our exploded data model is the reason we are fully-indexed with no data duplication, it’s why we can support full SQL and NoSQL within a single model, and it’s the basis for even more innovative features to come. I know, I know, that’s all well and good, but how does it work?

It’s certainly difficult to wrap your head around at first. When it finally clicked for me was when I stopped comparing it to how any other database, SQL or NoSQL, stores data. Instead, I started fresh and thought outside the box, just like the founders did when they first architected HarperDB. Let’s step back all the way and think about a grouping of data. These can be SQL records, NoSQL objects, CSV rows, or whatever you prefer. In the simplest terms, we have a record which consists of fields, each containing a single data point. When HarperDB ingests a record it immediately splits that record up into individual attributes, storing the attributes and their values discreetly on disk. We use the required hash value to link the attributes together. This is what we mean when we say exploded.

Now, because we’re storing the attributes individually, they immediately become an index on write, no duplication necessary. If we’re searching on name, we simply go to the name attribute and look for matching attribute values. Then, to pull the exploded record back together, we use parallelization to retrieve all required attributes associated with the matching hash values. An additional benefit here is that joins are as performant as single table searches. HarperDB is simply coalescing a collection of attributes, no matter where they reside. This method of storage allows concurrent read and write, all with high throughput. Additionally, because everything is stored directly on disk, there is no row locking or in-memory transformation.

That’s the high-level explanation, but there is a bit more secret sauce to it than that. HarperDB utilizes age old and established file system techniques to ensure data consistency. What I haven’t mentioned yet is that HarperDB actually uses two types of indexing. First, we have primary key indexing where the hash values are stored for each attribute. Then, we have secondary indexing where each attribute is indexed by value. However, HarperDB uses hard links within the file system to connect the two indexes on disk. What that means is that on the physical storage, both indexes are actually pointing to the same storage location, meaning no data duplication. One neat side effect of storing data directly on the file system is that it’s completely human readable through your standard browser. Let’s explore the file system using the dog tables found in the HarperDB Examples.

The image below shows my file system. Now if we work left to right we see that we’re in the dev schema, then the dog table. Within the dog table we see all of the attributes associated with a dog, as well as the __hdb_hash attribute/folder, which we’ll get to in a moment. First let’s examine the owner_name attribute/folder. Within it we see the names of all the dog owners at HarperDB. (I know, I know, I need to get a dog). Now we can look in the Stephen folder and see that Stephen owns two dogs, with hash attributes 2 and 8, denoted by the hdb files.

If we want to go retrieve other attributes associated with those hash attributes we can now move over to the __hdb_hash folder, the primary key index. Let’s go find Stephen’s dogs’ names. In the image below we’re going to go up to the __hdb_hash folder, navigate to dog_name, and open the files associated with 2 and 8.

Opening these files we get the results seen below. Two files, both containing the value associated with the given records. The key here is the hard links. The files we opened, 2.hdb and 8.hdb, under dog_name in the __hdb_hash folder point to the exact same storage location as the files under dog/dog_name/Harper/2.hdb and dog/dog_name/Gemma/8.hdb. This is the exploded data model.

Under the hood, HarperDB is unlike any other database. It operates using our original design, enabling us to have these cutting-edge features, all in a small footprint. We’re constantly finding new and innovative ways to advance our design and capitalize on the power of the exploded data model. We have some pretty cool stuff coming up on our roadmap, but you’ll just have to wait and see.

In the meantime, I encourage you to download our community edition and give it a try yourself. Dig into the file system and open some hdb files. The best way to understand our data model is to play around with it and examine it. It’s human readable, you might as well see it for yourself!