LMDB Deep Dive: Interview with Kyle, HarperDB CTO





Kaylan: That’s a good analogy. So obviously, from what we’ve already talked about, it’s been easy. But can you tell me more about the implementation process, what that was like? Was there anything that you disliked about implementing LMDB? 

Kyle: Yeah, if you don’t mind, I think it also goes in line with why we selected LMDB, because it really is the implementation. These are all very tied together. So I’ll probably jump to that. So like I said, this is something that I’ve been thinking about for almost two years. And it really came down to we had a couple of POCs that we were working on and the feedback that we got from the POC’s was [they] really enjoyed the product, but the read performance was not what they needed.

Around the same time we were also getting some feedback from investors and other people in the tech community, saying databases like MongoDB use multiple data stores. MySQL has different ways of using data stores. It’s pretty common that the underlying place that you store the data is ultimately decoupled from the database itself. And so it’s kind of smart to swap things in and out and give options because depending on use cases, someone may want to use a different data storage mechanism. 

I spent the end of 2019 evaluating different key value stores with all the things that we’d learned and also making sure that whatever new technology we implement doesn’t break our core mission, which is simplicity, a dynamic data model, a single data storage with SQL/NoSQL… All the things that we always talked about were very important for making it very easy for developers to, you know, put data in, get data out and do complex querying and simple querying, analytics. So I evaluated a lot of different products very quickly and while a lot of them washed out for various reasons, LMDB held up through the evaluation process. 

There was a great node module built for it. I could build a dynamic schema around it. It was just very lightweight and it didn’t put a lot of constraints on us in order to implement. So once I sort of did a quick “Bake Off”, I started digging in through the month of December. Can we mimic HarperDB’s existing data model, so not implementing it into HarperDB, but creating a very similar data model as if it was HarperDB and then running workloads through that sample.

Then I did a series of tests running these high scale workloads like doing inserts and different SQL queries and searches and all these things, comparing it to our file system data model to LMDB. And then [for] some workloads LMDB was, I think, six hundred times faster, that was on bulk CVS loads. Even for queries it was oh man, I think it was around like 60 times faster or something like that. And just like on all workloads, it was better than what we currently have. And so I did a write up and I disseminated that to the team. Because it’s also a big level of effort, at the same exact time we also determined that we also need to roll out a cloud product.

And so, Stephen and I sat down and discussed what resources were needed and ultimately decided I would work on this solo so the rest of the team could focus on HarperDB Cloud. So then the implementation process was, you know, the approach I took was to do a modular approach. And two of our engineers, Sam Johnson and David Cockerill, when they were working on a failed co-development project earlier in 2019 with another company’s key value store, they created a mechanism for us to decouple the data storage from our core logic. So there was already a pattern in place. They saved me probably months of work. So creating this modular design of creating the core functions, that just does the create, read, update and delete operations.

And also managing the tables, which LMDB calls environments. So when you create a table, we need to create this environment and then create an attribute when you need to create a separate key value store inside there. How do we track all that? So there is a lot of wiring that was specific to us so that it would then bubble up and make sense for HarperDB. And so creating these foundational modules and functions is where I started. And then building unit tests on top of that. So for everything I built, making sure that there was testing all the edge cases, because then I would test something and realize, oh, that doesn’t work. 

The most complicated thing I had to spend time on was search, which is no surprise. But, you know, it’s just like adding layers on top of layers, and just doing this modular design. It took three months and then the testing took about another month and then it basically lined up. It was hard and by that time the managed service was right to deploy. So the timing all worked out really well and Kaylan you project managed me on that

Kaylan: Of course! I’m just thinking back to that implementation process when at one point you said you had 100 hundred errors when you did a test and then you fixed, like, one thing and then it just like fixed [everything]…that just was crazy to me.

Kyle: Yeah I can’t remember what I fixed there..It’s all a blur now. As far as implementing LMDB itself it’s very easy. It was overall just more like I’d had all my requirements laid out and it just was cranking through them, I was like, I know what I need to accomplish. I know what I need to do. It’s just the doing of it and getting through that process.

Kaylan: Yeah, definitely. I think it speaks a lot to how easy it was to implement and use that you did it alone and well, while you were in it, it felt like a long time, but three months to me just doesn’t seem like that long of a time to completely implement this new tool. So, yeah, I mean, it’s awesome.

Kyle: And to give us that level of performance improvements. Yeah. It was huge. And you’re right, it is a very easy product to work with because, yeah, if it was complicated. I might still be working on it.