Event-driven architectures and change data capture are two popular software paradigms useful for streaming data in modern applications. For example, to stream data from PostgreSQL databases, Debezium can read write-ahead-logs to look for changes and publish them to Kafka for further processing.
While HarperDB does not currently have native integrations for Debezium or Elasticsearch, we can leverage HarperDB’s application layer to periodically fetch data and index it in Elasticsearch. In this tutorial, we’ll go over a proof of concept to index new data from HarperDB into Elasticsearch.
Setup
We’ll use Docker Compose to run a local instance of HarperDB and Elasticsearch.
This spins up a single-node instance of Elasticsearch as well as HarperDB. Note that I’m mapping a local directory called `harperdb` to persist data, but you can opt to skip that step.
We’ll create our standard `dev` schema and `dogs` table via curl:
We can seed some sample data:
Finally, let’s create the index on Elasticsearch:
Writing a Custom Connector
We will write a simple custom connector to periodically poll data from HarperDB and index them to Elasticsearch if it’s new data that we have not seen before. In this simple example, we will use the id field to determine this, but for other workloads, using the timestamp field may be more appropriate.
We are using `axios` to request data from HarperDB, the `elasticsearch` client to connect to Elasticsearch, and finally `node-cron` to schedule our polling operation.
The code is relatively simple. We use `cron.schedule` to run the `fetchDataAndIndex` function every 5 seconds. Within that field, we run the sql query to fetch all the data where the id is greater than what we’ve seen before. The body of that text is then indexed into Elasticsearch.
Testing the Connector
Once we run the connector, it will immediately index our existing records and increment the id to 3. Then every read after that will return no data and skip.
We can run a curl command to search for our indexed data:
That query will return one result as we expect.
While the connector is running, we can issue more data into HarperDB. We can use same dog names or other fields so that when we search we can get more than one result.
Parting Thoughts
In this tutorial, we set up a proof of concept for polling data from HarperDB and continuously indexing to Elasticsearch. Compared to more traditional change data capture architectures using Debezium, data isn’t “pushed” but rather polled. We are also using the incrementing id field, but for some workloads, using the timestamp field would be more appropriate. Finally, we are not persisting the last indexed id, so restarting the connector would attempt to reindex from start. These are all limitations that a production version of this connector would need to address, but until HarperDB comes out with a native Debezium or Elasticsearch connector, we can use this custom workaround to bridge the data gap.