HarperDB’s New Upsert Feature





In our new 2.3.0 release, we included an often requested NoSQL upsert operation to HarperDB*.  This new hybrid operation will insert new records, if they do not exist, or update them, if they do.

This new feature can be used in two different ways via HarperDB’s API – via a simple NoSQL operation or as the action for a bulk load operation.

*This new operation is only available in HDB instances utilizing a LMDB data store. While the File System (FS) data store is still configurable and supported in HDB, some new/more advanced features may not be implemented for FS moving forward.

NoSQL Upsert Operation

As noted above, HDB users can now utilize an upsert operation via our API which will insert new records and/or update existing records.  

A new record (to be inserted) is identified as a record that does not include a hash value or with a hash value that does not already exist on the table being upserted to.  

An existing record (to be updated) is identified by a valid table hash value and will be updated based on the attribute values included in the JSON to upsert for that record – i.e. as with update, any attributes not included in the record’s JSON will NOT be updated.

Example NoSQL Upsert Operation

Example Request

{
    "operation": "upsert",
    "schema": "dev",
    "table": "dog",
    "records": [
        {
            "id": 1,            
            "nickname": "Sammy"
        },
        {
            "name": "Harper",
            "nickname": "Good boy!"
            "breed": "Mutt",
            "age": 5,
            "weight_lbs": 155
        }
    ]
}

Example Response

{
    "message": "upserted 2 of 2 records",
    "upserted_hashes": [
        1,
        "6bca9762-ad06-40bd-8ac8-299c920d0aad"
    ]
}

In the above example:

  • The existing record with hash value equal to 1, will have the nickname attribute updated to equal “Sammy” and all other attribute values for that record will remain untouched. 
    Note: if there was no record with id equal to 1, a new record would be inserted with the provided nickname value.
  • The new record will be inserted as written and with a system generated hash value.  If a new, unused hash value had been provided for this record, we would have used that hash value when inserting the new record.

NoSQL Bulk Load Upsert

Similar to our NoSQL insert and update operations, upsert is also now available to specify as the action on a bulk load API operation. This will communicate to the bulk load job to run an upsert operation on the large data set provided.

Example NoSQL Bulk Load w/ Upsert Action

Example Request

{
    "operation":"csv_url_load",
    "action": "upsert",
    "schema":"dev",
    "table":"dogs",
    "csv_url":"https://s3.amazonaws.com/data/dogs.csv"
}

Example Response

{
    "message": "Starting job with id e047424c-5518-402f-9bd4-998535b65336"
}

Example Response from get_job operation for bulk load

[
    {
        "__createdtime__": 1607897781553,
        "__updatedtime__": 1607897784027,
        "created_datetime": 1607897781549,
        "end_datetime": 1607897784026,
        "id": "e047424c-5518-402f-9bd4-998535b65336",
        "job_body": null,
        "message": "successfully loaded 348 of 348 records",
        "start_datetime": 1607897781562,
        "status": "COMPLETE",
        "type": "csv_url_load",
        "user": "admin",
        "start_datetime_converted": "2020-12-13T22:16:21.562Z",
        "end_datetime_converted": "2020-12-13T22:16:24.026Z"
    }
]

In the above example:

  • A csv_url_load job is started using the linked data set.  All records included in the linked data will be upserted into the table identified using the logic described above.
  • Hitting the get_job endpoint with the job id will provide you with an updated status of the bulk load job and, when complete, confirm the number or records upserted from the linked data set.

A Note RE: Clustering

As with other database operations like insert, update, csv_file_load, etc. in HarperDB, an upsert operation to a table on a specific node will distribute to the other nodes subscribed to changes on that node’s table.  

A few things to keep in mind when thinking through how this will play out for your clustering architecture.

  1. In a scenario where you are upserting new records without hash values provided, the system generated hashes will be included in the transaction payload that is shipped to connected nodes – i.e. the auto-generated hashes for the new records will be mirrored on connected nodes
  2. In a clustered architecture, it is important to take a moment to consider the best NoSQL operation to use in every situation, while it may seem easy to just use upsert  even when you are only intending to insert or update  those records, there could be unintended consequences to your data integrity from that strategy.  

    For example, in a scenario where you have provided the hash values for upsert records, the upsert transaction will do one of the following on any connected nodes:
    • if no matching hash value is found on the subscribing table, a new record will be inserted on the table even if the operation on the publishing node was an update on the record
    • if there is a hash value match on the subscribing table, the record will be updated even if the operation on the publishing node was a record insert

To be specific, in some scenarios, using upsert could cause hash values for what you consider to be the same record to become out of sync across the cluster.

While this may not make a difference to the overall value or use of your data cluster – it could be the preferred outcome! – in others, your data cluster may be affected negatively so think through your use case carefully. Being explicit about the operation you want to transact will also make reviewing and understanding the transaction logs on your clustered nodes easier in the case where an issue arises and a rollback/fix is needed.

Happy upserting!

Do you have a new feature idea for HarperDB?
Our Feedback Board is a great place to vote and leave product suggestions, and you can always connect with our team in the community Slack Channel.

Leave comments or feedback on the original post here