Updating Documents With Arangoimp

Inspired by the feature request in Github issue #1298, we added update and replace support for ArangoDB’s import facilities.

This extends ArangoDB’s HTTP REST API for importing documents plus the arangoimp binary so they can not only insert new documents but also update existing ones.

Inserts and updates can also be mixed in a single import run. This blog post provides a few usage examples.

Traditional import

Previously, the HTTP REST API for importing documents and the arangoimp binary only supported document inserts, so they could not be used to update existing documents. Bulk-updating existing documents with data from a file or mixing inserts with updates required to write custom scripts or run multiple commands or queries.

I won’t show this in detail but want to concentrate solely on what the import did. I will only show arangoimp and not the HTTP import API.

Let’s assume there is already a collection named users containing the following documents:

data in collection before import

{ "_key" : "user1", "name" : "John Doe" }
{ "_key" : "user2", "name" : "Jane Smith" }

Now, importing the following data via arangoimp would produce errors for line 1 and 2 (i.e. for keys user1 and user2) because these documents already exist in the target collection:

data to be imported

{ "_key" : "user1", "country" : "AU" }
{ "_key" : "user2", "country" : "UK" }
{ "_key" : "user3", "name" : "Joe Public", "country" : "ZA" }

Here’s what happened when importing the above data into the collection with the two existing documents:

> arangoimp --file data.json --collection users

2015-04-14T18:23:32Z [27441] WARNING at position 1: creating document failed with error 'unique constraint violated', offending document: {"_key":"user1","country":"AU"}
2015-04-14T18:23:32Z [27441] WARNING at position 2: creating document failed with error 'unique constraint violated', offending document: {"_key":"user2","country":"UK"}

created:          1
warnings/errors:  2

After the traditional import, the collection contained the following documents:

collection contents after traditional import

{ "_key" : "user1", "name" : "John Doe" }
{ "_key" : "user2", "name" : "Jane Smith" }
{ "_key" : "user3", "country" : "ZA", "name" : "Joe Public" }

As can be seen, the first two documents (user1 and user2) remain unmodified, and the third document (user3) was inserted because it did not yet exist in the target collection.

Using —on-duplicate

So what’s the change?

As announced, a single import run can now both insert new documents and update existing ones. What exactly will happen is configurable by setting arangoimp’s new command-line option --on-duplicate.

By default, even in devel there will be errors reported for the two already existing documents.

Good news is that this behavior can be changed by setting --on-duplicate to a value of update, replace or ignore:

error: if a document with the specified _key already exists in the target collection, the import will not modify it and instead return an error. This is the default behavior and compatible with all previous versions of ArangoDB.

We have seen the result above in the traditional import.
update: if a document with the specified _key already exists in the target collection, the import will (partially) update the existing document with the specified attributes. Only the attributes present in the import data will be updated, and all other attributes of the document present in the collection will be preserved.
```
> arangoimp --file data.json --collection users --on-duplicate update

created:          1
warnings/errors:  0
updated/replaced: 2
ignored:          0
```
The first two documents (user1 and user2) were updated (attribute country was added) and the third document (user3) was inserted because it did not exist in the target collection:
```
{ "_key" : "user1", "country" : "AU", "name" : "John Doe" }
{ "_key" : "user2", "country" : "UK", "name" : "Jane Smith" } 
{ "_key" : "user3", "country" : "ZA", "name" : "Joe Public" } 
```
replace: if a document with the specified _key already exists in the target collection, the import will fully replace the existing document with the specified attributes. Only the attributes present in the import data will be preserved, and all other attributes of the document present in the collection will be removed.
```
> arangoimp --file data.json --collection users --on-duplicate replace

created:          1
warnings/errors:  0
updated/replaced: 2
ignored:          0
```
The first two documents (user1 and user2) were replaced (attribute country was present in the import data, previously existing attribute name was removed). The third document (user3) was inserted because it did not exist in the target collection before:
```
{ "_key" : "user1", "country" : "AU" } 
{ "_key" : "user2", "country" : "UK" } 
{ "_key" : "user3", "country" : "ZA", "name" : "Joe Public" } 
```
ignore: if a document with the specified _key already exists in the target collection, the import will ignore and not modify it. The difference to error is that ignored documents will not be counted as errors. No errors/warnings will be reported for duplicate _key values, but the number of duplicate key occurrences will be reported in the ignored attribute
```
> arangoimp --file data.json --collection users --on-duplicate ignore

created:          1
warnings/errors:  0
updated/replaced: 0
ignored:          2
```
Collection contents are the same as in the error case.

The above examples were for the arangoimp import binary, but the HTTP import API was adjusted as well. The duplicate key behavior can be controlled there by using the new onDuplicate URL parameter. Possible values are also error, update, replace and ignore as shown for arangoimp.

Caveats

All matching is done using document keys (i.e. _key attributes) and no other attributes. That means existing documents can only be updated if their _key attributes are present in the import data. When no _key attribute is present for a document in the import data, the import will try to insert a new document.

The extended functionality is available in the devel branch, which will eventually turn into a stable 2.6 release.

Enjoy!

J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

Updating Documents With Arangoimp

Traditional import

Using —on-duplicate

Caveats