Inspired by the feature request in Github issue #1298, we added update and replace support for ArangoDB’s import facilities.
This extends ArangoDB’s HTTP REST API for importing documents plus the arangoimp binary so they can not only insert new documents but also update existing ones.
Inserts and updates can also be mixed in a single import run. This blog post provides a few usage examples.
Traditional import
Previously, the HTTP REST API for importing documents and the arangoimp binary only supported document inserts, so they could not be used to update existing documents. Bulk-updating existing documents with data from a file or mixing inserts with updates required to write custom scripts or run multiple commands or queries.
I won’t show this in detail but want to concentrate solely on what the import did. I will only show arangoimp and not the HTTP import API.
Let’s assume there is already a collection named users containing the following documents:
1 2 |
|
Now, importing the following data via arangoimp would produce errors for line 1 and 2 (i.e.
for keys user1
and user2
) because these documents already exist in the target collection:
1 2 3 |
|
Here’s what happened when importing the above data into the collection with the two existing documents:
> arangoimp --file data.json --collection users
2015-04-14T18:23:32Z [27441] WARNING at position 1: creating document failed with error 'unique constraint violated', offending document: {"_key":"user1","country":"AU"}
2015-04-14T18:23:32Z [27441] WARNING at position 2: creating document failed with error 'unique constraint violated', offending document: {"_key":"user2","country":"UK"}
created: 1
warnings/errors: 2
After the traditional import, the collection contained the following documents:
1 2 3 |
|
As can be seen, the first two documents (user1
and user2
) remain unmodified, and the third
document (user3
) was inserted because it did not yet exist in the target collection.
Using —on-duplicate
So what’s the change?
As announced, a single import run can now both insert new documents and update existing ones.
What exactly will happen is configurable by setting arangoimp’s new command-line option
--on-duplicate
.
By default, even in devel
there will be errors reported for the two already existing documents.
Good news is that this behavior can be changed by setting --on-duplicate
to a value of update
,
replace
or ignore
:
error
: if a document with the specified_key
already exists in the target collection, the import will not modify it and instead return an error. This is the default behavior and compatible with all previous versions of ArangoDB.We have seen the result above in the traditional import.
update
: if a document with the specified_key
already exists in the target collection, the import will (partially) update the existing document with the specified attributes. Only the attributes present in the import data will be updated, and all other attributes of the document present in the collection will be preserved.> arangoimp --file data.json --collection users --on-duplicate update created: 1 warnings/errors: 0 updated/replaced: 2 ignored: 0
The first two documents (
user1
anduser2
) were updated (attributecountry
was added) and the third document (user3
) was inserted because it did not exist in the target collection:{ "_key" : "user1", "country" : "AU", "name" : "John Doe" } { "_key" : "user2", "country" : "UK", "name" : "Jane Smith" } { "_key" : "user3", "country" : "ZA", "name" : "Joe Public" }
replace
: if a document with the specified_key
already exists in the target collection, the import will fully replace the existing document with the specified attributes. Only the attributes present in the import data will be preserved, and all other attributes of the document present in the collection will be removed.> arangoimp --file data.json --collection users --on-duplicate replace created: 1 warnings/errors: 0 updated/replaced: 2 ignored: 0
The first two documents (
user1
anduser2
) were replaced (attributecountry
was present in the import data, previously existing attributename
was removed). The third document (user3
) was inserted because it did not exist in the target collection before:{ "_key" : "user1", "country" : "AU" } { "_key" : "user2", "country" : "UK" } { "_key" : "user3", "country" : "ZA", "name" : "Joe Public" }
ignore
: if a document with the specified_key
already exists in the target collection, the import will ignore and not modify it. The difference toerror
is that ignored documents will not be counted as errors. No errors/warnings will be reported for duplicate_key
values, but the number of duplicate key occurrences will be reported in theignored
attribute> arangoimp --file data.json --collection users --on-duplicate ignore created: 1 warnings/errors: 0 updated/replaced: 0 ignored: 2
Collection contents are the same as in the
error
case.
The above examples were for the arangoimp import binary, but the HTTP import API was adjusted
as well. The duplicate key behavior can be controlled there by using the new onDuplicate
URL
parameter. Possible values are also error
, update
, replace
and ignore
as shown for arangoimp.
Caveats
All matching is done using document keys (i.e. _key
attributes) and no other attributes. That
means existing documents can only be updated if their _key
attributes are present in the import
data. When no _key
attribute is present for a document in the import data, the import will try
to insert a new document.
The extended functionality is available in the devel
branch, which will eventually turn into
a stable 2.6 release.
Enjoy!