More Efficient Data Exports

I recently wrote about the performance improvements for the cursor API made in ArangoDB 2.6. The performance improvements are due to a rewrite of the cursor API’s internals.

As a byproduct of this rewrite, an extra API was created for exporting all documents from a collection to a client application. With this being its only use case, it is clear that the new API will not solve every data export problem. However, the API’s limitedness facilitated a very efficient implementation, resulting in nice speedups and lower memory usage when compared to the alternative way of exporting all documents into a client application.

There did not exist an official export API before. So users often ran AQL queries like the following to export all documents from a collection:

AQL query to export all documents

FOR doc IN collection 
  RETURN doc

While such AQL queries will work for smaller result sets, they will get problematic when results get bigger. The reason is that the AQL very will effectively create a snapshot of all the documents present in the collection. Creating the snapshot is required for data consistency. Once the snapshot is created, clients can incrementally fetch the data from the snapshot and will still get a consistent result even if the underlying collections get modified.

For smaller result sets, snapshotting is not a big issue. But when exporting all documents from a bigger collection, big result sets will be produced. In this case, the snapshotting can become expensive in terms of CPU time and also memory consumption.

We couldn’t get around the snapshotting completely, but we could take advantage of the fact that when exporting documents from a collection, all that can be snapshotted are documents. This is different to snapshotting arbitrary AQL queries, which can produce any kind and combination of JSON.

Dealing only with documents allowed us to take an efficiency shortcut: instead of copying the complete documents it will only copy pointers to the document revisions presently in th collection. Not only is this much faster than doing a full copy of the document, but it also saves a lot of memory.

Invoking the API

While the invocation of the cursor API and the export API is slightly different, their result formats have intentionally been kept similar. This way client programs do not need to be adjusted much to consume the export API instead of the cursor API.

An example command for exporting via the cursor API is:

exporting all documents via the cursor API

curl -X POST \
     "http://127.0.0.1:8529/_api/cursor" \
     --data '{"query":"FOR doc IN collection RETURN docs"}'

A command for exporting via the new export API is:

exporting all documents via the export API

curl -X POST \
     "http://127.0.0.1:8529/_api/export?collection=docs"

In both cases, the result will look like this:

API results

{
  "result": [
    ...
  ],
  "hasMore":true,
  "id":"2221050516478"
}

The result attribute will contain the first few (1,000 by default) documents. The hasMore attribute will indicate whether there are more documents to fetch from the server. In this case the client can use the cursor id specified in the id attribute to fetch more result.

The API can be invoked via any HTTP-capable client such as curl (as shown above).

I have also added bindings to the ArangoDB-PHP driver today (contained in the driver’s devel branch).

API performance

Now, what can be gained by using the export API?

The following table shows the execution times for fetching the first 1,000 documents from collections of different sizes, both with via the cursor API and the export API. Figures for the cursor API are shown for ArangoDB 2.5 and 2.6 (the version in which it was rewritten):

execution times for cursor API and export API

# of documents    cursor API (2.5)    cursor API (2.6)      export API
--------------    ----------------    ----------------      ----------
       100,000               1.9 s               0.3 s          0.04 s
       500,000               9.5 s               1,4 s          0.08 s
     1,000,000              19.0 s               2.8 s          0.14 s
     2,000,000              39,0 s               7.5 s          0.19 s
     5,000,000               n/a                 n/a            0.55 s
    10,000,000               n/a                 n/a            1.32 s

Execution times are from my laptop, which only has 4 GB of RAM and a slow disk.

As can be seen, the rewritten cursor API in 2.6 is already much faster than the one in 2.5. However, for exporting documents from one collection only, the new export API is superior.

The export API also uses a lot less memory for snapshotting, as can be nicely seen in the two bottom rows of the results. For these cases, the snapshots done by the cursor API were bigger than the available RAM and the OS started swapping heavily. Snapshotting didn’t complete within 15 minutes, so no results are shown above.

Good news is that this didn’t happen with the export API, due to the fact that the snapshots it creates are much more compact.

Another nice side effect of the speedup is that the first results will arrive much earlier in the client application. This will help in reducing client connection timeouts in case clients are enforcing them on temporarily non-responding connections.

Summary

ArangoDB 2.6 provides a specialized export API for exporting all documents from a collection and shipping them to a client application. It is rather limited but faster than the general-purpose AQL cursor API and can store its snapshots using less memory.

Therefore, exporting all documents from bigger collections calls for using the new export API from 2.6 on. The new export API is present in the devel branch, which will eventually turn into a 2.6 release.

For other cases, when still using the cursor API, 2.6 will also provide significant performance improvements when compared to 2.5. This can be seen from the comparison table above and also from the observations made here.

J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

More Efficient Data Exports

Invoking the API

API performance

Summary