J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

On Getting Unique Values

While paging through the issues in the ArangoDB issue tracker I came across issue #987, titled Trying to get distinct document attribute values from a large collection fails.

The issue was opened around 10 months ago when ArangoDB 2.2 was around. We improved AQL performance somewhat since then, so I was eager to see how the query would perform in ArangoDB 2.6, especially when comparing it to 2.2.

For reproduction I quickly put together some example data to run the query on:

setting up example data
1
2
3
4
5
6
var db = require("org/arangodb").db;
var c = db._create("test");
for (var i = 0; i < 4 * 1000 * 1000; ++i) {
  c.save({ _key: "test" + i, value: (i % 100) });
}
require("internal").wal.flush(true, true);

This produces a collection named test with 4 million documents. Each document has a numeric value attribute, which in total has 100 unique values. I remembered from a conversation with the guy that opened the issue that the number of distinct values was 100 or even slightly lower. I didn’t bother to create an index on the value attribute, which might have sped up the query.

With data available, it was time to run the query and measure its execution time:

running the query
1
2
3
4
var time = require("internal").time;
var start = time();
db._query("FOR doc IN test COLLECT value = doc.value RETURN value");
time() - start;

Running this in 2.2.7 took 3 minutes and 18 seconds before bursting with the following error message:

2.2.7 error message
1
2
3
4
#
# Fatal error in CALL_AND_RETRY_2
# Allocation failed - process out of memory
#

In the 2.2 branch AQL queries were fully transformed to JavaScript and executed using V8. Obviously that didn’t work well with large collections. That was one of the reasons why version 2.3 saw a major rewrite of the AQL engine.

As a consequence, running the query in 2.3 (2.3.5) worked fine. Execution took around 28 seconds. The same was true for 2.4 (2.4.8) and 2.5 (2.5.5).

Finally, running the query in 2.6.0 completed in just 3.2 seconds.

The reasons for the speedup are the optimizations done for COLLECT (see COLLECTing with a hash table, the Return value optimization for AQL and some minor optimizations within AQL that didn’t get a pretty working title.

Looks like in sum all the optimizations put into 2.6 really pay out.