While paging through the issues in the ArangoDB issue tracker I came across issue #987, titled Trying to get distinct document attribute values from a large collection fails.
The issue was opened around 10 months ago when ArangoDB 2.2 was around. We improved AQL performance somewhat since then, so I was eager to see how the query would perform in ArangoDB 2.6, especially when comparing it to 2.2.
For reproduction I quickly put together some example data to run the query on:
1 2 3 4 5 6 |
|
This produces a collection named test
with 4 million documents. Each document has a numeric value
attribute, which in total has 100 unique values. I remembered from a conversation with the guy that
opened the issue that the number of distinct values was 100 or even slightly lower. I didn’t bother
to create an index on the value
attribute, which might have sped up the query.
With data available, it was time to run the query and measure its execution time:
1 2 3 4 |
|
Running this in 2.2.7 took 3 minutes and 18 seconds before bursting with the following error message:
1 2 3 4 |
|
In the 2.2 branch AQL queries were fully transformed to JavaScript and executed using V8. Obviously that didn’t work well with large collections. That was one of the reasons why version 2.3 saw a major rewrite of the AQL engine.
As a consequence, running the query in 2.3 (2.3.5) worked fine. Execution took around 28 seconds. The same was true for 2.4 (2.4.8) and 2.5 (2.5.5).
Finally, running the query in 2.6.0 completed in just 3.2 seconds.
The reasons for the speedup are the optimizations done for COLLECT
(see
COLLECTing with a hash table, the
Return value optimization for AQL and some
minor optimizations within AQL that didn’t get a pretty working title.
Looks like in sum all the optimizations put into 2.6 really pay out.