While on a retreat in Belgium, we found some spare time to work on improvements for AQL. These will be shipped with ArangoDB version 2.4, and are already available in the devel version for testing from now on.
Here’s a short overview of the improvements:
COLLECT WITH COUNT
A common use case in query languages is to count the number of
documents returned by a query. The AQL solution for this has been
to use the LENGTH
function and a subquery:
1 2 3 4 5 6 |
|
This works but is probably unintuitive for people which have used SQL for years.
We therefore now allow using the following alternative syntax,
using the new COLLECT ... WITH COUNT INTO ...
clause:
1 2 3 4 |
|
This query returns just the total number of matches, but not the
matches themselves. As this query is made for counting only, it
can be executed more efficiently than the original query.
In the query with the COUNT INTO ...
clause, the documents found
by the filter condition will only be counted and then instantly
discarded. They will not be shipped around inside the query, from
the subquery to the top level into the LENGTH()
function.
This new variant will be drastically faster than the old variant if there is no filter condition at all. When there is a filter condition, evaluating the filter condition might be the most computationally expensive part of the query. But even then, the new variant should be faster than the old one and use less memory.
As a bonus, there is no need to use a subquery anymore, though the subquery variant is still fully supported and will be.
COLLECT ... WITH COUNT
also works for counting the number of items
per group:
1 2 3 |
|
This returns the number of matches for each distinct value
.
A quick unscientific benchmark reveals that the specialized
WITH COUNT
clause seems to be faster than the old variant.
The following results show the differences for a collection with
500,000 small documents:
The old variant that counts the number of documents per age runs in 4.75 seconds on my laptop:
1 2 3 4 |
|
The new variant produces the same result, but runs in 0.6 seconds locally:
1 2 3 |
|
A notable speedup can also be observed if only a fraction of the groups is built (here: 1/8). The old variant for this runs in 0.6 seconds:
1 2 3 4 |
|
The new variant runs in 0.12 seconds:
1 2 3 4 |
|
The absolute times may vary greatly depending on the type of documents and the hardware used, but in general the new variant should provide a speedup.
COLLECT with group expression
Finally, COLLECT ... INTO
has been extended to support just another variant
that can reduce the amount of copying inside a query.
Let’s have a look at this example query:
1 2 3 |
|
In the above query, for each distinct age
value, all documents are collected
into variable g
. When the collecting phase is over, there will be an iteration
over all the collected documents again, to extract their dateRegistered
value.
After that, the dateRegistered
values will be passed into the MAX()
function.
This query can be made more efficient now as follows:
1 2 3 |
|
The new thing about this variant is the expression following the INTO
.
Having an expression there allows controlling what values are collected for
each group. Using a projection expression here can greatly reduce the
amount of copying afterwards, and thus make the query more efficient than if
it had to copy the entire documents.
Removing filters covered by indexes
FILTER
conditions which are completely covered by indexes will
now be removed from the execution plan if it is safe to do so.
Dropping the FILTER
statements allows the optimizer to get rid
of not only the FilterNode, but also its corresponding CalculationNode.
This will save a lot of computation if the condition needs to be checked
for many documents.
For example, imagine the following query:
1 2 3 |
|
If there is a (skiplist) index on doc.value
, the optimizer may
decide to use this index. It will replace the query’s EnumerateCollectionNode
with an IndexRangeNode instead first. The IndexRangeNode will scan the index
on doc.value
for the range [-inf, 10).
Following that, the optimizer rule remove-filter-covered-by-index
should fire and detect that the FILTER
condition is already covered
by the IndexRangeNode alone. It can thus remove the FilterNode.
This also makes the CalculationNode of the FilterNode obsolete,
so these two nodes will be removed and computation is saved.
Removing brackets for subquery function call parameters
Since the beginning of AQL, the parser required the user the put subqueries that were used as function parameters inside two pairs of brackets.
For example, the following query did not parse in 2.3 and before:
1
|
|
Instead, it needed to be written as:
1
|
|
If you didn’t notice the difference, the latter version of the query had duplicate parentheses. The requirement to use duplicate parentheses has caused several support questions over time, and this can be taken as a proof that it was not intuitive.
The requirement for duplicate parentheses was an artifact required by the AQL parser grammar in order to parse the query correctly.
For 2.4, the AQL grammar has been cleaned up in this respect. Duplicate parentheses are still allowed and work fine in 2.4 but they are not required anymore. This should make the first steps with AQL a bit easier and more intuitive.
We’re 1.5 days into our retreat now. Maybe there’ll be some more AQL-related improvements in the end. Let’s see.