J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

AQL Improvements for 2.4

While on a retreat in Belgium, we found some spare time to work on improvements for AQL. These will be shipped with ArangoDB version 2.4, and are already available in the devel version for testing from now on.

Here’s a short overview of the improvements:

COLLECT WITH COUNT

A common use case in query languages is to count the number of documents returned by a query. The AQL solution for this has been to use the LENGTH function and a subquery:

1
2
3
4
5
6
RETURN LENGTH((
  FOR doc IN collection 
    FILTER doc.someAttribute == someValue
    RETURN doc
  )
)

This works but is probably unintuitive for people which have used SQL for years.

We therefore now allow using the following alternative syntax, using the new COLLECT ... WITH COUNT INTO ... clause:

1
2
3
4
FOR doc IN collection
  FILTER doc.someAttribute == someValue
  COLLECT WITH COUNT INTO length
  RETURN length

This query returns just the total number of matches, but not the matches themselves. As this query is made for counting only, it can be executed more efficiently than the original query. In the query with the COUNT INTO ... clause, the documents found by the filter condition will only be counted and then instantly discarded. They will not be shipped around inside the query, from the subquery to the top level into the LENGTH() function.

This new variant will be drastically faster than the old variant if there is no filter condition at all. When there is a filter condition, evaluating the filter condition might be the most computationally expensive part of the query. But even then, the new variant should be faster than the old one and use less memory.

As a bonus, there is no need to use a subquery anymore, though the subquery variant is still fully supported and will be.

COLLECT ... WITH COUNT also works for counting the number of items per group:

1
2
3
FOR doc IN collection
  COLLECT value = doc.someAttribute WITH COUNT INTO length
  RETURN { value: value, length : length }

This returns the number of matches for each distinct value.

A quick unscientific benchmark reveals that the specialized WITH COUNT clause seems to be faster than the old variant. The following results show the differences for a collection with 500,000 small documents:

The old variant that counts the number of documents per age runs in 4.75 seconds on my laptop:

1
2
3
4
FOR doc IN collection
  FILTER doc.age < 20 
  COLLECT age = doc.age INTO g 
  RETURN { age: age, length: LENGTH(g) }

The new variant produces the same result, but runs in 0.6 seconds locally:

1
2
3
FOR doc IN collection 
  COLLECT age = doc.age WITH COUNT INTO length 
  RETURN { age: age, length: length }

A notable speedup can also be observed if only a fraction of the groups is built (here: 1/8). The old variant for this runs in 0.6 seconds:

1
2
3
4
FOR doc IN collection
  FILTER doc.age < 20 
  COLLECT age = doc.age INTO g 
  RETURN { age: age, length: LENGTH(g) }

The new variant runs in 0.12 seconds:

1
2
3
4
FOR doc IN collection 
  FILTER doc.age < 20 
  COLLECT age = doc.age WITH COUNT INTO length 
  RETURN { age: age, length: length }

The absolute times may vary greatly depending on the type of documents and the hardware used, but in general the new variant should provide a speedup.

COLLECT with group expression

Finally, COLLECT ... INTO has been extended to support just another variant that can reduce the amount of copying inside a query.

Let’s have a look at this example query:

1
2
3
FOR doc IN collection 
  COLLECT age = doc.age INTO g
  RETURN { age: age, maxDate: MAX(g[*].doc.dateRegistered) }

In the above query, for each distinct age value, all documents are collected into variable g. When the collecting phase is over, there will be an iteration over all the collected documents again, to extract their dateRegistered value. After that, the dateRegistered values will be passed into the MAX() function.

This query can be made more efficient now as follows:

1
2
3
FOR doc IN collection 
  COLLECT age = doc.age INTO g = doc.dateRegistered
  RETURN { age: age, maxDate: MAX(g) }

The new thing about this variant is the expression following the INTO. Having an expression there allows controlling what values are collected for each group. Using a projection expression here can greatly reduce the amount of copying afterwards, and thus make the query more efficient than if it had to copy the entire documents.

Removing filters covered by indexes

FILTER conditions which are completely covered by indexes will now be removed from the execution plan if it is safe to do so. Dropping the FILTER statements allows the optimizer to get rid of not only the FilterNode, but also its corresponding CalculationNode. This will save a lot of computation if the condition needs to be checked for many documents.

For example, imagine the following query:

1
2
3
FOR doc IN collection
  FILTER doc.value < 10 
  RETURN doc

If there is a (skiplist) index on doc.value, the optimizer may decide to use this index. It will replace the query’s EnumerateCollectionNode with an IndexRangeNode instead first. The IndexRangeNode will scan the index on doc.value for the range [-inf, 10).

Following that, the optimizer rule remove-filter-covered-by-index should fire and detect that the FILTER condition is already covered by the IndexRangeNode alone. It can thus remove the FilterNode. This also makes the CalculationNode of the FilterNode obsolete, so these two nodes will be removed and computation is saved.

Removing brackets for subquery function call parameters

Since the beginning of AQL, the parser required the user the put subqueries that were used as function parameters inside two pairs of brackets.

For example, the following query did not parse in 2.3 and before:

1
RETURN LENGTH(FOR doc IN collection RETURN doc)

Instead, it needed to be written as:

1
RETURN LENGTH((FOR doc IN collection RETURN doc))

If you didn’t notice the difference, the latter version of the query had duplicate parentheses. The requirement to use duplicate parentheses has caused several support questions over time, and this can be taken as a proof that it was not intuitive.

The requirement for duplicate parentheses was an artifact required by the AQL parser grammar in order to parse the query correctly.

For 2.4, the AQL grammar has been cleaned up in this respect. Duplicate parentheses are still allowed and work fine in 2.4 but they are not required anymore. This should make the first steps with AQL a bit easier and more intuitive.

We’re 1.5 days into our retreat now. Maybe there’ll be some more AQL-related improvements in the end. Let’s see.