J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }

Subquery Optimizations

This is another post demonstrating some of the AQL query performance improvements that can be expected in ArangoDB 2.6. Specifically, this post is about an optimization for subqueries. AQL queries with multiple subqueries will likely benefit from it.

The following example queries use the same users collection with 100,000 documents that was used in the previous post about return value optimizations. Again, the source data for the collection can be found here.

We’ll start with a query that uses a single subquery extracting all users from the collection into a flat result array:

example query with single subquery
1
2
3
RETURN [ 
  (FOR u IN users RETURN u) 
]

This query is equally fast with ArangoDB 2.5 and 2.6, no changes here.

Let’s ramp it up to using two subqueries, one for users with a gender attribute value of male, and one for users with gender attribute value female. No indexes were used for the extraction in 2.5 nor 2.6:

example query with two subqueries
1
2
3
4
RETURN [ 
  (FOR u IN users FILTER u.gender == 'male' RETURN u), 
  (FOR u IN users FILTER u.gender == 'female' RETURN u) 
]

The query takes 16.6 seconds to execute in 2.5, but only 2.95 seconds with ArangoDB 2.6. This 80 % reduction in execution time is due to ArangoDB 2.6 being a bit smarter about subqueries than 2.5 is.

In the above query, the two subqueries are independent, so not only can they be executed in any order, but they also do not rely on each other’s results. ArangoDB 2.6 will detect that and avoid copying variables and intermediate results into subqueries if they are actually not needed there. 2.5 copied all variables into subqueries unconditionally, even if variables were not needed there.

In 2.6, any AQL query with multiple subqueries will benefit from this optimization. The performance improvements will be greater if subqueries late in the execution pipeline have a lot of intermediate results created in front of them, but do not rely on these intermediate results.

Another nice example for a 2.6 speedup is extracting a single attribute per subquery, as is done for the name attribute in the following query:

extracting a single attribute in two subqueries
1
2
3
4
RETURN [ 
  (FOR u IN users FILTER u.gender == 'male' RETURN u.name), 
  (FOR u IN users FILTER u.gender == 'female' RETURN u.name) 
]

This takes 42 seconds to execute in 2.5, and only 0.86 seconds in 2.6. This is a more than 95 % reduction in execution time. It is caused by a mix of factors, one of them again being the subquery optimization that avoids copying unneeded intermediate results.

Enjoy!