A Tour Around the New AQL Query Optimizer

The major new feature in ArangoDB 2.3 is the shiny new AQL query optimizer and executor. These parts of ArangoDB have been rewritten in 2.3 to make AQL much better for our end users.

Since one of the initial releases, ArangoDB has been shipped with AQL, the ArangoDB Query Language. AQL has since then been ArangoDB’s most versatile way of executing simple and also the not-so-simple queries.

I’ll start with an overview of query execution in previous versions of ArangoDB, and then explain the new engine and explain the differences.

History: query execution in pre-2.3

Previous versions of ArangoDB executed any AQL query in the following steps:

tokenize and parse query string into an abstract syntax tree (AST)
perform simple AST optimizations
collect filter conditions in AST and look for index usage opportunities
generate code
execute code

This approach was simple and has worked for a lot of queries, but it also had a few quirks:

First of all, most of the steps were carried out directly on the abstract syntax tree, with the AST nodes being modified in place. There was also just the one AST per query, so the old AQL executor could not generate multiple, potentially very different execution plan candidates for a given query.

The “old” optimizer was able to move AST nodes around during optimization and it was already able to consider multiple index candidates for a query, but it would not compare multiple plans and make a cost-based decision. It was also limited in the amount and scope of transformations it could safely apply to the AST.

When it came to code generation and execution, the “old” executor fully relied on V8 to execute the queries. Result sets were created using V8’s value objects. Documents from collections that queries needed to iterate over had to be made available to V8. While some optimization was used for this, the conversions could have summed up to significant overhead for certain kinds of queries.

The representation of queries via an AST also made it hard to generate code that supported lazy evaluation during query execution.

Finally, the AQL optimizer so far did not provide much support for queries that were to be executed in a distributed fashion inside a cluster of servers.

Query execution in ArangoDB 2.3

We wanted to address all these issues with a rewrite of the AQL infrastructure. Starting with ArangoDB 2.3, AQL queries are executed in these steps:

tokenize and parse query string into an abstract syntax tree (AST)
perform simple AST optimizations
transform AST into execution plan
optimize and permute execution plans
estimate costs for execution plans and pick optimal plan
instanciate execution engine from optimal plan
(in cluster only) send execution plan parts to cluster nodes
execute query

Tokenization and parsing of AQL queries hasn’t changed much in 2.3: query strings are still parsed using a Bison/Flex-based parser and lexer combo. The AST structure has proven to be good during the parsing stage, so the parser creates an initial AST from the query string first.

After that, simple optimizations are performed directly on the AST, such as constant folding and constant propagation. Deterministic functions with constant operands will be executed already in this stage and the results be injected into the AST.

A major change in 2.3 is that no further transformations will be carried out on the AST. Instead, the AST will be transformed into an initial execution plan.

This execution plan is the starting point for the query optimizer. It will take the initial execution plan and apply transformations to it. Transformations will either update the existing plan in place or create a new, modified plan. The result of the transformations carried out will form the input for further transformations that can be carried out by query optimizer.

The result of the query optimization stage is one or many execution plans. For each plan, the optimizer will estimate a cost value, and then finally pick the plan with the lowest total estimated cost. This plan is considered to be the optimal plan. All other execution plans will be discarded by the optimizer as it has considered them non-optimal.

The optimal execution plan is then executed by the execution engine. For a single-server AQL query, this is straightforward: for each step in the execution plan, a C++ object is created that is supposed to execute the particular step. Query execution is then started by asking the first of these objects for its results.

The objects for multiple processing steps are linked in a pipelined fashion with lazy evaluation. Pulling data from the first object will eventually trigger pulling data from the second object etc., until there are no more results to produce.

For a distributed query, this is a bit more complicated. The different execution steps will likely be shipped to different servers in the cluster, and the objects need to be instanciated in different servers, too. The different parts of the query may pull data from each other via HTTP calls between cluster nodes.

How execution plans work

An execution plan is a sequence of query execution steps. Let’s start with a very simple example:

FOR doc IN mycollection
  RETURN doc._key

This query will be transformed into the following execution plan:

SingletonNode: passes a single empty value to the following steps
EnumerateCollectionNode: iterates over all documents of a collection and provides the current document in an output variable. In our example, it will iterate over collection mycollection and provide each document in variable doc
CalculationNode: evaluates an expression and returns its result. In the example, it will calculate doc._key
ReturnNode: returns results to the caller

If this plan is going to be executed, the execution engine will start pulling data from the node at the bottom, that is, the ReturnNode. The ReturnNode at this stage cannot provide any data, so it will ask its predecessor node, which in the example is the CalculationNode. The CalculationNode again does not have own data yet, so it must ask the node in front of it. The EnumerateCollectionNode will first ask the SingletonNode for input data. So the execution flow has bubbled up from the bottom of the sequence to the top.

The SingletonNode will now produce a single empty return value. It will also internally set its processing status to done, so it will not produce any more values if asked again. This is all a SingletonNode will ever do. We’ll see later why such a node may still be useful.

The single empty value will be provided as input to the EnumerateCollectionNode. This node will now go through all the documents in the underlying collection, and return them once for each input value its got. As its input value was the singleton, it will return the documents of the collection just once.

Processing is executed in blocks of size 1000 by default. The EnumerateCollectionNode will thus not return all documents to its successor node, but just 1,000. The return value will be a vector with 1,000 documents, stored under variable name doc.

The CalculationNode, still waiting for input data, can now execute its expression doc._key on this input value. It will execute this expression 1,000 times, once for each input value. The expression results will be stored in another variable. This variable is anonymous, as it hasn’t been named explicitly in the original query. The vector of results produced by the CalculationNode is then returned to the ReturnNode, which will then return it to the caller.

If the caller requests more documents, the procedure will repeat. Whenever a processing step cannot produce any more data, it will ask its predecessor step for more data. If the predecessor step already has status done, the current step will set itself to done as well, so a query will actually come to an end if there are no more results.

As can be seen, steps are executed with batches of values. We thought this would be a good way to improve efficiency and reduce the number of hops between steps.

Joins

Let’s say we want to join documents from two collections, based on common attribute values. Let’s use users and logins, joined by their id and userId attributes:

FOR user IN users
  FOR login IN logins
    FILTER user.id == login.userId
    RETURN { user: user, login: login }

Provided that there are no indexes, the query may be turned into this execution plan by the optimizer:

SingletonNode: passes a single empty value to the following steps
EnumerateCollectionNode: will iterate over all documents in collection users and produce a variable named user
EnumerateCollectionNode: will iterate over all documents in collection logins and produce a variable named login
CalculationNode: will calculate the result of the expression user.id == login.userId
FilterNode: will let only documents pass that match the filter condition (calculated by the CalculationNode above it)
CalculationNode: will calculate the result of the expression { user: user, login: login }
ReturnNode: returns results to the caller

Now we can see why the SingletonNode is useful: it can be used as an input to another node, telling this node to execute just once. Having the SingletonNode will ensure that the outermost EnumerateCollection will only iterate once over the documents in its underlying collection users.

The inner EnumerateCollectionNode for collection logins is now fed by the outer EnumerateCollectionNode on users. Thus these two nodes will produce a cartesian product. This will be done lazily, as producing results will normally happen in chunks of 1,000 values each.

The results of the cartesian product are then post-filtered by the FilterNode, which will only let those documents pass that match the filter condition of the query. The FilterNode employs its predecessor, the CalculationNode, to determine which values satisfy the condition.

Using indexes

Obviously creating cartesian products is not ideal. The optimizer will try to avoid generating such plans if it can, but it has no choice if there are no indexes present.

If there are indexes on attributes that are used in FILTER conditions of a query, the optimizer will try to turn EnumerateCollectionNodes into IndexRangeNodes. The purpose of an IndexRangeNode is to iterate over a specific range in an index. This is normally more efficient than iterating over all documents of a collection.

Let’s assume there is an index on logins.userId. Then the optimizer might be able to generate a plan like this:

SingletonNode: passes a single empty value to the following steps
EnumerateCollectionNode: will iterate over all documents in collection users and produce a variable named user
IndexRangeNode: will iterate over the values in index logins.userId that match the value of users.id and produce a variable named login
CalculationNode: will calculate the result of the expression user.id == login.userId
FilterNode: will let only documents pass that match the filter condition (calculated by the CalculationNode above it)
CalculationNode: will calculate the result of the expression { user: user, login: login }
ReturnNode: returns results to the caller

To run this query, the execution engine must still iterate over all documents in collection users, but for each of those, it only needs to find the documents in logins that match the join condition. This most likely means a lot less lookups and thus much faster execution.

Permutation of loops

Now consider adding an extra FILTER statement to the original query so we end up with this:

FOR user IN users
  FOR login IN logins
    FILTER user.id == login.userId
    FILTER login.ts == 1415402319       /* added this one! */
    RETURN { user: user, login: login }

The optimizer is free to permute the order of FOR loops as long as this won’t change the results of a query. In our case, permutation of the two FOR loops is allowed (the query does not contain a SORT instruction so the order of results is not guaranteed).

If the optimizer exchanges the two loops, it can also pull out the FILTER statement on login.ts out of the inner loop, and move up into the outer loop. It might come up with a plan like this, which may be more efficient if a lot of documents from logins can be filtered out early:

FOR login IN logins
  FILTER login.ts == 1415402319
  FOR user IN users
    FILTER user.id == login.userId
    RETURN { user: user, login: login }

Exchanging the order of FOR loops may also allow the optimizer to use additional indexes.

A last note on indexes: the optimizer in 2.3 is able to use (sorted) skiplist indexes to eliminate extra SORT operations. For example, if there is a skiplist index on login.ts, the SORT in the following query can be removed by the optimizer:

FOR login IN logins
  FILTER login.ts > 1415402319
  SORT login.ts
  RETURN login

The AQL optimizer in 2.3 can optimize away a SORT even if the sort order is backwards or if no FILTER statement is used in the query at all.

Analyzing plans

One particular improvement over 2.2 is that in ArangoDB 2.3 the optimizer provides functionality for retrieving full execution plan information for queries without executing them. The execution plan information can be inspected by developers or DBAs, and, as it is JSON-encoded, can also be analyzed programmatically.

Retrieving the execution plan for a query is straight-forward:

arangosh> db._createStatement({ query: <query> }).explain();

By default, the optimizer will return just the optimal plan, containing all the plan’s execution nodes with lots of extra information plus cost estimates.

The optimizer is also able to return the alternative plans it produced but considered to be non-optimal:

arangosh> db._createStatement({ query: <query> }).explain({ allPlans: true });

This will hopefully allow developers and DBAs to get a better idea of how an AQL query will be executed internally.

Additionally, simple execution statistics are returned by default when executing a query. This statistics can also be used to get an idea of the runtime costs of a query after execution.

Writing optimizer rules

The AQL optimizer itself is dumb. It will simply try to apply all transformations from its rulebook to each input execution plan it is feeded with. This will produce output execution plans, on which further transformations may or may not be applied.

The more interesting part of the AQL optimizer stage is thus the rulebook. Each rule in the rulebook is a C++ function that is executed for an input plan.

Adding a new optimizer rule to the rulebook is intentionally simple. One of the design goals of the new AQL optimizer was to keep it flexible and extensible. All that’s need to be to add an optimizer rule is to implement a C++ function with the following signature:

(Optimizer*, ExecutionPlan*, Optimizer::Rule const*) -> int

and register it once in the Optimizer’s rulebook.

An optimizer rule function is called with an instance of the query optimizer (it can use it to register a new plan), the current execution plan and some information about the rule itself (this is the information about the rule from the rulebook).

The optimizer rule function can then analyze the input execution plan, modifiy it in place, and/or create additional plans. It must return a status code to the optimizer to indicate if something went wrong.

Outlook

The AQL optimizer features described here are available in ArangoDB 2.3, which is currently in beta stage.

Writing a perfect query optimizer is a never-ending endeavour. Other databases provide new optimizer features and fixes even decades after the initial version.

Our plan is to ship 2.3 with several essential and useful optimizer rules. We will likely add more in future releases. We’re also open to contributions. If you can think of rules that are missing but you would like to see in ArangoDB, please let us know. If you would like to contribute to the optimizer and write some rule code, consider sending a pull request or an email to hackers@arangodb.org.

J@ArangoDB

{ "subject" : "ArangoDB", "tags": [ "multi-model", "nosql", "database" ] }