Note: this post is about the ArangoDB 2.x series
This post is about some recent extensions for the AQL graph traversal functionality in ArangoDB.
These extensions allow invoking user-defined JavaScript code for filtering and results generation in AQL queries that contain traversals.
This should make AQL graph traversals much more powerful than before. Additionally, AQL graph traversals get more configurable, allowing to write traversal functions with control-flow logic and complex filtering. As a side-effect, this change facilitates writing specialized traversal functions with much higher efficiency than the general-purpose, cover-all-cases default ones.
The extensions are currently available in the devel branch of ArangoDB on in the 2.4 branch (with 2.4.2 being the first release to include them).
Example graph
For all following examples, I’ll be using a small example graph that can be set up by running this script from the ArangoShell.
I have chosen this small graph because it is easy to understand and still complex enough to demonstrate some common traversal use cases.
The example graph consists of the following two collections:
v
: a collection with vertex documentse
: an edge collection containing the connections between vertices inv
All vertices in the graph have a type
attribute, with types being either root
,
continent
, country
or capital
. The graph is a tree, so it has only one vertex
with type root
. The root vertex is named world. Below the root there are only
vertices of type continent
. These are also connected to some country
vertices.
Finally, country
vertices are also connected to capital
vertices:
1
|
|
In the examples, we’ll only look at the vertices and ignore what the connections look like.
Custom visitors
We know the graph is a tree, so let’s print its structure in a textual format using AQL. We’ll employ a custom visitor function for this. A custom visitor is a user-defined callback function that is called for every vertex that is encountered during a graph traversal. Custom visitor functions need to be written in JavaScript and be registered once before they can be used from an AQL query.
Custom visitors have the following function signature:
1
|
|
The function parameters have the following meanings:
config
: the traversal configurationresult
: the result already generated by the traversal. This is important only if the visitor is designed to modify an existing result in-placevertex
: the currently visited vertex documentpath
: the path from the start vertex to the currently visited vertex document. The path will contain an arrayvertices
and an arrayedges
Let’s register a custom visitor named myfunctions::structurePrinter
. This can done
by running the following code from the ArangoShell:
1 2 3 4 5 6 7 |
|
Processing vertex data with a function
The above function will be called for every vertex in the graph when we use it in a traversal. Let’s do it and run the AQL query to invoke the visitor function.
I suggest running the query from the web interface’s AQL editor:
1 2 3 4 5 6 |
|
The visitor will visit all vertices in the graph, starting at the vertex
v/world
as specified. It will then follow incoming connections using a
depth-first search (this was not specified in the query as it is the default).
As we started with the root vertex of the graph, the query will visit all vertices exactly once. Fortunately the example graph is a tree and does not contain any cycles, so we do not have to care about how to make the traversal terminate. The traversal will automatically terminate after it has visited all nodes.
The AQL query should produce something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
This should provide a good overview of the graph’s contents.
Referring to elements in the path
To return the above result in a more structured manner, let’s overwrite the previous visitor function with one that returns the most interesting vertex attributes individually. Let’s include one that shows the nesting level of each vertex in the tree:
1 2 3 4 5 6 7 8 9 |
|
Running the same AQL query will now return something like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
Adding control flow
Now let’s add some control flow to the visitor function. The following visitor function will also return information about each vertex’ parent – except for the root vertex, which does not have a parent:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Running our AQL query will now produce a different type of result for the root vertex than for all the other vertices:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
Of course much more things can be achieved by peeking the path variable.
Filtering
Now let’s try to restrict the results of a graph traversal to just some vertices, for example, all European countries. As we know the structure of the graph is quite simple, the following naive approach will already do:
1 2 3 4 5 6 7 8 |
|
But the above is clearly not ideal.
First of all, the traversal will still visit every vertex in the graph, even though most vertices will not be returned. Ideally, one will want to restrict a traversal to visit as few vertices as possible, especially in a big graph or in a graph that contains cycles.
Second, the above visitor is looking into a vertex’ direct parent for filtering. This will work for graphs that have a rigid structure, but may not work in more complex setups.
We better use a dedicated function for filtering. Such function can control if a given vertex is going to be visited (via calling the visitor function) and if its connections should be followed. It can skip non-interesting vertices early, providing a good way to make traversals more efficient.
A filter function can be specified in the filterVertices attribute of the traversal options. If specified, filterVertices needs to contain the name of a custom AQL function. A filter function again needs to be written in JavaScript and has the following signature:
1
|
|
The filter function will be called for each vertex. It can return one of the following values:
"prune"
: visit the vertex, but do not descend into its connections"exclude"
: do not visit the vertex, but do descend into its connections[ "prune", "exclude" ]
: do not visit, and do not descendundefined
(default): visit and descend
The following filter function will return "exclude"
for the root vertex, leading
to the visitor not being called for it. However, the traversal will still descend
into the connections of the root node.
On the next level, all continents will be enumerated. The filter will return
[ "prune", "exclude" ]
for all continents but Europe, leading to the visitor not
being invoked for these continents, and their connections not being followed.
For the Europe vertex, it will return "exclude"
,
meaning the visitor will not be called, but the traversal will descend into the
connections of Europe
.
For all vertices of type country
, the visitor will be
called. This is ok because the filter previously prevented the traversal from
descending into any other country but Europe.
Finally, the filter will return "prune"
for all countries, meaning the traversal
will not descend into a country’s connections (in this case that would be the captial vertices).
This will make the traversal end at the country
level.
Here it is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Putting the logic into the filter function allows using a very simple visitor:
1 2 3 4 5 |
|
We must also slightly extend our AQL query and tell it to use our custom filter function:
1 2 3 4 5 6 7 |
|
Running the adjusted AQL query should produce something like:
1 2 3 4 5 6 7 8 |
|
Custom filter functions are a very general approach. Obviously much more complex tasks than shown here can be achieved with them. We have often been asked for ways to set up complex filter conditions in AQL traversals, and I hope these custom filter functions will cover most of that.
Special visitors
Traversal depths can be controlled with the general configuration parameters
minDepth
and maxDepth
. These parameters are helpful to make the traversal only
include vertices occurring after the specified distance from the start vertex, or
up to the specified distance away from the start vertex. This helps bounding
traversals, but is not flexible enough when handling graphs with very distinct
path lengths.
Returning only leaf nodes
For example, finding leaf nodes in a graph is quite hard using a default traversal. The filterVertices function cannot be used to find leaf nodes, because filterVertices is called before a vertex’ connections are determined. The same is true for visitor functions. There were not provided any information about whether the currently visited vertex has connections or not. All a visitor could previously do to find leaf nodes is to return each visited vertex along with the full path information. Some post-processing of the traversal result with regular AQL was then required to detect the leaf nodes in that result.
This could easily get inefficient, especially in a big graphs for which the intermediate results created by the default traversal visitor grew beyond reasonable sizes.
We therefore added a mechanism that can pass information about the vertex’ connections to the visitor. This allows writing new types of visitor functions. For example, it makes it easy to write visitors that can return only leaf nodes.
In order to have the traversal pass the currently visited vertex’ connections to the
visitor function, the traversal parameter order
must be set to a value of "preorder-expander"
.
The traversal’s visitor function will then be called with an additional fifth
parameter named connected
, which is an array of the connections of the current
vertex. This array will be empty if the traversal’s expander function did not
find any connections for the vertex.
Here’s a simple visitor that will make a traversal return only all leaf nodes:
1 2 3 4 5 6 7 |
|
And here’s an AQL query that shows how to use this type of visitor:
1 2 3 4 5 6 7 |
|
As a result, the above query will only print vertices of type capital
(because those are the only leaf nodes in the graph).
The nice thing when looking at the custom visitor function is that it only filters on the number of connections, but not on vertex type or anything else specific for this type of graph.
So it seems like the above function is general purpose and can be reused for other graphs, too.
Counting vertices
Let’s say we wanted to count the number of vertices in the graph, or the number of vertices that passed our filterVertices function.
Counting globally
This is easy to achieve with a custom visitor like this:
1 2 3 4 5 6 7 8 |
|
Note that the above visitor function does not return anything, but will modify an existing result in place. As a result, it will produce one counter value, which will be increased whenever the visitor is called for a vertex.
To invoke this visitor and retrieve the count value, we have to set the visitorReturnsResults
attribute in the AQL query to false. This will make the traversal code pass the existing
result into the visitor and does not expect it to return any results via a return
instruction.
Here’s how to run this visitor:
1 2 3 4 5 6 |
|
The result will simply be:
1 2 3 |
|
Counting by type
Let’s say we wanted to count vertices by type. This is similar, except that now one global counter value is insufficient and we instead need an object to keep track of the different counters. We can still get away with modifying an existing result in place:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Our invocation AQL query does not change. The result of this visitor for the example graph will be:
1 2 3 4 5 6 7 8 |
|
Of course such visitors can be combined with custom filters.
You probably ask why writing custom code is required to achieve a simple task like counting. The main reason for this is that the traversal functionality is very general purpose and is not optimized for a specific use case like just counting. For example, the default traversal visitor will copy the complete vertex and path information into the result.
This can produce very big intermediate results if the graph is big or vertices contain lots of data. If all we want is to count the number of vertices globally or per type, we are better off with something more specialized.
The good news is that the most simple use case “count all vertices” there is a predefined
visitor named _AQL::COUNTINGVISITOR
that can directly be used from a query, without prior
registration of a custom function:
1 2 3 4 5 |
|
Accessing components from the path
Visitors are allowed to peek into the paths
variable to check how the currently visited
vertex is linked to others.
The paths variable is an object with a vertices
sub-attribute and an edges
sub-attribute.
vertices
is an array including all vertices in the path from the start vertex up to the
currently visited vertex. The currently visited vertex is included in this array.
edges
is an array including all connections (edges) between the start vertex and the
currently visited vertex. This array might be empty, in case the visitor is called for the
start vertex.
The following visitor function demonstrates how to peek into paths
: it will produce a
stringified version of the path for all leaf vertices:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
It can be invoked as follows:
1 2 3 4 5 6 7 |
|
It will return something like the following:
1 2 3 4 5 6 7 8 9 10 |
|
Note: the contents of the path
variable will change between calls to the
visitor function. Therefore is it not safe to reference arrays or objects from path
in the result for visitors that modify the result
variable in place (i.e. when
visitorReturnsResults is set to false). The safe way to put path components
into the result of such visitors is to clone the parts of the path before putting
them into result
.
Passing parameters into visitors and filters
It is often useful to pass own parameters into function to provide some sort of invocation context. For example, the purpose of the following visitor function is to return an object with only certain attributes of each visited vertex:
1 2 3 4 5 6 7 8 9 10 11 |
|
Which attributes the function will return can be configured by passing in an
array of attribute names in the config
parameter’s data
sub-attribute. Here’s
an AQL query that will configure the visitor to return _id
and type
:
1 2 3 4 5 6 7 8 9 |
|